14:00:17 <tdurakov> #startmeeting Nova Live Migration
14:00:21 <openstack> Meeting started Tue Aug  2 14:00:17 2016 UTC and is due to finish in 60 minutes.  The chair is tdurakov. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:22 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:25 <openstack> The meeting name has been set to 'nova_live_migration'
14:00:30 <tdurakov> hi everyone
14:00:32 <davidgiluk> o/
14:00:39 <lpetrut> hi
14:01:10 <tdurakov> agenda - https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration
14:01:19 * kashyap waves
14:01:44 <mdbooth> o/
14:01:52 <tdurakov> let's wait a minute for others, and will start
14:02:01 <paul-carlton2> hi
14:02:12 <pkoniszewski> o/
14:02:47 <tdurakov> so
14:02:50 <tdurakov> #topic Libvirt image backend
14:03:21 <tdurakov> mdbooth: any updates on that?
14:03:32 * andrearosa is late
14:03:47 <mdbooth> tdurakov: I sent a big email the week before last, and was on vacation last week
14:04:02 <mdbooth> The changes are gradually merging
14:04:28 <tdurakov> mdbooth: anything to help with?
14:04:32 <mdbooth> There's a specific change I'd like to call out, because it changes live migration quite a bit
14:04:32 <tdurakov> or just revews?
14:04:35 * mdbooth finds the link
14:05:09 <mdbooth> https://review.openstack.org/#/c/342224/
14:05:56 * tdurakov starred change
14:06:23 <mdbooth> Note that's in the middle of a very long series
14:06:38 <tdurakov> #action review this https://review.openstack.org/#/c/342224/
14:06:49 <mdbooth> In general there are very few functional changes in the series, but that's a functional change.
14:07:01 <tdurakov> mdbooth: could you share the very bottom patch to follow?
14:07:13 <mdbooth> tdurakov: Hah
14:07:32 <tdurakov> mdbooth: very bottom that still requires review
14:07:33 <mdbooth> It's currently https://review.openstack.org/#/c/344168/
14:07:45 <mdbooth> But that's about 20 patches prior to the above.
14:08:15 <mdbooth> I need reviews on those, too, but if you only review 1 really closely, please look at the pre live migration one
14:08:41 <tdurakov> #link https://review.openstack.org/#/c/344168/ - current bottom change for series
14:09:40 <tdurakov> mdbooth: ok, anything to discuss on this?
14:10:38 <mdbooth> Pre live migration patch is the most relevant thing.
14:10:45 <mdbooth> Apart from that, all reviews welcome.
14:10:49 <tdurakov> mdbooth: acked will take a look
14:11:12 <tdurakov> #action to review Libvirt image backend series
14:11:57 <tdurakov> let's move on then
14:12:11 <tdurakov> #topic Storage pools
14:12:50 <tdurakov> paul-carlton2: anything to discuss on this topic?
14:13:12 <mriedem> did danpb review the storage pools spec yet?
14:13:21 <paul-carlton2> Would like to get the specs approved in next few days if possible
14:13:30 <paul-carlton2> mriedem, nope
14:14:03 <tdurakov> paul-carlton2: this one: https://review.openstack.org/#/c/310505/  right?
14:14:14 <paul-carlton2> but doesn't matter if not, will be working on implementation when I get back from holiday and resubmit specs for ocata anyway
14:14:45 <paul-carlton2> yep and https://review.openstack.org/#/c/310538/
14:15:18 <paul-carlton2> plan is to work on this and get some of the implementation done so it can be completed ain ocata
14:15:48 <tdurakov> paul-carlton2: acked
14:16:43 <tdurakov> let's go to the next topic then
14:16:49 <paul-carlton2> some parts of the implementation depend on the work mdbooth is doing but there is some work that doesn't
14:16:56 <paul-carlton2> ta
14:17:34 <mdbooth> paul-carlton2: Are you likely to work on the local root BDM thing?
14:17:58 <mdbooth> Also, BDMs for config disks
14:19:04 <paul-carlton2> mdbooth, nope, Paul Murray changed his mind and said I should focus on the libvirt storage pools stuff when I told him Diane was working on this
14:19:14 <mdbooth> paul-carlton2: Ok, np.
14:20:08 <tdurakov> so...
14:20:12 <tdurakov> #topic CI
14:20:28 <tdurakov> https://bugs.launchpad.net/nova/+bug/1524898 - still valid
14:20:28 <openstack> Launchpad bug 1524898 in OpenStack Compute (nova) "Volume based live migration aborted unexpectedly" [High,Confirmed]
14:20:44 <tdurakov> I've acked cinder folks to take a look
14:20:56 <davidgiluk> the previous bet was it was iSCSI config wasn't it?
14:21:10 <tdurakov> davidgiluk: yes, I think so
14:21:17 <kashyap> tdurakov: I've checked a few times on -cinder IRC in the past few weeks, just radio silence
14:21:32 <kashyap> Even with specific pointers to current state of analysis on the bug.
14:21:58 <kashyap> Seems like this one of those bugs that'd just rot away without any attention due to proper lack of coordination
14:22:16 <tdurakov> #action tdurakov to start thread on ml for cinder-nova teams
14:22:24 <tdurakov> mriedem: any ideas?
14:22:30 <tdurakov> kashyap: yes(
14:22:40 * mriedem hasn't been following
14:22:40 <kashyap> tdurakov: Raising it on the mailing list is the best bet
14:22:51 <kashyap> With a proper action item for Cinder folks with iSCSI / Kernel expertise.
14:23:02 <tdurakov> kashyap: yes
14:23:02 <mriedem> oh,
14:23:05 <tdurakov> agree
14:23:10 <kashyap> mriedem: No worries, you could catch up with the summary on the list
14:23:14 <mriedem> i don't have anything if hemna or danpb aren't looking at it
14:23:34 <mriedem> my feeling is,
14:23:40 <kashyap> mriedem: danpb, and davidgiluk narrowed down the issue to Kernel / iSCSI, if you see the bug's analysis
14:23:45 <mriedem> if that test is keeping us from making the live migration job voting, we should skip it
14:24:04 <mriedem> would in-qemu iscsi help?
14:24:06 <davidgiluk> we really should fix the test
14:24:10 <mriedem> is that available in xenial?
14:24:21 <mriedem> davidgiluk: fix the test or fix the bug?
14:24:21 <davidgiluk> mriedem: We should understand the problem before changing it
14:24:31 <mdbooth> mriedem: in-qemu iscsi doesn't (didn't?) support multipath
14:24:40 <mriedem> this is multipath?
14:24:55 <tdurakov> davidgiluk: as I understood mriedem proposes to temporally skip this test, right?
14:24:56 <mdbooth> Not afaik, but it means it's not a functional replacement yet
14:25:18 <mriedem> mdbooth: ok, but we don't use multipath in the gate anywhere as far as i know,
14:25:19 <kashyap> These are the iSCSI errors that Kernel is throwing:
14:25:19 <kashyap> Jun 30 14:28:09 ubuntu-xenial-2-node-ovh-gra1-2121639 iscsid[525]: Kernel reported iSCSI connection 1:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
14:25:28 <mriedem> so i was thinking if in-qemu scsi fixes this for the live migration job, we should use that
14:25:30 <mriedem> if available
14:25:40 <mdbooth> Have we implemented in-qemu iscsi?
14:25:43 <mriedem> is there any possible hack workaround we can do in the code?
14:25:44 <davidgiluk> mriedem: I would say we should not do that - we should understand the problem
14:25:52 <mriedem> davidgiluk: ideally yes,
14:25:56 <mriedem> davidgiluk: but who's doing that?
14:26:11 <davidgiluk> mriedem: Do we not have any friendly iscsi people we know?
14:26:15 <mriedem> i don't want to keep the live migration job non-voting forever just because of this one test that no one is working on
14:26:25 <kashyap> mdbooth: I think this is what you were looking for - https://specs.openstack.org/openstack/nova-specs/specs/kilo/implemented/qemu-built-in-iscsi-initiator.html
14:26:28 <mriedem> davidgiluk: hemna, but i'm sure he's preoccupied
14:26:38 <mriedem> mdbooth: yeah that ^
14:26:49 <mriedem> but ubuntu kept the patch out of their qemu package
14:26:58 <mriedem> at least <xenial, i'm not sure about xenial
14:27:02 <mdbooth> kashyap: Was it implemented?
14:27:06 <mriedem> mdbooth: yeah
14:27:15 <kashyap> Ah, okay.  Was trying to confirm that
14:27:16 <mriedem> but a total "you have to patch qemu yourself to use this"
14:27:22 * mdbooth wonders if gets tested
14:27:26 <mriedem> it does not
14:27:33 * mdbooth suspects it's broken :)
14:27:40 <mriedem> the patch has details, but ubuntu didn't carry the in-qemu scsi support
14:27:45 <mriedem> in their package
14:27:49 <mdbooth> It's would be a substantially different code path
14:27:53 <davidgiluk> mriedem: We don't even know if in-qemu iscsi would fix the problem
14:27:59 <mriedem> davidgiluk: i realize,
14:28:05 <mriedem> but it's a thread to pull on
14:28:05 <mriedem> right?
14:28:19 <tdurakov> mriedem: +
14:28:42 <mdbooth> mriedem: I agree that it would be interesting diagnostically to know if in-qemu iscsi made it go away.
14:29:05 <mriedem> well, i thought danpb's long-term vision would all things would be qemu native
14:29:06 <tdurakov> any volunteers on that?
14:29:41 <mdbooth> mriedem: Right, it would be awesome. I understood the blocker was just multipath. I didn't even know it was implemented.
14:30:19 <mriedem> the qemu blocker
14:30:23 <mriedem> not the gate job blocker
14:30:30 <mriedem> i'm starting to speak your language :)
14:30:48 <mriedem> anyway, maybe we take a note that we should investigate in-qemu in the live migration job
14:30:58 <mriedem> (9:29:50 AM) danpb: then again, it might give us a nicer error message in qemu that acutally shows us the real problem
14:30:58 <mriedem> (9:30:33 AM) danpb: as the error reporting from the kernel iscsi client is awful (and that's being polite)
14:31:01 <tdurakov> so... my proposal for this, temporally exclude this test from l-m job, and start investigation on that
14:31:15 <mriedem> tdurakov: i'm fine with that
14:31:20 <mriedem> we'd still have it in the multinode job
14:31:28 <tdurakov> right
14:31:35 <mriedem> i would like to get more stable runs on the l-m job though
14:31:39 <mriedem> so we can start digging out
14:32:09 <tdurakov> btw, has anyone reproduced this locally?
14:33:09 <tdurakov> that's kind of the problem^
14:33:31 <tdurakov> #action: tdurakov to skip test in live-migration job
14:34:01 <tdurakov> #action find volunteer for underlying bug issue, will made a call in ml
14:34:11 <tdurakov> let's move on
14:34:38 <mdbooth> tdurakov: If anybody has a paying customer hitting this, getting cycles for a reproducer should be simple
14:34:38 <tdurakov> just fyi https://review.openstack.org/#/c/329466/ - updated patch, so if it's ok we could enable nfs again soon
14:34:55 <mriedem> http://packages.ubuntu.com/xenial/qemu-block-extra has the package we need
14:36:02 <tdurakov> #topic Migration object
14:36:20 <tdurakov> I want to discuss usage of migration object in nova
14:36:47 <tdurakov> it turn's out for resize/evacuate it's being created implicitly during claim
14:36:59 <tdurakov> so, it's kind of related for live-migration thing
14:37:10 <tdurakov> that I'd like to change
14:37:31 <tdurakov> I'd prefer to create it explicitly in conductor instead
14:37:37 <tdurakov> thought?^
14:38:07 <mdbooth> Without looking at the code, explicit always wins for me.
14:39:37 <tdurakov> ok, will send mail with details on that
14:39:56 <tdurakov> #topic Plan for Ocata
14:40:19 <tdurakov> as I already understood it will be Storage pools
14:40:32 <tdurakov> anything else that requires bp/spec?
14:40:51 <tdurakov> from my sight it will be fsm for migrations, working on that now
14:40:58 <tdurakov> anything else?
14:41:00 <mdbooth> I would really like to take a hard look at how we negotiate shared storage
14:41:17 <mdbooth> Right now, working out what's shared and what's not between 2 hosts is a mess
14:41:59 <tdurakov> mdbooth: big + on that
14:42:34 <tdurakov> I'd also expect this one post-copy interrupts networking
14:42:35 <mdbooth> I had an idea to be explicit about it somehow. i.e. Have the target communicate what it already has to the source.
14:43:20 <tdurakov> mdbooth: could work
14:44:39 <davidgiluk> tdurakov: I think Luis said he was away this week for that post-copy/networking one - I'm assuming he's back next week but not sure
14:44:40 <tdurakov> the way how migrate_data object contains 'dozen' of flags for shared/not shared make this thing tricky every time
14:45:05 <mdbooth> tdurakov: Right, they're unfathomable, and they still don't cover the edge cases
14:45:24 <tdurakov> davidgiluk: ok
14:45:31 <mdbooth> like 2 hosts which use separate ceph setups
14:45:40 <mdbooth> currently marked as shared
14:46:00 <tdurakov> yes, looks like a big work item for ocata
14:46:14 <tdurakov> we have ~10 minutes, so let's go next
14:46:19 * mdbooth can't guarantee the cycles to work on it, but if anybody's interested...
14:46:27 <tdurakov> #topic Networking
14:46:55 <tdurakov> any updates on this one https://review.openstack.org/#/c/275073?
14:48:11 <tdurakov> #action to figure out status for setup_networks_on_host for Neutron
14:49:23 <tdurakov> johnthetubaguy: hi, any updates on this item:  Future port info spec to be worked on for Ocata
14:50:37 <tdurakov> the same action then
14:50:56 <tdurakov> so next topic
14:51:00 <johnthetubaguy> ah, so yes, same action
14:51:19 <johnthetubaguy> been looking into that, but not yet at the bottom of things, mostly due to holiday end of last week and yesterday
14:51:33 <johnthetubaguy> there is a patch in review we want to get merged, which should help
14:51:51 <tdurakov> johnthetubaguy: acked, thanks for update
14:51:59 <johnthetubaguy> I am booked to go to the neutron midycle to help talk about the plan for next cycle
14:52:12 <johnthetubaguy> so let me know if there are things folks want raised there
14:53:28 <tdurakov> #action, reach johnthetubaguy with things for nova-neutron to be discussed during Neutron mid-cycle
14:53:54 <tdurakov> #topic Open discussion
14:54:35 <mdbooth> https://bugs.launchpad.net/nova/+bug/1597644
14:54:35 <openstack> Launchpad bug 1597644 in OpenStack Compute (nova) "Quobyte: Permission denied on console.log during instance startup" [High,Fix released] - Assigned to Silvan Kaiser (2-silvan)
14:54:44 <mdbooth> This bug came out of my series the other day
14:54:56 <mdbooth> However, the thing I'd like to discuss here is
14:55:22 <mdbooth> The bug describes that Quobyte CI deliberately configures cinder and nova to be able to write to each others' instance files
14:55:39 <mdbooth> Can anybody think of a reason that they would do that, or how it might not be broken?
14:56:27 * tdurakov haven't seen Quobyte and it's CI yet
14:56:42 * mdbooth hadn't heard of it until his patch got reverted ;)
14:56:58 <mdbooth> However, that was easily worked around
14:57:14 <mdbooth> Shared access to storage between cinder and nova just sounds scary
14:58:18 <pkoniszewski> i have one more thing
14:58:44 <pkoniszewski> i just sent an e-mail to os-dev list about removing live_migration_flag and what to do with live_migration_tunnelled
14:58:45 <pkoniszewski> http://lists.openstack.org/pipermail/openstack-dev/2016-August/100657.html
14:59:14 <pkoniszewski> if you can take a look on it, this VIR_MIGRATE_TUNNELLED flag is a pain for a really long time right now and maybe this is a good time to get rid of it
14:59:23 <pkoniszewski> that's all :)
14:59:41 <tdurakov> mdbooth: agree on that, need walk through code
15:00:02 <tdurakov> pkoniszewski: flags... right
15:00:08 * amrith coughs discreetly in the back of the room
15:00:15 <gothicmindfood> ohhai
15:00:37 <tdurakov> pkoniszewski: I'd exepect we remove it
15:00:39 <tdurakov> anyway
15:00:48 <gothicmindfood> amrith: are we crashing the nova meeting rn?
15:00:50 <tdurakov> it looks like we need to end
15:00:55 <tdurakov> #endmeeting