14:00:17 <tdurakov> #startmeeting Nova Live Migration 14:00:21 <openstack> Meeting started Tue Aug 2 14:00:17 2016 UTC and is due to finish in 60 minutes. The chair is tdurakov. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:22 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:25 <openstack> The meeting name has been set to 'nova_live_migration' 14:00:30 <tdurakov> hi everyone 14:00:32 <davidgiluk> o/ 14:00:39 <lpetrut> hi 14:01:10 <tdurakov> agenda - https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:01:19 * kashyap waves 14:01:44 <mdbooth> o/ 14:01:52 <tdurakov> let's wait a minute for others, and will start 14:02:01 <paul-carlton2> hi 14:02:12 <pkoniszewski> o/ 14:02:47 <tdurakov> so 14:02:50 <tdurakov> #topic Libvirt image backend 14:03:21 <tdurakov> mdbooth: any updates on that? 14:03:32 * andrearosa is late 14:03:47 <mdbooth> tdurakov: I sent a big email the week before last, and was on vacation last week 14:04:02 <mdbooth> The changes are gradually merging 14:04:28 <tdurakov> mdbooth: anything to help with? 14:04:32 <mdbooth> There's a specific change I'd like to call out, because it changes live migration quite a bit 14:04:32 <tdurakov> or just revews? 14:04:35 * mdbooth finds the link 14:05:09 <mdbooth> https://review.openstack.org/#/c/342224/ 14:05:56 * tdurakov starred change 14:06:23 <mdbooth> Note that's in the middle of a very long series 14:06:38 <tdurakov> #action review this https://review.openstack.org/#/c/342224/ 14:06:49 <mdbooth> In general there are very few functional changes in the series, but that's a functional change. 14:07:01 <tdurakov> mdbooth: could you share the very bottom patch to follow? 14:07:13 <mdbooth> tdurakov: Hah 14:07:32 <tdurakov> mdbooth: very bottom that still requires review 14:07:33 <mdbooth> It's currently https://review.openstack.org/#/c/344168/ 14:07:45 <mdbooth> But that's about 20 patches prior to the above. 14:08:15 <mdbooth> I need reviews on those, too, but if you only review 1 really closely, please look at the pre live migration one 14:08:41 <tdurakov> #link https://review.openstack.org/#/c/344168/ - current bottom change for series 14:09:40 <tdurakov> mdbooth: ok, anything to discuss on this? 14:10:38 <mdbooth> Pre live migration patch is the most relevant thing. 14:10:45 <mdbooth> Apart from that, all reviews welcome. 14:10:49 <tdurakov> mdbooth: acked will take a look 14:11:12 <tdurakov> #action to review Libvirt image backend series 14:11:57 <tdurakov> let's move on then 14:12:11 <tdurakov> #topic Storage pools 14:12:50 <tdurakov> paul-carlton2: anything to discuss on this topic? 14:13:12 <mriedem> did danpb review the storage pools spec yet? 14:13:21 <paul-carlton2> Would like to get the specs approved in next few days if possible 14:13:30 <paul-carlton2> mriedem, nope 14:14:03 <tdurakov> paul-carlton2: this one: https://review.openstack.org/#/c/310505/ right? 14:14:14 <paul-carlton2> but doesn't matter if not, will be working on implementation when I get back from holiday and resubmit specs for ocata anyway 14:14:45 <paul-carlton2> yep and https://review.openstack.org/#/c/310538/ 14:15:18 <paul-carlton2> plan is to work on this and get some of the implementation done so it can be completed ain ocata 14:15:48 <tdurakov> paul-carlton2: acked 14:16:43 <tdurakov> let's go to the next topic then 14:16:49 <paul-carlton2> some parts of the implementation depend on the work mdbooth is doing but there is some work that doesn't 14:16:56 <paul-carlton2> ta 14:17:34 <mdbooth> paul-carlton2: Are you likely to work on the local root BDM thing? 14:17:58 <mdbooth> Also, BDMs for config disks 14:19:04 <paul-carlton2> mdbooth, nope, Paul Murray changed his mind and said I should focus on the libvirt storage pools stuff when I told him Diane was working on this 14:19:14 <mdbooth> paul-carlton2: Ok, np. 14:20:08 <tdurakov> so... 14:20:12 <tdurakov> #topic CI 14:20:28 <tdurakov> https://bugs.launchpad.net/nova/+bug/1524898 - still valid 14:20:28 <openstack> Launchpad bug 1524898 in OpenStack Compute (nova) "Volume based live migration aborted unexpectedly" [High,Confirmed] 14:20:44 <tdurakov> I've acked cinder folks to take a look 14:20:56 <davidgiluk> the previous bet was it was iSCSI config wasn't it? 14:21:10 <tdurakov> davidgiluk: yes, I think so 14:21:17 <kashyap> tdurakov: I've checked a few times on -cinder IRC in the past few weeks, just radio silence 14:21:32 <kashyap> Even with specific pointers to current state of analysis on the bug. 14:21:58 <kashyap> Seems like this one of those bugs that'd just rot away without any attention due to proper lack of coordination 14:22:16 <tdurakov> #action tdurakov to start thread on ml for cinder-nova teams 14:22:24 <tdurakov> mriedem: any ideas? 14:22:30 <tdurakov> kashyap: yes( 14:22:40 * mriedem hasn't been following 14:22:40 <kashyap> tdurakov: Raising it on the mailing list is the best bet 14:22:51 <kashyap> With a proper action item for Cinder folks with iSCSI / Kernel expertise. 14:23:02 <tdurakov> kashyap: yes 14:23:02 <mriedem> oh, 14:23:05 <tdurakov> agree 14:23:10 <kashyap> mriedem: No worries, you could catch up with the summary on the list 14:23:14 <mriedem> i don't have anything if hemna or danpb aren't looking at it 14:23:34 <mriedem> my feeling is, 14:23:40 <kashyap> mriedem: danpb, and davidgiluk narrowed down the issue to Kernel / iSCSI, if you see the bug's analysis 14:23:45 <mriedem> if that test is keeping us from making the live migration job voting, we should skip it 14:24:04 <mriedem> would in-qemu iscsi help? 14:24:06 <davidgiluk> we really should fix the test 14:24:10 <mriedem> is that available in xenial? 14:24:21 <mriedem> davidgiluk: fix the test or fix the bug? 14:24:21 <davidgiluk> mriedem: We should understand the problem before changing it 14:24:31 <mdbooth> mriedem: in-qemu iscsi doesn't (didn't?) support multipath 14:24:40 <mriedem> this is multipath? 14:24:55 <tdurakov> davidgiluk: as I understood mriedem proposes to temporally skip this test, right? 14:24:56 <mdbooth> Not afaik, but it means it's not a functional replacement yet 14:25:18 <mriedem> mdbooth: ok, but we don't use multipath in the gate anywhere as far as i know, 14:25:19 <kashyap> These are the iSCSI errors that Kernel is throwing: 14:25:19 <kashyap> Jun 30 14:28:09 ubuntu-xenial-2-node-ovh-gra1-2121639 iscsid[525]: Kernel reported iSCSI connection 1:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3) 14:25:28 <mriedem> so i was thinking if in-qemu scsi fixes this for the live migration job, we should use that 14:25:30 <mriedem> if available 14:25:40 <mdbooth> Have we implemented in-qemu iscsi? 14:25:43 <mriedem> is there any possible hack workaround we can do in the code? 14:25:44 <davidgiluk> mriedem: I would say we should not do that - we should understand the problem 14:25:52 <mriedem> davidgiluk: ideally yes, 14:25:56 <mriedem> davidgiluk: but who's doing that? 14:26:11 <davidgiluk> mriedem: Do we not have any friendly iscsi people we know? 14:26:15 <mriedem> i don't want to keep the live migration job non-voting forever just because of this one test that no one is working on 14:26:25 <kashyap> mdbooth: I think this is what you were looking for - https://specs.openstack.org/openstack/nova-specs/specs/kilo/implemented/qemu-built-in-iscsi-initiator.html 14:26:28 <mriedem> davidgiluk: hemna, but i'm sure he's preoccupied 14:26:38 <mriedem> mdbooth: yeah that ^ 14:26:49 <mriedem> but ubuntu kept the patch out of their qemu package 14:26:58 <mriedem> at least <xenial, i'm not sure about xenial 14:27:02 <mdbooth> kashyap: Was it implemented? 14:27:06 <mriedem> mdbooth: yeah 14:27:15 <kashyap> Ah, okay. Was trying to confirm that 14:27:16 <mriedem> but a total "you have to patch qemu yourself to use this" 14:27:22 * mdbooth wonders if gets tested 14:27:26 <mriedem> it does not 14:27:33 * mdbooth suspects it's broken :) 14:27:40 <mriedem> the patch has details, but ubuntu didn't carry the in-qemu scsi support 14:27:45 <mriedem> in their package 14:27:49 <mdbooth> It's would be a substantially different code path 14:27:53 <davidgiluk> mriedem: We don't even know if in-qemu iscsi would fix the problem 14:27:59 <mriedem> davidgiluk: i realize, 14:28:05 <mriedem> but it's a thread to pull on 14:28:05 <mriedem> right? 14:28:19 <tdurakov> mriedem: + 14:28:42 <mdbooth> mriedem: I agree that it would be interesting diagnostically to know if in-qemu iscsi made it go away. 14:29:05 <mriedem> well, i thought danpb's long-term vision would all things would be qemu native 14:29:06 <tdurakov> any volunteers on that? 14:29:41 <mdbooth> mriedem: Right, it would be awesome. I understood the blocker was just multipath. I didn't even know it was implemented. 14:30:19 <mriedem> the qemu blocker 14:30:23 <mriedem> not the gate job blocker 14:30:30 <mriedem> i'm starting to speak your language :) 14:30:48 <mriedem> anyway, maybe we take a note that we should investigate in-qemu in the live migration job 14:30:58 <mriedem> (9:29:50 AM) danpb: then again, it might give us a nicer error message in qemu that acutally shows us the real problem 14:30:58 <mriedem> (9:30:33 AM) danpb: as the error reporting from the kernel iscsi client is awful (and that's being polite) 14:31:01 <tdurakov> so... my proposal for this, temporally exclude this test from l-m job, and start investigation on that 14:31:15 <mriedem> tdurakov: i'm fine with that 14:31:20 <mriedem> we'd still have it in the multinode job 14:31:28 <tdurakov> right 14:31:35 <mriedem> i would like to get more stable runs on the l-m job though 14:31:39 <mriedem> so we can start digging out 14:32:09 <tdurakov> btw, has anyone reproduced this locally? 14:33:09 <tdurakov> that's kind of the problem^ 14:33:31 <tdurakov> #action: tdurakov to skip test in live-migration job 14:34:01 <tdurakov> #action find volunteer for underlying bug issue, will made a call in ml 14:34:11 <tdurakov> let's move on 14:34:38 <mdbooth> tdurakov: If anybody has a paying customer hitting this, getting cycles for a reproducer should be simple 14:34:38 <tdurakov> just fyi https://review.openstack.org/#/c/329466/ - updated patch, so if it's ok we could enable nfs again soon 14:34:55 <mriedem> http://packages.ubuntu.com/xenial/qemu-block-extra has the package we need 14:36:02 <tdurakov> #topic Migration object 14:36:20 <tdurakov> I want to discuss usage of migration object in nova 14:36:47 <tdurakov> it turn's out for resize/evacuate it's being created implicitly during claim 14:36:59 <tdurakov> so, it's kind of related for live-migration thing 14:37:10 <tdurakov> that I'd like to change 14:37:31 <tdurakov> I'd prefer to create it explicitly in conductor instead 14:37:37 <tdurakov> thought?^ 14:38:07 <mdbooth> Without looking at the code, explicit always wins for me. 14:39:37 <tdurakov> ok, will send mail with details on that 14:39:56 <tdurakov> #topic Plan for Ocata 14:40:19 <tdurakov> as I already understood it will be Storage pools 14:40:32 <tdurakov> anything else that requires bp/spec? 14:40:51 <tdurakov> from my sight it will be fsm for migrations, working on that now 14:40:58 <tdurakov> anything else? 14:41:00 <mdbooth> I would really like to take a hard look at how we negotiate shared storage 14:41:17 <mdbooth> Right now, working out what's shared and what's not between 2 hosts is a mess 14:41:59 <tdurakov> mdbooth: big + on that 14:42:34 <tdurakov> I'd also expect this one post-copy interrupts networking 14:42:35 <mdbooth> I had an idea to be explicit about it somehow. i.e. Have the target communicate what it already has to the source. 14:43:20 <tdurakov> mdbooth: could work 14:44:39 <davidgiluk> tdurakov: I think Luis said he was away this week for that post-copy/networking one - I'm assuming he's back next week but not sure 14:44:40 <tdurakov> the way how migrate_data object contains 'dozen' of flags for shared/not shared make this thing tricky every time 14:45:05 <mdbooth> tdurakov: Right, they're unfathomable, and they still don't cover the edge cases 14:45:24 <tdurakov> davidgiluk: ok 14:45:31 <mdbooth> like 2 hosts which use separate ceph setups 14:45:40 <mdbooth> currently marked as shared 14:46:00 <tdurakov> yes, looks like a big work item for ocata 14:46:14 <tdurakov> we have ~10 minutes, so let's go next 14:46:19 * mdbooth can't guarantee the cycles to work on it, but if anybody's interested... 14:46:27 <tdurakov> #topic Networking 14:46:55 <tdurakov> any updates on this one https://review.openstack.org/#/c/275073? 14:48:11 <tdurakov> #action to figure out status for setup_networks_on_host for Neutron 14:49:23 <tdurakov> johnthetubaguy: hi, any updates on this item: Future port info spec to be worked on for Ocata 14:50:37 <tdurakov> the same action then 14:50:56 <tdurakov> so next topic 14:51:00 <johnthetubaguy> ah, so yes, same action 14:51:19 <johnthetubaguy> been looking into that, but not yet at the bottom of things, mostly due to holiday end of last week and yesterday 14:51:33 <johnthetubaguy> there is a patch in review we want to get merged, which should help 14:51:51 <tdurakov> johnthetubaguy: acked, thanks for update 14:51:59 <johnthetubaguy> I am booked to go to the neutron midycle to help talk about the plan for next cycle 14:52:12 <johnthetubaguy> so let me know if there are things folks want raised there 14:53:28 <tdurakov> #action, reach johnthetubaguy with things for nova-neutron to be discussed during Neutron mid-cycle 14:53:54 <tdurakov> #topic Open discussion 14:54:35 <mdbooth> https://bugs.launchpad.net/nova/+bug/1597644 14:54:35 <openstack> Launchpad bug 1597644 in OpenStack Compute (nova) "Quobyte: Permission denied on console.log during instance startup" [High,Fix released] - Assigned to Silvan Kaiser (2-silvan) 14:54:44 <mdbooth> This bug came out of my series the other day 14:54:56 <mdbooth> However, the thing I'd like to discuss here is 14:55:22 <mdbooth> The bug describes that Quobyte CI deliberately configures cinder and nova to be able to write to each others' instance files 14:55:39 <mdbooth> Can anybody think of a reason that they would do that, or how it might not be broken? 14:56:27 * tdurakov haven't seen Quobyte and it's CI yet 14:56:42 * mdbooth hadn't heard of it until his patch got reverted ;) 14:56:58 <mdbooth> However, that was easily worked around 14:57:14 <mdbooth> Shared access to storage between cinder and nova just sounds scary 14:58:18 <pkoniszewski> i have one more thing 14:58:44 <pkoniszewski> i just sent an e-mail to os-dev list about removing live_migration_flag and what to do with live_migration_tunnelled 14:58:45 <pkoniszewski> http://lists.openstack.org/pipermail/openstack-dev/2016-August/100657.html 14:59:14 <pkoniszewski> if you can take a look on it, this VIR_MIGRATE_TUNNELLED flag is a pain for a really long time right now and maybe this is a good time to get rid of it 14:59:23 <pkoniszewski> that's all :) 14:59:41 <tdurakov> mdbooth: agree on that, need walk through code 15:00:02 <tdurakov> pkoniszewski: flags... right 15:00:08 * amrith coughs discreetly in the back of the room 15:00:15 <gothicmindfood> ohhai 15:00:37 <tdurakov> pkoniszewski: I'd exepect we remove it 15:00:39 <tdurakov> anyway 15:00:48 <gothicmindfood> amrith: are we crashing the nova meeting rn? 15:00:50 <tdurakov> it looks like we need to end 15:00:55 <tdurakov> #endmeeting