14:00:17 #startmeeting Nova Live Migration 14:00:21 Meeting started Tue Aug 2 14:00:17 2016 UTC and is due to finish in 60 minutes. The chair is tdurakov. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:22 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:25 The meeting name has been set to 'nova_live_migration' 14:00:30 hi everyone 14:00:32 o/ 14:00:39 hi 14:01:10 agenda - https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:01:19 * kashyap waves 14:01:44 o/ 14:01:52 let's wait a minute for others, and will start 14:02:01 hi 14:02:12 o/ 14:02:47 so 14:02:50 #topic Libvirt image backend 14:03:21 mdbooth: any updates on that? 14:03:32 * andrearosa is late 14:03:47 tdurakov: I sent a big email the week before last, and was on vacation last week 14:04:02 The changes are gradually merging 14:04:28 mdbooth: anything to help with? 14:04:32 There's a specific change I'd like to call out, because it changes live migration quite a bit 14:04:32 or just revews? 14:04:35 * mdbooth finds the link 14:05:09 https://review.openstack.org/#/c/342224/ 14:05:56 * tdurakov starred change 14:06:23 Note that's in the middle of a very long series 14:06:38 #action review this https://review.openstack.org/#/c/342224/ 14:06:49 In general there are very few functional changes in the series, but that's a functional change. 14:07:01 mdbooth: could you share the very bottom patch to follow? 14:07:13 tdurakov: Hah 14:07:32 mdbooth: very bottom that still requires review 14:07:33 It's currently https://review.openstack.org/#/c/344168/ 14:07:45 But that's about 20 patches prior to the above. 14:08:15 I need reviews on those, too, but if you only review 1 really closely, please look at the pre live migration one 14:08:41 #link https://review.openstack.org/#/c/344168/ - current bottom change for series 14:09:40 mdbooth: ok, anything to discuss on this? 14:10:38 Pre live migration patch is the most relevant thing. 14:10:45 Apart from that, all reviews welcome. 14:10:49 mdbooth: acked will take a look 14:11:12 #action to review Libvirt image backend series 14:11:57 let's move on then 14:12:11 #topic Storage pools 14:12:50 paul-carlton2: anything to discuss on this topic? 14:13:12 did danpb review the storage pools spec yet? 14:13:21 Would like to get the specs approved in next few days if possible 14:13:30 mriedem, nope 14:14:03 paul-carlton2: this one: https://review.openstack.org/#/c/310505/ right? 14:14:14 but doesn't matter if not, will be working on implementation when I get back from holiday and resubmit specs for ocata anyway 14:14:45 yep and https://review.openstack.org/#/c/310538/ 14:15:18 plan is to work on this and get some of the implementation done so it can be completed ain ocata 14:15:48 paul-carlton2: acked 14:16:43 let's go to the next topic then 14:16:49 some parts of the implementation depend on the work mdbooth is doing but there is some work that doesn't 14:16:56 ta 14:17:34 paul-carlton2: Are you likely to work on the local root BDM thing? 14:17:58 Also, BDMs for config disks 14:19:04 mdbooth, nope, Paul Murray changed his mind and said I should focus on the libvirt storage pools stuff when I told him Diane was working on this 14:19:14 paul-carlton2: Ok, np. 14:20:08 so... 14:20:12 #topic CI 14:20:28 https://bugs.launchpad.net/nova/+bug/1524898 - still valid 14:20:28 Launchpad bug 1524898 in OpenStack Compute (nova) "Volume based live migration aborted unexpectedly" [High,Confirmed] 14:20:44 I've acked cinder folks to take a look 14:20:56 the previous bet was it was iSCSI config wasn't it? 14:21:10 davidgiluk: yes, I think so 14:21:17 tdurakov: I've checked a few times on -cinder IRC in the past few weeks, just radio silence 14:21:32 Even with specific pointers to current state of analysis on the bug. 14:21:58 Seems like this one of those bugs that'd just rot away without any attention due to proper lack of coordination 14:22:16 #action tdurakov to start thread on ml for cinder-nova teams 14:22:24 mriedem: any ideas? 14:22:30 kashyap: yes( 14:22:40 * mriedem hasn't been following 14:22:40 tdurakov: Raising it on the mailing list is the best bet 14:22:51 With a proper action item for Cinder folks with iSCSI / Kernel expertise. 14:23:02 kashyap: yes 14:23:02 oh, 14:23:05 agree 14:23:10 mriedem: No worries, you could catch up with the summary on the list 14:23:14 i don't have anything if hemna or danpb aren't looking at it 14:23:34 my feeling is, 14:23:40 mriedem: danpb, and davidgiluk narrowed down the issue to Kernel / iSCSI, if you see the bug's analysis 14:23:45 if that test is keeping us from making the live migration job voting, we should skip it 14:24:04 would in-qemu iscsi help? 14:24:06 we really should fix the test 14:24:10 is that available in xenial? 14:24:21 davidgiluk: fix the test or fix the bug? 14:24:21 mriedem: We should understand the problem before changing it 14:24:31 mriedem: in-qemu iscsi doesn't (didn't?) support multipath 14:24:40 this is multipath? 14:24:55 davidgiluk: as I understood mriedem proposes to temporally skip this test, right? 14:24:56 Not afaik, but it means it's not a functional replacement yet 14:25:18 mdbooth: ok, but we don't use multipath in the gate anywhere as far as i know, 14:25:19 These are the iSCSI errors that Kernel is throwing: 14:25:19 Jun 30 14:28:09 ubuntu-xenial-2-node-ovh-gra1-2121639 iscsid[525]: Kernel reported iSCSI connection 1:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3) 14:25:28 so i was thinking if in-qemu scsi fixes this for the live migration job, we should use that 14:25:30 if available 14:25:40 Have we implemented in-qemu iscsi? 14:25:43 is there any possible hack workaround we can do in the code? 14:25:44 mriedem: I would say we should not do that - we should understand the problem 14:25:52 davidgiluk: ideally yes, 14:25:56 davidgiluk: but who's doing that? 14:26:11 mriedem: Do we not have any friendly iscsi people we know? 14:26:15 i don't want to keep the live migration job non-voting forever just because of this one test that no one is working on 14:26:25 mdbooth: I think this is what you were looking for - https://specs.openstack.org/openstack/nova-specs/specs/kilo/implemented/qemu-built-in-iscsi-initiator.html 14:26:28 davidgiluk: hemna, but i'm sure he's preoccupied 14:26:38 mdbooth: yeah that ^ 14:26:49 but ubuntu kept the patch out of their qemu package 14:26:58 at least kashyap: Was it implemented? 14:27:06 mdbooth: yeah 14:27:15 Ah, okay. Was trying to confirm that 14:27:16 but a total "you have to patch qemu yourself to use this" 14:27:22 * mdbooth wonders if gets tested 14:27:26 it does not 14:27:33 * mdbooth suspects it's broken :) 14:27:40 the patch has details, but ubuntu didn't carry the in-qemu scsi support 14:27:45 in their package 14:27:49 It's would be a substantially different code path 14:27:53 mriedem: We don't even know if in-qemu iscsi would fix the problem 14:27:59 davidgiluk: i realize, 14:28:05 but it's a thread to pull on 14:28:05 right? 14:28:19 mriedem: + 14:28:42 mriedem: I agree that it would be interesting diagnostically to know if in-qemu iscsi made it go away. 14:29:05 well, i thought danpb's long-term vision would all things would be qemu native 14:29:06 any volunteers on that? 14:29:41 mriedem: Right, it would be awesome. I understood the blocker was just multipath. I didn't even know it was implemented. 14:30:19 the qemu blocker 14:30:23 not the gate job blocker 14:30:30 i'm starting to speak your language :) 14:30:48 anyway, maybe we take a note that we should investigate in-qemu in the live migration job 14:30:58 (9:29:50 AM) danpb: then again, it might give us a nicer error message in qemu that acutally shows us the real problem 14:30:58 (9:30:33 AM) danpb: as the error reporting from the kernel iscsi client is awful (and that's being polite) 14:31:01 so... my proposal for this, temporally exclude this test from l-m job, and start investigation on that 14:31:15 tdurakov: i'm fine with that 14:31:20 we'd still have it in the multinode job 14:31:28 right 14:31:35 i would like to get more stable runs on the l-m job though 14:31:39 so we can start digging out 14:32:09 btw, has anyone reproduced this locally? 14:33:09 that's kind of the problem^ 14:33:31 #action: tdurakov to skip test in live-migration job 14:34:01 #action find volunteer for underlying bug issue, will made a call in ml 14:34:11 let's move on 14:34:38 tdurakov: If anybody has a paying customer hitting this, getting cycles for a reproducer should be simple 14:34:38 just fyi https://review.openstack.org/#/c/329466/ - updated patch, so if it's ok we could enable nfs again soon 14:34:55 http://packages.ubuntu.com/xenial/qemu-block-extra has the package we need 14:36:02 #topic Migration object 14:36:20 I want to discuss usage of migration object in nova 14:36:47 it turn's out for resize/evacuate it's being created implicitly during claim 14:36:59 so, it's kind of related for live-migration thing 14:37:10 that I'd like to change 14:37:31 I'd prefer to create it explicitly in conductor instead 14:37:37 thought?^ 14:38:07 Without looking at the code, explicit always wins for me. 14:39:37 ok, will send mail with details on that 14:39:56 #topic Plan for Ocata 14:40:19 as I already understood it will be Storage pools 14:40:32 anything else that requires bp/spec? 14:40:51 from my sight it will be fsm for migrations, working on that now 14:40:58 anything else? 14:41:00 I would really like to take a hard look at how we negotiate shared storage 14:41:17 Right now, working out what's shared and what's not between 2 hosts is a mess 14:41:59 mdbooth: big + on that 14:42:34 I'd also expect this one post-copy interrupts networking 14:42:35 I had an idea to be explicit about it somehow. i.e. Have the target communicate what it already has to the source. 14:43:20 mdbooth: could work 14:44:39 tdurakov: I think Luis said he was away this week for that post-copy/networking one - I'm assuming he's back next week but not sure 14:44:40 the way how migrate_data object contains 'dozen' of flags for shared/not shared make this thing tricky every time 14:45:05 tdurakov: Right, they're unfathomable, and they still don't cover the edge cases 14:45:24 davidgiluk: ok 14:45:31 like 2 hosts which use separate ceph setups 14:45:40 currently marked as shared 14:46:00 yes, looks like a big work item for ocata 14:46:14 we have ~10 minutes, so let's go next 14:46:19 * mdbooth can't guarantee the cycles to work on it, but if anybody's interested... 14:46:27 #topic Networking 14:46:55 any updates on this one https://review.openstack.org/#/c/275073? 14:48:11 #action to figure out status for setup_networks_on_host for Neutron 14:49:23 johnthetubaguy: hi, any updates on this item: Future port info spec to be worked on for Ocata 14:50:37 the same action then 14:50:56 so next topic 14:51:00 ah, so yes, same action 14:51:19 been looking into that, but not yet at the bottom of things, mostly due to holiday end of last week and yesterday 14:51:33 there is a patch in review we want to get merged, which should help 14:51:51 johnthetubaguy: acked, thanks for update 14:51:59 I am booked to go to the neutron midycle to help talk about the plan for next cycle 14:52:12 so let me know if there are things folks want raised there 14:53:28 #action, reach johnthetubaguy with things for nova-neutron to be discussed during Neutron mid-cycle 14:53:54 #topic Open discussion 14:54:35 https://bugs.launchpad.net/nova/+bug/1597644 14:54:35 Launchpad bug 1597644 in OpenStack Compute (nova) "Quobyte: Permission denied on console.log during instance startup" [High,Fix released] - Assigned to Silvan Kaiser (2-silvan) 14:54:44 This bug came out of my series the other day 14:54:56 However, the thing I'd like to discuss here is 14:55:22 The bug describes that Quobyte CI deliberately configures cinder and nova to be able to write to each others' instance files 14:55:39 Can anybody think of a reason that they would do that, or how it might not be broken? 14:56:27 * tdurakov haven't seen Quobyte and it's CI yet 14:56:42 * mdbooth hadn't heard of it until his patch got reverted ;) 14:56:58 However, that was easily worked around 14:57:14 Shared access to storage between cinder and nova just sounds scary 14:58:18 i have one more thing 14:58:44 i just sent an e-mail to os-dev list about removing live_migration_flag and what to do with live_migration_tunnelled 14:58:45 http://lists.openstack.org/pipermail/openstack-dev/2016-August/100657.html 14:59:14 if you can take a look on it, this VIR_MIGRATE_TUNNELLED flag is a pain for a really long time right now and maybe this is a good time to get rid of it 14:59:23 that's all :) 14:59:41 mdbooth: agree on that, need walk through code 15:00:02 pkoniszewski: flags... right 15:00:08 * amrith coughs discreetly in the back of the room 15:00:15 ohhai 15:00:37 pkoniszewski: I'd exepect we remove it 15:00:39 anyway 15:00:48 amrith: are we crashing the nova meeting rn? 15:00:50 it looks like we need to end 15:00:55 #endmeeting