14:00:04 <PaulMurray> #startmeeting Nova Live Migration 14:00:08 <openstack> Meeting started Tue Sep 6 14:00:04 2016 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:09 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:12 <openstack> The meeting name has been set to 'nova_live_migration' 14:00:32 <PaulMurray> Hello everyone 14:00:49 <davidgiluk> hi 14:00:51 <PaulMurray> As usual: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:00:55 <mdbooth> o/ 14:01:14 <luis5tb> o/ 14:01:28 <pkoniszewski> o/ 14:01:29 <PaulMurray> Hopefully everyone is back from summer now 14:01:32 <PaulMurray> I am 14:02:05 <mdbooth> Apparently summer's about to restart. It had previously ended, though. Here's hoping. 14:02:36 <PaulMurray> We are past feature freeze now, so I thought we would skip feature stuff for now 14:02:46 <PaulMurray> #topic CI 14:03:02 <PaulMurray> I have been trying to catch up 14:03:12 <PaulMurray> I saw something about flaky NFS? 14:03:18 <PaulMurray> Any news 14:04:14 <PaulMurray> I guess that's a no 14:04:47 <PaulMurray> If no-one has anything to say we'll go on 14:05:01 <PaulMurray> #topic Bugs 14:05:07 <pkoniszewski> tdurakov is on vacations, so no update probably 14:05:23 <PaulMurray> pkoniszewski, thanks 14:05:28 <mdbooth> johnthetubaguy: https://bugs.launchpad.net/nova/+bug/1605016 14:05:28 <openstack> Launchpad bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,Confirmed] 14:06:07 <mdbooth> I wondered if we could add a 'if post-copy: do_network_stuff()' at the end of pre_live_migration() 14:06:25 <PaulMurray> I don't know if johnthetubaguy is here, but I saw the comments 14:06:25 <pkoniszewski> well, we can also go other way 14:06:51 <pkoniszewski> we do have a control on this switch 14:06:59 <davidgiluk> luis5tb: did you have a chance to look at that postcopy network stuff? 14:07:02 <pkoniszewski> i mean it's nova decision to switch pre-copy to post-copy 14:07:07 <raj_singh> mdbooth: johnthetubaguy is on vacation till Wednesday 14:07:13 <mdbooth> raj_singh: Ok, thanks. 14:07:30 <PaulMurray> pkoniszewski, but nova doesn't know when the switch has hapened 14:07:46 <pkoniszewski> it doesn't? 14:08:00 <PaulMurray> i don't think its synchronous is it? 14:08:09 <pkoniszewski> davidgiluk: ? 14:08:12 <pkoniszewski> davidgiluk: ^ 14:08:12 <PaulMurray> I think nova triggers the switch 14:08:24 <davidgiluk> pkoniszewski: I only know it from the qemu side; luis5tb ^ 14:08:26 <luis5tb> davidgiluk: no, not yet. I'm moving so I pretty busy the last weeks 14:08:30 <mdbooth> The thing is, if we're going to add a new callback it needs to be asynchronous, and concurrent with the migration. 14:08:31 <PaulMurray> danpb, said there is an event at the destination that we can pick up 14:08:52 <danpb> yes, libvirt emits an event when post-copy starts 14:08:53 <PaulMurray> I noted it in the bug 14:08:58 <mdbooth> danpb: Can we trigger nova code with context at the destination? 14:09:03 <danpb> so you can use that to trigger the network change 14:09:17 <danpb> on source it'll emit PAUSED + reason POST_COPY 14:09:18 <mdbooth> So libvirt can give us an async callback? 14:09:23 <danpb> on target it'll emit RUNNING + reason POST_COPY 14:09:39 <danpb> nova already has an event handler registered with libvirt that'll be receiving these 14:09:55 <danpb> so you merely need to wire up some code to trigger on it 14:10:10 <mdbooth> PaulMurray: I see your comment now. Read backwards and missed it. 14:10:15 <danpb> shouldn't be any more complex that what nova already has todo to switch network when pre-copy finished 14:11:10 <mdbooth> Is this considered a bug that's eligible for Newton? 14:11:32 <mdbooth> mriedem: ^^^ ? 14:11:58 <mriedem> how much backscroll do i need to read here? 14:12:03 <danpb> if it isn't considered eligible then you pretty much have to block any use of post-copy in Newton 14:12:09 <mdbooth> mriedem: About 5 lines. 14:12:10 <danpb> as it is useless if network is fubar 14:12:10 <PaulMurray> I'm not sure its super critical 14:12:18 <PaulMurray> but it would be good if it can be done 14:12:30 <mriedem> talking about https://bugs.launchpad.net/nova/+bug/1605016 ? 14:12:30 <openstack> Launchpad bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,Confirmed] 14:12:31 <danpb> PaulMurray: complete lack of working network while in post-copy phase seems pretty critical to me 14:12:34 <mdbooth> mriedem: Yes 14:12:49 <mriedem> bugs are fair game 14:12:56 <mdbooth> So if this will be considered in Newton, I don't mind taking this. 14:13:05 <PaulMurray> danpb, yeah, but I meant we can do migration without it - so its not a show stopper 14:13:06 <mriedem> is this like 100% fail with post-copy? 14:13:20 <mdbooth> mriedem: No, just no networking until it completes. 14:13:25 <davidgiluk> which is pretty bad 14:13:30 <mdbooth> Which is a potentially long time, of course. 14:13:35 <danpb> that may be anywhere from a few milliseconds to many minutes 14:13:57 <pkoniszewski> mdbooth: is that true for ANY neutron configuration? 14:14:15 <PaulMurray> that's why the problem is there - testing with vas that aren't really doing much means you don't notice 14:14:25 <PaulMurray> but with a busy vm it would be a problem 14:14:25 <mdbooth> pkoniszewski: I haven't looked at it in detail, yet, but I don't see how it could be otherwise? 14:14:57 <PaulMurray> pkoniszewski, its not neutron that makes the difference 14:15:08 <PaulMurray> its the time it takes to complete the post-copy 14:15:15 <pkoniszewski> c 14:15:17 <pkoniszewski> sorry 14:15:48 <mdbooth> Ok, sounds like we agree this is pretty bad and eligible for Newton. 14:15:53 <mdbooth> I'll have a stab at it. 14:16:04 <davidgiluk> mdbooth: Prod me if you need any help understanding postcopy 14:16:12 <mdbooth> davidgiluk: Will do, thanks. 14:16:15 <danpb> mdbooth: you should wear a stab proof vest, because live migration always stabs back :-P 14:16:22 <davidgiluk> haha 14:16:28 <PaulMurray> mdbooth, thanks - I'll see if we can help too 14:16:36 <PaulMurray> at least with testing 14:16:54 <pkoniszewski> PaulMurray: just checked that nova-network does not require post copy steps to keep networking up 14:16:56 <mriedem> i've tagged it with neutron-rc-potential 14:17:04 <luis5tb> I'm happy to help too, but as this is high prio and I will not have too much time available within the nxt couple of weeks I prefer to not be the responsible one! 14:17:05 <pkoniszewski> and i'm pretty sure that it is only DVR case for neutron 14:17:22 <PaulMurray> #action mdbooth to fix https://bugs.launchpad.net/nova/+bug/1605016 14:17:22 <openstack> Launchpad bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,Confirmed] 14:18:11 <PaulMurray> Do we have any other critical bugs ? 14:18:23 <pkoniszewski> yeah, i think 14:18:32 <pkoniszewski> this fixes regression that we introduced in newton 14:18:38 <pkoniszewski> https://review.openstack.org/#/c/358599/ 14:19:17 <PaulMurray> I saw that just now 14:19:24 <pkoniszewski> basically because of this bug everyone needs to set vnc/spice to listen all/localhost 14:19:53 <PaulMurray> is the patch close to done ? 14:20:17 <pkoniszewski> i think that it can be merged in the current state 14:20:34 <mriedem> i'll take a look at it after meetings 14:20:38 <markus_z> This fixes an older issue with live migration + serial console ports: https://review.openstack.org/#/c/275801/ 14:21:04 <markus_z> I wasn't sure if you're aware of that one. 14:21:49 <pkoniszewski> another one is that it is impossible to use dedicated interfaces for live migration using our default confiugration - https://review.openstack.org/#/c/356558/ 14:21:56 <pkoniszewski> basically it has never worked 14:22:18 <PaulMurray> There are related changes - this one has +2 https://review.openstack.org/#/c/335132/7 14:22:35 <danpb> pkoniszewski: not entirely true - it worked fine if using VIR_MIGRATE_TUNNELLED 14:22:42 <danpb> pkoniszewski: which used to be our default until recently 14:22:51 <danpb> but i agree we should fix it 14:23:14 <pkoniszewski> i wasn't fast enough to add that it works while tunnelling is on, which is not out default configuration anymore 14:23:17 <pkoniszewski> thanks danpb 14:23:42 <danpb> IOW, this is a regression from previous state, so we should fix it 14:24:06 <PaulMurray> https://bugs.launchpad.net/nova/+bug/1455252 14:24:06 <openstack> Launchpad bug 1455252 in OpenStack Compute (nova) "enabling serial console breaks live migration" [High,In progress] - Assigned to sahid (sahid-ferdjaoui) 14:24:07 <mriedem> plug for long-standing live migration + dvr change that neutron wants in for newton: https://review.openstack.org/#/c/275073/ - i have a todo to get back to reviewing that again 14:24:49 <PaulMurray> mriedem, yes, that's on the agenda 14:24:59 <PaulMurray> just noting these console bugs 14:25:02 <PaulMurray> https://bugs.launchpad.net/nova/+bug/1595962 14:25:02 <openstack> Launchpad bug 1595962 in OpenStack Compute (nova) "live migration with disabled vnc/spice not possible" [Medium,In progress] - Assigned to Markus Zoeller (markus_z) (mzoeller) 14:25:32 <mriedem> PaulMurray: if we have regressions in newton then we should tag those bugs with newton-rc-potential 14:25:39 <mriedem> at least to get them attention in the next 2 weeks 14:25:48 <PaulMurray> ok - will do 14:25:50 <mriedem> RC1 is in 2 weeks 14:26:07 <PaulMurray> mriedem, the neutron one still have a failing tempest test 14:26:30 <PaulMurray> is anyone looking at that ? 14:26:57 <PaulMurray> https://review.openstack.org/#/c/286855/ 14:27:31 <PaulMurray> that's the test 14:27:34 <PaulMurray> the fix is: 14:27:36 <PaulMurray> https://review.openstack.org/#/c/275073 14:27:36 <mriedem> PaulMurray: just to be clear, that's a WIP test 14:27:46 <mriedem> i mean, it's throw away, just to test that nova dvr change 14:28:02 <mriedem> gate-tempest-dsvm-neutron-dvr-multinode-full is the job in that tempest change we care about 14:28:04 <mriedem> which is passing 14:28:14 <PaulMurray> ah, goo news 14:28:40 <PaulMurray> So we just need review on that fix then 14:28:51 <mriedem> well i'm rechecking the tempest fix 14:28:59 <mriedem> to run that job again, it hasn't run on the latest nova PS 14:29:00 <mriedem> but yeah 14:29:18 <mriedem> i reviewed the nova change a couple of weeks ago, some perf improvements and lots of missing unit tests, but otherwise it looks pretty close 14:30:29 <PaulMurray> #action ALL look at https://review.openstack.org/#/c/275073 14:30:45 <PaulMurray> Any others to mention 14:31:21 <mriedem> not a bug, but, 14:31:36 <mriedem> i haven't yet gone through the libvirt imagebackend refactor series yet to start -2ing 14:31:40 <mriedem> but expect that early this week 14:31:54 <mriedem> i think that's the last bp i have left to sort out for what's in FF and what's not 14:32:08 <PaulMurray> The list is here https://bugs.launchpad.net/nova/+bugs?field.tag=live-migration+ 14:32:25 <mdbooth> mriedem: Thanks :) 14:32:27 <PaulMurray> There are a number of 'high' importance bugs in progress 14:32:59 <PaulMurray> markus_z, have you been checking if bugs actually are in progress ? 14:34:42 <PaulMurray> There are about 7 or 8 and we have discussed some already 14:34:48 <PaulMurray> so lets have a quick look at the others 14:35:23 <PaulMurray> https://bugs.launchpad.net/nova/+bug/1419577 14:35:23 <openstack> Launchpad bug 1419577 in OpenStack Compute (nova) "when live-migrate failed, lun-id couldn't be rollback in havana" [High,In progress] - Assigned to Lee Yarwood (lyarwood) 14:36:08 <PaulMurray> anyone know about this? it has a fix here https://review.openstack.org/#/c/338929/ 14:36:27 <PaulMurray> its by lee yarwood 14:37:22 <PaulMurray> The next two we know about 14:37:43 <mdbooth> Lee is off for a bit 14:37:50 <mdbooth> Not sure when he's back 14:37:55 <PaulMurray> the fourth is the volume breaks CI one 14:38:00 <PaulMurray> ok mdbooth 14:38:26 <PaulMurray> This one looks quite specific 14:38:30 <PaulMurray> https://bugs.launchpad.net/nova/+bug/1566622 14:38:30 <openstack> Launchpad bug 1566622 in OpenStack Compute (nova) "live migration fails with xenapi virt driver and SRs with old-style naming convention" [High,In progress] - Assigned to Corey Wright (coreywright) 14:39:03 <PaulMurray> the fix that was up for review has been abandoned in july 14:39:53 <PaulMurray> https://bugs.launchpad.net/nova/+bug/1600251 14:39:53 <openstack> Launchpad bug 1600251 in OpenStack Compute (nova) "live migration does not honor server group policy" [High,In progress] - Assigned to Paul Carlton (paul-carlton2) 14:40:27 <PaulMurray> this has a fix up that should be easy to finish off 14:40:29 <PaulMurray> https://review.openstack.org/#/c/339588/ 14:40:33 <paul-carlton2> I lokked at that this morning will fix it when done with current task 14:40:48 <PaulMurray> thanks paul-carlton2 14:41:23 <PaulMurray> the next one we discussed already 14:41:45 <PaulMurray> then we have 14:41:47 <PaulMurray> https://bugs.launchpad.net/nova/+bug/1607996 14:41:47 <openstack> Launchpad bug 1607996 in OpenStack Compute (nova) "Live migration does not update numa hugepages info in xml" [High,In progress] 14:42:00 <PaulMurray> unassigned 14:43:09 <PaulMurray> danpb, do you know anything about this ? I know you haven't commented on the bug 14:43:17 * mdbooth wonders why sfinucan unassigned himself 14:44:23 <danpb> PaulMurray: its not entirely suprising - i mean the original numa/hugepage code never tried to make migraiton work 14:44:45 * danpb doesn't even know if we update NUMA topology upon migration yet, let alone hugepages 14:45:29 <PaulMurray> danpb, I think that's all work in progress at the moment - along with claims on destination etc 14:45:36 <danpb> yeah 14:45:59 <danpb> there's probably another open bug somewhere you can mark it a duplicate of 14:46:11 <PaulMurray> I'll take a look 14:46:40 <mdbooth> danpb: Is numa topology allowed to change during live migration? 14:46:48 <PaulMurray> That's all the high importance bugs. 14:47:06 <danpb> mdbooth: guest NUMA topology cannot change but the host<->guest placement can 14:47:15 <mdbooth> Got it. 14:47:17 <danpb> its just that nova has never tried to 14:47:19 <PaulMurray> the only ones not in progress are the post-copy network problem and the volume based live migration one that's messing up CI 14:47:29 <danpb> because it has complex interactions in the schedular that need fixing first 14:48:31 <PaulMurray> most have reviews up, so lets try to get those in at least. 14:48:46 <PaulMurray> I'll make a list and put it on the ML and our tracking page 14:49:06 <PaulMurray> Feel free to add if something needs attention that is not there 14:49:22 <PaulMurray> #topic Open discussion 14:49:30 <PaulMurray> anything left over to discuss ? 14:50:37 <PaulMurray> I think we're done 14:50:45 <PaulMurray> Thanks for coming 14:50:53 <PaulMurray> #endmeeting