14:00:04 #startmeeting Nova Live Migration 14:00:08 Meeting started Tue Sep 6 14:00:04 2016 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:09 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:12 The meeting name has been set to 'nova_live_migration' 14:00:32 Hello everyone 14:00:49 hi 14:00:51 As usual: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:00:55 o/ 14:01:14 o/ 14:01:28 o/ 14:01:29 Hopefully everyone is back from summer now 14:01:32 I am 14:02:05 Apparently summer's about to restart. It had previously ended, though. Here's hoping. 14:02:36 We are past feature freeze now, so I thought we would skip feature stuff for now 14:02:46 #topic CI 14:03:02 I have been trying to catch up 14:03:12 I saw something about flaky NFS? 14:03:18 Any news 14:04:14 I guess that's a no 14:04:47 If no-one has anything to say we'll go on 14:05:01 #topic Bugs 14:05:07 tdurakov is on vacations, so no update probably 14:05:23 pkoniszewski, thanks 14:05:28 johnthetubaguy: https://bugs.launchpad.net/nova/+bug/1605016 14:05:28 Launchpad bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,Confirmed] 14:06:07 I wondered if we could add a 'if post-copy: do_network_stuff()' at the end of pre_live_migration() 14:06:25 I don't know if johnthetubaguy is here, but I saw the comments 14:06:25 well, we can also go other way 14:06:51 we do have a control on this switch 14:06:59 luis5tb: did you have a chance to look at that postcopy network stuff? 14:07:02 i mean it's nova decision to switch pre-copy to post-copy 14:07:07 mdbooth: johnthetubaguy is on vacation till Wednesday 14:07:13 raj_singh: Ok, thanks. 14:07:30 pkoniszewski, but nova doesn't know when the switch has hapened 14:07:46 it doesn't? 14:08:00 i don't think its synchronous is it? 14:08:09 davidgiluk: ? 14:08:12 davidgiluk: ^ 14:08:12 I think nova triggers the switch 14:08:24 pkoniszewski: I only know it from the qemu side; luis5tb ^ 14:08:26 davidgiluk: no, not yet. I'm moving so I pretty busy the last weeks 14:08:30 The thing is, if we're going to add a new callback it needs to be asynchronous, and concurrent with the migration. 14:08:31 danpb, said there is an event at the destination that we can pick up 14:08:52 yes, libvirt emits an event when post-copy starts 14:08:53 I noted it in the bug 14:08:58 danpb: Can we trigger nova code with context at the destination? 14:09:03 so you can use that to trigger the network change 14:09:17 on source it'll emit PAUSED + reason POST_COPY 14:09:18 So libvirt can give us an async callback? 14:09:23 on target it'll emit RUNNING + reason POST_COPY 14:09:39 nova already has an event handler registered with libvirt that'll be receiving these 14:09:55 so you merely need to wire up some code to trigger on it 14:10:10 PaulMurray: I see your comment now. Read backwards and missed it. 14:10:15 shouldn't be any more complex that what nova already has todo to switch network when pre-copy finished 14:11:10 Is this considered a bug that's eligible for Newton? 14:11:32 mriedem: ^^^ ? 14:11:58 how much backscroll do i need to read here? 14:12:03 if it isn't considered eligible then you pretty much have to block any use of post-copy in Newton 14:12:09 mriedem: About 5 lines. 14:12:10 as it is useless if network is fubar 14:12:10 I'm not sure its super critical 14:12:18 but it would be good if it can be done 14:12:30 talking about https://bugs.launchpad.net/nova/+bug/1605016 ? 14:12:30 Launchpad bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,Confirmed] 14:12:31 PaulMurray: complete lack of working network while in post-copy phase seems pretty critical to me 14:12:34 mriedem: Yes 14:12:49 bugs are fair game 14:12:56 So if this will be considered in Newton, I don't mind taking this. 14:13:05 danpb, yeah, but I meant we can do migration without it - so its not a show stopper 14:13:06 is this like 100% fail with post-copy? 14:13:20 mriedem: No, just no networking until it completes. 14:13:25 which is pretty bad 14:13:30 Which is a potentially long time, of course. 14:13:35 that may be anywhere from a few milliseconds to many minutes 14:13:57 mdbooth: is that true for ANY neutron configuration? 14:14:15 that's why the problem is there - testing with vas that aren't really doing much means you don't notice 14:14:25 but with a busy vm it would be a problem 14:14:25 pkoniszewski: I haven't looked at it in detail, yet, but I don't see how it could be otherwise? 14:14:57 pkoniszewski, its not neutron that makes the difference 14:15:08 its the time it takes to complete the post-copy 14:15:15 c 14:15:17 sorry 14:15:48 Ok, sounds like we agree this is pretty bad and eligible for Newton. 14:15:53 I'll have a stab at it. 14:16:04 mdbooth: Prod me if you need any help understanding postcopy 14:16:12 davidgiluk: Will do, thanks. 14:16:15 mdbooth: you should wear a stab proof vest, because live migration always stabs back :-P 14:16:22 haha 14:16:28 mdbooth, thanks - I'll see if we can help too 14:16:36 at least with testing 14:16:54 PaulMurray: just checked that nova-network does not require post copy steps to keep networking up 14:16:56 i've tagged it with neutron-rc-potential 14:17:04 I'm happy to help too, but as this is high prio and I will not have too much time available within the nxt couple of weeks I prefer to not be the responsible one! 14:17:05 and i'm pretty sure that it is only DVR case for neutron 14:17:22 #action mdbooth to fix https://bugs.launchpad.net/nova/+bug/1605016 14:17:22 Launchpad bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,Confirmed] 14:18:11 Do we have any other critical bugs ? 14:18:23 yeah, i think 14:18:32 this fixes regression that we introduced in newton 14:18:38 https://review.openstack.org/#/c/358599/ 14:19:17 I saw that just now 14:19:24 basically because of this bug everyone needs to set vnc/spice to listen all/localhost 14:19:53 is the patch close to done ? 14:20:17 i think that it can be merged in the current state 14:20:34 i'll take a look at it after meetings 14:20:38 This fixes an older issue with live migration + serial console ports: https://review.openstack.org/#/c/275801/ 14:21:04 I wasn't sure if you're aware of that one. 14:21:49 another one is that it is impossible to use dedicated interfaces for live migration using our default confiugration - https://review.openstack.org/#/c/356558/ 14:21:56 basically it has never worked 14:22:18 There are related changes - this one has +2 https://review.openstack.org/#/c/335132/7 14:22:35 pkoniszewski: not entirely true - it worked fine if using VIR_MIGRATE_TUNNELLED 14:22:42 pkoniszewski: which used to be our default until recently 14:22:51 but i agree we should fix it 14:23:14 i wasn't fast enough to add that it works while tunnelling is on, which is not out default configuration anymore 14:23:17 thanks danpb 14:23:42 IOW, this is a regression from previous state, so we should fix it 14:24:06 https://bugs.launchpad.net/nova/+bug/1455252 14:24:06 Launchpad bug 1455252 in OpenStack Compute (nova) "enabling serial console breaks live migration" [High,In progress] - Assigned to sahid (sahid-ferdjaoui) 14:24:07 plug for long-standing live migration + dvr change that neutron wants in for newton: https://review.openstack.org/#/c/275073/ - i have a todo to get back to reviewing that again 14:24:49 mriedem, yes, that's on the agenda 14:24:59 just noting these console bugs 14:25:02 https://bugs.launchpad.net/nova/+bug/1595962 14:25:02 Launchpad bug 1595962 in OpenStack Compute (nova) "live migration with disabled vnc/spice not possible" [Medium,In progress] - Assigned to Markus Zoeller (markus_z) (mzoeller) 14:25:32 PaulMurray: if we have regressions in newton then we should tag those bugs with newton-rc-potential 14:25:39 at least to get them attention in the next 2 weeks 14:25:48 ok - will do 14:25:50 RC1 is in 2 weeks 14:26:07 mriedem, the neutron one still have a failing tempest test 14:26:30 is anyone looking at that ? 14:26:57 https://review.openstack.org/#/c/286855/ 14:27:31 that's the test 14:27:34 the fix is: 14:27:36 https://review.openstack.org/#/c/275073 14:27:36 PaulMurray: just to be clear, that's a WIP test 14:27:46 i mean, it's throw away, just to test that nova dvr change 14:28:02 gate-tempest-dsvm-neutron-dvr-multinode-full is the job in that tempest change we care about 14:28:04 which is passing 14:28:14 ah, goo news 14:28:40 So we just need review on that fix then 14:28:51 well i'm rechecking the tempest fix 14:28:59 to run that job again, it hasn't run on the latest nova PS 14:29:00 but yeah 14:29:18 i reviewed the nova change a couple of weeks ago, some perf improvements and lots of missing unit tests, but otherwise it looks pretty close 14:30:29 #action ALL look at https://review.openstack.org/#/c/275073 14:30:45 Any others to mention 14:31:21 not a bug, but, 14:31:36 i haven't yet gone through the libvirt imagebackend refactor series yet to start -2ing 14:31:40 but expect that early this week 14:31:54 i think that's the last bp i have left to sort out for what's in FF and what's not 14:32:08 The list is here https://bugs.launchpad.net/nova/+bugs?field.tag=live-migration+ 14:32:25 mriedem: Thanks :) 14:32:27 There are a number of 'high' importance bugs in progress 14:32:59 markus_z, have you been checking if bugs actually are in progress ? 14:34:42 There are about 7 or 8 and we have discussed some already 14:34:48 so lets have a quick look at the others 14:35:23 https://bugs.launchpad.net/nova/+bug/1419577 14:35:23 Launchpad bug 1419577 in OpenStack Compute (nova) "when live-migrate failed, lun-id couldn't be rollback in havana" [High,In progress] - Assigned to Lee Yarwood (lyarwood) 14:36:08 anyone know about this? it has a fix here https://review.openstack.org/#/c/338929/ 14:36:27 its by lee yarwood 14:37:22 The next two we know about 14:37:43 Lee is off for a bit 14:37:50 Not sure when he's back 14:37:55 the fourth is the volume breaks CI one 14:38:00 ok mdbooth 14:38:26 This one looks quite specific 14:38:30 https://bugs.launchpad.net/nova/+bug/1566622 14:38:30 Launchpad bug 1566622 in OpenStack Compute (nova) "live migration fails with xenapi virt driver and SRs with old-style naming convention" [High,In progress] - Assigned to Corey Wright (coreywright) 14:39:03 the fix that was up for review has been abandoned in july 14:39:53 https://bugs.launchpad.net/nova/+bug/1600251 14:39:53 Launchpad bug 1600251 in OpenStack Compute (nova) "live migration does not honor server group policy" [High,In progress] - Assigned to Paul Carlton (paul-carlton2) 14:40:27 this has a fix up that should be easy to finish off 14:40:29 https://review.openstack.org/#/c/339588/ 14:40:33 I lokked at that this morning will fix it when done with current task 14:40:48 thanks paul-carlton2 14:41:23 the next one we discussed already 14:41:45 then we have 14:41:47 https://bugs.launchpad.net/nova/+bug/1607996 14:41:47 Launchpad bug 1607996 in OpenStack Compute (nova) "Live migration does not update numa hugepages info in xml" [High,In progress] 14:42:00 unassigned 14:43:09 danpb, do you know anything about this ? I know you haven't commented on the bug 14:43:17 * mdbooth wonders why sfinucan unassigned himself 14:44:23 PaulMurray: its not entirely suprising - i mean the original numa/hugepage code never tried to make migraiton work 14:44:45 * danpb doesn't even know if we update NUMA topology upon migration yet, let alone hugepages 14:45:29 danpb, I think that's all work in progress at the moment - along with claims on destination etc 14:45:36 yeah 14:45:59 there's probably another open bug somewhere you can mark it a duplicate of 14:46:11 I'll take a look 14:46:40 danpb: Is numa topology allowed to change during live migration? 14:46:48 That's all the high importance bugs. 14:47:06 mdbooth: guest NUMA topology cannot change but the host<->guest placement can 14:47:15 Got it. 14:47:17 its just that nova has never tried to 14:47:19 the only ones not in progress are the post-copy network problem and the volume based live migration one that's messing up CI 14:47:29 because it has complex interactions in the schedular that need fixing first 14:48:31 most have reviews up, so lets try to get those in at least. 14:48:46 I'll make a list and put it on the ML and our tracking page 14:49:06 Feel free to add if something needs attention that is not there 14:49:22 #topic Open discussion 14:49:30 anything left over to discuss ? 14:50:37 I think we're done 14:50:45 Thanks for coming 14:50:53 #endmeeting