14:00:04 <PaulMurray> #startmeeting Nova Live Migration
14:00:08 <openstack> Meeting started Tue Sep  6 14:00:04 2016 UTC and is due to finish in 60 minutes.  The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:09 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:12 <openstack> The meeting name has been set to 'nova_live_migration'
14:00:32 <PaulMurray> Hello everyone
14:00:49 <davidgiluk> hi
14:00:51 <PaulMurray> As usual: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration
14:00:55 <mdbooth> o/
14:01:14 <luis5tb> o/
14:01:28 <pkoniszewski> o/
14:01:29 <PaulMurray> Hopefully everyone is back from summer now
14:01:32 <PaulMurray> I am
14:02:05 <mdbooth> Apparently summer's about to restart. It had previously ended, though. Here's hoping.
14:02:36 <PaulMurray> We are past feature freeze now, so I thought we would skip feature stuff for now
14:02:46 <PaulMurray> #topic CI
14:03:02 <PaulMurray> I have been trying to catch up
14:03:12 <PaulMurray> I saw something about flaky NFS?
14:03:18 <PaulMurray> Any news
14:04:14 <PaulMurray> I guess that's a no
14:04:47 <PaulMurray> If no-one has anything to say we'll go on
14:05:01 <PaulMurray> #topic Bugs
14:05:07 <pkoniszewski> tdurakov is on vacations, so no update probably
14:05:23 <PaulMurray> pkoniszewski, thanks
14:05:28 <mdbooth> johnthetubaguy: https://bugs.launchpad.net/nova/+bug/1605016
14:05:28 <openstack> Launchpad bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,Confirmed]
14:06:07 <mdbooth> I wondered if we could add a 'if post-copy: do_network_stuff()' at the end of pre_live_migration()
14:06:25 <PaulMurray> I don't know if johnthetubaguy is here, but I saw the comments
14:06:25 <pkoniszewski> well, we can also go other way
14:06:51 <pkoniszewski> we do have a control on this switch
14:06:59 <davidgiluk> luis5tb: did you have a chance to look at that postcopy network stuff?
14:07:02 <pkoniszewski> i mean it's nova decision to switch pre-copy to post-copy
14:07:07 <raj_singh> mdbooth: johnthetubaguy is on vacation till Wednesday
14:07:13 <mdbooth> raj_singh: Ok, thanks.
14:07:30 <PaulMurray> pkoniszewski, but nova doesn't know when the switch has hapened
14:07:46 <pkoniszewski> it doesn't?
14:08:00 <PaulMurray> i don't think its synchronous is it?
14:08:09 <pkoniszewski> davidgiluk: ?
14:08:12 <pkoniszewski> davidgiluk: ^
14:08:12 <PaulMurray> I think nova triggers the switch
14:08:24 <davidgiluk> pkoniszewski: I only know it from the qemu side; luis5tb ^
14:08:26 <luis5tb> davidgiluk: no, not yet. I'm moving so I pretty busy the last weeks
14:08:30 <mdbooth> The thing is, if we're going to add a new callback it needs to be asynchronous, and concurrent with the migration.
14:08:31 <PaulMurray> danpb, said there is an event at the destination that we can pick up
14:08:52 <danpb> yes, libvirt emits an event when post-copy starts
14:08:53 <PaulMurray> I noted it in the bug
14:08:58 <mdbooth> danpb: Can we trigger nova code with context at the destination?
14:09:03 <danpb> so you can use that to trigger the network change
14:09:17 <danpb> on source it'll emit  PAUSED + reason POST_COPY
14:09:18 <mdbooth> So libvirt can give us an async callback?
14:09:23 <danpb> on target it'll emit RUNNING + reason POST_COPY
14:09:39 <danpb> nova  already  has an event handler registered with libvirt that'll be receiving these
14:09:55 <danpb> so you merely need to wire up some code to trigger on it
14:10:10 <mdbooth> PaulMurray: I see your comment now. Read backwards and missed it.
14:10:15 <danpb> shouldn't be any more complex that what nova already has todo to switch network when pre-copy finished
14:11:10 <mdbooth> Is this considered a bug that's eligible for Newton?
14:11:32 <mdbooth> mriedem: ^^^ ?
14:11:58 <mriedem> how much backscroll do i need to read here?
14:12:03 <danpb> if it isn't considered eligible then you pretty much have to block any use of post-copy in Newton
14:12:09 <mdbooth> mriedem: About 5 lines.
14:12:10 <danpb> as it is useless if network is fubar
14:12:10 <PaulMurray> I'm not sure its super critical
14:12:18 <PaulMurray> but it would be good if it can be done
14:12:30 <mriedem> talking about https://bugs.launchpad.net/nova/+bug/1605016 ?
14:12:30 <openstack> Launchpad bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,Confirmed]
14:12:31 <danpb> PaulMurray: complete lack of working network while in post-copy phase seems pretty critical to me
14:12:34 <mdbooth> mriedem: Yes
14:12:49 <mriedem> bugs are fair game
14:12:56 <mdbooth> So if this will be considered in Newton, I don't mind taking this.
14:13:05 <PaulMurray> danpb, yeah, but I meant we can do migration without it - so its not a show stopper
14:13:06 <mriedem> is this like 100% fail with post-copy?
14:13:20 <mdbooth> mriedem: No, just no networking until it completes.
14:13:25 <davidgiluk> which is pretty bad
14:13:30 <mdbooth> Which is a potentially long time, of course.
14:13:35 <danpb> that may be anywhere from a few milliseconds to many minutes
14:13:57 <pkoniszewski> mdbooth: is that true for ANY neutron configuration?
14:14:15 <PaulMurray> that's why the problem is there - testing with vas that aren't really doing much means you don't notice
14:14:25 <PaulMurray> but with a busy vm it would be a problem
14:14:25 <mdbooth> pkoniszewski: I haven't looked at it in detail, yet, but I don't see how it could be otherwise?
14:14:57 <PaulMurray> pkoniszewski, its not neutron that makes the difference
14:15:08 <PaulMurray> its the time it takes to complete the post-copy
14:15:15 <pkoniszewski> c
14:15:17 <pkoniszewski> sorry
14:15:48 <mdbooth> Ok, sounds like we agree this is pretty bad and eligible for Newton.
14:15:53 <mdbooth> I'll have a stab at it.
14:16:04 <davidgiluk> mdbooth: Prod me if you need any help understanding postcopy
14:16:12 <mdbooth> davidgiluk: Will do, thanks.
14:16:15 <danpb> mdbooth: you should wear a stab proof vest, because live migration always stabs back :-P
14:16:22 <davidgiluk> haha
14:16:28 <PaulMurray> mdbooth, thanks - I'll see if we can help too
14:16:36 <PaulMurray> at least with testing
14:16:54 <pkoniszewski> PaulMurray: just checked that nova-network does not require post copy steps to keep networking up
14:16:56 <mriedem> i've tagged it with neutron-rc-potential
14:17:04 <luis5tb> I'm happy to help too, but as this is high prio and I will not have too much time available within the nxt couple of weeks I prefer to not be the responsible one!
14:17:05 <pkoniszewski> and i'm pretty sure that it is only DVR case for neutron
14:17:22 <PaulMurray> #action mdbooth to fix https://bugs.launchpad.net/nova/+bug/1605016
14:17:22 <openstack> Launchpad bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,Confirmed]
14:18:11 <PaulMurray> Do we have any other critical bugs ?
14:18:23 <pkoniszewski> yeah, i think
14:18:32 <pkoniszewski> this fixes regression that we introduced in newton
14:18:38 <pkoniszewski> https://review.openstack.org/#/c/358599/
14:19:17 <PaulMurray> I saw that just now
14:19:24 <pkoniszewski> basically because of this bug everyone needs to set vnc/spice to listen all/localhost
14:19:53 <PaulMurray> is the patch close to done ?
14:20:17 <pkoniszewski> i think that it can be merged in the current state
14:20:34 <mriedem> i'll take a look at it after meetings
14:20:38 <markus_z> This fixes an older issue with live migration + serial console ports: https://review.openstack.org/#/c/275801/
14:21:04 <markus_z> I wasn't sure if you're aware of that one.
14:21:49 <pkoniszewski> another one is that it is impossible to use dedicated interfaces for live migration using our default confiugration - https://review.openstack.org/#/c/356558/
14:21:56 <pkoniszewski> basically it has never worked
14:22:18 <PaulMurray> There are related changes - this one has +2 https://review.openstack.org/#/c/335132/7
14:22:35 <danpb> pkoniszewski: not entirely true - it worked fine if using VIR_MIGRATE_TUNNELLED
14:22:42 <danpb> pkoniszewski: which used to be our default until recently
14:22:51 <danpb> but i agree we should fix it
14:23:14 <pkoniszewski> i wasn't fast enough to add that it works while tunnelling is on, which is not out default configuration anymore
14:23:17 <pkoniszewski> thanks danpb
14:23:42 <danpb> IOW, this is a regression from previous state, so we should fix it
14:24:06 <PaulMurray> https://bugs.launchpad.net/nova/+bug/1455252
14:24:06 <openstack> Launchpad bug 1455252 in OpenStack Compute (nova) "enabling serial console breaks live migration" [High,In progress] - Assigned to sahid (sahid-ferdjaoui)
14:24:07 <mriedem> plug for long-standing live migration + dvr change that neutron wants in for newton: https://review.openstack.org/#/c/275073/ - i have a todo to get back to reviewing that again
14:24:49 <PaulMurray> mriedem, yes, that's on the agenda
14:24:59 <PaulMurray> just noting these console bugs
14:25:02 <PaulMurray> https://bugs.launchpad.net/nova/+bug/1595962
14:25:02 <openstack> Launchpad bug 1595962 in OpenStack Compute (nova) "live migration with disabled vnc/spice not possible" [Medium,In progress] - Assigned to Markus Zoeller (markus_z) (mzoeller)
14:25:32 <mriedem> PaulMurray: if we have regressions in newton then we should tag those bugs with newton-rc-potential
14:25:39 <mriedem> at least to get them attention in the next 2 weeks
14:25:48 <PaulMurray> ok - will do
14:25:50 <mriedem> RC1 is in 2 weeks
14:26:07 <PaulMurray> mriedem, the neutron one still have a failing tempest test
14:26:30 <PaulMurray> is anyone looking at that ?
14:26:57 <PaulMurray> https://review.openstack.org/#/c/286855/
14:27:31 <PaulMurray> that's the test
14:27:34 <PaulMurray> the fix is:
14:27:36 <PaulMurray> https://review.openstack.org/#/c/275073
14:27:36 <mriedem> PaulMurray: just to be clear, that's a WIP test
14:27:46 <mriedem> i mean, it's throw away, just to test that nova dvr change
14:28:02 <mriedem> gate-tempest-dsvm-neutron-dvr-multinode-full is the job in that tempest change we care about
14:28:04 <mriedem> which is passing
14:28:14 <PaulMurray> ah, goo news
14:28:40 <PaulMurray> So we just need review on that fix then
14:28:51 <mriedem> well i'm rechecking the tempest fix
14:28:59 <mriedem> to run that job again, it hasn't run on the latest nova PS
14:29:00 <mriedem> but yeah
14:29:18 <mriedem> i reviewed the nova change a couple of weeks ago, some perf improvements and lots of missing unit tests, but otherwise it looks pretty close
14:30:29 <PaulMurray> #action ALL look at https://review.openstack.org/#/c/275073
14:30:45 <PaulMurray> Any others to mention
14:31:21 <mriedem> not a bug, but,
14:31:36 <mriedem> i haven't yet gone through the libvirt imagebackend refactor series yet to start -2ing
14:31:40 <mriedem> but expect that early this week
14:31:54 <mriedem> i think that's the last bp i have left to sort out for what's in FF and what's not
14:32:08 <PaulMurray> The list is here https://bugs.launchpad.net/nova/+bugs?field.tag=live-migration+
14:32:25 <mdbooth> mriedem: Thanks :)
14:32:27 <PaulMurray> There are a number of 'high' importance bugs in progress
14:32:59 <PaulMurray> markus_z, have you been checking if bugs actually are in progress ?
14:34:42 <PaulMurray> There are about 7 or 8 and we have discussed some already
14:34:48 <PaulMurray> so lets have a quick look at the others
14:35:23 <PaulMurray> https://bugs.launchpad.net/nova/+bug/1419577
14:35:23 <openstack> Launchpad bug 1419577 in OpenStack Compute (nova) "when live-migrate failed, lun-id couldn't be rollback in havana" [High,In progress] - Assigned to Lee Yarwood (lyarwood)
14:36:08 <PaulMurray> anyone know about this? it has a fix here https://review.openstack.org/#/c/338929/
14:36:27 <PaulMurray> its by lee yarwood
14:37:22 <PaulMurray> The next two we know about
14:37:43 <mdbooth> Lee is off for a bit
14:37:50 <mdbooth> Not sure when he's back
14:37:55 <PaulMurray> the fourth is the volume breaks CI one
14:38:00 <PaulMurray> ok mdbooth
14:38:26 <PaulMurray> This one looks quite specific
14:38:30 <PaulMurray> https://bugs.launchpad.net/nova/+bug/1566622
14:38:30 <openstack> Launchpad bug 1566622 in OpenStack Compute (nova) "live migration fails with xenapi virt driver and SRs with old-style naming convention" [High,In progress] - Assigned to Corey Wright (coreywright)
14:39:03 <PaulMurray> the fix that was up for review has been abandoned in july
14:39:53 <PaulMurray> https://bugs.launchpad.net/nova/+bug/1600251
14:39:53 <openstack> Launchpad bug 1600251 in OpenStack Compute (nova) "live migration does not honor server group policy" [High,In progress] - Assigned to Paul Carlton (paul-carlton2)
14:40:27 <PaulMurray> this has a fix up that should be easy to finish off
14:40:29 <PaulMurray> https://review.openstack.org/#/c/339588/
14:40:33 <paul-carlton2> I lokked at that this morning will fix it when done with current task
14:40:48 <PaulMurray> thanks paul-carlton2
14:41:23 <PaulMurray> the next one we discussed already
14:41:45 <PaulMurray> then we have
14:41:47 <PaulMurray> https://bugs.launchpad.net/nova/+bug/1607996
14:41:47 <openstack> Launchpad bug 1607996 in OpenStack Compute (nova) "Live migration does not update numa hugepages info in xml" [High,In progress]
14:42:00 <PaulMurray> unassigned
14:43:09 <PaulMurray> danpb, do you know anything about this ? I know you haven't commented on the bug
14:43:17 * mdbooth wonders why sfinucan unassigned himself
14:44:23 <danpb> PaulMurray: its not entirely suprising - i mean the original numa/hugepage code never tried to make migraiton work
14:44:45 * danpb doesn't even know if we update NUMA topology upon migration yet, let alone hugepages
14:45:29 <PaulMurray> danpb, I think that's all work in progress at the moment - along with claims on destination etc
14:45:36 <danpb> yeah
14:45:59 <danpb> there's probably another open bug somewhere you can mark it a duplicate of
14:46:11 <PaulMurray> I'll take a look
14:46:40 <mdbooth> danpb: Is numa topology allowed to change during live migration?
14:46:48 <PaulMurray> That's all the high importance bugs.
14:47:06 <danpb> mdbooth: guest NUMA topology cannot change but the host<->guest placement can
14:47:15 <mdbooth> Got it.
14:47:17 <danpb> its just that nova has never tried to
14:47:19 <PaulMurray> the only ones not in progress are the post-copy network problem and the volume based live migration one that's messing up CI
14:47:29 <danpb> because it has complex interactions in the schedular that need fixing first
14:48:31 <PaulMurray> most have reviews up, so lets try to get those in at least.
14:48:46 <PaulMurray> I'll make a list and put it on the ML and our tracking page
14:49:06 <PaulMurray> Feel free to add if something needs attention that is not there
14:49:22 <PaulMurray> #topic Open discussion
14:49:30 <PaulMurray> anything left over to discuss ?
14:50:37 <PaulMurray> I think we're done
14:50:45 <PaulMurray> Thanks for coming
14:50:53 <PaulMurray> #endmeeting