14:00:36 <PaulMurray> #startmeeting Nova Live Migration
14:00:37 <openstack> Meeting started Tue Sep 13 14:00:36 2016 UTC and is due to finish in 60 minutes.  The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:38 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:40 <openstack> The meeting name has been set to 'nova_live_migration'
14:00:46 <mdbooth> o/
14:00:49 <pkoniszewski> o/
14:00:51 <abhishekk> o/
14:00:57 <PaulMurray> hi all
14:01:17 <mriedem> o/
14:01:33 <PaulMurray> agenda: https://wiki.openstack.org/wiki/Nova/Newton_Release_Schedule
14:02:08 <PaulMurray> release schedule says RC1 16th Sept
14:02:23 <mriedem> really EOD dansmith time on thursday
14:02:26 <PaulMurray> that's friday
14:02:49 * mriedem updates wiki
14:03:44 <PaulMurray> So we will go over the newton-ra-potential bugs
14:03:49 <PaulMurray> but first
14:04:01 <PaulMurray> #topic CI
14:04:09 <PaulMurray> anything to do on CI ?
14:04:36 <PaulMurray> I haven't kept track - so I wanted to make sure there is nothing urgent to attend to ?
14:05:05 <pkoniszewski> i'm trying to work on LM job with grenade, any advice would be helpful https://review.openstack.org/#/c/364809/
14:05:23 <pkoniszewski> heard that mriedem or dansmith tried to do something with that, but haven't found any patches yet
14:06:35 <mdbooth> pkoniszewski: Relevant to the bug I'm working on, btw. Would definitely be good to get that.
14:07:08 <mriedem> i haven't
14:07:15 <PaulMurray> #link Add new job to test live migration with grenade https://review.openstack.org/#/c/364809/
14:07:18 <mriedem> we'd basically need a multinode grenade job to run live migration
14:07:30 <mriedem> we already have multinode grenade jobs
14:07:35 <mriedem> where n-cpu is backlevel on one compute
14:07:54 <mriedem> i think we'd just need to flip the live migration flag in tempest in that job to test it
14:08:06 <mriedem> but it would probably be experimental queue to start
14:08:10 <pkoniszewski> okay, i will check that
14:08:12 <mriedem> unless we used the in-tree hook
14:08:36 <pkoniszewski> so I wanted to use in-tree hook, but those tests that we have right now are not enough
14:08:58 <pkoniszewski> we need tests that will live migrate an instance back and forth, in our tests we always just move an instance to another host and validate it
14:09:19 <mdbooth> pkoniszewski: You thinking of cleanup bugs?
14:09:50 <pkoniszewski> what do you mean?
14:09:50 <mdbooth> Like the bug where you couldn't cold migrate back to a host because the instance directory hadn't been deleted?
14:10:43 <pkoniszewski> also about this, but the priority for me is to check whether we can move VM between two versions in a basic scenario
14:10:52 <mdbooth> Ok
14:12:19 <PaulMurray> lets move on
14:12:30 <PaulMurray> #topic Bugs
14:12:46 <PaulMurray> Starting with live migration bugs on https://bugs.launchpad.net/nova/+bugs?field.tag=newton-rc-potential
14:12:53 <PaulMurray> https://bugs.launchpad.net/nova/+bugs?field.tag=newton-rc-potential
14:13:20 <PaulMurray> I think the first is https://bugs.launchpad.net/nova/+bug/1605016
14:13:21 <openstack> Launchpad bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,In progress] - Assigned to Matthew Booth (mbooth-9)
14:13:31 <PaulMurray> that's the one mdbooth is looking at
14:13:43 <PaulMurray> i see you have one patch up
14:13:50 <mdbooth> So, it's taken me a while to get a 'reproducer'
14:14:11 <mdbooth> And I'm still not 100% sure I'm testing the right thing, but I can see something
14:14:29 <mdbooth> I don't currently think this is the show stopper it sounded like last week
14:14:33 <mdbooth> Because of 2 things
14:14:54 <mdbooth> Firstly, the period after post-copy switch over is really short, because post-copy is really fast
14:15:03 <mdbooth> and efficient
14:15:16 <davidgiluk> mdbooth: That depends a little on how painful your workload is
14:15:26 <mdbooth> Secondly, because even if you leave it alone and *never* call the network fixup stuff
14:15:33 <mdbooth> Something fixes it up anyway
14:15:40 <mdbooth> Max 60 seconds outage
14:15:47 <mdbooth> I'd love to know what that is, btw
14:15:59 <mdbooth> I looked, but I don't know enough about neutron
14:16:05 <davidgiluk> mdbooth: Yeh it's worrying knowing something is lurking fiddling with the config but not knowing what
14:16:14 <PaulMurray> mdbooth, that's with DVR
14:16:18 <mdbooth> Yeah
14:16:41 <PaulMurray> haleyb are you lurking ?
14:16:43 <mdbooth> That said, I'm working on fixing it anyway
14:17:03 <mdbooth> First patch is here, and it's an RPC change: https://review.openstack.org/#/c/369423/
14:17:13 <haleyb> PaulMurray: sort0-of, neutron meeting now too
14:17:33 <mdbooth> I'd love to get eyes on ^^^ from somebody with a deep understanding of what those calls actually do for various backends
14:17:37 <PaulMurray> I think the issue is there a variety of different neutron backend implementation and we don't know how hey all behave
14:17:41 <mdbooth> I'm just shifting them around
14:17:56 <mdbooth> PaulMurray: Yup.
14:18:19 <PaulMurray> mdbooth, but 60 sec outage is a bummer
14:18:39 <mdbooth> My initial testing of the above patch suggests it works fine, and slightly reduces the network outage in the non-post-copy case.
14:19:36 <PaulMurray> #action please review https://review.openstack.org/#/c/369423/
14:19:47 <PaulMurray> do you have the follow on coming ?
14:20:07 <mriedem> to be clear on https://review.openstack.org/#/c/369423/ ,
14:20:10 <mdbooth> I only just knocked that out this morning, so not yet :)
14:20:18 <mriedem> if it doesn't make newton, it's probably not going to be backported b/c of the rpc change
14:20:25 <PaulMurray> mdbooth, that was hours ago !
14:20:27 <mdbooth> However, the follow-on should be pretty simple in comparison, and won't involve rpc change
14:20:38 <PaulMurray> :)
14:20:52 <mdbooth> mriedem: Right. It would be great to get ^^^ in, even if we don't get the follow-on in
14:21:06 <mdbooth> So please, eyes on that urgently.
14:21:17 <PaulMurray> ok
14:21:18 <PaulMurray> next
14:21:28 <PaulMurray> https://bugs.launchpad.net/nova/+bug/1615613
14:21:30 <openstack> Launchpad bug 1615613 in OpenStack Compute (nova) "Live migration always fails when VNC/SPICE is listening at non-local, non-catch-all address" [High,In progress] - Assigned to Paulo Matias (paulo-matias)
14:21:30 <mdbooth> The follow-on will be a simple backport.
14:21:42 <PaulMurray> anyone know anything about this ?
14:22:33 <PaulMurray> There is a revert here with one +2: https://review.openstack.org/#/c/368732/
14:23:17 <PaulMurray> then: https://review.openstack.org/#/c/358599/6
14:23:34 <PaulMurray> So reviews for those as well please
14:23:57 <mriedem> pkoniszewski: for https://review.openstack.org/#/c/368732/
14:24:09 <mriedem> have you tested N->M and M->N and verified it fixes that vnc/spice console issue?
14:24:17 <pkoniszewski> yes, i did
14:24:28 <mriedem> ok, great
14:24:36 <pkoniszewski> so the problem was backward compatibility, i.e. live migrating from newton to mitaka
14:24:39 <mriedem> btw, pkoniszewski is once again our rc week live migration savior :)
14:24:57 <pkoniszewski> it solves the issue so let's just prepare for fixing the check in ocata
14:25:37 <pkoniszewski> i mean this change https://review.openstack.org/#/c/358599/ is a requirement to finally move the check in Ocata
14:26:02 <PaulMurray> pkoniszewski, are you saying we only need the first revert in newton
14:26:16 <pkoniszewski> the one prepared by johnthetubaguy, yes
14:26:32 <PaulMurray> understood
14:26:58 <pkoniszewski> and if we can land "fill destination check data..." change in Newton, then we will be able to move the check in Ocata
14:27:13 <pkoniszewski> and then land all Markus's changes in Ocata to fix serial console
14:27:32 <mriedem> https://review.openstack.org/#/c/368732/ is approved now
14:28:39 <PaulMurray> #action review  https://review.openstack.org/#/c/358599/
14:28:59 <PaulMurray> the next is on consoles again: https://bugs.launchpad.net/nova/+bug/1595962
14:29:00 <openstack> Launchpad bug 1595962 in OpenStack Compute (nova) "live migration with disabled vnc/spice not possible" [Medium,In progress] - Assigned to Markus Zoeller (markus_z) (mzoeller)
14:29:23 <pkoniszewski> this will be in a merge confllict once the revert is merged
14:30:07 <PaulMurray> it also has a +2 from bauzas and a -1 from alaski
14:30:15 <PaulMurray> on https://review.openstack.org/#/c/335132/
14:30:58 <alaski> with the revert that one may not be necessary
14:31:05 <alaski> at least that was my understanding
14:31:56 <PaulMurray> alaski, do we need to test it to confirm again with the revert just approved
14:32:31 <PaulMurray> or were you really asking a question
14:33:02 <alaski> I'm hoping someone can confirm my understanding
14:33:17 <alaski> we could also get https://review.openstack.org/#/c/338416/ which should provide some testing
14:33:25 <PaulMurray> pkoniszewski, you were on both patches - any thoughts ?
14:33:36 <PaulMurray> or can you try it out ?
14:33:40 <pkoniszewski> yeah, what exactly are you asking about alaski?
14:33:50 <pkoniszewski> and i will try it anyway
14:34:05 <alaski> with the revert at https://review.openstack.org/#/c/368732/ is https://review.openstack.org/#/c/358599/ still necessary?
14:34:22 <pkoniszewski> yes, it is
14:34:58 <pkoniszewski> so the point of this issue was that we moved the check but we never moved a code that was responsible for populating migrate data with graphic addresses
14:35:21 <alaski> right. but the revert moves the check back
14:35:34 <pkoniszewski> exactly
14:36:05 <alaski> so the check should now be in a place that has the graphic addresses
14:36:12 <pkoniszewski> we just can't move the check to check_can_live_migrate_source when older release does not populate data in check_can_live_migrate_destination
14:37:04 <pkoniszewski> once we merge the change you just pasted here we will be able to move the check in next release
14:37:05 <alaski> I see. so the patch from markus_z no longer fixes a bug after the revert, but it necessary to move the check in O?
14:37:21 <pkoniszewski> because older release will populate the data earlier, in check at destination
14:37:25 <pkoniszewski> right now Mitaka can't do that
14:38:12 <pkoniszewski> just fyi, RPC chain during prechecks looks like conductor-> destination compute (check_can_live_migrate_destination) -> source compute (check_can_live_migrate_source)
14:38:12 <alaski> yeah. so the patch is needed. but not for a bug fix, but so that we can move code in O
14:38:26 <pkoniszewski> exactly
14:38:51 <alaski> okay. I'll let markus_z rebase and see where it's at
14:39:09 <PaulMurray> thanks alaski
14:39:33 <pkoniszewski> so, well, you are right, we can merge markus_z changes still in Newton, but the code needs to be on top of the revert
14:40:11 <PaulMurray> the next one is: https://bugs.launchpad.net/nova/+bug/1621709
14:40:12 <openstack> Launchpad bug 1621709 in OpenStack Compute (nova) "There is no allocation record for migration action" [Medium,In progress] - Assigned to Alex Xu (xuhj)
14:41:46 <PaulMurray> Does anyone know what is going on with this? there is a chain of patches
14:42:27 <PaulMurray> is alex_xu around?
14:44:15 <PaulMurray> mriedem, do you know about this one ? ^^^^
14:44:44 <mriedem> PaulMurray: alex_xu wants to get the bottom change into newton
14:44:50 <mriedem> to stop leaking resources
14:44:59 <PaulMurray> just the bottom change
14:45:23 <PaulMurray> this one: https://review.openstack.org/#/c/369147/4
14:45:28 <PaulMurray> mriedem, ^
14:47:21 <PaulMurray> reading the comment it looks like that is what he meant
14:47:27 <PaulMurray> next is: https://bugs.launchpad.net/nova/+bug/1622854
14:47:28 <openstack> Launchpad bug 1622854 in OpenStack Compute (nova) "pci: double pci migration is putting vm in ERROR" [Medium,Confirmed]
14:47:35 <mriedem> yes that's the one he wants in newton
14:47:39 <mriedem> he said the rest could be ocata
14:47:44 <PaulMurray> thanks
14:47:52 <PaulMurray> no one is working this last one
14:49:08 <mdbooth> Incidentally, I assume we're anticipating an rc2? i.e. There's still an opportunity to fix bugs after Thursday?
14:49:10 <PaulMurray> mriedem, there is a comment on this last bug saying we can go without it and do a back port later
14:49:57 <PaulMurray> mdbooth, I think that's usually the case, but only for really critical bugs
14:50:06 <PaulMurray> ?
14:50:27 <PaulMurray> need to ask the boss
14:50:45 <mriedem> there is usually another rc for translations
14:50:48 <mdbooth> PaulMurray: Does bug 1605016 come under that description? If not, I'll likely go do other stuff.
14:50:49 <openstack> bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,In progress] https://launchpad.net/bugs/1605016 - Assigned to Matthew Booth (mbooth-9)
14:51:35 * mdbooth can't guage the chances of that patch getting in by Thursday, but I'd guess they're low, right?
14:52:14 <PaulMurray> mdbooth, I would hope the chances are good for anything that actually works and is considered worth the tag
14:52:20 <PaulMurray> they get attention
14:52:33 <PaulMurray> mriedem, ^^
14:53:20 <mdbooth> dansmith didn't have the bandwidth to look when I spoke to him earlier. I didn't appreciate that might be a complete blocker.
14:53:33 <mriedem> dan is working on the placement stuff
14:53:43 <mriedem> i've tagged the bug
14:53:44 <mriedem> so it's on the list
14:53:53 <mriedem> and it's in the etherpad
14:54:04 <mriedem> it's not a trivial change though, i haven't reviewed yet
14:54:12 <mriedem> and if it's not required for post-copy to work, then it might slide
14:54:20 <mdbooth> mriedem: So, the critical bit is just the rpc change, because the follow-on will be backportable
14:54:34 <dansmith> it's way too disruptive for this point in the cycle, IMHO.. post-copy is still optional yes?
14:54:36 <mriedem> if the rpc change is a noop it migth be ok
14:54:37 <mdbooth> mriedem: And it's not required for post-copy
14:54:59 <mriedem> if the rpc change is a mistake though and we find that out later, we can't revert it
14:55:07 <mdbooth> dansmith: Yup, this don't break post-copy.
14:55:24 <mdbooth> It just makes it sub-optimal when using dvr
14:55:40 <dansmith> mdbooth: so only when using post-copy and dvr right?
14:55:44 <mdbooth> And post-copy is also disabled by default iirc?
14:55:49 <mdbooth> dansmith: Yup
14:55:53 <davidgiluk> mdbooth: IMHO it's much worse than suboptimal
14:55:55 <pkoniszewski> it is disabled by default
14:56:02 <dansmith> right, so.. I would not make this change before the release, IMHO
14:56:13 <mdbooth> dansmith: Cool, thanks.
14:56:31 * mdbooth will still work on it, just no longer top priority
14:56:54 <PaulMurray> so are we dropping the newton-rc-potntial tag for that ?
14:56:59 <mdbooth> Sounds like it
14:57:02 <davidgiluk> :-(
14:57:06 <PaulMurray> ok
14:57:10 <dansmith> yeah, drop IMHO
14:57:21 <PaulMurray> we're out of time
14:57:28 <PaulMurray> finished just before the bell
14:57:33 <PaulMurray> thanks for coming
14:57:34 <abhishekk> hi, I have a backport patch for stable/mitaka, https://review.openstack.org/#/c/353851, please review the same and give your suggestions when you get some time, thank you
14:58:00 <PaulMurray> thanks abhishekk - sorry no time for discussion
14:58:08 <PaulMurray> in the the open section
14:58:13 <PaulMurray> bye
14:58:19 <PaulMurray> #endmeeting