14:00:36 <PaulMurray> #startmeeting Nova Live Migration 14:00:37 <openstack> Meeting started Tue Sep 13 14:00:36 2016 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:38 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:40 <openstack> The meeting name has been set to 'nova_live_migration' 14:00:46 <mdbooth> o/ 14:00:49 <pkoniszewski> o/ 14:00:51 <abhishekk> o/ 14:00:57 <PaulMurray> hi all 14:01:17 <mriedem> o/ 14:01:33 <PaulMurray> agenda: https://wiki.openstack.org/wiki/Nova/Newton_Release_Schedule 14:02:08 <PaulMurray> release schedule says RC1 16th Sept 14:02:23 <mriedem> really EOD dansmith time on thursday 14:02:26 <PaulMurray> that's friday 14:02:49 * mriedem updates wiki 14:03:44 <PaulMurray> So we will go over the newton-ra-potential bugs 14:03:49 <PaulMurray> but first 14:04:01 <PaulMurray> #topic CI 14:04:09 <PaulMurray> anything to do on CI ? 14:04:36 <PaulMurray> I haven't kept track - so I wanted to make sure there is nothing urgent to attend to ? 14:05:05 <pkoniszewski> i'm trying to work on LM job with grenade, any advice would be helpful https://review.openstack.org/#/c/364809/ 14:05:23 <pkoniszewski> heard that mriedem or dansmith tried to do something with that, but haven't found any patches yet 14:06:35 <mdbooth> pkoniszewski: Relevant to the bug I'm working on, btw. Would definitely be good to get that. 14:07:08 <mriedem> i haven't 14:07:15 <PaulMurray> #link Add new job to test live migration with grenade https://review.openstack.org/#/c/364809/ 14:07:18 <mriedem> we'd basically need a multinode grenade job to run live migration 14:07:30 <mriedem> we already have multinode grenade jobs 14:07:35 <mriedem> where n-cpu is backlevel on one compute 14:07:54 <mriedem> i think we'd just need to flip the live migration flag in tempest in that job to test it 14:08:06 <mriedem> but it would probably be experimental queue to start 14:08:10 <pkoniszewski> okay, i will check that 14:08:12 <mriedem> unless we used the in-tree hook 14:08:36 <pkoniszewski> so I wanted to use in-tree hook, but those tests that we have right now are not enough 14:08:58 <pkoniszewski> we need tests that will live migrate an instance back and forth, in our tests we always just move an instance to another host and validate it 14:09:19 <mdbooth> pkoniszewski: You thinking of cleanup bugs? 14:09:50 <pkoniszewski> what do you mean? 14:09:50 <mdbooth> Like the bug where you couldn't cold migrate back to a host because the instance directory hadn't been deleted? 14:10:43 <pkoniszewski> also about this, but the priority for me is to check whether we can move VM between two versions in a basic scenario 14:10:52 <mdbooth> Ok 14:12:19 <PaulMurray> lets move on 14:12:30 <PaulMurray> #topic Bugs 14:12:46 <PaulMurray> Starting with live migration bugs on https://bugs.launchpad.net/nova/+bugs?field.tag=newton-rc-potential 14:12:53 <PaulMurray> https://bugs.launchpad.net/nova/+bugs?field.tag=newton-rc-potential 14:13:20 <PaulMurray> I think the first is https://bugs.launchpad.net/nova/+bug/1605016 14:13:21 <openstack> Launchpad bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,In progress] - Assigned to Matthew Booth (mbooth-9) 14:13:31 <PaulMurray> that's the one mdbooth is looking at 14:13:43 <PaulMurray> i see you have one patch up 14:13:50 <mdbooth> So, it's taken me a while to get a 'reproducer' 14:14:11 <mdbooth> And I'm still not 100% sure I'm testing the right thing, but I can see something 14:14:29 <mdbooth> I don't currently think this is the show stopper it sounded like last week 14:14:33 <mdbooth> Because of 2 things 14:14:54 <mdbooth> Firstly, the period after post-copy switch over is really short, because post-copy is really fast 14:15:03 <mdbooth> and efficient 14:15:16 <davidgiluk> mdbooth: That depends a little on how painful your workload is 14:15:26 <mdbooth> Secondly, because even if you leave it alone and *never* call the network fixup stuff 14:15:33 <mdbooth> Something fixes it up anyway 14:15:40 <mdbooth> Max 60 seconds outage 14:15:47 <mdbooth> I'd love to know what that is, btw 14:15:59 <mdbooth> I looked, but I don't know enough about neutron 14:16:05 <davidgiluk> mdbooth: Yeh it's worrying knowing something is lurking fiddling with the config but not knowing what 14:16:14 <PaulMurray> mdbooth, that's with DVR 14:16:18 <mdbooth> Yeah 14:16:41 <PaulMurray> haleyb are you lurking ? 14:16:43 <mdbooth> That said, I'm working on fixing it anyway 14:17:03 <mdbooth> First patch is here, and it's an RPC change: https://review.openstack.org/#/c/369423/ 14:17:13 <haleyb> PaulMurray: sort0-of, neutron meeting now too 14:17:33 <mdbooth> I'd love to get eyes on ^^^ from somebody with a deep understanding of what those calls actually do for various backends 14:17:37 <PaulMurray> I think the issue is there a variety of different neutron backend implementation and we don't know how hey all behave 14:17:41 <mdbooth> I'm just shifting them around 14:17:56 <mdbooth> PaulMurray: Yup. 14:18:19 <PaulMurray> mdbooth, but 60 sec outage is a bummer 14:18:39 <mdbooth> My initial testing of the above patch suggests it works fine, and slightly reduces the network outage in the non-post-copy case. 14:19:36 <PaulMurray> #action please review https://review.openstack.org/#/c/369423/ 14:19:47 <PaulMurray> do you have the follow on coming ? 14:20:07 <mriedem> to be clear on https://review.openstack.org/#/c/369423/ , 14:20:10 <mdbooth> I only just knocked that out this morning, so not yet :) 14:20:18 <mriedem> if it doesn't make newton, it's probably not going to be backported b/c of the rpc change 14:20:25 <PaulMurray> mdbooth, that was hours ago ! 14:20:27 <mdbooth> However, the follow-on should be pretty simple in comparison, and won't involve rpc change 14:20:38 <PaulMurray> :) 14:20:52 <mdbooth> mriedem: Right. It would be great to get ^^^ in, even if we don't get the follow-on in 14:21:06 <mdbooth> So please, eyes on that urgently. 14:21:17 <PaulMurray> ok 14:21:18 <PaulMurray> next 14:21:28 <PaulMurray> https://bugs.launchpad.net/nova/+bug/1615613 14:21:30 <openstack> Launchpad bug 1615613 in OpenStack Compute (nova) "Live migration always fails when VNC/SPICE is listening at non-local, non-catch-all address" [High,In progress] - Assigned to Paulo Matias (paulo-matias) 14:21:30 <mdbooth> The follow-on will be a simple backport. 14:21:42 <PaulMurray> anyone know anything about this ? 14:22:33 <PaulMurray> There is a revert here with one +2: https://review.openstack.org/#/c/368732/ 14:23:17 <PaulMurray> then: https://review.openstack.org/#/c/358599/6 14:23:34 <PaulMurray> So reviews for those as well please 14:23:57 <mriedem> pkoniszewski: for https://review.openstack.org/#/c/368732/ 14:24:09 <mriedem> have you tested N->M and M->N and verified it fixes that vnc/spice console issue? 14:24:17 <pkoniszewski> yes, i did 14:24:28 <mriedem> ok, great 14:24:36 <pkoniszewski> so the problem was backward compatibility, i.e. live migrating from newton to mitaka 14:24:39 <mriedem> btw, pkoniszewski is once again our rc week live migration savior :) 14:24:57 <pkoniszewski> it solves the issue so let's just prepare for fixing the check in ocata 14:25:37 <pkoniszewski> i mean this change https://review.openstack.org/#/c/358599/ is a requirement to finally move the check in Ocata 14:26:02 <PaulMurray> pkoniszewski, are you saying we only need the first revert in newton 14:26:16 <pkoniszewski> the one prepared by johnthetubaguy, yes 14:26:32 <PaulMurray> understood 14:26:58 <pkoniszewski> and if we can land "fill destination check data..." change in Newton, then we will be able to move the check in Ocata 14:27:13 <pkoniszewski> and then land all Markus's changes in Ocata to fix serial console 14:27:32 <mriedem> https://review.openstack.org/#/c/368732/ is approved now 14:28:39 <PaulMurray> #action review https://review.openstack.org/#/c/358599/ 14:28:59 <PaulMurray> the next is on consoles again: https://bugs.launchpad.net/nova/+bug/1595962 14:29:00 <openstack> Launchpad bug 1595962 in OpenStack Compute (nova) "live migration with disabled vnc/spice not possible" [Medium,In progress] - Assigned to Markus Zoeller (markus_z) (mzoeller) 14:29:23 <pkoniszewski> this will be in a merge confllict once the revert is merged 14:30:07 <PaulMurray> it also has a +2 from bauzas and a -1 from alaski 14:30:15 <PaulMurray> on https://review.openstack.org/#/c/335132/ 14:30:58 <alaski> with the revert that one may not be necessary 14:31:05 <alaski> at least that was my understanding 14:31:56 <PaulMurray> alaski, do we need to test it to confirm again with the revert just approved 14:32:31 <PaulMurray> or were you really asking a question 14:33:02 <alaski> I'm hoping someone can confirm my understanding 14:33:17 <alaski> we could also get https://review.openstack.org/#/c/338416/ which should provide some testing 14:33:25 <PaulMurray> pkoniszewski, you were on both patches - any thoughts ? 14:33:36 <PaulMurray> or can you try it out ? 14:33:40 <pkoniszewski> yeah, what exactly are you asking about alaski? 14:33:50 <pkoniszewski> and i will try it anyway 14:34:05 <alaski> with the revert at https://review.openstack.org/#/c/368732/ is https://review.openstack.org/#/c/358599/ still necessary? 14:34:22 <pkoniszewski> yes, it is 14:34:58 <pkoniszewski> so the point of this issue was that we moved the check but we never moved a code that was responsible for populating migrate data with graphic addresses 14:35:21 <alaski> right. but the revert moves the check back 14:35:34 <pkoniszewski> exactly 14:36:05 <alaski> so the check should now be in a place that has the graphic addresses 14:36:12 <pkoniszewski> we just can't move the check to check_can_live_migrate_source when older release does not populate data in check_can_live_migrate_destination 14:37:04 <pkoniszewski> once we merge the change you just pasted here we will be able to move the check in next release 14:37:05 <alaski> I see. so the patch from markus_z no longer fixes a bug after the revert, but it necessary to move the check in O? 14:37:21 <pkoniszewski> because older release will populate the data earlier, in check at destination 14:37:25 <pkoniszewski> right now Mitaka can't do that 14:38:12 <pkoniszewski> just fyi, RPC chain during prechecks looks like conductor-> destination compute (check_can_live_migrate_destination) -> source compute (check_can_live_migrate_source) 14:38:12 <alaski> yeah. so the patch is needed. but not for a bug fix, but so that we can move code in O 14:38:26 <pkoniszewski> exactly 14:38:51 <alaski> okay. I'll let markus_z rebase and see where it's at 14:39:09 <PaulMurray> thanks alaski 14:39:33 <pkoniszewski> so, well, you are right, we can merge markus_z changes still in Newton, but the code needs to be on top of the revert 14:40:11 <PaulMurray> the next one is: https://bugs.launchpad.net/nova/+bug/1621709 14:40:12 <openstack> Launchpad bug 1621709 in OpenStack Compute (nova) "There is no allocation record for migration action" [Medium,In progress] - Assigned to Alex Xu (xuhj) 14:41:46 <PaulMurray> Does anyone know what is going on with this? there is a chain of patches 14:42:27 <PaulMurray> is alex_xu around? 14:44:15 <PaulMurray> mriedem, do you know about this one ? ^^^^ 14:44:44 <mriedem> PaulMurray: alex_xu wants to get the bottom change into newton 14:44:50 <mriedem> to stop leaking resources 14:44:59 <PaulMurray> just the bottom change 14:45:23 <PaulMurray> this one: https://review.openstack.org/#/c/369147/4 14:45:28 <PaulMurray> mriedem, ^ 14:47:21 <PaulMurray> reading the comment it looks like that is what he meant 14:47:27 <PaulMurray> next is: https://bugs.launchpad.net/nova/+bug/1622854 14:47:28 <openstack> Launchpad bug 1622854 in OpenStack Compute (nova) "pci: double pci migration is putting vm in ERROR" [Medium,Confirmed] 14:47:35 <mriedem> yes that's the one he wants in newton 14:47:39 <mriedem> he said the rest could be ocata 14:47:44 <PaulMurray> thanks 14:47:52 <PaulMurray> no one is working this last one 14:49:08 <mdbooth> Incidentally, I assume we're anticipating an rc2? i.e. There's still an opportunity to fix bugs after Thursday? 14:49:10 <PaulMurray> mriedem, there is a comment on this last bug saying we can go without it and do a back port later 14:49:57 <PaulMurray> mdbooth, I think that's usually the case, but only for really critical bugs 14:50:06 <PaulMurray> ? 14:50:27 <PaulMurray> need to ask the boss 14:50:45 <mriedem> there is usually another rc for translations 14:50:48 <mdbooth> PaulMurray: Does bug 1605016 come under that description? If not, I'll likely go do other stuff. 14:50:49 <openstack> bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,In progress] https://launchpad.net/bugs/1605016 - Assigned to Matthew Booth (mbooth-9) 14:51:35 * mdbooth can't guage the chances of that patch getting in by Thursday, but I'd guess they're low, right? 14:52:14 <PaulMurray> mdbooth, I would hope the chances are good for anything that actually works and is considered worth the tag 14:52:20 <PaulMurray> they get attention 14:52:33 <PaulMurray> mriedem, ^^ 14:53:20 <mdbooth> dansmith didn't have the bandwidth to look when I spoke to him earlier. I didn't appreciate that might be a complete blocker. 14:53:33 <mriedem> dan is working on the placement stuff 14:53:43 <mriedem> i've tagged the bug 14:53:44 <mriedem> so it's on the list 14:53:53 <mriedem> and it's in the etherpad 14:54:04 <mriedem> it's not a trivial change though, i haven't reviewed yet 14:54:12 <mriedem> and if it's not required for post-copy to work, then it might slide 14:54:20 <mdbooth> mriedem: So, the critical bit is just the rpc change, because the follow-on will be backportable 14:54:34 <dansmith> it's way too disruptive for this point in the cycle, IMHO.. post-copy is still optional yes? 14:54:36 <mriedem> if the rpc change is a noop it migth be ok 14:54:37 <mdbooth> mriedem: And it's not required for post-copy 14:54:59 <mriedem> if the rpc change is a mistake though and we find that out later, we can't revert it 14:55:07 <mdbooth> dansmith: Yup, this don't break post-copy. 14:55:24 <mdbooth> It just makes it sub-optimal when using dvr 14:55:40 <dansmith> mdbooth: so only when using post-copy and dvr right? 14:55:44 <mdbooth> And post-copy is also disabled by default iirc? 14:55:49 <mdbooth> dansmith: Yup 14:55:53 <davidgiluk> mdbooth: IMHO it's much worse than suboptimal 14:55:55 <pkoniszewski> it is disabled by default 14:56:02 <dansmith> right, so.. I would not make this change before the release, IMHO 14:56:13 <mdbooth> dansmith: Cool, thanks. 14:56:31 * mdbooth will still work on it, just no longer top priority 14:56:54 <PaulMurray> so are we dropping the newton-rc-potntial tag for that ? 14:56:59 <mdbooth> Sounds like it 14:57:02 <davidgiluk> :-( 14:57:06 <PaulMurray> ok 14:57:10 <dansmith> yeah, drop IMHO 14:57:21 <PaulMurray> we're out of time 14:57:28 <PaulMurray> finished just before the bell 14:57:33 <PaulMurray> thanks for coming 14:57:34 <abhishekk> hi, I have a backport patch for stable/mitaka, https://review.openstack.org/#/c/353851, please review the same and give your suggestions when you get some time, thank you 14:58:00 <PaulMurray> thanks abhishekk - sorry no time for discussion 14:58:08 <PaulMurray> in the the open section 14:58:13 <PaulMurray> bye 14:58:19 <PaulMurray> #endmeeting