14:01:20 <tdurakov> #startmeeting Nova Live Migration 14:01:21 <openstack> Meeting started Tue Feb 14 14:01:20 2017 UTC and is due to finish in 60 minutes. The chair is tdurakov. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:01:23 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:01:25 <openstack> The meeting name has been set to 'nova_live_migration' 14:01:29 <tdurakov> hi everyone 14:01:32 <davidgiluk> hi 14:01:44 <siva_krishnan> hi 14:01:58 <tdurakov> since there were no meeting last week let's walk through the previous topics 14:02:08 <tdurakov> #topic CI 14:02:39 <tdurakov> so, any news about the CI progress? 14:03:03 * johnthetubaguy lurks because of all the bug fixes patches he has pending 14:04:21 <pkoniszewski> one thing 14:04:28 <pkoniszewski> Matt uploaded new patch, let me find it 14:04:37 <siva_krishnan> https://review.openstack.org/#/c/431848/ 14:05:10 <pkoniszewski> exactly this one 14:05:11 <pkoniszewski> thanks! 14:05:42 <pkoniszewski> this one sets LIVE_MIGRATE_BACK_AND_FORTH to True in project config 14:05:51 <pkoniszewski> instead of devstack-gate 14:05:59 <tdurakov> oh, it's a workaround for the back-and-froth 14:06:07 <tdurakov> s/forth 14:06:41 <tdurakov> I wonder, have clarkb seen it? 14:06:59 <pkoniszewski> i don't know 14:07:30 <siva_krishnan> we can add him as reviewer 14:08:29 <tdurakov> will add him 14:09:36 <tdurakov> serial console hook is still on review... 14:10:14 <tdurakov> https://review.openstack.org/#/c/346815/ - will catch markus_z after meeting, to figure out if he still working on this one 14:10:22 <tdurakov> ok 14:10:26 <tdurakov> let's move on 14:10:38 <tdurakov> #bugs 14:11:04 <tdurakov> https://bugs.launchpad.net/nova/+bug/1658877 14:11:05 <openstack> Launchpad bug 1658877 in OpenStack Compute (nova) "live migration failed with XenServer as hypervisor" [High,In progress] - Assigned to huan (huan-xie) 14:12:14 <tdurakov> overall a fix for that: https://review.openstack.org/#/c/424428/ looks ok, but there are no CI 14:12:45 <tdurakov> so, I've tried to deploy multinode locally, so had issues with deployment itself 14:13:30 <tdurakov> johnthetubaguy: thoughts? 14:18:05 <johnthetubaguy> yeah, I was trusting the folks from Citrix on having tested that one 14:18:49 <tdurakov> ok, let's merge it then... 14:19:03 <tdurakov> pkoniszewski: any updates on claims? 14:19:31 <pkoniszewski> no updates this time 14:19:37 <johnthetubaguy> tdurakov: thats my take, just needs another core at this point 14:19:59 <johnthetubaguy> I have some interesting bugs around the timeouts folks hit in production 14:20:48 <johnthetubaguy> Bug 1644248 14:20:48 <openstack> bug 1644248 in OpenStack Compute (nova) ocata "Nova incorrectly tracks live migration progress" [High,In progress] https://launchpad.net/bugs/1644248 - Assigned to Matt Riedemann (mriedem) 14:20:52 <johnthetubaguy> https://review.openstack.org/#/c/430218/ 14:21:43 <johnthetubaguy> and Bug 1662626 14:21:43 <openstack> bug 1662626 in OpenStack Compute (nova) "live-migrate left in migrating as domain not found" [Medium,In progress] https://launchpad.net/bugs/1662626 - Assigned to John Garbutt (johngarbutt) 14:21:56 <johnthetubaguy> https://review.openstack.org/#/c/430404 (and the dependent patch) 14:22:12 <johnthetubaguy> OSIC has been doing live-migrate benchmarking 14:22:27 <johnthetubaguy> and these appear to be the issues that were blocking getting 100% success 14:22:38 <johnthetubaguy> (it was for a given "tuned" workload) 14:23:13 <tdurakov> johnthetubaguy: thanks for these 2 14:23:19 <johnthetubaguy> so would be great to get eyes those 14:23:20 <tdurakov> second looks interesting... 14:23:28 <tdurakov> yeah 14:23:45 <johnthetubaguy> FWIW, after applying these patches (and the one we got into RC2) we were able to get 100% live-migrate success 14:24:15 <tdurakov> got it, anyway, will review 14:24:29 <johnthetubaguy> pkoniszewski: I would love a deep libvirt look at that domain not found one, it was a bit strange 14:24:47 <tdurakov> pkoniszewski: do you need some help with claims? 14:24:55 <pkoniszewski> claims are on a good way\ 14:25:02 <pkoniszewski> i will take a look on this one johnthetubaguy 14:25:11 <tdurakov> ok, cool 14:25:13 <johnthetubaguy> I am curious if folks are OK with the removal of the completion timeout, it seemed the best thing to do 14:25:23 <johnthetubaguy> oops 14:25:28 <johnthetubaguy> I mean the progress timeout 14:25:37 <johnthetubaguy> and changing the auto post_copy trigger 14:25:37 <johnthetubaguy> https://review.openstack.org/#/c/430218 14:26:27 <johnthetubaguy> pkoniszewski: I messed up one of your changes a little bit too, trying to re-work it on top of all of these changes: https://review.openstack.org/#/c/408002 14:27:10 <pkoniszewski> i haven't had time to take a deep look yet 14:27:26 <johnthetubaguy> pkoniszewski: no worries 14:28:01 <johnthetubaguy> I am really wondering if we can ship defaults that mean (when post copy is available) users never hit live-migration timeouts, that would be kinda cool 14:29:16 <pkoniszewski> some operators keep saying that they dont want to use post copy 14:29:17 <johnthetubaguy> I do wonder if operators want to control the action on competition timeout, either pause, post_copy or error out the live-migrate 14:29:27 <davidgiluk> johnthetubaguy: Depending how you do storage migration possibly 14:29:49 <pkoniszewski> i think that Chet is one of them 14:29:56 <johnthetubaguy> yeah, if you timeout migrating storage, thats a looog pause 14:30:06 <pkoniszewski> he said that it is a way too big risk with way too big impact on workloads 14:30:26 <johnthetubaguy> pkoniszewski: so he favours a longer downtime I believe, I think he uses 1000 ms rather than 500 ms, but we seem to keep rejecting that change upstream 14:30:53 <johnthetubaguy> maybe pause the VM by default, if you hit the timeout (assuming disk copy has completed)? 14:32:50 <johnthetubaguy> anyways, some patches up around that area, feels like we can make live-migrate feel way more reliable with those changes 14:32:59 <pkoniszewski> should this be new option in nova.conf? I mean, action_on_timeout or something like that? 14:33:15 <pkoniszewski> so it might be changed between abort, pause and post-copy 14:33:19 <pkoniszewski> or abort|force 14:33:43 <pkoniszewski> i guess that was your initial proposition, johnthetubaguy? 14:34:01 <johnthetubaguy> possibly 14:34:12 <johnthetubaguy> currently I just do post_copy if enabled, or abort 14:34:23 <johnthetubaguy> since thats the only timeout we have for the auto trigger now 14:34:49 * davidgiluk still thinks it's not a per-host config; if people are worried about postcopy then I'd assume they're worried about it for some hyper-critical VMs 14:35:18 <pkoniszewski> got disconnected, sorry 14:35:24 <tdurakov> why not to leave this options to operator 14:35:29 <johnthetubaguy> davidgiluk: you mean for the force-live-migrate call? 14:35:55 <davidgiluk> johnthetubaguy: Yeh - I mean the problem is whether it's a 'this is what we do for all our VMs' or a 'this is what we do for this class of VMs' 14:36:06 <johnthetubaguy> davidgiluk: honestly, it feels more like an image property or flavor property thing, than an API thing 14:36:14 <davidgiluk> johnthetubaguy: Yeh 14:36:26 <johnthetubaguy> so really I want to make it so folks never really have to call foce-live-migrate 14:36:39 <johnthetubaguy> so if they get some preference from the image, I am OK with that 14:37:04 <johnthetubaguy> the conf is more just "allow or disable post-copy" then 14:37:18 <johnthetubaguy> but anyways, thats a separate spec 14:37:37 <johnthetubaguy> I should probably do a spec for this, now we have the bug fix out the way (default to no progress timeout) 14:41:02 <johnthetubaguy> we should move on to the next thing I guess 14:41:37 <tdurakov> johnthetubaguy: the thing is that the behaviour will be different, and not sure it will be fine to store it on image meta 14:43:03 <johnthetubaguy> tdurakov: we change behaviour quite a lot with other image properties, either way, it needs a spec to discuss all the details, will put that on my TODO list 14:43:33 <tdurakov> kk 14:43:55 <siva_krishnan> we can try to do it based in the info that we can get from vm diagnostics as well 14:44:51 <tdurakov> #topic pike bps 14:45:12 <johnthetubaguy> siva_krishnan: not sure what you mean, I am curious, can talk more offline 14:45:31 <johnthetubaguy> I guess, what do we need to cover at the PTG / who will be there? 14:45:33 <siva_krishnan> sure johnthetubaguy 14:46:06 <pkoniszewski> i will be there, cfriesen added claim patches to the list 14:46:38 <tdurakov> me too 14:46:49 <johnthetubaguy> cool, I am going, happy to talk about those things above, and the results of the OSIC testing 14:47:16 <johnthetubaguy> I know raj_singh said he would be there 14:47:41 <johnthetubaguy> well, thats a good number for a good discussion 14:49:05 <tdurakov> let's discuss it on the ptg then 14:49:15 <tdurakov> #topic open discussions 14:49:29 <siva_krishnan> https://launchpad.net/bugs/1662626 14:49:29 <openstack> Launchpad bug 1662626 in OpenStack Compute (nova) "live-migrate left in migrating as domain not found" [Medium,In progress] - Assigned to John Garbutt (johngarbutt) 14:50:16 <siva_krishnan> sorry wrong bug link 14:50:19 <siva_krishnan> https://bugs.launchpad.net/nova/+bug/1605016 14:50:19 <openstack> Launchpad bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,In progress] - Assigned to Sivasathurappan Radhakrishnan (siva-radhakrishnan) 14:50:40 <siva_krishnan> I am trying to do network port binding based on VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY and VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED. Any thoughts/suggestions on this one ? 14:51:18 <johnthetubaguy> so I was suggesting we go for the event that means the VM is now paused on the source node 14:51:56 <johnthetubaguy> so we get a change to update the network updated before we resume on the destination host, but not too soon, while the VM might still take a while to actually trigger the post copy action 14:52:03 <johnthetubaguy> s/change/chance/ 14:52:16 <davidgiluk> I think there's an event you can get from the destination 14:52:33 <johnthetubaguy> but on the destination its too late I believe, its already moved 14:52:41 <siva_krishnan> I have a patch ready for it based on your suggestions johnthetubaguy. just wanted to know tdurakov, davidgiluk and pkoniszewski thoughts on it as well 14:52:56 <johnthetubaguy> you got the link handy? 14:53:01 <davidgiluk> johnthetubaguy: Not if it's running with -S - qemu wont have autostarted it; I was just worrying about the source stopping but the destination not quite being ready yet 14:53:43 <davidgiluk> siva_krishnan: Can you loop jdenemar in - he knows what it looks like from a libvirt view of it, I only know what it looks like from deeper in qemu 14:53:53 <johnthetubaguy> davidgiluk: sorry, I totally lost you there 14:54:06 <siva_krishnan> As of now I just have WIP patch https://review.openstack.org/#/c/432370/ whcih I shared with you earlier 14:54:18 <davidgiluk> siva_krishnan: Can you add that to the lp if it isn't already ? 14:54:43 <johnthetubaguy> davidgiluk: I don't believe we need the destination to be ready before we make the neutron switch, at least thats what the neutron folks were saying, as the VIF plug already happened in a pre-live-migrate step 14:55:02 <davidgiluk> johnthetubaguy: Ah ok, that's ok then 14:55:33 <johnthetubaguy> thats an assumption people nodded at, rather than me being 100% certainly though 14:55:50 <johnthetubaguy> but seems like we are good for the 80% case for sure 14:56:53 <johnthetubaguy> siva_krishnan: you OK finding jdenemar and asking them what they think? 14:57:26 <siva_krishnan> johnthetubaguy: yeah sure will do it. 14:57:36 <tdurakov> so, need to finish, thanks to everyone 14:57:41 <tdurakov> #endmeeting