14:01:20 #startmeeting Nova Live Migration 14:01:21 Meeting started Tue Feb 14 14:01:20 2017 UTC and is due to finish in 60 minutes. The chair is tdurakov. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:01:23 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:01:25 The meeting name has been set to 'nova_live_migration' 14:01:29 hi everyone 14:01:32 hi 14:01:44 hi 14:01:58 since there were no meeting last week let's walk through the previous topics 14:02:08 #topic CI 14:02:39 so, any news about the CI progress? 14:03:03 * johnthetubaguy lurks because of all the bug fixes patches he has pending 14:04:21 one thing 14:04:28 Matt uploaded new patch, let me find it 14:04:37 https://review.openstack.org/#/c/431848/ 14:05:10 exactly this one 14:05:11 thanks! 14:05:42 this one sets LIVE_MIGRATE_BACK_AND_FORTH to True in project config 14:05:51 instead of devstack-gate 14:05:59 oh, it's a workaround for the back-and-froth 14:06:07 s/forth 14:06:41 I wonder, have clarkb seen it? 14:06:59 i don't know 14:07:30 we can add him as reviewer 14:08:29 will add him 14:09:36 serial console hook is still on review... 14:10:14 https://review.openstack.org/#/c/346815/ - will catch markus_z after meeting, to figure out if he still working on this one 14:10:22 ok 14:10:26 let's move on 14:10:38 #bugs 14:11:04 https://bugs.launchpad.net/nova/+bug/1658877 14:11:05 Launchpad bug 1658877 in OpenStack Compute (nova) "live migration failed with XenServer as hypervisor" [High,In progress] - Assigned to huan (huan-xie) 14:12:14 overall a fix for that: https://review.openstack.org/#/c/424428/ looks ok, but there are no CI 14:12:45 so, I've tried to deploy multinode locally, so had issues with deployment itself 14:13:30 johnthetubaguy: thoughts? 14:18:05 yeah, I was trusting the folks from Citrix on having tested that one 14:18:49 ok, let's merge it then... 14:19:03 pkoniszewski: any updates on claims? 14:19:31 no updates this time 14:19:37 tdurakov: thats my take, just needs another core at this point 14:19:59 I have some interesting bugs around the timeouts folks hit in production 14:20:48 Bug 1644248 14:20:48 bug 1644248 in OpenStack Compute (nova) ocata "Nova incorrectly tracks live migration progress" [High,In progress] https://launchpad.net/bugs/1644248 - Assigned to Matt Riedemann (mriedem) 14:20:52 https://review.openstack.org/#/c/430218/ 14:21:43 and Bug 1662626 14:21:43 bug 1662626 in OpenStack Compute (nova) "live-migrate left in migrating as domain not found" [Medium,In progress] https://launchpad.net/bugs/1662626 - Assigned to John Garbutt (johngarbutt) 14:21:56 https://review.openstack.org/#/c/430404 (and the dependent patch) 14:22:12 OSIC has been doing live-migrate benchmarking 14:22:27 and these appear to be the issues that were blocking getting 100% success 14:22:38 (it was for a given "tuned" workload) 14:23:13 johnthetubaguy: thanks for these 2 14:23:19 so would be great to get eyes those 14:23:20 second looks interesting... 14:23:28 yeah 14:23:45 FWIW, after applying these patches (and the one we got into RC2) we were able to get 100% live-migrate success 14:24:15 got it, anyway, will review 14:24:29 pkoniszewski: I would love a deep libvirt look at that domain not found one, it was a bit strange 14:24:47 pkoniszewski: do you need some help with claims? 14:24:55 claims are on a good way\ 14:25:02 i will take a look on this one johnthetubaguy 14:25:11 ok, cool 14:25:13 I am curious if folks are OK with the removal of the completion timeout, it seemed the best thing to do 14:25:23 oops 14:25:28 I mean the progress timeout 14:25:37 and changing the auto post_copy trigger 14:25:37 https://review.openstack.org/#/c/430218 14:26:27 pkoniszewski: I messed up one of your changes a little bit too, trying to re-work it on top of all of these changes: https://review.openstack.org/#/c/408002 14:27:10 i haven't had time to take a deep look yet 14:27:26 pkoniszewski: no worries 14:28:01 I am really wondering if we can ship defaults that mean (when post copy is available) users never hit live-migration timeouts, that would be kinda cool 14:29:16 some operators keep saying that they dont want to use post copy 14:29:17 I do wonder if operators want to control the action on competition timeout, either pause, post_copy or error out the live-migrate 14:29:27 johnthetubaguy: Depending how you do storage migration possibly 14:29:49 i think that Chet is one of them 14:29:56 yeah, if you timeout migrating storage, thats a looog pause 14:30:06 he said that it is a way too big risk with way too big impact on workloads 14:30:26 pkoniszewski: so he favours a longer downtime I believe, I think he uses 1000 ms rather than 500 ms, but we seem to keep rejecting that change upstream 14:30:53 maybe pause the VM by default, if you hit the timeout (assuming disk copy has completed)? 14:32:50 anyways, some patches up around that area, feels like we can make live-migrate feel way more reliable with those changes 14:32:59 should this be new option in nova.conf? I mean, action_on_timeout or something like that? 14:33:15 so it might be changed between abort, pause and post-copy 14:33:19 or abort|force 14:33:43 i guess that was your initial proposition, johnthetubaguy? 14:34:01 possibly 14:34:12 currently I just do post_copy if enabled, or abort 14:34:23 since thats the only timeout we have for the auto trigger now 14:34:49 * davidgiluk still thinks it's not a per-host config; if people are worried about postcopy then I'd assume they're worried about it for some hyper-critical VMs 14:35:18 got disconnected, sorry 14:35:24 why not to leave this options to operator 14:35:29 davidgiluk: you mean for the force-live-migrate call? 14:35:55 johnthetubaguy: Yeh - I mean the problem is whether it's a 'this is what we do for all our VMs' or a 'this is what we do for this class of VMs' 14:36:06 davidgiluk: honestly, it feels more like an image property or flavor property thing, than an API thing 14:36:14 johnthetubaguy: Yeh 14:36:26 so really I want to make it so folks never really have to call foce-live-migrate 14:36:39 so if they get some preference from the image, I am OK with that 14:37:04 the conf is more just "allow or disable post-copy" then 14:37:18 but anyways, thats a separate spec 14:37:37 I should probably do a spec for this, now we have the bug fix out the way (default to no progress timeout) 14:41:02 we should move on to the next thing I guess 14:41:37 johnthetubaguy: the thing is that the behaviour will be different, and not sure it will be fine to store it on image meta 14:43:03 tdurakov: we change behaviour quite a lot with other image properties, either way, it needs a spec to discuss all the details, will put that on my TODO list 14:43:33 kk 14:43:55 we can try to do it based in the info that we can get from vm diagnostics as well 14:44:51 #topic pike bps 14:45:12 siva_krishnan: not sure what you mean, I am curious, can talk more offline 14:45:31 I guess, what do we need to cover at the PTG / who will be there? 14:45:33 sure johnthetubaguy 14:46:06 i will be there, cfriesen added claim patches to the list 14:46:38 me too 14:46:49 cool, I am going, happy to talk about those things above, and the results of the OSIC testing 14:47:16 I know raj_singh said he would be there 14:47:41 well, thats a good number for a good discussion 14:49:05 let's discuss it on the ptg then 14:49:15 #topic open discussions 14:49:29 https://launchpad.net/bugs/1662626 14:49:29 Launchpad bug 1662626 in OpenStack Compute (nova) "live-migrate left in migrating as domain not found" [Medium,In progress] - Assigned to John Garbutt (johngarbutt) 14:50:16 sorry wrong bug link 14:50:19 https://bugs.launchpad.net/nova/+bug/1605016 14:50:19 Launchpad bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,In progress] - Assigned to Sivasathurappan Radhakrishnan (siva-radhakrishnan) 14:50:40 I am trying to do network port binding based on  VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY and  VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED. Any thoughts/suggestions on this one ? 14:51:18 so I was suggesting we go for the event that means the VM is now paused on the source node 14:51:56 so we get a change to update the network updated before we resume on the destination host, but not too soon, while the VM might still take a while to actually trigger the post copy action 14:52:03 s/change/chance/ 14:52:16 I think there's an event you can get from the destination 14:52:33 but on the destination its too late I believe, its already moved 14:52:41 I have a patch ready for it based on your suggestions johnthetubaguy. just wanted to know tdurakov, davidgiluk and pkoniszewski thoughts on it as well 14:52:56 you got the link handy? 14:53:01 johnthetubaguy: Not if it's running with -S - qemu wont have autostarted it; I was just worrying about the source stopping but the destination not quite being ready yet 14:53:43 siva_krishnan: Can you loop jdenemar in - he knows what it looks like from a libvirt view of it, I only know what it looks like from deeper in qemu 14:53:53 davidgiluk: sorry, I totally lost you there 14:54:06 As of now I just have WIP patch https://review.openstack.org/#/c/432370/ whcih I shared with you earlier 14:54:18 siva_krishnan: Can you add that to the lp if it isn't already ? 14:54:43 davidgiluk: I don't believe we need the destination to be ready before we make the neutron switch, at least thats what the neutron folks were saying, as the VIF plug already happened in a pre-live-migrate step 14:55:02 johnthetubaguy: Ah ok, that's ok then 14:55:33 thats an assumption people nodded at, rather than me being 100% certainly though 14:55:50 but seems like we are good for the 80% case for sure 14:56:53 siva_krishnan: you OK finding jdenemar and asking them what they think? 14:57:26 johnthetubaguy: yeah sure will do it. 14:57:36 so, need to finish, thanks to everyone 14:57:41 #endmeeting