14:01:20 <tdurakov> #startmeeting Nova Live Migration
14:01:21 <openstack> Meeting started Tue Feb 14 14:01:20 2017 UTC and is due to finish in 60 minutes.  The chair is tdurakov. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:01:23 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:01:25 <openstack> The meeting name has been set to 'nova_live_migration'
14:01:29 <tdurakov> hi everyone
14:01:32 <davidgiluk> hi
14:01:44 <siva_krishnan> hi
14:01:58 <tdurakov> since there were no meeting last week let's walk through the previous topics
14:02:08 <tdurakov> #topic CI
14:02:39 <tdurakov> so, any news about the CI progress?
14:03:03 * johnthetubaguy lurks because of all the bug fixes patches he has pending
14:04:21 <pkoniszewski> one thing
14:04:28 <pkoniszewski> Matt uploaded new patch, let me find it
14:04:37 <siva_krishnan> https://review.openstack.org/#/c/431848/
14:05:10 <pkoniszewski> exactly this one
14:05:11 <pkoniszewski> thanks!
14:05:42 <pkoniszewski> this one sets LIVE_MIGRATE_BACK_AND_FORTH to True in project config
14:05:51 <pkoniszewski> instead of devstack-gate
14:05:59 <tdurakov> oh, it's a workaround for the back-and-froth
14:06:07 <tdurakov> s/forth
14:06:41 <tdurakov> I wonder, have clarkb seen it?
14:06:59 <pkoniszewski> i don't know
14:07:30 <siva_krishnan> we can add him as reviewer
14:08:29 <tdurakov> will add him
14:09:36 <tdurakov> serial console hook is still on review...
14:10:14 <tdurakov> https://review.openstack.org/#/c/346815/ - will catch markus_z after meeting, to figure out if he still working on this one
14:10:22 <tdurakov> ok
14:10:26 <tdurakov> let's move on
14:10:38 <tdurakov> #bugs
14:11:04 <tdurakov> https://bugs.launchpad.net/nova/+bug/1658877
14:11:05 <openstack> Launchpad bug 1658877 in OpenStack Compute (nova) "live migration failed with XenServer as hypervisor" [High,In progress] - Assigned to huan (huan-xie)
14:12:14 <tdurakov> overall a fix for that: https://review.openstack.org/#/c/424428/  looks ok, but there are no CI
14:12:45 <tdurakov> so, I've tried to deploy multinode locally, so had issues with deployment itself
14:13:30 <tdurakov> johnthetubaguy: thoughts?
14:18:05 <johnthetubaguy> yeah, I was trusting the folks from Citrix on having tested that one
14:18:49 <tdurakov> ok, let's merge it then...
14:19:03 <tdurakov> pkoniszewski: any updates on claims?
14:19:31 <pkoniszewski> no updates this time
14:19:37 <johnthetubaguy> tdurakov: thats my take, just needs another core at this point
14:19:59 <johnthetubaguy> I have some interesting bugs around the timeouts folks hit in production
14:20:48 <johnthetubaguy> Bug 1644248
14:20:48 <openstack> bug 1644248 in OpenStack Compute (nova) ocata "Nova incorrectly tracks live migration progress" [High,In progress] https://launchpad.net/bugs/1644248 - Assigned to Matt Riedemann (mriedem)
14:20:52 <johnthetubaguy> https://review.openstack.org/#/c/430218/
14:21:43 <johnthetubaguy> and Bug 1662626
14:21:43 <openstack> bug 1662626 in OpenStack Compute (nova) "live-migrate left in migrating as domain not found" [Medium,In progress] https://launchpad.net/bugs/1662626 - Assigned to John Garbutt (johngarbutt)
14:21:56 <johnthetubaguy> https://review.openstack.org/#/c/430404 (and the dependent patch)
14:22:12 <johnthetubaguy> OSIC has been doing live-migrate benchmarking
14:22:27 <johnthetubaguy> and these appear to be the issues that were blocking getting 100% success
14:22:38 <johnthetubaguy> (it was for a given "tuned" workload)
14:23:13 <tdurakov> johnthetubaguy: thanks for these 2
14:23:19 <johnthetubaguy> so would be great to get eyes those
14:23:20 <tdurakov> second looks interesting...
14:23:28 <tdurakov> yeah
14:23:45 <johnthetubaguy> FWIW, after applying these patches (and the one we got into RC2) we were able to get 100% live-migrate success
14:24:15 <tdurakov> got it, anyway, will review
14:24:29 <johnthetubaguy> pkoniszewski: I would love a deep libvirt look at that domain not found one, it was a bit strange
14:24:47 <tdurakov> pkoniszewski: do you need some help with claims?
14:24:55 <pkoniszewski> claims are on a good way\
14:25:02 <pkoniszewski> i will take a look on this one johnthetubaguy
14:25:11 <tdurakov> ok, cool
14:25:13 <johnthetubaguy> I am curious if folks are OK with the removal of the completion timeout, it seemed the best thing to do
14:25:23 <johnthetubaguy> oops
14:25:28 <johnthetubaguy> I mean the progress timeout
14:25:37 <johnthetubaguy> and changing the auto post_copy trigger
14:25:37 <johnthetubaguy> https://review.openstack.org/#/c/430218
14:26:27 <johnthetubaguy> pkoniszewski: I messed up one of your changes a little bit too, trying to re-work it on top of all of these changes: https://review.openstack.org/#/c/408002
14:27:10 <pkoniszewski> i haven't had time to take a deep look yet
14:27:26 <johnthetubaguy> pkoniszewski: no worries
14:28:01 <johnthetubaguy> I am really wondering if we can ship defaults that mean (when post copy is available) users never hit live-migration timeouts, that would be kinda cool
14:29:16 <pkoniszewski> some operators keep saying that they dont want to use post copy
14:29:17 <johnthetubaguy> I do wonder if operators want to control the action on competition timeout, either pause, post_copy or error out the live-migrate
14:29:27 <davidgiluk> johnthetubaguy: Depending how you do storage migration possibly
14:29:49 <pkoniszewski> i think that Chet is one of them
14:29:56 <johnthetubaguy> yeah, if you timeout migrating storage, thats a looog pause
14:30:06 <pkoniszewski> he said that it is a way too big risk with way too big impact on workloads
14:30:26 <johnthetubaguy> pkoniszewski: so he favours a longer downtime I believe, I think he uses 1000 ms rather than 500 ms, but we seem to keep rejecting that change upstream
14:30:53 <johnthetubaguy> maybe pause the VM by default, if you hit the timeout (assuming disk copy has completed)?
14:32:50 <johnthetubaguy> anyways, some patches up around that area, feels like we can make live-migrate feel way more reliable with those changes
14:32:59 <pkoniszewski> should this be new option in nova.conf? I mean, action_on_timeout or something like that?
14:33:15 <pkoniszewski> so it might be changed between abort, pause and post-copy
14:33:19 <pkoniszewski> or abort|force
14:33:43 <pkoniszewski> i guess that was your initial proposition, johnthetubaguy?
14:34:01 <johnthetubaguy> possibly
14:34:12 <johnthetubaguy> currently I just do post_copy if enabled, or abort
14:34:23 <johnthetubaguy> since thats the only timeout we have for the auto trigger now
14:34:49 * davidgiluk still thinks it's not a per-host config; if people are worried about postcopy then I'd assume they're worried about it for some hyper-critical VMs
14:35:18 <pkoniszewski> got disconnected, sorry
14:35:24 <tdurakov> why not to leave this options to operator
14:35:29 <johnthetubaguy> davidgiluk: you mean for the force-live-migrate call?
14:35:55 <davidgiluk> johnthetubaguy: Yeh - I mean the problem is whether it's a 'this is what we do for all our VMs' or a 'this is what we do for this class of VMs'
14:36:06 <johnthetubaguy> davidgiluk: honestly, it feels more like an image property or flavor property thing, than an API thing
14:36:14 <davidgiluk> johnthetubaguy: Yeh
14:36:26 <johnthetubaguy> so really I want to make it so folks never really have to call foce-live-migrate
14:36:39 <johnthetubaguy> so if they get some preference from the image, I am OK with that
14:37:04 <johnthetubaguy> the conf is more just "allow or disable post-copy" then
14:37:18 <johnthetubaguy> but anyways, thats a separate spec
14:37:37 <johnthetubaguy> I should probably do a spec for this, now we have the bug fix out the way (default to no progress timeout)
14:41:02 <johnthetubaguy> we should move on to the next thing I guess
14:41:37 <tdurakov> johnthetubaguy: the thing is that the behaviour will be different, and not sure it will be fine to store it on image meta
14:43:03 <johnthetubaguy> tdurakov: we change behaviour quite a lot with other image properties, either way, it needs a spec to discuss all the details, will put that on my TODO list
14:43:33 <tdurakov> kk
14:43:55 <siva_krishnan> we can try to do it based in the info that we can get from vm diagnostics as well
14:44:51 <tdurakov> #topic pike bps
14:45:12 <johnthetubaguy> siva_krishnan: not sure what you mean, I am curious, can talk more offline
14:45:31 <johnthetubaguy> I guess, what do we need to cover at the PTG / who will be there?
14:45:33 <siva_krishnan> sure johnthetubaguy
14:46:06 <pkoniszewski> i will be there, cfriesen added claim patches to the list
14:46:38 <tdurakov> me too
14:46:49 <johnthetubaguy> cool, I am going, happy to talk about those things above, and the results of the OSIC testing
14:47:16 <johnthetubaguy> I know raj_singh said he would be there
14:47:41 <johnthetubaguy> well, thats a good number for a good discussion
14:49:05 <tdurakov> let's discuss it on the ptg then
14:49:15 <tdurakov> #topic open discussions
14:49:29 <siva_krishnan> https://launchpad.net/bugs/1662626
14:49:29 <openstack> Launchpad bug 1662626 in OpenStack Compute (nova) "live-migrate left in migrating as domain not found" [Medium,In progress] - Assigned to John Garbutt (johngarbutt)
14:50:16 <siva_krishnan> sorry wrong bug link
14:50:19 <siva_krishnan> https://bugs.launchpad.net/nova/+bug/1605016
14:50:19 <openstack> Launchpad bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,In progress] - Assigned to Sivasathurappan Radhakrishnan (siva-radhakrishnan)
14:50:40 <siva_krishnan> I am trying to do network port binding based on  VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY and  VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED. Any thoughts/suggestions on this one ?
14:51:18 <johnthetubaguy> so I was suggesting we go for the event that means the VM is now paused on the source node
14:51:56 <johnthetubaguy> so we get a change to update the network updated before we resume on the destination host, but not too soon, while the VM might still take a while to actually trigger the post copy action
14:52:03 <johnthetubaguy> s/change/chance/
14:52:16 <davidgiluk> I think there's an event you can get from the destination
14:52:33 <johnthetubaguy> but on the destination its too late I believe, its already moved
14:52:41 <siva_krishnan> I have a patch ready for it based on your suggestions johnthetubaguy. just wanted to know tdurakov, davidgiluk and  pkoniszewski thoughts on it as well
14:52:56 <johnthetubaguy> you got the link handy?
14:53:01 <davidgiluk> johnthetubaguy: Not if it's running with -S - qemu wont have autostarted it; I was just worrying about the source stopping but the destination not quite being ready yet
14:53:43 <davidgiluk> siva_krishnan: Can you loop jdenemar in - he knows what it looks like from a libvirt view of it, I only know what it looks like from deeper in qemu
14:53:53 <johnthetubaguy> davidgiluk: sorry, I totally lost you there
14:54:06 <siva_krishnan> As of now  I just have WIP patch https://review.openstack.org/#/c/432370/ whcih I shared with you earlier
14:54:18 <davidgiluk> siva_krishnan: Can you add that to the lp if it isn't already ?
14:54:43 <johnthetubaguy> davidgiluk: I don't believe we need the destination to be ready before we make the neutron switch, at least thats what the neutron folks were saying, as the VIF plug already happened in a pre-live-migrate step
14:55:02 <davidgiluk> johnthetubaguy: Ah ok, that's ok then
14:55:33 <johnthetubaguy> thats an assumption people nodded at, rather than me being 100% certainly though
14:55:50 <johnthetubaguy> but seems like we are good for the 80% case for sure
14:56:53 <johnthetubaguy> siva_krishnan: you OK finding jdenemar and asking them what they think?
14:57:26 <siva_krishnan> johnthetubaguy: yeah sure will do it.
14:57:36 <tdurakov> so, need to finish, thanks to everyone
14:57:41 <tdurakov> #endmeeting