14:00:25 <tdurakov> #startmeeting Nova Live Migration
14:00:26 <openstack> Meeting started Tue Nov 22 14:00:25 2016 UTC and is due to finish in 60 minutes.  The chair is tdurakov. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:27 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:29 <openstack> The meeting name has been set to 'nova_live_migration'
14:00:35 <tdurakov> hello everyone
14:00:54 <tdurakov> #link agenda https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration
14:01:16 <markus_z> o/
14:01:34 <tdurakov> hm
14:01:45 <tdurakov> only me and markus_z O_o
14:01:51 <pkoniszewski> o/
14:02:16 <markus_z> whoop whoop, three's a party :)
14:02:22 <davidgiluk> o/
14:02:36 <tdurakov> last week there were only 3:(
14:02:41 <tdurakov> so now it's 4)
14:02:42 * kashyap waves
14:02:47 <tdurakov> ok, let's start
14:02:51 <tdurakov> #topic ci
14:03:12 <tdurakov> I think grenade and ceph bits are stuck on tempest part
14:03:35 <pkoniszewski> that's right
14:03:45 <tdurakov> https://review.openstack.org/#/c/379638/
14:03:45 <tdurakov> https://review.openstack.org/#/c/389767/
14:04:15 <tdurakov> tried to catch JordanP last week, no results
14:05:46 <tdurakov> send a message for #openstack-qa with review request
14:06:01 <wznoinsk> late hi
14:06:08 <tdurakov> pkoniszewski: any updates on stats for post-copy testing?
14:06:57 * tdurakov trying to make predictable mem load on instance - work in progress
14:07:02 <pkoniszewski> none yet, i was burried in reviewing serial console and claims patches
14:07:04 <pkoniszewski> i'm still a bit
14:07:14 <tdurakov> ok
14:07:21 <tdurakov> let's go next topic
14:07:31 <tdurakov> #topic bugs
14:07:53 <tdurakov> so, let's discuss that patches for serial-console and life migration
14:08:13 <markus_z> I wanted to ask here for a general direction on that patches.
14:08:17 <markus_z> For context: https://bugs.launchpad.net/nova/+bug/1455252
14:08:17 <openstack> Launchpad bug 1455252 in OpenStack Compute (nova) "enabling serial console breaks live migration" [High,In progress] - Assigned to sahid (sahid-ferdjaoui)
14:08:39 <markus_z> I took over from sahid on that series and discussed it yesterday with dansmith.
14:09:06 <markus_z> Let me get the reference to that, one moment
14:09:35 <markus_z> http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2016-11-21.log.html#t2016-11-21T15:10:15
14:09:57 <tdurakov> https://review.openstack.org/#/c/397276/ -is that fix?
14:10:35 <markus_z> ^ is one of the preparation patches. The actual fix is https://review.openstack.org/#/c/275801/
14:10:48 <markus_z> I'm rebasing that after this meeting.
14:11:00 <pkoniszewski> we are almost there, I think
14:11:08 <pkoniszewski> there is one change that I don't like in those patches
14:11:08 <tdurakov> markus_z: I'm ok with None too
14:11:10 <pkoniszewski> I mean this check move
14:11:20 <pkoniszewski> this is completely unreleated to the fix
14:11:45 <markus_z> OK, cool, that's the two major hurdles at the moment.
14:11:53 <markus_z> I got it and change them accordingly.
14:12:02 <pkoniszewski> i'm fine with everything else and I agree with Dan's point
14:12:18 <markus_z> Anything else which would make the reviews easier?
14:13:16 <tdurakov> hold on
14:13:57 <tdurakov> https://review.openstack.org/#/c/398389/2/nova/virt/libvirt/migration.py - won't this raise obj_attr unset?
14:14:11 <pkoniszewski> can't find anything else
14:14:37 <pkoniszewski> tdurakov: according to Dan's comment I don't think we still need this change in current form
14:14:45 <markus_z> tdurakov: yes it would. I already reverted that but didn't yet push it. It will be in ps3.
14:14:45 <pkoniszewski> I mean, we still need to have a guard there
14:14:54 <pkoniszewski> great! :)
14:15:05 <tdurakov> ok)
14:15:51 <markus_z> Please be aware that tempest isn't yet capable of testing the serial console with live-migration. That's done with: https://review.openstack.org/#/c/346815/
14:16:34 <markus_z> To be precise, this one in Nova does the prep step: https://review.openstack.org/#/c/347471/
14:16:45 <markus_z> Just to be sure to give the full context here.
14:17:03 <tdurakov> markus_z: I could update hook
14:17:50 <markus_z> tdurakov: That would be nice, thanks :)
14:17:56 <tdurakov> so, if it works, we could depends on that
14:18:16 <tdurakov> again not sure tempest tests to be merged soon
14:18:40 <tdurakov> maybe it will be easier to create tempest plugin for live-migration
14:19:13 <tdurakov> anything else on that?
14:19:23 <markus_z> For testing, right now I use https://github.com/markuszoeller/openstack/tree/master/scripts/vagrant/live-migration-U1404-VB like the savage I am. :)
14:19:39 <markus_z> No, that's all on that, thanks for your time.
14:19:52 <tdurakov> :)
14:20:01 <tdurakov> let's move on
14:20:07 <tdurakov> #topic specs
14:20:19 <tdurakov> johnthetubaguy: are you arounds?
14:20:26 <tdurakov> s/around
14:20:29 <johnthetubaguy> I am
14:20:32 <pkoniszewski> are we past spec freeze already?
14:20:43 <johnthetubaguy> yes, it was last week, thursday
14:20:58 <tdurakov> but there still specs unmerged
14:20:59 <tdurakov> so
14:21:00 <tdurakov> https://review.openstack.org/#/c/347161/
14:21:30 <tdurakov> I'm fine with that one, except I do not wont checks to be made in sync manner
14:22:19 <tdurakov> johnthetubaguy: could you please take a look once again
14:22:50 <johnthetubaguy> I thought the problem with that, was the issue around recreating the libvirt xml?
14:23:45 <johnthetubaguy> FWIW, it seems worth adding that quick RPC call to get capabilities.
14:24:03 <johnthetubaguy> I know we don't want RPC calls in the API, but at least it should be a quick thing to process
14:24:29 <tdurakov> what's the difference with other l-m pre-checks here?
14:24:48 <johnthetubaguy> they had to call out to cinder and things, and it was two sets of RPC calls
14:24:53 <tdurakov> why not to get this check too over migration status and instance-actions
14:24:54 <johnthetubaguy> they took time
14:25:25 <johnthetubaguy> I think its a fine line, but if we can fail early, every reliably, that doesn't seem so bad
14:25:32 <johnthetubaguy> its only needed if the VM is in Rescue state
14:25:49 <tdurakov> btw, is there plan to store that data on the the api side in db?
14:25:51 <johnthetubaguy> unless we check if live-migrate is support at all, I guess, which I guess we should
14:26:10 <johnthetubaguy> probably never, that would be a child DB thing if it did happen
14:26:16 <johnthetubaguy> at least in my head
14:26:57 <tdurakov> okay, still against that, but looks like I'm in minority on that question
14:27:14 <johnthetubaguy> so the key bit, for me, is the API experience
14:27:15 <tdurakov> if we do rpc.call, let's do it direct to compute
14:27:31 <tdurakov> without extra hop to conductor
14:27:34 <johnthetubaguy> right, thats what we are asking, direct RPC call to compute
14:28:02 <johnthetubaguy> anyways, the key bit is the API sematics
14:28:08 <johnthetubaguy> the implementation can change over time
14:28:30 <tdurakov> ok, will re-review that soon
14:29:27 <tdurakov> johnthetubaguy: thanks
14:29:34 <tdurakov> let's go next
14:30:02 <tdurakov> #topic Open discussion
14:30:46 <tdurakov> do we have anything?
14:31:23 <siva_krish> Hi All! I am trying to reproduce this bug https://bugs.launchpad.net/nova/+bug/1605016. Would someone be able to help me with that ?
14:31:23 <openstack> Launchpad bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,Confirmed] - Assigned to Sarafraj Singh (sarafraj-singh)
14:31:42 <pkoniszewski> siva_krish: sure
14:32:01 <pkoniszewski> siva_krish: the easiest way would be to have environment with DVR
14:32:13 <pkoniszewski> as it appears only in certain neutron configurations
14:32:16 <tdurakov> right
14:32:51 <siva_krish> I tried reproducing it with DVR setup only but I wasn't able to reproduce it
14:33:05 <siva_krish> pkoniszewski: ^
14:33:24 <tdurakov> siva_krish: how are you detecting network failure?
14:34:08 <kashyap> siva_krish: The reproducer of it is non-trivial
14:34:17 <kashyap> As you might've noticed the description from Matt Booth, there
14:34:24 <siva_krish> tdurakov: by pinging the VM from outside the network and also ran stress command on VM beofr migrating
14:35:01 <tdurakov> siva_krish: hm, are you sure that migration is switched to post-copy?
14:36:41 <siva_krish> tdurakov: I changed to configuration  to use post copy in nova.conf. Is there any other way to check it ?
14:36:59 <davidgiluk> siva_krish: Make sure the guest is running something heavy that wont migrate with precopy
14:37:09 <pkoniszewski> did you just wait or did you use force-complete API?
14:37:14 <davidgiluk> siva_krish: e.g. stressapptest hammering memory
14:38:05 <siva_krish> pkoniszewski:  I didn't use force-complete
14:38:16 <tdurakov> siva_krish: that's the point
14:38:25 <mdbooth> precopy is pretty much irrelevant here in practice. You should see packet loss/interruption to a degree in any case.
14:38:28 <tdurakov> you need to raise stress level on vm
14:38:50 <mdbooth> Raising stress level is only required to trigger post-copy in the first place
14:39:15 <mdbooth> However, post-copy is subsequently fast enough that you're unlikely to notice the additional network switchover delay due to post-copy
14:39:58 <siva_krish> mdbooth: tdurakov will trying doing that and let you know what happened
14:40:00 <mdbooth> I found that the easiest way to test it was to introduce an artificial delay in the code.
14:40:18 <siva_krish> davidgiluk: will try that out
14:40:40 <mdbooth> Also, frig the code to switch to post-copy mode immediately
14:40:50 <mdbooth> Then it doesn't require stress in the guest
14:42:24 <tdurakov> anything else?
14:42:29 <pkoniszewski> hmm, i think it still does require
14:43:00 <pkoniszewski> we don't know when LM will be switched to post copy, do we?
14:43:44 <pkoniszewski> davidgiluk: can QEMU switch LM to post-copy in the middle of particular iteration?
14:44:03 <pkoniszewski> or does it wait till the end of iteration and then starts post-copy in subsequent iteration?
14:44:16 <davidgiluk> pkoniszewski: That...depends
14:44:20 <mdbooth> pkoniszewski: Yep, because we do the switch
14:44:33 <pkoniszewski> mdbooth: but it's still async...
14:44:49 <mdbooth> pkoniszewski: Nope
14:44:51 <davidgiluk> pkoniszewski: It only checks at certain points, if you've set a bandwidth limit I think it'll probably notice before the end of an iteration but no guarantee
14:44:56 <mdbooth> Nova triggers it explicitly
14:45:04 <mdbooth> And waits
14:45:16 <johnthetubaguy> sounds like load in the VM, and making the code trigger it straight away would help?
14:45:32 <davidgiluk> yeh, guest load is dead easy anyway
14:45:34 <tdurakov> mdbooth: nova scheduled it, no?
14:45:44 <pkoniszewski> yeah, nova just schedules it
14:45:49 <siva_krish> mdbooth: Sorry had a internet connection issue.
14:46:08 <tdurakov> siva_krish: the resolution is the same, good load level:)
14:46:28 <siva_krish> tdurakov: mdbooth will try your suggestions
14:46:29 <pkoniszewski> and switch to post-copy explicitly
14:46:50 <johnthetubaguy> siva_krish: we have folks trying to simulate load in the ops/qa team right now, that can probably share that with you
14:46:56 <johnthetubaguy> yeah, and somehow trigger post-copy
14:47:07 <johnthetubaguy> API call, or via changing the code a bit
14:47:08 <tdurakov> live-migration-force-complete
14:47:24 <siva_krish> johnthetubaguy: that might be helpful as well
14:47:35 <mdbooth> Nova triggers the switch to post-copy explicitly in the _live_migration_monitor loop
14:48:07 <mdbooth> The issue is that it doesn't do network switchover until that loop completes
14:48:13 <mdbooth> LM is really a red herring here
14:48:18 <mdbooth> The bug is the design of the loop
14:48:43 <mdbooth> It's not worth spending a lot of time on a complicated setup to trigger a long post-copy
14:48:52 <mdbooth> Because post-copy isn't what's causing the issue
14:48:57 <johnthetubaguy> so the fix can be done in parallel, I would say
14:49:03 <mdbooth> It's what Nova does when post-copy happens which is the issue
14:49:58 <mdbooth> (and it's not a big issue)
14:50:41 <siva_krish> mdbooth: johnthetubaguy will start working on the fix.
14:54:48 <siva_krish> ^ thanks for all of your info/suggestions on this bug
14:55:21 <tdurakov> thanks everyone for joining
14:55:28 <tdurakov> #endmeeting