14:00:25 <tdurakov> #startmeeting Nova Live Migration 14:00:26 <openstack> Meeting started Tue Nov 22 14:00:25 2016 UTC and is due to finish in 60 minutes. The chair is tdurakov. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:27 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:29 <openstack> The meeting name has been set to 'nova_live_migration' 14:00:35 <tdurakov> hello everyone 14:00:54 <tdurakov> #link agenda https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:01:16 <markus_z> o/ 14:01:34 <tdurakov> hm 14:01:45 <tdurakov> only me and markus_z O_o 14:01:51 <pkoniszewski> o/ 14:02:16 <markus_z> whoop whoop, three's a party :) 14:02:22 <davidgiluk> o/ 14:02:36 <tdurakov> last week there were only 3:( 14:02:41 <tdurakov> so now it's 4) 14:02:42 * kashyap waves 14:02:47 <tdurakov> ok, let's start 14:02:51 <tdurakov> #topic ci 14:03:12 <tdurakov> I think grenade and ceph bits are stuck on tempest part 14:03:35 <pkoniszewski> that's right 14:03:45 <tdurakov> https://review.openstack.org/#/c/379638/ 14:03:45 <tdurakov> https://review.openstack.org/#/c/389767/ 14:04:15 <tdurakov> tried to catch JordanP last week, no results 14:05:46 <tdurakov> send a message for #openstack-qa with review request 14:06:01 <wznoinsk> late hi 14:06:08 <tdurakov> pkoniszewski: any updates on stats for post-copy testing? 14:06:57 * tdurakov trying to make predictable mem load on instance - work in progress 14:07:02 <pkoniszewski> none yet, i was burried in reviewing serial console and claims patches 14:07:04 <pkoniszewski> i'm still a bit 14:07:14 <tdurakov> ok 14:07:21 <tdurakov> let's go next topic 14:07:31 <tdurakov> #topic bugs 14:07:53 <tdurakov> so, let's discuss that patches for serial-console and life migration 14:08:13 <markus_z> I wanted to ask here for a general direction on that patches. 14:08:17 <markus_z> For context: https://bugs.launchpad.net/nova/+bug/1455252 14:08:17 <openstack> Launchpad bug 1455252 in OpenStack Compute (nova) "enabling serial console breaks live migration" [High,In progress] - Assigned to sahid (sahid-ferdjaoui) 14:08:39 <markus_z> I took over from sahid on that series and discussed it yesterday with dansmith. 14:09:06 <markus_z> Let me get the reference to that, one moment 14:09:35 <markus_z> http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2016-11-21.log.html#t2016-11-21T15:10:15 14:09:57 <tdurakov> https://review.openstack.org/#/c/397276/ -is that fix? 14:10:35 <markus_z> ^ is one of the preparation patches. The actual fix is https://review.openstack.org/#/c/275801/ 14:10:48 <markus_z> I'm rebasing that after this meeting. 14:11:00 <pkoniszewski> we are almost there, I think 14:11:08 <pkoniszewski> there is one change that I don't like in those patches 14:11:08 <tdurakov> markus_z: I'm ok with None too 14:11:10 <pkoniszewski> I mean this check move 14:11:20 <pkoniszewski> this is completely unreleated to the fix 14:11:45 <markus_z> OK, cool, that's the two major hurdles at the moment. 14:11:53 <markus_z> I got it and change them accordingly. 14:12:02 <pkoniszewski> i'm fine with everything else and I agree with Dan's point 14:12:18 <markus_z> Anything else which would make the reviews easier? 14:13:16 <tdurakov> hold on 14:13:57 <tdurakov> https://review.openstack.org/#/c/398389/2/nova/virt/libvirt/migration.py - won't this raise obj_attr unset? 14:14:11 <pkoniszewski> can't find anything else 14:14:37 <pkoniszewski> tdurakov: according to Dan's comment I don't think we still need this change in current form 14:14:45 <markus_z> tdurakov: yes it would. I already reverted that but didn't yet push it. It will be in ps3. 14:14:45 <pkoniszewski> I mean, we still need to have a guard there 14:14:54 <pkoniszewski> great! :) 14:15:05 <tdurakov> ok) 14:15:51 <markus_z> Please be aware that tempest isn't yet capable of testing the serial console with live-migration. That's done with: https://review.openstack.org/#/c/346815/ 14:16:34 <markus_z> To be precise, this one in Nova does the prep step: https://review.openstack.org/#/c/347471/ 14:16:45 <markus_z> Just to be sure to give the full context here. 14:17:03 <tdurakov> markus_z: I could update hook 14:17:50 <markus_z> tdurakov: That would be nice, thanks :) 14:17:56 <tdurakov> so, if it works, we could depends on that 14:18:16 <tdurakov> again not sure tempest tests to be merged soon 14:18:40 <tdurakov> maybe it will be easier to create tempest plugin for live-migration 14:19:13 <tdurakov> anything else on that? 14:19:23 <markus_z> For testing, right now I use https://github.com/markuszoeller/openstack/tree/master/scripts/vagrant/live-migration-U1404-VB like the savage I am. :) 14:19:39 <markus_z> No, that's all on that, thanks for your time. 14:19:52 <tdurakov> :) 14:20:01 <tdurakov> let's move on 14:20:07 <tdurakov> #topic specs 14:20:19 <tdurakov> johnthetubaguy: are you arounds? 14:20:26 <tdurakov> s/around 14:20:29 <johnthetubaguy> I am 14:20:32 <pkoniszewski> are we past spec freeze already? 14:20:43 <johnthetubaguy> yes, it was last week, thursday 14:20:58 <tdurakov> but there still specs unmerged 14:20:59 <tdurakov> so 14:21:00 <tdurakov> https://review.openstack.org/#/c/347161/ 14:21:30 <tdurakov> I'm fine with that one, except I do not wont checks to be made in sync manner 14:22:19 <tdurakov> johnthetubaguy: could you please take a look once again 14:22:50 <johnthetubaguy> I thought the problem with that, was the issue around recreating the libvirt xml? 14:23:45 <johnthetubaguy> FWIW, it seems worth adding that quick RPC call to get capabilities. 14:24:03 <johnthetubaguy> I know we don't want RPC calls in the API, but at least it should be a quick thing to process 14:24:29 <tdurakov> what's the difference with other l-m pre-checks here? 14:24:48 <johnthetubaguy> they had to call out to cinder and things, and it was two sets of RPC calls 14:24:53 <tdurakov> why not to get this check too over migration status and instance-actions 14:24:54 <johnthetubaguy> they took time 14:25:25 <johnthetubaguy> I think its a fine line, but if we can fail early, every reliably, that doesn't seem so bad 14:25:32 <johnthetubaguy> its only needed if the VM is in Rescue state 14:25:49 <tdurakov> btw, is there plan to store that data on the the api side in db? 14:25:51 <johnthetubaguy> unless we check if live-migrate is support at all, I guess, which I guess we should 14:26:10 <johnthetubaguy> probably never, that would be a child DB thing if it did happen 14:26:16 <johnthetubaguy> at least in my head 14:26:57 <tdurakov> okay, still against that, but looks like I'm in minority on that question 14:27:14 <johnthetubaguy> so the key bit, for me, is the API experience 14:27:15 <tdurakov> if we do rpc.call, let's do it direct to compute 14:27:31 <tdurakov> without extra hop to conductor 14:27:34 <johnthetubaguy> right, thats what we are asking, direct RPC call to compute 14:28:02 <johnthetubaguy> anyways, the key bit is the API sematics 14:28:08 <johnthetubaguy> the implementation can change over time 14:28:30 <tdurakov> ok, will re-review that soon 14:29:27 <tdurakov> johnthetubaguy: thanks 14:29:34 <tdurakov> let's go next 14:30:02 <tdurakov> #topic Open discussion 14:30:46 <tdurakov> do we have anything? 14:31:23 <siva_krish> Hi All! I am trying to reproduce this bug https://bugs.launchpad.net/nova/+bug/1605016. Would someone be able to help me with that ? 14:31:23 <openstack> Launchpad bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,Confirmed] - Assigned to Sarafraj Singh (sarafraj-singh) 14:31:42 <pkoniszewski> siva_krish: sure 14:32:01 <pkoniszewski> siva_krish: the easiest way would be to have environment with DVR 14:32:13 <pkoniszewski> as it appears only in certain neutron configurations 14:32:16 <tdurakov> right 14:32:51 <siva_krish> I tried reproducing it with DVR setup only but I wasn't able to reproduce it 14:33:05 <siva_krish> pkoniszewski: ^ 14:33:24 <tdurakov> siva_krish: how are you detecting network failure? 14:34:08 <kashyap> siva_krish: The reproducer of it is non-trivial 14:34:17 <kashyap> As you might've noticed the description from Matt Booth, there 14:34:24 <siva_krish> tdurakov: by pinging the VM from outside the network and also ran stress command on VM beofr migrating 14:35:01 <tdurakov> siva_krish: hm, are you sure that migration is switched to post-copy? 14:36:41 <siva_krish> tdurakov: I changed to configuration to use post copy in nova.conf. Is there any other way to check it ? 14:36:59 <davidgiluk> siva_krish: Make sure the guest is running something heavy that wont migrate with precopy 14:37:09 <pkoniszewski> did you just wait or did you use force-complete API? 14:37:14 <davidgiluk> siva_krish: e.g. stressapptest hammering memory 14:38:05 <siva_krish> pkoniszewski: I didn't use force-complete 14:38:16 <tdurakov> siva_krish: that's the point 14:38:25 <mdbooth> precopy is pretty much irrelevant here in practice. You should see packet loss/interruption to a degree in any case. 14:38:28 <tdurakov> you need to raise stress level on vm 14:38:50 <mdbooth> Raising stress level is only required to trigger post-copy in the first place 14:39:15 <mdbooth> However, post-copy is subsequently fast enough that you're unlikely to notice the additional network switchover delay due to post-copy 14:39:58 <siva_krish> mdbooth: tdurakov will trying doing that and let you know what happened 14:40:00 <mdbooth> I found that the easiest way to test it was to introduce an artificial delay in the code. 14:40:18 <siva_krish> davidgiluk: will try that out 14:40:40 <mdbooth> Also, frig the code to switch to post-copy mode immediately 14:40:50 <mdbooth> Then it doesn't require stress in the guest 14:42:24 <tdurakov> anything else? 14:42:29 <pkoniszewski> hmm, i think it still does require 14:43:00 <pkoniszewski> we don't know when LM will be switched to post copy, do we? 14:43:44 <pkoniszewski> davidgiluk: can QEMU switch LM to post-copy in the middle of particular iteration? 14:44:03 <pkoniszewski> or does it wait till the end of iteration and then starts post-copy in subsequent iteration? 14:44:16 <davidgiluk> pkoniszewski: That...depends 14:44:20 <mdbooth> pkoniszewski: Yep, because we do the switch 14:44:33 <pkoniszewski> mdbooth: but it's still async... 14:44:49 <mdbooth> pkoniszewski: Nope 14:44:51 <davidgiluk> pkoniszewski: It only checks at certain points, if you've set a bandwidth limit I think it'll probably notice before the end of an iteration but no guarantee 14:44:56 <mdbooth> Nova triggers it explicitly 14:45:04 <mdbooth> And waits 14:45:16 <johnthetubaguy> sounds like load in the VM, and making the code trigger it straight away would help? 14:45:32 <davidgiluk> yeh, guest load is dead easy anyway 14:45:34 <tdurakov> mdbooth: nova scheduled it, no? 14:45:44 <pkoniszewski> yeah, nova just schedules it 14:45:49 <siva_krish> mdbooth: Sorry had a internet connection issue. 14:46:08 <tdurakov> siva_krish: the resolution is the same, good load level:) 14:46:28 <siva_krish> tdurakov: mdbooth will try your suggestions 14:46:29 <pkoniszewski> and switch to post-copy explicitly 14:46:50 <johnthetubaguy> siva_krish: we have folks trying to simulate load in the ops/qa team right now, that can probably share that with you 14:46:56 <johnthetubaguy> yeah, and somehow trigger post-copy 14:47:07 <johnthetubaguy> API call, or via changing the code a bit 14:47:08 <tdurakov> live-migration-force-complete 14:47:24 <siva_krish> johnthetubaguy: that might be helpful as well 14:47:35 <mdbooth> Nova triggers the switch to post-copy explicitly in the _live_migration_monitor loop 14:48:07 <mdbooth> The issue is that it doesn't do network switchover until that loop completes 14:48:13 <mdbooth> LM is really a red herring here 14:48:18 <mdbooth> The bug is the design of the loop 14:48:43 <mdbooth> It's not worth spending a lot of time on a complicated setup to trigger a long post-copy 14:48:52 <mdbooth> Because post-copy isn't what's causing the issue 14:48:57 <johnthetubaguy> so the fix can be done in parallel, I would say 14:49:03 <mdbooth> It's what Nova does when post-copy happens which is the issue 14:49:58 <mdbooth> (and it's not a big issue) 14:50:41 <siva_krish> mdbooth: johnthetubaguy will start working on the fix. 14:54:48 <siva_krish> ^ thanks for all of your info/suggestions on this bug 14:55:21 <tdurakov> thanks everyone for joining 14:55:28 <tdurakov> #endmeeting