14:00:21 <PaulMurray> #startmeeting Nova Live Migration 14:00:23 <openstack> Meeting started Tue Mar 8 14:00:21 2016 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:24 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:26 <openstack> The meeting name has been set to 'nova_live_migration' 14:00:35 <PaulMurray> hi, who is here for lm? 14:00:40 <efried> . 14:00:46 <andrearosa> I am 14:00:49 <mdbooth> o/ 14:00:57 <jlanoux> hi 14:01:24 <PaulMurray> lets wait a minute for any others 14:02:06 <PaulMurray> ok - lets get going 14:02:18 <PaulMurray> people are lurking if not putting their hands up 14:02:34 <davidgiluk> o/ 14:02:41 * davidgiluk wouldn't want to lurk 14:02:46 <PaulMurray> agenda here as usual: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:02:50 * kashyap waves 14:02:54 <PaulMurray> :) 14:03:08 <PaulMurray> #topic Documentation 14:03:34 <PaulMurray> We have a lot of features in 14:03:41 <PaulMurray> and they need to be documented 14:03:46 <ndipanov> o/ 14:03:58 <eliqiao> o/ 14:04:13 <PaulMurray> So look for command descriptions in api and in 14:04:27 <PaulMurray> concepts guide and see if there is anything affected by recent features 14:04:30 <PaulMurray> please 14:04:34 <andrearosa> PaulMurray: I have a nova bug assigned to me because of the DOCIMPACT tag in the commit message 14:04:41 <andrearosa> without any details 14:04:46 <andrearosa> so I put a comment on the bug 14:04:48 <andrearosa> with more details 14:04:51 <andrearosa> is that enough? 14:05:00 <andrearosa> https://bugs.launchpad.net/nova/+bug/1550525 14:05:00 <openstack> Launchpad bug 1550525 in OpenStack Compute (nova) " Abort an ongoing live migration" [Undecided,Confirmed] - Assigned to Andrea Rosa (andrea-rosa-m) 14:05:17 <PaulMurray> I've not seen that before 14:05:20 <eliqiao> andrearosa: that's because that you have 'DOCImpact' in you commit message. 14:05:59 <PaulMurray> eliqiao, are they usually assigned to the commiter ? 14:06:19 <andrearosa> PaulMurray: not automatically 14:06:40 <andrearosa> I assigned it to myself after makus pinged me on IRC 14:06:47 <eliqiao> PaulMurray: not really. 14:07:09 <eliqiao> yeah, someone pinged me to. 14:07:14 <eliqiao> s/to/too 14:07:33 <PaulMurray> I'm not sure what the process is or if its changed 14:08:02 <andrearosa> after this meeting I am going to ask in the #openstack-infra what I have to do 14:08:11 <PaulMurray> But it would be good to look for what needs to be done. And do remember the concepts guide, which is in nova 14:08:46 <PaulMurray> #action ALL check document changes needed related to new features 14:09:03 <PaulMurray> #topic Top check queue failures 14:09:30 <eliqiao> PaulMurray: I did some investigation about that bug 14:09:47 <PaulMurray> http://lists.openstack.org/pipermail/openstack-dev/2016-March/088437.html 14:09:53 <eliqiao> PaulMurray: I think you are talking about https://bugs.launchpad.net/nova/+bug/1539271 14:09:53 <openstack> Launchpad bug 1539271 in OpenStack Compute (nova) "Libvirt live block migration migration stalls" [High,Confirmed] - Assigned to Eli Qiao (taget-9) 14:09:55 <PaulMurray> matt pointed out those two bugs 14:09:59 <PaulMurray> go ahead eliqiao 14:10:24 <eliqiao> PaulMurray: I don't find the root cause 14:10:30 <PaulMurray> #link Libvirt live block migration migration stalls: https://bugs.launchpad.net/nova/+bug/1539271 14:10:34 <eliqiao> seems not from nova 14:10:51 <eliqiao> libvirt and nova are all do right thing from the log which I can see. 14:11:49 <eliqiao> libvrit invoke migrate command to qemu, and wait for job status, but no memory transfered at all. 14:11:56 <PaulMurray> this comment from matts message: "The problem, unfortunately, could be 14:11:56 <PaulMurray> something latent in libvirt 1.2.2 or qemu 2.0.0" 14:12:29 <PaulMurray> has know of anything like that in libvirt or qemu ? 14:12:31 <eliqiao> PaulMurray: yeah, can we call for some low level expert ? 14:12:45 <PaulMurray> mdbooth, do you know anyone ? 14:12:59 <eliqiao> danbp or someone from qemu. 14:13:03 <mdbooth> Not personally, unfortunately 14:13:11 <mdbooth> kashyap: ring any bells? 14:13:20 <kashyap> mdbooth: Looking 14:13:29 <kashyap> Sorry, was lost in chat in another channel 14:13:49 <PaulMurray> I can ask our (HPE) virt people to have a look - but not always as quick as one of us 14:13:51 <kashyap> Wow, block migration completely stalls? 14:14:10 <eliqiao> kashyap: yeah. 14:14:17 <kashyap> As of a week ago, I tested with libvirt-1.3.1 14:14:37 <kashyap> And, qemu-system-x86-2.5.0-6.fc23 -- it worked fine here 14:14:41 <davidgiluk> hmm it's probably first important to figure out which type of block migration it's using - is it qemu block migration or is it virtio setting up nbd ? 14:14:49 <mdbooth> Presumably this is a non-deterministic failure, right? 14:15:04 <kashyap> davidgiluk: A quick look at the log should tell us 14:15:10 <PaulMurray> mdbooth, right 14:15:14 <mdbooth> i.e. success once isn't conclusive 14:15:44 <kashyap> The bug doesn't have any libvirtd logs? 14:15:54 <eliqiao> kashyap: it has 14:16:29 <eliqiao> kashyap: http://logs.openstack.org/84/286084/3/check/gate-tempest-dsvm-multinode-full/a5d70a2/logs/subnode-2/libvirt/libvirtd.txt.gz 14:17:00 <kashyap> eliqiao: Thanks 14:17:38 * kashyap will be on a plane for 12 hours tomorrow, and can only spend about 2 hours today on this 14:17:43 <PaulMurray> maybe if you can have a look and put anything you find on the bug - including any missing details 14:17:52 <PaulMurray> There is another one: 14:17:57 <davidgiluk> kashyap: Take a libvirtd bug with you for entertainment! 14:18:00 <mdbooth> kashyap: Can you ping me with anything you find out? 14:18:03 <kashyap> davidgiluk: Looking sir 14:18:08 <pkoniszewski> o/ 14:18:12 <PaulMurray> #link Volume based live migration aborted unexpectedly: https://bugs.launchpad.net/nova/+bug/1524898 14:18:12 <openstack> Launchpad bug 1524898 in OpenStack Compute (nova) "Volume based live migration aborted unexpectedly" [High,Confirmed] 14:18:12 <kashyap> mdbooth: Sure. I'll note anything that I can find 14:18:29 <PaulMurray> has anyone had a chance to look at this one ? 14:19:09 <PaulMurray> The occurances of these has increased in last 48 hours 14:19:20 <kashyap> eliqiao: "subnode-2" means, it's the _source_ machine, right? (Despite the name 'subnode-2') 14:21:16 <kashyap> davidgiluk: Okay, it's using old style ("inc":true" approach) 14:21:23 <PaulMurray> danpb thinks its this one is probably qemu - but don't know why it would start failing more now 14:21:32 <eliqiao> kashyap: yeah, it should be. I checked the log, monitor thread is running on that. 14:21:51 <davidgiluk> kashyap: Do we have qemu logs? Also wasn't there an NFS+migration one from a few weeks back - did that get fixed? 14:21:57 <mdbooth> PaulMurray: Matt says in that bug: 14:21:59 <kashyap> (s/true/false) 14:22:02 <mdbooth> This fails primarily on OVH nodes which are the slowest ones we have in the community infra system, so that indicates there is a speed component to this. 14:22:15 <kashyap> davidgiluk: There will be - DevStack records it, let me find 14:22:16 <mdbooth> I wonder if it's a race in an old qemu which is triggered by a slow system 14:22:22 <mdbooth> And our system got slow again lately? 14:23:39 <kashyap> davidgiluk: The same dreadful "signal 15" (SIGTERM) - http://logs.openstack.org/84/286084/3/check/gate-tempest-dsvm-multinode-full/a5d70a2/logs/subnode-2/libvirt/qemu/instance-00000022.txt.gz 14:23:47 <PaulMurray> tdurakov, does the live migration job use the same setup ? 14:24:04 <eliqiao> kashyap: the qemu logs seems good. 14:24:14 <davidgiluk> kashyap: So I think that means libvirt decided to kill it - nothing that qemu is complaining about there 14:24:16 <kashyap> davidgiluk: Also, they (the test) seem to have supplied TUNNELED flag. 14:24:52 <davidgiluk> kashyap: Yeh so isn't it the constraint that if you have tunneled on then you have to use qemu block migration? 14:25:08 <eliqiao> davidgiluk: tempest case will kill it if LM case no response after 250s?(maybe) 14:25:14 <kashyap> We should really be testing with latest upstream stable of libvirt/QEMU. Markus (Z) was following it up on the upstream list - not sure about its status 14:25:57 <andrearosa> we even have a kill of the job if the ligration is not making progress for more tha 150 seconds 14:26:03 <kashyap> davidgiluk: Upstream libvirt will fall back to the legacy approach evey when the tunnelled flag is supplied 14:26:03 <andrearosa> that is repoted in the compute log 14:26:19 <andrearosa> ligration = live migration 14:27:17 <davidgiluk> andrearosa: What's the definition of progress? If it's shoveling block data and not moving much RAM data what happens? 14:27:53 <andrearosa> davidgiluk: I think so, but I have to check the code. I saw that in my tests few times but I never dig into it 14:29:15 <mdbooth> I'm a bit concerned that we may well be wasting time tracking down issues which are down to bugs which are already fixed in stable libvirt/qemu 14:29:27 <mdbooth> I'm not sure this is the best use of peoples' time 14:29:41 <mdbooth> Any chance we can upgrade? 14:30:06 <kashyap> mdbooth: Fully agree with you 14:30:27 <kashyap> Markus (on #openstack-nova) just provided some status on it (newer Virt compoenents in Gate) 14:30:28 <PaulMurray> mdbooth, agree - we were looking at getting up to date libvirt/qemu in some CI jobs 14:30:30 <kashyap> https://etherpad.openstack.org/p/libvirt-latest-test-job 14:30:34 <davidgiluk> yes it would be better if you could taste our newer bugs 14:30:37 <PaulMurray> anyone know where we are with that 14:30:39 <mdbooth> If we were seeing these issues with the latest stable libvirt/qemu it would be quite clear that there's something we need to look at 14:31:25 <kashyap> PaulMurray: mdbooth: https://review.openstack.org/#/c/289255/ 14:31:37 <kashyap> That's the WIP test job 14:31:48 <markus_z> hey hey 14:31:59 <mdbooth> kashyap: That's related to features, though. I'm more interested in stability. 14:32:26 <kashyap> mdbooth: Some of the "bugs" we investigate upstream are already fixed in the latest stable versions of libvirt/QEMU 14:32:30 <markus_z> It's also interesting for stability as we know it upfront when we do a min_libvirt_version bump 14:32:47 <kashyap> It's just due to the UCA having old versions that are Just Buggy Enough 14:33:01 <kashyap> ("UCA" == Ubuntu Cloud Archive) 14:33:25 <markus_z> That's the version we use to poke through the happy path. We haven't fully figured out how to use the very latest. 14:34:31 <davidgiluk> andrearosa: If you hit it again I'd be interested in what the idea of 'progress' is and what behaviour you're seeing; I can imagine that perhaps it's just shifting block data and not making much progress on the RAM, but if it's an incremental I wonder why it's taking so long on the block 14:35:09 <andrearosa> davidgiluk: ok I let you know if I have a reproducible 14:35:44 <PaulMurray> I'm not sure we can do anything more discussing this now 14:36:18 <PaulMurray> It would be good if we can try to figure out the problem and see if it matches a known bug in libvirt/qemu 14:36:39 <kashyap> Yep 14:36:58 <PaulMurray> moving on 14:37:10 <PaulMurray> #link CI Coverage 14:37:26 <PaulMurray> is tdurakov here? 14:37:49 <PaulMurray> there is not much on this pad: https://etherpad.openstack.org/p/nova-live-migration-CI-ideas 14:38:16 <davidgiluk> Is there anything that describes the current CI ? 14:38:37 <PaulMurray> Good question, what are you looking for? 14:39:33 <PaulMurray> there are some links here: https://etherpad.openstack.org/p/mitaka-live-migration 14:40:22 <kashyap> PaulMurray: I think what davidgiluk is asking is: What _kind_ of tests current live migration CI does 14:40:36 <davidgiluk> well it's just if there's a thing saying to improve the CI I was just wondering what's already there 14:41:17 <PaulMurray> there is a job in the experimental queue 14:41:49 <PaulMurray> It has several types of storage 14:42:09 * PaulMurray scrambles around for a link 14:42:19 <jlanoux> It's what is in Tempest currently: LM of a server (ephemeral), LM of a server booted from a volume, LM of a server in paused state 14:43:00 <jlanoux> There are several additional tests under review 14:43:00 <davidgiluk> what workload do you run on the server while it's migrated? 14:43:07 <jlanoux> none 14:43:32 <davidgiluk> do you check the status of the server afterwards - e.g. check it's not rebooted, check it can still do IO to it's disk, network 14:44:11 <jlanoux> no - Matt Riedmann submitted a change to do a ssh check before and after LM 14:44:25 <jlanoux> but ssh check is not 100% reliable currently 14:44:55 <jlanoux> so little chance to get that voting 14:45:07 <davidgiluk> yeh, checking the state of the VM after migration is pretty important 14:45:52 <PaulMurray> Maybe we can get someone to put the current state of CI - what it does now on https://etherpad.openstack.org/p/nova-live-migration-CI-ideas 14:46:10 <PaulMurray> jlanoux, could you organise that with tdurakov please ? 14:46:29 <jlanoux> PaulMurray: yep, no problem 14:46:37 <PaulMurray> #action jlanoux tdurakov to add current state of live migration to https://etherpad.openstack.org/p/nova-live-migration-CI-ideas 14:46:57 <PaulMurray> and add ideas 14:47:20 <davidgiluk> we've had qemu bugs where you'd find the guest rebooted instead of retaining the state; and you don't notice that if you just check for migration completing 14:48:23 <PaulMurray> davidgiluk, saw the addition to the page - I guess that's you 14:48:28 <davidgiluk> yep 14:48:57 <PaulMurray> good - feel free to add the check as well - I think there are some patches in tempest for this job now 14:49:24 <PaulMurray> so all look for things to review - do point any reviews on the priorities page as you go if they are not there 14:49:32 <PaulMurray> #link Bugs 14:49:42 <PaulMurray> Any other bugs (10 mins left) 14:50:27 * PaulMurray is sure there are other bugs - really checking if anyone wants to bring any up 14:50:42 <PaulMurray> #link Open Discussion 14:50:46 <PaulMurray> anything else ? 14:51:37 <mdbooth> I posted something to the mailing list yesterday, prompted by a tdurakov patch 14:51:58 <PaulMurray> mdbooth, do you have a link 14:52:00 <mdbooth> I would be interested if anybody has thought through the implications of this for things I may not have thought of 14:52:22 * mdbooth hunts 14:52:54 <mdbooth> http://lists.openstack.org/pipermail/openstack-dev/2016-March/088578.html 14:52:55 <PaulMurray> mdbooth, is it this: http://lists.openstack.org/pipermail/openstack-dev/2016-March/088578.html 14:52:55 <markus_z> just as a side note, potential rc blocker bugs get tagged with "mitaka-rc-potential" 14:53:11 <mdbooth> PaulMurray: yup 14:53:19 <PaulMurray> you got there first 14:53:42 <mdbooth> It occurs to me that cleanup in the event of failure might be more complicated 14:54:21 <mdbooth> However, I'd be interested to know if people think that solves more than it causes 14:54:41 <PaulMurray> mdbooth, I'll take a look 14:55:18 <PaulMurray> mdbooth, could also start a similar thread asking if operators see problems - but put in the context of running systems ? 14:55:26 <PaulMurray> mdbooth, on operators list 14:55:46 <mdbooth> I don't know how I'd frame it other than: do you find migration to be a bit fragile? ;) 14:55:58 <mdbooth> I don't think the response would be useful :) 14:56:03 <PaulMurray> mdbooth, I think we know the answer to that 14:56:24 <PaulMurray> markus_z, what is the process about mitaka-rc-potential bugs? 14:56:53 <markus_z> #link http://lists.openstack.org/pipermail/openstack-dev/2016-March/088205.html 14:57:03 <markus_z> PaulMurray: ^ 14:57:15 <PaulMurray> #info potential rc blocker bugs get tagged with "mitaka-rc-potential" - see: http://lists.openstack.org/pipermail/openstack-dev/2016-March/088205.html 14:57:20 <PaulMurray> thanks 14:57:47 <PaulMurray> I think we're done for today - so better end before that Stable lot come and throw us out again 14:58:15 <PaulMurray> thanks for coming - please see if you can help with those CI bugs 14:58:19 <sigmavirus24> PaulMurray: yeahh best you be finishing :P 14:58:40 <PaulMurray> ...ooo I'm scared... :) 14:58:46 <PaulMurray> #endmeeting