14:00:21 <PaulMurray> #startmeeting Nova Live Migration
14:00:23 <openstack> Meeting started Tue Mar  8 14:00:21 2016 UTC and is due to finish in 60 minutes.  The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:24 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:26 <openstack> The meeting name has been set to 'nova_live_migration'
14:00:35 <PaulMurray> hi, who is here for lm?
14:00:40 <efried> .
14:00:46 <andrearosa> I am
14:00:49 <mdbooth> o/
14:00:57 <jlanoux> hi
14:01:24 <PaulMurray> lets wait a minute for any others
14:02:06 <PaulMurray> ok - lets get going
14:02:18 <PaulMurray> people are lurking if not putting their hands up
14:02:34 <davidgiluk> o/
14:02:41 * davidgiluk wouldn't want to lurk
14:02:46 <PaulMurray> agenda here as usual: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration
14:02:50 * kashyap waves
14:02:54 <PaulMurray> :)
14:03:08 <PaulMurray> #topic Documentation
14:03:34 <PaulMurray> We have a lot of features in
14:03:41 <PaulMurray> and they need to be documented
14:03:46 <ndipanov> o/
14:03:58 <eliqiao> o/
14:04:13 <PaulMurray> So look for command descriptions in api and in
14:04:27 <PaulMurray> concepts guide and see if there is anything affected by recent features
14:04:30 <PaulMurray> please
14:04:34 <andrearosa> PaulMurray: I have a nova bug assigned to me because of the DOCIMPACT tag in the commit message
14:04:41 <andrearosa> without any details
14:04:46 <andrearosa> so I put a comment on the bug
14:04:48 <andrearosa> with more details
14:04:51 <andrearosa> is that enough?
14:05:00 <andrearosa> https://bugs.launchpad.net/nova/+bug/1550525
14:05:00 <openstack> Launchpad bug 1550525 in OpenStack Compute (nova) " Abort an ongoing live migration" [Undecided,Confirmed] - Assigned to Andrea Rosa (andrea-rosa-m)
14:05:17 <PaulMurray> I've not seen that before
14:05:20 <eliqiao> andrearosa: that's because that you have 'DOCImpact' in you commit message.
14:05:59 <PaulMurray> eliqiao, are they usually assigned to the commiter ?
14:06:19 <andrearosa> PaulMurray: not automatically
14:06:40 <andrearosa> I assigned it to myself after makus pinged me on IRC
14:06:47 <eliqiao> PaulMurray: not really.
14:07:09 <eliqiao> yeah, someone pinged me to.
14:07:14 <eliqiao> s/to/too
14:07:33 <PaulMurray> I'm not sure what the process is or if its changed
14:08:02 <andrearosa> after this meeting I am going to ask in the #openstack-infra what I have to do
14:08:11 <PaulMurray> But it would be good to look for what needs to be done. And do remember the concepts guide, which is in nova
14:08:46 <PaulMurray> #action ALL check document changes needed related to new features
14:09:03 <PaulMurray> #topic Top check queue failures
14:09:30 <eliqiao> PaulMurray: I did some investigation about that bug
14:09:47 <PaulMurray> http://lists.openstack.org/pipermail/openstack-dev/2016-March/088437.html
14:09:53 <eliqiao> PaulMurray: I think you are talking about https://bugs.launchpad.net/nova/+bug/1539271
14:09:53 <openstack> Launchpad bug 1539271 in OpenStack Compute (nova) "Libvirt live block migration migration stalls" [High,Confirmed] - Assigned to Eli Qiao (taget-9)
14:09:55 <PaulMurray> matt pointed out those two bugs
14:09:59 <PaulMurray> go ahead eliqiao
14:10:24 <eliqiao> PaulMurray: I don't find the root cause
14:10:30 <PaulMurray> #link Libvirt live block migration migration stalls: https://bugs.launchpad.net/nova/+bug/1539271
14:10:34 <eliqiao> seems not from nova
14:10:51 <eliqiao> libvirt and nova are all do right thing from the log which I can see.
14:11:49 <eliqiao> libvrit invoke migrate command to qemu, and wait for job status, but no memory transfered at all.
14:11:56 <PaulMurray> this comment from matts message: "The problem, unfortunately, could be
14:11:56 <PaulMurray> something latent in libvirt 1.2.2 or qemu 2.0.0"
14:12:29 <PaulMurray> has know of anything like that in libvirt or qemu ?
14:12:31 <eliqiao> PaulMurray: yeah, can we call for some low level expert ?
14:12:45 <PaulMurray> mdbooth, do you know anyone ?
14:12:59 <eliqiao> danbp or someone from qemu.
14:13:03 <mdbooth> Not personally, unfortunately
14:13:11 <mdbooth> kashyap: ring any bells?
14:13:20 <kashyap> mdbooth: Looking
14:13:29 <kashyap> Sorry, was lost in chat in another channel
14:13:49 <PaulMurray> I can ask our (HPE) virt people to have a look - but not always as quick as one of us
14:13:51 <kashyap> Wow, block migration completely stalls?
14:14:10 <eliqiao> kashyap: yeah.
14:14:17 <kashyap> As of a week ago, I tested with libvirt-1.3.1
14:14:37 <kashyap> And, qemu-system-x86-2.5.0-6.fc23  -- it worked fine here
14:14:41 <davidgiluk> hmm it's probably first important to figure out which type of block migration it's using - is it qemu block migration or is it virtio setting up nbd ?
14:14:49 <mdbooth> Presumably this is a non-deterministic failure, right?
14:15:04 <kashyap> davidgiluk: A quick look at the log should tell us
14:15:10 <PaulMurray> mdbooth, right
14:15:14 <mdbooth> i.e. success once isn't conclusive
14:15:44 <kashyap> The bug doesn't have any libvirtd logs?
14:15:54 <eliqiao> kashyap: it has
14:16:29 <eliqiao> kashyap: http://logs.openstack.org/84/286084/3/check/gate-tempest-dsvm-multinode-full/a5d70a2/logs/subnode-2/libvirt/libvirtd.txt.gz
14:17:00 <kashyap> eliqiao: Thanks
14:17:38 * kashyap will be on a plane for 12 hours tomorrow, and can only spend about 2 hours today on this
14:17:43 <PaulMurray> maybe if you can have a look and put anything you find on the bug - including any missing details
14:17:52 <PaulMurray> There is another one:
14:17:57 <davidgiluk> kashyap: Take a libvirtd bug with you for entertainment!
14:18:00 <mdbooth> kashyap: Can you ping me with anything you find out?
14:18:03 <kashyap> davidgiluk: Looking sir
14:18:08 <pkoniszewski> o/
14:18:12 <PaulMurray> #link Volume based live migration aborted unexpectedly: https://bugs.launchpad.net/nova/+bug/1524898
14:18:12 <openstack> Launchpad bug 1524898 in OpenStack Compute (nova) "Volume based live migration aborted unexpectedly" [High,Confirmed]
14:18:12 <kashyap> mdbooth: Sure.  I'll note anything that I can find
14:18:29 <PaulMurray> has anyone had a chance to look at this one ?
14:19:09 <PaulMurray> The occurances of these has increased in last 48 hours
14:19:20 <kashyap> eliqiao: "subnode-2" means, it's the _source_ machine, right?  (Despite the name 'subnode-2')
14:21:16 <kashyap> davidgiluk: Okay, it's using old style ("inc":true" approach)
14:21:23 <PaulMurray> danpb thinks its this one is probably qemu - but don't know why it would start failing more now
14:21:32 <eliqiao> kashyap: yeah, it should be. I checked the log, monitor thread is running on that.
14:21:51 <davidgiluk> kashyap: Do we have qemu logs? Also wasn't there an NFS+migration one from a few weeks back - did that get fixed?
14:21:57 <mdbooth> PaulMurray: Matt says in that bug:
14:21:59 <kashyap> (s/true/false)
14:22:02 <mdbooth> This fails primarily on OVH nodes which are the slowest ones we have in the community infra system, so that indicates there is a speed component to this.
14:22:15 <kashyap> davidgiluk: There will be - DevStack records it, let me find
14:22:16 <mdbooth> I wonder if it's a race in an old qemu which is triggered by a slow system
14:22:22 <mdbooth> And our system got slow again lately?
14:23:39 <kashyap> davidgiluk: The same dreadful "signal 15" (SIGTERM) - http://logs.openstack.org/84/286084/3/check/gate-tempest-dsvm-multinode-full/a5d70a2/logs/subnode-2/libvirt/qemu/instance-00000022.txt.gz
14:23:47 <PaulMurray> tdurakov, does the live migration job use the same setup ?
14:24:04 <eliqiao> kashyap: the qemu logs seems good.
14:24:14 <davidgiluk> kashyap: So I think that means libvirt decided to kill it - nothing that qemu is complaining about there
14:24:16 <kashyap> davidgiluk: Also, they (the test) seem to have supplied TUNNELED flag.
14:24:52 <davidgiluk> kashyap: Yeh so isn't it the constraint that if you have tunneled on then you have to use qemu block migration?
14:25:08 <eliqiao> davidgiluk: tempest case will kill it if LM case no response after 250s?(maybe)
14:25:14 <kashyap> We should really be testing with latest upstream stable of libvirt/QEMU.  Markus (Z) was following it up on the upstream list - not sure about its status
14:25:57 <andrearosa> we even have a kill of the job if the ligration is not making progress for more tha 150 seconds
14:26:03 <kashyap> davidgiluk: Upstream libvirt will fall back to the legacy approach evey when the tunnelled flag is supplied
14:26:03 <andrearosa> that is repoted in the compute log
14:26:19 <andrearosa> ligration = live migration
14:27:17 <davidgiluk> andrearosa: What's the definition of progress? If it's shoveling block data and not moving much RAM data what happens?
14:27:53 <andrearosa> davidgiluk: I think so, but I have to check the code. I saw that in my tests few times but I never dig into it
14:29:15 <mdbooth> I'm a bit concerned that we may well be wasting time tracking down issues which are down to bugs which are already fixed in stable libvirt/qemu
14:29:27 <mdbooth> I'm not sure this is the best use of peoples' time
14:29:41 <mdbooth> Any chance we can upgrade?
14:30:06 <kashyap> mdbooth: Fully agree with you
14:30:27 <kashyap> Markus (on #openstack-nova) just provided some status on it (newer Virt compoenents in Gate)
14:30:28 <PaulMurray> mdbooth, agree - we were looking at getting up to date libvirt/qemu in some CI jobs
14:30:30 <kashyap> https://etherpad.openstack.org/p/libvirt-latest-test-job
14:30:34 <davidgiluk> yes it would be better if you could taste our newer bugs
14:30:37 <PaulMurray> anyone know where we are with that
14:30:39 <mdbooth> If we were seeing these issues with the latest stable libvirt/qemu it would be quite clear that there's something we need to look at
14:31:25 <kashyap> PaulMurray: mdbooth: https://review.openstack.org/#/c/289255/
14:31:37 <kashyap> That's the WIP test job
14:31:48 <markus_z> hey hey
14:31:59 <mdbooth> kashyap: That's related to features, though. I'm more interested in stability.
14:32:26 <kashyap> mdbooth: Some of the "bugs" we investigate upstream are already fixed in the latest stable versions of libvirt/QEMU
14:32:30 <markus_z> It's also interesting for stability as we know it upfront when we do a min_libvirt_version bump
14:32:47 <kashyap> It's just due to the UCA having old versions that are Just Buggy Enough
14:33:01 <kashyap> ("UCA" == Ubuntu Cloud Archive)
14:33:25 <markus_z> That's the version we use to poke through the happy path. We haven't fully figured out how to use the very latest.
14:34:31 <davidgiluk> andrearosa: If you hit it again I'd be interested in what the idea of 'progress' is and what behaviour you're seeing; I can imagine that perhaps it's just shifting block data and not making much progress on the RAM, but if it's an incremental I wonder why it's taking so long on the block
14:35:09 <andrearosa> davidgiluk: ok I let you know if I have a reproducible
14:35:44 <PaulMurray> I'm not sure we can do anything more discussing this now
14:36:18 <PaulMurray> It would be good if we can try to figure out the problem and see if it matches a known bug in libvirt/qemu
14:36:39 <kashyap> Yep
14:36:58 <PaulMurray> moving on
14:37:10 <PaulMurray> #link CI Coverage
14:37:26 <PaulMurray> is tdurakov here?
14:37:49 <PaulMurray> there is not much on this pad: https://etherpad.openstack.org/p/nova-live-migration-CI-ideas
14:38:16 <davidgiluk> Is there anything that describes the current CI ?
14:38:37 <PaulMurray> Good question, what are you looking for?
14:39:33 <PaulMurray> there are some links here: https://etherpad.openstack.org/p/mitaka-live-migration
14:40:22 <kashyap> PaulMurray: I think what davidgiluk is asking is: What _kind_ of tests current live migration CI does
14:40:36 <davidgiluk> well it's just if there's a thing saying to improve the CI I was just wondering what's already there
14:41:17 <PaulMurray> there is a job in the experimental queue
14:41:49 <PaulMurray> It has several types of storage
14:42:09 * PaulMurray scrambles around for a link
14:42:19 <jlanoux> It's what is in Tempest currently: LM of a server (ephemeral), LM of a server booted from a volume, LM of a server in paused state
14:43:00 <jlanoux> There are several additional tests under review
14:43:00 <davidgiluk> what workload do you run on the server while it's migrated?
14:43:07 <jlanoux> none
14:43:32 <davidgiluk> do you check the status of the server afterwards - e.g. check it's not rebooted, check it can still do IO to it's disk, network
14:44:11 <jlanoux> no - Matt Riedmann submitted a change to do a ssh check before and after LM
14:44:25 <jlanoux> but ssh check is not 100% reliable currently
14:44:55 <jlanoux> so little chance to get that voting
14:45:07 <davidgiluk> yeh, checking the state of the VM after migration is pretty important
14:45:52 <PaulMurray> Maybe we can get someone to put the current state of CI - what it does now on https://etherpad.openstack.org/p/nova-live-migration-CI-ideas
14:46:10 <PaulMurray> jlanoux, could you organise that with tdurakov please ?
14:46:29 <jlanoux> PaulMurray: yep, no problem
14:46:37 <PaulMurray> #action jlanoux tdurakov to add current state of live migration to https://etherpad.openstack.org/p/nova-live-migration-CI-ideas
14:46:57 <PaulMurray> and add ideas
14:47:20 <davidgiluk> we've had qemu bugs where you'd find the guest rebooted instead of retaining the state; and you don't notice that if you just check for migration completing
14:48:23 <PaulMurray> davidgiluk, saw the addition to the page - I guess that's you
14:48:28 <davidgiluk> yep
14:48:57 <PaulMurray> good - feel free to add the check as well - I think there are some patches in tempest for this job now
14:49:24 <PaulMurray> so all look for things to review - do point any reviews on the priorities page as you go if they are not there
14:49:32 <PaulMurray> #link Bugs
14:49:42 <PaulMurray> Any other bugs (10 mins left)
14:50:27 * PaulMurray is sure there are other bugs - really checking if anyone wants to bring any up
14:50:42 <PaulMurray> #link Open Discussion
14:50:46 <PaulMurray> anything else ?
14:51:37 <mdbooth> I posted something to the mailing list yesterday, prompted by a tdurakov patch
14:51:58 <PaulMurray> mdbooth, do you have a link
14:52:00 <mdbooth> I would be interested if anybody has thought through the implications of this for things I may not have thought of
14:52:22 * mdbooth hunts
14:52:54 <mdbooth> http://lists.openstack.org/pipermail/openstack-dev/2016-March/088578.html
14:52:55 <PaulMurray> mdbooth, is it this: http://lists.openstack.org/pipermail/openstack-dev/2016-March/088578.html
14:52:55 <markus_z> just as a side note, potential rc blocker bugs get tagged with "mitaka-rc-potential"
14:53:11 <mdbooth> PaulMurray: yup
14:53:19 <PaulMurray> you got there first
14:53:42 <mdbooth> It occurs to me that cleanup in the event of failure might be more complicated
14:54:21 <mdbooth> However, I'd be interested to know if people think that solves more than it causes
14:54:41 <PaulMurray> mdbooth, I'll take a look
14:55:18 <PaulMurray> mdbooth, could also start a similar thread asking if operators see problems - but put in the context of running systems ?
14:55:26 <PaulMurray> mdbooth, on operators list
14:55:46 <mdbooth> I don't know how I'd frame it other than: do you find migration to be a bit fragile? ;)
14:55:58 <mdbooth> I don't think the response would be useful :)
14:56:03 <PaulMurray> mdbooth, I think we know the answer to that
14:56:24 <PaulMurray> markus_z, what is the process about mitaka-rc-potential bugs?
14:56:53 <markus_z> #link http://lists.openstack.org/pipermail/openstack-dev/2016-March/088205.html
14:57:03 <markus_z> PaulMurray: ^
14:57:15 <PaulMurray> #info  potential rc blocker bugs get tagged with "mitaka-rc-potential" - see: http://lists.openstack.org/pipermail/openstack-dev/2016-March/088205.html
14:57:20 <PaulMurray> thanks
14:57:47 <PaulMurray> I think we're done for today - so better end before that Stable lot come and throw us out again
14:58:15 <PaulMurray> thanks for coming - please see if you can help with those CI bugs
14:58:19 <sigmavirus24> PaulMurray: yeahh best you be finishing :P
14:58:40 <PaulMurray> ...ooo I'm scared... :)
14:58:46 <PaulMurray> #endmeeting