14:00:21 #startmeeting Nova Live Migration 14:00:23 Meeting started Tue Mar 8 14:00:21 2016 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:24 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:26 The meeting name has been set to 'nova_live_migration' 14:00:35 hi, who is here for lm? 14:00:40 . 14:00:46 I am 14:00:49 o/ 14:00:57 hi 14:01:24 lets wait a minute for any others 14:02:06 ok - lets get going 14:02:18 people are lurking if not putting their hands up 14:02:34 o/ 14:02:41 * davidgiluk wouldn't want to lurk 14:02:46 agenda here as usual: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:02:50 * kashyap waves 14:02:54 :) 14:03:08 #topic Documentation 14:03:34 We have a lot of features in 14:03:41 and they need to be documented 14:03:46 o/ 14:03:58 o/ 14:04:13 So look for command descriptions in api and in 14:04:27 concepts guide and see if there is anything affected by recent features 14:04:30 please 14:04:34 PaulMurray: I have a nova bug assigned to me because of the DOCIMPACT tag in the commit message 14:04:41 without any details 14:04:46 so I put a comment on the bug 14:04:48 with more details 14:04:51 is that enough? 14:05:00 https://bugs.launchpad.net/nova/+bug/1550525 14:05:00 Launchpad bug 1550525 in OpenStack Compute (nova) " Abort an ongoing live migration" [Undecided,Confirmed] - Assigned to Andrea Rosa (andrea-rosa-m) 14:05:17 I've not seen that before 14:05:20 andrearosa: that's because that you have 'DOCImpact' in you commit message. 14:05:59 eliqiao, are they usually assigned to the commiter ? 14:06:19 PaulMurray: not automatically 14:06:40 I assigned it to myself after makus pinged me on IRC 14:06:47 PaulMurray: not really. 14:07:09 yeah, someone pinged me to. 14:07:14 s/to/too 14:07:33 I'm not sure what the process is or if its changed 14:08:02 after this meeting I am going to ask in the #openstack-infra what I have to do 14:08:11 But it would be good to look for what needs to be done. And do remember the concepts guide, which is in nova 14:08:46 #action ALL check document changes needed related to new features 14:09:03 #topic Top check queue failures 14:09:30 PaulMurray: I did some investigation about that bug 14:09:47 http://lists.openstack.org/pipermail/openstack-dev/2016-March/088437.html 14:09:53 PaulMurray: I think you are talking about https://bugs.launchpad.net/nova/+bug/1539271 14:09:53 Launchpad bug 1539271 in OpenStack Compute (nova) "Libvirt live block migration migration stalls" [High,Confirmed] - Assigned to Eli Qiao (taget-9) 14:09:55 matt pointed out those two bugs 14:09:59 go ahead eliqiao 14:10:24 PaulMurray: I don't find the root cause 14:10:30 #link Libvirt live block migration migration stalls: https://bugs.launchpad.net/nova/+bug/1539271 14:10:34 seems not from nova 14:10:51 libvirt and nova are all do right thing from the log which I can see. 14:11:49 libvrit invoke migrate command to qemu, and wait for job status, but no memory transfered at all. 14:11:56 this comment from matts message: "The problem, unfortunately, could be 14:11:56 something latent in libvirt 1.2.2 or qemu 2.0.0" 14:12:29 has know of anything like that in libvirt or qemu ? 14:12:31 PaulMurray: yeah, can we call for some low level expert ? 14:12:45 mdbooth, do you know anyone ? 14:12:59 danbp or someone from qemu. 14:13:03 Not personally, unfortunately 14:13:11 kashyap: ring any bells? 14:13:20 mdbooth: Looking 14:13:29 Sorry, was lost in chat in another channel 14:13:49 I can ask our (HPE) virt people to have a look - but not always as quick as one of us 14:13:51 Wow, block migration completely stalls? 14:14:10 kashyap: yeah. 14:14:17 As of a week ago, I tested with libvirt-1.3.1 14:14:37 And, qemu-system-x86-2.5.0-6.fc23 -- it worked fine here 14:14:41 hmm it's probably first important to figure out which type of block migration it's using - is it qemu block migration or is it virtio setting up nbd ? 14:14:49 Presumably this is a non-deterministic failure, right? 14:15:04 davidgiluk: A quick look at the log should tell us 14:15:10 mdbooth, right 14:15:14 i.e. success once isn't conclusive 14:15:44 The bug doesn't have any libvirtd logs? 14:15:54 kashyap: it has 14:16:29 kashyap: http://logs.openstack.org/84/286084/3/check/gate-tempest-dsvm-multinode-full/a5d70a2/logs/subnode-2/libvirt/libvirtd.txt.gz 14:17:00 eliqiao: Thanks 14:17:38 * kashyap will be on a plane for 12 hours tomorrow, and can only spend about 2 hours today on this 14:17:43 maybe if you can have a look and put anything you find on the bug - including any missing details 14:17:52 There is another one: 14:17:57 kashyap: Take a libvirtd bug with you for entertainment! 14:18:00 kashyap: Can you ping me with anything you find out? 14:18:03 davidgiluk: Looking sir 14:18:08 o/ 14:18:12 #link Volume based live migration aborted unexpectedly: https://bugs.launchpad.net/nova/+bug/1524898 14:18:12 Launchpad bug 1524898 in OpenStack Compute (nova) "Volume based live migration aborted unexpectedly" [High,Confirmed] 14:18:12 mdbooth: Sure. I'll note anything that I can find 14:18:29 has anyone had a chance to look at this one ? 14:19:09 The occurances of these has increased in last 48 hours 14:19:20 eliqiao: "subnode-2" means, it's the _source_ machine, right? (Despite the name 'subnode-2') 14:21:16 davidgiluk: Okay, it's using old style ("inc":true" approach) 14:21:23 danpb thinks its this one is probably qemu - but don't know why it would start failing more now 14:21:32 kashyap: yeah, it should be. I checked the log, monitor thread is running on that. 14:21:51 kashyap: Do we have qemu logs? Also wasn't there an NFS+migration one from a few weeks back - did that get fixed? 14:21:57 PaulMurray: Matt says in that bug: 14:21:59 (s/true/false) 14:22:02 This fails primarily on OVH nodes which are the slowest ones we have in the community infra system, so that indicates there is a speed component to this. 14:22:15 davidgiluk: There will be - DevStack records it, let me find 14:22:16 I wonder if it's a race in an old qemu which is triggered by a slow system 14:22:22 And our system got slow again lately? 14:23:39 davidgiluk: The same dreadful "signal 15" (SIGTERM) - http://logs.openstack.org/84/286084/3/check/gate-tempest-dsvm-multinode-full/a5d70a2/logs/subnode-2/libvirt/qemu/instance-00000022.txt.gz 14:23:47 tdurakov, does the live migration job use the same setup ? 14:24:04 kashyap: the qemu logs seems good. 14:24:14 kashyap: So I think that means libvirt decided to kill it - nothing that qemu is complaining about there 14:24:16 davidgiluk: Also, they (the test) seem to have supplied TUNNELED flag. 14:24:52 kashyap: Yeh so isn't it the constraint that if you have tunneled on then you have to use qemu block migration? 14:25:08 davidgiluk: tempest case will kill it if LM case no response after 250s?(maybe) 14:25:14 We should really be testing with latest upstream stable of libvirt/QEMU. Markus (Z) was following it up on the upstream list - not sure about its status 14:25:57 we even have a kill of the job if the ligration is not making progress for more tha 150 seconds 14:26:03 davidgiluk: Upstream libvirt will fall back to the legacy approach evey when the tunnelled flag is supplied 14:26:03 that is repoted in the compute log 14:26:19 ligration = live migration 14:27:17 andrearosa: What's the definition of progress? If it's shoveling block data and not moving much RAM data what happens? 14:27:53 davidgiluk: I think so, but I have to check the code. I saw that in my tests few times but I never dig into it 14:29:15 I'm a bit concerned that we may well be wasting time tracking down issues which are down to bugs which are already fixed in stable libvirt/qemu 14:29:27 I'm not sure this is the best use of peoples' time 14:29:41 Any chance we can upgrade? 14:30:06 mdbooth: Fully agree with you 14:30:27 Markus (on #openstack-nova) just provided some status on it (newer Virt compoenents in Gate) 14:30:28 mdbooth, agree - we were looking at getting up to date libvirt/qemu in some CI jobs 14:30:30 https://etherpad.openstack.org/p/libvirt-latest-test-job 14:30:34 yes it would be better if you could taste our newer bugs 14:30:37 anyone know where we are with that 14:30:39 If we were seeing these issues with the latest stable libvirt/qemu it would be quite clear that there's something we need to look at 14:31:25 PaulMurray: mdbooth: https://review.openstack.org/#/c/289255/ 14:31:37 That's the WIP test job 14:31:48 hey hey 14:31:59 kashyap: That's related to features, though. I'm more interested in stability. 14:32:26 mdbooth: Some of the "bugs" we investigate upstream are already fixed in the latest stable versions of libvirt/QEMU 14:32:30 It's also interesting for stability as we know it upfront when we do a min_libvirt_version bump 14:32:47 It's just due to the UCA having old versions that are Just Buggy Enough 14:33:01 ("UCA" == Ubuntu Cloud Archive) 14:33:25 That's the version we use to poke through the happy path. We haven't fully figured out how to use the very latest. 14:34:31 andrearosa: If you hit it again I'd be interested in what the idea of 'progress' is and what behaviour you're seeing; I can imagine that perhaps it's just shifting block data and not making much progress on the RAM, but if it's an incremental I wonder why it's taking so long on the block 14:35:09 davidgiluk: ok I let you know if I have a reproducible 14:35:44 I'm not sure we can do anything more discussing this now 14:36:18 It would be good if we can try to figure out the problem and see if it matches a known bug in libvirt/qemu 14:36:39 Yep 14:36:58 moving on 14:37:10 #link CI Coverage 14:37:26 is tdurakov here? 14:37:49 there is not much on this pad: https://etherpad.openstack.org/p/nova-live-migration-CI-ideas 14:38:16 Is there anything that describes the current CI ? 14:38:37 Good question, what are you looking for? 14:39:33 there are some links here: https://etherpad.openstack.org/p/mitaka-live-migration 14:40:22 PaulMurray: I think what davidgiluk is asking is: What _kind_ of tests current live migration CI does 14:40:36 well it's just if there's a thing saying to improve the CI I was just wondering what's already there 14:41:17 there is a job in the experimental queue 14:41:49 It has several types of storage 14:42:09 * PaulMurray scrambles around for a link 14:42:19 It's what is in Tempest currently: LM of a server (ephemeral), LM of a server booted from a volume, LM of a server in paused state 14:43:00 There are several additional tests under review 14:43:00 what workload do you run on the server while it's migrated? 14:43:07 none 14:43:32 do you check the status of the server afterwards - e.g. check it's not rebooted, check it can still do IO to it's disk, network 14:44:11 no - Matt Riedmann submitted a change to do a ssh check before and after LM 14:44:25 but ssh check is not 100% reliable currently 14:44:55 so little chance to get that voting 14:45:07 yeh, checking the state of the VM after migration is pretty important 14:45:52 Maybe we can get someone to put the current state of CI - what it does now on https://etherpad.openstack.org/p/nova-live-migration-CI-ideas 14:46:10 jlanoux, could you organise that with tdurakov please ? 14:46:29 PaulMurray: yep, no problem 14:46:37 #action jlanoux tdurakov to add current state of live migration to https://etherpad.openstack.org/p/nova-live-migration-CI-ideas 14:46:57 and add ideas 14:47:20 we've had qemu bugs where you'd find the guest rebooted instead of retaining the state; and you don't notice that if you just check for migration completing 14:48:23 davidgiluk, saw the addition to the page - I guess that's you 14:48:28 yep 14:48:57 good - feel free to add the check as well - I think there are some patches in tempest for this job now 14:49:24 so all look for things to review - do point any reviews on the priorities page as you go if they are not there 14:49:32 #link Bugs 14:49:42 Any other bugs (10 mins left) 14:50:27 * PaulMurray is sure there are other bugs - really checking if anyone wants to bring any up 14:50:42 #link Open Discussion 14:50:46 anything else ? 14:51:37 I posted something to the mailing list yesterday, prompted by a tdurakov patch 14:51:58 mdbooth, do you have a link 14:52:00 I would be interested if anybody has thought through the implications of this for things I may not have thought of 14:52:22 * mdbooth hunts 14:52:54 http://lists.openstack.org/pipermail/openstack-dev/2016-March/088578.html 14:52:55 mdbooth, is it this: http://lists.openstack.org/pipermail/openstack-dev/2016-March/088578.html 14:52:55 just as a side note, potential rc blocker bugs get tagged with "mitaka-rc-potential" 14:53:11 PaulMurray: yup 14:53:19 you got there first 14:53:42 It occurs to me that cleanup in the event of failure might be more complicated 14:54:21 However, I'd be interested to know if people think that solves more than it causes 14:54:41 mdbooth, I'll take a look 14:55:18 mdbooth, could also start a similar thread asking if operators see problems - but put in the context of running systems ? 14:55:26 mdbooth, on operators list 14:55:46 I don't know how I'd frame it other than: do you find migration to be a bit fragile? ;) 14:55:58 I don't think the response would be useful :) 14:56:03 mdbooth, I think we know the answer to that 14:56:24 markus_z, what is the process about mitaka-rc-potential bugs? 14:56:53 #link http://lists.openstack.org/pipermail/openstack-dev/2016-March/088205.html 14:57:03 PaulMurray: ^ 14:57:15 #info potential rc blocker bugs get tagged with "mitaka-rc-potential" - see: http://lists.openstack.org/pipermail/openstack-dev/2016-March/088205.html 14:57:20 thanks 14:57:47 I think we're done for today - so better end before that Stable lot come and throw us out again 14:58:15 thanks for coming - please see if you can help with those CI bugs 14:58:19 PaulMurray: yeahh best you be finishing :P 14:58:40 ...ooo I'm scared... :) 14:58:46 #endmeeting