13:59:53 #startmeeting Nova Live Migration 13:59:55 Meeting started Tue Mar 15 13:59:53 2016 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:59:56 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:59:58 The meeting name has been set to 'nova_live_migration' 14:00:02 o/ 14:00:02 o/ 14:00:06 o/ 14:00:11 o/ 14:00:11 hi all 14:00:23 o/ 14:00:28 ho 14:00:31 *hi 14:00:40 Agenda here as usual: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:00:49 o/ 14:01:19 hi 14:01:21 please tell me if I start using link instead of topic by mistake again this week 14:01:40 #topic Documentation 14:01:47 o/ 14:02:13 I saw some comments in openstack-nova 14:02:48 if you have DocImpact on your patch you need to supply enough info in the commit message 14:02:57 or in the bug for docs team to deal with it 14:03:11 so please check it is all ok if you have a patch like that merged 14:03:25 that's not actually correct now i don't think, 14:03:30 oh 14:03:31 DocImpact commits generate a bug in nova now 14:03:40 b/c the docs team got tired of dealing with these 14:04:01 if it's actually a bug for the manuals (which the docs team owns), then it has to be triaged and routed to the proper project 14:04:41 e.g. a lot of people put DocImpact in their commit for config option changes, which are automatically generated in docs.o.o, so there is no need for the docs team to know about those 14:04:44 We did see some bugs being assinged to the original committer 14:05:12 do we need to do anything besides make sure there is enough info ? 14:06:00 well, 14:06:09 if it's meant to stay within nova, i don't think it requires DocImpact, 14:06:23 if you're not sure, then i'd say tag it and yeah, provide a bunch of info in the commit message 14:06:37 e.g. new APIs should be doc'ed in openstack/api-site 14:07:04 if the doc impact is just for having a release note in nova, then don't use the tag, just provide a reno in the change itself 14:07:06 etc etc 14:07:21 right 14:08:03 so everyone make sure you follow up if your change needs a document update - whereever it is 14:08:25 #topic Bugs 14:08:33 This is the main one for today 14:09:24 mriedem, these seem to be occuring slightly less : http://status.openstack.org/elastic-recheck/ 14:09:33 https://bugs.launchpad.net/nova/+bug/1524898 14:09:33 Launchpad bug 1524898 in OpenStack Compute (nova) "Volume based live migration aborted unexpectedly" [High,Confirmed] 14:09:40 https://bugs.launchpad.net/nova/+bug/1539271 14:09:40 Launchpad bug 1539271 in OpenStack Compute (nova) "Libvirt live block migration migration stalls" [High,Confirmed] - Assigned to Eli Qiao (taget-9) 14:09:50 156 fails in 10 days, 68 fails in 10 days 14:09:52 :( 14:09:58 I think we didn't make any progress 14:10:01 i'd have to bring up the graphite charts to see 14:10:18 tdurakov was saying yesterday that the live migration job is goign to try and hook in with the trunk libvirt devstack plugin, 14:10:29 in the hopes that newer libvirt/qemu can resolve the random live migration failures 14:10:57 mriedem: as we discussed earlier: once devstack plugin is ready, I'm going to test l-m with latest versions 14:11:49 any alternative approaches for that? 14:11:53 honestly, 14:12:07 i wouldn't be surprised if just isolating the live migration tests to their own job might not help the issue, 14:12:15 i.e. running fewer tests in serial might make it go away 14:12:29 mriedem: it's not only isolation 14:12:32 with the random aborts/stalls we see today, those could be due to other tests hitting the host at the same time 14:12:43 separate job i mean 14:12:59 and it also suffer from same issues 14:13:08 mriedem, tdurakov we need newer libvirt to test newer features at all 14:13:13 tdurakov: is the separate job running tests concurrently or in serial? 14:13:30 mriedem: small amount of live-migration tests concurently 14:13:44 tdurakov: i'd be interested to see what happens if you run that in serial 14:13:51 PaulMurray: agree 14:14:07 mriedem, I could try in separate patch to hook 14:14:14 let me check your idea 14:14:30 mriedem, I think we have two probems - what we do about that existing test affecting the check queue 14:14:43 and what we do for better live migratino tests 14:15:26 PaulMurray: unfortunately both jobs suffer from same issues 14:15:41 so, I think we have same rootcause 14:16:22 PaulMurray: unfortunately you're in a bind, 14:16:30 the job can't be voting with a failure rate that high 14:16:45 but to get the failure rate down, you'd have to (1) fix the bugs or (2) skip those tests 14:16:53 and (2) drops the coverage you care about in the first place 14:17:26 mriedem, yes - we need to try and fix it 14:17:51 and i'm not sure we can fix (1) in nova, it might be a fix in newer libvirt/qemu 14:17:58 but we won't know until we have that dsvm plugin 14:18:12 if there is a workaround/fix in nova, danpb would probably have to tell us what it is 14:18:29 tdurakov, how far are you from getting the newer libvirt/qemu ? 14:18:32 like, can we detect the random abort/stall and retry the live migration? 14:18:43 with a workaround config option 14:18:52 PaulMurray: it's on markus_z right now 14:19:16 tdurakov, what does he need to do ? 14:19:43 let's ask markus_z, I seen patch on review 14:19:50 What's the technical issue with testing against newer libvirt/qemu, btw? 14:20:34 mdbooth: it's just not in the ubuntu 14.04 images we gate on 14:20:46 mriedem: Ah... ok. 14:20:49 That's a pita. 14:21:07 for whatever reason we haven't just enabled cloud archive in the jobs to get them, 14:21:14 or used centos 7.x 14:21:25 maybe the devstack plugin uses the cloud archive, i don't know 14:21:29 So, is CI planning to maintain its own libvirt/qemu stack? 14:21:36 no 14:21:40 well, i don't think so 14:21:56 i'm hoping they aren't planning on rebuilding libvirt/qemu from source as part of the job 14:23:12 tdurakov, is this what you were thinking about: https://review.openstack.org/#/c/289255/ 14:23:29 the change from markus_z 14:23:55 PaulMurray: not exactly 14:24:20 https://github.com/tbreeds/tbreeds-devstack-plugin-additional-pkg-repos.git 14:24:26 https://github.com/tbreeds/tbreeds-devstack-plugin-additional-pkg-repos 14:24:50 https://github.com/tbreeds/tbreeds-devstack-plugin-additional-pkg-repos/blob/master/devstack/lib/libvirt#L18 14:24:53 so it's not latest libvirt, 14:24:56 it's the liberty cloud archive 14:26:34 This doesn't look like it is going to be resolved very soon 14:26:42 mriedem: let's check your idea: https://review.openstack.org/#/c/292907/ 14:27:55 PaulMurray: Hypothetically, could it be resolved with another body? 14:27:56 PaulMurray: it might, it's for the experimental queue 14:28:08 so it doesn't hurt to land the job and tweak it 14:28:20 i just had some questions in there about qemu 14:28:36 i think the other thing was, 14:28:45 tonyb was looking to get https://github.com/tbreeds/tbreeds-devstack-plugin-additional-pkg-repos into the openstack namespace 14:29:00 which i think we actually could just have that live in nova as a devstack plugin 14:29:09 some projects have devstack plugins in tree, some have separate repos 14:29:21 if it were in tree we could test changes to it via the experimental queue job 14:29:23 paul-carlton1, I think you saw one of the bugs on your setup running just one migration ? 14:30:34 yewyes 14:30:52 I'll refresh and try again 14:31:18 so the concurrency limited to 1 might not make a difference 14:31:27 but worth trying...? 14:31:33 PaulMurray: true 14:31:49 just to make sure 14:32:03 I think we need to move on 14:32:04 paul-carlton1: If you do hit it again I'd like to understand exactly what 'stuck' means - i.e. do we have any info migrate output to see if any data is moving 14:32:30 davidgiluk: i think danpb had notes on the stuck bug 14:32:37 from what he saw in the libvirt logs 14:32:57 ok 14:33:16 davidgiluk: It's hard to justify committing a lot of resources to it when there's a good chance it's already fixed. 14:33:28 What I saw was it trying to migrate, i.e. evidence from libvirt of it starting but no progress 14:33:41 mdbooth: Yep 14:33:43 mdbooth: well, this is someone saying they see the same but on newer libvirt/qemu https://bugs.launchpad.net/nova/+bug/1539271/comments/7 14:33:43 Launchpad bug 1539271 in OpenStack Compute (nova) "Libvirt live block migration migration stalls" [High,Confirmed] - Assigned to Eli Qiao (taget-9) 14:33:50 libvirt 1.2.12-0ubuntu14.2~cloud0 14:33:54 qemu 1:2.2+dfsg-5expubuntu9.6~cloud0 14:34:03 trusty is libvirt 1.2.2 and qemu 2.0 i think 14:35:02 still not the latest... has anyone looked for the bug/fix in qemu notes ? 14:36:12 I can get someone to follow up on checking that 14:36:32 should probably ask danpb 14:36:40 yes 14:36:42 eliqiao said in the bug that he didn't see anything helpful in the libvirt logs 14:37:07 I'm trying to summarise what we are doing 14:37:09 ... 14:37:36 tdurakov, you said markus_z is doing the dvsm plugin - is that right? 14:37:38 PaulMurray: If paul-carlton1 can reproduce, then retrying with newer libvirt/qemu seems like exactly the way to go 14:37:53 mdbooth, he has been trying to set that up but having trouble 14:37:56 right, devstack plugin 14:38:04 tdurakov, thatks 14:38:20 paul-carlton1, is trying to repoduce with latest libvirt/qemu 14:38:51 We have an attempt to see if serializing it does anything 14:39:07 Reproducer details would be really useful. I could run it against Fedora, which is normally pretty close to upstream latest. 14:39:08 And we need someone to look at a workaround 14:39:41 Anyone volentier to follow up on a workaround (at least coming up with a plan)? 14:40:13 I'm happy to follow up if I can reproduce on Fedora 14:40:24 mdbooth, thanks 14:40:27 mdbooth: If you can, ping me and I'll prod the qemu 14:40:30 If paul-carlton1 can share steps 14:40:38 I think that is all we can do right now 14:40:56 Does anyone have other bugs to discuss ? 14:41:08 I don't think we have any blockers for rc 14:41:54 i might poke on the abort bug to see if i can workaround it 14:41:58 by retrying/restarting the migration 14:42:24 mriedem, thanks 14:42:27 mriedem: If you can at least detect it then if you can see if qemu responds to an 'info migrate' at the point it's apparently ill that would be good 14:42:34 mbooth will do when I have it working 14:42:56 paul-carlton1: Can you share the reproducer for old libvirt? 14:43:09 Just basic steps/setup would be good. Anything you've got. 14:43:15 #topic CI Coverage 14:43:25 https://etherpad.openstack.org/p/nova-live-migration-CI-ideas 14:43:38 Saw current status has been filled in 14:43:45 and only a few ideas 14:44:13 It will be good to progress any bigger ideas before the summit 14:44:36 Not sure there is much else to say there 14:45:01 #topic Open Discussion 14:45:11 just a few minutes 14:45:34 could you review this: https://review.openstack.org/#/c/292271/ 14:45:50 johnthetubaguy, has put up an etherpad for summit ideas 14:45:56 that is something that is good to add 14:46:04 Postcopy has now no-longer-experimental in qemu; so that clears the way for jdenemar's libvirt support 14:46:26 PaulMurray: ok, will add 14:46:37 maybe I should make another etherpad like last time for newton 14:46:37 davidgiluk: great to hear that! 14:46:50 tdurakov: Does that bp answer my question about conductor? 14:47:01 * mdbooth will review it 14:47:08 mdbooth: I've updated patches with poc 14:47:17 mdbooth: thanks! 14:47:41 tdurakov, mdbooth it would be good to do something about the depth of the call chain in pre_live_migration is nothing 14:47:43 else 14:48:00 I have seen rpc timeouts due to volume operations being slow 14:48:13 I am sure it will be fragile because of the nested rpcs 14:48:19 PaulMurray: there is no rpc calls now 14:48:19 PaulMurray: the async patches might help that, no? 14:48:26 in spec and poc 14:48:38 so it will possibly solve your problem 14:48:38 tdurakov, mdbooth cool - I need to catch up 14:48:56 honestly, we probably need to just stop trying to do all that doing an API call, and fallback on the error handling we have in instance actions 14:49:12 there might be some upfront checks, but lets stop pretending. 14:49:39 that would bring it more inline with every other operation, but yeah tdurakov has some good plans around that area 14:50:05 tdurakov, do you have links to hand ? 14:50:23 https://review.openstack.org/#/c/291161/ 14:50:24 PaulMurray: it's on reference chapter of spec 14:50:50 thanks - just realised you gavea link above already - duh 14:51:08 We also need to progress storage pools 14:51:14 mdbooth, paul-carlton1 ^^ 14:51:15 johnthetubaguy: could we discuss your idea in details? 14:51:47 PaulMurray: Yep. I haven't touched it in a couple of weeks, tbh. Been taking a breather from it, but I'll pick it up again rsn. 14:52:18 yep, next on my list after live mig bugs 14:52:24 johnthetubaguy, who will be organising the sessions ? is it new PTL ? 14:53:22 tdurakov: I'm assuming that johnthetubaguy is saying we should just kick actions off, and have them set the instance to an error state asynchronously if they fail? 14:53:37 I'll try to get a full list of work going on related to live migration - look out for a ML post 14:53:48 mdbooth: ok 14:53:50 Rather than attemtping to return success from a synchronous rpc call 14:53:57 However, I may have misread 14:54:20 tdurakov: I'm pretty sure that's one of the things your patches do anyway, though 14:54:30 mdbooth: yes 14:54:48 PaulMurray: yeah, the new PTL, and their team, will be choosing them 14:55:10 In general, call() isn't a great idea for most things, I think 14:55:24 Except queries 14:55:36 johnthetubaguy, BTW - thanks for being so helpful for us - I hope you will stay around 14:55:48 PaulMurray: +1 14:55:52 +1 14:56:07 I totally plan on staying around, and focusing on Nova 14:56:28 * andrearosa is glad to hear that 14:57:07 Ok, I'm going to call it there. mriedem is already in the room and you know he can get pushy :) 14:57:12 thanks for coming 14:57:25 #endmeeting