14:02:50 #startmeeting Nova Live Migration 14:02:51 Meeting started Tue Jun 14 14:02:50 2016 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:02:52 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:02:54 The meeting name has been set to 'nova_live_migration' 14:03:06 o/ 14:03:06 * andrearosa was not late at all 14:03:07 o/ 14:03:07 hi 14:03:09 o/ 14:03:17 o/ 14:03:26 Sorry guys - I got "reset by peer" at exactly 15:00:52 14:03:36 * kashyap waves 14:03:57 remembering where I am.... 14:04:01 o/ 14:04:09 the agenda: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:04:15 ...is relatively empty 14:04:30 #topic CI 14:04:43 Does anyone have an update on CI ? 14:05:01 do we know who is working on this right now, maybe tdurakov? 14:05:07 o/ 14:05:16 hey 14:05:17 last week mriedem was helping out 14:05:25 missed notification 14:05:32 and tdurakov is usually 14:05:42 o/ 14:05:52 tdurakov, we're on CI 14:05:57 do you have an update ? 14:06:32 last week we were talking about post-config and devstack-gate 14:06:46 NFS and ceph setup are busted on xeial 14:06:48 *xenial 14:06:58 https://review.openstack.org/#/c/327886/ 14:07:05 i tried skipping NFS but it failed in setting up ceph 14:07:13 looks like probably an issue with systemd on 16.04 14:07:45 but the stuff before the shared storage is kinda working now? 14:07:51 i don't have a 16.04 devstack vm available to troubleshoot the scripts 14:08:00 yes, scenario #1 is working 14:08:12 has the instability gone, do we know? 14:08:16 so i guess our option is skip nfs and ceph (2-4) 14:08:23 i don't know that yet 14:08:27 until we can get the job passing 14:08:33 i have a bug open for the NFS setup issue 14:08:38 right, thats what I wondered 14:08:45 i might as well open one for ceph, and then skip both until someone fixes them 14:08:49 its very tempting to get *something* passing soon to check that 14:08:56 yeah, agree 14:09:03 i can update that patch to skip ceph also 14:09:07 cool, ping me if those patches apear 14:09:11 awesome 14:09:29 thanks mriedem 14:09:35 johnthetubaguy, mriedem nfs potential fix https://review.openstack.org/#/c/329466/ 14:10:29 tdurakov: ok i'll keep an eye on that 14:10:46 * johnthetubaguy nods 14:10:46 Anything else happening in CI ? 14:11:21 moving on 14:11:33 #topic Libvirt Storage Pools 14:11:42 mdbooth, ? 14:11:57 i have an item in our kanban this week to focus on reviewing the imagebackend refactor series 14:12:04 diana_clarke has a series posted 14:12:06 so, harassing me to review those this week is fair game :) 14:12:18 It's all green except for virtuozzo 14:12:25 Looks like we might have a real bug there 14:12:56 will it block you ? 14:13:08 We don't currently see any blockers 14:13:20 Actively looking for review there 14:13:27 has anyone talked to mnestratov? 14:13:53 As previously noted, we don't have coverage of the flat backend yet, but that's in hand 14:14:03 I understand, anyway? 14:14:48 not really 14:14:53 depends on post-config working in devstack-gate 14:15:12 which is not sounding like a priority for the QA team, and i haven't been trying to help out there, and don't really want to keep bugging sdague about it 14:15:20 If that stalls, what's the impact on us? 14:15:30 Would we make that a blocker? 14:15:51 mriedem: I've been working through it, it's close 14:16:06 mdbooth: i don't think we'd make it a blocker no 14:16:08 Could we, perhaps, have a certain level of confidence by running tempest against it manually? 14:16:17 It's not as if we're reducing test coverage 14:16:17 mdbooth: was just going to say that 14:16:37 https://review.openstack.org/#/c/326585/ passes everything but multinode, but it could use more hands to move it faster 14:17:05 sdague: yikes i need to drop those kilo jobs :) 14:17:29 mriedem: heh, yeh, there will be stable backports to devstack / grenade to make it all work completely in the end 14:17:30 diana_clarke is currently busy, but I wonder if she's up for setting up a private tempest environment for the flat backend 14:17:38 Would you take our word on it that it passed? 14:18:05 i trust no one 14:18:12 j/k, sort of 14:18:16 Understandable, and very wise 14:18:27 honestly i wouldn't block on that unless i personally found the time to test it out too 14:18:30 which probably won't happen 14:18:47 I do think if we aren't regressing test coverage we should move forward, and just know what we need to get in place before the end of cycle to feel confident 14:18:57 Yeah. I'd prefer to see it pass at least once, but we already risk breaking it every time we touch it anyway. 14:19:07 i think there are ways i could test this out upstream too 14:19:17 with some WIP patches to d-g 14:19:21 we've done that before 14:19:26 mriedem: is this just needing post config to flip a nova bit? 14:19:50 we can definitely do one off tests for that in upstream 14:19:51 sdague: yeah, but honestly i could just do a WIP devstack change that sets the images_type=flat and depend on the top of the refactor series 14:19:56 mriedem: yeh 14:19:57 when that passes we throw the patch away 14:20:03 same as glance v2 14:20:13 mdbooth: so i'll just do that 14:20:16 right, I think that's completely acceptable approach 14:20:21 mriedem: Awesome, thanks. 14:20:35 #action write one off devstack change ot test image backend refactor series with flat type 14:20:38 until we get the rest of the test infrastructure retooled 14:20:41 #undo 14:20:48 mriedem: Will it require mriedem's magic key, or if you document the hack could we rerun it? 14:20:49 #action mriedem to write one off devstack change ot test image backend refactor series with flat type 14:21:04 mdbooth: it's pretty simple, i'll just add you to the review when it's up 14:21:15 mriedem: diana_clarke too, please 14:21:15 just set a flag in devstack and add the depends-on in the commit message 14:21:18 sure 14:21:53 Thanks mriedem mdbooth 14:22:04 are we done with that ? 14:22:08 yeah 14:22:15 yup 14:22:31 I know it comes after but I always worry about plans for the storage pools part 14:22:40 as the specs haven't changed recently 14:22:54 paul-carlton2, are you there ? 14:23:19 PaulMurray: I think he is not available for the meeting today 14:23:26 Maybe I can catch up with him later 14:23:36 thanks andrearosa 14:23:49 #topic Review request 14:24:15 I put all three items there for review 14:24:19 there are a few on the list 14:24:34 pkoniszewski, anything to say about them ? 14:24:46 https://review.openstack.org/#/c/234659/ 14:24:47 yeah, just short comments 14:24:58 https://review.openstack.org/#/c/328910/ 14:25:06 https://review.openstack.org/#/q/topic:bp/libvirt-clean-driver 14:25:35 first of all, patch that fixes tunnelled block live migration is merged into master, backport to mitaka was conflicting so i'd ask for some eyes there as i don't want to miss anything - https://review.openstack.org/#/c/328910/ 14:26:12 given that we are broken out of the box in mitaka... 14:26:56 second thing is that we are a bit stuck with luis5tb implementing automatic live migration completion, especially post-copy bits because live migraiton monitor is super complex and it starts to break pep8 14:27:32 danpb proposed series of patches to split monitor into methods, so before we move forward with automatic live migration completion we need to have monitor split https://review.openstack.org/#/q/topic:bp/libvirt-clean-driver 14:28:03 and the last is that i finally managed to make live migration working with iso9660 config drive, patch is up for review - https://review.openstack.org/#/c/234659/ 14:28:04 thats all 14:28:06 thanks :) 14:28:32 are they on the review page ? 14:28:39 ok, I'll rebase the post-copy patches to include danpb patch on live migration monitoring 14:28:40 not yet 14:28:46 i'll put all of them 14:28:48 * mriedem runs to another meeting 14:28:56 thanks 14:29:41 Anyone have any other reviews to mention 14:29:43 ? 14:29:57 * mdbooth has some incidental to some aob 14:30:25 moving on 14:30:37 #topic Open Discussion 14:30:46 I had one thing on the agenda 14:31:01 related to: http://lists.openstack.org/pipermail/openstack-dev/2016-June/097016.html 14:31:03 * mdbooth has a thing, but didn't put it on the agenda :/ 14:31:16 * PaulMurray I've got it in mind 14:31:44 That thread is about sync issues in live migration fucntions 14:32:12 it reminded me that we had a situation where pre_live_migration was running at the same time as rollback 14:32:17 due to RP timeouts 14:32:25 s/RP/RPC/ 14:32:33 PaulMurray: I think that's a scheduler thing. live migration is just a trigger. 14:32:52 mdbooth, well the point is that there are other sync issues 14:32:54 * pkoniszewski goes offline 14:33:01 besides the resource counting 14:33:28 I wanted to check if that was being looked at 14:33:28 So, forgive my high level ignorance here. Is resource tracker co-located with the scheduler? 14:33:36 no 14:33:36 i.e. it's not running on the compute host, right 14:33:43 its on the compute host 14:33:44 It *is* running on the compute host? 14:33:52 Oh.... in that case it might be interesting 14:34:12 resource tracker is supposed to be the source of truth for resource consumption 14:34:20 but it does refresh 14:34:31 Sources of truth shouldn't need refreshes 14:34:55 so I guess the ypervisor is really source of truth for resources 14:35:03 and DB is source for their consumption 14:35:10 based on what instances are supposed to be on the hsot 14:35:24 the resource tracker just calculates from there 14:36:09 there has been a debate about moving the tracking to the scheduler instead 14:36:26 but that has not been done 14:36:36 * mdbooth doesn't understand how info from the resource tracker is consumed by the scheduler. 14:36:54 It sounds to me like the symptom is that the scheduler is sending an instance somewhere which can't handle it. 14:37:13 that should be ok 14:37:32 because the compute manager checks it fits when it accepts it 14:37:50 ...and sends it back for a rescheduler if it doesn't 14:38:48 My concern was that the migration functions don't have synchronization 14:39:00 like the boot/delete etc. do 14:39:20 so you can do more than one operation on the instance at a time by mistake 14:40:07 is tdurakov still here ? 14:40:42 yes 14:40:50 tdurakov, this all seems relevant to the refactor you are doing 14:41:03 race discussed on ml? 14:41:27 more sync in general 14:41:38 between live migration functions and other operations 14:42:11 tdurakov, or do you think its seperate issue 14:43:28 well, need to look closely, but briefly yes, it's connected 14:43:42 ok - we can talk later 14:44:04 * tdurakov stared ml, will respond later 14:44:17 mdbooth, do you want to do your item now ? 14:44:42 Yeah, I'm looking at trying to consolidate driver._create_image and driver._create_images_and_backing 14:44:51 They both do very similar things slightly differently 14:45:34 The critical difference I'm interested in is that _create_images_and_backing uses libvirt xml as source of truth 14:45:43 _create_image uses bdms/instance 14:45:54 live migration uses the former 14:46:05 Can anybody think of any reason those 2 might get out of whack? 14:46:54 In the process of investigating this, I'm intending to write up a ml post describing the various different block device data structures, what they are and where they come from 14:47:23 Don't know 14:47:37 Couple of cleanup patches I wrote while digging: https://review.openstack.org/#/c/329366/ https://review.openstack.org/#/c/329381/ 14:47:42 I think the xml is not really supposed to be source of truth ? 14:47:42 latter is pretty egregious 14:48:03 PaulMurray: That's also my understanding, looking for edge cases 14:48:59 Do you know why it takes xml as truth ? 14:49:06 PaulMurray: No. 14:49:21 Which makes me slightly nervous. 14:49:34 paul-carlton2, realised something similar with live migrate rescued 14:50:12 rescue keeps the old xml so it can unrescue, but you can just do a hard reboot to rebuild the whole state from stratch instead 14:50:20 so no real need for the old xml 14:50:22 Right 14:50:30 They *should* be interchangeable 14:50:47 And afaik they are, but I wouldn't be surprised to discover they're not. 14:51:23 I don't know enough about how the xml is generated 14:51:42 maybe if you have a instance in place it needs some ids to stay the same ? 14:51:48 Ok, was hoping maybe somebody might know off their head. 14:51:51 mdbooth: Very nice, just noticed your intention to write-up on different block device data structures in Nova. 14:52:14 mdbooth, also noticed you nearly lost it this morning :) 14:52:37 PaulMurray: Hehe, cursing bad naming? 14:53:02 Anyway.... anything else for last couple of minutes ? 14:53:06 yes 14:53:12 Although not in the agenda either 14:53:19 While working on the post-copy patches I discovered (what I think is) a bug that may trigger live-migrations abort too early 14:53:25 https://bugs.launchpad.net/nova/+bug/1591240 14:53:25 Launchpad bug 1591240 in OpenStack Compute (nova) "progress_watermark is not updated" [Undecided,New] 14:54:14 luis5tb, are you planning to do the fix ? 14:54:24 sure 14:55:01 it is just include an extra condition into the if condition 14:55:20 cool 14:55:37 once it is confirmed I'll do that 14:56:09 I think it is the part of the code that danpb is moving out of the driver 14:56:40 try having a quick chat with him in the nova channel 14:56:50 if its clear what is wrong you can just do it 14:57:07 yep, maybe he can just fix it at the same time he commits his patch 14:57:17 ahh, ok, either works 14:57:41 time to end 14:57:50 thanks everyone for coming 14:57:57 #endmeeting