14:02:50 <PaulMurray> #startmeeting Nova Live Migration 14:02:51 <openstack> Meeting started Tue Jun 14 14:02:50 2016 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:02:52 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:02:54 <openstack> The meeting name has been set to 'nova_live_migration' 14:03:06 <davidgiluk> o/ 14:03:06 * andrearosa was not late at all 14:03:07 <pkoniszewski> o/ 14:03:07 <andreas_s> hi 14:03:09 <mdbooth> o/ 14:03:17 <luis5tb> o/ 14:03:26 <PaulMurray> Sorry guys - I got "reset by peer" at exactly 15:00:52 14:03:36 * kashyap waves 14:03:57 <PaulMurray> remembering where I am.... 14:04:01 <mriedem> o/ 14:04:09 <PaulMurray> the agenda: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:04:15 <PaulMurray> ...is relatively empty 14:04:30 <PaulMurray> #topic CI 14:04:43 <PaulMurray> Does anyone have an update on CI ? 14:05:01 <johnthetubaguy> do we know who is working on this right now, maybe tdurakov? 14:05:07 <diana_clarke> o/ 14:05:16 <tdurakov> hey 14:05:17 <PaulMurray> last week mriedem was helping out 14:05:25 <tdurakov> missed notification 14:05:32 <PaulMurray> and tdurakov is usually 14:05:42 <woodster_> o/ 14:05:52 <PaulMurray> tdurakov, we're on CI 14:05:57 <PaulMurray> do you have an update ? 14:06:32 <PaulMurray> last week we were talking about post-config and devstack-gate 14:06:46 <mriedem> NFS and ceph setup are busted on xeial 14:06:48 <mriedem> *xenial 14:06:58 <mriedem> https://review.openstack.org/#/c/327886/ 14:07:05 <mriedem> i tried skipping NFS but it failed in setting up ceph 14:07:13 <mriedem> looks like probably an issue with systemd on 16.04 14:07:45 <johnthetubaguy> but the stuff before the shared storage is kinda working now? 14:07:51 <mriedem> i don't have a 16.04 devstack vm available to troubleshoot the scripts 14:08:00 <mriedem> yes, scenario #1 is working 14:08:12 <johnthetubaguy> has the instability gone, do we know? 14:08:16 <mriedem> so i guess our option is skip nfs and ceph (2-4) 14:08:23 <mriedem> i don't know that yet 14:08:27 <mriedem> until we can get the job passing 14:08:33 <mriedem> i have a bug open for the NFS setup issue 14:08:38 <johnthetubaguy> right, thats what I wondered 14:08:45 <mriedem> i might as well open one for ceph, and then skip both until someone fixes them 14:08:49 <johnthetubaguy> its very tempting to get *something* passing soon to check that 14:08:56 <mriedem> yeah, agree 14:09:03 <mriedem> i can update that patch to skip ceph also 14:09:07 <johnthetubaguy> cool, ping me if those patches apear 14:09:11 <johnthetubaguy> awesome 14:09:29 <PaulMurray> thanks mriedem 14:09:35 <tdurakov> johnthetubaguy, mriedem nfs potential fix https://review.openstack.org/#/c/329466/ 14:10:29 <mriedem> tdurakov: ok i'll keep an eye on that 14:10:46 * johnthetubaguy nods 14:10:46 <PaulMurray> Anything else happening in CI ? 14:11:21 <PaulMurray> moving on 14:11:33 <PaulMurray> #topic Libvirt Storage Pools 14:11:42 <PaulMurray> mdbooth, ? 14:11:57 <mriedem> i have an item in our kanban this week to focus on reviewing the imagebackend refactor series 14:12:04 <mdbooth> diana_clarke has a series posted 14:12:06 <mriedem> so, harassing me to review those this week is fair game :) 14:12:18 <mdbooth> It's all green except for virtuozzo 14:12:25 <mdbooth> Looks like we might have a real bug there 14:12:56 <PaulMurray> will it block you ? 14:13:08 <mdbooth> We don't currently see any blockers 14:13:20 <mdbooth> Actively looking for review there 14:13:27 <mriedem> has anyone talked to mnestratov? 14:13:53 <mdbooth> As previously noted, we don't have coverage of the flat backend yet, but that's in hand 14:14:03 <mdbooth> I understand, anyway? 14:14:48 <mriedem> not really 14:14:53 <mriedem> depends on post-config working in devstack-gate 14:15:12 <mriedem> which is not sounding like a priority for the QA team, and i haven't been trying to help out there, and don't really want to keep bugging sdague about it 14:15:20 <mdbooth> If that stalls, what's the impact on us? 14:15:30 <mdbooth> Would we make that a blocker? 14:15:51 <sdague> mriedem: I've been working through it, it's close 14:16:06 <mriedem> mdbooth: i don't think we'd make it a blocker no 14:16:08 <mdbooth> Could we, perhaps, have a certain level of confidence by running tempest against it manually? 14:16:17 <mdbooth> It's not as if we're reducing test coverage 14:16:17 <mriedem> mdbooth: was just going to say that 14:16:37 <sdague> https://review.openstack.org/#/c/326585/ passes everything but multinode, but it could use more hands to move it faster 14:17:05 <mriedem> sdague: yikes i need to drop those kilo jobs :) 14:17:29 <sdague> mriedem: heh, yeh, there will be stable backports to devstack / grenade to make it all work completely in the end 14:17:30 <mdbooth> diana_clarke is currently busy, but I wonder if she's up for setting up a private tempest environment for the flat backend 14:17:38 <mdbooth> Would you take our word on it that it passed? 14:18:05 <mriedem> i trust no one 14:18:12 <mriedem> j/k, sort of 14:18:16 <mdbooth> Understandable, and very wise 14:18:27 <mriedem> honestly i wouldn't block on that unless i personally found the time to test it out too 14:18:30 <mriedem> which probably won't happen 14:18:47 <sdague> I do think if we aren't regressing test coverage we should move forward, and just know what we need to get in place before the end of cycle to feel confident 14:18:57 <mdbooth> Yeah. I'd prefer to see it pass at least once, but we already risk breaking it every time we touch it anyway. 14:19:07 <mriedem> i think there are ways i could test this out upstream too 14:19:17 <mriedem> with some WIP patches to d-g 14:19:21 <mriedem> we've done that before 14:19:26 <sdague> mriedem: is this just needing post config to flip a nova bit? 14:19:50 <sdague> we can definitely do one off tests for that in upstream 14:19:51 <mriedem> sdague: yeah, but honestly i could just do a WIP devstack change that sets the images_type=flat and depend on the top of the refactor series 14:19:56 <sdague> mriedem: yeh 14:19:57 <mriedem> when that passes we throw the patch away 14:20:03 <mriedem> same as glance v2 14:20:13 <mriedem> mdbooth: so i'll just do that 14:20:16 <sdague> right, I think that's completely acceptable approach 14:20:21 <mdbooth> mriedem: Awesome, thanks. 14:20:35 <mriedem> #action write one off devstack change ot test image backend refactor series with flat type 14:20:38 <sdague> until we get the rest of the test infrastructure retooled 14:20:41 <mriedem> #undo 14:20:48 <mdbooth> mriedem: Will it require mriedem's magic key, or if you document the hack could we rerun it? 14:20:49 <mriedem> #action mriedem to write one off devstack change ot test image backend refactor series with flat type 14:21:04 <mriedem> mdbooth: it's pretty simple, i'll just add you to the review when it's up 14:21:15 <mdbooth> mriedem: diana_clarke too, please 14:21:15 <mriedem> just set a flag in devstack and add the depends-on in the commit message 14:21:18 <mriedem> sure 14:21:53 <PaulMurray> Thanks mriedem mdbooth 14:22:04 <PaulMurray> are we done with that ? 14:22:08 <mriedem> yeah 14:22:15 <mdbooth> yup 14:22:31 <PaulMurray> I know it comes after but I always worry about plans for the storage pools part 14:22:40 <PaulMurray> as the specs haven't changed recently 14:22:54 <PaulMurray> paul-carlton2, are you there ? 14:23:19 <andrearosa> PaulMurray: I think he is not available for the meeting today 14:23:26 <PaulMurray> Maybe I can catch up with him later 14:23:36 <PaulMurray> thanks andrearosa 14:23:49 <PaulMurray> #topic Review request 14:24:15 <pkoniszewski> I put all three items there for review 14:24:19 <PaulMurray> there are a few on the list 14:24:34 <PaulMurray> pkoniszewski, anything to say about them ? 14:24:46 <PaulMurray> https://review.openstack.org/#/c/234659/ 14:24:47 <pkoniszewski> yeah, just short comments 14:24:58 <PaulMurray> https://review.openstack.org/#/c/328910/ 14:25:06 <PaulMurray> https://review.openstack.org/#/q/topic:bp/libvirt-clean-driver 14:25:35 <pkoniszewski> first of all, patch that fixes tunnelled block live migration is merged into master, backport to mitaka was conflicting so i'd ask for some eyes there as i don't want to miss anything - https://review.openstack.org/#/c/328910/ 14:26:12 <pkoniszewski> given that we are broken out of the box in mitaka... 14:26:56 <pkoniszewski> second thing is that we are a bit stuck with luis5tb implementing automatic live migration completion, especially post-copy bits because live migraiton monitor is super complex and it starts to break pep8 14:27:32 <pkoniszewski> danpb proposed series of patches to split monitor into methods, so before we move forward with automatic live migration completion we need to have monitor split https://review.openstack.org/#/q/topic:bp/libvirt-clean-driver 14:28:03 <pkoniszewski> and the last is that i finally managed to make live migration working with iso9660 config drive, patch is up for review - https://review.openstack.org/#/c/234659/ 14:28:04 <pkoniszewski> thats all 14:28:06 <pkoniszewski> thanks :) 14:28:32 <PaulMurray> are they on the review page ? 14:28:39 <luis5tb> ok, I'll rebase the post-copy patches to include danpb patch on live migration monitoring 14:28:40 <pkoniszewski> not yet 14:28:46 <pkoniszewski> i'll put all of them 14:28:48 * mriedem runs to another meeting 14:28:56 <PaulMurray> thanks 14:29:41 <PaulMurray> Anyone have any other reviews to mention 14:29:43 <PaulMurray> ? 14:29:57 * mdbooth has some incidental to some aob 14:30:25 <PaulMurray> moving on 14:30:37 <PaulMurray> #topic Open Discussion 14:30:46 <PaulMurray> I had one thing on the agenda 14:31:01 <PaulMurray> related to: http://lists.openstack.org/pipermail/openstack-dev/2016-June/097016.html 14:31:03 * mdbooth has a thing, but didn't put it on the agenda :/ 14:31:16 * PaulMurray I've got it in mind 14:31:44 <PaulMurray> That thread is about sync issues in live migration fucntions 14:32:12 <PaulMurray> it reminded me that we had a situation where pre_live_migration was running at the same time as rollback 14:32:17 <PaulMurray> due to RP timeouts 14:32:25 <PaulMurray> s/RP/RPC/ 14:32:33 <mdbooth> PaulMurray: I think that's a scheduler thing. live migration is just a trigger. 14:32:52 <PaulMurray> mdbooth, well the point is that there are other sync issues 14:32:54 * pkoniszewski goes offline 14:33:01 <PaulMurray> besides the resource counting 14:33:28 <PaulMurray> I wanted to check if that was being looked at 14:33:28 <mdbooth> So, forgive my high level ignorance here. Is resource tracker co-located with the scheduler? 14:33:36 <PaulMurray> no 14:33:36 <mdbooth> i.e. it's not running on the compute host, right 14:33:43 <PaulMurray> its on the compute host 14:33:44 <mdbooth> It *is* running on the compute host? 14:33:52 <mdbooth> Oh.... in that case it might be interesting 14:34:12 <PaulMurray> resource tracker is supposed to be the source of truth for resource consumption 14:34:20 <PaulMurray> but it does refresh 14:34:31 <mdbooth> Sources of truth shouldn't need refreshes 14:34:55 <PaulMurray> so I guess the ypervisor is really source of truth for resources 14:35:03 <PaulMurray> and DB is source for their consumption 14:35:10 <PaulMurray> based on what instances are supposed to be on the hsot 14:35:24 <PaulMurray> the resource tracker just calculates from there 14:36:09 <PaulMurray> there has been a debate about moving the tracking to the scheduler instead 14:36:26 <PaulMurray> but that has not been done 14:36:36 * mdbooth doesn't understand how info from the resource tracker is consumed by the scheduler. 14:36:54 <mdbooth> It sounds to me like the symptom is that the scheduler is sending an instance somewhere which can't handle it. 14:37:13 <PaulMurray> that should be ok 14:37:32 <PaulMurray> because the compute manager checks it fits when it accepts it 14:37:50 <PaulMurray> ...and sends it back for a rescheduler if it doesn't 14:38:48 <PaulMurray> My concern was that the migration functions don't have synchronization 14:39:00 <PaulMurray> like the boot/delete etc. do 14:39:20 <PaulMurray> so you can do more than one operation on the instance at a time by mistake 14:40:07 <PaulMurray> is tdurakov still here ? 14:40:42 <tdurakov> yes 14:40:50 <PaulMurray> tdurakov, this all seems relevant to the refactor you are doing 14:41:03 <tdurakov> race discussed on ml? 14:41:27 <PaulMurray> more sync in general 14:41:38 <PaulMurray> between live migration functions and other operations 14:42:11 <PaulMurray> tdurakov, or do you think its seperate issue 14:43:28 <tdurakov> well, need to look closely, but briefly yes, it's connected 14:43:42 <PaulMurray> ok - we can talk later 14:44:04 * tdurakov stared ml, will respond later 14:44:17 <PaulMurray> mdbooth, do you want to do your item now ? 14:44:42 <mdbooth> Yeah, I'm looking at trying to consolidate driver._create_image and driver._create_images_and_backing 14:44:51 <mdbooth> They both do very similar things slightly differently 14:45:34 <mdbooth> The critical difference I'm interested in is that _create_images_and_backing uses libvirt xml as source of truth 14:45:43 <mdbooth> _create_image uses bdms/instance 14:45:54 <mdbooth> live migration uses the former 14:46:05 <mdbooth> Can anybody think of any reason those 2 might get out of whack? 14:46:54 <mdbooth> In the process of investigating this, I'm intending to write up a ml post describing the various different block device data structures, what they are and where they come from 14:47:23 <PaulMurray> Don't know 14:47:37 <mdbooth> Couple of cleanup patches I wrote while digging: https://review.openstack.org/#/c/329366/ https://review.openstack.org/#/c/329381/ 14:47:42 <PaulMurray> I think the xml is not really supposed to be source of truth ? 14:47:42 <mdbooth> latter is pretty egregious 14:48:03 <mdbooth> PaulMurray: That's also my understanding, looking for edge cases 14:48:59 <PaulMurray> Do you know why it takes xml as truth ? 14:49:06 <mdbooth> PaulMurray: No. 14:49:21 <mdbooth> Which makes me slightly nervous. 14:49:34 <PaulMurray> paul-carlton2, realised something similar with live migrate rescued 14:50:12 <PaulMurray> rescue keeps the old xml so it can unrescue, but you can just do a hard reboot to rebuild the whole state from stratch instead 14:50:20 <PaulMurray> so no real need for the old xml 14:50:22 <mdbooth> Right 14:50:30 <mdbooth> They *should* be interchangeable 14:50:47 <mdbooth> And afaik they are, but I wouldn't be surprised to discover they're not. 14:51:23 <PaulMurray> I don't know enough about how the xml is generated 14:51:42 <PaulMurray> maybe if you have a instance in place it needs some ids to stay the same ? 14:51:48 <mdbooth> Ok, was hoping maybe somebody might know off their head. 14:51:51 <kashyap> mdbooth: Very nice, just noticed your intention to write-up on different block device data structures in Nova. 14:52:14 <PaulMurray> mdbooth, also noticed you nearly lost it this morning :) 14:52:37 <mdbooth> PaulMurray: Hehe, cursing bad naming? 14:53:02 <PaulMurray> Anyway.... anything else for last couple of minutes ? 14:53:06 <luis5tb> yes 14:53:12 <luis5tb> Although not in the agenda either 14:53:19 <luis5tb> While working on the post-copy patches I discovered (what I think is) a bug that may trigger live-migrations abort too early 14:53:25 <luis5tb> https://bugs.launchpad.net/nova/+bug/1591240 14:53:25 <openstack> Launchpad bug 1591240 in OpenStack Compute (nova) "progress_watermark is not updated" [Undecided,New] 14:54:14 <PaulMurray> luis5tb, are you planning to do the fix ? 14:54:24 <luis5tb> sure 14:55:01 <luis5tb> it is just include an extra condition into the if condition 14:55:20 <PaulMurray> cool 14:55:37 <luis5tb> once it is confirmed I'll do that 14:56:09 <luis5tb> I think it is the part of the code that danpb is moving out of the driver 14:56:40 <PaulMurray> try having a quick chat with him in the nova channel 14:56:50 <PaulMurray> if its clear what is wrong you can just do it 14:57:07 <luis5tb> yep, maybe he can just fix it at the same time he commits his patch 14:57:17 <luis5tb> ahh, ok, either works 14:57:41 <PaulMurray> time to end 14:57:50 <PaulMurray> thanks everyone for coming 14:57:57 <PaulMurray> #endmeeting