14:02:50 <PaulMurray> #startmeeting Nova Live Migration
14:02:51 <openstack> Meeting started Tue Jun 14 14:02:50 2016 UTC and is due to finish in 60 minutes.  The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:02:52 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:02:54 <openstack> The meeting name has been set to 'nova_live_migration'
14:03:06 <davidgiluk> o/
14:03:06 * andrearosa was not late at all
14:03:07 <pkoniszewski> o/
14:03:07 <andreas_s> hi
14:03:09 <mdbooth> o/
14:03:17 <luis5tb> o/
14:03:26 <PaulMurray> Sorry guys - I got "reset by peer" at exactly 15:00:52
14:03:36 * kashyap waves
14:03:57 <PaulMurray> remembering where I am....
14:04:01 <mriedem> o/
14:04:09 <PaulMurray> the agenda: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration
14:04:15 <PaulMurray> ...is relatively empty
14:04:30 <PaulMurray> #topic CI
14:04:43 <PaulMurray> Does anyone have an update on CI ?
14:05:01 <johnthetubaguy> do we know who is working on this right now, maybe tdurakov?
14:05:07 <diana_clarke> o/
14:05:16 <tdurakov> hey
14:05:17 <PaulMurray> last week mriedem was helping out
14:05:25 <tdurakov> missed notification
14:05:32 <PaulMurray> and tdurakov is usually
14:05:42 <woodster_> o/
14:05:52 <PaulMurray> tdurakov, we're on CI
14:05:57 <PaulMurray> do you have an update ?
14:06:32 <PaulMurray> last week we were talking about post-config and devstack-gate
14:06:46 <mriedem> NFS and ceph setup are busted on xeial
14:06:48 <mriedem> *xenial
14:06:58 <mriedem> https://review.openstack.org/#/c/327886/
14:07:05 <mriedem> i tried skipping NFS but it failed in setting up ceph
14:07:13 <mriedem> looks like probably an issue with systemd on 16.04
14:07:45 <johnthetubaguy> but the stuff before the shared storage is kinda working now?
14:07:51 <mriedem> i don't have a 16.04 devstack vm available to troubleshoot the scripts
14:08:00 <mriedem> yes, scenario #1 is working
14:08:12 <johnthetubaguy> has the instability gone, do we know?
14:08:16 <mriedem> so i guess our option is skip nfs and ceph (2-4)
14:08:23 <mriedem> i don't know that yet
14:08:27 <mriedem> until we can get the job passing
14:08:33 <mriedem> i have a bug open for the NFS setup issue
14:08:38 <johnthetubaguy> right, thats what I wondered
14:08:45 <mriedem> i might as well open one for ceph, and then skip both until someone fixes them
14:08:49 <johnthetubaguy> its very tempting to get *something* passing soon to check that
14:08:56 <mriedem> yeah, agree
14:09:03 <mriedem> i can update that patch to skip ceph also
14:09:07 <johnthetubaguy> cool, ping me if those patches apear
14:09:11 <johnthetubaguy> awesome
14:09:29 <PaulMurray> thanks mriedem
14:09:35 <tdurakov> johnthetubaguy, mriedem nfs potential fix https://review.openstack.org/#/c/329466/
14:10:29 <mriedem> tdurakov: ok i'll keep an eye on that
14:10:46 * johnthetubaguy nods
14:10:46 <PaulMurray> Anything else happening in CI ?
14:11:21 <PaulMurray> moving on
14:11:33 <PaulMurray> #topic Libvirt Storage Pools
14:11:42 <PaulMurray> mdbooth, ?
14:11:57 <mriedem> i have an item in our kanban this week to focus on reviewing the imagebackend refactor series
14:12:04 <mdbooth> diana_clarke has a series posted
14:12:06 <mriedem> so, harassing me to review those this week is fair game :)
14:12:18 <mdbooth> It's all green except for virtuozzo
14:12:25 <mdbooth> Looks like we might have a real bug there
14:12:56 <PaulMurray> will it block you ?
14:13:08 <mdbooth> We don't currently see any blockers
14:13:20 <mdbooth> Actively looking for review there
14:13:27 <mriedem> has anyone talked to mnestratov?
14:13:53 <mdbooth> As previously noted, we don't have coverage of the flat backend yet, but that's in hand
14:14:03 <mdbooth> I understand, anyway?
14:14:48 <mriedem> not really
14:14:53 <mriedem> depends on post-config working in devstack-gate
14:15:12 <mriedem> which is not sounding like a priority for the QA team, and i haven't been trying to help out there, and don't really want to keep bugging sdague about it
14:15:20 <mdbooth> If that stalls, what's the impact on us?
14:15:30 <mdbooth> Would we make that a blocker?
14:15:51 <sdague> mriedem: I've been working through it, it's close
14:16:06 <mriedem> mdbooth: i don't think we'd make it a blocker no
14:16:08 <mdbooth> Could we, perhaps, have a certain level of confidence by running tempest against it manually?
14:16:17 <mdbooth> It's not as if we're reducing test coverage
14:16:17 <mriedem> mdbooth: was just going to say that
14:16:37 <sdague> https://review.openstack.org/#/c/326585/ passes everything but multinode, but it could use more hands to move it faster
14:17:05 <mriedem> sdague: yikes i need to drop those kilo jobs :)
14:17:29 <sdague> mriedem: heh, yeh, there will be stable backports to devstack / grenade to make it all work completely in the end
14:17:30 <mdbooth> diana_clarke is currently busy, but I wonder if she's up for setting up a private tempest environment for the flat backend
14:17:38 <mdbooth> Would you take our word on it that it passed?
14:18:05 <mriedem> i trust no one
14:18:12 <mriedem> j/k, sort of
14:18:16 <mdbooth> Understandable, and very wise
14:18:27 <mriedem> honestly i wouldn't block on that unless i personally found the time to test it out too
14:18:30 <mriedem> which probably won't happen
14:18:47 <sdague> I do think if we aren't regressing test coverage we should move forward, and just know what we need to get in place before the end of cycle to feel confident
14:18:57 <mdbooth> Yeah. I'd prefer to see it pass at least once, but we already risk breaking it every time we touch it anyway.
14:19:07 <mriedem> i think there are ways i could test this out upstream too
14:19:17 <mriedem> with some WIP patches to d-g
14:19:21 <mriedem> we've done that before
14:19:26 <sdague> mriedem: is this just needing post config to flip a nova bit?
14:19:50 <sdague> we can definitely do one off tests for that in upstream
14:19:51 <mriedem> sdague: yeah, but honestly i could just do a WIP devstack change that sets the images_type=flat and depend on the top of the refactor series
14:19:56 <sdague> mriedem: yeh
14:19:57 <mriedem> when that passes we throw the patch away
14:20:03 <mriedem> same as glance v2
14:20:13 <mriedem> mdbooth: so i'll just do that
14:20:16 <sdague> right, I think that's completely acceptable approach
14:20:21 <mdbooth> mriedem: Awesome, thanks.
14:20:35 <mriedem> #action write one off devstack change ot test image backend refactor series with flat type
14:20:38 <sdague> until we get the rest of the test infrastructure retooled
14:20:41 <mriedem> #undo
14:20:48 <mdbooth> mriedem: Will it require mriedem's magic key, or if you document the hack could we rerun it?
14:20:49 <mriedem> #action mriedem to write one off devstack change ot test image backend refactor series with flat type
14:21:04 <mriedem> mdbooth: it's pretty simple, i'll just add you to the review when it's up
14:21:15 <mdbooth> mriedem: diana_clarke too, please
14:21:15 <mriedem> just set a flag in devstack and add the depends-on in the commit message
14:21:18 <mriedem> sure
14:21:53 <PaulMurray> Thanks mriedem mdbooth
14:22:04 <PaulMurray> are we done with that ?
14:22:08 <mriedem> yeah
14:22:15 <mdbooth> yup
14:22:31 <PaulMurray> I know it comes after but I always worry about plans for the storage pools part
14:22:40 <PaulMurray> as the specs haven't changed recently
14:22:54 <PaulMurray> paul-carlton2, are you there ?
14:23:19 <andrearosa> PaulMurray: I think he is not available for the meeting today
14:23:26 <PaulMurray> Maybe I can catch up with him later
14:23:36 <PaulMurray> thanks andrearosa
14:23:49 <PaulMurray> #topic Review request
14:24:15 <pkoniszewski> I put all three items there for review
14:24:19 <PaulMurray> there are a few on the list
14:24:34 <PaulMurray> pkoniszewski, anything to say about them ?
14:24:46 <PaulMurray> https://review.openstack.org/#/c/234659/
14:24:47 <pkoniszewski> yeah, just short comments
14:24:58 <PaulMurray> https://review.openstack.org/#/c/328910/
14:25:06 <PaulMurray> https://review.openstack.org/#/q/topic:bp/libvirt-clean-driver
14:25:35 <pkoniszewski> first of all, patch that fixes tunnelled block live migration is merged into master, backport to mitaka was conflicting so i'd ask for some eyes there as i don't want to miss anything - https://review.openstack.org/#/c/328910/
14:26:12 <pkoniszewski> given that we are broken out of the box in mitaka...
14:26:56 <pkoniszewski> second thing is that we are a bit stuck with luis5tb implementing automatic live migration completion, especially post-copy bits because live migraiton monitor is super complex and it starts to break pep8
14:27:32 <pkoniszewski> danpb proposed series of patches to split monitor into methods, so before we move forward with automatic live migration completion we need to have monitor split https://review.openstack.org/#/q/topic:bp/libvirt-clean-driver
14:28:03 <pkoniszewski> and the last is that i finally managed to make live migration working with iso9660 config drive, patch is up for review - https://review.openstack.org/#/c/234659/
14:28:04 <pkoniszewski> thats all
14:28:06 <pkoniszewski> thanks :)
14:28:32 <PaulMurray> are they on the review page ?
14:28:39 <luis5tb> ok, I'll rebase the post-copy patches to include danpb patch on live migration monitoring
14:28:40 <pkoniszewski> not yet
14:28:46 <pkoniszewski> i'll put all of them
14:28:48 * mriedem runs to another meeting
14:28:56 <PaulMurray> thanks
14:29:41 <PaulMurray> Anyone have any other reviews to mention
14:29:43 <PaulMurray> ?
14:29:57 * mdbooth has some incidental to some aob
14:30:25 <PaulMurray> moving on
14:30:37 <PaulMurray> #topic Open Discussion
14:30:46 <PaulMurray> I had one thing on the agenda
14:31:01 <PaulMurray> related to: http://lists.openstack.org/pipermail/openstack-dev/2016-June/097016.html
14:31:03 * mdbooth has a thing, but didn't put it on the agenda :/
14:31:16 * PaulMurray I've got it in mind
14:31:44 <PaulMurray> That thread is about sync issues in live migration fucntions
14:32:12 <PaulMurray> it reminded me that we had a situation where pre_live_migration was running at the same time as rollback
14:32:17 <PaulMurray> due to RP timeouts
14:32:25 <PaulMurray> s/RP/RPC/
14:32:33 <mdbooth> PaulMurray: I think that's a scheduler thing. live migration is just a trigger.
14:32:52 <PaulMurray> mdbooth, well the point is that there are other sync issues
14:32:54 * pkoniszewski goes offline
14:33:01 <PaulMurray> besides the resource counting
14:33:28 <PaulMurray> I wanted to check if that was being looked at
14:33:28 <mdbooth> So, forgive my high level ignorance here. Is resource tracker co-located with the scheduler?
14:33:36 <PaulMurray> no
14:33:36 <mdbooth> i.e. it's not running on the compute host, right
14:33:43 <PaulMurray> its on the compute host
14:33:44 <mdbooth> It *is* running on the compute host?
14:33:52 <mdbooth> Oh.... in that case it might be interesting
14:34:12 <PaulMurray> resource tracker is supposed to be the source of truth for resource consumption
14:34:20 <PaulMurray> but it does refresh
14:34:31 <mdbooth> Sources of truth shouldn't need refreshes
14:34:55 <PaulMurray> so I guess the ypervisor is really source of truth for resources
14:35:03 <PaulMurray> and DB is source for their consumption
14:35:10 <PaulMurray> based on what instances are supposed to be on the hsot
14:35:24 <PaulMurray> the resource tracker just calculates from there
14:36:09 <PaulMurray> there has been a debate about moving the tracking to the scheduler instead
14:36:26 <PaulMurray> but that has not been done
14:36:36 * mdbooth doesn't understand how info from the resource tracker is consumed by the scheduler.
14:36:54 <mdbooth> It sounds to me like the symptom is that the scheduler is sending an instance somewhere which can't handle it.
14:37:13 <PaulMurray> that should be ok
14:37:32 <PaulMurray> because the compute manager checks it fits when it accepts it
14:37:50 <PaulMurray> ...and sends it back for a rescheduler if it doesn't
14:38:48 <PaulMurray> My concern was that the migration functions don't have synchronization
14:39:00 <PaulMurray> like the boot/delete etc. do
14:39:20 <PaulMurray> so you can do more than one operation on the instance at a time by mistake
14:40:07 <PaulMurray> is tdurakov still here ?
14:40:42 <tdurakov> yes
14:40:50 <PaulMurray> tdurakov, this all seems relevant to the refactor you are doing
14:41:03 <tdurakov> race discussed on ml?
14:41:27 <PaulMurray> more sync in general
14:41:38 <PaulMurray> between live migration functions and other operations
14:42:11 <PaulMurray> tdurakov, or do you think its seperate issue
14:43:28 <tdurakov> well, need to look closely, but briefly yes, it's connected
14:43:42 <PaulMurray> ok - we can talk later
14:44:04 * tdurakov stared ml, will respond later
14:44:17 <PaulMurray> mdbooth, do you want to do your item now ?
14:44:42 <mdbooth> Yeah, I'm looking at trying to consolidate driver._create_image and driver._create_images_and_backing
14:44:51 <mdbooth> They both do very similar things slightly differently
14:45:34 <mdbooth> The critical difference I'm interested in is that _create_images_and_backing uses libvirt xml as source of truth
14:45:43 <mdbooth> _create_image uses bdms/instance
14:45:54 <mdbooth> live migration uses the former
14:46:05 <mdbooth> Can anybody think of any reason those 2 might get out of whack?
14:46:54 <mdbooth> In the process of investigating this, I'm intending to write up a ml post describing the various different block device data structures, what they are and where they come from
14:47:23 <PaulMurray> Don't know
14:47:37 <mdbooth> Couple of cleanup patches I wrote while digging: https://review.openstack.org/#/c/329366/ https://review.openstack.org/#/c/329381/
14:47:42 <PaulMurray> I think the xml is not really supposed to be source of truth ?
14:47:42 <mdbooth> latter is pretty egregious
14:48:03 <mdbooth> PaulMurray: That's also my understanding, looking for edge cases
14:48:59 <PaulMurray> Do you know why it takes xml as truth ?
14:49:06 <mdbooth> PaulMurray: No.
14:49:21 <mdbooth> Which makes me slightly nervous.
14:49:34 <PaulMurray> paul-carlton2, realised something similar with live migrate rescued
14:50:12 <PaulMurray> rescue keeps the old xml so it can unrescue, but you can just do a hard reboot to rebuild the whole state from stratch instead
14:50:20 <PaulMurray> so no real need for the old xml
14:50:22 <mdbooth> Right
14:50:30 <mdbooth> They *should* be interchangeable
14:50:47 <mdbooth> And afaik they are, but I wouldn't be surprised to discover they're not.
14:51:23 <PaulMurray> I don't know enough about how the xml is generated
14:51:42 <PaulMurray> maybe if you have a instance in place it needs some ids to stay the same ?
14:51:48 <mdbooth> Ok, was hoping maybe somebody might know off their head.
14:51:51 <kashyap> mdbooth: Very nice, just noticed your intention to write-up on different block device data structures in Nova.
14:52:14 <PaulMurray> mdbooth, also noticed you nearly lost it this morning :)
14:52:37 <mdbooth> PaulMurray: Hehe, cursing bad naming?
14:53:02 <PaulMurray> Anyway.... anything else for last couple of minutes ?
14:53:06 <luis5tb> yes
14:53:12 <luis5tb> Although not in the agenda either
14:53:19 <luis5tb> While working on the post-copy patches I discovered (what I think is) a bug that may trigger live-migrations abort too early
14:53:25 <luis5tb> https://bugs.launchpad.net/nova/+bug/1591240
14:53:25 <openstack> Launchpad bug 1591240 in OpenStack Compute (nova) "progress_watermark is not updated" [Undecided,New]
14:54:14 <PaulMurray> luis5tb, are you planning to do the fix ?
14:54:24 <luis5tb> sure
14:55:01 <luis5tb> it is just include an extra condition into the if condition
14:55:20 <PaulMurray> cool
14:55:37 <luis5tb> once it is confirmed I'll do that
14:56:09 <luis5tb> I think it is the part of the code that danpb is moving out of the driver
14:56:40 <PaulMurray> try having a quick chat with him in the nova channel
14:56:50 <PaulMurray> if its clear what is wrong you can just do it
14:57:07 <luis5tb> yep, maybe he can just fix it at the same time he commits his patch
14:57:17 <luis5tb> ahh, ok, either works
14:57:41 <PaulMurray> time to end
14:57:50 <PaulMurray> thanks everyone for coming
14:57:57 <PaulMurray> #endmeeting