14:01:02 #startmeeting Nova Live Migration 14:01:05 Meeting started Tue May 17 14:01:02 2016 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:01:06 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:01:09 The meeting name has been set to 'nova_live_migration' 14:01:15 o/ 14:01:17 o/ 14:01:22 o/ 14:01:28 Hi all, just do a ping on other channel 14:01:30 \o 14:01:31 * johnthetubaguy lurks with mild intent 14:01:37 hi 14:01:43 hi 14:01:44 o/ 14:01:48 o/ 14:01:58 o/ 14:02:06 I normally do that in advance but got a little delayed 14:02:28 agenda: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:02:46 o/ 14:03:00 * kashyap waves 14:03:12 #topic CI 14:03:24 jumping straight in 14:03:30 o/ 14:03:35 we had a couple of actions from last week 14:03:56 tdurakov, did you get anything from markus or tony ? 14:04:06 the action was: tdurakov to follow up with markus_z/tonyb about the trunk libvirt/qemu repo 14:04:41 don't see a wave from tdurakov - maybe he will turn up in a minute 14:04:47 tdurakov was also looking at using xenial nodes 14:05:00 i think he had a patch up actually 14:05:05 * mriedem looks 14:05:27 https://review.openstack.org/#/c/314636/ 14:05:55 according to that, gate-tempest-dsvm-multinode-live-migration should be running on xenial nodes now 14:05:56 ah good, do you know if its made any difference ? 14:06:12 i haven't seen that switch yet, but i know where to look 14:06:34 (For those not familiar, "xenial" == Ubuntu 16.04 Long Term Release (LTS)) 14:06:40 http://logs.openstack.org/03/315703/1/experimental/gate-tempest-dsvm-multinode-live-migration/1288a98/ 14:06:58 http://logs.openstack.org/03/315703/1/experimental/gate-tempest-dsvm-multinode-live-migration/1288a98/logs/dpkg-l.txt.gz 14:07:06 libvirt 1.3.1 14:07:13 qemu 2.5 14:07:31 that's more recent than we need, so good 14:07:34 kashyap: Thanks, I was assuming it was hypervisor related 14:07:35 and considering https://review.openstack.org/#/c/315703/ no it doesn't appear to have helped with stability 14:07:48 mdbooth: Me too. I thought it was a play on 'Xen' 14:08:13 in that change, the job has failed 3 times 14:08:20 before it ever gets to setup ceph 14:08:21 :( 14:08:29 so the default config is failing testes 14:08:31 *tests 14:08:56 http://logs.openstack.org/03/315703/1/experimental/gate-tempest-dsvm-multinode-live-migration/1288a98/logs/screen-n-cpu.txt.gz?level=TRACE#_2016-05-13_21_32_30_216 14:08:58 mriedem: Have you looked at the failures, btw? 14:09:03 so 14:09:03 Unable to find CPU definition: gate64 14:09:10 mriedem: Right, thanks 14:09:10 it looks like the job itself is just busted 14:09:30 there was a lot of tinkering for cpu models for the multinode live migration job on trusty, 14:09:49 switching to xenial didn't magically work so the cpu models will have to be tinkered with again it looks like 14:10:01 mriedem: I recall clarkb playing with CPU models in Gate Infra 14:10:06 yeah 14:10:54 ok, so something to get into 14:11:02 mriedem: I take it the infrastructure is genuinely heterogeneous? 14:11:20 Specifcally this: "Update libvirt cpu map before starting nova " -- https://review.openstack.org/#/c/168407/ 14:11:23 mdbooth: yes 14:11:38 mriedem: Ok. Maybe we could just hardcode some lcd? 14:11:55 what's lcd ? 14:12:05 lowest common denominator 14:12:18 :) its always the easy ones 14:12:20 i thought that's what they were doing 14:12:21 It's not as if the cpu is really that important to us 14:12:26 mdbooth: It was hard-coded, if you're referring to CPU features: https://review.openstack.org/#/c/168407/9/tools/cpu_map_update.py 14:12:38 probably an action item for someone to talk to clarkb in infra after the meeting 14:12:49 k 14:13:12 mriedem: I can follow up 14:13:27 I'll make it an action 14:14:00 (Or if someone with better Infra access than me wants to take it up, that's fine too.) 14:14:07 #action kashyap to follow up with clarkb about experimental job running xenial fix (presumed cpu model) 14:14:39 its a good start anyway 14:14:44 https://review.openstack.org/#/c/168407/9/tools/cpu_map_update.py doesn't appear to bomb out at all' 14:14:54 i.e. if it can't find what it's looking for i don't see it fail 14:15:59 http://logs.openstack.org/03/315703/1/experimental/gate-tempest-dsvm-multinode-live-migration/1288a98/logs/libvirt/cpu_map.xml 14:16:15 14:16:16 is in there 14:17:02 Hmm, but somehow the CPU def. is lost / not recognized. 14:17:26 it's also in the subnode http://logs.openstack.org/03/315703/1/experimental/gate-tempest-dsvm-multinode-live-migration/1288a98/logs/subnode-2/libvirt/cpu_map.xml 14:18:21 lets not debug this here, we can pick it up outside the meeting 14:18:37 Yeah. 14:18:40 The next action was on mriedem 14:18:43 mriedem to hack up devstack-gate changes to test lvm and raw image backends 14:18:51 I saw a patch 14:19:00 https://review.openstack.org/#/c/316298/ 14:19:05 ^ is the lvm experimental queue job 14:19:13 there is a dependency on a devstack change and a nova change 14:19:54 this one isn't merged: https://review.openstack.org/#/c/316295/ 14:20:04 yeah i need to update it quick 14:20:07 easy peasy 14:20:09 should be merged today 14:20:22 then just need another +2 on the infra change, which i can wrastle up in -infra later 14:21:09 is it this: https://review.openstack.org/#/c/215929/ 14:21:16 that's merged ? 14:21:19 no 14:21:48 oh, you mean your one 14:21:57 yeah the nova change for the blacklist rc file 14:22:04 the file has a .txt extension which i need to drop 14:22:09 oomichi: found that 14:22:12 quick fix 14:22:23 anyway, should have this all merged today so we can use the job 14:22:36 note it's on the experimental queue, so to run it you have to comment with 'check experimental' on your patch 14:22:39 mdbooth: ^ 14:23:06 mriedem: Cool, thanks. We're not quite there yet, though. diana_clarke will update. 14:23:15 as for a raw job, 14:23:19 i didn't get that far 14:23:32 last week we talked about maybe switching one of the existing qcow2 jobs to use raw 14:23:49 since everything except ceph and this new lvm job is using qcow2 14:24:10 mriedem: yes, thanks! 14:24:20 i could test that out with a devstack-gate patch too, i think it's just a matter of setting use_cow_images=False right mdbooth? 14:24:44 mriedem: Assuming images_type is defaulted, then yes. 14:26:36 Do you know which job to use? 14:26:50 no 14:26:59 was thinking maybe the postgresql job 14:27:13 that's already an odd duck of a job anyway 14:27:38 :) 14:28:04 you still happy to do it 14:28:16 i can push the d-g patch yeah 14:28:18 and see how it goes 14:28:24 lots of meetings today though 14:28:47 #action mriedem to change an existing job to use raw instead of qcow2 14:28:54 that's life at the top 14:29:17 Anything else needed for CI today ? 14:29:38 ok - moving on 14:29:43 #topic Libvirt Storage Pools 14:29:51 any update mdbooth diana_clarke 14:30:10 I've added the new methods (create_from_image & create_from_func) to the following backends: Ploop, Rbd, Flat (aka Raw, aka NoBacking). I'll toss them up for review later today 14:30:39 thanks 14:30:44 I believe Matt is auditing the BDM object usage in libvirt/driver.py in preparation for adding the driver_info to the BDM object. Please correct me if I'm wrong, Matt. 14:31:21 Yup. It's not quite as clear cut as I'd hoped. I need to draw some pictures to work out how BDM info gets from compute/manager into the driver 14:31:52 If anybody here is intimately familiar with that process and has some time I'd love to pick your brains, btw 14:32:18 mdbooth, there's a wiki or doc page somewhere for bdm - if you discover something not on there it would be good to update it 14:32:41 mdbooth, or tell me and I'll update it 14:32:51 mdbooth: i know some of it 14:32:59 My specs for storage pools and libvirt migration are awating review https://review.openstack.org/#/c/310505/ and https://review.openstack.org/#/c/310538/ 14:33:08 mdbooth: it's all wrapped up in the mystical magical 3 different block_device.py modules in nova 14:33:19 you might have a dict, you might have an object, you might have a dict that wraps an object 14:33:30 mriedem: Yeah, it's the relationship between those which isn't yet clear to me. 14:33:55 * mdbooth suspects that these days you probably always have an object, and that the code is unnecessarily crufty 14:34:09 probably true 14:34:13 or its close 14:34:50 Was it nikola who was doing that 14:34:56 Originally, yes 14:35:03 yeah, some of the cruft is from the legacy bdm stuff 14:35:07 converting v1 bdms to v2 14:35:17 so there is a lot of facade stuff and wrapping 14:35:26 I remember he was the only one who knew some of this stuff 14:35:37 i've fixed some bugs in some of it 14:35:46 while we're deprecating apis, maybe we should deprecated bdm v1 14:35:51 *deprecate 14:36:44 getting a bit off topic now 14:36:46 mriedem: Yeah. It's weird that we're not converting those at the api layer. 14:36:51 PaulMurray: Indeed, sorry. 14:37:05 paul-carlton1, mentioned his specs aboe 14:37:19 would be good to get those sorted 14:37:39 Is there anything else to add on that paul-carlton1 14:37:41 ? 14:38:43 hopefully they speak for themselves but I'm anxious that they don't miss cut for Newton specs 14:38:58 paul-carlton1: When is that? 14:39:01 its priority so there is a bit to go, but yes 14:39:22 n-1 14:39:25 6/2 14:39:31 is non-priority spec approval freeze 14:39:40 end of may I think 14:39:48 we said, however, that libvirt storage pools are a priority 14:39:55 mriedem, 6th of feburary - great 14:40:02 this is our release schedule https://wiki.openstack.org/wiki/Nova/Newton_Release_Schedule 14:40:10 PaulMurray: 'merican date format only here 14:40:32 priorities https://specs.openstack.org/openstack/nova-specs/priorities/newton-priorities.html#libvirt-storage-pools-live-migration 14:40:35 ok, so no immediate panic! 14:40:49 but better to get it settled anyway 14:41:13 moving on agin 14:41:24 #topic Specs 14:41:31 well, sort of moving on 14:41:48 where is the migration force/postcopy/autoconverge/etc spec upto? 14:41:53 the Post-copy and auto-converge merged with auto completion seems close to agreement 14:42:23 * PaulMurray ah, trying to cut paste 14:42:24 mriedem, when is priority feature freeze, I don't see it on wiki page 14:42:37 https://review.openstack.org/#/c/306561 - Automatic Live Migration Completion 14:42:40 paul-carlton1: same as normal FF 14:42:50 paul-carlton1: sept 2nd 14:43:32 That spec did have a +2 from danpb 14:43:37 so isn't switching to post-copy automated somehow? 14:43:42 ta 14:43:54 johnthetubaguy, nope 14:43:58 johnthetubaguy, no, we have to tell it to do it 14:44:11 johnthetubaguy, so the spec talks about when and how agressive to be 14:44:20 in moitoring thread 14:44:24 gotcha 14:44:52 It also adds a --force-complete flag to live migrate API 14:44:53 I just don't like config options change the API sematics, but I kinda get why we want that here, I just need some theory about that first 14:45:32 johnthetubaguy: Automated by Nova. 14:46:06 johnthetubaguy, the idea is always use post-copy if you can 14:46:19 but allow ops to disable it if they really want 14:46:28 so default is to have it 14:47:01 if available, post-copy is a recent feature 14:47:36 This spec is not priority so we could do with reviews 14:47:48 hmm, I thought we said post-copy was dangerous, but I guess its optional 14:47:51 at earliest convenience 14:48:23 johnthetubaguy: I think the 'dangerous' moniker is largely misinformation. 14:48:57 mdbooth, I think the exposure is small, but the main problem is its new 14:49:06 johnthetubaguy, there are risks of instance failing and needing a reboot if network goes down but networks are pretty reliable and a reboot is not end of the world 14:49:09 and in production we find new = don't always work 14:49:13 PaulMurray: Always apropos bugs. 14:49:51 the alternative may be a suspend or auto-converge which is potentially worse 14:50:06 Any other specs to mention? 14:50:12 Basically, almost everything which would cause post-copy to fail would also have caused some other failure. 14:50:12 https://review.openstack.org/#/c/307131/ 14:50:47 johnthetubaguy, you -1 that before ^^^ 14:50:48 Live Migration of rescued instances, I have an implementation ready(ish) for this 14:50:55 I think it covers what you asked for ? 14:51:12 I've updated it since speaking to johnthetubaguy 14:51:39 cool, I should get back to that one and take another look soon 14:51:39 time is moving on 14:51:58 #topic reviews 14:52:08 anything specific in review ? 14:52:20 hi 14:52:24 https://review.openstack.org/#/c/215483/ - Set migration status to 'error' on live-migration failure 14:53:01 I think that one is there now - just needs core review 14:53:23 yes, please do the needful when get time:) 14:53:35 not sure i can handle an 8 LOC patch 14:53:53 I'll ping the cores that had an opinion on it if they don't respond soon 14:54:15 #topic Open Discussion 14:54:18 thank you PaulMurray 14:54:30 a few minutes for any other business 14:55:34 I guess we're done 14:55:39 thank you all for coming 14:55:46 please do some reviews for others 14:55:59 I guess the etherpad is all up to date right? 14:56:05 #action all review sub team patches 14:56:17 the sub team section on the review tracking page is 14:56:27 I need to clean up our own page 14:56:46 but for now the https://etherpad.openstack.org/p/newton-nova-priorities-tracking is the place to go 14:56:55 * mdbooth is getting substantive review on the bottom of the instance storage patch stack, btw 14:57:22 good mdbooth 14:57:25 bye 14:57:25 which is awesome 14:57:31 #endmeeting