14:00:06 #startmeeting Nova Live Migration 14:00:07 Meeting started Tue May 31 14:00:06 2016 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:08 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:10 The meeting name has been set to 'nova_live_migration' 14:00:25 Hi all 14:00:27 o/ 14:00:29 o/ 14:01:01 o/ 14:01:03 o/ 14:01:10 hi 14:01:12 o/ 14:01:20 agenda: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:01:53 Sorry the agenda was updated so late, it has been a holiday here 14:02:04 #topic Specs 14:02:20 hi 14:02:23 Its nearly non-prio spec freeze 14:02:45 So looking for any last minute things to sort out 14:02:55 this is on the agenda 14:03:14 #link https://review.openstack.org/#/c/320849/ : Use finite machine in migration process 14:04:10 PaulMurray: commented this spec 14:04:12 I don't think tangchen is here 14:04:15 o/ 14:04:30 I saw your comment tdurakov 14:04:34 is this the general direction folks here want? 14:04:58 johnthetubaguy: which one? 14:05:02 tdurakov, did you get a chance to look at what Nikola had done before ? 14:05:13 the above spec 320849 14:05:26 * mdbooth completely agrees with the concept of a state machine here, but I haven't had a chance to look this spec in detail 14:05:48 PaulMurray: patch you've mentioned? yes 14:06:21 I think it was going that way but need tangchen to confirm that's what he meant 14:06:28 if so it looks good to me 14:06:31 what do you think? 14:06:40 it feels like we should focus on tdurakov's changes around the conductor this cycle 14:07:09 the statemachine would be cool, but not sure we have agreed the direction on how to use that yet 14:07:26 PaulMurray: if we want to go this way - sure, but from my sight we could use state machines for the whole process instead of dealing with statuses only 14:08:12 johnthetubaguy, PaulMurray, I'll start ml for this, so we could discuss it 14:08:29 ok, lets go that way 14:08:47 #action tdurakov to start ML about migration state machine 14:08:54 I just think it will be clearer how this helps, after we remove the compute <-> compute bits 14:09:08 johnthetubaguy: +1 14:09:21 lets move on then 14:09:33 I'll comment on the spec about that 14:09:34 * mdbooth still doesn't understand the rationale for removing the compute<->compute bits, but perhaps I ought to read that spec again 14:09:48 :) 14:10:22 any other specs to go over 14:10:35 I ignored ones that had no progress for the last month 14:10:54 are there any more we need at this point? 14:11:00 the postcopy/force finish looks like it's getting there 14:11:18 I think that's merged now 14:11:21 ? 14:11:29 https://review.openstack.org/#/c/306561/ - this one? 14:11:34 yup, it's merged 14:11:36 PaulMurray: it is 14:11:49 yes, it's good to see that merged 14:12:12 I think we got most of the big ticket items done this morning 14:12:18 just wondering what other bits we have lingering 14:12:20 The only ones that have had progress in last month either merged or are priority specs 14:12:36 so what are the outstanding priority one? 14:12:38 i.e. rest of storage pools ones 14:12:48 ah, the storage pool ones 14:12:53 for completeness: the use-target-vif spec requires still discussion on the neutron side 14:12:54 https://review.openstack.org/#/c/310505 Use libvirt storage pools (spec) 14:12:55 https://review.openstack.org/#/c/310538/ Migrate libvirt volumes (spec) 14:13:24 oh, those are spec links, I thought they were code links 14:13:37 thats clearer now 14:14:09 Aside: does anybody know Feodor Tersin's nick, or what tz he's in? 14:15:07 the storage pools specs have been getting reviews, but still to be finished 14:15:33 mdbooth: 'ftersin' seems to be his nick 14:15:58 lets move on to the fun stuff then 14:16:01 #topic CI 14:16:19 The gate 64 discussion 14:16:31 http://lists.openstack.org/pipermail/openstack-dev/2016-May/095811.html 14:16:36 starts here ^ 14:16:54 Looks like we're a bit stuck for now 14:16:57 PaulMurray: Yeah, the current status is: 14:17:28 It is a legit regression in libvirt; _but_, libvirt devs insist that what we're doing (manually updating cpu_map.xml) in DevStack is a hack. 14:17:53 And, nova.conf should be enhanced to let operators enable/disable specific CPU flags. 14:18:06 I'm going to file an upstream RFE for that before this is all forgotten 14:18:29 kashyap: Is the regression going to be fixed? 14:18:53 mdbooth: Jiri (migration dev) told today that it is not trivial at all to fix, but that said, it'll be fixed as part of an upcoming libvirt CPU driver fixes 14:19:02 mdbooth: Which might take a month-ish 14:20:06 kashyap: if it hack in devstack, any recommendation on dealing it proper way? 14:20:18 See danpb's comment: http://lists.openstack.org/pipermail/openstack-dev/2016-May/096251.html 14:20:24 tdurakov: ^ 14:20:52 kashyap: Yeah, that's pretty much what I was thinking 14:21:34 IMHO we shouldn't really be trying to create custom cpu models in the first place 14:21:48 libvirt has explicit support for turning off individual features against a pre-existing cpu model 14:21:51 mdbooth: Do you have any idea how the gate64 set came to be - i.e. why it was core2duo- those options? 14:21:58 we've just not got that wired up in nova.conf 14:22:33 if we used the proper libvirt support for this, instead of inventing gate64, we would not have hit a problem 14:22:38 davidgiluk: Nope. Guessing trial and error. I doubt specific cpu features are all that important in the gate as long as it works. 14:22:52 danpb: that sounds like it matches the CPU mask I am used to in XenServer to do something similar 14:22:52 danpb: You're referring to the element here (with 'custom' mode), right? http://libvirt.org/formatdomain.html 14:23:03 yes 14:23:15 mdbooth: What's odd to me is why dont you just use core2duo - I'm curious why there is something less capable than a core2duo being used 14:23:17 johnthetubaguy: Is there a solution to this problem in devstack for XenServer? 14:23:42 davidgiluk: certain public cloud providers aren't exposing nice cpus to the guest 14:23:59 iirc, in particular there was one or two key feature flags that all qemu models include which were not exposed by some clouds 14:24:07 mdbooth: nope 14:24:32 of course since the CI is apparently using TCG and not KVM, the choice of CPU model ought not to matter at all 14:24:59 so there's also probably another bug in nova where it is mistakenly doing a host/guest CPU comparison for TCG guests, when it should just ignore it 14:25:00 Yeah, that regression bug won't trigger in a domain type KVM. 14:25:04 danpb: orly? Could that be the cause of other issues? 14:25:14 i.e. tcg 14:25:27 TCG can expose any CPU model it likes, regardless of whether the host cpu has the same features 14:25:39 since it is 100% cpu emulation with no hardware acceleration 14:26:16 mdbooth: by other issues, do you mean the lockups during live-migrate? 14:26:34 danpb: Although how many of those features are tested in TCG.... 14:27:00 johnthetubaguy: For example, or anything tbh. Whenever I've mentioned TCG to KVM folks in the past they've been pretty much: don't do that. 14:27:21 davidgiluk: true, but the point is we shouldn't need to invent custom cpu models at all for TCG - we can use any cpu model we want that exists 14:27:37 eg just use the default qemu64 14:29:12 Anyone want to give it a go... 14:30:56 danpb, if TCG doesn't need a custom model why was one introduced? 14:31:04 danpb, was there no TCG at the time / 14:31:45 PaulMurray: It was introduced in this change, with this rationale: 14:31:47 "We are trying to get a working 64bit qemu cpu model in the gate for nova live migration testing. It appears that we need to make this change prior to nova starting. " 14:31:53 https://review.openstack.org/#/c/168407/ 14:32:00 either the CI was using KVM (nested) originally (?) or there is a bug in nova where it is trying todo CPU comparisons for both KVM and TCG, not just KVM 14:32:21 or even both 14:33:00 yeah, that latter sounds correct 14:33:09 I'm really ignorant on this stuff... I would think anything we can try would be a good idea 14:33:20 I don't think the CPU comparison is conditional, but my mind could be playing tricks on me 14:33:54 # NOTE(berendt): virConnectCompareCPU not working for Xen 14:33:55 if CONF.libvirt.virt_type not in ['qemu', 'kvm']: 14:33:55 return 14:34:02 that's int he _compare_cpu method 14:34:09 removing 'qemu' from that list might be suffiicent 14:34:37 ah, that could do it 14:34:53 there might be other cases of that issue elsewhere in the code of course 14:37:50 Would someone like to pick this up ? 14:39:20 I'm not intimately familiar with the Nova codebase of this area, but can give it a shot 14:39:34 Thanks kashyap 14:39:38 Unless someone wants to work on it immeidately 14:40:00 I'll see if I can get someone for it if your not keen, but you might be the best bet for now. 14:40:21 PaulMurray: will try to help kashyap 14:40:29 thanks 14:40:31 tdurakov: Great 14:41:02 we have never used nested virt on rax as far as I know 14:41:06 #action kashyap and tdurakov to look into using qemu defaults for CI testing instead of gate64 cpu model 14:41:25 becuse of the xen vs kvm issues. and that is where we run into the non homogenous cpu issues 14:41:50 sdague pushed a patch that stopped using a common cpu model type and got it to fail recently so you can start there 14:42:28 clarkb, thanks, have you seen the discussion above about using TCG ? 14:42:30 clarkb: thanks for details 14:43:02 clarkb, does that make any difference to you ? 14:43:35 I have no idea just want to avoid confusion with tue test setup as it currently runs 14:43:44 ok, thanks 14:43:47 Sean seems to have tried on RAX env without specifying that custom model: "Experimenting with not doing the gate64 cpu setting failed in one of the live migration jobs on RAX because of cpu compat." "Here is the cpu comparison between the master and subnode" -- http://paste.openstack.org/show/505672/ 14:44:35 if we can pull the compat check like danpb says, because we don't need it for qemu, that seems like the most sensible fix 14:45:06 especially as we discovered nested kvm breaks too often for us, so we're unlikely to enable it 14:45:32 yeh it's still a bit touchy 14:46:04 well, we've stayed on this topic for a long time because its important, but lets move on now 14:46:25 If it breaks, bugs ought to be filed. Upstream Kernel / KVM is quite responsive in these issues from my first-hand experience, FWIW. 14:46:33 I think mriedem is still away 14:46:54 #topic Libvirt Storage Pools 14:47:05 anything you need mdbooth ? 14:47:11 or anything you want to tell us ? 14:47:12 PaulMurray: it would be good to make sure this gets resolved, because this inhibits moving to xenial in upstream testing (unless we disable live migration testing entirely) 14:47:21 PaulMurray: Nope, diana_clarke and I have had our heads down 14:47:28 I have some qcow2 code to revisit, but otherwise I'm pretty close to being done the new imagebackend methods. And Matt is starting to use these new methods in the driver code. 14:47:36 sdague, plan to follow it through - kashyap and tdurakov are going to look at it 14:47:38 We're *hoping* to be testing out CI by the end of the week. 14:48:26 That's good - are you getting what you need from everyone else ? 14:48:46 PaulMurray: Yup. Getting excellent review attention. 14:49:04 good 14:49:08 Could do with more eyes on diana_clarke's series now, though 14:49:22 They're still in flux, but nothing major I think 14:49:22 diana_clarke, you got a link ? 14:49:48 PaulMurray: Links should be on the wiki? 14:50:26 https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:libvirt-instance-storage 14:50:54 (ignore the most recent two for the moment) 14:51:16 are you putting this in here: https://etherpad.openstack.org/p/newton-nova-priorities-tracking 14:53:00 #action all look at these https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:libvirt-instance-storage 14:53:22 only a few minutes to go 14:53:32 #topic reviews and open discussion 14:53:47 anything ? 14:54:48 I think we are done 14:54:56 thanks for coming 14:55:05 #endmeeting