14:00:06 <PaulMurray> #startmeeting Nova Live Migration
14:00:07 <openstack> Meeting started Tue May 31 14:00:06 2016 UTC and is due to finish in 60 minutes.  The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:08 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:10 <openstack> The meeting name has been set to 'nova_live_migration'
14:00:25 <PaulMurray> Hi all
14:00:27 <diana_clarke> o/
14:00:29 <tdurakov> o/
14:01:01 <paul-carlton> o/
14:01:03 <davidgiluk> o/
14:01:10 <andrearosa> hi
14:01:12 <mdbooth> o/
14:01:20 <PaulMurray> agenda: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration
14:01:53 <PaulMurray> Sorry the agenda was updated so late, it has been a holiday here
14:02:04 <PaulMurray> #topic Specs
14:02:20 <andreas_s> hi
14:02:23 <PaulMurray> Its nearly non-prio spec freeze
14:02:45 <PaulMurray> So looking for any last minute things to sort out
14:02:55 <PaulMurray> this is on the agenda
14:03:14 <PaulMurray> #link https://review.openstack.org/#/c/320849/ : Use finite machine in migration process
14:04:10 <tdurakov> PaulMurray: commented this spec
14:04:12 <PaulMurray> I don't think tangchen is here
14:04:15 <woodster_> o/
14:04:30 <PaulMurray> I saw your comment tdurakov
14:04:34 <johnthetubaguy> is this the general direction folks here want?
14:04:58 <tdurakov> johnthetubaguy: which one?
14:05:02 <PaulMurray> tdurakov, did you get a chance to look at what Nikola had done before ?
14:05:13 <johnthetubaguy> the above spec 320849
14:05:26 * mdbooth completely agrees with the concept of a state machine here, but I haven't had a chance to look this spec in detail
14:05:48 <tdurakov> PaulMurray: patch you've mentioned? yes
14:06:21 <PaulMurray> I think it was going that way but need tangchen to confirm that's what he meant
14:06:28 <PaulMurray> if so it looks good to me
14:06:31 <PaulMurray> what do you think?
14:06:40 <johnthetubaguy> it feels like we should focus on tdurakov's changes around the conductor this cycle
14:07:09 <johnthetubaguy> the statemachine would be cool, but not sure we have agreed the direction on how to use that yet
14:07:26 <tdurakov> PaulMurray: if we want to go this way - sure, but from my sight we could use state machines for the whole process instead of dealing with statuses only
14:08:12 <tdurakov> johnthetubaguy, PaulMurray, I'll start ml for this, so we could discuss it
14:08:29 <PaulMurray> ok, lets go that way
14:08:47 <PaulMurray> #action tdurakov to start ML about migration state machine
14:08:54 <johnthetubaguy> I just think it will be clearer how this helps, after we remove the compute <-> compute bits
14:09:08 <tdurakov> johnthetubaguy: +1
14:09:21 <PaulMurray> lets move on then
14:09:33 <PaulMurray> I'll comment on the spec about that
14:09:34 * mdbooth still doesn't understand the rationale for removing the compute<->compute bits, but perhaps I ought to read that spec again
14:09:48 <tdurakov> :)
14:10:22 <PaulMurray> any other specs to go over
14:10:35 <PaulMurray> I ignored ones that had no progress for the last month
14:10:54 <johnthetubaguy> are there any more we need at this point?
14:11:00 <davidgiluk> the postcopy/force finish looks like it's getting there
14:11:18 <PaulMurray> I think that's merged now
14:11:21 <PaulMurray> ?
14:11:29 <tdurakov> https://review.openstack.org/#/c/306561/ - this one?
14:11:34 <tdurakov> yup, it's merged
14:11:36 <andrearosa> PaulMurray: it is
14:11:49 <davidgiluk> yes, it's good to see that merged
14:12:12 <johnthetubaguy> I think we got most of the big ticket items done this morning
14:12:18 <johnthetubaguy> just wondering what other bits we have lingering
14:12:20 <PaulMurray> The only ones that have had progress in last month either merged or are priority specs
14:12:36 <johnthetubaguy> so what are the outstanding priority one?
14:12:38 <PaulMurray> i.e. rest of storage pools ones
14:12:48 <johnthetubaguy> ah, the storage pool ones
14:12:53 <andreas_s> for completeness: the use-target-vif spec requires still discussion on the neutron side
14:12:54 <PaulMurray> https://review.openstack.org/#/c/310505 Use libvirt storage pools (spec)
14:12:55 <PaulMurray> https://review.openstack.org/#/c/310538/ Migrate libvirt volumes (spec)
14:13:24 <johnthetubaguy> oh, those are spec links, I thought they were code links
14:13:37 <johnthetubaguy> thats clearer now
14:14:09 <mdbooth> Aside: does anybody know Feodor Tersin's nick, or what tz he's in?
14:15:07 <PaulMurray> the storage pools specs have been getting reviews, but still to be finished
14:15:33 <kashyap> mdbooth: 'ftersin' seems to be his nick
14:15:58 <PaulMurray> lets move on to the fun stuff then
14:16:01 <PaulMurray> #topic CI
14:16:19 <PaulMurray> The gate 64 discussion
14:16:31 <PaulMurray> http://lists.openstack.org/pipermail/openstack-dev/2016-May/095811.html
14:16:36 <PaulMurray> starts here ^
14:16:54 <PaulMurray> Looks like we're a bit stuck for now
14:16:57 <kashyap> PaulMurray: Yeah, the current status is:
14:17:28 <kashyap> It is a legit regression in libvirt; _but_, libvirt devs insist that what we're doing (manually updating cpu_map.xml) in DevStack is a hack.
14:17:53 <kashyap> And, nova.conf should be enhanced to let operators enable/disable specific CPU flags.
14:18:06 <kashyap> I'm going to file an upstream RFE for that before this is all forgotten
14:18:29 <mdbooth> kashyap: Is the regression going to be fixed?
14:18:53 <kashyap> mdbooth: Jiri (migration dev) told today that it is not trivial at all to fix, but that said, it'll be fixed as part of an upcoming libvirt CPU driver fixes
14:19:02 <kashyap> mdbooth: Which might take a month-ish
14:20:06 <tdurakov> kashyap: if it hack in devstack, any recommendation on dealing it proper way?
14:20:18 <kashyap> See danpb's comment: http://lists.openstack.org/pipermail/openstack-dev/2016-May/096251.html
14:20:24 <kashyap> tdurakov: ^
14:20:52 <mdbooth> kashyap: Yeah, that's pretty much what I was thinking
14:21:34 <danpb> IMHO we shouldn't really be trying to create custom cpu models in the first place
14:21:48 <danpb> libvirt has explicit support for turning off individual features against a pre-existing cpu model
14:21:51 <davidgiluk> mdbooth: Do you have any idea how the gate64 set came to be - i.e. why it was core2duo- those options?
14:21:58 <danpb> we've just not got that wired up in nova.conf
14:22:33 <danpb> if we used the proper libvirt support for this, instead of inventing gate64, we would not have hit a problem
14:22:38 <mdbooth> davidgiluk: Nope. Guessing trial and error. I doubt specific cpu features are all that important in the gate as long as it works.
14:22:52 <johnthetubaguy> danpb: that sounds like it matches the CPU mask I am used to in XenServer to do something similar
14:22:52 <kashyap> danpb: You're referring to the <cpu> element here (with 'custom' mode), right? http://libvirt.org/formatdomain.html
14:23:03 <danpb> yes
14:23:15 <davidgiluk> mdbooth: What's odd to me is why dont you just use core2duo - I'm curious why there is something less capable than a core2duo being used
14:23:17 <mdbooth> johnthetubaguy: Is there a solution to this problem in devstack for XenServer?
14:23:42 <danpb> davidgiluk: certain public cloud providers aren't exposing nice cpus to the guest
14:23:59 <danpb> iirc, in particular there was one or two key feature flags that all qemu models include which were not exposed by some clouds
14:24:07 <johnthetubaguy> mdbooth: nope
14:24:32 <danpb> of course since the CI is apparently using TCG and not KVM, the choice of CPU model ought not to matter at all
14:24:59 <danpb> so there's also probably another bug in nova where it is mistakenly doing a host/guest CPU comparison for TCG guests, when it should just ignore it
14:25:00 <kashyap> Yeah, that regression bug won't trigger in a domain type KVM.
14:25:04 <mdbooth> danpb: orly? Could that be the cause of other issues?
14:25:14 <mdbooth> i.e. tcg
14:25:27 <danpb> TCG can expose any CPU model it likes, regardless of whether the host cpu has the same features
14:25:39 <danpb> since it is 100% cpu emulation with no hardware acceleration
14:26:16 <johnthetubaguy> mdbooth: by other issues, do you mean the lockups during live-migrate?
14:26:34 <davidgiluk> danpb: Although how many of those features are tested in TCG....
14:27:00 <mdbooth> johnthetubaguy: For example, or anything tbh. Whenever I've mentioned TCG to KVM folks in the past they've been pretty much: don't do that.
14:27:21 <danpb> davidgiluk: true, but the point is we shouldn't need to invent custom cpu models at all for TCG - we can use any cpu model we want that exists
14:27:37 <danpb> eg just use the default  qemu64
14:29:12 <PaulMurray> Anyone want to give it a go...
14:30:56 <PaulMurray> danpb, if TCG doesn't need a custom model why was one introduced?
14:31:04 <PaulMurray> danpb, was there no TCG at the time /
14:31:45 <kashyap> PaulMurray: It was introduced in this change, with this rationale:
14:31:47 <kashyap> "We are trying to get a working 64bit qemu cpu model in the gate for nova live migration testing. It appears that we need to make this change prior to nova starting. "
14:31:53 <kashyap> https://review.openstack.org/#/c/168407/
14:32:00 <danpb> either the CI was using KVM (nested) originally (?) or there is a bug in nova where it is trying todo CPU comparisons for both KVM and TCG, not just KVM
14:32:21 <danpb> or even both
14:33:00 <johnthetubaguy> yeah, that latter sounds correct
14:33:09 <PaulMurray> I'm really ignorant on this stuff... I would think anything we can try would be a good idea
14:33:20 <johnthetubaguy> I don't think the CPU comparison is conditional, but my mind could be playing tricks on me
14:33:54 <danpb> # NOTE(berendt): virConnectCompareCPU not working for Xen
14:33:55 <danpb> if CONF.libvirt.virt_type not in ['qemu', 'kvm']:
14:33:55 <danpb> return
14:34:02 <danpb> that's int he _compare_cpu method
14:34:09 <danpb> removing 'qemu' from that list might be suffiicent
14:34:37 <johnthetubaguy> ah, that could do it
14:34:53 <danpb> there might be other cases of that issue elsewhere in the code of course
14:37:50 <PaulMurray> Would someone like to pick this up ?
14:39:20 <kashyap> I'm not intimately familiar with the Nova codebase of this area, but can give it a shot
14:39:34 <PaulMurray> Thanks kashyap
14:39:38 <kashyap> Unless someone wants to work on it immeidately
14:40:00 <PaulMurray> I'll see if I can get someone for it if your not keen, but you might be the best bet for now.
14:40:21 <tdurakov> PaulMurray: will try to help kashyap
14:40:29 <PaulMurray> thanks
14:40:31 <kashyap> tdurakov: Great
14:41:02 <clarkb> we have never used nested virt on rax as far as I know
14:41:06 <PaulMurray> #action kashyap and tdurakov to look into using qemu defaults for CI testing instead of gate64 cpu model
14:41:25 <clarkb> becuse of the xen vs kvm issues. and that is where we run into the non homogenous cpu issues
14:41:50 <clarkb> sdague pushed a patch that stopped using a common cpu model type and got it to fail recently so you can start there
14:42:28 <PaulMurray> clarkb, thanks, have you seen the discussion above about using TCG ?
14:42:30 <tdurakov> clarkb: thanks for details
14:43:02 <PaulMurray> clarkb, does that make any difference to you ?
14:43:35 <clarkb> I have no idea just want to avoid confusion with tue test setup as it currently runs
14:43:44 <PaulMurray> ok, thanks
14:43:47 <kashyap> Sean seems to have tried on RAX env without specifying that custom model:  "Experimenting with not doing the gate64 cpu setting failed in one of the live migration jobs on RAX because of cpu compat."  "Here is the cpu comparison between the master and subnode" -- http://paste.openstack.org/show/505672/
14:44:35 <sdague> if we can pull the compat check like danpb says, because we don't need it for qemu, that seems like the most sensible fix
14:45:06 <sdague> especially as we discovered nested kvm breaks too often for us, so we're unlikely to enable it
14:45:32 <davidgiluk> yeh it's still a bit touchy
14:46:04 <PaulMurray> well, we've stayed on this topic for a long time because its important, but lets move on now
14:46:25 <kashyap> If it breaks, bugs ought to be filed.  Upstream Kernel / KVM is quite responsive in these issues from my first-hand experience, FWIW.
14:46:33 <PaulMurray> I think mriedem is still away
14:46:54 <PaulMurray> #topic Libvirt Storage Pools
14:47:05 <PaulMurray> anything you need mdbooth ?
14:47:11 <PaulMurray> or anything you want to tell us ?
14:47:12 <sdague> PaulMurray: it would be good to make sure this gets resolved, because this inhibits moving to xenial in upstream testing (unless we disable live migration testing entirely)
14:47:21 <mdbooth> PaulMurray: Nope, diana_clarke and I have had our heads down
14:47:28 <diana_clarke> I have some qcow2 code to revisit, but otherwise I'm pretty close to being done the new imagebackend methods. And Matt is starting to use these new methods in the driver code.
14:47:36 <PaulMurray> sdague, plan to follow it through - kashyap and tdurakov are going to look at it
14:47:38 <mdbooth> We're *hoping* to be testing out CI by the end of the week.
14:48:26 <PaulMurray> That's good - are you getting what you need from everyone else ?
14:48:46 <mdbooth> PaulMurray: Yup. Getting excellent review attention.
14:49:04 <PaulMurray> good
14:49:08 <mdbooth> Could do with more eyes on diana_clarke's series now, though
14:49:22 <mdbooth> They're still in flux, but nothing major I think
14:49:22 <PaulMurray> diana_clarke, you got a link ?
14:49:48 <mdbooth> PaulMurray: Links should be on the wiki?
14:50:26 <diana_clarke> https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:libvirt-instance-storage
14:50:54 <diana_clarke> (ignore the most recent two for the moment)
14:51:16 <PaulMurray> are you putting this in here: https://etherpad.openstack.org/p/newton-nova-priorities-tracking
14:53:00 <PaulMurray> #action all look at these https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:libvirt-instance-storage
14:53:22 <PaulMurray> only a few minutes to go
14:53:32 <PaulMurray> #topic reviews and open discussion
14:53:47 <PaulMurray> anything ?
14:54:48 <PaulMurray> I think we are done
14:54:56 <PaulMurray> thanks for coming
14:55:05 <PaulMurray> #endmeeting