14:00:17 <tdurakov> #startmeeting Nova Live Migration
14:00:21 <openstack> Meeting started Tue Sep 27 14:00:17 2016 UTC and is due to finish in 60 minutes.  The chair is tdurakov. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:22 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:24 <tdurakov> hi everyone
14:00:25 <openstack> The meeting name has been set to 'nova_live_migration'
14:00:42 * kashyap waves
14:00:54 <mrhillsman> o/
14:00:56 <mriedem> o/
14:00:57 <mdbooth> o/
14:00:57 <raj_singh> o/
14:01:09 <davidgiluk> o/
14:01:11 <ltomasbo> o/
14:01:13 <tdurakov> agenda for the meeting  https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration#Agenda_for_next_meeting
14:02:00 <tdurakov> #topic CI
14:03:04 <tdurakov> let me start, I'm working in background  on the patch to re-enable ceph for the tempest l-m job, so will notify, once ready
14:03:47 <mriedem> fyi on ceph
14:03:53 <mriedem> we're hitting some other issues with the ceph job on xenial
14:04:00 * johnthetubaguy lurks
14:04:14 <mriedem> https://bugs.launchpad.net/cinder/+bug/1627220
14:04:16 <openstack> Launchpad bug 1627220 in devstack-plugin-ceph "ceph: test_volume_boot_pattern fails with "ImageNotFound: error protecting snapshot"" [Undecided,In progress]
14:04:22 <tdurakov> mriedem: so, we are blocked on *both* backends
14:04:32 <mriedem> tdurakov: well, ^ might not impact live migration tests
14:04:38 <mriedem> it only seems to blow up tests that use snapshots
14:04:55 <mdbooth> mriedem: Does that impact nova? Bug seems cindery.
14:05:07 <tdurakov> mriedem then it should not
14:05:09 <mriedem> it impacts everything in the ceph job
14:05:16 <mriedem> but not the bug isn't in nova
14:05:23 <tdurakov> you are right
14:05:26 <mriedem> there is a guy from red hat with a patch for the devstack plugin
14:05:27 <mdbooth> mriedem: Gotcha
14:05:31 <mriedem> sounds like a regression in the version of ceph in xenial
14:06:20 <tdurakov> any updates on grenade job?
14:06:46 <tdurakov> pkoniszewski: hi, are you around?
14:08:17 <tdurakov> anyone else has updates on that?
14:09:03 <tdurakov> seems no
14:09:08 <johnthetubaguy> I don't, I know raj_singh is hoping to take a look at that, if pkoniszewski doesn't get there
14:09:22 <mriedem> i know the patch, sec
14:09:28 <johnthetubaguy> yeah, the patch is up
14:09:33 <mriedem> https://review.openstack.org/#/c/364809/
14:09:39 <mriedem> i had reviewed it
14:09:40 <raj_singh> yes I pinged pkoniszewski if he need any help, have not heard back
14:09:54 <mriedem> i have some concerns about it,
14:09:55 <johnthetubaguy> #link https://review.openstack.org/#/c/364809
14:10:09 <mriedem> we should probably test that in a d-g WIP patch first to see if it works
14:10:10 <johnthetubaguy> mriedem: didn't you have a patch at one point that helped with that?
14:10:16 <mriedem> b/c by default grenade jobs only run smoke tests
14:10:24 <mriedem> and the live migration tests aren't smoke tests
14:10:31 <johnthetubaguy> mriedem: ah, rather than a brand new thing
14:10:34 <tdurakov> mriedem: right
14:10:34 <mriedem> johnthetubaguy: they are just one-offs
14:10:52 <tdurakov> mriedem: how could we test it without merging initial job definition?
14:10:56 <mriedem> johnthetubaguy: basically you take the variables and shell stuff from https://review.openstack.org/#/c/364809/5/jenkins/jobs/devstack-gate.yaml and put that into a d-g change
14:11:34 <mriedem> tdurakov: like this https://review.openstack.org/#/c/362441/
14:12:23 <tdurakov> mriedem: will it help?
14:12:23 <johnthetubaguy> raj_singh: could you get someone to try that out?
14:12:30 <johnthetubaguy> nice to know it would work
14:12:51 <tdurakov> I think there will be no such job registered in zuul, am I wrong??
14:13:03 <raj_singh> johnthetubaguy: yes sure
14:13:27 <johnthetubaguy> so modify the existing job, in a DNM path, depend on that on some nova patch, and we test out to see if grenade + live-migrate works
14:13:50 <mriedem> tdurakov: the d-g change would test it
14:14:07 <mriedem> project-config defines the job and just passes variables to d-g
14:14:08 <tdurakov> mriedem: so, existing job for now, as a debug/PoC?
14:14:15 <mriedem> we can WIP a d-g change to hardcoded those same vars to test things
14:14:17 <mriedem> yeah
14:14:25 <mriedem> there is a multinode grenade job that runs on d-g already
14:14:36 <tdurakov> got it, yeah this will work
14:15:14 <mriedem> the post_test_hook is something i haven't figured out how to stub in d-g yet
14:15:29 <mriedem> actually maybe i have https://review.openstack.org/#/c/377113/2/devstack-vm-gate-wrap.sh
14:16:04 <tdurakov> mriedem: l-m job we have is based on post-test hook
14:16:12 <mriedem> yeah i know
14:16:22 <mriedem> so, i can take a todo to try and run this through d-g today
14:16:26 <mriedem> at least start it
14:16:37 <johnthetubaguy> oh, nice, just redefine it to call us
14:16:45 <johnthetubaguy> cheakey
14:17:04 <tdurakov> +1, p.s I've commented that patch
14:17:13 <tdurakov> so let's use one directory for such hooks
14:18:26 <tdurakov> do we have anything else on CI?
14:19:00 <tdurakov> let's go next then
14:19:10 <tdurakov> #topic Bugs
14:21:09 <tdurakov> has anyone seen https://review.openstack.org/#/c/375644/ ?
14:21:43 <mriedem> nope
14:21:49 <mriedem> there isn't a bug associated with that either
14:22:22 <mdbooth> mriedem: It's not really a bug
14:22:48 <tdurakov> mdbooth: what is it then?
14:22:53 <tdurakov> looks like a bug
14:23:03 <mdbooth> tdurakov: It's a statement
14:23:42 <tdurakov> mdbooth: do we need to backport that 'statement'?
14:24:06 <mdbooth> I'm not convinced I agree it's a good change, tbh
14:24:20 <mdbooth> I'd want input from danpb on this
14:25:01 <tdurakov> so, please review that
14:26:06 <tdurakov> there are also several new bugs
14:27:26 <mdbooth> tdurakov: I've asked danpb to look.
14:28:43 <tdurakov> mdbooth: thank you
14:28:56 <tdurakov> there are also 4 new/confirmed bug for live-migration
14:29:24 <tdurakov> https://bugs.launchpad.net/nova/+bugs?field.searchtext=&orderby=-importance&field.status%3Alist=NEW&field.status%3Alist=CONFIRMED&assignee_option=none&field.assignee=&field.bug_reporter=&field.bug_commenter=&field.subscriber=&field.structural_subscriber=&field.tag=liberty-backport-potential+&field.tags_combinator=ANY&field.has_cve.used=&field.omit_dupes.used
14:29:24 <tdurakov> =&field.omit_dupes=on&field.affects_me.used=&field.has_patch.used=&field.has_branches.used=&field.has_branches=on&field.has_no_branches.used=&field.has_no_branches=on&field.has_blueprints.used=&field.has_blueprints=on&field.has_no_blueprints.used=&field.has_no_blueprints=on&search=Search
14:30:22 <mdbooth> tdurakov: Could you paste the 4 bug links?
14:30:32 <davidgiluk> and/or a short URL for the search
14:30:35 <mdbooth> Right
14:31:38 * tdurakov love launchpad
14:31:39 <tdurakov> https://bugs.launchpad.net/nova/+bug/1573875
14:31:41 <openstack> Launchpad bug 1573875 in OpenStack Compute (nova) "The same ceph rbd device is used by multiple instances" [Undecided,New]
14:31:51 <tdurakov> https://bugs.launchpad.net/nova/+bug/1573944
14:31:53 <openstack> Launchpad bug 1573944 in OpenStack Compute (nova) "target-lun id of volume changed when live-migration failed" [Undecided,New] - Assigned to Xuanzhou Perry Dong (oss-xzdong)
14:32:01 <tdurakov> https://bugs.launchpad.net/nova/+bug/1583107
14:32:02 <openstack> Launchpad bug 1583107 in OpenStack Compute (nova) "live-migration abort parameters are not honored" [Undecided,New]
14:32:09 <tdurakov> https://bugs.launchpad.net/nova/+bug/1583145
14:32:11 <openstack> Launchpad bug 1583145 in OpenStack Compute (nova) "live-migration monitoring is not working properly" [Undecided,New]
14:32:26 <mdbooth> tdurakov: From the first bug: "We use Kilo from Ubuntu cloud archive"
14:32:30 <mdbooth> Can we close that one?
14:33:03 <tdurakov> it looks like a duplicate https://bugs.launchpad.net/nova/+bug/1419577
14:33:04 <openstack> Launchpad bug 1419577 in OpenStack Compute (nova) "when live-migrate failed, lun-id couldn't be rollback in havana" [High,In progress] - Assigned to Lee Yarwood (lyarwood)
14:33:29 <tdurakov> second one
14:34:09 <tdurakov> mdbooth: I'll move first to incomplete and ask to check on newer version
14:34:35 <mdbooth> tdurakov: Further down they mention mitaka
14:34:54 <tdurakov> #action to triage bugs above
14:35:20 <tdurakov> let's do it after meeting and go to the next topic
14:35:27 <tdurakov> #topic specs
14:36:05 <tdurakov> I've walked through latest submitted/updated specs
14:36:12 <tdurakov> so added 2 in agenda
14:36:53 <tdurakov> if you have any other specs, please update agenda
14:37:13 <mdbooth> Related, I need to re-propose persistent instance metadata
14:37:25 <tdurakov> so, first one: https://review.openstack.org/#/c/347161/
14:37:29 * mdbooth has quite gotten round to it yet, though
14:38:04 <tdurakov> paul-carlton2: hi, are you around?
14:38:10 <mdbooth> So, I have a point related to that spec
14:38:39 <mdbooth> paul-carlton2: Why is that APIImpact, btw?
14:38:52 <mdbooth> That's not it, anyway.
14:39:00 <paul-carlton2> yep
14:39:47 <paul-carlton2> api impact because attempting to live migrate a rescued instance has different outcome
14:40:01 <mdbooth> So, currently the libvirt driver assumes that the db is canonical. i.e. At any point we can throw away libvirt xml and regenerate it.
14:40:13 <mdbooth> This spec follows that model, so it's fine.
14:40:23 <paul-carlton2> it will return the same sort of error for XenAPI hypervisor but on libvirt it will migtrate
14:40:24 <tdurakov> paul-carlton2: I agree with mdbooth on that, it's just kind of fix, no api change actually
14:41:10 <paul-carlton2> I'm happy to remove it, pretty sure I was told to added on the mitaka version of this, will check who by
14:41:17 <mdbooth> However, I was recently thinking about this assumption in a few different contexts. We (RH) got a customer bug about disk ordering changing after a reboot.
14:41:43 <mdbooth> At that point, I started thinking if that's something we could fix by persisting this info in persistent instance metadata.
14:42:06 <tdurakov> paul-carlton2: yes, please, I'll leave a comment on that either
14:42:23 <mdbooth> Then I got to wondering if we're just going to end up tying ourselves in knots, and we should update everything which currently assumes that the db is canonical and switch to libvirt being canonical.
14:42:57 <mdbooth> The reason I bring it up here specifically is that it's relevant to this spec.
14:43:02 <mdbooth> paul-carlton2: Any thoughts ^^^ ?
14:44:03 <paul-carlton2> I think nova db should be the authority
14:44:05 <mdbooth> So, if libvirt domain xml was canonical, we would need to separately store storage format or location. Also we would never lose device ordering, address assignment.
14:44:28 <paul-carlton2> we rebuild libvirt stuff from nova db in hard reboot rebuild etc
14:44:38 <mdbooth> paul-carlton2: Right. That's something we'd fix.
14:44:39 <paul-carlton2> gets us out of jail sometimes
14:44:46 <mdbooth> There's probably about 4 places we'd need to fix.
14:44:50 <tdurakov> domain.xml shouldn't be source of truth
14:45:02 <mdbooth> tdurakov: I'm arguing against that.
14:45:24 <paul-carlton2> mdbooth, it is turning stuff on its head and db should be the authority for all hypervisors
14:45:34 <mdbooth> domain.xml *isn't* canonical source of truth. I'm arguing it might be easier if it was, though.
14:46:03 <mdbooth> paul-carlton2: It means we don't need to store driver-specific detail to the Nth degree in the db.
14:46:05 <tdurakov> mdbooth: there is a bug on hard_reboot instances with config drive, and the way nova rebuilds domain.xml make things tricky
14:46:10 <paul-carlton2> but that would only impact libvirt, other hypervisors might be different
14:46:15 <mdbooth> See above around address assignment, device ordering.
14:46:41 <mdbooth> tdurakov: Right. I'm suggesting nova should never rebuild domain.xml.
14:47:01 <mdbooth> Unless doing an actual rebuild, eg for evacuate.
14:47:03 <paul-carlton2> trouble is we can't assume what current or future hyper-visors can do
14:47:16 <mdbooth> paul-carlton2: Exactly, that's libvirt's problem.
14:47:39 <paul-carlton2> better to store everything you need in db and build instance, whatever the hypervisor, from that
14:48:01 <mdbooth> paul-carlton2: But then you implicitly *lose* all the hypervisor's new features unless you also add them to Nova.
14:48:08 <mdbooth> We're duplicating libvirt functionality in Nova.
14:48:18 <mdbooth> Unless we don't, in which case we lose that functionality.
14:48:20 <paul-carlton2> I have no problem with some of the info in the db being hypervisor specific, like your new object
14:48:30 <mdbooth> e.g. persistent device addressing, device ordering.
14:48:53 <mdbooth> If we want this stuff to work under the direction we're currently taking, we need to implement everything libvirt does in Nova.
14:48:56 <paul-carlton2> all can be store in a hypervisor specific object in db
14:49:07 <mdbooth> If we just create a domain, then use libvirt to manipulate it.
14:53:06 <tdurakov> please start ml thread for that, unfortunately we are limited in time on that meeting, let's switch to the next
14:53:06 <mdbooth> libvirt is responsible for maintaining that and updating it.
14:53:08 <paul-carlton2> by the wy Andrea Rosa said add apiimpact to rescused mighration
14:53:08 <mdbooth> The other reason I bring it up is it would obsolete my persistent instance storage metadata spec
14:53:08 <tdurakov> another spec: https://review.openstack.org/#/c/301090/
14:53:08 <tdurakov> andreas_s: I think it's yours, right?
14:53:08 <paul-carlton2> he has left openstack work now so guess I can pull that if I want but you actually think he was right
14:53:08 <andreas_s> tdurakov, yes
14:53:09 <andreas_s> we're currently trying to get things on neutron side clear
14:53:09 <andreas_s> https://review.openstack.org/309416
14:53:09 <tdurakov> haven't read it yet, do you have anything to discuss right now on the spec
14:53:09 <andreas_s> that's why work on the Nova spec was layed aside for a while
14:53:09 <mdbooth> johnthetubaguy: Did I see a spec from you about neutron apis, or am I getting confused with the cinder one? ^^^
14:53:10 <tdurakov> #action reiview https://review.openstack.org/#/c/301090/
14:53:10 <andreas_s> tdurakov, no - the big questions are on Neutron side that need to be solved - once that is in place (hopefully at the summit) we can follow up
14:53:23 <tdurakov> andreas_s: acked
14:53:51 <mdbooth> johnthetubaguy: nm, you're already reviewing it
14:53:58 <tdurakov> any other specs?
14:54:39 <tdurakov> then we have 5 minutes and
14:54:54 <tdurakov> #topic Open discussion
14:55:04 <mdbooth> I put a meta-point on the agenda
14:55:11 <mdbooth> Should we have a separate sub-team for the libvirt driver?
14:55:47 <mdbooth> The overlap between this team and a libvirt driver sub-team would be significant, but not total.
14:55:56 <tdurakov> mdbooth: I'd bring this up on nova weekly tbh
14:56:13 <mdbooth> tdurakov: Ok, will do. Anybody here got a quick opinion on it, though?
14:56:28 <mdbooth> It would likely affect most people on this team.
14:56:46 <tdurakov> mdbooth: I do not mind on that
14:58:03 <tdurakov> so, lets wrap up
14:58:11 <tdurakov> thanks everyone for coming
14:58:31 <tdurakov> #endmeeting