14:00:17 <tdurakov> #startmeeting Nova Live Migration 14:00:21 <openstack> Meeting started Tue Sep 27 14:00:17 2016 UTC and is due to finish in 60 minutes. The chair is tdurakov. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:22 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:24 <tdurakov> hi everyone 14:00:25 <openstack> The meeting name has been set to 'nova_live_migration' 14:00:42 * kashyap waves 14:00:54 <mrhillsman> o/ 14:00:56 <mriedem> o/ 14:00:57 <mdbooth> o/ 14:00:57 <raj_singh> o/ 14:01:09 <davidgiluk> o/ 14:01:11 <ltomasbo> o/ 14:01:13 <tdurakov> agenda for the meeting https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration#Agenda_for_next_meeting 14:02:00 <tdurakov> #topic CI 14:03:04 <tdurakov> let me start, I'm working in background on the patch to re-enable ceph for the tempest l-m job, so will notify, once ready 14:03:47 <mriedem> fyi on ceph 14:03:53 <mriedem> we're hitting some other issues with the ceph job on xenial 14:04:00 * johnthetubaguy lurks 14:04:14 <mriedem> https://bugs.launchpad.net/cinder/+bug/1627220 14:04:16 <openstack> Launchpad bug 1627220 in devstack-plugin-ceph "ceph: test_volume_boot_pattern fails with "ImageNotFound: error protecting snapshot"" [Undecided,In progress] 14:04:22 <tdurakov> mriedem: so, we are blocked on *both* backends 14:04:32 <mriedem> tdurakov: well, ^ might not impact live migration tests 14:04:38 <mriedem> it only seems to blow up tests that use snapshots 14:04:55 <mdbooth> mriedem: Does that impact nova? Bug seems cindery. 14:05:07 <tdurakov> mriedem then it should not 14:05:09 <mriedem> it impacts everything in the ceph job 14:05:16 <mriedem> but not the bug isn't in nova 14:05:23 <tdurakov> you are right 14:05:26 <mriedem> there is a guy from red hat with a patch for the devstack plugin 14:05:27 <mdbooth> mriedem: Gotcha 14:05:31 <mriedem> sounds like a regression in the version of ceph in xenial 14:06:20 <tdurakov> any updates on grenade job? 14:06:46 <tdurakov> pkoniszewski: hi, are you around? 14:08:17 <tdurakov> anyone else has updates on that? 14:09:03 <tdurakov> seems no 14:09:08 <johnthetubaguy> I don't, I know raj_singh is hoping to take a look at that, if pkoniszewski doesn't get there 14:09:22 <mriedem> i know the patch, sec 14:09:28 <johnthetubaguy> yeah, the patch is up 14:09:33 <mriedem> https://review.openstack.org/#/c/364809/ 14:09:39 <mriedem> i had reviewed it 14:09:40 <raj_singh> yes I pinged pkoniszewski if he need any help, have not heard back 14:09:54 <mriedem> i have some concerns about it, 14:09:55 <johnthetubaguy> #link https://review.openstack.org/#/c/364809 14:10:09 <mriedem> we should probably test that in a d-g WIP patch first to see if it works 14:10:10 <johnthetubaguy> mriedem: didn't you have a patch at one point that helped with that? 14:10:16 <mriedem> b/c by default grenade jobs only run smoke tests 14:10:24 <mriedem> and the live migration tests aren't smoke tests 14:10:31 <johnthetubaguy> mriedem: ah, rather than a brand new thing 14:10:34 <tdurakov> mriedem: right 14:10:34 <mriedem> johnthetubaguy: they are just one-offs 14:10:52 <tdurakov> mriedem: how could we test it without merging initial job definition? 14:10:56 <mriedem> johnthetubaguy: basically you take the variables and shell stuff from https://review.openstack.org/#/c/364809/5/jenkins/jobs/devstack-gate.yaml and put that into a d-g change 14:11:34 <mriedem> tdurakov: like this https://review.openstack.org/#/c/362441/ 14:12:23 <tdurakov> mriedem: will it help? 14:12:23 <johnthetubaguy> raj_singh: could you get someone to try that out? 14:12:30 <johnthetubaguy> nice to know it would work 14:12:51 <tdurakov> I think there will be no such job registered in zuul, am I wrong?? 14:13:03 <raj_singh> johnthetubaguy: yes sure 14:13:27 <johnthetubaguy> so modify the existing job, in a DNM path, depend on that on some nova patch, and we test out to see if grenade + live-migrate works 14:13:50 <mriedem> tdurakov: the d-g change would test it 14:14:07 <mriedem> project-config defines the job and just passes variables to d-g 14:14:08 <tdurakov> mriedem: so, existing job for now, as a debug/PoC? 14:14:15 <mriedem> we can WIP a d-g change to hardcoded those same vars to test things 14:14:17 <mriedem> yeah 14:14:25 <mriedem> there is a multinode grenade job that runs on d-g already 14:14:36 <tdurakov> got it, yeah this will work 14:15:14 <mriedem> the post_test_hook is something i haven't figured out how to stub in d-g yet 14:15:29 <mriedem> actually maybe i have https://review.openstack.org/#/c/377113/2/devstack-vm-gate-wrap.sh 14:16:04 <tdurakov> mriedem: l-m job we have is based on post-test hook 14:16:12 <mriedem> yeah i know 14:16:22 <mriedem> so, i can take a todo to try and run this through d-g today 14:16:26 <mriedem> at least start it 14:16:37 <johnthetubaguy> oh, nice, just redefine it to call us 14:16:45 <johnthetubaguy> cheakey 14:17:04 <tdurakov> +1, p.s I've commented that patch 14:17:13 <tdurakov> so let's use one directory for such hooks 14:18:26 <tdurakov> do we have anything else on CI? 14:19:00 <tdurakov> let's go next then 14:19:10 <tdurakov> #topic Bugs 14:21:09 <tdurakov> has anyone seen https://review.openstack.org/#/c/375644/ ? 14:21:43 <mriedem> nope 14:21:49 <mriedem> there isn't a bug associated with that either 14:22:22 <mdbooth> mriedem: It's not really a bug 14:22:48 <tdurakov> mdbooth: what is it then? 14:22:53 <tdurakov> looks like a bug 14:23:03 <mdbooth> tdurakov: It's a statement 14:23:42 <tdurakov> mdbooth: do we need to backport that 'statement'? 14:24:06 <mdbooth> I'm not convinced I agree it's a good change, tbh 14:24:20 <mdbooth> I'd want input from danpb on this 14:25:01 <tdurakov> so, please review that 14:26:06 <tdurakov> there are also several new bugs 14:27:26 <mdbooth> tdurakov: I've asked danpb to look. 14:28:43 <tdurakov> mdbooth: thank you 14:28:56 <tdurakov> there are also 4 new/confirmed bug for live-migration 14:29:24 <tdurakov> https://bugs.launchpad.net/nova/+bugs?field.searchtext=&orderby=-importance&field.status%3Alist=NEW&field.status%3Alist=CONFIRMED&assignee_option=none&field.assignee=&field.bug_reporter=&field.bug_commenter=&field.subscriber=&field.structural_subscriber=&field.tag=liberty-backport-potential+&field.tags_combinator=ANY&field.has_cve.used=&field.omit_dupes.used 14:29:24 <tdurakov> =&field.omit_dupes=on&field.affects_me.used=&field.has_patch.used=&field.has_branches.used=&field.has_branches=on&field.has_no_branches.used=&field.has_no_branches=on&field.has_blueprints.used=&field.has_blueprints=on&field.has_no_blueprints.used=&field.has_no_blueprints=on&search=Search 14:30:22 <mdbooth> tdurakov: Could you paste the 4 bug links? 14:30:32 <davidgiluk> and/or a short URL for the search 14:30:35 <mdbooth> Right 14:31:38 * tdurakov love launchpad 14:31:39 <tdurakov> https://bugs.launchpad.net/nova/+bug/1573875 14:31:41 <openstack> Launchpad bug 1573875 in OpenStack Compute (nova) "The same ceph rbd device is used by multiple instances" [Undecided,New] 14:31:51 <tdurakov> https://bugs.launchpad.net/nova/+bug/1573944 14:31:53 <openstack> Launchpad bug 1573944 in OpenStack Compute (nova) "target-lun id of volume changed when live-migration failed" [Undecided,New] - Assigned to Xuanzhou Perry Dong (oss-xzdong) 14:32:01 <tdurakov> https://bugs.launchpad.net/nova/+bug/1583107 14:32:02 <openstack> Launchpad bug 1583107 in OpenStack Compute (nova) "live-migration abort parameters are not honored" [Undecided,New] 14:32:09 <tdurakov> https://bugs.launchpad.net/nova/+bug/1583145 14:32:11 <openstack> Launchpad bug 1583145 in OpenStack Compute (nova) "live-migration monitoring is not working properly" [Undecided,New] 14:32:26 <mdbooth> tdurakov: From the first bug: "We use Kilo from Ubuntu cloud archive" 14:32:30 <mdbooth> Can we close that one? 14:33:03 <tdurakov> it looks like a duplicate https://bugs.launchpad.net/nova/+bug/1419577 14:33:04 <openstack> Launchpad bug 1419577 in OpenStack Compute (nova) "when live-migrate failed, lun-id couldn't be rollback in havana" [High,In progress] - Assigned to Lee Yarwood (lyarwood) 14:33:29 <tdurakov> second one 14:34:09 <tdurakov> mdbooth: I'll move first to incomplete and ask to check on newer version 14:34:35 <mdbooth> tdurakov: Further down they mention mitaka 14:34:54 <tdurakov> #action to triage bugs above 14:35:20 <tdurakov> let's do it after meeting and go to the next topic 14:35:27 <tdurakov> #topic specs 14:36:05 <tdurakov> I've walked through latest submitted/updated specs 14:36:12 <tdurakov> so added 2 in agenda 14:36:53 <tdurakov> if you have any other specs, please update agenda 14:37:13 <mdbooth> Related, I need to re-propose persistent instance metadata 14:37:25 <tdurakov> so, first one: https://review.openstack.org/#/c/347161/ 14:37:29 * mdbooth has quite gotten round to it yet, though 14:38:04 <tdurakov> paul-carlton2: hi, are you around? 14:38:10 <mdbooth> So, I have a point related to that spec 14:38:39 <mdbooth> paul-carlton2: Why is that APIImpact, btw? 14:38:52 <mdbooth> That's not it, anyway. 14:39:00 <paul-carlton2> yep 14:39:47 <paul-carlton2> api impact because attempting to live migrate a rescued instance has different outcome 14:40:01 <mdbooth> So, currently the libvirt driver assumes that the db is canonical. i.e. At any point we can throw away libvirt xml and regenerate it. 14:40:13 <mdbooth> This spec follows that model, so it's fine. 14:40:23 <paul-carlton2> it will return the same sort of error for XenAPI hypervisor but on libvirt it will migtrate 14:40:24 <tdurakov> paul-carlton2: I agree with mdbooth on that, it's just kind of fix, no api change actually 14:41:10 <paul-carlton2> I'm happy to remove it, pretty sure I was told to added on the mitaka version of this, will check who by 14:41:17 <mdbooth> However, I was recently thinking about this assumption in a few different contexts. We (RH) got a customer bug about disk ordering changing after a reboot. 14:41:43 <mdbooth> At that point, I started thinking if that's something we could fix by persisting this info in persistent instance metadata. 14:42:06 <tdurakov> paul-carlton2: yes, please, I'll leave a comment on that either 14:42:23 <mdbooth> Then I got to wondering if we're just going to end up tying ourselves in knots, and we should update everything which currently assumes that the db is canonical and switch to libvirt being canonical. 14:42:57 <mdbooth> The reason I bring it up here specifically is that it's relevant to this spec. 14:43:02 <mdbooth> paul-carlton2: Any thoughts ^^^ ? 14:44:03 <paul-carlton2> I think nova db should be the authority 14:44:05 <mdbooth> So, if libvirt domain xml was canonical, we would need to separately store storage format or location. Also we would never lose device ordering, address assignment. 14:44:28 <paul-carlton2> we rebuild libvirt stuff from nova db in hard reboot rebuild etc 14:44:38 <mdbooth> paul-carlton2: Right. That's something we'd fix. 14:44:39 <paul-carlton2> gets us out of jail sometimes 14:44:46 <mdbooth> There's probably about 4 places we'd need to fix. 14:44:50 <tdurakov> domain.xml shouldn't be source of truth 14:45:02 <mdbooth> tdurakov: I'm arguing against that. 14:45:24 <paul-carlton2> mdbooth, it is turning stuff on its head and db should be the authority for all hypervisors 14:45:34 <mdbooth> domain.xml *isn't* canonical source of truth. I'm arguing it might be easier if it was, though. 14:46:03 <mdbooth> paul-carlton2: It means we don't need to store driver-specific detail to the Nth degree in the db. 14:46:05 <tdurakov> mdbooth: there is a bug on hard_reboot instances with config drive, and the way nova rebuilds domain.xml make things tricky 14:46:10 <paul-carlton2> but that would only impact libvirt, other hypervisors might be different 14:46:15 <mdbooth> See above around address assignment, device ordering. 14:46:41 <mdbooth> tdurakov: Right. I'm suggesting nova should never rebuild domain.xml. 14:47:01 <mdbooth> Unless doing an actual rebuild, eg for evacuate. 14:47:03 <paul-carlton2> trouble is we can't assume what current or future hyper-visors can do 14:47:16 <mdbooth> paul-carlton2: Exactly, that's libvirt's problem. 14:47:39 <paul-carlton2> better to store everything you need in db and build instance, whatever the hypervisor, from that 14:48:01 <mdbooth> paul-carlton2: But then you implicitly *lose* all the hypervisor's new features unless you also add them to Nova. 14:48:08 <mdbooth> We're duplicating libvirt functionality in Nova. 14:48:18 <mdbooth> Unless we don't, in which case we lose that functionality. 14:48:20 <paul-carlton2> I have no problem with some of the info in the db being hypervisor specific, like your new object 14:48:30 <mdbooth> e.g. persistent device addressing, device ordering. 14:48:53 <mdbooth> If we want this stuff to work under the direction we're currently taking, we need to implement everything libvirt does in Nova. 14:48:56 <paul-carlton2> all can be store in a hypervisor specific object in db 14:49:07 <mdbooth> If we just create a domain, then use libvirt to manipulate it. 14:53:06 <tdurakov> please start ml thread for that, unfortunately we are limited in time on that meeting, let's switch to the next 14:53:06 <mdbooth> libvirt is responsible for maintaining that and updating it. 14:53:08 <paul-carlton2> by the wy Andrea Rosa said add apiimpact to rescused mighration 14:53:08 <mdbooth> The other reason I bring it up is it would obsolete my persistent instance storage metadata spec 14:53:08 <tdurakov> another spec: https://review.openstack.org/#/c/301090/ 14:53:08 <tdurakov> andreas_s: I think it's yours, right? 14:53:08 <paul-carlton2> he has left openstack work now so guess I can pull that if I want but you actually think he was right 14:53:08 <andreas_s> tdurakov, yes 14:53:09 <andreas_s> we're currently trying to get things on neutron side clear 14:53:09 <andreas_s> https://review.openstack.org/309416 14:53:09 <tdurakov> haven't read it yet, do you have anything to discuss right now on the spec 14:53:09 <andreas_s> that's why work on the Nova spec was layed aside for a while 14:53:09 <mdbooth> johnthetubaguy: Did I see a spec from you about neutron apis, or am I getting confused with the cinder one? ^^^ 14:53:10 <tdurakov> #action reiview https://review.openstack.org/#/c/301090/ 14:53:10 <andreas_s> tdurakov, no - the big questions are on Neutron side that need to be solved - once that is in place (hopefully at the summit) we can follow up 14:53:23 <tdurakov> andreas_s: acked 14:53:51 <mdbooth> johnthetubaguy: nm, you're already reviewing it 14:53:58 <tdurakov> any other specs? 14:54:39 <tdurakov> then we have 5 minutes and 14:54:54 <tdurakov> #topic Open discussion 14:55:04 <mdbooth> I put a meta-point on the agenda 14:55:11 <mdbooth> Should we have a separate sub-team for the libvirt driver? 14:55:47 <mdbooth> The overlap between this team and a libvirt driver sub-team would be significant, but not total. 14:55:56 <tdurakov> mdbooth: I'd bring this up on nova weekly tbh 14:56:13 <mdbooth> tdurakov: Ok, will do. Anybody here got a quick opinion on it, though? 14:56:28 <mdbooth> It would likely affect most people on this team. 14:56:46 <tdurakov> mdbooth: I do not mind on that 14:58:03 <tdurakov> so, lets wrap up 14:58:11 <tdurakov> thanks everyone for coming 14:58:31 <tdurakov> #endmeeting