14:01:36 <PaulMurray> #startmeeting Nova Live Migration
14:01:37 <openstack> Meeting started Tue Sep 20 14:01:36 2016 UTC and is due to finish in 60 minutes.  The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:01:39 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:01:41 <openstack> The meeting name has been set to 'nova_live_migration'
14:01:55 <mdbooth> o/
14:02:02 <PaulMurray> hi - sorry I'm a minute or two late
14:02:08 <davidgiluk> o/
14:02:15 <ltomasbo> o/
14:02:28 <PaulMurray> agenda: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration
14:02:50 <PaulMurray> actually there are no real items on the agenda, just headings
14:03:09 * kashyap waves
14:03:12 <PaulMurray> #topic CI
14:03:17 <mdbooth> PaulMurray: I was going to add something to the agenda, but the lack of anything else on it made me suspect I had the wrong place
14:03:19 <PaulMurray> Anything for CI ?
14:03:34 <PaulMurray> mdbooth, no worries
14:03:41 <johnthetubaguy> so I think there are patches up for grenade tests from pkoniszewski
14:03:54 <PaulMurray> johnthetubaguy, do you have links ?
14:04:00 <johnthetubaguy> yeah, just looking
14:04:12 <johnthetubaguy> #link https://review.openstack.org/#/c/364809
14:04:48 <johnthetubaguy> I am hoping raj_singh can get some folks to help you with that, if needed
14:05:08 <johnthetubaguy> do we have any more news on that iscsi bug that breaks the tests sometimes?
14:05:32 <PaulMurray> no, I don't think so
14:05:49 <PaulMurray> It is something we need to get back to
14:06:29 <johnthetubaguy> OK, just making sure I am not missing out :)
14:07:23 <PaulMurray> is raj_singh around ?
14:07:58 <PaulMurray> is raj_singh one of yours johnthetubaguy ?
14:08:02 <johnthetubaguy> its probably still early in texas, so not 100% sure
14:08:17 <johnthetubaguy> he is a senior dev on the OSIC team
14:08:32 <PaulMurray> ok, thanks
14:09:23 <PaulMurray> That grenade patch looks like something that is needed very soon
14:09:49 <PaulMurray> are we worried about newton release ?
14:10:13 <PaulMurray> or just something for the future ?
14:10:34 <johnthetubaguy> hmm, I think I saw some good bug reports and fixes go in, that makes me feel good about us catching the big things
14:10:53 <johnthetubaguy> the change to turn on tunnelling by default is probably the biggest gain in the release, ironically
14:11:45 <PaulMurray> ok, lets move on then - anything else for CI ?
14:12:30 <PaulMurray> #topic Bugs
14:13:39 <PaulMurray> Anything to talk about here ?
14:14:20 <PaulMurray> I think I saw that all the newton blockers are dealt with or not blockers
14:14:30 <kashyap> PaulMurray: Sorry for jumping in late, a quick question on https://bugs.launchpad.net/nova/+bug/1524898/
14:14:31 <openstack> Launchpad bug 1524898 in OpenStack Compute (nova) "Volume based live migration aborted unexpectedly" [High,Confirmed]
14:14:37 <johnthetubaguy> yeah, thats all I have seen
14:14:39 <mdbooth> I think at this stage in the release cycle we're only considering bugs which may result in immediate death, right?
14:15:02 <johnthetubaguy> erm, master is open for ocata now, but just taking it steady so we don't cause issues with backports, I think
14:15:05 <kashyap> Since it's not reproducible anymore, tdurakov seems to have submitted this https://review.openstack.org/#/c/353002/ ("Skipping test_volume_backed_live_migration for live_migration job")
14:15:08 <PaulMurray> yeah - there is nothing on the list
14:15:13 <PaulMurray> of that type
14:15:24 <PaulMurray> kashyap, go ahead
14:15:33 <mriedem> immediate death or something akin to passing a stone
14:15:57 <johnthetubaguy> hmm, I thought that volume thing was still causing issues in the gate, it as was always an occasional thing
14:16:07 <johnthetubaguy> mriedem: heh, love the definition
14:16:10 <kashyap> PaulMurray: Just checking if anyone is able to reproduce it, maybe it's only "occasionally" reproducible
14:16:20 <mriedem> johnthetubaguy: the volume-backed test is skipped in the voting live migration job
14:16:30 <PaulMurray> We tried to reproduce it in house in HPE but failed
14:16:35 <mriedem> it's not skipped in the non-voting multinode tempest full job
14:16:59 <kashyap> Yeah, we've worked around by skipping the test, but if someone hits it in production, they'll get back to this bug
14:16:59 <mriedem> which is why https://review.openstack.org/#/c/353002/ is just a regex skip in the nova tree for the dedicated live migration job
14:17:19 <mriedem> danpb thought it was something with the kernel, or native scsi i thought
14:17:48 <mriedem> there was talk of trying to reproduce with native in-qemu scsi, which danpb thought would at least give a better error message for what the root failure is, if it fails
14:17:49 <johnthetubaguy> mriedem, yeah, that was my understanding too
14:19:17 <kashyap> mriedem: Someone from Cinder team should look at it
14:19:36 <kashyap> I've requested on the upstream channel with a few specific pointers, but only crickets...Anyway
14:19:58 <kashyap> This is where one of those "cross-project relations" should kick in
14:20:42 <PaulMurray> mriedem, is this something we can bring up at a PTL to PTL level ?
14:20:58 <johnthetubaguy> seems like a kernel issue to be honest
14:21:12 <mriedem> kashyap: hemna was the person to look probably, but he's busy atm
14:21:17 <mriedem> from cinder i mean
14:21:37 <PaulMurray> mriedem, yes, I asked him a couple of times, didn't want to bother him more
14:21:50 <kashyap> Yeah, I hear you.  But my guess is atleast there's one another person who must be well-versed with iSCSI internals
14:21:55 <mriedem> kashyap: so,
14:22:02 <PaulMurray> but in our experience the problems have come from particular storage or network drivers
14:22:06 <johnthetubaguy> I did hear talk that is is specific to the iSCSI target and user being on the same box, but I am not sure about that
14:22:12 <mriedem> wouldn't the red hat guys be the best equipped to at least try a recreate with the in-qemu scsi?
14:22:16 <pkoniszewski> o/
14:22:19 <kashyap> :-)
14:22:24 <mdbooth> On a related topic, the functional gap between host and qemu iscsi is multipath. If we switch to qemu iscsi for the simplicity and lost multipath, do we think anybody would be upset?
14:22:27 <davidgiluk> johnthetubaguy: I thought we concluded it was an iscsi config problem
14:22:42 <kashyap> mriedem: Sure, I can try, once these two nasty bugs off my plate are gone
14:22:46 <mriedem> mdbooth: i don't think we test multipath anywhere
14:22:47 <johnthetubaguy> davidgiluk: I haven't heard anything that concrete, its possible I guess
14:23:03 <mdbooth> mriedem: Right, I think it's one of those theoretical features.
14:24:47 <mriedem> i know it breaks our london cloud running kilo + multipath :)
14:25:05 <mriedem> 'what's different about storage on this cloud?' 'oh right, multipath...
14:25:25 * mriedem runs to another meeting
14:25:36 <mdbooth> Ironic Availability: when your primary source of downtime is your HA solution.
14:25:55 <johnthetubaguy> mriedem: its real, thats a good datapoint
14:26:38 <mdbooth> While we're on bugs, maybe raise https://review.openstack.org/#/c/370089/
14:27:04 <mdbooth> johnthetubaguy looked at this and was understandably uneasy about randomly deleting things with vague warnings attached.
14:27:36 <mdbooth> Me too, btw. Wondering if anybody can add anything to it.
14:27:53 <PaulMurray> Have we got rid of nova-networking yet ?
14:27:58 <mdbooth> No
14:28:01 <mdbooth> Nearly, but no
14:28:08 <PaulMurray> cause those multiple calls are needed for nova-network
14:28:18 <mdbooth> Do you know that they are?
14:28:24 <mdbooth> Or are you just trusting the comments?
14:28:34 <PaulMurray> ha - good question
14:28:49 <mdbooth> Because without looking at the implementation, it seems weird that calling the same thing multiple times with the same arguments would have different results.
14:29:00 <PaulMurray> I looked a while back so trusting to memory
14:29:09 <PaulMurray> and can't remember that clearly
14:29:30 <mdbooth> However, if they really are required then we can't delete them, because nova-networking isn't gone yet.
14:30:34 <johnthetubaguy> I think this was something like the dhcp stuff reads out of something that gets populated the first time around things, but I don't remember the details
14:30:48 <PaulMurray> I think it something to do with the sequence of events on the network side - the parameters are the same, but the state in the network isn't
14:31:21 <mdbooth> PaulMurray: Do we know anybody who could say for sure?
14:31:48 <PaulMurray> Is it important to you for something ?
14:31:50 <johnthetubaguy> I should go study the code, and try get my memory back on that, not sure who has been in there recently
14:32:11 <johnthetubaguy> so post copy is a lot easier if we don't call things one million times
14:32:15 <mdbooth> I think it makes the next patch a bit simpler.
14:32:26 <mdbooth> johnthetubaguy: Right
14:32:33 <mriedem> mdbooth: https://review.openstack.org/#/c/370089/ worries me for dvr
14:32:37 <mriedem> so i added swami
14:32:50 <mdbooth> mriedem: DVR is the only situation I've tested it in :)
14:32:56 <mriedem> oh right
14:32:57 <mdbooth> That one definitely works fine.
14:33:06 <mriedem> then it worries me for nova-network :)
14:33:13 <mriedem> i'm just constantly worried
14:33:13 <mdbooth> :)
14:33:45 <PaulMurray> I would expect it to work for DVR cause until swamis last patch went in these calls were a noop for neutron
14:34:11 <mdbooth> PaulMurray: Incidentally, that follow-on patch which just moves network calls to the source significantly decreases network downtime for DVR in the non-postcopy case
14:34:17 <mdbooth> Can't remember if I mentioned that last time
14:34:27 <mdbooth> I haven't studied exactly why, but that's what I observe
14:34:55 <PaulMurray> I saw your comments about that somewhere (IRC, patch, somewhere)
14:34:56 <mdbooth> The 3-4 second gap you always get when live migrating is almost entirely eliminated.
14:35:16 <PaulMurray> I saw three pings dropped in those tests I did
14:35:19 <PaulMurray> so that agrees
14:36:02 <mdbooth> Presumably means there's something in that code path which takes about that long. Don't know what it is, though.
14:37:04 <PaulMurray> So I don't know what you want to do about the multiple setup_on_networks() calls
14:37:20 <PaulMurray> should at least validate with someone about nova network side
14:37:28 <PaulMurray> or wait until nova network goes away
14:37:30 <PaulMurray> ?
14:37:40 <mdbooth> PaulMurray: nova-network seems to be where there unease is originating.
14:38:03 <PaulMurray> yep, I am pretty sure it's not relevant for neutron
14:38:07 <mdbooth> If that code is maintained at all any more, it would be good if somebody could look at it authoritatively.
14:38:56 <mdbooth> Meta note: looking at that comment, although it's great the author mentions 'dhcp', more detail would have been great. A bug# in the comment would have been fantastic.
14:39:01 <PaulMurray> maybe this needs an ML thread - don't know who to reach out to
14:39:09 * mdbooth will try to note that personally when adding hacks like this.
14:39:47 <PaulMurray> Anything else on bugs ?
14:40:24 <PaulMurray> #topic Ocata Specs
14:40:51 <PaulMurray> Need to make sure any slipped specs get re-proposed
14:41:05 <PaulMurray> (I think there's a few of those)
14:41:16 <PaulMurray> Also any new work
14:41:31 <PaulMurray> Any questions ?
14:42:11 <PaulMurray> next....
14:42:23 <PaulMurray> #topic Open Discussion
14:42:33 <PaulMurray> Anything else to raise this week ?
14:42:53 * mdbooth wanted to raise https://etherpad.openstack.org/p/nova-newton-retrospective
14:43:06 <PaulMurray> go for it
14:43:16 <mdbooth> If you haven't already, I'd encourage you to look and participate
14:44:01 <PaulMurray> I thought you summed it up pretty well
14:44:18 <PaulMurray> the long stack of patches is a problem
14:44:41 <mdbooth> From my own pov, I've been advocating baby steps towards a subsystem maintainer model lately. I summed up my proposal in a ml post this morning: http://lists.openstack.org/pipermail/openstack-dev/2016-September/104093.html
14:44:46 <PaulMurray> I saw mriedem brought that up in his PTL candidacy message too
14:45:16 <mdbooth> Would something like that have been useful to this team over the last cycle in practise?
14:45:25 <mdbooth> Or are our pain points elsewhere?
14:45:52 <PaulMurray> have you seen the discussions about this is in the past few years
14:46:25 <PaulMurray> are you proposing that some people get +2 but not +A?
14:46:25 <mdbooth> PaulMurray: Yes. Many times.
14:46:46 <kashyap> ccccccdlvjjtblfgnffchujbdhignbkklncbntbjlvhv
14:46:56 <kashyap> Urgh, sorry, it's my YubiKey
14:47:05 <mdbooth> PaulMurray: For this meeting, I'm really interested to know if we think this or anything like it would have been useful to us.
14:47:40 <mdbooth> PaulMurray: But in general, I wanted to propose something conservative which could be piloted without changing the universe.
14:48:13 <PaulMurray> It would be useful, but I think the consensus in the past has been that this is done informally anyway in the cores
14:48:39 <mdbooth> PaulMurray: So you don't think it would be useful?
14:48:54 <mdbooth> Because we do it already?
14:49:11 <kashyap> PaulMurray: I think what mdbooth is saying is that, it maybe done informally with some unspoken rules, why not resonably enforce it, with some predictable rules
14:49:14 <PaulMurray> to be honest I am just repeating past arguments
14:49:19 <kashyap> (Hope I didn't misrepresent Matt's point)
14:49:46 * mdbooth is soliciting input, not advocating :)
14:49:56 <mdbooth> PaulMurray: Yeah, I've heard those arguments.
14:50:03 <PaulMurray> kashyap, the same argument goes the other way - some +1 are taken as virtual +2s and some cores refrain from doing +A where they are not sure
14:50:20 <mriedem> as i've said before,
14:50:21 <PaulMurray> the real issue is bandwidth and attention
14:50:32 <mriedem> there are people i look for in certain subsystems for a +1
14:50:38 <mriedem> and add to changes
14:51:31 <kashyap> PaulMurray: Also that: there's simply no way a core (that is not an expert in that area) can meaningfully comment on a deep change involving say e.g. virt drivers.  So for that, the subsystem model that is proposed makes sense, IMHO.
14:51:39 <mriedem> i've generally liked the idea of a dashboard where subteams can star changes that are important to them and they've reviewed - more like the etherpad but easier to process
14:51:51 <mdbooth> PaulMurray: My hope is that by formalising a subsystem maintainer model we'd halve the number of core +2s required.
14:52:04 <mriedem> here is a perfect example of where the subsystem thing breaks down for a virt driver https://review.openstack.org/#/c/253666/
14:52:08 <mdbooth> We'd also devolve responsibility.
14:52:15 <mriedem> not trying to pick on ^ but it just came up this morning
14:53:19 <mdbooth> mriedem: I don't want to take core review out of the loop. Baby steps :)
14:54:29 <mdbooth> Anyway, I think it's important to participate in the retrospective. If my own proposal gains traction, then awesome.
14:54:31 <PaulMurray> mdbooth, I think your case is unusual because you are working in a very contained piece of code
14:54:53 <PaulMurray> and you are the expert
14:55:29 <PaulMurray> os-brick and os-win have been broken off into other projects for pretty much the same reason
14:55:32 <PaulMurray> I think
14:55:43 <mriedem> well,
14:55:51 <mriedem> those were broken off because of code duplication
14:55:57 <mdbooth> They're a bit different, because the expertise was in other projects.
14:56:02 <mriedem> os-brick was literally code in both nova and cinder
14:56:14 <mriedem> os-win is hyper-v stuff that's in multiple projects, like utility code
14:56:23 <mriedem> similar with oslo.vmware
14:56:45 <PaulMurray> ok, so that was not the reason, but it still works ok
14:56:53 <PaulMurray> because it is separable
14:56:54 <mdbooth> Anyway, I'm not proposing splitting the repo. I'm saying that if tdurakov gives a live migration patch a +2, we only need 1 mriedem to approve it.
14:57:59 <mdbooth> Also, tdurakov should know when he needs additional review, and who to ask for it.
14:58:17 <kashyap> Exactly, FWIW, I fully agree with the above.
14:58:48 <mdbooth> Sorry, now I really am advocating :)
14:58:55 <PaulMurray> I need to end the meeting
14:58:56 <kashyap> Trusting people to do the right thing, and there's always hammers like 'git revert'
14:59:04 <PaulMurray> its nearly top of the hour
14:59:09 * mdbooth will stop that. If anybody has anything useful to add, please respond on ML.
14:59:12 <PaulMurray> this will have to continue on the ML thread
14:59:24 <mdbooth> or etherpad
14:59:28 <PaulMurray> sorry - bye
14:59:37 <PaulMurray> #endmeeting