14:01:36 #startmeeting Nova Live Migration 14:01:37 Meeting started Tue Sep 20 14:01:36 2016 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:01:39 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:01:41 The meeting name has been set to 'nova_live_migration' 14:01:55 o/ 14:02:02 hi - sorry I'm a minute or two late 14:02:08 o/ 14:02:15 o/ 14:02:28 agenda: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:02:50 actually there are no real items on the agenda, just headings 14:03:09 * kashyap waves 14:03:12 #topic CI 14:03:17 PaulMurray: I was going to add something to the agenda, but the lack of anything else on it made me suspect I had the wrong place 14:03:19 Anything for CI ? 14:03:34 mdbooth, no worries 14:03:41 so I think there are patches up for grenade tests from pkoniszewski 14:03:54 johnthetubaguy, do you have links ? 14:04:00 yeah, just looking 14:04:12 #link https://review.openstack.org/#/c/364809 14:04:48 I am hoping raj_singh can get some folks to help you with that, if needed 14:05:08 do we have any more news on that iscsi bug that breaks the tests sometimes? 14:05:32 no, I don't think so 14:05:49 It is something we need to get back to 14:06:29 OK, just making sure I am not missing out :) 14:07:23 is raj_singh around ? 14:07:58 is raj_singh one of yours johnthetubaguy ? 14:08:02 its probably still early in texas, so not 100% sure 14:08:17 he is a senior dev on the OSIC team 14:08:32 ok, thanks 14:09:23 That grenade patch looks like something that is needed very soon 14:09:49 are we worried about newton release ? 14:10:13 or just something for the future ? 14:10:34 hmm, I think I saw some good bug reports and fixes go in, that makes me feel good about us catching the big things 14:10:53 the change to turn on tunnelling by default is probably the biggest gain in the release, ironically 14:11:45 ok, lets move on then - anything else for CI ? 14:12:30 #topic Bugs 14:13:39 Anything to talk about here ? 14:14:20 I think I saw that all the newton blockers are dealt with or not blockers 14:14:30 PaulMurray: Sorry for jumping in late, a quick question on https://bugs.launchpad.net/nova/+bug/1524898/ 14:14:31 Launchpad bug 1524898 in OpenStack Compute (nova) "Volume based live migration aborted unexpectedly" [High,Confirmed] 14:14:37 yeah, thats all I have seen 14:14:39 I think at this stage in the release cycle we're only considering bugs which may result in immediate death, right? 14:15:02 erm, master is open for ocata now, but just taking it steady so we don't cause issues with backports, I think 14:15:05 Since it's not reproducible anymore, tdurakov seems to have submitted this https://review.openstack.org/#/c/353002/ ("Skipping test_volume_backed_live_migration for live_migration job") 14:15:08 yeah - there is nothing on the list 14:15:13 of that type 14:15:24 kashyap, go ahead 14:15:33 immediate death or something akin to passing a stone 14:15:57 hmm, I thought that volume thing was still causing issues in the gate, it as was always an occasional thing 14:16:07 mriedem: heh, love the definition 14:16:10 PaulMurray: Just checking if anyone is able to reproduce it, maybe it's only "occasionally" reproducible 14:16:20 johnthetubaguy: the volume-backed test is skipped in the voting live migration job 14:16:30 We tried to reproduce it in house in HPE but failed 14:16:35 it's not skipped in the non-voting multinode tempest full job 14:16:59 Yeah, we've worked around by skipping the test, but if someone hits it in production, they'll get back to this bug 14:16:59 which is why https://review.openstack.org/#/c/353002/ is just a regex skip in the nova tree for the dedicated live migration job 14:17:19 danpb thought it was something with the kernel, or native scsi i thought 14:17:48 there was talk of trying to reproduce with native in-qemu scsi, which danpb thought would at least give a better error message for what the root failure is, if it fails 14:17:49 mriedem, yeah, that was my understanding too 14:19:17 mriedem: Someone from Cinder team should look at it 14:19:36 I've requested on the upstream channel with a few specific pointers, but only crickets...Anyway 14:19:58 This is where one of those "cross-project relations" should kick in 14:20:42 mriedem, is this something we can bring up at a PTL to PTL level ? 14:20:58 seems like a kernel issue to be honest 14:21:12 kashyap: hemna was the person to look probably, but he's busy atm 14:21:17 from cinder i mean 14:21:37 mriedem, yes, I asked him a couple of times, didn't want to bother him more 14:21:50 Yeah, I hear you. But my guess is atleast there's one another person who must be well-versed with iSCSI internals 14:21:55 kashyap: so, 14:22:02 but in our experience the problems have come from particular storage or network drivers 14:22:06 I did hear talk that is is specific to the iSCSI target and user being on the same box, but I am not sure about that 14:22:12 wouldn't the red hat guys be the best equipped to at least try a recreate with the in-qemu scsi? 14:22:16 o/ 14:22:19 :-) 14:22:24 On a related topic, the functional gap between host and qemu iscsi is multipath. If we switch to qemu iscsi for the simplicity and lost multipath, do we think anybody would be upset? 14:22:27 johnthetubaguy: I thought we concluded it was an iscsi config problem 14:22:42 mriedem: Sure, I can try, once these two nasty bugs off my plate are gone 14:22:46 mdbooth: i don't think we test multipath anywhere 14:22:47 davidgiluk: I haven't heard anything that concrete, its possible I guess 14:23:03 mriedem: Right, I think it's one of those theoretical features. 14:24:47 i know it breaks our london cloud running kilo + multipath :) 14:25:05 'what's different about storage on this cloud?' 'oh right, multipath... 14:25:25 * mriedem runs to another meeting 14:25:36 Ironic Availability: when your primary source of downtime is your HA solution. 14:25:55 mriedem: its real, thats a good datapoint 14:26:38 While we're on bugs, maybe raise https://review.openstack.org/#/c/370089/ 14:27:04 johnthetubaguy looked at this and was understandably uneasy about randomly deleting things with vague warnings attached. 14:27:36 Me too, btw. Wondering if anybody can add anything to it. 14:27:53 Have we got rid of nova-networking yet ? 14:27:58 No 14:28:01 Nearly, but no 14:28:08 cause those multiple calls are needed for nova-network 14:28:18 Do you know that they are? 14:28:24 Or are you just trusting the comments? 14:28:34 ha - good question 14:28:49 Because without looking at the implementation, it seems weird that calling the same thing multiple times with the same arguments would have different results. 14:29:00 I looked a while back so trusting to memory 14:29:09 and can't remember that clearly 14:29:30 However, if they really are required then we can't delete them, because nova-networking isn't gone yet. 14:30:34 I think this was something like the dhcp stuff reads out of something that gets populated the first time around things, but I don't remember the details 14:30:48 I think it something to do with the sequence of events on the network side - the parameters are the same, but the state in the network isn't 14:31:21 PaulMurray: Do we know anybody who could say for sure? 14:31:48 Is it important to you for something ? 14:31:50 I should go study the code, and try get my memory back on that, not sure who has been in there recently 14:32:11 so post copy is a lot easier if we don't call things one million times 14:32:15 I think it makes the next patch a bit simpler. 14:32:26 johnthetubaguy: Right 14:32:33 mdbooth: https://review.openstack.org/#/c/370089/ worries me for dvr 14:32:37 so i added swami 14:32:50 mriedem: DVR is the only situation I've tested it in :) 14:32:56 oh right 14:32:57 That one definitely works fine. 14:33:06 then it worries me for nova-network :) 14:33:13 i'm just constantly worried 14:33:13 :) 14:33:45 I would expect it to work for DVR cause until swamis last patch went in these calls were a noop for neutron 14:34:11 PaulMurray: Incidentally, that follow-on patch which just moves network calls to the source significantly decreases network downtime for DVR in the non-postcopy case 14:34:17 Can't remember if I mentioned that last time 14:34:27 I haven't studied exactly why, but that's what I observe 14:34:55 I saw your comments about that somewhere (IRC, patch, somewhere) 14:34:56 The 3-4 second gap you always get when live migrating is almost entirely eliminated. 14:35:16 I saw three pings dropped in those tests I did 14:35:19 so that agrees 14:36:02 Presumably means there's something in that code path which takes about that long. Don't know what it is, though. 14:37:04 So I don't know what you want to do about the multiple setup_on_networks() calls 14:37:20 should at least validate with someone about nova network side 14:37:28 or wait until nova network goes away 14:37:30 ? 14:37:40 PaulMurray: nova-network seems to be where there unease is originating. 14:38:03 yep, I am pretty sure it's not relevant for neutron 14:38:07 If that code is maintained at all any more, it would be good if somebody could look at it authoritatively. 14:38:56 Meta note: looking at that comment, although it's great the author mentions 'dhcp', more detail would have been great. A bug# in the comment would have been fantastic. 14:39:01 maybe this needs an ML thread - don't know who to reach out to 14:39:09 * mdbooth will try to note that personally when adding hacks like this. 14:39:47 Anything else on bugs ? 14:40:24 #topic Ocata Specs 14:40:51 Need to make sure any slipped specs get re-proposed 14:41:05 (I think there's a few of those) 14:41:16 Also any new work 14:41:31 Any questions ? 14:42:11 next.... 14:42:23 #topic Open Discussion 14:42:33 Anything else to raise this week ? 14:42:53 * mdbooth wanted to raise https://etherpad.openstack.org/p/nova-newton-retrospective 14:43:06 go for it 14:43:16 If you haven't already, I'd encourage you to look and participate 14:44:01 I thought you summed it up pretty well 14:44:18 the long stack of patches is a problem 14:44:41 From my own pov, I've been advocating baby steps towards a subsystem maintainer model lately. I summed up my proposal in a ml post this morning: http://lists.openstack.org/pipermail/openstack-dev/2016-September/104093.html 14:44:46 I saw mriedem brought that up in his PTL candidacy message too 14:45:16 Would something like that have been useful to this team over the last cycle in practise? 14:45:25 Or are our pain points elsewhere? 14:45:52 have you seen the discussions about this is in the past few years 14:46:25 are you proposing that some people get +2 but not +A? 14:46:25 PaulMurray: Yes. Many times. 14:46:46 ccccccdlvjjtblfgnffchujbdhignbkklncbntbjlvhv 14:46:56 Urgh, sorry, it's my YubiKey 14:47:05 PaulMurray: For this meeting, I'm really interested to know if we think this or anything like it would have been useful to us. 14:47:40 PaulMurray: But in general, I wanted to propose something conservative which could be piloted without changing the universe. 14:48:13 It would be useful, but I think the consensus in the past has been that this is done informally anyway in the cores 14:48:39 PaulMurray: So you don't think it would be useful? 14:48:54 Because we do it already? 14:49:11 PaulMurray: I think what mdbooth is saying is that, it maybe done informally with some unspoken rules, why not resonably enforce it, with some predictable rules 14:49:14 to be honest I am just repeating past arguments 14:49:19 (Hope I didn't misrepresent Matt's point) 14:49:46 * mdbooth is soliciting input, not advocating :) 14:49:56 PaulMurray: Yeah, I've heard those arguments. 14:50:03 kashyap, the same argument goes the other way - some +1 are taken as virtual +2s and some cores refrain from doing +A where they are not sure 14:50:20 as i've said before, 14:50:21 the real issue is bandwidth and attention 14:50:32 there are people i look for in certain subsystems for a +1 14:50:38 and add to changes 14:51:31 PaulMurray: Also that: there's simply no way a core (that is not an expert in that area) can meaningfully comment on a deep change involving say e.g. virt drivers. So for that, the subsystem model that is proposed makes sense, IMHO. 14:51:39 i've generally liked the idea of a dashboard where subteams can star changes that are important to them and they've reviewed - more like the etherpad but easier to process 14:51:51 PaulMurray: My hope is that by formalising a subsystem maintainer model we'd halve the number of core +2s required. 14:52:04 here is a perfect example of where the subsystem thing breaks down for a virt driver https://review.openstack.org/#/c/253666/ 14:52:08 We'd also devolve responsibility. 14:52:15 not trying to pick on ^ but it just came up this morning 14:53:19 mriedem: I don't want to take core review out of the loop. Baby steps :) 14:54:29 Anyway, I think it's important to participate in the retrospective. If my own proposal gains traction, then awesome. 14:54:31 mdbooth, I think your case is unusual because you are working in a very contained piece of code 14:54:53 and you are the expert 14:55:29 os-brick and os-win have been broken off into other projects for pretty much the same reason 14:55:32 I think 14:55:43 well, 14:55:51 those were broken off because of code duplication 14:55:57 They're a bit different, because the expertise was in other projects. 14:56:02 os-brick was literally code in both nova and cinder 14:56:14 os-win is hyper-v stuff that's in multiple projects, like utility code 14:56:23 similar with oslo.vmware 14:56:45 ok, so that was not the reason, but it still works ok 14:56:53 because it is separable 14:56:54 Anyway, I'm not proposing splitting the repo. I'm saying that if tdurakov gives a live migration patch a +2, we only need 1 mriedem to approve it. 14:57:59 Also, tdurakov should know when he needs additional review, and who to ask for it. 14:58:17 Exactly, FWIW, I fully agree with the above. 14:58:48 Sorry, now I really am advocating :) 14:58:55 I need to end the meeting 14:58:56 Trusting people to do the right thing, and there's always hammers like 'git revert' 14:59:04 its nearly top of the hour 14:59:09 * mdbooth will stop that. If anybody has anything useful to add, please respond on ML. 14:59:12 this will have to continue on the ML thread 14:59:24 or etherpad 14:59:28 sorry - bye 14:59:37 #endmeeting