21:03:32 <ttx> #startmeeting project 21:03:32 <dhellmann> o/ 21:03:33 <markmcclain> o/ 21:03:33 <openstack> Meeting started Tue Dec 17 21:03:32 2013 UTC and is due to finish in 60 minutes. The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:03:34 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:03:36 <openstack> The meeting name has been set to 'project' 21:03:44 <devananda> o/ 21:03:44 <markwash> o/ 21:03:46 <ttx> Agenda for today: 21:03:48 <ttx> #link http://wiki.openstack.org/Meetings/ProjectMeeting 21:03:51 <kgriffs> o/ 21:03:51 <stevebaker> \o 21:03:58 <russellb> o/ 21:04:05 <ttx> #topic Icehouse-2 roadmap 21:04:05 <SergeyLukjanov> o/ 21:04:23 <ttx> All looks good from our 1:1s 21:04:43 <ttx> we'll skip the next two meetings 21:04:56 <ttx> and check back progress on the Jan 7th meeting 21:05:02 <hub_cap> bam 21:05:26 <ttx> #topic Gate checks (notmyname) 21:05:33 <notmyname> hello 21:05:36 <lifeless> hello! 21:05:40 <ttx> notmyname: hi! care to introduce topic ? 21:05:51 <notmyname> here's where we start: 21:05:53 <notmyname> I've been hearing (and experiencing) some major frsutration the the amount of effort it takes to get stuff through the gate queue 21:06:14 <notmyname> in some cases, it takes days of rechecks. other times, it's merely a dozen hours or so 21:06:25 <notmyname> so I started using the stats to graph out what's happening 21:06:29 <notmyname> http://not.mn/gate_status.html 21:07:05 <notmyname> and the end result, as shown on the graph above, is that we've got about a 60-70% chance of failure for gate jobs, just based on nondeterministic bugs 21:07:20 <jog0> notmyname: we also wedged the gate twice in less then 3 months 21:07:25 <notmyname> this means that any patch that tries to land has a pretty poor chance of actually passing 21:07:52 <notmyname> note that over the last 14 days, there are 9 days where a coin flip would have given you better odds on the top job in the gate passing 21:08:04 <russellb> I feel like folks like jog0 and sdague have done a nice job watching this status and raising extra awareness for important issues 21:08:16 <russellb> there's plenty of room for more attention to some of the bugs, though, for sure 21:08:25 <russellb> notmyname: but do you have anything in particular you'd like to propose? 21:08:28 <notmyname> so I want to do 2 things 21:08:42 <notmyname> (1) raise awareness of the issue (now with real data!) 21:08:53 <notmyname> (2) propose some ideas to fix it 21:09:02 <russellb> i feel like everyone has been very aware already :-) ... but your graph is neat 21:09:03 <notmyname> which leads to other ideas, I hope 21:09:30 <notmyname> so for (1), I claim that a 60% pass chance for gate jobs is unacceptable 21:09:39 <dolphm> ++ 21:09:43 <markwash> +1 21:09:48 <david-lyle_> +1 21:10:01 <russellb> i don't think anyone is going to argue with failures being bad 21:10:04 <notmyname> and I have 3 proposals of how we can potentially still move forward with day-to-day dev work 21:10:04 <dolphm> can we gate on pass chance? :P 21:10:12 <jog0> russellb: I would disagree with me doing a good job of raising awareness and watching status. we haven't been able to get the base line low enough and getenough bugs fixed. wehave been able to track how bad it is and prioritize but that isn't enough 21:10:32 <russellb> jog0: OK, well just trying to give props where it's due for those working extra hard on things 21:10:34 <sdague> dolphm: only if we can take people's +2 away from them for a week when they push a 100% guarunteed to fail change to the gate :) 21:10:37 <russellb> your reports help me 21:10:49 <dolphm> sdague: where do we sign people up 21:11:04 <dhellmann> jog0: yeah, +1 to what russellb said, don't knock yourself for not having super powers 21:11:07 <notmyname> russellb: yes, I agree that the -infra team has done a great job triaging things when they get critical. but let's not stay there (as we have been) 21:11:09 <sdague> which was actually a huge part of the issue the last 4 days with all the grizzly changes 21:11:11 <notmyname> first idea: multi-gate-queue 21:11:25 <notmyname> in this case, instead of having one gate queue, have 3 21:11:34 <notmyname> Have N gate queues (for this example, let's use 3). In gate A, run all the patches like today. In gate B, run all but the top patch. In gate C, run all but the top 2. This way, if gate A fails, you already have a head start on the rechecks (and same for B->C). If gate A passes, then throw away the results of B and C. 21:11:46 <notmyname> this is a pessimistic version of what we have today 21:12:25 <markwash> sdague: I would love to drill down on that past your warranted frustrations 21:12:34 <notmyname> idea two: cut down on what's tested 21:12:51 <jeblair> notmyname: i would be happy to have zuul start exploring alternate scenarios sooner, even ones heuristically based on observed conditions like job failure rates 21:13:10 <jeblair> notmyname: that's not a simple change, so it'd be great if someone wants to sign up to dev that. 21:13:16 <jog0> proposal 1 doesn't help get things to merge, it just gets them to merge faster 21:13:20 <notmyname> in this case, there is no need to test the same code for both postgress and mysql functionality (or normal and large ops) if the patch doesn't affect those at all 21:13:21 <jog0> or fail faster 21:13:40 <notmyname> jog0: correct. things eventually merge today 21:13:40 <jeblair> jog0: i agree with that. 21:13:50 <notmyname> where eventually is really long 21:14:03 <dolphm> jog0: faster dev cycle is always appreciated, at least 21:14:09 <portante> and seems too long 21:14:10 <notmyname> for idea two, I'm proposing that the set of things that are tested are winnowed down 21:14:30 <russellb> i'm -1 on testing things less in general ... if things fail, they're broken, and should just be fixed 21:14:41 <russellb> i don't think the answer to failures is do less testing 21:14:42 <jog0> notmyname: I am much more concerned about false gate failures then gate delay. if you fix false gate failure you fix gate delay too 21:14:50 <notmyname> eg why test postgres and mysql functionality in neutron for a glance client test? 21:14:55 <jeblair> notmyname: one of the benefits of running extra jobs -- evon ones that don't seem to be needed (testing mysql/pg) is that we do hit nondeterministic failures more often 21:15:05 <markmcclain> I think testing less items is bad idea too 21:15:18 <notmyname> in all cases, the nondeterministic bugs need to be squashed 21:15:18 <mordred> the gate issues are actual openstack bugs 21:15:27 <jeblair> notmyname: neutron was in a bad state for a while because it only ran 1 test whereas everyone else ran 6; it was way more apt to fail changes 21:15:34 <jog0> I would rather make it harder to get the gate to pass then have these nondetermistic failures leak out into the releases for users to experience 21:15:37 <sdague> notmyname: yeh, we invented more jobs for neutron for exactly that case 21:16:05 <markwash> to notmyname's point, though. . we just recheck through those failures of actual nondeterministic bugs mostly, do we not? 21:16:08 <dolphm> jog0: so you're opposed to option 2? 21:16:14 <sdague> and I agree that race conditions need to be stompt out 21:16:19 <dolphm> jog0: err, idea 2 21:16:23 <markwash> rechecking is just *slow* ignoring 21:16:23 <jog0> dolphm: very much so, we need more tests 21:16:25 <notmyname> but the point is, if neutron jobs are still failing a lot, then they don't need to be run for every code repo 21:16:33 <notmyname> s/neutron/whatever/ 21:16:37 <david-lyle_> all projects are gated on those failures, related or not 21:16:40 <lifeless> uhh 21:16:43 <jog0> markwash: that is a problem 21:16:44 <lifeless> I don't follow your logic 21:16:45 <sdague> markwash: you need to stop thinking about those as non deterministic, they are race conditions 21:17:00 <torgomatic> it's not a matter of "run less things because they fail", it's a matter of "run less things because they're not needed" 21:17:10 <markwash> sdague: agreed, both to me carry the same level of badness (high) 21:17:11 <jeblair> notmyname: neutron even got so bad that we pulled it out of the integrated gate -- it pretty much _instantly_ fully broke 21:17:17 <mordred> ++ 21:17:29 <dolphm> what's the realistic maximum number of changes openstack has ever seen merge cleanly in succession? 21:17:34 <notmyname> torgomatic phrased it better than I was doing 21:17:36 <dolphm> 4? 5? 21:17:40 <sdague> dolphm: 20+ 21:17:41 <dhellmann> torgomatic: but they *are* needed because the failures don't occur all of the time, so we need as many examples of failures as possible to debug 21:17:42 <jog0> dolphm: I saw 10 recently 21:17:42 <torgomatic> like, does keystone really need the gate job with neutron-large-ops? I don't think you can break Keystone in such a way as to only hose the large ops jobs 21:17:43 <dolphm> sdague: wow 21:17:48 <ttx> dolphm: I witnessed 25 myself 21:17:51 <lifeless> notmyname: if we don't run it, and there is any dependency in that thing on the other projects we let change, we have asymmetric gating. 21:17:52 <sdague> it's not been a good couple of weeks 21:17:53 <jeblair> notmyname: so we've learned that with no testing, real solid bugs (as opposed to transient ones) land almost immediately in repo. 21:17:59 <jog0> torgomatic: yes it does 21:18:03 <sdague> we also had a lot of external events in these 2 weeks 21:18:06 <lifeless> notmyname: asymmetric gating is a great way to wedge another project entirely, instantly. 21:18:06 <ttx> dolphm: granted, it was full moon outside. 21:18:18 <sdague> sphinx, puppetlabs repo, jenkins splode 21:18:19 <jog0> both nova and neutron use keystone so it can break neutron-large-ops 21:18:24 <mordred> yup. we've seen that almost every time we've had assymetric gating 21:18:29 <torgomatic> jog0: maybe a bad example, then, but there are other cases where the difference between two gate jobs has 0 effect on the patch being tested 21:18:34 <dhellmann> I would rather spend the effort needed to figure out which subset of all our tests need to be run for any given change to fixing these race conditions themselves 21:18:48 <sdague> dhellmann: +1 21:18:48 <russellb> dhellmann: +100 21:18:53 <markmcclain> dhellmann: +1 21:19:07 <jeblair> dhellmann: +1 21:19:16 <jog0> dhellmann: amen! 21:19:16 * jd__ nods 21:19:22 <mordred> dhellmann: ++ 21:19:26 <notmyname> ok, so option 3: enforce strong SAO-style interaction between projects 21:19:32 <sdague> hey, look we even have a reasonable currated list - http://status.openstack.org/elastic-recheck/ - (will continue to try to make it better) 21:19:33 <markwash> dhellmann: that's obviously better but I hope we *do* it 21:19:35 <notmyname> Embrase API contracts between projects. If one project uses another openstack project, treat it as any other dependency with version constraints and a defined API. Use pip or packages to install it. And when a project does gate checks, only check based on that project's tests. 21:19:45 <notmyname> This is consistent for what we do today for other dependencies. If there are changes, then we can talk cross-project. That's the good stuff we have, so let's not throw that out. 21:19:45 <dhellmann> notmyname: SOA? 21:20:08 <lifeless> dhellmann: service orientated archivetture 21:20:10 <lifeless> dhellmann: what we have 21:20:11 <notmyname> service oriented architecture. IOW, just have well defined APIs with the commitment to not break it and only use that 21:20:14 <lifeless> bah, architecture 21:20:19 <dhellmann> lifeless: I know SOA, I didn't know SAO 21:20:26 <jd__> you typed SAO :) 21:20:32 <lifeless> oh lol, my brain refused to notice that 21:20:40 <russellb> so, version pinning between openstack projects? 21:20:42 <notmyname> sdague: the problem with elastic recheck (which it is good), is that it's hand-currated 21:20:49 <russellb> seems like we'd just be kicking the "find the breakage" can down the road 21:20:55 <mordred> russellb: ++ 21:20:56 <mordred> in fact 21:21:05 <sdague> notmyname: it's 54% of all the fails, and super easy to add another one 21:21:07 <mordred> when you wanted to update the requirement, you would not have been testing the two together 21:21:14 <notmyname> well, what happens now for other dependencies? eg we don't run eventlet tests for every openstack patch and vice versa 21:21:16 <portante> so then why are we not pulling sphinx builds into our jobs? 21:21:16 <sdague> we approve them super fast 21:21:30 <markmcclain> I don't think we can use pip packages otherwise for projects with strong integration we run into issues landing coordinated patches in the master branches 21:21:30 <notmyname> or sphinx, as portante stated 21:21:37 <mordred> notmyname: those are libraries, not thigns that do SDN 21:21:40 <dhellmann> the whole point of gating on trunk is to ensure that trunk continues to work so we can prepare the integrated release, right? 21:21:43 <sdague> because sphinx isn't openstack 21:21:51 <notmyname> markmcclain: that's exactly my point. it needs strong API contracts 21:21:56 <dhellmann> for other dependencies, we should be doing the same gate checks on the requirements project (if we're not already) 21:21:57 <mordred> it's more tahn an API 21:21:59 <notmyname> dhellmann: it still woiuld 21:22:00 <lifeless> notmyname: we want to run eventlet tests on upstream pull requests actually. 21:22:08 <markmcclain> notmyname: those contracts evolve 21:22:13 <lifeless> notmyname: thats a test-the-world concept that infra have been kicking around 21:22:21 <lifeless> notmyname: so that we're not broken by things like sphinx 1.2 21:22:23 <notmyname> markmcclain: of course, that's where deendency versions come from 21:22:28 <mordred> the longer we diverge between these projects, the harder re-aligning is going to be 21:22:55 <mordred> it also makes it REALLY painful for folks running CD from master 21:23:03 <markmcclain> we do integrated releases so we should the tests should be intergrated 21:23:09 <ttx> yes, it's not as if dependencies did not break us badly in the past 21:23:14 <russellb> mordred: painful as in ... we stop testing that use case completely :( 21:23:18 <notmyname> mordred: yes. integration is hard, so it needs to be continually done. if something breaks, fix it. what I'm suggesting is that treating the interdependencies as more dcoupled things 21:23:21 <mordred> russellb: yup 21:23:26 <jog0> so one of the problems we have seen is that gate has so many false positives that its very easy for more to sneak in 21:23:30 <mordred> notmyname: but they're not 21:23:31 <lifeless> mmm, from a CD perspective, I don't object to carefully versioned API transitions upstream 21:23:32 <jog0> we have a horrible base line to compare against 21:23:33 <mordred> they're quire interrelated 21:23:34 <lifeless> but 21:23:43 <lifeless> I strongly object to big step integrations 21:23:47 <mordred> lifeless: ++ 21:23:48 <portante> mordred: how are they not? 21:23:57 <notmyname> mordred: again, that's why I'm here talking about this today. we've got a problem, and I'm throwing out ideas to help resolve it 21:24:02 <mordred> because these are things with side effects 21:24:03 <lifeless> if we bump the API a few times a day, that would be fine with me 21:24:16 <lifeless> but more than that and we'll start to see nasty suprises I expect 21:24:20 <portante> things with side effects sounds kinda general, no? 21:24:40 <mordred> there is a reason that side effects are a bad idea in well constructed code - they aren't accounted for in the API 21:24:48 <mordred> but 21:24:52 <portante> would notmyname's idea really make things worse than what we have today? 21:24:52 <mordred> sometimes they're necessary 21:24:57 <mordred> which is why scheme isn't actually used 21:25:01 <mordred> yes 21:25:04 <mordred> it would make it worse 21:25:06 <mordred> unless 21:25:09 <markmcclain> portante: yes 21:25:10 <mordred> you happen to not care about integration 21:25:21 <portante> how will it make it worse from what we have today? 21:25:25 <mordred> if you don't care about integration, it would make your experience as a developer better 21:25:34 <mordred> portante: define "worse" 21:25:36 <notmyname> mordred: I didn't see portante say anything about not caring about integration 21:25:44 <notmyname> (ever in fact) 21:25:58 <russellb> point is, that's the case it's not worse 21:26:17 <dhellmann> russellb: ? 21:26:24 <mordred> notmyname: I'm saying that delaying integration until we have larger sets of things to integrate is going to make it more likely to introduce isseus, and harder to track them down when they happen 21:26:30 <mordred> I believe that will be worse 21:26:31 <russellb> heh, mordred is saying it's worse, unless you don't care about integration 21:26:34 <jeblair> notmyname: because the proposal would mean we would perform integration testing less, essentially only once and on abi bumps. 21:26:41 <mordred> however, doing such a delay 21:26:56 <jog0> we rarely change APIs 21:27:01 <portante> integration tests would still be run at the same rate 21:27:01 <mordred> will increase the pleasurability of folks doing development if those people are not concerned about the problems encountered in integration 21:27:13 <dhellmann> portante: how so? 21:27:17 <mordred> not against combinations that would show you that a patch introduced an issue 21:27:36 <mordred> which means that your patch against glance has no way of knowing that it breaks when combined with a recent patch to keystone 21:27:44 <mordred> when neither patches have landed yet 21:27:53 <portante> we would still run the same job sets as we do today, that would not change, it just that we would be work with sets of changes from projects instead of individual commits 21:27:55 <mordred> which means you have to BUNDLE all of the possible new patches until there is a new release 21:28:10 <mordred> which means _hundreds_ of patches 21:28:20 <jeblair> and then bisect those out when you have a problem 21:28:28 <jog0> so I think this whole discussion is looking at things the wrong way. Gate is effectively broken, we don't trust it and its slowing down development. The solution is to fix the bugs not find ways of running less tests 21:28:30 <mordred> considering that it's hard enough to get it right when we're doing exact patch for patch matching 21:28:40 <russellb> jog0: +1 21:28:40 <markmcclain> jog0: +1 21:28:40 <portante> but why would my patch break something else without also breaking the API contract? 21:28:42 <jeblair> jog0: ++ 21:28:48 <mordred> think about how much worse it will be when you only test every few huundred patches 21:28:50 <mordred> jog0: ++ 21:29:02 <dhellmann> jog0: +1 21:29:03 <mordred> portante: because it can and will 21:29:07 <jog0> one thing that would help, is make sure we are collecting good data against master all the time 21:29:08 <ttx> jog0: I think notmyanme's point is that it cannot ever be fixed so you need new ideas 21:29:10 <mordred> because that's the actual reality 21:29:29 <jog0> so if we have free resources, run gate against it so we get more data to analyze and debug with 21:29:32 <dolphm> mordred: ++; it's happened plenty in our history 21:29:36 <sdague> jog0: +1 ... so basically the ask back is what do we do (me & jog0 ... as I'm signing him up for this) to get better data in elastic recheck to help bring focus to the stuff that needs fixing 21:29:41 <jog0> ttx: I am not ready to accept that answer yet 21:29:42 <ttx> jog0: do you think we can get to the bottom of those issues ? 21:29:50 <notmyname> ttx: no, not that it can't be fixed, per se. but that openstack has grown to a scale where perhaps existing methods aren't as valuable 21:29:53 <jog0> ttx: yes, it may take a lot of effor but yes 21:29:57 <mordred> I think the methods are fine 21:30:04 <mordred> the main problem is getting people to participate 21:30:12 <sdague> yeh, agree with mordred 21:30:12 <markwash> I think we probably need some sort of painful freeze to draw attention to fixing these bugs 21:30:13 <mordred> introducing more slack into the system will not help that 21:30:33 <portante> it does not seem to be about adding more slack 21:30:33 <sdague> markwash: if only developers were feeling some pain.... ;) 21:30:34 <torgomatic> markwash: more pain as the answer to gate pain? 21:30:36 <mordred> the fact that we al know that jog0 and sdague have been killing themselves on this 21:30:38 <mordred> is very sad 21:30:46 <mordred> and many people should feel shame 21:30:51 <markwash> torgomatic: yeah, in one big dose, to reduce future gate pain 21:30:51 <portante> but targetting a finite set of resources on the point of integration 21:30:53 <mordred> because everyone should be 21:31:08 <mordred> portante: it's batching integration 21:31:16 <mordred> portante: which is the opposite of continuous integratoin 21:31:29 <dolphm> was the idea of prioritizing the gate queue ever shot down? (landing [transient] bug fixes before bp's, for example) or was that just an implementation challenge 21:31:32 <mordred> and which will be a step backwards and will be a nightmare 21:31:51 <portante> mordred: if the current system causes developers to assemble large patches unbeknownst to you, isn't that the same thing? 21:31:54 <jeblair> dolphm: we just added the ability to do that 21:32:00 <jog0> so we are tracking 27 different bugs in http://status.openstack.org/elastic-recheck/ and that doesn't cover all the failures. Fixing these bugs takes a lot of effort 21:32:02 <sdague> dolphm: we have manual ways to promote now. We've used it recently 21:32:05 <dolphm> jeblair: oh cool - where can i find details? 21:32:34 <torgomatic> it seems like we're saying that we can leave the gate as-is if we would just stop writing intermittent bugs 21:32:35 <jeblair> dolphm: we've done it ~twice now; it's a manual process that we can use for patches that are expected to fix gate-blocking bugs, and are limiting it to that for now. 21:32:37 <notmyname> portante: that's actually my biggest fear. that current gate issues encourage people to go into corners to contribute to forks. which is bad for everyone 21:32:37 <sdague> this is the in progress data to narrow things down furthere - http://paste.openstack.org/show/55185/ 21:32:46 <torgomatic> and if we can stop doing that, let's just stop writing bugs at all and throw the gate out 21:32:47 <mordred> notmyname: what forks? 21:32:51 <mordred> notmyname: what forks of openstack are there? 21:33:03 <dolphm> jeblair: is the process to ping -infra when we need to land a community priority change then? 21:33:11 <mordred> notmyname: and which developers are hacking on them? 21:33:15 <jeblair> dolphm: yes 21:33:25 <dolphm> jeblair: sdague: easy enough, thanks! 21:33:26 <hub_cap> mordred: maybe internal "forks" cuz patches take a while to land? 21:33:33 * hub_cap guesses 21:33:37 <markwash> mordred: I guess many companies run private forks 21:33:37 <portante> mordred: no names, dont' want the nsa to take them out. ;) 21:33:46 <mordred> portante: ;) 21:33:48 <notmyname> hub_cap: yes. but to portante's point, it happens privately 21:33:52 <hub_cap> portante: the nsa knows already 21:33:53 <markwash> guesses the nsa runs a fork :-) 21:33:58 <mordred> well, those companies usually learn pretty quickly 21:33:58 <portante> its does!? 21:33:59 <creiht> what company doesn't have a fork of every openstack component as they try to get features in? 21:33:59 <russellb> private forks seem natural 21:34:04 <torgomatic> alternately, we can accept that bugs happen, including intermittent bugs, and restructure things to be less annoying when they do 21:34:11 <mordred> that getting out of sync signficantly is super painful 21:34:12 * portante smashes laptop on the ground 21:34:15 <russellb> and honestly just seems like FUD 21:34:18 <notmyname> torgomatic: yes! 21:34:18 * jd__ smells FUD 21:34:21 <jog0> many of the bugs we see in gate are really bad ones 21:34:22 <russellb> jd__: jinx 21:34:26 <jd__> raaah 21:34:28 <markwash> portante: lol 21:34:44 <sdague> yeh, a lot of these races are pretty fundamental things 21:34:51 * mordred hands portante a new laptop that he promises has no malware on it 21:34:57 <sdague> where compute should go to a state... and it doesn't 21:35:17 * portante thankful for kind folks with hardware 21:35:19 <ttx> the tension is because some developers are slowed down by issues happening in other corners of the project and over which they have limited influence 21:35:25 <torgomatic> to that end, I think notmyname's first two suggestions are both good ones 21:35:42 <russellb> ttx: and the dangerous response is to continue not to care what's happening in the other corners 21:35:53 <lifeless> we're all in this together :) 21:35:54 <jeblair> ttx: they don't have limited influence though 21:35:56 <russellb> lifeless: yes! 21:35:59 <portante> can we at least run experiments with the suggestions to play them out? 21:36:03 <sdague> honestly, in the past we keep going in cycles where gate gets bad, pitch forks come out, people work on bugs, it gets better 21:36:05 <ttx> but if you take the viewpoint of openstack as a whole, some parts may be slowed down, but the result is better in the end 21:36:15 <sdague> this time... the number of folks working these bugs isn't showing up 21:36:25 <russellb> portante: which ones? #2 and #3 there were fundamental disagreements from many people 21:36:27 <sdague> which is really the crux of the problem 21:36:31 <markwash> one policy that might help: as we triage a race-condition based failure in the gate, we need to require unit / lower level / faster tests that reproduce those failures to land in the projects themselves and fail every time 21:36:33 <jeblair> i hit a transient bug on a devstack-gate change, and with some help from sdague we tracked it down to a real bug in keystone, i filed the bug, wrote an er query and moved on 21:36:33 <russellb> #1 jeblair invited some help to zuul dev to add 21:36:48 <jeblair> i think that was beneficial to the project 21:36:53 <lifeless> so I proposed that gate affecting bugs be critical by default 21:37:04 <jog0> markwash: that won't work many times we don't know why something is breaking 21:37:07 <lifeless> I think the stats we have here suggest that perhaps that isn't a bad an idea as folk thought :) 21:37:07 <portante> it is okay to disagree, can't hurt to try a few things to see they pan out 21:37:09 <russellb> can someone ban d0ugal? the join/parts are really annoying 21:37:10 <jeblair> and i was glad i could help even though i knew that my shell script change to devstack-gate didn't cause it. 21:37:19 <ttx> russellb: I use them as a clock 21:37:21 <jog0> take the http2 lib file descriptor bug 21:37:25 <creiht> what if we just turn off the gate for a specific project until they fix the bugs that are clogging it? 21:37:30 <markwash> jog0: ah, okay. .yeah its only for bugs where we understand the race but its hard to fix 21:37:39 <dolphm> ttx: rofl. russellb: can your client hide join/parts? 21:37:46 <markwash> creiht: +1 21:37:56 <russellb> dolphm: probably, but i don't want to hide the non broken ones 21:38:02 <creiht> well prevent the project from any further patches until they fix gate critical bugs 21:38:02 <markmcclain> creiht: that is a bad idea… we've done this before and it caused more problems than it solved 21:38:09 <jeblair> lifeless: ++critical 21:38:25 <creiht> markmcclain: my first explanation wasn't as clear sorry 21:38:28 <russellb> heh, and now we have a pile of critical bugs that the same small number of people are looking at 21:38:31 <dolphm> creiht: not sure i follow - block that project from being tested or block that project from landing irrelevant changes? 21:38:39 <russellb> just saying, that alone doesn't get people to work on them :) 21:38:50 <creiht> block from landing any changes until the critical bugs are fixed 21:38:58 <lifeless> russellb: sure, but can't we also say 'when there are critical bugs, we won't be reviewing or landing anything else' ? 21:39:10 <lifeless> russellb: like, make it really crystal clear that these things are /what matters/ 21:39:16 <russellb> lifeless: sure, something, just saying that labeling things critical doesn't do anything by itself 21:39:18 <ttx> creiht: I think we have that option, yes 21:39:20 <lifeless> russellb: ack, agreed. 21:39:27 <jeblair> markmcclain: you've done that once or twice, right? prioritized critical fixes to the exclusion of other patches? 21:39:35 <dolphm> idea: can http://status.openstack.org/rechecks/ be redesigned so that you can see the most impactful bugs per project the associated bugs are tracked against? 21:39:42 <ttx> creiht: if we can really identify a project that doesn't play ball 21:39:54 <russellb> dolphm: have you seen http://status.openstack.org/elastic-recheck/ ? 21:39:54 <dolphm> it's impossible for me to glance and that page and see where i can help 21:40:04 <sdague> dolphm: yes, moving towards eliminating it with the elastic recheck dashboard 21:40:04 <markmcclain> jeblair: yes.. we blocked approvals until fixes landed 21:40:13 <jog0> dolphm: keystone doesn't have any gate issues as far as I know 21:40:18 <dolphm> russellb: yeah, that's not what i want either 21:40:18 <sdague> it just... takes time 21:40:22 <dolphm> jog0: understood, but still 21:40:22 <jeblair> sdague, dolphm: ++ 21:40:28 <clarkb> jog0: it does 21:40:31 <clarkb> the port issue 21:40:35 <sdague> jog0: that's not true 21:40:37 <creiht> ttx: it isn't about playing ball... if there are critical bugs blocking the gate, then your project gets no new patches in until that bug is fixed 21:40:38 <jog0> clarkb: link 21:40:39 <sdague> it bounced stuff this morning 21:40:50 <jeblair> dolphm, jog0: and the keystoneclient issue we found yesterday 21:41:00 <dolphm> jog0: actually we do have a couple issues ;) 21:41:02 <portante> creiht: if there are critical bugs blocking the gate from your project, then your project .... 21:41:13 <creiht> yes 21:41:15 <jog0> in that case I think most integrated projects have critical bugs 21:41:17 <jog0> if not all 21:41:34 <portante> great, so let's do that creiht thingy then 21:41:40 <creiht> lol 21:41:40 <markwash> I mean, maybe they all need to stop and fix those 21:41:51 <ttx> creiht: in some cases it's not as binary as that. Some bugs take time to investigate/reproduce, and blocking the project that makes progress on them is probably not very useful 21:42:02 <lifeless> ttx: so, I disagree 21:42:08 <torgomatic> that approach acknowledges that bugs happen, so it's got that going for it 21:42:11 <creiht> ttx: it seems more usefull then just letting status quo go on 21:42:21 <lifeless> ttx: when you make changes there is a chance you introduce new bugs right ? 21:42:26 <lifeless> ttx: or make the current ones worse! 21:42:29 <markwash> race condition bugs are a good situation for tough love 21:42:29 <portante> nothing changes if nothing changes 21:42:31 <notmyname> well, that brings up another point. elastic-recheck doesn't do any alerting to a project. maybe that shoudl be added 21:42:46 <sdague> notmyname: agreed 21:42:46 <lifeless> ttx: so if you have critical issues, changing things that aren't fixing that issue, is just fundamentally a bad idea. 21:42:57 <jeblair> notmyname: sounds like a good idea 21:43:18 <russellb> or perhaps an openstack-dev email for each bug that gets added? or would that be too much? 21:43:29 <portante> public flogging? 21:43:29 <lifeless> might be too little 21:43:32 <russellb> heh 21:43:33 <jog0> notmyname: so one issue is many times we don't know which project the bug is in 21:43:34 <sdague> we were talking about that, if we can determine the project, or set of projects where the bug is, it should alert those channels whenever it fails a patch 21:43:52 <ttx> ok, I think we are not maling anymore progress now 21:43:55 <ttx> or making 21:43:58 <sdague> so people shamed into how often they are breaking things 21:44:05 <notmyname> so what's next, then? 21:44:07 <notmyname> ttx: ^ 21:44:09 <lifeless> I don't think shame really helps 21:44:11 <creiht> status quo! 21:44:12 <creiht> :) 21:44:21 <lifeless> noone wanted to introduce these bugs 21:44:22 <markmcclain> the downside of public is shaming is that sometimes the initial point of fault could be incorrect 21:44:28 <russellb> what's next? how to get more people helping fix bus? 21:44:30 <russellb> bugs* 21:44:33 <lifeless> right! 21:44:41 <ttx> practical actions 21:44:43 <russellb> continued work to raise awareness of the most impotant things is part of it 21:44:43 <jog0> russellb: agreed 21:44:45 <markwash> yeah, not about shame, just about how do we progress when there are criitcal bugs 21:44:51 <russellb> and i think some ideas are being tossed around for that right now 21:45:14 <lifeless> is everyone raising their gate critical bugs in each weekly meeting ? 21:45:15 <russellb> and then what hammers are available when not enough progress is made, and when we do we use them 21:45:16 <ttx> notmyname: I think everyone agreed your suggestion 1 was interesting, just missing dev manpower to make it happen 21:45:21 <russellb> and i'm not sure we have good answers for that part yet 21:45:26 <lifeless> Like as a dedicated section? And getting volunteers to work on them ? 21:45:29 <ttx> (the multigate thing) 21:45:34 <markmcclain> lifeless: it's the 1st real item in our meeting each week 21:45:35 <torgomatic> some of us are giant fans of suggestion 2 as well 21:46:00 <torgomatic> (suggestion 2 is removing redundant gate jobs) 21:46:24 <markmcclain> torgomatic: no the extra data points are very helpful for diagnosing some the race conditions 21:46:26 <lifeless> torgomatic: what redundant jobs? 21:46:27 <ttx> I think that one was far from consensual 21:46:36 <markmcclain> it also helps us to prioritize based on frequency 21:46:42 <markwash> I think we should just have a post-gate master integration job that is wired up to a thermonuclear device. . when the failure rate hits 50% it blows 21:46:51 <lifeless> markwash: sweet 21:46:54 <russellb> ttx: if anything, more consensus on "no" for 2 and 3 IMO 21:47:03 <torgomatic> lifeless: like running devstack 5 times against every project, when there's not always a way for that project's patches to break stuff 21:47:11 <torgomatic> well, not only for one 21:47:14 <torgomatic> I meant to say 21:47:22 <lifeless> torgomatic: yes, your analysis is missing something 21:47:27 <lifeless> torgomatic: which we disucussed 21:47:34 <jog0> https://bugs.launchpad.net/openstack/+bugs?search=Search&field.importance=Critical&field.status=New&field.status=Incomplete&field.status=Confirmed&field.status=Triaged&field.status=In+Progress&field.status=Fix+Committed 21:47:34 <russellb> don't want to rehash it 21:47:38 <lifeless> torgomatic: whic his that the break relationship is often bidirectional, and transitive. 21:47:40 <torgomatic> as in, I'm sure I can write a Swift patch that breaks devstack for everything, but I cannot write one that only breaks devstack-neutron-large-ops 21:47:40 <jog0> 117 critical bugs 21:47:58 <jog0> torgomatic: yes you can 21:48:13 <torgomatic> jog0: great, please provide an existence proof in the form of a patch 21:48:21 <lifeless> lets get out of the rabbit hole 21:48:27 <jog0> put some timeouts in swift to make things super slow for glance 21:48:30 <lifeless> back to how do we get more people working on critical bugs 21:48:45 <jeblair> btw, some projects have started tagging bugs with 'gate-failure' which can help folks searching for these bugs 21:48:47 <sdague> jog0: yuo probably want to remove git committed 21:49:02 <markwash> s/git/fix/ 21:49:20 <russellb> which brings it to 44 21:49:27 <ttx> lifeless: suggestions ? 21:49:31 <lifeless> jog0: that includes non integrated project 21:49:43 <ttx> We shall soon move on to the rest of the meeting content 21:49:52 <jog0> lifeless: yeah, do you have a better link? 21:50:07 <lifeless> jog0: not in time for the meeting 21:50:11 <lifeless> jog0: LP limittion 21:50:14 <ttx> I see no reason why we can't continue to discuss this on the ML, btw 21:50:35 <ttx> Everyone agrees it's an issue 21:50:53 <ttx> Just absence of convergence on solutions 21:50:57 <russellb> let's fix it, and not by doing less testing of the continuous or the integrated varieties. 21:51:12 <ttx> except suggestion 1 which was pretty consensual 21:51:20 <ttx> just missing resources to make it happen 21:51:35 <sdague> yeh, that's going to require dev resources on zuul 21:52:00 <sdague> but jeblair said he'd be happy to entertain those adaptive algorithms 21:52:09 <jeblair> and it's worth remembering, that's just speeding up the failures. 21:52:09 <jog0> so I am not too keen on the first idea 21:52:10 <jog0> actually 21:52:24 <jog0> I think we can use the compute and human resources much better 21:52:27 <russellb> jog0: I don't think it hurts, while the others arguably do hurt 21:52:29 <jog0> if we fix gate issue one goes away 21:52:29 <sdague> well honestly, it also requires effort 21:52:32 <jeblair> russellb: ++ 21:52:51 <sdague> so if someone is signing up for it, cool. If people are just "someone else should do it" then it won't happen 21:53:13 <markwash> it seems like idea #1 is just tuning the existing optimizations we have in place, not sure why it would be bad if someone showed up with a patch? 21:53:14 <russellb> like most things :) 21:53:26 <ttx> ok, 7 minutes left let's move on 21:53:35 <ttx> #topic Red Flag District / Blocked blueprints 21:53:39 <russellb> i like this new cross project meeting style :) 21:53:48 <russellb> we never had time for stuff like this before 21:53:52 <portante> exciting 21:54:00 <ttx> No blocked blueprint afaict 21:54:11 <ttx> russellb: yes, we used to put that dust under carpets 21:54:23 <ttx> at least we now voice the anger 21:54:28 <ttx> "put the dead fish on the table" 21:54:41 * markwash googles 21:54:47 <notmyname> ttx: I don't think "anger" is the right word 21:55:07 * jeblair thinks a failed patch in the queue should be called a dead fish 21:55:30 <russellb> jeblair: so the red circle in the zuul status page should be a dead fish instead? 21:55:31 <notmyname> I think there is frustration, but there is quite a bit of grace given to the current state of things by those who are frsustrated 21:55:32 <ttx> we still have a conflict between heat and keystone around service-scoped-role-definition 21:55:44 <jeblair> russellb: with little stink lines 21:55:44 <ttx> notmyname: yes, frustration ius a better term, sorry 21:56:05 <ttx> heat/management-api still needs keystone/service-scoped-role-definition 21:56:15 <ttx> stevebaker, dolphm: did you solve it ? 21:56:16 <stevebaker> ttx: that dep should be removed 21:56:20 <dolphm> i followed up on that last week - heat really shouldn't be blocked on that 21:56:31 <ttx> stevebaker: ah, great 21:56:32 <dolphm> although heat *could* take advantage of it- and i understand the desire to 21:56:36 <stevebaker> i thought I did that 21:56:47 <creiht> notmyname: well said 21:57:17 <ttx> stevebaker: yep it's removed now, thx 21:57:27 <ttx> Any other blocked work that this meeting could try to help unblock ? 21:58:06 <ttx> I'll take that as a "no" 21:58:09 <ttx> #topic Incubated projects 21:58:48 <ttx> devananda, kgriffs, SergeyLukjanov: around ? any question ? 21:59:01 <SergeyLukjanov> ttx, I'm here 21:59:08 <SergeyLukjanov> ttx, no questions atm 21:59:19 <kgriffs> no questions here 21:59:20 <devananda> aside from wondering how much slower development on ironic will be when we get integration testing .... nope :) 21:59:33 <kgriffs> +1 for raising the bar on code quality 21:59:34 <SergeyLukjanov> ttx, first working code of heat integration already landed, waiting for reviews on tempest patches 21:59:39 <ttx> kgriffs: had a question for you about when you wanted to switch to release amnagement handling your milestones 22:00:02 <kgriffs> ah, great question 22:00:11 <kgriffs> tbh, I don't have a good feel for what that entails 22:00:12 <ttx> I see your i1 is still open 22:00:32 <ttx> kgriffs: we should talk. Will ping you tomorrow ? 22:00:33 <kgriffs> hmm. Thought I closed it. 22:00:34 * kgriffs hides 22:00:42 <kgriffs> ttx: sounds good 22:00:49 <ttx> kgriffs: it's inactive but it looks in progress :) 22:00:55 <kgriffs> I've been trying to move closer to tracking the i milestones, so this is timely 22:01:07 <kgriffs> ttx: oic 22:01:07 <ttx> kgriffs: awesome, talk to you tomorrow 22:01:10 <kgriffs> kk 22:01:14 <ttx> and.. time is up 22:01:16 <ttx> #endmeeting