22:01:05 <corvus> #startmeeting zuul 22:01:06 <openstack> Meeting started Mon Feb 5 22:01:05 2018 UTC and is due to finish in 60 minutes. The chair is corvus. Information about MeetBot at http://wiki.debian.org/MeetBot. 22:01:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 22:01:09 <openstack> The meeting name has been set to 'zuul' 22:01:11 <corvus> #topic Agenda 22:01:31 <corvus> there is no agenda in the wiki https://wiki.openstack.org/wiki/Meetings/Zuul 22:01:36 <jhesketh> o/ 22:01:41 <corvus> #link agenda (or lack thereof) https://wiki.openstack.org/wiki/Meetings/Zuul 22:01:45 <fungi> the best kind of agenda 22:02:05 <corvus> anyone have anything they want to talk about? 22:02:42 <clarkb> the inap situation has maybe showed us that fixing timeouts is more important? 22:02:48 <clarkb> granted the actual fix there was to fix the cloud 22:02:59 <corvus> #topic timeouts 22:03:04 <corvus> clarkb: can you elaborate? 22:03:21 <clarkb> Last week we had trouble with instances in inap which resulted in slow disk and slow networking (as I understand it) 22:03:23 <corvus> which timeouts, and how were they broken? 22:03:39 <clarkb> this caused jobs to timeout but it took them like 6 hours to do so because we apply the timeout to each run stage rather than the job as a whole 22:04:06 <corvus> ah yep. that much at least should be a relatively easy change, as soon as we decide how we want to implement it. 22:04:08 <clarkb> this was painful beacuse it meant that jobs weren't getting rescheduled (relatively) quickly on new nodes and instead were sitting around for a quarter of a day before failing 22:04:27 <clarkb> (I think the rescheduling would've had to be manual via recheck in this case) 22:04:52 <fungi> pathological scenario though, where the reused ssh connection basically becomes a blackhole for teh commands being passed in 22:05:27 <corvus> we could go ahead and give the entire job the timeout budget. so if the timeout is 2h, and the pre playbook takes 2h, we will timeout. 22:05:39 <corvus> that's probably not ideal, but it's probably better than what we have now. 22:06:03 <corvus> (i think the ideal thing would maybe be per-playbook timouts, so pre could have a 10m timeout, and run could have 2h) 22:06:09 <clarkb> ya I think thats what I have in mind as far as addressing it 22:06:22 <clarkb> another approach would be to specify three timeouts one for each run stage 22:06:23 <corvus> but we can implement cumulative job timout fairly easily, and then maybe talk about per-playbook timeouts later. 22:06:30 <corvus> or per-stage 22:06:38 <clarkb> but I think for user simplicity a single timeout is easy to understand 22:06:38 <mordred> yah. I think entire job the timeout budget to start, and maybe enhacing in the future to have per-playbook timeouts? 22:07:33 <corvus> i feel like both of those are things we can do now, and change later without much disruption 22:07:39 <mordred> ++ 22:07:48 <fungi> any of the above ideas seems fine to me. as long as a job that sometimes needs 3h for its run playbook doesn't end up potentially hung through 5x that because it gets the same timeout applied to two run playbooks and two post playbooks 22:07:57 <corvus> with the first update to cumulative timeout, folks *may* need to increase timouts a bit. but hopefully not much. 22:08:08 <clarkb> was there a particular issue that prevented us from implementing this behavior before (I want ot say I heard there was but don't know details) 22:08:14 <fungi> er, to two pre playbooks and two post playbooks 22:08:20 <dmsimard> fungi: I guess it's even worse if the timeout ends up occurring in pre which has the job retry 22:08:23 <clarkb> corvus: I think most people implemented the timeout values mostly as if they were cummulative 22:08:28 <corvus> clarkb: nope, just nobody typed the words into a text editor. 22:08:31 <fungi> dmsimard: which happened in some cases 22:08:38 <clarkb> cool, in that case I may try to poke at change the behavior 22:09:51 <corvus> dmsimard: if it times out in pre, it will retry the job. (and therefore, the timer would reset). i assume that we'd generally want to continue that. 22:10:10 <fungi> clarkb: i think the reason was that for converted jobs we had one timeout value, and passing a timeout to the playbook was relatively trivial to implement 22:10:33 <fungi> since that's an ansible feature already 22:10:42 <corvus> there will be a case where we'll hit the timeout 3 times in 3 retries, and we'll be sad. but i think by and large, the kind of error we'd expect a timeout to represent is exactly the kind of error we usually want to retry. 22:10:50 <clarkb> fungi: gotcha so it was just less accounting in zuul makes sense 22:10:57 <dmsimard> corvus: +1 22:11:27 <corvus> (in pre, of course) 22:11:42 <fungi> ahh, yeah i guess timing out one of the pre playbooks would have gotten the job aborted regardless of the ssh connection state 22:11:43 <clarkb> I can look into that probably tomorrow if not later today 22:11:49 <clarkb> still catching up on being largely afk for almost two weeks 22:12:10 <corvus> #action clarkb make timeout cumulative in executor 22:12:19 <corvus> any other topics? 22:12:52 <fungi> some changes landed for a memory governor i guess? and there's question as to whether it's working as intended? 22:13:22 <corvus> yeah, i'm trying to sort that out now. 22:13:47 <corvus> we're hoping that will keep us from oom-killing the log streamer 22:13:59 <fungi> fingers crossed 22:14:28 <corvus> but at this point, i have a bunch of confusing data. i'll keep brain-dumping into #zuul as i work through it, and bug people when i've got a coherent idea sorted 22:14:38 <fungi> thanks! 22:15:44 <corvus> if there's nothing else -- let's get back to it :) 22:15:47 <Shrews> fyi, i will be away beginning this thursday through (and including) the following thursday 22:15:56 <fungi> nothing else from me 22:15:58 <Shrews> so don't break nuttin 22:15:58 <corvus> Shrews: ah thanks! 22:16:14 <corvus> Shrews: anything we should try to get into prod before you leave? 22:16:21 <corvus> or get merged 22:16:44 <Shrews> corvus: nothing urgent. we've merged some good fixes to nodepool recently. perhaps we should restart the launchers? 22:16:52 <clarkb> I will be missing the next meeting as I'm doing taxes and its cheaper if I do them before the 15th 22:17:20 <corvus> Shrews: probably a good idea 22:17:30 <corvus> okay, thanks everyone! 22:17:32 <corvus> #endmeeting