19:01:06 #startmeeting infra 19:01:06 Meeting started Tue Jul 26 19:01:06 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:06 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:06 The meeting name has been set to 'infra' 19:01:08 o/ 19:01:29 #link https://lists.opendev.org/pipermail/service-discuss/2022-July/000346.html Our Agenda 19:01:45 I had no announcements so I'm just going to dive right into the topic list 19:01:55 #topic Topics 19:02:08 #topic Improving CD throughput 19:02:29 I'm not aware of any changes to this since the last meeting, but wanted to make sure I wasn't overlooking anything important or actionable 19:03:42 Sounds like there aren't any updates from others either 19:03:51 #topic Updating Grafana Management Tooling 19:03:56 #link https://lists.opendev.org/pipermail/service-discuss/2022-July/000342.html 19:04:00 #link https://review.opendev.org/q/topic:grafana-json 19:04:34 Thank you ianw for putting this together 19:04:46 I've managed to review the stack and had a few pieces of feedback but overall I think this looks good 19:05:01 yes thanks, i need to respond to your comments 19:05:37 I suspect once we clarify a few of those things a second reviewer should be able to land them if second review is happy too 19:05:45 one was about having two jobs; they probably could be combined. one is an explicit syntax check 19:06:27 ya I +2'd them all as my feedback was all minor and I'd be happy to address things in a followup if we decide that addressing those items is a good idea 19:07:25 I guess that is probably all there is to say on this :) second reviewer would be great if anyone else has time 19:07:54 #topic Bastion Host Updates 19:08:12 We discovered recently that Zuul leaks console log streaming artifacts (task log files essentially) in /tmp on bridge 19:08:41 I wrote a simple (probably too simple) change to have a periodic cleanup of those files run on bridge. But ianw had the even better idea of udpating zuul to clean up after itself 19:09:04 Considering how long this has been happening I don't think it is an emergency and I can abandon my change while we work to land ianw's fixes to zuul 19:09:19 if anyone is concerned with that plan let me know and I'll work to make my change less bad and it can be a temporary fix 19:09:46 ianw: I also wanted to call out taht corvus made note on the base change of the zuul fix stack that we probably do need tmp reaper functionality in zuul itself too for aborted jobs 19:09:53 i just noticed there was some discussion in matrix over adding a periodic cleaner to the zuul-console daemon 19:10:00 yup 19:10:17 I got the impression the current stack can go in as is, but we should look at a followup to close the aborted job gap 19:10:29 to be clear, the current behavior is not an oversight in zuul 19:10:29 since the current stack is a strict improvement. It just doesn't fully solve the problem 19:10:45 yeah. i guess my concern with that a priori is the same thing that made be feel a bit weird about cleaning it via the cron job, in that it's a global namespace there 19:10:46 i think an improvement is fine 19:10:54 but it's not like we just forgot to do that 19:11:14 we understood that it's nearly impossible to actually remove these files synchronously 19:11:37 which is why we expected one of 2 things: either the node disappears, or a tmp reaper/reboot fixes it 19:11:59 but we can't really change the name of the file on disk until we are happy enough that there are no zuul_console processes out there looking for the old name 19:12:09 ianw's change is an improvement in that it deletes many of the files much of the time, but it's not 100%; the only 100% fix is async tmp cleanup 19:13:11 corvus: right 19:13:24 anyway I think we can improve what we've got for now, then look into further improvement as a followup 19:14:23 if we've deleted files in /tmp on bridge, then we've probably got a year of headroom :) 19:14:36 hopefully it won't take that long :) 19:14:44 yup I deleted all files older than a month following the rough format used by zuul currently 19:14:47 should be plenty :) 19:15:41 Any other bastion host changes to call out? I think the ansible in a venv work hasn't happened yet as other items have come up 19:15:57 ianw: thanks for you work on this -- it's def a good improvement. it also scares me a lot which is why i'm trying to bring up as much info/caveats as possible. 19:16:05 not sure if that comes through in text :) 19:17:20 corvus: thanks, and yes touching anything related to command:/shell: also worries me :) 19:18:16 #topic Upgrading Bionic servers to Focal/Jammy 19:18:26 #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done. 19:18:50 I was hoping to spin up a jammy replacement server for something like zp01 late last week then that jammy kernel thing happened 19:19:26 Since then I've also realized that I keep intending on spinning up a prometheus and helping with mailman 3 work. I'm now thinking I'm going to start here by seeing what mailman3 on jammy looks like in CI 19:19:57 I think that kills two birds with one stone as far as spinning up jammy in our configuration management goes. I don't really expect any problems 19:20:28 agreed! 19:20:35 But don't let me stop anyone else from chipping away at this either. I think there is enough here to do htat we can work it concurrently :) 19:21:20 ++ yes taking any steps helps! :) 19:21:21 If you do find jammy differences that are notable please acll them out (on the etherpad?) 19:21:47 do we have system-config-base jobs on jammy yet? 19:22:00 that's probably an easy place to start 19:22:14 ianw: we have some jobs primarily for wheel building iirc. I don't know if that made it as far as system-config-base jobs. But that is a good call out 19:22:47 I can probably look at that this afternoon 19:24:14 #topic Zuul job POST_FAILUREs 19:24:44 I haven't heard anyone complaining about these recently, but we did end up landing the base job upate to record log upload target swift locations before uploading 19:25:14 this means if we do start to get reports of these again we can query their logs (on the executor since upload failed) to see where they were uploading to. Then we can check if they are all consistently to a single target 19:25:58 and take the debugging from there. I'm still slightly suspicious the glibc fix may have helped make things better, but only beacuse when we updated glibc the problem seemed to stop being reported. I don't haven't explicitly gone looking for the problem afterwards 19:27:03 it could also be that the log pruning to reduce the total count of files has made a noticeable impact 19:28:19 #topic Service Coordinator Elections 19:29:00 About 6 months ago I pencilled in August 2-16, 2022 as our nomination period. I think that scheduling continue to work and was going to make sure there were no objections here before sending email about it to the service-discuss list today 19:29:22 I'm still happy for someone else to give it a go too :) 19:30:24 ... this is the problem with doing the job too well :) 19:30:52 heh. Are you suggesting I should do worse? :P 19:31:17 I won't send the email immediately, so let me know if you've got any objections to that timing and we can take it from there. Otherwise late today (relative to me) I'll get that sent out 19:31:30 #topic Open Discussion 19:33:12 I'm in day 3 of a ~6-8 day heat wave. I really only ever worry about power outages in weather like this or ice storms. Though things seem to be holding up so far. 19:34:11 I've also got a meeting with the works on arm folks thursday at 8am pacific. 19:34:23 ianw: it would be great to have you, but I don't think it is more important than your sleep :) 19:34:49 I should be able to handle it fine, but happy to forward the email with details to anyone else if interested 19:36:13 ok, i guess my input is "we like running jobs on arm, and i think it benefits everyone having them" :) 19:36:20 ++ 19:37:13 i think i pulled a few stats on the total number of jobs run, i wonder if there's an exact way to tell? 19:37:53 the graphite/grafana numbers are probably the best we've got 19:38:17 We might also be able to scrape the zuul api but I don't think it is very efficient at collecting stats like that through the rest api 19:38:31 (you'd have to iterate through all the jobs you are interested in) 19:38:33 yeah, pointing to that is probably the most compelling thing, since you can see it 19:39:00 yep, and things like system-config have two nodes, which doubles usage 19:39:17 for a one off, you could run a sql query 19:39:22 corvus: oh good point 19:39:42 and ya I agree that the visual data is always good 19:40:57 Anything else before we call it a meeting? 19:41:37 i didn't have anything 19:42:19 Sounds like that may be it. Thank you everyone. 19:42:24 We'll be back here next week 19:42:39 #endmeeting