#opendev-meeting log

19:01:06 <clarkb> #startmeeting infra
19:01:06 <opendevmeet> Meeting started Tue Jul 26 19:01:06 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:06 <opendevmeet> The meeting name has been set to 'infra'
19:01:08 <ianw> o/
19:01:29 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-July/000346.html Our Agenda
19:01:45 <clarkb> I had no announcements so I'm just going to dive right into the topic list
19:01:55 <clarkb> #topic Topics
19:02:08 <clarkb> #topic Improving CD throughput
19:02:29 <clarkb> I'm not aware of any changes to this since the last meeting, but wanted to make sure I wasn't overlooking anything important or actionable
19:03:42 <clarkb> Sounds like there aren't any updates from others either
19:03:51 <clarkb> #topic Updating Grafana Management Tooling
19:03:56 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-July/000342.html
19:04:00 <clarkb> #link https://review.opendev.org/q/topic:grafana-json
19:04:34 <clarkb> Thank you ianw for putting this together
19:04:46 <clarkb> I've managed to review the stack and had a few pieces of feedback but overall I think this looks good
19:05:01 <ianw> yes thanks, i need to respond to your comments
19:05:37 <clarkb> I suspect once we clarify a few of those things a second reviewer should be able to land them if second review is happy too
19:05:45 <ianw> one was about having two jobs; they probably could be combined.  one is an explicit syntax check
19:06:27 <clarkb> ya I +2'd them all as my feedback was all minor and I'd be happy to address things in a followup if we decide that addressing those items is a good idea
19:07:25 <clarkb> I guess that is probably all there is to say on this :) second reviewer would be great if anyone else has time
19:07:54 <clarkb> #topic Bastion Host Updates
19:08:12 <clarkb> We discovered recently that Zuul leaks console log streaming artifacts (task log files essentially) in /tmp on bridge
19:08:41 <clarkb> I wrote a simple (probably too simple) change to have a periodic cleanup of those files run on bridge. But ianw had the even better idea of udpating zuul to clean up after itself
19:09:04 <clarkb> Considering how long this has been happening I don't think it is an emergency and I can abandon my change while we work to land ianw's fixes to zuul
19:09:19 <clarkb> if anyone is concerned with that plan let me know and I'll work to make my change less bad and it can be a temporary fix
19:09:46 <clarkb> ianw: I also wanted to call out taht corvus made note on the base change of the zuul fix stack that we probably do need tmp reaper functionality in zuul itself too for aborted jobs
19:09:53 <ianw> i just noticed there was some discussion in matrix over adding a periodic cleaner to the zuul-console daemon
19:10:00 <clarkb> yup
19:10:17 <clarkb> I got the impression the current stack can go in as is, but we should look at a followup to close the aborted job gap
19:10:29 <corvus> to be clear, the current behavior is not an oversight in zuul
19:10:29 <clarkb> since the current stack is a strict improvement. It just doesn't fully solve the problem
19:10:45 <ianw> yeah.  i guess my concern with that a priori is the same thing that made be feel a bit weird about cleaning it via the cron job, in that it's a global namespace there
19:10:46 <corvus> i think an improvement is fine
19:10:54 <corvus> but it's not like we just forgot to do that
19:11:14 <corvus> we understood that it's nearly impossible to actually remove these files synchronously
19:11:37 <corvus> which is why we expected one of 2 things: either the node disappears, or a tmp reaper/reboot fixes it
19:11:59 <ianw> but we can't really change the name of the file on disk until we are happy enough that there are no zuul_console processes out there looking for the old name
19:12:09 <corvus> ianw's change is an improvement in that it deletes many of the files much of the time, but it's not 100%; the only 100% fix is async tmp cleanup
19:13:11 <clarkb> corvus: right
19:13:24 <clarkb> anyway I think we can improve what we've got for now, then look into further improvement as a followup
19:14:23 <corvus> if we've deleted files in /tmp on bridge, then we've probably got a year of headroom :)
19:14:36 <corvus> hopefully it won't take that long :)
19:14:44 <clarkb> yup I deleted all files older than a month following the rough format used by zuul currently
19:14:47 <clarkb> should be plenty :)
19:15:41 <clarkb> Any other bastion host changes to call out? I think the ansible in a venv work hasn't happened yet as other items have come up
19:15:57 <corvus> ianw: thanks for you work on this -- it's def a good improvement.  it also scares me a lot which is why i'm trying to bring up as much info/caveats as possible.
19:16:05 <corvus> not sure if that comes through in text :)
19:17:20 <ianw> corvus: thanks, and yes touching anything related to command:/shell: also worries me :)
19:18:16 <clarkb> #topic Upgrading Bionic servers to Focal/Jammy
19:18:26 <clarkb> #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done.
19:18:50 <clarkb> I was hoping to spin up a jammy replacement server for something like zp01 late last week then that jammy kernel thing happened
19:19:26 <clarkb> Since then I've also realized that I keep intending on spinning up a prometheus and helping with mailman 3 work. I'm now thinking I'm going to start here by seeing what mailman3 on jammy looks like in CI
19:19:57 <clarkb> I think that kills two birds with one stone as far as spinning up jammy in our configuration management goes. I don't really expect any problems
19:20:28 <fungi> agreed!
19:20:35 <clarkb> But don't let me stop anyone else from chipping away at this either. I think there is enough here to do htat we can work it concurrently :)
19:21:20 <ianw> ++ yes taking any steps helps! :)
19:21:21 <clarkb> If you do find jammy differences that are notable please acll them out (on the etherpad?)
19:21:47 <ianw> do we have system-config-base jobs on jammy yet?
19:22:00 <ianw> that's probably an easy place to start
19:22:14 <clarkb> ianw: we have some jobs primarily for wheel building iirc. I don't know if that made it as far as system-config-base jobs. But that is a good call out
19:22:47 <clarkb> I can probably look at that this afternoon
19:24:14 <clarkb> #topic Zuul job POST_FAILUREs
19:24:44 <clarkb> I haven't heard anyone complaining about these recently, but we did end up landing the base job upate to record log upload target swift locations before uploading
19:25:14 <clarkb> this means if we do start to get reports of these again we can query their logs (on the executor since upload failed) to see where they were uploading to. Then we can check if they are all consistently to a single target
19:25:58 <clarkb> and take the debugging from there. I'm still slightly suspicious the glibc fix may have helped make things better, but only beacuse when we updated glibc the problem seemed to stop being reported. I don't haven't explicitly gone looking for the problem afterwards
19:27:03 <clarkb> it could also be that the log pruning to reduce the total count of files has made a noticeable impact
19:28:19 <clarkb> #topic Service Coordinator Elections
19:29:00 <clarkb> About 6 months ago I pencilled in August 2-16, 2022 as our nomination period. I think that scheduling continue to work and was going to make sure there were no objections here before sending email about it to the service-discuss list today
19:29:22 <clarkb> I'm still happy for someone else to give it a go too :)
19:30:24 <ianw> ... this is the problem with doing the job too well :)
19:30:52 <clarkb> heh. Are you suggesting I should do worse? :P
19:31:17 <clarkb> I won't send the email immediately, so let me know if you've got any objections to that timing and we can take it from there. Otherwise late today (relative to me) I'll get that sent out
19:31:30 <clarkb> #topic Open Discussion
19:33:12 <clarkb> I'm in day 3 of a ~6-8 day heat wave. I really only ever worry about power outages in weather like this or ice storms. Though things seem to be holding up so far.
19:34:11 <clarkb> I've also got a meeting with the works on arm folks thursday at 8am pacific.
19:34:23 <clarkb> ianw: it would be great to have you, but I don't think it is more important than your sleep :)
19:34:49 <clarkb> I should be able to handle it fine, but happy to forward the email with details to anyone else if interested
19:36:13 <ianw> ok, i guess my input is "we like running jobs on arm, and i think it benefits everyone having them" :)
19:36:20 <clarkb> ++
19:37:13 <ianw> i think i pulled a few stats on the total number of jobs run, i wonder if there's an exact way to tell?
19:37:53 <clarkb> the graphite/grafana numbers are probably the best we've got
19:38:17 <clarkb> We might also be able to scrape the zuul api but I don't think it is very efficient at collecting stats like that through the rest api
19:38:31 <clarkb> (you'd have to iterate through all the jobs you are interested in)
19:38:33 <ianw> yeah, pointing to that is probably the most compelling thing, since you can see it
19:39:00 <ianw> yep, and things like system-config have two nodes, which doubles usage
19:39:17 <corvus> for a one off, you could run a sql query
19:39:22 <clarkb> corvus: oh good point
19:39:42 <clarkb> and ya I agree that the visual data is always good
19:40:57 <clarkb> Anything else before we call it a meeting?
19:41:37 <fungi> i didn't have anything
19:42:19 <clarkb> Sounds like that may be it. Thank you everyone.
19:42:24 <clarkb> We'll be back here next week
19:42:39 <clarkb> #endmeeting