Tuesday, 2022-07-26

clarkbMeeting time!19:00
fungiohai!19:00
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Jul 26 19:01:06 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
ianwo/19:01
clarkb#link https://lists.opendev.org/pipermail/service-discuss/2022-July/000346.html Our Agenda19:01
clarkbI had no announcements so I'm just going to dive right into the topic list19:01
clarkb#topic Topics19:01
clarkb#topic Improving CD throughput19:02
clarkbI'm not aware of any changes to this since the last meeting, but wanted to make sure I wasn't overlooking anything important or actionable19:02
clarkbSounds like there aren't any updates from others either19:03
clarkb#topic Updating Grafana Management Tooling19:03
clarkb#link https://lists.opendev.org/pipermail/service-discuss/2022-July/000342.html19:03
clarkb#link https://review.opendev.org/q/topic:grafana-json19:04
clarkbThank you ianw for putting this together19:04
clarkbI've managed to review the stack and had a few pieces of feedback but overall I think this looks good19:04
ianwyes thanks, i need to respond to your comments19:05
clarkbI suspect once we clarify a few of those things a second reviewer should be able to land them if second review is happy too19:05
ianwone was about having two jobs; they probably could be combined.  one is an explicit syntax check19:05
clarkbya I +2'd them all as my feedback was all minor and I'd be happy to address things in a followup if we decide that addressing those items is a good idea19:06
clarkbI guess that is probably all there is to say on this :) second reviewer would be great if anyone else has time19:07
clarkb#topic Bastion Host Updates19:07
clarkbWe discovered recently that Zuul leaks console log streaming artifacts (task log files essentially) in /tmp on bridge19:08
clarkbI wrote a simple (probably too simple) change to have a periodic cleanup of those files run on bridge. But ianw had the even better idea of udpating zuul to clean up after itself19:08
clarkbConsidering how long this has been happening I don't think it is an emergency and I can abandon my change while we work to land ianw's fixes to zuul19:09
clarkbif anyone is concerned with that plan let me know and I'll work to make my change less bad and it can be a temporary fix19:09
clarkbianw: I also wanted to call out taht corvus made note on the base change of the zuul fix stack that we probably do need tmp reaper functionality in zuul itself too for aborted jobs19:09
ianwi just noticed there was some discussion in matrix over adding a periodic cleaner to the zuul-console daemon19:09
clarkbyup19:10
clarkbI got the impression the current stack can go in as is, but we should look at a followup to close the aborted job gap19:10
corvusto be clear, the current behavior is not an oversight in zuul19:10
clarkbsince the current stack is a strict improvement. It just doesn't fully solve the problem19:10
ianwyeah.  i guess my concern with that a priori is the same thing that made be feel a bit weird about cleaning it via the cron job, in that it's a global namespace there19:10
corvusi think an improvement is fine19:10
corvusbut it's not like we just forgot to do that19:10
corvuswe understood that it's nearly impossible to actually remove these files synchronously19:11
corvuswhich is why we expected one of 2 things: either the node disappears, or a tmp reaper/reboot fixes it19:11
ianwbut we can't really change the name of the file on disk until we are happy enough that there are no zuul_console processes out there looking for the old name19:11
corvusianw's change is an improvement in that it deletes many of the files much of the time, but it's not 100%; the only 100% fix is async tmp cleanup19:12
clarkbcorvus: right19:13
clarkbanyway I think we can improve what we've got for now, then look into further improvement as a followup19:13
corvusif we've deleted files in /tmp on bridge, then we've probably got a year of headroom :)19:14
corvushopefully it won't take that long :)19:14
clarkbyup I deleted all files older than a month following the rough format used by zuul currently19:14
clarkbshould be plenty :)19:14
clarkbAny other bastion host changes to call out? I think the ansible in a venv work hasn't happened yet as other items have come up19:15
corvusianw: thanks for you work on this -- it's def a good improvement.  it also scares me a lot which is why i'm trying to bring up as much info/caveats as possible.19:15
corvusnot sure if that comes through in text :)19:16
ianwcorvus: thanks, and yes touching anything related to command:/shell: also worries me :)19:17
clarkb#topic Upgrading Bionic servers to Focal/Jammy19:18
clarkb#link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done.19:18
clarkbI was hoping to spin up a jammy replacement server for something like zp01 late last week then that jammy kernel thing happened19:18
clarkbSince then I've also realized that I keep intending on spinning up a prometheus and helping with mailman 3 work. I'm now thinking I'm going to start here by seeing what mailman3 on jammy looks like in CI19:19
clarkbI think that kills two birds with one stone as far as spinning up jammy in our configuration management goes. I don't really expect any problems19:19
fungiagreed!19:20
clarkbBut don't let me stop anyone else from chipping away at this either. I think there is enough here to do htat we can work it concurrently :)19:20
ianw++ yes taking any steps helps! :)19:21
clarkbIf you do find jammy differences that are notable please acll them out (on the etherpad?)19:21
ianwdo we have system-config-base jobs on jammy yet?19:21
ianwthat's probably an easy place to start19:22
clarkbianw: we have some jobs primarily for wheel building iirc. I don't know if that made it as far as system-config-base jobs. But that is a good call out19:22
clarkbI can probably look at that this afternoon19:22
clarkb#topic Zuul job POST_FAILUREs19:24
clarkbI haven't heard anyone complaining about these recently, but we did end up landing the base job upate to record log upload target swift locations before uploading19:24
clarkbthis means if we do start to get reports of these again we can query their logs (on the executor since upload failed) to see where they were uploading to. Then we can check if they are all consistently to a single target19:25
clarkband take the debugging from there. I'm still slightly suspicious the glibc fix may have helped make things better, but only beacuse when we updated glibc the problem seemed to stop being reported. I don't haven't explicitly gone looking for the problem afterwards19:25
clarkbit could also be that the log pruning to reduce the total count of files has made a noticeable impact19:27
clarkb#topic Service Coordinator Elections19:28
clarkbAbout 6 months ago I pencilled in August 2-16, 2022 as our nomination period. I think that scheduling continue to work and was going to make sure there were no objections here before sending email about it to the service-discuss list today19:29
clarkbI'm still happy for someone else to give it a go too :)19:29
ianw... this is the problem with doing the job too well :)19:30
clarkbheh. Are you suggesting I should do worse? :P19:30
clarkbI won't send the email immediately, so let me know if you've got any objections to that timing and we can take it from there. Otherwise late today (relative to me) I'll get that sent out19:31
clarkb#topic Open Discussion19:31
clarkbI'm in day 3 of a ~6-8 day heat wave. I really only ever worry about power outages in weather like this or ice storms. Though things seem to be holding up so far.19:33
clarkbI've also got a meeting with the works on arm folks thursday at 8am pacific.19:34
clarkbianw: it would be great to have you, but I don't think it is more important than your sleep :)19:34
clarkbI should be able to handle it fine, but happy to forward the email with details to anyone else if interested19:34
ianwok, i guess my input is "we like running jobs on arm, and i think it benefits everyone having them" :)19:36
clarkb++19:36
ianwi think i pulled a few stats on the total number of jobs run, i wonder if there's an exact way to tell?19:37
clarkbthe graphite/grafana numbers are probably the best we've got19:37
clarkbWe might also be able to scrape the zuul api but I don't think it is very efficient at collecting stats like that through the rest api19:38
clarkb(you'd have to iterate through all the jobs you are interested in)19:38
ianwyeah, pointing to that is probably the most compelling thing, since you can see it19:38
ianwyep, and things like system-config have two nodes, which doubles usage 19:39
corvusfor a one off, you could run a sql query19:39
clarkbcorvus: oh good point19:39
clarkband ya I agree that the visual data is always good19:39
clarkbAnything else before we call it a meeting?19:40
fungii didn't have anything19:41
clarkbSounds like that may be it. Thank you everyone.19:42
clarkbWe'll be back here next week19:42
clarkb#endmeeting19:42
opendevmeetMeeting ended Tue Jul 26 19:42:39 2022 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:42
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2022/infra.2022-07-26-19.01.html19:42
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-07-26-19.01.txt19:42
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2022/infra.2022-07-26-19.01.log.html19:42
fungithanks!19:43

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!