#opendev-meeting log

19:00:18 <clarkb> #startmeeting infra
19:00:18 <opendevmeet> Meeting started Tue Dec 17 19:00:18 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:18 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:18 <opendevmeet> The meeting name has been set to 'infra'
19:00:24 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/MID2FVRVSSZBARY7TM64ZWOE2WXI264E/ Our Agenda
19:00:27 <clarkb> #topic Announcements
19:01:02 <clarkb> This will be our last meeting of 2024. The next two tuesdays overlap with major holidays (December 24 and 31) so we'll skip the next two weeks of meetings and be back here January 7, 2025
19:03:07 <clarkb> #topic Zuul-launcher image builds
19:03:29 <clarkb> Not sure if we've got corvus around but yesterday I'm pretty sure I saw zuul changes get proposed for adding image build management API stuff to zuul web
19:03:36 <clarkb> including things like buttons to delete images
19:03:49 <corvus> o/
19:04:02 <corvus> yep i'm partway through that
19:04:12 <corvus> a little more work needed on the backend for the image deletion lifecycle
19:04:25 <corvus> then the buttons will actually do something
19:04:29 <corvus> then we're ready to try it out.  :)
19:04:54 <clarkb> corvus: zuul-cleint will get similar support too I hope?
19:05:56 <corvus> that should be able to happen, but it's not a priority for me at the moment.
19:06:02 <clarkb> ack
19:06:14 <clarkb> I guess if it is in the API then any api client can manage it
19:06:20 <clarkb> just a matter of adding support
19:06:20 <corvus> yeah
19:06:47 <corvus> i think the web is going to be a much nicer way of interacting with this stuff
19:06:54 <corvus> so i'd recommend it for opendev admins.  :)
19:07:16 <corvus> (it's a lot of uuids)
19:07:49 <clarkb> any other image build updates?
19:07:58 <corvus> not from me
19:08:10 <clarkb> #topic Upgrading old servers
19:08:27 <clarkb> I have updates on adding noble support  but that has a topic of its own
19:08:37 <clarkb> anyone else have updates for general server update related efforts?
19:08:38 <tonyb> Nothing new from me
19:08:46 <clarkb> cool let's dive into the next topic then
19:08:53 <clarkb> #topic Docker compose plugin with podman service for servers
19:09:25 <clarkb> The background on this is that Ubuntu Noble comes with python3.12 and docker-compose the python tool doesn't run on python3.12. Rather tha ntry and resurrect that tool we instead switch to the docker compose golang implementation
19:09:39 <clarkb> then because its a clean break we take this as an opportunity to run containers with podman on the backend rather than docker proper
19:09:50 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/937641 docker compose and podman for Noble
19:10:23 <clarkb> This change and the rest of topic:podman-prep show that this generally works for a number of canary services (paste, meetpad, gitea, gerrit, mailman) within our CI jobs
19:10:57 <clarkb> There are a few things that do break though and most of topic:podman-prep has been written to update things to be forward and backward compatbile so the same config management works on pre Noble and Noble installations
19:11:43 <clarkb> things we've found so far: podman doesn't support syslog logging output so we update that to journald which seems to be largely equivalent. Docker compose changes the actual underlying container names so we switch to using docker-compose/docker compose commands which use logical container names.
19:12:13 <clarkb> oh and docker-compose pull is more verbose than docker compose pull so we had to rewrite some ansible that checks if we pulled new images to more explicitly check if that has happened
19:12:28 <clarkb> on the whole fairly minor things that we've been able to workaround in every case which I think is really promising
19:12:38 <tonyb> There is a CLI flag to keep the python style names if you want
19:12:56 <clarkb> tonyb: no I think this is better since it gets us into a more forward looking position
19:13:07 <clarkb> and the changes are straightforward
19:13:18 <clarkb> as far as the implementation itself goes there are a few things to be aware
19:13:21 <clarkb> *aware of
19:13:36 <tonyb> oka
19:13:41 <tonyb> *okay
19:13:46 <clarkb> I'ev configured the podman.socket systemd unit to listen on docker's default socket path and disabled docker / put docker.socket on a different socket path
19:14:01 <clarkb> what this means is if you run `docker` or `docker compose` you automatically talk to podman on the backend
19:14:28 <tonyb> Which is kinda cute :)
19:14:39 <clarkb> when you run `podman` you also talk to podman on the backend but it seems to use a different (podman default) socket path. The good news is that despite teh change of socket path podman commands also see all of the docker command created resources so I don't think this is an issue
19:15:04 <clarkb> I also put a shell shim at /usr/local/bin/docker-compose to replace the python docker-compose command that would normall go there and it just passes through to `docker compose`
19:15:19 <clarkb> this is a compatibility layer for our configuration management that we can drop once everything is on noble or newer
19:15:44 <clarkb> oh and I've opted to install both docker-compose-v2 and podman from ubuntu distro packages. Not sure if we want ot tryand install one or the other or both from upstream sources
19:16:05 <clarkb> I went with distro packages because it made bootstrapping simpler, but now that we know it generally works we can expand to other sources if we like
19:17:02 <clarkb> I think where we are at now is deciding how we wnt ot proceed with this since nothing has exploded dramatically
19:17:26 <clarkb> in particular I think we have two options. One is to continue to use that WIP change to check our services in CI and fix them up before we land the changes to the install-docker role
19:17:41 <corvus> i think using docker cli commands when we run things manually makes sense to match our automation
19:17:49 <clarkb> the other is to accept we've probably done enough testing to proceed with something in production (like paste) and then get somethign into production more quickly
19:17:54 <clarkb> corvus: ++
19:18:08 <clarkb> I've mostly used docker cli commands in on the held node to test things out of habit too
19:18:30 <clarkb> if we want to proceed to production with a service first I can clean up the WIP change to make it mergable then next step would be redeploying paste I think
19:18:47 <clarkb> otherwise I'll continue to iterate through our services and see what needs updates and fix them under the topic:podman-prep topic
19:19:22 <tonyb> I like the get it into production approach
19:19:44 <frickler> +1 just maybe not right before the holidays?
19:19:51 <corvus> i recognize this is unhelpful, but i'm ambivalent about whether we go into prod now or continue to iterate.  our pre-merge testing is pretty good.  but maybe putting something simple in production will help us find unknown unknowns sooner.
19:20:07 <tonyb> Given the switch is combined with the OS update I think it's fairly safe and controllable
19:20:20 <clarkb> ya the unknown unknowns is probably the biggest concern at this point so pushing on something into production probably makes the most sense
19:20:28 <clarkb> tonyb: I agree cc frickler
19:20:28 <corvus> i think if paste blows up we can roll it back pretty easily despite the holidays.
19:20:58 <clarkb> ok sounds like I should clean up that change so we can merge it and then maybe tomorrowish start a new paste
19:21:39 <clarkb> oh the last thing I wanted to call out about this effort is we landed changes to "fix" logging for several services yesterday all of which have restarted onto the new docker compose config except for gerrit
19:22:06 <clarkb> after lunch today I'd like to restart gerrit so that it is running under that new config (mostly a heads up I should be able to do that without trouble the images aren't updating just the logging config for docker)
19:22:14 <tonyb> We can keep the existing paste around then rollback if needed would be DNS and data dump+restore
19:22:18 <corvus> yeah i think doing a gerrit restart soon would be good
19:22:31 <clarkb> tonyb: ++
19:22:48 <clarkb> ok I think that is the feedback I needed to take the next few steps on this effort. Thanks!
19:22:54 <clarkb> any other questions/concerns/feedback?
19:23:01 <corvus> thank you clarkb
19:23:26 <tonyb> clarkb: Thanks for doing this, I'm sorry I slowed things down.
19:23:40 <clarkb> tonyb: I don't think you slowed things down. I wasn't able to get to it until I got to it
19:23:45 <tonyb> clarkb: also ++ on a gerrit restart
19:25:03 <clarkb> #topic Docker Hub Rate Limits
19:25:29 <clarkb> Last week when we updated Gerrit's image to pick up the openid fix I wrote we discovered that we were at the docker hub rate limit despite not pulling any images for days
19:25:45 <clarkb> this led to us discovering that docker hub treats entire ipv6 /64s as a single IP for rate limit purposes
19:26:16 <clarkb> at the time we worked around this by editing /etc/hosts to use ipv4 addresses for the two locations that are used to fetch docker images from docker hub and that sorted us out
19:26:42 <clarkb> I think we could update our zuul jobs to maybe do that on the fly if we like though I haven't looked into doing that myself
19:26:58 <clarkb> I know corvus is advocating for more of a get off docker hub approach which is partially what is driving the noble podman work
19:27:11 <clarkb> basically I've been going down that path myself too
19:27:35 <corvus> i think we have everything we need to start mirroring images to quay
19:27:49 <clarkb> oh right the role to drive that landed ni zuul-jobs last week too
19:28:08 <corvus> next step for that would be to write a simple zuul job that used the new role and run it in a periodic pipeline
19:28:12 <clarkb> I think starting with the python base image(s) and mariadb image(s) would be a good start
19:28:19 <corvus> ++
19:29:22 <clarkb> corvus: where do you think that job should live and what namespace should we publish to?
19:29:35 <clarkb> like should opendev host the job and also host the images in the quay opendev location?
19:30:11 <corvus> yeah for those images, i think so
19:30:36 <clarkb> ya I think that makes sense since we would be primary consumers of them
19:30:45 <corvus> what's a little less clear is:
19:31:17 <corvus> say the zuul project needs something else; should it make its own job and use its own namespace, or should we have the opendev job be open to anyone?
19:31:57 <corvus> i don't think it'd be a huge imposition on us to review one-liner additions.... but also, it can be self-serve for individual projects so maybe it should be?
19:32:21 <clarkb> ya getting us out of the critical path unless there is broader usefulness is probably a good thing
19:32:28 <corvus> (also, i guess if the list of images gets very long, we'd need multiple jobs anyway...)
19:33:17 <corvus> ok, so multiple jobs managed by projects; sounds good to me
19:33:33 <tonyb> Yeah I think I'm leaning that way also
19:33:51 <corvus> (we might also want one job per image for rate limit purposes :)
19:34:11 <clarkb> that is a very good point acutally
19:34:59 <corvus> oh also, i'd put the "pull" part of the job in a pre-run playbook.  ;)
19:35:06 <clarkb> also a good low impact change over holidays for anyone who finds time to do it
19:35:14 <clarkb> that is also a good idea
19:35:23 <corvus> oh phooey we can't
19:35:31 <corvus> we'd need to split the role.  oh well.
19:35:41 <clarkb> can cross that bridge if/when we get there
19:35:53 <corvus> yep.
19:35:59 <clarkb> anything else on this topic?
19:36:23 <corvus> nak
19:36:36 <clarkb> #topic Rax-ord Noble Nodes with 1 VCPU
19:36:52 <clarkb> I don't really have anything new on this topic, but it seems like this problem occurs very infrequently
19:37:12 <clarkb> I wanted to make sure I wasn't missing anything important on this or that I'm wrong about ^ that observation
19:37:36 <tonyb> I trust your investigation
19:37:53 <clarkb> the rax flex folks indicated that anything xen related is not in their wheelhouse unfortunately
19:38:10 <clarkb> so I suspect that addressing this is going to require someone to file a proper ticket with rax
19:38:22 <clarkb> nto sure that is high on my priority list unless the error rate increases
19:38:40 <tonyb> Can we add something to detect this early and fail the job, which would get a new node from the pool?
19:39:08 <clarkb> tonyb: yes I think you could do a check along the lines of ansible bios version is too low and num cpu is 1 then exit with failure
19:39:46 <clarkb> I don't think we only want to check the cpu count since we may have jobs that intentionally try to use lower cpus and then wonder why they fail in a year or three so try and constrain it as much as possible to the symptoms we have identified
19:40:31 <tonyb> clarkb: ++
19:40:51 <clarkb> #topic Ubuntu-ports mirror fixups
19:40:54 <tonyb> I'll see if I can get something written today.
19:40:56 <clarkb> thanks
19:41:18 <clarkb> fungi discovered a few of our afs mirrors were stale and managed to fix them all in a pretty straightforward manner except for ubuntu-ports
19:42:03 <clarkb> reprepro was complaining about a corrupt berkley db for ubuntu-ports, fungi rebuilt the db which fixed the db issue but then some tempfile which records package info was growign to infinity recording the same packages over and over again
19:42:14 <clarkb> eventually that would hit disk limits for the disk the temp file was written to and we would fail
19:42:40 <clarkb> where we've ended up is taht fungi has deleted the ubuntu-ports RW volume content and has started reprepo over from scratch
19:42:55 <clarkb> the RO volumes still have the old stale content so our jobs should continue to run successfully
19:43:11 <clarkb> this is mostly a heads up about this situation as it may take some time to correct
19:43:29 <clarkb> the rebuild from scratch is running on a screen on mirror-update. THough i haven't checked in on it myself yet
19:44:04 <clarkb> fungi is enjoying some well deserved vacation time so we don't have an update from him on this but we can check directly on progress in the screen if anything comes up in the short term
19:44:13 <clarkb> sounded like fungi would try to monitor it as he can though
19:44:22 <clarkb> #topic Open Discussion
19:44:24 <clarkb> Anything else?
19:45:00 <clarkb> I guess it is worth mentioning that I'm expecting to be around this week. But then the two weeks after I'll be in and out as things happen with the kids etc
19:45:14 <clarkb> I don't really have a concrete schedule as i'm not currently traveling anywhere so it will be more organic time off I guess
19:45:59 <tonyb> Same, although I'll be more busy with my kids the week after christmas
19:46:52 <clarkb> I hope everyone else gets to enjoy some time off too.
19:47:14 <clarkb> Also worth mentioning that since we don't have any weekly meetings for a while please do bring up important topics via our typical comms channels if necessary
19:47:24 <tonyb> noted
19:48:09 <clarkb> and thank you everyone for all the help this year. I think it ended up being quite productive within OpenDev
19:48:28 <tonyb> clarkb: and thank you
19:48:32 <corvus> thanks all and thank you clarkb !
19:48:57 <tonyb> clarkb: Oh did you have annual report stuff to write that you'd like eyes on?
19:49:13 <clarkb> tonyb: they have changed how we draft that stuff this year so I don't
19:49:20 <clarkb> there is still a small section of content but its much smaller in scope
19:50:09 <tonyb> clarkb: Oh cool?
19:50:27 <clarkb> I've used it to call out software updates (gerrit, gitea, ehterpad etc) as well as onboarding new clouds (raxflex) as well as a notice that we lost one of two arm clouds
19:50:51 <tonyb> nice
19:50:53 <clarkb> more of a quick highlights than an indepth recap
19:51:04 <clarkb> if you feel strongly about somethign not in that list let me know and I can probably add it
19:52:30 <clarkb> and with that I think we can end the meeting
19:52:30 <tonyb> I don't have any strong feelings about it.  It just occurred to me that we normally have $stuff to do at this time of year
19:52:34 <clarkb> ack
19:52:34 <tonyb> ++
19:52:49 <clarkb> we'll be back here at our normal time and location on January 7
19:53:05 <clarkb> #endmeeting