19:00:18 <clarkb> #startmeeting infra 19:00:18 <opendevmeet> Meeting started Tue Dec 17 19:00:18 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:18 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:18 <opendevmeet> The meeting name has been set to 'infra' 19:00:24 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/MID2FVRVSSZBARY7TM64ZWOE2WXI264E/ Our Agenda 19:00:27 <clarkb> #topic Announcements 19:01:02 <clarkb> This will be our last meeting of 2024. The next two tuesdays overlap with major holidays (December 24 and 31) so we'll skip the next two weeks of meetings and be back here January 7, 2025 19:03:07 <clarkb> #topic Zuul-launcher image builds 19:03:29 <clarkb> Not sure if we've got corvus around but yesterday I'm pretty sure I saw zuul changes get proposed for adding image build management API stuff to zuul web 19:03:36 <clarkb> including things like buttons to delete images 19:03:49 <corvus> o/ 19:04:02 <corvus> yep i'm partway through that 19:04:12 <corvus> a little more work needed on the backend for the image deletion lifecycle 19:04:25 <corvus> then the buttons will actually do something 19:04:29 <corvus> then we're ready to try it out. :) 19:04:54 <clarkb> corvus: zuul-cleint will get similar support too I hope? 19:05:56 <corvus> that should be able to happen, but it's not a priority for me at the moment. 19:06:02 <clarkb> ack 19:06:14 <clarkb> I guess if it is in the API then any api client can manage it 19:06:20 <clarkb> just a matter of adding support 19:06:20 <corvus> yeah 19:06:47 <corvus> i think the web is going to be a much nicer way of interacting with this stuff 19:06:54 <corvus> so i'd recommend it for opendev admins. :) 19:07:16 <corvus> (it's a lot of uuids) 19:07:49 <clarkb> any other image build updates? 19:07:58 <corvus> not from me 19:08:10 <clarkb> #topic Upgrading old servers 19:08:27 <clarkb> I have updates on adding noble support but that has a topic of its own 19:08:37 <clarkb> anyone else have updates for general server update related efforts? 19:08:38 <tonyb> Nothing new from me 19:08:46 <clarkb> cool let's dive into the next topic then 19:08:53 <clarkb> #topic Docker compose plugin with podman service for servers 19:09:25 <clarkb> The background on this is that Ubuntu Noble comes with python3.12 and docker-compose the python tool doesn't run on python3.12. Rather tha ntry and resurrect that tool we instead switch to the docker compose golang implementation 19:09:39 <clarkb> then because its a clean break we take this as an opportunity to run containers with podman on the backend rather than docker proper 19:09:50 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/937641 docker compose and podman for Noble 19:10:23 <clarkb> This change and the rest of topic:podman-prep show that this generally works for a number of canary services (paste, meetpad, gitea, gerrit, mailman) within our CI jobs 19:10:57 <clarkb> There are a few things that do break though and most of topic:podman-prep has been written to update things to be forward and backward compatbile so the same config management works on pre Noble and Noble installations 19:11:43 <clarkb> things we've found so far: podman doesn't support syslog logging output so we update that to journald which seems to be largely equivalent. Docker compose changes the actual underlying container names so we switch to using docker-compose/docker compose commands which use logical container names. 19:12:13 <clarkb> oh and docker-compose pull is more verbose than docker compose pull so we had to rewrite some ansible that checks if we pulled new images to more explicitly check if that has happened 19:12:28 <clarkb> on the whole fairly minor things that we've been able to workaround in every case which I think is really promising 19:12:38 <tonyb> There is a CLI flag to keep the python style names if you want 19:12:56 <clarkb> tonyb: no I think this is better since it gets us into a more forward looking position 19:13:07 <clarkb> and the changes are straightforward 19:13:18 <clarkb> as far as the implementation itself goes there are a few things to be aware 19:13:21 <clarkb> *aware of 19:13:36 <tonyb> oka 19:13:41 <tonyb> *okay 19:13:46 <clarkb> I'ev configured the podman.socket systemd unit to listen on docker's default socket path and disabled docker / put docker.socket on a different socket path 19:14:01 <clarkb> what this means is if you run `docker` or `docker compose` you automatically talk to podman on the backend 19:14:28 <tonyb> Which is kinda cute :) 19:14:39 <clarkb> when you run `podman` you also talk to podman on the backend but it seems to use a different (podman default) socket path. The good news is that despite teh change of socket path podman commands also see all of the docker command created resources so I don't think this is an issue 19:15:04 <clarkb> I also put a shell shim at /usr/local/bin/docker-compose to replace the python docker-compose command that would normall go there and it just passes through to `docker compose` 19:15:19 <clarkb> this is a compatibility layer for our configuration management that we can drop once everything is on noble or newer 19:15:44 <clarkb> oh and I've opted to install both docker-compose-v2 and podman from ubuntu distro packages. Not sure if we want ot tryand install one or the other or both from upstream sources 19:16:05 <clarkb> I went with distro packages because it made bootstrapping simpler, but now that we know it generally works we can expand to other sources if we like 19:17:02 <clarkb> I think where we are at now is deciding how we wnt ot proceed with this since nothing has exploded dramatically 19:17:26 <clarkb> in particular I think we have two options. One is to continue to use that WIP change to check our services in CI and fix them up before we land the changes to the install-docker role 19:17:41 <corvus> i think using docker cli commands when we run things manually makes sense to match our automation 19:17:49 <clarkb> the other is to accept we've probably done enough testing to proceed with something in production (like paste) and then get somethign into production more quickly 19:17:54 <clarkb> corvus: ++ 19:18:08 <clarkb> I've mostly used docker cli commands in on the held node to test things out of habit too 19:18:30 <clarkb> if we want to proceed to production with a service first I can clean up the WIP change to make it mergable then next step would be redeploying paste I think 19:18:47 <clarkb> otherwise I'll continue to iterate through our services and see what needs updates and fix them under the topic:podman-prep topic 19:19:22 <tonyb> I like the get it into production approach 19:19:44 <frickler> +1 just maybe not right before the holidays? 19:19:51 <corvus> i recognize this is unhelpful, but i'm ambivalent about whether we go into prod now or continue to iterate. our pre-merge testing is pretty good. but maybe putting something simple in production will help us find unknown unknowns sooner. 19:20:07 <tonyb> Given the switch is combined with the OS update I think it's fairly safe and controllable 19:20:20 <clarkb> ya the unknown unknowns is probably the biggest concern at this point so pushing on something into production probably makes the most sense 19:20:28 <clarkb> tonyb: I agree cc frickler 19:20:28 <corvus> i think if paste blows up we can roll it back pretty easily despite the holidays. 19:20:58 <clarkb> ok sounds like I should clean up that change so we can merge it and then maybe tomorrowish start a new paste 19:21:39 <clarkb> oh the last thing I wanted to call out about this effort is we landed changes to "fix" logging for several services yesterday all of which have restarted onto the new docker compose config except for gerrit 19:22:06 <clarkb> after lunch today I'd like to restart gerrit so that it is running under that new config (mostly a heads up I should be able to do that without trouble the images aren't updating just the logging config for docker) 19:22:14 <tonyb> We can keep the existing paste around then rollback if needed would be DNS and data dump+restore 19:22:18 <corvus> yeah i think doing a gerrit restart soon would be good 19:22:31 <clarkb> tonyb: ++ 19:22:48 <clarkb> ok I think that is the feedback I needed to take the next few steps on this effort. Thanks! 19:22:54 <clarkb> any other questions/concerns/feedback? 19:23:01 <corvus> thank you clarkb 19:23:26 <tonyb> clarkb: Thanks for doing this, I'm sorry I slowed things down. 19:23:40 <clarkb> tonyb: I don't think you slowed things down. I wasn't able to get to it until I got to it 19:23:45 <tonyb> clarkb: also ++ on a gerrit restart 19:25:03 <clarkb> #topic Docker Hub Rate Limits 19:25:29 <clarkb> Last week when we updated Gerrit's image to pick up the openid fix I wrote we discovered that we were at the docker hub rate limit despite not pulling any images for days 19:25:45 <clarkb> this led to us discovering that docker hub treats entire ipv6 /64s as a single IP for rate limit purposes 19:26:16 <clarkb> at the time we worked around this by editing /etc/hosts to use ipv4 addresses for the two locations that are used to fetch docker images from docker hub and that sorted us out 19:26:42 <clarkb> I think we could update our zuul jobs to maybe do that on the fly if we like though I haven't looked into doing that myself 19:26:58 <clarkb> I know corvus is advocating for more of a get off docker hub approach which is partially what is driving the noble podman work 19:27:11 <clarkb> basically I've been going down that path myself too 19:27:35 <corvus> i think we have everything we need to start mirroring images to quay 19:27:49 <clarkb> oh right the role to drive that landed ni zuul-jobs last week too 19:28:08 <corvus> next step for that would be to write a simple zuul job that used the new role and run it in a periodic pipeline 19:28:12 <clarkb> I think starting with the python base image(s) and mariadb image(s) would be a good start 19:28:19 <corvus> ++ 19:29:22 <clarkb> corvus: where do you think that job should live and what namespace should we publish to? 19:29:35 <clarkb> like should opendev host the job and also host the images in the quay opendev location? 19:30:11 <corvus> yeah for those images, i think so 19:30:36 <clarkb> ya I think that makes sense since we would be primary consumers of them 19:30:45 <corvus> what's a little less clear is: 19:31:17 <corvus> say the zuul project needs something else; should it make its own job and use its own namespace, or should we have the opendev job be open to anyone? 19:31:57 <corvus> i don't think it'd be a huge imposition on us to review one-liner additions.... but also, it can be self-serve for individual projects so maybe it should be? 19:32:21 <clarkb> ya getting us out of the critical path unless there is broader usefulness is probably a good thing 19:32:28 <corvus> (also, i guess if the list of images gets very long, we'd need multiple jobs anyway...) 19:33:17 <corvus> ok, so multiple jobs managed by projects; sounds good to me 19:33:33 <tonyb> Yeah I think I'm leaning that way also 19:33:51 <corvus> (we might also want one job per image for rate limit purposes :) 19:34:11 <clarkb> that is a very good point acutally 19:34:59 <corvus> oh also, i'd put the "pull" part of the job in a pre-run playbook. ;) 19:35:06 <clarkb> also a good low impact change over holidays for anyone who finds time to do it 19:35:14 <clarkb> that is also a good idea 19:35:23 <corvus> oh phooey we can't 19:35:31 <corvus> we'd need to split the role. oh well. 19:35:41 <clarkb> can cross that bridge if/when we get there 19:35:53 <corvus> yep. 19:35:59 <clarkb> anything else on this topic? 19:36:23 <corvus> nak 19:36:36 <clarkb> #topic Rax-ord Noble Nodes with 1 VCPU 19:36:52 <clarkb> I don't really have anything new on this topic, but it seems like this problem occurs very infrequently 19:37:12 <clarkb> I wanted to make sure I wasn't missing anything important on this or that I'm wrong about ^ that observation 19:37:36 <tonyb> I trust your investigation 19:37:53 <clarkb> the rax flex folks indicated that anything xen related is not in their wheelhouse unfortunately 19:38:10 <clarkb> so I suspect that addressing this is going to require someone to file a proper ticket with rax 19:38:22 <clarkb> nto sure that is high on my priority list unless the error rate increases 19:38:40 <tonyb> Can we add something to detect this early and fail the job, which would get a new node from the pool? 19:39:08 <clarkb> tonyb: yes I think you could do a check along the lines of ansible bios version is too low and num cpu is 1 then exit with failure 19:39:46 <clarkb> I don't think we only want to check the cpu count since we may have jobs that intentionally try to use lower cpus and then wonder why they fail in a year or three so try and constrain it as much as possible to the symptoms we have identified 19:40:31 <tonyb> clarkb: ++ 19:40:51 <clarkb> #topic Ubuntu-ports mirror fixups 19:40:54 <tonyb> I'll see if I can get something written today. 19:40:56 <clarkb> thanks 19:41:18 <clarkb> fungi discovered a few of our afs mirrors were stale and managed to fix them all in a pretty straightforward manner except for ubuntu-ports 19:42:03 <clarkb> reprepro was complaining about a corrupt berkley db for ubuntu-ports, fungi rebuilt the db which fixed the db issue but then some tempfile which records package info was growign to infinity recording the same packages over and over again 19:42:14 <clarkb> eventually that would hit disk limits for the disk the temp file was written to and we would fail 19:42:40 <clarkb> where we've ended up is taht fungi has deleted the ubuntu-ports RW volume content and has started reprepo over from scratch 19:42:55 <clarkb> the RO volumes still have the old stale content so our jobs should continue to run successfully 19:43:11 <clarkb> this is mostly a heads up about this situation as it may take some time to correct 19:43:29 <clarkb> the rebuild from scratch is running on a screen on mirror-update. THough i haven't checked in on it myself yet 19:44:04 <clarkb> fungi is enjoying some well deserved vacation time so we don't have an update from him on this but we can check directly on progress in the screen if anything comes up in the short term 19:44:13 <clarkb> sounded like fungi would try to monitor it as he can though 19:44:22 <clarkb> #topic Open Discussion 19:44:24 <clarkb> Anything else? 19:45:00 <clarkb> I guess it is worth mentioning that I'm expecting to be around this week. But then the two weeks after I'll be in and out as things happen with the kids etc 19:45:14 <clarkb> I don't really have a concrete schedule as i'm not currently traveling anywhere so it will be more organic time off I guess 19:45:59 <tonyb> Same, although I'll be more busy with my kids the week after christmas 19:46:52 <clarkb> I hope everyone else gets to enjoy some time off too. 19:47:14 <clarkb> Also worth mentioning that since we don't have any weekly meetings for a while please do bring up important topics via our typical comms channels if necessary 19:47:24 <tonyb> noted 19:48:09 <clarkb> and thank you everyone for all the help this year. I think it ended up being quite productive within OpenDev 19:48:28 <tonyb> clarkb: and thank you 19:48:32 <corvus> thanks all and thank you clarkb ! 19:48:57 <tonyb> clarkb: Oh did you have annual report stuff to write that you'd like eyes on? 19:49:13 <clarkb> tonyb: they have changed how we draft that stuff this year so I don't 19:49:20 <clarkb> there is still a small section of content but its much smaller in scope 19:50:09 <tonyb> clarkb: Oh cool? 19:50:27 <clarkb> I've used it to call out software updates (gerrit, gitea, ehterpad etc) as well as onboarding new clouds (raxflex) as well as a notice that we lost one of two arm clouds 19:50:51 <tonyb> nice 19:50:53 <clarkb> more of a quick highlights than an indepth recap 19:51:04 <clarkb> if you feel strongly about somethign not in that list let me know and I can probably add it 19:52:30 <clarkb> and with that I think we can end the meeting 19:52:30 <tonyb> I don't have any strong feelings about it. It just occurred to me that we normally have $stuff to do at this time of year 19:52:34 <clarkb> ack 19:52:34 <tonyb> ++ 19:52:49 <clarkb> we'll be back here at our normal time and location on January 7 19:53:05 <clarkb> #endmeeting