#opendev-meeting log

19:00:04 <clarkb> #startmeeting infra
19:00:04 <opendevmeet> Meeting started Tue Nov 19 19:00:04 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:04 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:04 <opendevmeet> The meeting name has been set to 'infra'
19:00:05 * tonyb is on a train so coverage may be sporadic
19:00:18 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/BQ2IWL7BUMNUXYWCQV62DZQCF2AI7E5U/ Our Agenda
19:00:25 <clarkb> #topic Announcements
19:00:35 <tonyb> also in an EU timezone for the next few weeks
19:00:39 <clarkb> oh fun
19:00:56 <tonyb> should be!
19:01:14 <clarkb> Next week is a major US holiday week. I plan to be around Monday and Tuesday and will host the weekly meeting. But we should probably expect slower response times from various places
19:01:47 <clarkb> Anything else to announce?
19:03:15 <clarkb> #topic Zuul-launcher image builds
19:03:41 <clarkb> corvus has continued to iterate on the mechanics of uploading images to swift, downloading them to the launcher and reuploading to the clouds
19:04:12 <clarkb> and good news is the time to build a qcow2 image then shuffle it around is close to if not better than the time the current builders do it in
19:04:19 <clarkb> #link https://review.opendev.org/935455 Setup a raw image cloud for raw image testing
19:04:37 <clarkb> qcow2 images are relatievly small compared to the raw and vhd images we also deal with so next step is testing this process with the larger image types
19:04:57 <clarkb> There is still opportunity to add image build jobs for other distros and releases as well
19:05:32 <tonyb> it's high on my to-do list
19:05:45 <clarkb> cool
19:05:50 <clarkb> anything else on this topic cc corvus
19:06:01 <clarkb> seems like good slow but steady progrss
19:07:17 <clarkb> I'll keep the meeting moving as we have a number of items to get through. We can always swing back to topics if we have time at the end or after the meeting etc
19:07:22 <clarkb> #topic Backup Server Pruning
19:07:38 <clarkb> the smaller backup server got close to filling up its disk again and fungi pruned it again. THank you for that
19:07:57 <clarkb> but this is a good reminder that we have a couple of changes proposed to help alleviate some of that by purging things from the backup servers once they aer no longer needed
19:08:04 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/933700 Backup deletions managed through Ansible
19:08:08 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/934768 Handle backup verification for purged backups
19:08:29 <clarkb> oh I missed hta fungi had +2'd but not approved the first one
19:09:00 <fungi> i wasn't sure if those needed close attention after merging
19:09:04 <clarkb> should we go ahead and either fix the indnetation then approve or just approve it?
19:09:25 <fungi> i think we can just approve
19:10:22 <tonyb> I agree, we can do the indentation after if we want
19:10:32 <clarkb> I think the test case on line 34 in https://review.opendev.org/c/opendev/system-config/+/933700/23/testinfra/test_borg_backups.py should ensure that it is fairly safe
19:11:30 <clarkb> then after it is landed we can touch the retired flag in the ethercalc dir and then add ethercalc to the list of retirements (to catch up the other server) check that worked, then add it to the purge list (to catch up the other server) and check that worked
19:11:40 <clarkb> then if that is good we can retire the rest of the items in our retirement list
19:12:15 <clarkb> I think this should make managing disk consumption a bit more sane
19:12:26 <clarkb> anything else related to backups?
19:13:28 <clarkb> actually I souldn't need to manually touch the retired file
19:13:42 <clarkb> going through the motions to retire it on the other server should fix the one I did manually
19:13:53 <clarkb> #topic Upgrading old servers
19:14:14 <clarkb> tonyb: anything new to report on wiki or other server replacments?
19:14:35 <clarkb> (I did have a note about the docker compose situation noble but was going to bring that up during the docker compose portion of the agenda)
19:15:10 <tonyb> nothing new.
19:15:50 <tonyb> I guess I discovered that noble is going to be harder than expected due to python being too new for docker-compose v1
19:15:50 <clarkb> #topic Docker Hub Rate Limits
19:15:53 <clarkb> #undo
19:15:53 <opendevmeet> Removing item from minutes: #topic Docker Hub Rate Limits
19:16:09 <clarkb> ya thats the bit I was going to discuss during the docker compose podman section since they are related
19:16:15 <clarkb> I have an idea for that that may not be terrible
19:16:21 <tonyb> okay that's cool
19:16:21 <clarkb> #topic Docker Hub Rate Limits
19:16:36 <corvus> [i'm here now]
19:16:44 <clarkb> Before we get there another related topic is that people have been noticing we're hitting docker hub rate limits more often
19:16:49 <clarkb> #link https://www.docker.com/blog/november-2024-updated-plans-announcement/
19:16:50 <tonyb> I don't know why it's only just come up as we have noble servers
19:17:02 <clarkb> tonyb: because mirrors don't use docker-compose I think
19:17:08 <clarkb> basically you've avoided the intersection so far
19:17:21 <clarkb> so that blog post says anonymous requetss will get 10 pulls per hour now
19:17:45 <clarkb> which is a reduction from whatever the old value is. However, if I go through the dance of getting an anonymous pull token and inspect that token it says 100 pulls per 6 hours
19:18:09 <corvus> maybe they walked it back due to complaints...
19:18:12 <clarkb> I've also experimentally checked docker image pull against 12 different alpine image tags and about 12 various library images from docker hub
19:18:19 <clarkb> and had no errors
19:19:04 <clarkb> corvus: ya that could be. Another thought I had was maybe they rate limit the library images differently than the normal images but once you hit the limit it fails for all pulls. But kolla/base reported the same 100 pulls per 6 hours limit that the other library images did so I don't think it is that
19:19:08 <frickler> well 100/6h is 16/h, not too far off, just a bit more burst allowed
19:19:25 <clarkb> frickler: ya the burst is important for CI workloads though particularly since we cache things (if you use the proxy cache)
19:19:43 <corvus> one contingency plan would be to mirror the dockerhub images we need on quay.io (or elsewhere); i started on https://review.opendev.org/935574 for that.
19:20:32 <clarkb> planning for contigencies and generally trying to be on the look out for anything that helps us undersatnd their changes (if any) would be good
19:20:53 <clarkb> but otherwise I now feel like I understand less today than I did yesterday. This doesn't feel like a drop everything emergency but something we should work to understand and then address
19:20:59 <tonyb> assuming that still worth with speculative builds that seems like a solid contingency
19:21:35 <clarkb> another suggested improvement was to lean in buildset registries more to stash all the images a buildset will need and not just those we may build locally
19:21:53 <clarkb> this way we're fetching images like mariadb once per buildset instead of N times for each of N jobs using it
19:22:20 <fungi> heck, some jobs themselves may be pulling the same image multiple times for different hosts
19:22:44 <tonyb> true.
19:23:22 <clarkb> so ya be on the lookout for more concrete info and feel free to experiment with making the jobs more image pull efficient
19:23:28 <clarkb> and maybe we'll know more next week
19:23:30 <clarkb> #topic Docker compose plugin with podman service for servers
19:23:59 <clarkb> This agenda item is related to the previous in that it would allow us to migrate off of docker hub for opendev images and preserve our speculative testing of our images
19:24:24 <clarkb> additionally tonyb found that python docker-compose doesn't run on python3.12 on noble
19:24:32 <clarkb> which is another reason to switch to docker compose
19:24:44 <corvus> (in particular, getting python-base and python-builder on quay.io could be a big win for total image pulls)
19:24:46 <clarkb> all of this is coming together to make this effort a bit of ah igher priority
19:25:09 <clarkb> I'd like to talk about the noble docker compose thing first
19:25:53 <clarkb> I suspect that in our install docker role we can do something hacky like have it install an alias/symlink/something that maps docker-compose to docker compose and then we won't have to rewrite our playbooks/roles until after everything has left docker-compose behind
19:26:21 <clarkb> the two tools have similar enough command lines that I suspect the only place we would run into trouble is anywhere we parse command output and we might have to check both versions instead
19:26:32 <tonyb> yeah, that's kinda gross but itd work for now
19:26:48 <clarkb> but this way we don't need to rewrite every role using docker-compose today and as long as we don't do in place upgrade we're replacing the servers anyway and they'll use the proper tool in that transition
19:26:59 <clarkb> I think this wouldnt' work fi we did an in place switch between docker-compose and docker compose
19:27:16 <clarkb> but as long as its old focal server replaced by new noble server that should mostly work
19:27:45 <tonyb> yeah.   I guess we can come back and tidy up the roles after the fact
19:27:59 <clarkb> assuming we do that the next question is do we also have install docker configure docker-compose which is now really docker compose to run podman for everything? I think we were leaning that way when we first discussed this
19:28:12 <clarkb> the upside to that is we get the speculative testing and don't have to be on docker hub
19:28:17 <clarkb> tonyb: exactly
19:28:22 <corvus> i think the last time someone talked about parsing output of "docker-compose" i suggested an alternative...  like maybe an "inspect" command we could use instead.
19:28:27 <clarkb> corvus: ++
19:28:55 <corvus> it may actually be simpler to do it once with podman
19:29:10 <clarkb> so to tl;dr all my typing above I think there are two improvements we should make to our docker installation role. A) set it up to alias docker-compose to docker compose somehow and B) configure docker compose to rely on podman as the runtime
19:29:26 <tonyb> well one place thatd be hard is docker compose pull and looking for updates but yes generally avoinf parsing the output is good
19:29:32 <corvus> just because the difference between them is so small, but it's all at install time.  so switching from docker to podman later is more work than "docker compose" plugin with podman now.
19:29:51 <clarkb> tonyb: ya I think thats the only place we do it. So we'd inspect, pull, inspect and see if they change or something
19:29:59 <clarkb> corvus: exactly
19:30:02 <corvus> tonyb: yeah, that's the thing i suggested an alternative to.  no one took me up on it at the time, but it's in scrollback somewhere.
19:30:11 <tonyb> okay
19:30:32 <clarkb> tonyb: I know you brought this up as a need for noble servers. I'm not sure if you are interested in making those changes to the docker installation role
19:30:46 <tonyb> yeah I can
19:31:00 <corvus> i think the only outstanding question for https://review.opendev.org/923084 is how to set up the podman socket path -- i think in a previous meeting we identified a potential easy way of doing that.
19:31:06 <tonyb> I don't know exactly what's needed for the podman enablement
19:31:08 <clarkb> I can probably help though I feel like this week is already swamped and next week is the holiday, but ping me for reviews or ideas etc. I ended up brainstorming this a bit the other day so enough is paged in I think I can be useful
19:31:24 <tonyb> and making it nice but I can take direction
19:31:33 <clarkb> tonyb: 923084 does it in a different context so we have to map that into system-config
19:31:43 <tonyb> noted
19:31:43 <clarkb> and i guess address the question af the socket path
19:32:08 <tonyb> okay
19:32:30 <clarkb> sounds like no one is terribly concerned about these hacks and we should be able to get away from them as soon as we're sufficiently migrated
19:32:38 <clarkb> Anything else on these subjects?
19:32:42 <corvus> oh yeah docker "contexts" is the thing
19:32:47 <corvus> that might make setting the path easy
19:33:05 <tonyb> okay cool
19:33:07 <corvus> #link https://meetings.opendev.org/meetings/infra/2024/infra.2024-10-01-19.00.log.html#l-91 docker contexts
19:33:13 <tonyb> and this is for noble+
19:33:13 <clarkb> thanks!
19:33:16 <clarkb> tonyb: yes
19:33:17 <corvus> yep
19:33:22 <tonyb> perfect
19:34:09 <corvus> oh ha
19:34:21 <corvus> clarkb: also you suggested we could set the env var in a "docker-compose" compat tool shim :)
19:34:27 <corvus> (in that meeting)
19:34:41 <tonyb> ( for the record I'm about to get off the train which probably equates to offline)
19:34:43 <clarkb> no wonder when I was thinking about tonyb's propblem I was like this is the solution
19:35:01 <tonyb> yeah that could work I guess
19:35:02 <corvus> ¿por que no los dos?
19:35:02 <clarkb> tonyb: ack don't hurt yourself trying to type and walk at the same time
19:35:16 <clarkb> #topic Enabling mailman3 bounce processing
19:35:20 <clarkb> let's keep this show moving forward
19:35:39 <clarkb> last week lists.opendev.org and lists.zuul-ci.org lists were all set to (or already set to) enable bounce processing
19:35:52 <clarkb> fungi: do you know if openstack-discuss got its config updted?
19:36:29 <clarkb> then separately I haven't received any notifications of members hitting the limits for any of the lists I moderate and can't find evidence of anyone with a score higher than 3 (5 is the threshold)
19:36:34 <clarkb> so I'm curious if anyone has seen that in action yet
19:37:04 <fungi> clarkb: i've not done that yet, no
19:37:28 <clarkb> ok it would probably be good to set as I suspect that's where we'll get the most timely feedback
19:37:57 <clarkb> then if we're still happy with the results enabling this by default on new lists is likely to require we define a custom mailman list style and create new lists with that style
19:38:25 <clarkb> the documentation is pretty sparse on how you're actually supposed to create a new style unfortunately
19:38:38 <fungi> i've just now switched "process bounces" to "yes" for openstack-discuss
19:38:41 <clarkb> (there is a rest api endpoint for it btu without info on how you set the millions of config options)
19:38:48 <clarkb> fungi: thanks!
19:39:00 <clarkb> fungi: you don't happen to know what would be required to set up a new list style do you?
19:39:29 <clarkb> (we probably actually need two, one for private lists and one for public lists)
19:39:33 <fungi> no clue, short of the instructions for making a mailman plugin in python. might be worth one of us asking on the mailman-users ml
19:39:34 <corvus> i think i have an instance of a user where bounce processing is not working
19:40:47 <clarkb> corvus: are they hitting the threshold then not getting removed?
19:42:06 <clarkb> the bounce disable warnings configuration item implies there is some delay that must be reached after being above the threshold before you are removed
19:42:18 <clarkb> I wonder if the 7 day threshold reset is resetting them to 0 before hitting that if so
19:42:20 <corvus> hrm, i got a message about an unprocessed bounce a while back, but the user does not appear to be a member anymore.  so this may not be actionable.
19:42:24 <corvus> not sure what happened in the interim.
19:42:43 <clarkb> ack, I guess we monitor for current behavior and debug from there
19:42:53 <clarkb> fungi: ++ that seems like a good idea.
19:43:08 <clarkb> I'm going to keep things moving as we have less than 20 minutes and still several topics to cover
19:43:14 <clarkb> #topic Intermediate Insecure CI Registry Pruning
19:43:27 <corvus> 0x8e
19:43:31 <clarkb> As scheduled/announced we started this on Friday. We hit a few issues. The first was 404s on object delete requests
19:43:59 <clarkb> that was a simple fix as we can simply ignore 404 errors when trying to delete something. The other was that we weren't paginating object listings so were capped at 10k objects per listing reuqest
19:44:13 <clarkb> this gave the pruning process an incomplete (and possibly inaccurate) picture of what should be deleted vs kept
19:44:42 <corvus> that means we've fixed two bugs that could have caused the previous issues!  :)
19:44:43 <clarkb> the process was restarted after fixing the problems and has been running since late friday. We anticipate it will take at least 6 days though I THink it is trending slowly to be longer as it goes on
19:45:12 <clarkb> as far as I can tell no one has had problems with the intermediate registry while this is running either which is a good sign we're not over deleting anything
19:45:18 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/935542 Enable daily pruning after this bulk prune is complete
19:45:33 <corvus> 0x8e out of 0xff means we're 55% through the blobs.
19:45:38 <clarkb> once this is done we should be able to run regular pruning that doesn't take days to complete since we'll have much fewer objects to contend with
19:46:18 <clarkb> that change enables a cron job to do this. Which is probably good for hygiene purposes but shouldn't be merged until after we complete this manual run and are happy with it
19:46:36 <clarkb> we'll be much more disk efficient as a result \o/
19:46:49 <clarkb> corvus: napkin math has that taking closer to 8 days now?
19:47:51 * clarkb looks at the clock and continues on
19:47:53 <clarkb> #topic Gerrit 3.10 Upgrade Planning
19:47:54 <corvus> are we starting from friday or saturday?
19:48:11 <clarkb> corvus: I think we started at roughly 00:00 UTC saturdayt
19:48:30 <clarkb> and we're almost at 20:00 UTC tuesday so like 7.75 days?
19:48:39 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document
19:49:07 <clarkb> I've announced the upgrade for December 6. I'm hoping that I can focus on getting the last bits of checks and testing done next week before the holiday so that the week after we aren't rushed
19:49:29 <clarkb> If you have time to read the etherpad and call out any additioanl concerns I would appreciate it. There is a held node too fi you are interested in checking out the newer version of gerrit
19:49:51 <clarkb> I'm not too terribly concerned about this though as there is a straightforward rollback path which i have also tested
19:50:17 <clarkb> #topic mirror.sjc3.raxflex.opendev.org cinder volume issues
19:50:42 <clarkb> Yesterday after someone complained about this mirror not working I discovered it couldn't read sector 0 on its cinder volume backing the cache dirs
19:51:01 <clarkb> and then ~5 hours later it seems the server itself shutdown (maybe due to kernel panic after being in this state?)
19:51:14 <fungi> nope, services got restarted
19:51:28 <clarkb> in general that shouldn't restart VMs though? I guess maybe if you restart libvirt or something
19:51:30 <fungi> which apparently resulted in a lot of server instances shutting off or rebooting
19:51:50 <fungi> well, that was the explanation we got anyway
19:51:52 <clarkb> anyway klamath in #opendev reports that it should be happier now and that this wasn't intentional so we've stuck the mirro back into service
19:52:16 <clarkb> there are two things to consider though. One is that we are using a volume of type capacity instead of type standard and it is suggested we could hcange that
19:52:25 <clarkb> the other is if we rebuild our networks we will get bigger mtus
19:52:53 <fungi> full 1500-byte mtus to the server instances, specifically
19:52:53 <clarkb> to rebuild the networks I think the most straightforward option is to simply delete everything we've got and let cloud launcher recreate that stuff for us
19:53:30 <clarkb> doing that likely means deleting the mirror anyway based on mordred's report of not being able to change ports for stnadard port creation on instance create processes
19:54:07 <clarkb> so long story short we should pick a time where we intentionally stop using this cloud, delete the networks and all servers, rerun cloud launcher to recreate networks, then rebuild our mirror using the new networks and a new cinder volume of type standard
19:54:24 <fungi> sounds good
19:54:46 <clarkb> that might be a good big exercise for the week after the gerrit upgrade
19:54:59 <clarkb> in theory things will slow down as we near the end of the year making those changes less impactful
19:55:11 <clarkb> we should also check that the cloud launcher is running happily before we do that
19:55:18 <clarkb> (to avoid delay in reconfiguring the networks)
19:55:52 <fungi> yep
19:56:03 <clarkb> #topic Open Discussion
19:56:21 <clarkb> https://review.opendev.org/c/opendev/lodgeit/+/935712 someone noticed problems with captcha rendering in lodgeit today and has arleady pushed up a fix
19:56:44 <clarkb> we've also been updating our openafs packages and rolling those out with reboots to affected servers
19:57:41 <fungi> #link https://launchpad.net/~openstack-ci-core/+archive/ubuntu/openafs
19:58:06 <fungi> i'm still planning to do the openinfra.org mailing list migration on december 2, i have the start of a migration plan outline in an etherpad i'll share when i get it a little more fleshed out
19:58:25 <fungi> sending an announcement about it to the foundation ml later today
19:58:44 <clarkb> fungi: is that something we can/should test in the system-config-run mm3 job?
19:59:03 <clarkb> I assume we create a new domain then move the lists over then add the lists to our config?
19:59:15 <clarkb> I'll wait for the etherpad no need to run through the whole process here
19:59:20 <fungi> no, database update query
19:59:23 <clarkb> oh fun
19:59:35 <fungi> mailman core, hyperkitty and postorius all use the django db
20:00:04 <fungi> so can basically just change the domain/host references there and then update our ansible data to match so it doesn't recreate the old lists
20:00:04 <clarkb> and we are at time
20:00:10 <clarkb> fungi: got it
20:00:12 <clarkb> makes sense
20:00:25 <clarkb> thank you everyone! as mentioned we'll be back here next week per usual despite the holiday for several of us
20:00:31 <fungi> thanks clarkb!
20:00:33 <clarkb> #endmeeting