19:00:26 <clarkb> #startmeeting infra
19:00:26 <opendevmeet> Meeting started Tue Nov 26 19:00:26 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:26 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:26 <opendevmeet> The meeting name has been set to 'infra'
19:00:34 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/HVQKZ6YZUHSL6JQSADEIHVNK4PTCSM2E/ Our Agenda
19:00:39 <clarkb> #topic Announcements
19:01:08 <clarkb> Reminder that this week is a major US holiday. I will be busy with family stuff the next few days and I expect that others will be in a similar boat
19:01:19 <fungi> yes, same
19:02:00 * frickler will be around
19:02:58 <clarkb> #topic Zuul-launcher image builds
19:03:15 <clarkb> corvus continues to make progress on this item
19:03:40 <corvus> yeah we're all set to test raw image upload but i have not done so yet
19:03:44 <clarkb> last I saw there were promising results using zlib to compress raw images for their shuffling between locations. Gets the images down to a size similar to qcow2 and then timing after the compression is similar
19:04:00 <corvus> clarkb: zstd actually
19:04:07 <clarkb> oh oops my bad
19:04:13 <corvus> i benchmarked a bunch and zstd hit the sweet spot for time/space
19:04:24 <clarkb> but the important bit is that we can do a relatively quick compression that also gets us reasonable file sizes
19:04:50 <clarkb> and a reminder that adding new image builds for things other than debian bullseye is still helpful
19:05:01 <corvus> yep, so i'm expecting something like 20m for every image format we deal with
19:06:32 <clarkb> corvus: I susppose vhd may not compress similar to raw but that seems unlikely (iirc our vhd images are very similar to raw)
19:07:09 <clarkb> anything else on this topic?
19:07:33 <corvus> not from me
19:07:35 <clarkb> #topic Backup Server Pruning
19:07:52 <clarkb> the changes to automate the retirement and purging of backup users from the backup servers has landed
19:08:17 <clarkb> we also landed a change to retire and purge the borg-ethercalc02 user/backups from the vexxhost server to catch my manual work there up with the new automation
19:08:22 <clarkb> best I can tell this all went fine
19:08:28 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/936203 retire these old unneeded backups in prep for eventual purging
19:08:42 <clarkb> this change is the next step in retiring the unnecessary backups that I identified previously
19:09:20 <clarkb> that should mark each of them retired then we should do a manual prune step which will reduce the total backups for each retired server to a single backup (rather than the weekly monthly etc set we keep)
19:10:06 <clarkb> we should do that to ensure the scripts work as expected before continuing to the new step. I think for future retirements this would happen somewhat naturally as we'd retire the server/service when it gets replace or removed then a few months later we would prune due to normal disk utilization
19:10:28 <clarkb> then at some point later we would purge them. In this case I'm hoping we can speed that up since we're catching up with the new sytstem against some very old backups
19:10:47 <clarkb> so basically land the retirements, manually run the prune and ensure the script is happy, then land a change to purge the backups
19:11:11 <clarkb> all that to say I think 936203 should be relatively safe at this point as the next major step is manual pruning and by definition we'redoing that when we can monitor it
19:12:24 <clarkb> thank you ianw for the good brainstorming and implementation of this
19:12:44 <clarkb> sometimes it is really helpful to have an outside perspective when you're just focused on fixing the immediate issue (high disk utilization)
19:12:53 <clarkb> #topic Upgrading old servers
19:13:28 <clarkb> tonyb: any updates here? fungi had to manually add some robots.txt handling and user agent filtering to the existing wiki to prevent the server from being overwhelemed
19:13:39 <clarkb> those may need to be resynced into the changes for deploying a new wiki
19:14:24 <fungi> i also did another quick check of user agents hitting wiki today and suspect the top offender(s) is/are bots with faked random user agent strings
19:14:57 <tonyb> sorry no upsates from me.
19:15:02 <corvus> did "brief downtime" help with reading robots.txt?  i missed the end of that story
19:15:12 <corvus> re-reading robots.txt after update
19:15:21 <fungi> there are 6 distinct agents making up >99% of the request volume and all claiming to be smart phones
19:15:35 <fungi> corvus: it didn't seem to slow anything down, no
19:15:56 <clarkb> even the new meta agent doesn't seem to respect crawl delay
19:16:10 <fungi> load average there is nearly 100 at the moment, after telling a couple of llm training crawlers who did identify themselves to go away
19:16:22 <clarkb> its unfortunate that the internet has basically become a race to download as much info as possible. I seem to recall a futurama episode with this plot
19:16:24 <corvus> :(
19:17:00 <corvus> those who don't watch futurama reruns are doomed to repeat them or something i think
19:18:30 <clarkb> anyway if there are not other updates we can continue on
19:18:43 <clarkb> #topic Docker Hub Rate Limits
19:19:03 <clarkb> we managed to disable proxy caching for docker hub and my anecdotal percecption is that this has helped without making hte problem go away
19:20:26 <clarkb> I don't really have any other inptu at this point other than to say reducing any use of docker hub hosted images will only improve the isutation
19:20:45 <clarkb> so if you've got easy places to address that (for example we fixed where zuul-registry is fetched for buildeset registry jobs) that would be good
19:20:54 <corvus> if anyone wants to pitch in on my mirror change, feel free
19:20:58 <clarkb> oh and there is the rehosting toolset in zuul-jobs that corvus is working on
19:21:03 * clarkb finds a link
19:21:09 <corvus> it's a simple change with complex testing :)
19:21:18 <clarkb> #link https://review.opendev.org/c/zuul/zuul-jobs/+/935574
19:21:25 <corvus> that's the one
19:21:51 <corvus> i think it's almost there
19:24:04 <clarkb> #topic Docker compose plugin with podman service for servers
19:24:38 <clarkb> just to followup from last week I wanted to record that we're going to try and modify our docker and docker compose installation roles to behave differently on noble and newer so that we can somewhat transparent transition to things as they are deployed on enwer hosts
19:26:04 <clarkb> this will aid our transition to new docker compose and podman by making the transition point a redeployment from older ubuntu to noble or newer
19:26:22 <clarkb> but other than that I think its mostly just a matter of getting that work done at this point
19:26:31 <clarkb> so may drop this off of next week's meeting
19:26:38 <clarkb> any questions/concerns/thoughts before we move on?
19:28:27 <clarkb> #topic Gerrit 3.10 Upgrade Planning
19:28:47 <clarkb> Gerrit just released new bugfix releases for 3.9 and 3.10
19:29:18 <clarkb> my plan is to get new images built and test nodes held today (I hope) so that I can test these new images with some manually upgrade and downgrade testing early next week
19:29:28 <clarkb> assuming that all comes together I think we are still on track for upgrading on December 6
19:29:38 <clarkb> I want to say they did a release last yaer during thanksgiving week too...
19:31:04 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document
19:31:15 <clarkb> review of that very much welcome particualrly as we're less than 2 weeks away
19:31:27 <clarkb> but otherwise I think the ball is back in my court to catch up to these new gerrit releases
19:31:43 <clarkb> #topic Renaming mailman's openinfra.dev vhost to openinfra.org
19:31:48 <clarkb> #link https://etherpad.opendev.org/p/lists-openinfra-org-migration migration plan document
19:31:57 <clarkb> fungi is planning to do this work on December 2 (tahts monday I think)
19:32:04 <fungi> yep
19:32:17 <fungi> currently hacking away on it
19:33:58 <fungi> hoping to have the ansible change up for review later today
19:34:10 <clarkb> anything else we should know or can do to help?
19:34:19 <fungi> (have everything done except the apache redirects, i think)
19:34:54 <fungi> i've got a held job node i'm importing a production data set into with exim disabled, and will step through the database queries once i finish nailing them down
19:35:54 <fungi> i'll also send out a note to service-announce after today's meeting about downtime, if there are no objections
19:36:08 <fungi> planning for 15:00-17:00 but should hopefully go faster
19:37:05 <clarkb> wfm
19:39:39 <clarkb> #topic Upgrading Gitea to 1.22.4
19:39:44 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/936198 Upgrade gitea to 1.22.4
19:40:04 <clarkb> Gitea made a new 1.22 release yesterday. Unfortunately I don't think they expect this to address the OOM issue with 1.22
19:40:17 <clarkb> however there are a number of other bugfixes and performance improvements so this seems worthwhile?
19:40:44 <clarkb> nto sure if we want to send it in now or delay for next week
19:40:53 <clarkb> testing seems clean and none of the templates we override were updated
19:43:31 <clarkb> #topic Open Discussion
19:43:51 <clarkb> As mentioned Gerrit just did releases which I'll start poking at after lunch
19:44:13 <clarkb> also I'm trying to land the stack that ends at https://review.opendev.org/c/opendev/system-config/+/936297 once we can confirm the captcha's are rendering better with teh screenshot
19:44:22 <clarkb> trying to encourage people who fix bugs like that for us
19:44:49 <clarkb> In the process of working on ^ I noticed that we may have overdeleted an insecure ci registry hosted blob as part of the original pruning process
19:45:21 <clarkb> I don't understand that yet but will look at it. If it is a timing issue then I suspect our daily prunes will be less susceptible as they take less than two hours but the initial prune took ~7 days
19:45:24 <fungi> i've got the aforementioned lists.openinfra.org gerrit change up for review and linked in the maintenance planning etherpad, i'll set it wip for now though
19:47:25 <fungi> if anybody spots potential problems with it, let me know, but otherwise my focus is shifting to migration testing on the held node so i can flesh out the database queries in the pad
19:47:43 <clarkb> considering we haven't seen any other complaints about 404s from the insecure ci registry hosted blobs my hunch is that htis is more of a corner case tahn something very wrong which is different to the other bugs we have already fixed in the process
19:47:49 <clarkb> fungi: will do thanks
19:48:23 <clarkb> I'll give it until 19:50 then call it if there is nothing else
19:50:47 <clarkb> ok thanks everyone!
19:50:53 <clarkb> enjoy the holiday and we'll be back next week
19:50:55 <clarkb> #endmeeting