19:00:26 <clarkb> #startmeeting infra 19:00:26 <opendevmeet> Meeting started Tue Nov 26 19:00:26 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:26 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:26 <opendevmeet> The meeting name has been set to 'infra' 19:00:34 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/HVQKZ6YZUHSL6JQSADEIHVNK4PTCSM2E/ Our Agenda 19:00:39 <clarkb> #topic Announcements 19:01:08 <clarkb> Reminder that this week is a major US holiday. I will be busy with family stuff the next few days and I expect that others will be in a similar boat 19:01:19 <fungi> yes, same 19:02:00 * frickler will be around 19:02:58 <clarkb> #topic Zuul-launcher image builds 19:03:15 <clarkb> corvus continues to make progress on this item 19:03:40 <corvus> yeah we're all set to test raw image upload but i have not done so yet 19:03:44 <clarkb> last I saw there were promising results using zlib to compress raw images for their shuffling between locations. Gets the images down to a size similar to qcow2 and then timing after the compression is similar 19:04:00 <corvus> clarkb: zstd actually 19:04:07 <clarkb> oh oops my bad 19:04:13 <corvus> i benchmarked a bunch and zstd hit the sweet spot for time/space 19:04:24 <clarkb> but the important bit is that we can do a relatively quick compression that also gets us reasonable file sizes 19:04:50 <clarkb> and a reminder that adding new image builds for things other than debian bullseye is still helpful 19:05:01 <corvus> yep, so i'm expecting something like 20m for every image format we deal with 19:06:32 <clarkb> corvus: I susppose vhd may not compress similar to raw but that seems unlikely (iirc our vhd images are very similar to raw) 19:07:09 <clarkb> anything else on this topic? 19:07:33 <corvus> not from me 19:07:35 <clarkb> #topic Backup Server Pruning 19:07:52 <clarkb> the changes to automate the retirement and purging of backup users from the backup servers has landed 19:08:17 <clarkb> we also landed a change to retire and purge the borg-ethercalc02 user/backups from the vexxhost server to catch my manual work there up with the new automation 19:08:22 <clarkb> best I can tell this all went fine 19:08:28 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/936203 retire these old unneeded backups in prep for eventual purging 19:08:42 <clarkb> this change is the next step in retiring the unnecessary backups that I identified previously 19:09:20 <clarkb> that should mark each of them retired then we should do a manual prune step which will reduce the total backups for each retired server to a single backup (rather than the weekly monthly etc set we keep) 19:10:06 <clarkb> we should do that to ensure the scripts work as expected before continuing to the new step. I think for future retirements this would happen somewhat naturally as we'd retire the server/service when it gets replace or removed then a few months later we would prune due to normal disk utilization 19:10:28 <clarkb> then at some point later we would purge them. In this case I'm hoping we can speed that up since we're catching up with the new sytstem against some very old backups 19:10:47 <clarkb> so basically land the retirements, manually run the prune and ensure the script is happy, then land a change to purge the backups 19:11:11 <clarkb> all that to say I think 936203 should be relatively safe at this point as the next major step is manual pruning and by definition we'redoing that when we can monitor it 19:12:24 <clarkb> thank you ianw for the good brainstorming and implementation of this 19:12:44 <clarkb> sometimes it is really helpful to have an outside perspective when you're just focused on fixing the immediate issue (high disk utilization) 19:12:53 <clarkb> #topic Upgrading old servers 19:13:28 <clarkb> tonyb: any updates here? fungi had to manually add some robots.txt handling and user agent filtering to the existing wiki to prevent the server from being overwhelemed 19:13:39 <clarkb> those may need to be resynced into the changes for deploying a new wiki 19:14:24 <fungi> i also did another quick check of user agents hitting wiki today and suspect the top offender(s) is/are bots with faked random user agent strings 19:14:57 <tonyb> sorry no upsates from me. 19:15:02 <corvus> did "brief downtime" help with reading robots.txt? i missed the end of that story 19:15:12 <corvus> re-reading robots.txt after update 19:15:21 <fungi> there are 6 distinct agents making up >99% of the request volume and all claiming to be smart phones 19:15:35 <fungi> corvus: it didn't seem to slow anything down, no 19:15:56 <clarkb> even the new meta agent doesn't seem to respect crawl delay 19:16:10 <fungi> load average there is nearly 100 at the moment, after telling a couple of llm training crawlers who did identify themselves to go away 19:16:22 <clarkb> its unfortunate that the internet has basically become a race to download as much info as possible. I seem to recall a futurama episode with this plot 19:16:24 <corvus> :( 19:17:00 <corvus> those who don't watch futurama reruns are doomed to repeat them or something i think 19:18:30 <clarkb> anyway if there are not other updates we can continue on 19:18:43 <clarkb> #topic Docker Hub Rate Limits 19:19:03 <clarkb> we managed to disable proxy caching for docker hub and my anecdotal percecption is that this has helped without making hte problem go away 19:20:26 <clarkb> I don't really have any other inptu at this point other than to say reducing any use of docker hub hosted images will only improve the isutation 19:20:45 <clarkb> so if you've got easy places to address that (for example we fixed where zuul-registry is fetched for buildeset registry jobs) that would be good 19:20:54 <corvus> if anyone wants to pitch in on my mirror change, feel free 19:20:58 <clarkb> oh and there is the rehosting toolset in zuul-jobs that corvus is working on 19:21:03 * clarkb finds a link 19:21:09 <corvus> it's a simple change with complex testing :) 19:21:18 <clarkb> #link https://review.opendev.org/c/zuul/zuul-jobs/+/935574 19:21:25 <corvus> that's the one 19:21:51 <corvus> i think it's almost there 19:24:04 <clarkb> #topic Docker compose plugin with podman service for servers 19:24:38 <clarkb> just to followup from last week I wanted to record that we're going to try and modify our docker and docker compose installation roles to behave differently on noble and newer so that we can somewhat transparent transition to things as they are deployed on enwer hosts 19:26:04 <clarkb> this will aid our transition to new docker compose and podman by making the transition point a redeployment from older ubuntu to noble or newer 19:26:22 <clarkb> but other than that I think its mostly just a matter of getting that work done at this point 19:26:31 <clarkb> so may drop this off of next week's meeting 19:26:38 <clarkb> any questions/concerns/thoughts before we move on? 19:28:27 <clarkb> #topic Gerrit 3.10 Upgrade Planning 19:28:47 <clarkb> Gerrit just released new bugfix releases for 3.9 and 3.10 19:29:18 <clarkb> my plan is to get new images built and test nodes held today (I hope) so that I can test these new images with some manually upgrade and downgrade testing early next week 19:29:28 <clarkb> assuming that all comes together I think we are still on track for upgrading on December 6 19:29:38 <clarkb> I want to say they did a release last yaer during thanksgiving week too... 19:31:04 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document 19:31:15 <clarkb> review of that very much welcome particualrly as we're less than 2 weeks away 19:31:27 <clarkb> but otherwise I think the ball is back in my court to catch up to these new gerrit releases 19:31:43 <clarkb> #topic Renaming mailman's openinfra.dev vhost to openinfra.org 19:31:48 <clarkb> #link https://etherpad.opendev.org/p/lists-openinfra-org-migration migration plan document 19:31:57 <clarkb> fungi is planning to do this work on December 2 (tahts monday I think) 19:32:04 <fungi> yep 19:32:17 <fungi> currently hacking away on it 19:33:58 <fungi> hoping to have the ansible change up for review later today 19:34:10 <clarkb> anything else we should know or can do to help? 19:34:19 <fungi> (have everything done except the apache redirects, i think) 19:34:54 <fungi> i've got a held job node i'm importing a production data set into with exim disabled, and will step through the database queries once i finish nailing them down 19:35:54 <fungi> i'll also send out a note to service-announce after today's meeting about downtime, if there are no objections 19:36:08 <fungi> planning for 15:00-17:00 but should hopefully go faster 19:37:05 <clarkb> wfm 19:39:39 <clarkb> #topic Upgrading Gitea to 1.22.4 19:39:44 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/936198 Upgrade gitea to 1.22.4 19:40:04 <clarkb> Gitea made a new 1.22 release yesterday. Unfortunately I don't think they expect this to address the OOM issue with 1.22 19:40:17 <clarkb> however there are a number of other bugfixes and performance improvements so this seems worthwhile? 19:40:44 <clarkb> nto sure if we want to send it in now or delay for next week 19:40:53 <clarkb> testing seems clean and none of the templates we override were updated 19:43:31 <clarkb> #topic Open Discussion 19:43:51 <clarkb> As mentioned Gerrit just did releases which I'll start poking at after lunch 19:44:13 <clarkb> also I'm trying to land the stack that ends at https://review.opendev.org/c/opendev/system-config/+/936297 once we can confirm the captcha's are rendering better with teh screenshot 19:44:22 <clarkb> trying to encourage people who fix bugs like that for us 19:44:49 <clarkb> In the process of working on ^ I noticed that we may have overdeleted an insecure ci registry hosted blob as part of the original pruning process 19:45:21 <clarkb> I don't understand that yet but will look at it. If it is a timing issue then I suspect our daily prunes will be less susceptible as they take less than two hours but the initial prune took ~7 days 19:45:24 <fungi> i've got the aforementioned lists.openinfra.org gerrit change up for review and linked in the maintenance planning etherpad, i'll set it wip for now though 19:47:25 <fungi> if anybody spots potential problems with it, let me know, but otherwise my focus is shifting to migration testing on the held node so i can flesh out the database queries in the pad 19:47:43 <clarkb> considering we haven't seen any other complaints about 404s from the insecure ci registry hosted blobs my hunch is that htis is more of a corner case tahn something very wrong which is different to the other bugs we have already fixed in the process 19:47:49 <clarkb> fungi: will do thanks 19:48:23 <clarkb> I'll give it until 19:50 then call it if there is nothing else 19:50:47 <clarkb> ok thanks everyone! 19:50:53 <clarkb> enjoy the holiday and we'll be back next week 19:50:55 <clarkb> #endmeeting