19:00:04 #startmeeting infra 19:00:04 Meeting started Tue Nov 19 19:00:04 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:04 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:04 The meeting name has been set to 'infra' 19:00:05 * tonyb is on a train so coverage may be sporadic 19:00:18 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/BQ2IWL7BUMNUXYWCQV62DZQCF2AI7E5U/ Our Agenda 19:00:25 #topic Announcements 19:00:35 also in an EU timezone for the next few weeks 19:00:39 oh fun 19:00:56 should be! 19:01:14 Next week is a major US holiday week. I plan to be around Monday and Tuesday and will host the weekly meeting. But we should probably expect slower response times from various places 19:01:47 Anything else to announce? 19:03:15 #topic Zuul-launcher image builds 19:03:41 corvus has continued to iterate on the mechanics of uploading images to swift, downloading them to the launcher and reuploading to the clouds 19:04:12 and good news is the time to build a qcow2 image then shuffle it around is close to if not better than the time the current builders do it in 19:04:19 #link https://review.opendev.org/935455 Setup a raw image cloud for raw image testing 19:04:37 qcow2 images are relatievly small compared to the raw and vhd images we also deal with so next step is testing this process with the larger image types 19:04:57 There is still opportunity to add image build jobs for other distros and releases as well 19:05:32 it's high on my to-do list 19:05:45 cool 19:05:50 anything else on this topic cc corvus 19:06:01 seems like good slow but steady progrss 19:07:17 I'll keep the meeting moving as we have a number of items to get through. We can always swing back to topics if we have time at the end or after the meeting etc 19:07:22 #topic Backup Server Pruning 19:07:38 the smaller backup server got close to filling up its disk again and fungi pruned it again. THank you for that 19:07:57 but this is a good reminder that we have a couple of changes proposed to help alleviate some of that by purging things from the backup servers once they aer no longer needed 19:08:04 #link https://review.opendev.org/c/opendev/system-config/+/933700 Backup deletions managed through Ansible 19:08:08 #link https://review.opendev.org/c/opendev/system-config/+/934768 Handle backup verification for purged backups 19:08:29 oh I missed hta fungi had +2'd but not approved the first one 19:09:00 i wasn't sure if those needed close attention after merging 19:09:04 should we go ahead and either fix the indnetation then approve or just approve it? 19:09:25 i think we can just approve 19:10:22 I agree, we can do the indentation after if we want 19:10:32 I think the test case on line 34 in https://review.opendev.org/c/opendev/system-config/+/933700/23/testinfra/test_borg_backups.py should ensure that it is fairly safe 19:11:30 then after it is landed we can touch the retired flag in the ethercalc dir and then add ethercalc to the list of retirements (to catch up the other server) check that worked, then add it to the purge list (to catch up the other server) and check that worked 19:11:40 then if that is good we can retire the rest of the items in our retirement list 19:12:15 I think this should make managing disk consumption a bit more sane 19:12:26 anything else related to backups? 19:13:28 actually I souldn't need to manually touch the retired file 19:13:42 going through the motions to retire it on the other server should fix the one I did manually 19:13:53 #topic Upgrading old servers 19:14:14 tonyb: anything new to report on wiki or other server replacments? 19:14:35 (I did have a note about the docker compose situation noble but was going to bring that up during the docker compose portion of the agenda) 19:15:10 nothing new. 19:15:50 I guess I discovered that noble is going to be harder than expected due to python being too new for docker-compose v1 19:15:50 #topic Docker Hub Rate Limits 19:15:53 #undo 19:15:53 Removing item from minutes: #topic Docker Hub Rate Limits 19:16:09 ya thats the bit I was going to discuss during the docker compose podman section since they are related 19:16:15 I have an idea for that that may not be terrible 19:16:21 okay that's cool 19:16:21 #topic Docker Hub Rate Limits 19:16:36 [i'm here now] 19:16:44 Before we get there another related topic is that people have been noticing we're hitting docker hub rate limits more often 19:16:49 #link https://www.docker.com/blog/november-2024-updated-plans-announcement/ 19:16:50 I don't know why it's only just come up as we have noble servers 19:17:02 tonyb: because mirrors don't use docker-compose I think 19:17:08 basically you've avoided the intersection so far 19:17:21 so that blog post says anonymous requetss will get 10 pulls per hour now 19:17:45 which is a reduction from whatever the old value is. However, if I go through the dance of getting an anonymous pull token and inspect that token it says 100 pulls per 6 hours 19:18:09 maybe they walked it back due to complaints... 19:18:12 I've also experimentally checked docker image pull against 12 different alpine image tags and about 12 various library images from docker hub 19:18:19 and had no errors 19:19:04 corvus: ya that could be. Another thought I had was maybe they rate limit the library images differently than the normal images but once you hit the limit it fails for all pulls. But kolla/base reported the same 100 pulls per 6 hours limit that the other library images did so I don't think it is that 19:19:08 well 100/6h is 16/h, not too far off, just a bit more burst allowed 19:19:25 frickler: ya the burst is important for CI workloads though particularly since we cache things (if you use the proxy cache) 19:19:43 one contingency plan would be to mirror the dockerhub images we need on quay.io (or elsewhere); i started on https://review.opendev.org/935574 for that. 19:20:32 planning for contigencies and generally trying to be on the look out for anything that helps us undersatnd their changes (if any) would be good 19:20:53 but otherwise I now feel like I understand less today than I did yesterday. This doesn't feel like a drop everything emergency but something we should work to understand and then address 19:20:59 assuming that still worth with speculative builds that seems like a solid contingency 19:21:35 another suggested improvement was to lean in buildset registries more to stash all the images a buildset will need and not just those we may build locally 19:21:53 this way we're fetching images like mariadb once per buildset instead of N times for each of N jobs using it 19:22:20 heck, some jobs themselves may be pulling the same image multiple times for different hosts 19:22:44 true. 19:23:22 so ya be on the lookout for more concrete info and feel free to experiment with making the jobs more image pull efficient 19:23:28 and maybe we'll know more next week 19:23:30 #topic Docker compose plugin with podman service for servers 19:23:59 This agenda item is related to the previous in that it would allow us to migrate off of docker hub for opendev images and preserve our speculative testing of our images 19:24:24 additionally tonyb found that python docker-compose doesn't run on python3.12 on noble 19:24:32 which is another reason to switch to docker compose 19:24:44 (in particular, getting python-base and python-builder on quay.io could be a big win for total image pulls) 19:24:46 all of this is coming together to make this effort a bit of ah igher priority 19:25:09 I'd like to talk about the noble docker compose thing first 19:25:53 I suspect that in our install docker role we can do something hacky like have it install an alias/symlink/something that maps docker-compose to docker compose and then we won't have to rewrite our playbooks/roles until after everything has left docker-compose behind 19:26:21 the two tools have similar enough command lines that I suspect the only place we would run into trouble is anywhere we parse command output and we might have to check both versions instead 19:26:32 yeah, that's kinda gross but itd work for now 19:26:48 but this way we don't need to rewrite every role using docker-compose today and as long as we don't do in place upgrade we're replacing the servers anyway and they'll use the proper tool in that transition 19:26:59 I think this wouldnt' work fi we did an in place switch between docker-compose and docker compose 19:27:16 but as long as its old focal server replaced by new noble server that should mostly work 19:27:45 yeah. I guess we can come back and tidy up the roles after the fact 19:27:59 assuming we do that the next question is do we also have install docker configure docker-compose which is now really docker compose to run podman for everything? I think we were leaning that way when we first discussed this 19:28:12 the upside to that is we get the speculative testing and don't have to be on docker hub 19:28:17 tonyb: exactly 19:28:22 i think the last time someone talked about parsing output of "docker-compose" i suggested an alternative... like maybe an "inspect" command we could use instead. 19:28:27 corvus: ++ 19:28:55 it may actually be simpler to do it once with podman 19:29:10 so to tl;dr all my typing above I think there are two improvements we should make to our docker installation role. A) set it up to alias docker-compose to docker compose somehow and B) configure docker compose to rely on podman as the runtime 19:29:26 well one place thatd be hard is docker compose pull and looking for updates but yes generally avoinf parsing the output is good 19:29:32 just because the difference between them is so small, but it's all at install time. so switching from docker to podman later is more work than "docker compose" plugin with podman now. 19:29:51 tonyb: ya I think thats the only place we do it. So we'd inspect, pull, inspect and see if they change or something 19:29:59 corvus: exactly 19:30:02 tonyb: yeah, that's the thing i suggested an alternative to. no one took me up on it at the time, but it's in scrollback somewhere. 19:30:11 okay 19:30:32 tonyb: I know you brought this up as a need for noble servers. I'm not sure if you are interested in making those changes to the docker installation role 19:30:46 yeah I can 19:31:00 i think the only outstanding question for https://review.opendev.org/923084 is how to set up the podman socket path -- i think in a previous meeting we identified a potential easy way of doing that. 19:31:06 I don't know exactly what's needed for the podman enablement 19:31:08 I can probably help though I feel like this week is already swamped and next week is the holiday, but ping me for reviews or ideas etc. I ended up brainstorming this a bit the other day so enough is paged in I think I can be useful 19:31:24 and making it nice but I can take direction 19:31:33 tonyb: 923084 does it in a different context so we have to map that into system-config 19:31:43 noted 19:31:43 and i guess address the question af the socket path 19:32:08 okay 19:32:30 sounds like no one is terribly concerned about these hacks and we should be able to get away from them as soon as we're sufficiently migrated 19:32:38 Anything else on these subjects? 19:32:42 oh yeah docker "contexts" is the thing 19:32:47 that might make setting the path easy 19:33:05 okay cool 19:33:07 #link https://meetings.opendev.org/meetings/infra/2024/infra.2024-10-01-19.00.log.html#l-91 docker contexts 19:33:13 and this is for noble+ 19:33:13 thanks! 19:33:16 tonyb: yes 19:33:17 yep 19:33:22 perfect 19:34:09 oh ha 19:34:21 clarkb: also you suggested we could set the env var in a "docker-compose" compat tool shim :) 19:34:27 (in that meeting) 19:34:41 ( for the record I'm about to get off the train which probably equates to offline) 19:34:43 no wonder when I was thinking about tonyb's propblem I was like this is the solution 19:35:01 yeah that could work I guess 19:35:02 ¿por que no los dos? 19:35:02 tonyb: ack don't hurt yourself trying to type and walk at the same time 19:35:16 #topic Enabling mailman3 bounce processing 19:35:20 let's keep this show moving forward 19:35:39 last week lists.opendev.org and lists.zuul-ci.org lists were all set to (or already set to) enable bounce processing 19:35:52 fungi: do you know if openstack-discuss got its config updted? 19:36:29 then separately I haven't received any notifications of members hitting the limits for any of the lists I moderate and can't find evidence of anyone with a score higher than 3 (5 is the threshold) 19:36:34 so I'm curious if anyone has seen that in action yet 19:37:04 clarkb: i've not done that yet, no 19:37:28 ok it would probably be good to set as I suspect that's where we'll get the most timely feedback 19:37:57 then if we're still happy with the results enabling this by default on new lists is likely to require we define a custom mailman list style and create new lists with that style 19:38:25 the documentation is pretty sparse on how you're actually supposed to create a new style unfortunately 19:38:38 i've just now switched "process bounces" to "yes" for openstack-discuss 19:38:41 (there is a rest api endpoint for it btu without info on how you set the millions of config options) 19:38:48 fungi: thanks! 19:39:00 fungi: you don't happen to know what would be required to set up a new list style do you? 19:39:29 (we probably actually need two, one for private lists and one for public lists) 19:39:33 no clue, short of the instructions for making a mailman plugin in python. might be worth one of us asking on the mailman-users ml 19:39:34 i think i have an instance of a user where bounce processing is not working 19:40:47 corvus: are they hitting the threshold then not getting removed? 19:42:06 the bounce disable warnings configuration item implies there is some delay that must be reached after being above the threshold before you are removed 19:42:18 I wonder if the 7 day threshold reset is resetting them to 0 before hitting that if so 19:42:20 hrm, i got a message about an unprocessed bounce a while back, but the user does not appear to be a member anymore. so this may not be actionable. 19:42:24 not sure what happened in the interim. 19:42:43 ack, I guess we monitor for current behavior and debug from there 19:42:53 fungi: ++ that seems like a good idea. 19:43:08 I'm going to keep things moving as we have less than 20 minutes and still several topics to cover 19:43:14 #topic Intermediate Insecure CI Registry Pruning 19:43:27 0x8e 19:43:31 As scheduled/announced we started this on Friday. We hit a few issues. The first was 404s on object delete requests 19:43:59 that was a simple fix as we can simply ignore 404 errors when trying to delete something. The other was that we weren't paginating object listings so were capped at 10k objects per listing reuqest 19:44:13 this gave the pruning process an incomplete (and possibly inaccurate) picture of what should be deleted vs kept 19:44:42 that means we've fixed two bugs that could have caused the previous issues! :) 19:44:43 the process was restarted after fixing the problems and has been running since late friday. We anticipate it will take at least 6 days though I THink it is trending slowly to be longer as it goes on 19:45:12 as far as I can tell no one has had problems with the intermediate registry while this is running either which is a good sign we're not over deleting anything 19:45:18 #link https://review.opendev.org/c/opendev/system-config/+/935542 Enable daily pruning after this bulk prune is complete 19:45:33 0x8e out of 0xff means we're 55% through the blobs. 19:45:38 once this is done we should be able to run regular pruning that doesn't take days to complete since we'll have much fewer objects to contend with 19:46:18 that change enables a cron job to do this. Which is probably good for hygiene purposes but shouldn't be merged until after we complete this manual run and are happy with it 19:46:36 we'll be much more disk efficient as a result \o/ 19:46:49 corvus: napkin math has that taking closer to 8 days now? 19:47:51 * clarkb looks at the clock and continues on 19:47:53 #topic Gerrit 3.10 Upgrade Planning 19:47:54 are we starting from friday or saturday? 19:48:11 corvus: I think we started at roughly 00:00 UTC saturdayt 19:48:30 and we're almost at 20:00 UTC tuesday so like 7.75 days? 19:48:39 #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document 19:49:07 I've announced the upgrade for December 6. I'm hoping that I can focus on getting the last bits of checks and testing done next week before the holiday so that the week after we aren't rushed 19:49:29 If you have time to read the etherpad and call out any additioanl concerns I would appreciate it. There is a held node too fi you are interested in checking out the newer version of gerrit 19:49:51 I'm not too terribly concerned about this though as there is a straightforward rollback path which i have also tested 19:50:17 #topic mirror.sjc3.raxflex.opendev.org cinder volume issues 19:50:42 Yesterday after someone complained about this mirror not working I discovered it couldn't read sector 0 on its cinder volume backing the cache dirs 19:51:01 and then ~5 hours later it seems the server itself shutdown (maybe due to kernel panic after being in this state?) 19:51:14 nope, services got restarted 19:51:28 in general that shouldn't restart VMs though? I guess maybe if you restart libvirt or something 19:51:30 which apparently resulted in a lot of server instances shutting off or rebooting 19:51:50 well, that was the explanation we got anyway 19:51:52 anyway klamath in #opendev reports that it should be happier now and that this wasn't intentional so we've stuck the mirro back into service 19:52:16 there are two things to consider though. One is that we are using a volume of type capacity instead of type standard and it is suggested we could hcange that 19:52:25 the other is if we rebuild our networks we will get bigger mtus 19:52:53 full 1500-byte mtus to the server instances, specifically 19:52:53 to rebuild the networks I think the most straightforward option is to simply delete everything we've got and let cloud launcher recreate that stuff for us 19:53:30 doing that likely means deleting the mirror anyway based on mordred's report of not being able to change ports for stnadard port creation on instance create processes 19:54:07 so long story short we should pick a time where we intentionally stop using this cloud, delete the networks and all servers, rerun cloud launcher to recreate networks, then rebuild our mirror using the new networks and a new cinder volume of type standard 19:54:24 sounds good 19:54:46 that might be a good big exercise for the week after the gerrit upgrade 19:54:59 in theory things will slow down as we near the end of the year making those changes less impactful 19:55:11 we should also check that the cloud launcher is running happily before we do that 19:55:18 (to avoid delay in reconfiguring the networks) 19:55:52 yep 19:56:03 #topic Open Discussion 19:56:21 https://review.opendev.org/c/opendev/lodgeit/+/935712 someone noticed problems with captcha rendering in lodgeit today and has arleady pushed up a fix 19:56:44 we've also been updating our openafs packages and rolling those out with reboots to affected servers 19:57:41 #link https://launchpad.net/~openstack-ci-core/+archive/ubuntu/openafs 19:58:06 i'm still planning to do the openinfra.org mailing list migration on december 2, i have the start of a migration plan outline in an etherpad i'll share when i get it a little more fleshed out 19:58:25 sending an announcement about it to the foundation ml later today 19:58:44 fungi: is that something we can/should test in the system-config-run mm3 job? 19:59:03 I assume we create a new domain then move the lists over then add the lists to our config? 19:59:15 I'll wait for the etherpad no need to run through the whole process here 19:59:20 no, database update query 19:59:23 oh fun 19:59:35 mailman core, hyperkitty and postorius all use the django db 20:00:04 so can basically just change the domain/host references there and then update our ansible data to match so it doesn't recreate the old lists 20:00:04 and we are at time 20:00:10 fungi: got it 20:00:12 makes sense 20:00:25 thank you everyone! as mentioned we'll be back here next week per usual despite the holiday for several of us 20:00:31 thanks clarkb! 20:00:33 #endmeeting