19:00:19 #startmeeting infra 19:00:19 Meeting started Tue Sep 23 19:00:19 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:19 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:19 The meeting name has been set to 'infra' 19:00:25 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/YGIEBCAQV4W5TXZVDQTYOHFZQ47SRBPP/ Our Agenda 19:00:31 #topic Announcements 19:00:40 I'm out on Thurdsay just a heads up 19:01:08 then also I have someone from my ISP coming out to check my regular internet connectivity trboules later today so don't be surprised if my internets drop out this afternoon 19:01:16 Did anyone else have announcements? 19:01:51 i did not 19:02:56 seems like we can dive right in 19:02:59 #topic Gerrit 3.11 Upgrade Planning 19:03:31 I don't have any new updates there. Unfortuantely I've been distracted by the raxflex double nic issue and preparing for the summit. But considering that the openstack release is happening in a week this is probably not super urgent at the moment 19:03:43 Did anyone else have thoughts/concerns/ideas around teh gerrit 3.11 upgrade? 19:04:09 i too have been distracted by other tasks, so not yet 19:04:17 #link https://zuul.opendev.org/t/openstack/build/54f6629a3041466ca2b1cc6bf17886c4 19:04:21 #link https://zuul.opendev.org/t/openstack/build/c9051c435bf7414b986c37256f71538e 19:04:30 these job links should point at held nodes for anyone wants to look at them 19:04:47 these were refreshed for the new gerrit container images after we rebuilt for 3.10.8 and 3.11.5 19:05:29 #topic Upgrading old servers 19:05:43 I think the openafs and kerberos clusters are fully upgraded to noble now 19:05:58 thank you fungi for driving that and getting it done. it took a while but slow steady progress saw it to the end 19:06:48 anything to note about those upgrades? I guess we have to watch out for the single cpu boots when upgrading in place to noble as well as updating network configs 19:06:51 yeah, all done 19:07:06 but I'm not sure we'll do any more inplace upgrades to noble? Maybe for lists one day (it is jammy so not in a hurry on that one) 19:07:17 i haven't cleaned up the old eth0 interface configs, but they aren't hurting anything 19:07:48 ya should be completely ignored at this point 19:07:54 also our use of openafs from a ppa complicated the upgrades in ways that the lists server upgrade won't be impacted 19:08:47 with these servers done the graphite and backup servers are next up 19:09:33 I briefly thought about how we might handle the backup servers and I think deploying new noble nodes with new backup volumes is probably ideal. Then after backups have shifted to the new nodes we can move the old backup volumes to the new nodes to preserve that data for a bit then delete the old backup servers 19:10:13 wfm 19:10:27 or use new volumes even, and just retire the old servers and volumes after a while 19:10:31 there are two concerns I want to look into before doing this. The first is what does having 3 or 4 backup servers in the inventory look like when it comes to the backup cron jobs on nodes (will we backup to 4 locations or can we be smarter about that?) and then also we need a newer version of borg on the server side on noble due to python versions. We'll want to test that the 19:10:33 old borg version on older nodes can backup to the new version 19:10:48 we already know that new borg backing up to old borg seems to work so I expect it to work the other way around but it is worth checking 19:11:18 graphite is probably a bit more straightforward. We'll need a new server then take a downtime to move the data and start services up again on the new server after updating dns 19:11:27 we might need to restart services that report to graphite if they cache the dns record 19:12:08 I'm not sure either of these two efforts are on my immediate todo list so help is appreciated. I'm also happy to rearrange my priorities if others think these should be bumped up over say zuul launcher debugging and gerrit upgrade planning and opendev matrix work 19:12:09 if the new borg because of python versions is a problem, we could use a borg container in the near term 19:12:31 tonyb: that is a good point. Given it works the other way around I don't expect it to be a problem but we can fall back to that too 19:12:42 and ya as more things update to noble fewer things will be in that situation anwyay 19:12:59 ok any other thoughts on server upgrades? 19:13:49 #topic AFS mirror content cleanup 19:14:05 once the afs servers were updated fungi went ahead and performed the cleanups we had talked about 19:14:14 debian stretch and bullseye backports content has been cleared out 19:14:19 as has the openeuler mirror content 19:14:35 yeah, i ran into one stumbling block 19:14:37 I still wonder if we can put the reprepro cleanup steps in the regular cronjob for mirror updates 19:14:42 oh I missed that 19:15:06 our reprepro mirror script keeps separate state in the form of a list of packages to remove on the next pass 19:15:20 if those packages don't exist anywhere, then that command fails 19:15:43 i'll try to remember to push up an addition to our docs that say to clear out that file 19:15:58 because it confused the heck out of me for hours before i realized that was the problem 19:16:07 the cleanup steps clear out files that would be listed in that file and cleared out in the next pass? 19:16:23 yes 19:16:34 got it. That might also complicate automating the cleanup steps 19:16:36 and since they're not present, the clearout errors 19:17:19 it was dead simple once i realized what was happening, but basically took me realizing that the update was working and then stepping through our script to find the failing command 19:17:29 we can also probably start looking at mirror cleanups for puppetlabs, ceph, and docker content that is old/ancient. These are unlikely to have the same level of impact from a disk consumption perspective though 19:17:45 and then once that is done we can also clean up old python wheel cache content for distro releases we no longer have test nodes for 19:17:55 it tripped me up on the bionic arm64 cleanup too, but i somehow "fixed" the problem at some point without realizing it 19:17:57 so still lots of cleanup possible but more of a long tail in terms of disk consumption 19:18:27 oh, the follow-up change for the openeuler mirror removal still needs reviews 19:18:40 #link https://review.opendev.org/959892 Clean up OpenEuler mirroring infrastructure 19:19:08 I'll review that after the meeting 19:19:11 thanks! 19:19:22 at this point the cleanup is probably sufficient that we can entertain adding new mirrors again 19:19:45 I still think we need to be careful about doing so because it is easy to add content but more difficult to clear it out as this long overdue cleanup illustrates 19:19:53 yes, though we still need to be thinking about how soon to yank ubuntu-bionic 19:20:18 (amd64) 19:20:22 yup I suspect but don't know for certain the starlingx is relying on bionic 19:20:36 I think they plan to do a release next month so we may want to check with them first? 19:20:48 or after their release, if that's what you meant 19:20:53 but other than that I feel like we can be full steam ahead on that cleanup 19:21:08 fungi: ya either after their release or if they give an indication its unlikely to affect them 19:22:24 spot checking a random recent change they run a lot of things on debian bullseye 19:22:33 so maybe bionic won't impact them like I think it will 19:22:47 but yes clearing out bionic will be a good thing all around 19:22:56 its old, its mirror is not small, and its an image build that we don't need anymore 19:23:15 also centos 9 stream's mirror grows at a rate that requires regular quota bumps 19:23:33 someone like spotz might have insight into how we can manage that better 19:24:24 any other afs mirror content insights/thoguhts/cleanups we want to talk about? 19:25:05 not it 19:25:21 #topic Lists Server Slowness 19:25:44 also done, i think (other than deleting the old copy of mailman data) 19:26:06 as noted last week we think we tracked this down to iops limits on the disk that was hosting the lists service. Since then fungi attached a new cinder volume with many more iops and migrated mailman on to it. I haven't had any issues with slow mailman since the migration 19:26:11 if nobody's aware of any problems since friday, i'll go ahead and do that 19:26:28 I haven't seen any. I also plan to take this off of the meeting agenda for next week as things seem much happier now 19:26:34 the new cinder volume is ssd-backed too, for clarity 19:26:55 Oh I wrote tool to verify the repo meta-data for an RPM based distribution. I could possibly extend that to generate an rsync filter to help us mirror less 19:27:04 tonyb: that would be neat 19:27:10 very neat indeed! 19:27:14 id we're okay being a partial mirror not a compleet one 19:27:25 tonyb: yes, we already try to trim out things we don't need 19:27:30 i expect we're very fine with that 19:27:36 its just not feasible for us to mirror complete mirrors foe each of the distros we're working with 19:27:37 Okay I'm on it 19:27:53 we also have incomplete mirrors of deb-oriented repos too 19:27:59 if you notice any slowness from lists/mailman please let us know. Otherwise I think we're considering this fixed 19:28:06 like not mirroring all architectures and suites 19:28:24 Okay. 19:28:53 basically we mirror content if a large subset of our ci jobs will use it. anything jobs aren't using doesn't need to be mirrored 19:29:18 That's what I thought 19:29:24 we don't intend, and don't even want, for these to be treated as general-purpose mirrors 19:30:39 #topic Deleting ze11 19:30:59 a while back we disabled ze11 in the zuul cluster because it was not able to clone nova within the 10 minute timeout 19:31:10 other nodes take about 3-4 minutes. ze11 needs about 13 or so 19:31:34 rather than try and debug that further at this point I think we shoudl probably just delete it. We don't need that many executors with our workload nyamore 19:31:45 then if we want to we can also delete ze12 to avoid confusion about the gap 19:32:15 mostly want to bring this up to see if anyone has concerns (maybe we do need that many executors?) and feedback on whether ze12 should meet the same fate if ze11 goes away 19:32:30 Fine by me, I don't really care about the gap 19:32:41 https://review.opendev.org/961530 is related 19:33:06 If corvus agrees then I think that's a fair plan 19:33:07 that's intended to clean up the gauges (would take effect now, it's orthogonal to deletion) 19:33:34 yeah, i think deleting it is fine 19:34:00 cool. I can put that on my todo list as I'm not hearing any objections 19:34:09 does that clear out the data, so if we take a server offline for a day we lose all its history? 19:34:54 the commit message does say "delete" 19:34:59 i don't think so; it should just stop sending bogus values 19:35:09 they're deleted from the statsd sever, not from graphite 19:35:12 ah 19:35:26 okay, that sounds a little better 19:35:56 so if we bring a server back online a few days later, there's just a reporting gap from where it was not in the statsd server 19:36:03 yep 19:36:14 thanks@ 19:36:18 s/@/!/ 19:36:19 the trick is getting the interval right 19:36:38 how long is long enough to say it's dead versus just idle 19:36:46 24h is my starting guess :) 19:37:00 wfm 19:37:02 seems reasonable 19:37:49 #topic Zuul Launcher Updates 19:38:10 then also related to zuul I wanted to recap some of the launcher stuff that has happend recently because tehre was some confusion over what had been done 19:38:58 After zuul restarts ~10 days ago the launchers stopped being able to boot nodes in rax-flex. They were getting multiple networks found errors. I discovered that about a week ago and added explicit network config to clouds.yaml for rax flex not realizing it was already in the launcher config 19:39:12 after restarting on the clouds.yaml update instances started getting multiple nics breaking their routing tables 19:39:25 we then dropped the zuul launcher config and things worked as expected with one nic 19:40:02 then thinking the combo of clouds.yaml and launcher config is why we get two interfaces we flipped things around to having no network config in clouds.yaml and used launcher config to configure the network. This put us back into the multiple nic broken situation 19:40:12 so now we're back to only defining networks in the clouds.yaml for that cloud region 19:40:56 I plan to write a script to try and reproduce this outside of the launcher code so that we can track it down. One idea i have is that it could be related to using floating ips. This is our only fip cloud and maybe attaching an fip is adding another interface for some reason but I have no evidence for this 19:41:19 then separately we disabled rax dfw (classic) because its api started having problems like we had previously seen in iad and ord (classic) 19:41:54 I sent email to rackers asking about this and also gave them a list of old error'd nodepool nodes in ord to clean up. I don't think they are real instances but they exist in the nova db so I think they count against our quota so getting them cleaned up would be nice 19:42:17 oh also I deleted all other nodepool booted instances in our cloud regions. The only remainders should be those in ord that are stuck in an error state that I cannot delete 19:42:46 it occurs to me that we might want to check for nodepool era images that need cleanup now too 19:42:51 I haven't done that 19:43:23 last week there ws also a bug in the launcher that prevented cleanup of leaked nodes 19:43:26 that should be fixed now 19:43:32 anything else related to the launchers? 19:43:41 what's the next step for rax classic? 19:43:54 corvus: in terms of reenabling rax-dfw or ? 19:43:57 other than hoping we hear back from them 19:44:01 yeah 19:44:30 I guess if we don't hear back this week we could try turning it back on and see if the issues were resolved and we just weren't notified? 19:44:44 and if still unhappy write another email? 19:45:36 james denton was very responsive when in irc but isn't there right now unfortunately 19:45:41 that sounds like a plan. 19:46:33 i'm good with that 19:47:03 #topic Matrix for OpenDev Comms 19:47:12 we're running out of time so want tokeep things moving 19:47:27 I have not started on this yet (see notes about distractions that kept me from gerrit things earlier in the meeting) 19:47:39 but it is on my todo list to start bootstrapping things from the spec 19:47:50 lmk if you want to throw a work item my way 19:47:54 will do 19:48:13 Element just updated and looks slightly different fwiw. Not sure I'm a huge fan but its not terrible. Also personal spces are becoming helpful as I add more matrix rooms over time 19:48:33 #topic Pre PTG Planning 19:48:39 #link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document 19:48:44 Times: Tuesday October 7 1800-2000 UTC, Wednesday October 8 1500-1700 UTC, Thursday October 9 1500-1700 19:48:50 this is ~2 weeks away 19:49:05 we'll have our normal meeting here next week then the week after our meeting will be the pre ptg 19:49:27 I did want to note that TheJulia added a topic about exploring AI/LLM driven code review processes/tooling 19:50:04 I suspect that that can be driven entirely via zuul jobs by communities interested in doing so. I also suggested that there may be opportunity to collaborate with other zuul users in zuul-jobs to build jobs/roles that enable the functioanlity 19:50:18 FWIW: I don't think I will be in a US timezone for the pre-ptg but I'll adjust me sleep schedule to maximise time overlap that week 19:50:34 but I wanted to call that out as its not something we're already dealing with day to day so it is a topic some of us may wish to read up on before hand 19:50:40 tonyb: thank you for the heads up 19:51:01 tonyb: will you be +11ish ? (I think that is at least close to your typical timezone) 19:52:39 in any case feel free to add your ideas for the pre ptg to the etherpad and take a moment to read the current list to ensure that we're ready to dive in in a couple of weeks 19:52:48 #topic Etherpad 2.5.0 Upgrade 19:53:06 on the previous topic, i'm happy to find apac-friendly times to hang out as well 19:53:13 ya we can probably adjust as necessary too 19:53:21 clarkb: I think that's the correct UTC offset 19:53:25 fungi: for etherpad did you check the test node? 19:53:33 104.130.127.119 is a held node for testing. 19:54:03 mostly wondering if anyone else has checked ti to either confirm or deny what I'm seeing in my browser in terms of layout behavior still being meh for the front page but ok on the etherpads 19:55:24 Don't adjust for me. 1500-1700 UTC is a little rough but it's only a couple of days, and I'm kinda optional anyway 19:56:17 ok we've only got a few minutes left. If someone can check the held etherpad node and let me know if the layout loosk terrible for them too then I can work on a followup issue update for upstream 19:56:22 #topic Open Discussion 19:56:51 If you are attending the summit I think Friday night is the one night I'll have free to get together so if you'd keep that night free too we can do an informal dinner get together or something along those lines 19:57:16 anything else? 19:57:25 wanted to make sure we didn't run out of time before opening the floor 19:59:00 clarkb: Friday 17th? (clarifying) 19:59:09 tonyb: yes 19:59:13 noted 19:59:59 I fly in thursday night, friday is first day of the event, then saturday and sunday I've got things already planned for me and then I'm out mondayish (I'm actually spending the day in paris monday and flying out tuesday morning) 20:00:19 and we are at time 20:00:21 thank you everyone 20:00:23 noted. 20:00:26 we'll be back here same time and location next week 20:00:26 Thanks all 20:00:28 see you there 20:00:30 #endmeeting