19:00:19 <clarkb> #startmeeting infra
19:00:19 <opendevmeet> Meeting started Tue Sep 23 19:00:19 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:19 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:19 <opendevmeet> The meeting name has been set to 'infra'
19:00:25 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/YGIEBCAQV4W5TXZVDQTYOHFZQ47SRBPP/ Our Agenda
19:00:31 <clarkb> #topic Announcements
19:00:40 <clarkb> I'm out on Thurdsay just a heads up
19:01:08 <clarkb> then also I have someone from my ISP coming out to check my regular internet connectivity trboules later today so don't be surprised if my internets drop out this afternoon
19:01:16 <clarkb> Did anyone else have announcements?
19:01:51 <fungi> i did not
19:02:56 <clarkb> seems like we can dive right in
19:02:59 <clarkb> #topic Gerrit 3.11 Upgrade Planning
19:03:31 <clarkb> I don't have any new updates there. Unfortuantely I've been distracted by the raxflex double nic issue and preparing for the summit. But considering that the openstack release is happening in a week this is probably not super urgent at the moment
19:03:43 <clarkb> Did anyone else have thoughts/concerns/ideas around teh gerrit 3.11 upgrade?
19:04:09 <fungi> i too have been distracted by other tasks, so not yet
19:04:17 <clarkb> #link https://zuul.opendev.org/t/openstack/build/54f6629a3041466ca2b1cc6bf17886c4
19:04:21 <clarkb> #link https://zuul.opendev.org/t/openstack/build/c9051c435bf7414b986c37256f71538e
19:04:30 <clarkb> these job links should point at held nodes for anyone wants to look at them
19:04:47 <clarkb> these were refreshed for the new gerrit container images after we rebuilt for 3.10.8 and 3.11.5
19:05:29 <clarkb> #topic Upgrading old servers
19:05:43 <clarkb> I think the openafs and kerberos clusters are fully upgraded to noble now
19:05:58 <clarkb> thank you fungi for driving that and getting it done. it took a while but slow steady progress saw it to the end
19:06:48 <clarkb> anything to note about those upgrades? I guess we have to watch out for the single cpu boots when upgrading in place to noble as well as updating network configs
19:06:51 <fungi> yeah, all done
19:07:06 <clarkb> but I'm not sure we'll do any more inplace upgrades to noble? Maybe for lists one day (it is jammy so not in a hurry on that one)
19:07:17 <fungi> i haven't cleaned up the old eth0 interface configs, but they aren't hurting anything
19:07:48 <clarkb> ya should be completely ignored at this point
19:07:54 <fungi> also our use of openafs from a ppa complicated the upgrades in ways that the lists server upgrade won't be impacted
19:08:47 <clarkb> with these servers done the graphite and backup servers are next up
19:09:33 <clarkb> I briefly thought about how we might handle the backup servers and I think deploying new noble nodes with new backup volumes is probably ideal. Then after backups have shifted to the new nodes we can move the old backup volumes to the new nodes to preserve that data for a bit then delete the old backup servers
19:10:13 <fungi> wfm
19:10:27 <fungi> or use new volumes even, and just retire the old servers and volumes after a while
19:10:31 <clarkb> there are two concerns I want to look into before doing this. The first is what does having 3 or 4 backup servers in the inventory look like when it comes to the backup cron jobs on nodes (will we backup to 4 locations or can we be smarter about that?) and then also we need a newer version of borg on the server side on noble due to python versions. We'll want to test that the
19:10:33 <clarkb> old borg version on older nodes can backup to the new version
19:10:48 <clarkb> we already know that new borg backing up to old borg seems to work so I expect it to work the other way around but it is worth checking
19:11:18 <clarkb> graphite is probably a bit more straightforward. We'll need a new server then take a downtime to move the data and start services up again on the new server after updating dns
19:11:27 <clarkb> we might need to restart services that report to graphite if they cache the dns record
19:12:08 <clarkb> I'm not sure either of these two efforts are on my immediate todo list so help is appreciated. I'm also happy to rearrange my priorities if others think these should be bumped up over say zuul launcher debugging and gerrit upgrade planning and opendev matrix work
19:12:09 <tonyb> if the new borg because of python versions is a problem, we could use a borg container in the near term
19:12:31 <clarkb> tonyb: that is a good point. Given it works the other way around I don't expect it to be a problem but we can fall back to that too
19:12:42 <clarkb> and ya as more things update to noble fewer things will be in that situation anwyay
19:12:59 <clarkb> ok any other thoughts on server upgrades?
19:13:49 <clarkb> #topic AFS mirror content cleanup
19:14:05 <clarkb> once the afs servers were updated fungi went ahead and performed the cleanups we had talked about
19:14:14 <clarkb> debian stretch and bullseye backports content has been cleared out
19:14:19 <clarkb> as has the openeuler mirror content
19:14:35 <fungi> yeah, i ran into one stumbling block
19:14:37 <clarkb> I still wonder if we can put the reprepro cleanup steps in the regular cronjob for mirror updates
19:14:42 <clarkb> oh I missed that
19:15:06 <fungi> our reprepro mirror script keeps separate state in the form of a list of packages to remove on the next pass
19:15:20 <fungi> if those packages don't exist anywhere, then that command fails
19:15:43 <fungi> i'll try to remember to push up an addition to our docs that say to clear out that file
19:15:58 <fungi> because it confused the heck out of me for hours before i realized that was the problem
19:16:07 <clarkb> the cleanup steps clear out files that would be listed in that file and cleared out in the next pass?
19:16:23 <fungi> yes
19:16:34 <clarkb> got it. That might also complicate automating the cleanup steps
19:16:36 <fungi> and since they're not present, the clearout errors
19:17:19 <fungi> it was dead simple once i realized what was happening, but basically took me realizing that the update was working and then stepping through our script to find the failing command
19:17:29 <clarkb> we can also probably start looking at mirror cleanups for puppetlabs, ceph, and docker content that is old/ancient. These are unlikely to have the same level of impact from a disk consumption perspective though
19:17:45 <clarkb> and then once that is done we can also clean up old python wheel cache content for distro releases we no longer have test nodes for
19:17:55 <fungi> it tripped me up on the bionic arm64 cleanup too, but i somehow "fixed" the problem at some point without realizing it
19:17:57 <clarkb> so still lots of cleanup possible but more of a long tail in terms of disk consumption
19:18:27 <fungi> oh, the follow-up change for the openeuler mirror removal still needs reviews
19:18:40 <fungi> #link https://review.opendev.org/959892 Clean up OpenEuler mirroring infrastructure
19:19:08 <clarkb> I'll review that after the meeting
19:19:11 <fungi> thanks!
19:19:22 <clarkb> at this point the cleanup is probably sufficient that we can entertain adding new mirrors again
19:19:45 <clarkb> I still think we need to be careful about doing so because it is easy to add content but more difficult to clear it out as this long overdue cleanup illustrates
19:19:53 <fungi> yes, though we still need to be thinking about how soon to yank ubuntu-bionic
19:20:18 <fungi> (amd64)
19:20:22 <clarkb> yup I suspect but don't know for certain the starlingx is relying on bionic
19:20:36 <clarkb> I think they plan to do a release next month so we may want to check with them first?
19:20:48 <fungi> or after their release, if that's what you meant
19:20:53 <clarkb> but other than that I feel like we can be full steam ahead on that cleanup
19:21:08 <clarkb> fungi: ya either after their release or if they give an indication its unlikely to affect them
19:22:24 <clarkb> spot checking a random recent change they run a lot of things on debian bullseye
19:22:33 <clarkb> so maybe bionic won't impact them like I think it will
19:22:47 <clarkb> but yes clearing out bionic will be a good thing all around
19:22:56 <clarkb> its old, its mirror is not small, and its an image build that we don't need anymore
19:23:15 <clarkb> also centos 9 stream's mirror grows at a rate that requires regular quota bumps
19:23:33 <clarkb> someone like spotz might have insight into how we can manage that better
19:24:24 <clarkb> any other afs mirror content insights/thoguhts/cleanups we want to talk about?
19:25:05 <fungi> not it
19:25:21 <clarkb> #topic Lists Server Slowness
19:25:44 <fungi> also done, i think (other than deleting the old copy of mailman data)
19:26:06 <clarkb> as noted last week we think we tracked this down to iops limits on the disk that was hosting the lists service. Since then fungi attached a new cinder volume with many more iops and migrated mailman on to it. I haven't had any issues with slow mailman since the migration
19:26:11 <fungi> if nobody's aware of any problems since friday, i'll go ahead and do that
19:26:28 <clarkb> I haven't seen any. I also plan to take this off of the meeting agenda for next week as things seem much happier now
19:26:34 <fungi> the new cinder volume is ssd-backed too, for clarity
19:26:55 <tonyb> Oh I wrote tool to verify the repo meta-data for an RPM based distribution.  I could possibly extend that to generate an rsync filter to help us mirror less
19:27:04 <clarkb> tonyb: that would be neat
19:27:10 <fungi> very neat indeed!
19:27:14 <tonyb> id we're okay being a partial mirror not a compleet one
19:27:25 <clarkb> tonyb: yes, we already try to trim out things we don't need
19:27:30 <fungi> i expect we're very fine with that
19:27:36 <clarkb> its just not feasible for us to mirror complete mirrors foe each of the distros we're working with
19:27:37 <tonyb> Okay I'm on it
19:27:53 <fungi> we also have incomplete mirrors of deb-oriented repos too
19:27:59 <clarkb> if you notice any slowness from lists/mailman please let us know. Otherwise I think we're considering this fixed
19:28:06 <fungi> like not mirroring all architectures and suites
19:28:24 <tonyb> Okay.
19:28:53 <fungi> basically we mirror content if a large subset of our ci jobs will use it. anything jobs aren't using doesn't need to be mirrored
19:29:18 <tonyb> That's what I thought
19:29:24 <fungi> we don't intend, and don't even want, for these to be treated as general-purpose mirrors
19:30:39 <clarkb> #topic Deleting ze11
19:30:59 <clarkb> a while back we disabled ze11 in the zuul cluster because it was not able to clone nova within the 10 minute timeout
19:31:10 <clarkb> other nodes take about 3-4 minutes. ze11 needs about 13 or so
19:31:34 <clarkb> rather than try and debug that further at this point I think we shoudl probably just delete it. We don't need that many executors with our workload nyamore
19:31:45 <clarkb> then if we want to we can also delete ze12 to avoid confusion about the gap
19:32:15 <clarkb> mostly want to bring this up to see if anyone has concerns (maybe we do need that many executors?) and feedback on whether ze12 should meet the same fate if ze11 goes away
19:32:30 <tonyb> Fine by me, I don't really care about the gap
19:32:41 <corvus> https://review.opendev.org/961530 is related
19:33:06 <tonyb> If corvus agrees then I think that's a fair plan
19:33:07 <corvus> that's intended to clean up the gauges (would take effect now, it's orthogonal to deletion)
19:33:34 <corvus> yeah, i think deleting it is fine
19:34:00 <clarkb> cool. I can put that on my todo list as I'm not hearing any objections
19:34:09 <fungi> does that clear out the data, so if we take a server offline for a day we lose all its history?
19:34:54 <clarkb> the commit message does say "delete"
19:34:59 <corvus> i don't think so; it should just stop sending bogus values
19:35:09 <corvus> they're deleted from the statsd sever, not from graphite
19:35:12 <clarkb> ah
19:35:26 <fungi> okay, that sounds a little better
19:35:56 <fungi> so if we bring a server back online a few days later, there's just a reporting gap from where it was not in the statsd server
19:36:03 <corvus> yep
19:36:14 <fungi> thanks@
19:36:18 <fungi> s/@/!/
19:36:19 <corvus> the trick is getting the interval right
19:36:38 <corvus> how long is long enough to say it's dead versus just idle
19:36:46 <corvus> 24h is my starting guess :)
19:37:00 <fungi> wfm
19:37:02 <clarkb> seems reasonable
19:37:49 <clarkb> #topic Zuul Launcher Updates
19:38:10 <clarkb> then also related to zuul I wanted to recap some of the launcher stuff that has happend recently because tehre was some confusion over what had been done
19:38:58 <clarkb> After zuul restarts ~10 days ago the launchers stopped being able to boot nodes in rax-flex. They were getting multiple networks found errors. I discovered that about a week ago and added explicit network config to clouds.yaml for rax flex not realizing it was already in the launcher config
19:39:12 <clarkb> after restarting on the clouds.yaml update instances started getting multiple nics breaking their routing tables
19:39:25 <clarkb> we then dropped the zuul launcher config and things worked as expected with one nic
19:40:02 <clarkb> then thinking the combo of clouds.yaml and launcher config is why we get two interfaces we flipped things around to having no network config in clouds.yaml and used launcher config to configure the network. This put us back into the multiple nic broken situation
19:40:12 <clarkb> so now we're back to only defining networks in the clouds.yaml for that cloud region
19:40:56 <clarkb> I plan to write a script to try and reproduce this outside of the launcher code so that we can track it down. One idea i have is that it could be related to using floating ips. This is our only fip cloud and maybe attaching an fip is adding another interface for some reason but I have no evidence for this
19:41:19 <clarkb> then separately we disabled rax dfw (classic) because its api started having problems like we had previously seen in iad and ord (classic)
19:41:54 <clarkb> I sent email to rackers asking about this and also gave them a list of old error'd nodepool nodes in ord to clean up. I don't think they are real instances but they exist in the nova db so I think they count against our quota so getting them cleaned up would be nice
19:42:17 <clarkb> oh also I deleted all other nodepool booted instances in our cloud regions. The only remainders should be those in ord that are stuck in an error state that I cannot delete
19:42:46 <clarkb> it occurs to me that we might want to check for nodepool era images that need cleanup now too
19:42:51 <clarkb> I haven't done that
19:43:23 <clarkb> last week there ws also a bug in the launcher that prevented cleanup of leaked nodes
19:43:26 <clarkb> that should be fixed now
19:43:32 <clarkb> anything else related to the launchers?
19:43:41 <corvus> what's the next step for rax classic?
19:43:54 <clarkb> corvus: in terms of reenabling rax-dfw or ?
19:43:57 <fungi> other than hoping we hear back from them
19:44:01 <corvus> yeah
19:44:30 <clarkb> I guess if we don't hear back this week we could try turning it back on and see if the issues were resolved and we just weren't notified?
19:44:44 <clarkb> and if still unhappy write another email?
19:45:36 <clarkb> james denton was very responsive when in irc but isn't there right now unfortunately
19:45:41 <corvus> that sounds like a plan.
19:46:33 <fungi> i'm good with that
19:47:03 <clarkb> #topic Matrix for OpenDev Comms
19:47:12 <clarkb> we're running out of time so want tokeep things moving
19:47:27 <clarkb> I have not started on this yet (see notes about distractions that kept me from gerrit things earlier in the meeting)
19:47:39 <clarkb> but it is on my todo list to start bootstrapping things from the spec
19:47:50 <corvus> lmk if you want to throw a work item my way
19:47:54 <clarkb> will do
19:48:13 <clarkb> Element just updated and looks slightly different fwiw. Not sure I'm a huge fan but its not terrible. Also personal spces are becoming helpful as I add more matrix rooms over time
19:48:33 <clarkb> #topic Pre PTG Planning
19:48:39 <clarkb> #link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document
19:48:44 <clarkb> Times: Tuesday October 7 1800-2000 UTC, Wednesday October 8 1500-1700 UTC, Thursday October 9 1500-1700
19:48:50 <clarkb> this is ~2 weeks away
19:49:05 <clarkb> we'll have our normal meeting here next week then the week after our meeting will be the pre ptg
19:49:27 <clarkb> I did want to note that TheJulia added a topic about exploring AI/LLM driven code review processes/tooling
19:50:04 <clarkb> I suspect that that can be driven entirely via zuul jobs by communities interested in doing so. I also suggested that there may be opportunity to collaborate with other zuul users in zuul-jobs to build jobs/roles that enable the functioanlity
19:50:18 <tonyb> FWIW: I don't think I will be in a US timezone for the pre-ptg but I'll adjust me sleep schedule to maximise time overlap that week
19:50:34 <clarkb> but I wanted to call that out as its not something we're already dealing with day to day so it is a topic some of us may wish to read up on before hand
19:50:40 <clarkb> tonyb: thank you for the heads up
19:51:01 <clarkb> tonyb: will you be +11ish ? (I think that is at least close to your typical timezone)
19:52:39 <clarkb> in any case feel free to add your ideas for the pre ptg to the etherpad and take a moment to read the current list to ensure that we're ready to dive in in a couple of weeks
19:52:48 <clarkb> #topic Etherpad 2.5.0 Upgrade
19:53:06 <fungi> on the previous topic, i'm happy to find apac-friendly times to hang out as well
19:53:13 <clarkb> ya we can probably adjust as necessary too
19:53:21 <tonyb> clarkb: I think that's the correct UTC offset
19:53:25 <clarkb> fungi: for etherpad did you check the test node?
19:53:33 <clarkb> 104.130.127.119 is a held node for testing.
19:54:03 <clarkb> mostly wondering if anyone else has checked ti to either confirm or deny what I'm seeing in my browser in terms of layout behavior still being meh for the front page but ok on the etherpads
19:55:24 <tonyb> Don't adjust for me.  1500-1700 UTC is a little rough but it's only a couple of days, and I'm kinda optional anyway
19:56:17 <clarkb> ok we've only got a few minutes left. If someone can check the held etherpad node and let me know if the layout loosk terrible for them too then I can work on a followup issue update for upstream
19:56:22 <clarkb> #topic Open Discussion
19:56:51 <clarkb> If you are attending the summit I think Friday night is the one night I'll have free to get together so if you'd keep that night free too we can do an informal dinner get together or something along those lines
19:57:16 <clarkb> anything else?
19:57:25 <clarkb> wanted to make sure we didn't run out of time before opening the floor
19:59:00 <tonyb> clarkb: Friday 17th? (clarifying)
19:59:09 <clarkb> tonyb: yes
19:59:13 <tonyb> noted
19:59:59 <clarkb> I fly in thursday night, friday is first day of the event, then saturday and sunday I've got things already planned for me and then I'm out mondayish (I'm actually spending the day in paris monday and flying out tuesday morning)
20:00:19 <clarkb> and we are at time
20:00:21 <clarkb> thank you everyone
20:00:23 <tonyb> noted.
20:00:26 <clarkb> we'll be back here same time and location next week
20:00:26 <tonyb> Thanks all
20:00:28 <clarkb> see you there
20:00:30 <clarkb> #endmeeting