Tuesday, 2025-09-23

seunghunlee	Hello everyone. I'm not sure if this is the right place to ask. But is there a way to find if Zuul had a problem on last Friday?	09:27
mnasiadka	seunghunlee: that’s the channel for regular weekly OpenDev meeting - it’s better to ask on #opendev	10:06
seunghunlee	mnasiadka: thanks	10:11
*** NeilHanlon_ is now known as NeilHanlon		15:04
clarkb	just about meeting time	18:58
clarkb	#startmeeting infra	19:00
opendevmeet	Meeting started Tue Sep 23 19:00:19 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:00
opendevmeet	The meeting name has been set to 'infra'	19:00
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/YGIEBCAQV4W5TXZVDQTYOHFZQ47SRBPP/ Our Agenda	19:00
clarkb	#topic Announcements	19:00
clarkb	I'm out on Thurdsay just a heads up	19:00
clarkb	then also I have someone from my ISP coming out to check my regular internet connectivity trboules later today so don't be surprised if my internets drop out this afternoon	19:01
clarkb	Did anyone else have announcements?	19:01
fungi	i did not	19:01
clarkb	seems like we can dive right in	19:02
clarkb	#topic Gerrit 3.11 Upgrade Planning	19:02
clarkb	I don't have any new updates there. Unfortuantely I've been distracted by the raxflex double nic issue and preparing for the summit. But considering that the openstack release is happening in a week this is probably not super urgent at the moment	19:03
clarkb	Did anyone else have thoughts/concerns/ideas around teh gerrit 3.11 upgrade?	19:03
fungi	i too have been distracted by other tasks, so not yet	19:04
clarkb	#link https://zuul.opendev.org/t/openstack/build/54f6629a3041466ca2b1cc6bf17886c4	19:04
clarkb	#link https://zuul.opendev.org/t/openstack/build/c9051c435bf7414b986c37256f71538e	19:04
clarkb	these job links should point at held nodes for anyone wants to look at them	19:04
clarkb	these were refreshed for the new gerrit container images after we rebuilt for 3.10.8 and 3.11.5	19:04
clarkb	#topic Upgrading old servers	19:05
clarkb	I think the openafs and kerberos clusters are fully upgraded to noble now	19:05
clarkb	thank you fungi for driving that and getting it done. it took a while but slow steady progress saw it to the end	19:05
clarkb	anything to note about those upgrades? I guess we have to watch out for the single cpu boots when upgrading in place to noble as well as updating network configs	19:06
fungi	yeah, all done	19:06
clarkb	but I'm not sure we'll do any more inplace upgrades to noble? Maybe for lists one day (it is jammy so not in a hurry on that one)	19:07
fungi	i haven't cleaned up the old eth0 interface configs, but they aren't hurting anything	19:07
clarkb	ya should be completely ignored at this point	19:07
fungi	also our use of openafs from a ppa complicated the upgrades in ways that the lists server upgrade won't be impacted	19:07
clarkb	with these servers done the graphite and backup servers are next up	19:08
clarkb	I briefly thought about how we might handle the backup servers and I think deploying new noble nodes with new backup volumes is probably ideal. Then after backups have shifted to the new nodes we can move the old backup volumes to the new nodes to preserve that data for a bit then delete the old backup servers	19:09
fungi	wfm	19:10
fungi	or use new volumes even, and just retire the old servers and volumes after a while	19:10
clarkb	there are two concerns I want to look into before doing this. The first is what does having 3 or 4 backup servers in the inventory look like when it comes to the backup cron jobs on nodes (will we backup to 4 locations or can we be smarter about that?) and then also we need a newer version of borg on the server side on noble due to python versions. We'll want to test that the	19:10
clarkb	old borg version on older nodes can backup to the new version	19:10
clarkb	we already know that new borg backing up to old borg seems to work so I expect it to work the other way around but it is worth checking	19:10
clarkb	graphite is probably a bit more straightforward. We'll need a new server then take a downtime to move the data and start services up again on the new server after updating dns	19:11
clarkb	we might need to restart services that report to graphite if they cache the dns record	19:11
clarkb	I'm not sure either of these two efforts are on my immediate todo list so help is appreciated. I'm also happy to rearrange my priorities if others think these should be bumped up over say zuul launcher debugging and gerrit upgrade planning and opendev matrix work	19:12
tonyb	if the new borg because of python versions is a problem, we could use a borg container in the near term	19:12
clarkb	tonyb: that is a good point. Given it works the other way around I don't expect it to be a problem but we can fall back to that too	19:12
clarkb	and ya as more things update to noble fewer things will be in that situation anwyay	19:12
clarkb	ok any other thoughts on server upgrades?	19:12
clarkb	#topic AFS mirror content cleanup	19:13
clarkb	once the afs servers were updated fungi went ahead and performed the cleanups we had talked about	19:14
clarkb	debian stretch and bullseye backports content has been cleared out	19:14
clarkb	as has the openeuler mirror content	19:14
fungi	yeah, i ran into one stumbling block	19:14
clarkb	I still wonder if we can put the reprepro cleanup steps in the regular cronjob for mirror updates	19:14
clarkb	oh I missed that	19:14
fungi	our reprepro mirror script keeps separate state in the form of a list of packages to remove on the next pass	19:15
fungi	if those packages don't exist anywhere, then that command fails	19:15
fungi	i'll try to remember to push up an addition to our docs that say to clear out that file	19:15
fungi	because it confused the heck out of me for hours before i realized that was the problem	19:15
clarkb	the cleanup steps clear out files that would be listed in that file and cleared out in the next pass?	19:16
fungi	yes	19:16
clarkb	got it. That might also complicate automating the cleanup steps	19:16
fungi	and since they're not present, the clearout errors	19:16
fungi	it was dead simple once i realized what was happening, but basically took me realizing that the update was working and then stepping through our script to find the failing command	19:17
clarkb	we can also probably start looking at mirror cleanups for puppetlabs, ceph, and docker content that is old/ancient. These are unlikely to have the same level of impact from a disk consumption perspective though	19:17
clarkb	and then once that is done we can also clean up old python wheel cache content for distro releases we no longer have test nodes for	19:17
fungi	it tripped me up on the bionic arm64 cleanup too, but i somehow "fixed" the problem at some point without realizing it	19:17
clarkb	so still lots of cleanup possible but more of a long tail in terms of disk consumption	19:17
fungi	oh, the follow-up change for the openeuler mirror removal still needs reviews	19:18
fungi	#link https://review.opendev.org/959892 Clean up OpenEuler mirroring infrastructure	19:18
clarkb	I'll review that after the meeting	19:19
fungi	thanks!	19:19
clarkb	at this point the cleanup is probably sufficient that we can entertain adding new mirrors again	19:19
clarkb	I still think we need to be careful about doing so because it is easy to add content but more difficult to clear it out as this long overdue cleanup illustrates	19:19
fungi	yes, though we still need to be thinking about how soon to yank ubuntu-bionic	19:19
fungi	(amd64)	19:20
clarkb	yup I suspect but don't know for certain the starlingx is relying on bionic	19:20
clarkb	I think they plan to do a release next month so we may want to check with them first?	19:20
fungi	or after their release, if that's what you meant	19:20
clarkb	but other than that I feel like we can be full steam ahead on that cleanup	19:20
clarkb	fungi: ya either after their release or if they give an indication its unlikely to affect them	19:21
clarkb	spot checking a random recent change they run a lot of things on debian bullseye	19:22
clarkb	so maybe bionic won't impact them like I think it will	19:22
clarkb	but yes clearing out bionic will be a good thing all around	19:22
clarkb	its old, its mirror is not small, and its an image build that we don't need anymore	19:22
clarkb	also centos 9 stream's mirror grows at a rate that requires regular quota bumps	19:23
clarkb	someone like spotz might have insight into how we can manage that better	19:23
clarkb	any other afs mirror content insights/thoguhts/cleanups we want to talk about?	19:24
fungi	not it	19:25
clarkb	#topic Lists Server Slowness	19:25
fungi	also done, i think (other than deleting the old copy of mailman data)	19:25
clarkb	as noted last week we think we tracked this down to iops limits on the disk that was hosting the lists service. Since then fungi attached a new cinder volume with many more iops and migrated mailman on to it. I haven't had any issues with slow mailman since the migration	19:26
fungi	if nobody's aware of any problems since friday, i'll go ahead and do that	19:26
clarkb	I haven't seen any. I also plan to take this off of the meeting agenda for next week as things seem much happier now	19:26
fungi	the new cinder volume is ssd-backed too, for clarity	19:26
tonyb	Oh I wrote tool to verify the repo meta-data for an RPM based distribution. I could possibly extend that to generate an rsync filter to help us mirror less	19:26
clarkb	tonyb: that would be neat	19:27
fungi	very neat indeed!	19:27
tonyb	id we're okay being a partial mirror not a compleet one	19:27
clarkb	tonyb: yes, we already try to trim out things we don't need	19:27
fungi	i expect we're very fine with that	19:27
clarkb	its just not feasible for us to mirror complete mirrors foe each of the distros we're working with	19:27
tonyb	Okay I'm on it	19:27
fungi	we also have incomplete mirrors of deb-oriented repos too	19:27
clarkb	if you notice any slowness from lists/mailman please let us know. Otherwise I think we're considering this fixed	19:27
fungi	like not mirroring all architectures and suites	19:28
tonyb	Okay.	19:28
fungi	basically we mirror content if a large subset of our ci jobs will use it. anything jobs aren't using doesn't need to be mirrored	19:28
tonyb	That's what I thought	19:29
fungi	we don't intend, and don't even want, for these to be treated as general-purpose mirrors	19:29
clarkb	#topic Deleting ze11	19:30
clarkb	a while back we disabled ze11 in the zuul cluster because it was not able to clone nova within the 10 minute timeout	19:30
clarkb	other nodes take about 3-4 minutes. ze11 needs about 13 or so	19:31
clarkb	rather than try and debug that further at this point I think we shoudl probably just delete it. We don't need that many executors with our workload nyamore	19:31
clarkb	then if we want to we can also delete ze12 to avoid confusion about the gap	19:31
clarkb	mostly want to bring this up to see if anyone has concerns (maybe we do need that many executors?) and feedback on whether ze12 should meet the same fate if ze11 goes away	19:32
tonyb	Fine by me, I don't really care about the gap	19:32
corvus	https://review.opendev.org/961530 is related	19:32
tonyb	If corvus agrees then I think that's a fair plan	19:33
corvus	that's intended to clean up the gauges (would take effect now, it's orthogonal to deletion)	19:33
corvus	yeah, i think deleting it is fine	19:33
clarkb	cool. I can put that on my todo list as I'm not hearing any objections	19:34
fungi	does that clear out the data, so if we take a server offline for a day we lose all its history?	19:34
clarkb	the commit message does say "delete"	19:34
corvus	i don't think so; it should just stop sending bogus values	19:34
corvus	they're deleted from the statsd sever, not from graphite	19:35
clarkb	ah	19:35
fungi	okay, that sounds a little better	19:35
fungi	so if we bring a server back online a few days later, there's just a reporting gap from where it was not in the statsd server	19:35
corvus	yep	19:36
fungi	thanks@	19:36
fungi	s/@/!/	19:36
corvus	the trick is getting the interval right	19:36
corvus	how long is long enough to say it's dead versus just idle	19:36
corvus	24h is my starting guess :)	19:36
fungi	wfm	19:37
clarkb	seems reasonable	19:37
clarkb	#topic Zuul Launcher Updates	19:37
clarkb	then also related to zuul I wanted to recap some of the launcher stuff that has happend recently because tehre was some confusion over what had been done	19:38
clarkb	After zuul restarts ~10 days ago the launchers stopped being able to boot nodes in rax-flex. They were getting multiple networks found errors. I discovered that about a week ago and added explicit network config to clouds.yaml for rax flex not realizing it was already in the launcher config	19:38
clarkb	after restarting on the clouds.yaml update instances started getting multiple nics breaking their routing tables	19:39
clarkb	we then dropped the zuul launcher config and things worked as expected with one nic	19:39
clarkb	then thinking the combo of clouds.yaml and launcher config is why we get two interfaces we flipped things around to having no network config in clouds.yaml and used launcher config to configure the network. This put us back into the multiple nic broken situation	19:40
clarkb	so now we're back to only defining networks in the clouds.yaml for that cloud region	19:40
clarkb	I plan to write a script to try and reproduce this outside of the launcher code so that we can track it down. One idea i have is that it could be related to using floating ips. This is our only fip cloud and maybe attaching an fip is adding another interface for some reason but I have no evidence for this	19:40
clarkb	then separately we disabled rax dfw (classic) because its api started having problems like we had previously seen in iad and ord (classic)	19:41
clarkb	I sent email to rackers asking about this and also gave them a list of old error'd nodepool nodes in ord to clean up. I don't think they are real instances but they exist in the nova db so I think they count against our quota so getting them cleaned up would be nice	19:41
clarkb	oh also I deleted all other nodepool booted instances in our cloud regions. The only remainders should be those in ord that are stuck in an error state that I cannot delete	19:42
clarkb	it occurs to me that we might want to check for nodepool era images that need cleanup now too	19:42
clarkb	I haven't done that	19:42
clarkb	last week there ws also a bug in the launcher that prevented cleanup of leaked nodes	19:43
clarkb	that should be fixed now	19:43
clarkb	anything else related to the launchers?	19:43
corvus	what's the next step for rax classic?	19:43
clarkb	corvus: in terms of reenabling rax-dfw or ?	19:43
fungi	other than hoping we hear back from them	19:43
corvus	yeah	19:44
clarkb	I guess if we don't hear back this week we could try turning it back on and see if the issues were resolved and we just weren't notified?	19:44
clarkb	and if still unhappy write another email?	19:44
clarkb	james denton was very responsive when in irc but isn't there right now unfortunately	19:45
corvus	that sounds like a plan.	19:45
fungi	i'm good with that	19:46
clarkb	#topic Matrix for OpenDev Comms	19:47
clarkb	we're running out of time so want tokeep things moving	19:47
clarkb	I have not started on this yet (see notes about distractions that kept me from gerrit things earlier in the meeting)	19:47
clarkb	but it is on my todo list to start bootstrapping things from the spec	19:47
corvus	lmk if you want to throw a work item my way	19:47
clarkb	will do	19:47
clarkb	Element just updated and looks slightly different fwiw. Not sure I'm a huge fan but its not terrible. Also personal spces are becoming helpful as I add more matrix rooms over time	19:48
clarkb	#topic Pre PTG Planning	19:48
clarkb	#link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document	19:48
clarkb	Times: Tuesday October 7 1800-2000 UTC, Wednesday October 8 1500-1700 UTC, Thursday October 9 1500-1700	19:48
clarkb	this is ~2 weeks away	19:48
clarkb	we'll have our normal meeting here next week then the week after our meeting will be the pre ptg	19:49
clarkb	I did want to note that TheJulia added a topic about exploring AI/LLM driven code review processes/tooling	19:49
clarkb	I suspect that that can be driven entirely via zuul jobs by communities interested in doing so. I also suggested that there may be opportunity to collaborate with other zuul users in zuul-jobs to build jobs/roles that enable the functioanlity	19:50
tonyb	FWIW: I don't think I will be in a US timezone for the pre-ptg but I'll adjust me sleep schedule to maximise time overlap that week	19:50
clarkb	but I wanted to call that out as its not something we're already dealing with day to day so it is a topic some of us may wish to read up on before hand	19:50
clarkb	tonyb: thank you for the heads up	19:50
clarkb	tonyb: will you be +11ish ? (I think that is at least close to your typical timezone)	19:51
clarkb	in any case feel free to add your ideas for the pre ptg to the etherpad and take a moment to read the current list to ensure that we're ready to dive in in a couple of weeks	19:52
clarkb	#topic Etherpad 2.5.0 Upgrade	19:52
fungi	on the previous topic, i'm happy to find apac-friendly times to hang out as well	19:53
clarkb	ya we can probably adjust as necessary too	19:53
tonyb	clarkb: I think that's the correct UTC offset	19:53
clarkb	fungi: for etherpad did you check the test node?	19:53
clarkb	104.130.127.119 is a held node for testing.	19:53
clarkb	mostly wondering if anyone else has checked ti to either confirm or deny what I'm seeing in my browser in terms of layout behavior still being meh for the front page but ok on the etherpads	19:54
tonyb	Don't adjust for me. 1500-1700 UTC is a little rough but it's only a couple of days, and I'm kinda optional anyway	19:55
clarkb	ok we've only got a few minutes left. If someone can check the held etherpad node and let me know if the layout loosk terrible for them too then I can work on a followup issue update for upstream	19:56
clarkb	#topic Open Discussion	19:56
clarkb	If you are attending the summit I think Friday night is the one night I'll have free to get together so if you'd keep that night free too we can do an informal dinner get together or something along those lines	19:56
clarkb	anything else?	19:57
clarkb	wanted to make sure we didn't run out of time before opening the floor	19:57
tonyb	clarkb: Friday 17th? (clarifying)	19:59
clarkb	tonyb: yes	19:59
tonyb	noted	19:59
clarkb	I fly in thursday night, friday is first day of the event, then saturday and sunday I've got things already planned for me and then I'm out mondayish (I'm actually spending the day in paris monday and flying out tuesday morning)	19:59
clarkb	and we are at time	20:00
clarkb	thank you everyone	20:00
tonyb	noted.	20:00
clarkb	we'll be back here same time and location next week	20:00
tonyb	Thanks all	20:00
clarkb	see you there	20:00
clarkb	#endmeeting	20:00
opendevmeet	Meeting ended Tue Sep 23 20:00:30 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:00
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-23-19.00.html	20:00
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-23-19.00.txt	20:00
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-23-19.00.log.html	20:00

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!