Tuesday, 2025-10-28

clarkb	meeting time	19:00
clarkb	#startmeeting infra	19:00
opendevmeet	Meeting started Tue Oct 28 19:00:27 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:00
opendevmeet	The meeting name has been set to 'infra'	19:00
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/MRT6BQQHJPYJ43ENYTRSH4IOT6AR7FIW/ Our Agenda	19:01
clarkb	#topic Announcements	19:01
clarkb	I decided to have this meeting today despite the PTG happening this week because it has been some time since we had a meeting	19:03
clarkb	But also keep in mind the PTG is happening this week. I've already put the meetpad servers in teh ansible emergency file so they don't get randomly updated by upstream container image updates	19:04
clarkb	#topic Gerrit 3.11 Upgrade Planning	19:05
clarkb	After a zuul launcher upgrade issue my existing holds for testing this are no longer valid so I need to refresh them	19:05
clarkb	At the summit the Gerrit folks didn't feel we were super far behidn so that was encouraging	19:06
clarkb	I'm hoping to really start focusing on this again after this week and the ptg	19:07
clarkb	#topic Gerrit Spontaneous Shutdown During Summit	19:07
clarkb	That said during the Summit fungi and ianychoi noticed that Gerrit was not running. It was spontaneously in a shutdown state	19:07
clarkb	fungi was able to restart the VM and then start containers. The main issue was that the h2 cache backing files were not cleared out before doing so whcih made startup take a while. But it did startup and has been running since	19:08
clarkb	just keep that in mind if you're restarting Gerrit for any reason clearning out the h2 cache backing files can speed up startup.	19:08
clarkb	We spoke to mfick about improving this at the summit and he felt he knew what a workable solution was an in fact had merged an attempt at it but it didn't accomodate existing plugin expectations so was reverted	19:08
clarkb	but hopefully that means the issue can be addressed once plugin compatibility is addressed	19:08
clarkb	As for why Gerrit shutdown we also spoke to nova folks at the summit and something like running out of memory on the hypervisor could cause libvirt to request VMs shutdown and the nova wouldn't read that as an error that is bubbled back up to users	19:09
clarkb	so it seems like this is what caused the problem. Checking on that and mitigating the issue is somethign we'll have to bring up with vexxhost (which I don't think we have done yet)	19:10
fungi	yeah, it's outwardly consistent with the hypervisor host's oom killer reaping the qemu process, but without access to the host logs we don't know for sure	19:10
fungi	my money's on that though, because it's a very high-ram flavor	19:10
clarkb	ya	19:10
clarkb	#topic Upgrading old servers	19:13
clarkb	I'm not aware of any movement on this topic. But as mentioend previously I think the backup servers are a good next location to focus this effort	19:13
clarkb	we can replace them out at a time and transplant the current backup volumes onto the new servers to preserve that daat	19:13
tonyb	I did refresh the mediawiki patches	19:14
fungi	and we have plenty of cinder quota at this point, so new volumes are fine	19:14
clarkb	tonyb: oh cool, should we be reviewing them then?	19:14
tonyb	I think the container build is probably mergeable	19:14
clarkb	excellent /me makes a note on the todo list	19:14
fungi	i would actually add fresh volumes and use the server replacement as an excuse to rotate them	19:14
fungi	and we can still detach the old volumes and attach to the new servers to make them easier to access	19:14
tonyb	I also updated the ansible-devel series so reviewing them would be good	19:15
clarkb	ack also added to my todo list	19:15
tonyb	I added an new, short lived, ansible-next job that targets ansible-11 on bridge, rather than master	19:15
tonyb	I figure we'll want to update bridge before we get rid of focal but that may not be a valid conclusion	19:16
clarkb	makes sense since we're not really in a position to provide feedback upstraem for unreleased stuff but having signal about where incompatibilities with the next release are is helpful to us	19:16
clarkb	tonyb: the current bridge is jammy so the order of operations there is a bit more vague I think	19:16
tonyb	Okay	19:17
clarkb	Anything else as far as updating servers goes? I'm glad there is progress and I just need to catch up!	19:18
tonyb	I can chnage the ansible-next job to jammy if that's a more reasonable target	19:18
clarkb	tonyb: might be worth checking just to see if we can run ansible on jammy with the python version there	19:18
clarkb	to see if we need to upgrade to get new ansible or not	19:18
tonyb	I still haven't actaully tested the held MW node, but apart from that I think I'm good	19:18
tonyb	clarkb: noted.	19:19
clarkb	#topic AFS mirror content cleanup	19:20
clarkb	I think this effort has largely stalled out (whcih is fine, major improvements have been made and the wins we see going fowrard are much smaller)	19:21
clarkb	I'm curious if A) anyone is interested in chasing that long tail of cleanup and B) if we think we're ready to start mirroring new stuff like say trixie packages?	19:21
tonyb	I think "a" is still valuable, but I don't have cycles for it in the short term. I have no opinion on "b".	19:22
clarkb	ya maybe we need to put A on the backlog list etherpad linked to from specs	19:23
clarkb	for B I'm happy to stick with the current status quo until people find they need it	19:23
clarkb	mostly taking temperature on that I ugess	19:23
fungi	noonedeadpunk indicated in #openstack-ansible earlier this week that he'd look into adding a reprepro config patch for trixie soon	19:24
clarkb	cool so there is interest and we can probably wait for that change to show up then	19:24
fungi	they apparently hit some mirror host identification bug in their jobs which was causing the pip.conf to list deb.debian.org as the pypi index host	19:24
fungi	traced back to having an empty mirror host variable	19:25
clarkb	thats weird	19:25
frickler	yes, I had a similar issue with devstack	19:25
clarkb	an unexpected fallback behavior for sure	19:25
frickler	iirc that is because we had to work around the missing mirror in dib/image builds	19:26
clarkb	I think the dib fallback was to use the upstream mirrors though	19:27
clarkb	anyway its worth tracking down and we don't need to debug it now	19:27
mnasiadka	I can help with cleanup if needed (or in some other area)	19:27
clarkb	anything else related to afs mirroring? I think we can followup on A and B after the meeting as people have time	19:28
tonyb	Possibly related to: https://review.opendev.org/c/zuul/zuul-jobs/+/965008 "Allow mirror_fqdn to be overriden"	19:28
clarkb	#topic Zuul Launcher Updates	19:30
clarkb	As a heads up there is a bug in zuul launcher that currently affects nodesets if the requested node boots fail	19:30
clarkb	zuul tries to recover inappropriately and then fails the nodeset	19:30
clarkb	there is a fix for this currently in the zuul gate, but zuul ci hit problems due to the new pip release so its been a slow march to get the fix landed	19:31
clarkb	there was also a fix to some test cases identified to hopefully make the test cases more reliable. I'm hopeful with those two fixes in place we'll be able to land the launcher fix then restart launchers to address the node failure problem	19:31
clarkb	at this point I think we're on the right path to correcting this but wanted peopel to be aware	19:31
clarkb	any other zuul launcher concerns or feedback?	19:32
clarkb	#link https://review.opendev.org/c/zuul/zuul/+/964893 this is the node failure fixup	19:32
clarkb	#topic Matrix for OpenDev comms	19:33
clarkb	In addition to the Gerrit upgrade this is the other item that is high on my todo list	19:33
clarkb	I should be able to start on room creation and work through some of the bits of the spec that don't require user facing changes	19:34
clarkb	then when we're happy with the state of things we can make it more official and start porting usage over	19:34
tonyb	Sounds good	19:34
clarkb	#topic Etherpad 2.5.1 Upgrade	19:36
clarkb	Etherpad 2.5.0 was the version I was looking at previously with the broken but slightly improved css	19:36
clarkb	since then there is a new 2.5.1 release so I need to update the upgrade change and recycle test nodes and check if css is happy now	19:37
clarkb	but I didn't want to do that prior to or during the ptg so this is probably going to wait for a bit	19:37
clarkb	#link https://github.com/ether/etherpad-lite/blob/v2.5.1/CHANGELOG.md Is the upstream changelog	19:37
clarkb	I would say that often times their changelog is very incomplete	19:38
clarkb	#topic Gitea 1.24.7 Upgrade	19:38
clarkb	Gitea has pushed a new release too	19:38
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/964899/ Upgrade Gitea to 1.24.7	19:38
clarkb	I think we can probably proceed with updating this service if it looks like the service itself is stable and not falling over due to crawlers	19:38
clarkb	the screenshots looked good to me but please double check when you review the change	19:39
clarkb	#topic Gitea Performance	19:39
tonyb	They looks good to me	19:39
clarkb	which brings us to the general gitea performance issue	19:39
clarkb	prior to the summit we thought that part of the problem was crawlers hitting backends directly	19:39
clarkb	this meant that the load balancer couldn't really balance effectively as it is unaware of any direct connections	19:39
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/964728 Don't allow direct backend access	19:40
clarkb	this change is a response to that. It will limit our ability to test specific backends without doing something like ssh port forwarding	19:40
clarkb	however, yesterday performance was poor and the traffic did seem to be going through the load balancer	19:40
clarkb	so forcing everything through the load balancer is unlikely to fix all the issues. That said I suspect it will generally be an improvement	19:40
clarkb	yesterday I had to block a particularly bad crawler's ip addresses after confirming it was crawling with odd and what appeared to be bogus user agent	19:41
clarkb	after doing that things settled down a bit and the service seemed happier. Spot checking now seems to show thinsg are still reasonably happy	19:41
clarkb	I did identify one other problematic crawler that I intended on blocking if things didn't improve after the first was blocked but that was not necessary	19:42
clarkb	(this crawler is using a specific cloud provider and I was going to block that cloud provider's ip range....)	19:42
clarkb	anyway I guess the point here is the battle is ongoing and I'm less certain 964728 will help significantly but I'm willing to try it if others think it is a good idea	19:43
clarkb	I'm also open to other ideas and help	19:43
tonyb	We can also (maybe?) use our existing UA-filter to create a block list for haproxy	19:44
tonyb	something like:	19:44
fungi	not at that layer	19:44
tonyb	#link https://discourse.haproxy.org/t/howto-block-badbots-crawlers-scrapers-using-list-file/995	19:44
clarkb	ya we're currently load balancing tcp not https	19:44
fungi	we'd have to do it in apache since that's where https is terminated	19:44
clarkb	but maybe if we force all traffic through the load balancer then a reasonable next step is terminating https there?	19:45
clarkb	makes debugging even more difficult as clients don't see the backend specific altname	19:45
tonyb	Ahhh I see.	19:45
clarkb	but we could do more magic with haproxy if it mitm'd the service	19:45
clarkb	I'm open to experiments though and ideas like that are worth pursuing if we can reconfigure the service setup to match	19:46
clarkb	#link PBR Updates to Handle Setuptools Deprecations	19:47
clarkb	The last thing I wanted to call out today is that setuptools set a date of october 31 for removing some deprecated code that pbr relies on (specifically easy_install related stuff)	19:47
clarkb	#link https://review.opendev.org/c/openstack/pbr/+/964712/ and children aim to address this	19:47
clarkb	We think this stack of changes should hopefully mitigate (thank you stephenfin)	19:48
fungi	looks like they're passing again now	19:48
clarkb	the pip release broke pbr tests though so I had to fix those yesterday and now we're trying to land things again	19:48
clarkb	hopefully we can land the changes and get a relesae out tomorrow? but then be on the lookout for the next setuptools release and for any problems related to it	19:48
clarkb	I was brainstorming was we might mitigate if necessaryand I think we could do things like pin setuptools in our container images if not already doing so for things building container images. And also we could add pyproject.toml files to pin setuptools elsewhere	19:49
clarkb	this assumes that becomes necessary and we're hoping it won't be	19:49
clarkb	definitely say something if you notice problems with setuptools in the near future.	19:50
clarkb	#topic Open Discussion	19:50
clarkb	Anything else?	19:50
clarkb	I'm going to be out on the 10th. The 11th is a holiday but I expect to be around and have a meeting	19:51
clarkb	Sounds like that may be everything. Thank you everyone! We'll be back here next week at the same time and location.	19:53
clarkb	#endmeeting	19:53
opendevmeet	Meeting ended Tue Oct 28 19:53:31 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	19:53
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-10-28-19.00.html	19:53
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-10-28-19.00.txt	19:53
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-10-28-19.00.log.html	19:53
tonyb	Thanks all	19:53
fungi	thanks!	19:54

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!