Tuesday, 2025-11-04

clarkb	meeting time	19:00
clarkb	#startmeeting infra	19:00
opendevmeet	Meeting started Tue Nov 4 19:00:56 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:00
opendevmeet	The meeting name has been set to 'infra'	19:00
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/O2QLEHP5SBVJLBCK5WVAGWFSXJEMDI52/ Our Agenda	19:01
clarkb	#topic Announcements	19:01
clarkb	If the time of this meeting surprises you then note that we hold the meeting at 1900 UTC which doesn't observe daylight saving time	19:01
tonyb	one of its best features!	19:01
clarkb	I will be popping out a little early on Friday and am out on Monday	19:02
clarkb	we should expect a meeting in a week but the agenda may go out late	19:02
clarkb	Anything else to announce?	19:02
clarkb	#topic Gerrit 3.11 Upgrade Planning	19:04
clarkb	I'm perpetually not keeping up with Gerrit :/	19:04
clarkb	there is a new set of releases today from upstrea for 3.10.9 and 3.11.7	19:04
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/966084 Update Gerrit images for 3.10.9 and 3.11.7	19:04
clarkb	I also had to delete one of my holds for Gerrit testing due to a launcher issue but that isn't a big deal as I need to refresh for ^ anyway	19:05
clarkb	to tl;dr I think I'm back to needing to catch the gerrit deployment up to a happy spot then will refresh node holds and hopefully start testing things	19:05
clarkb	the linked change above has a parent change that addresses a general issue with gerrit restarts as well and we should probably plan to get both changes in together then restart gerrit to ensure everything is happy about it	19:05
clarkb	any questions or concerns? (we'll talk about the unexpected shutdowns next in its own topic)	19:06
clarkb	#topic Gerrit Spontaneous Shutdowns	19:07
clarkb	The other big gerrit topic is the issue of unexpected server shutdowns	19:07
clarkb	We had one occur during the summit and we just had one early today UTC time	19:07
clarkb	thank you tonyb for dealing with the shutdown that occurred today	19:07
tonyb	All good.	19:08
clarkb	The first thing I want to note is that the massive h2 cache files and their lockfiles in /home/gerrit2/review_site/cache can be deleted before starting gerrit again in order to speed up gerrit start	19:08
fungi	yes, huge thanks tonyb, if that had sat waiting for me to wake up it would have severely impacted the publication timeline for an openstack security advisory	19:09
tonyb	I did delete the files > 1G but not the lockfiles	19:09
clarkb	the issue there is taht gerrit processes those massive files on startup and prunes things into shape which is done serially database by database and prevents many other actions from happening in the system. If we delete the files gerrit simply creates new caches and populates them as they go	19:09
clarkb	tonyb: oh interesting, but gerrit startup was still slow?	19:09
tonyb	Yes	19:09
clarkb	ok so maybe there is an additional issue there	19:09
clarkb	fwiw we don't want to delete all the h2 cache files as some manage things like user sessions	19:10
clarkb	but caches like gerrit_file_diff.h2.db grow very large and can be deleted (but I'ev only ever deleted them with their lock files)	19:10
clarkb	I wonder if it waited for some lock timeout or something like that	19:10
tonyb	Yeah. that was my assumption and based on my passive following of the h2 issue I was confident about deleteing the 2 large files	19:10
clarkb	I did discuss that issue with upstream at the summit and mfick thought it was fixable and had attampted a fix that got reverted due to breaking plugins. But he still felt that it could be done without breaking plugins so hopefully soon it gets addressed	19:11
tonyb	I'm not familiar enough with what the gerrit logs look like to really know whats "normal"	19:11
clarkb	ya I also noticed in syslog there was something complaining about being able to connect to gerrit on 29418 for some time so I think it was basically doing whatever startup routines it needs to in order t ofeel ready and that took some time	19:12
clarkb	previously we had believed that to be largely stuck in db processing but maybe there is something else	19:12
clarkb	really the ideal situation here is ot have the service be more resilient and sane about this stuff which is happening (slowly)	19:13
tonyb	It's hard to debug given it's the production server ;P	19:13
clarkb	then on the cluod side I sent email to vexxhost support calling out the issue and asking them for advice on mitigating it	19:13
clarkb	so hopefully we can make that better too	19:13
clarkb	I cc'd infra rooters on that	19:13
clarkb	I'm happy to try and schedule the 3.10.9 update restart for a time where as many people as are interested can follow along with the process so that we are more aware of what a "good" restart looks like	19:14
clarkb	(I still expect the shutdown to timeout unfortunately)	19:14
clarkb	to summarize gerrit shutdown happened again. Startup in that situation is still slower than we'd like. We may need to debug further or it may simply be a matter of deleting h2 db lock files when dleeting h2 cache dbs. And we've engaged with the cloud host to debug the underlying issue	19:15
clarkb	anything else to note before we move on?	19:15
tonyb	clarkb: I'd be interested in learning more if we can make that work	19:15
clarkb	tonyb: yup lets coordinate as the followup changes get into a mergable state	19:16
clarkb	#topic Upgrading old servers	19:17
clarkb	tonyb: I did review both the wiki change and the ansible update stack	19:17
clarkb	tonyb: on the wiki change I think it may be worthwhile to publish the image to quay.io and deploy the new server on noble so that we're not doing the docker hub -> quay.io and jammy -> noble migrations later unnecessarily	19:17
clarkb	basically lets skip ahead to the state we want to be in long term if we can	19:17
clarkb	(I noted this on the change)	19:18
tonyb	Thank you! I got distracted with other things and I'll update the series soon	19:18
tonyb	I agree. I'll figure that out based on gerrit/hound containers	19:18
fungi	and thanks again for working on it	19:18
tonyb	np, sorry for the really long pause	19:18
clarkb	then on the ansible side of things I think we're in an odd position where ansible 11 needs python3.11 or never and jammy by deafult is 3.10. We can install 3.11 on jammy but I'm not sure how well that will work so suggested we test that with your change stack and determine if ansible 11 implies a noble bridge or if we can make it work on jammy so that we're not mixing	19:19
clarkb	together bridge updates and ansible updates	19:19
clarkb	and i think your change stack tehre is a good framework for testing that sort of thing	19:19
clarkb	and based on what we learn from that we can make some planning decisions for bridge	19:19
clarkb	anything else related to server upgrades?	19:20
tonyb	Yeah I'm working on that I think I'll move the 'ansible-next' to the end of the stack. I've done some testing with 3.11 and it seems fine. I'm working on a 'clunky' way to update the ansible-venv when needed	19:20
clarkb	great thanks!	19:21
tonyb	Well really move asside and recreate it with 3.11	19:21
clarkb	right virtualenv updates are often weird and starting over is usually simplest	19:21
tonyb	That's all from me on $topic	19:22
clarkb	#topic AFS mirror content updates	19:22
clarkb	the assertion last week that trixie nodes were getting upstream deb mirrors set as pypi proxies had me confused for a bit so I dug into our mirroring stuff	19:22
clarkb	I think I understand it. basically those jobs must be overriding our generic base mirror fqdn variable which assumes all things are mirrored at the same location but they are setting the value to the upstream deb value	19:23
clarkb	I believe you can separately set the pypi mirror location (and in this case you'd set it to upstream too)	19:23
clarkb	so one solution here is to basically micromanage each of those mirror locations. Or we can just mirror trixie and the existing tooling should just work	19:23
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/965334 Mirror trixie packages	19:23
clarkb	I've pushed ^ to take the just mirror trixie approach	19:23
clarkb	I also wondered why rocky linux (and now alma linux) weren't affected by this issue and the reason appears to be that we have specific mirror configuration support for each distro	19:24
clarkb	this means that debian having support means each debain release wants the same setup. But rockylinux having never been configured is fine	19:24
clarkb	so long story short I think we have two otpions for distros like debian which are currently in a mixed state. Option one is just mirror all the things for that distro so it isn't in a mixed state and option two is configure each mirror url separately to be correct and point at upstream when not mirrored	19:25
clarkb	then separately there is a spec in zuul-jobs to improve how mirrors are configured (by being more explicit and less implicit)	19:25
clarkb	that hasn't been implemented yet, but if there is interest in pushing that over the finish line we can in theory take advantage to do this better	19:26
clarkb	tonyb: I think you were working on that at one point?	19:26
tonyb	I was. I didn't get very far.	19:26
clarkb	(and to be clear this is a long standing item in zuul-jobs)	19:26
clarkb	ack mnasiadka I know you were talking about looking at thing to help. This might be a good option as it shouldn't require any special privileges and needs someone who understands different distro behaviors to accomodate those in the end result	19:27
clarkb	let me see if I can find the docs link again	19:27
clarkb	mnasiadka: (or anyone else interested) https://zuul-ci.org/docs/zuul-jobs/latest/mirror.html	19:28
clarkb	and I don't mean to scare tonyb off just thinking this is a good option for someone who doesn't have root access since it should all be driven through zuul	19:28
mnasiadka	Sure, can have a look :)	19:28
tonyb	Oh for sure. I was going to suggest the same thing :)	19:29
clarkb	anything else related to afs content management?	19:29
clarkb	#topic Zuul Launcher Updates	19:30
clarkb	There is a new launcher bug that can create mixed provider nodesets	19:30
clarkb	#link https://review.opendev.org/c/zuul/zuul/+/965954 Fix assignment of unassigned nodes.	19:30
clarkb	this is the fix	19:30
clarkb	unfortunately I think there are a couple other zuul bugs that need to be fixed before that change can land, but it is in the queue/todo list	19:31
clarkb	The situation where this happens seems to be infrequent as it relies on the launcher reassigning unused nodes from requests that are no longer needed to new requests	19:31
clarkb	so you have to get things aligned just right to hit it	19:31
clarkb	Then separately I discovered yesterday that the quota in raxflex iad3 is smaller than I had thought. I had thought it was for 10 instances but we can only boot 5 due to cpu quotas (and 6 if memroy quotas are the limit)	19:32
clarkb	this is why I dropped my held gerrit nodes to free up two held nodes in rax flex iad3 so that some openstack helm reuqests for 5 nodes could be handled	19:32
clarkb	cardoe brought up the OSH issue and I asked cardoe to followup with cloudnull about bumping quotas. Otherwise we may need to consider dropping that region for now	19:33
tonyb	Ahhhh that's what was causing the helm issue.	19:33
clarkb	I just checked and the quotas haven't been bumped yet	19:33
corvus	in the mean time, might be good to avoid holding nodes in raxflex-iad3	19:34
clarkb	++	19:34
corvus	and if that's not tenable, yeah, maybe we should turn it down.	19:34
fungi	not that we can avoid holding nodes in a specific provider, but we can certainly delete and reset the hold if it lands in one	19:35
clarkb	I'm willing to wait another day or two to see if quotas bump but if that doesn't happen then I'm good with turning it off while we wait	19:35
corvus	yep	19:35
tonyb	sounds reasonable	19:36
fungi	sounds fine to me. 5 nodes worth of quota is a drop in the bucket, but was good for making sure the provider is nominally operable for us	19:36
clarkb	#topic Matrix for OpenDev comms	19:36
clarkb	This item has not made it into the list of things I'm currently juggling :/	19:36
clarkb	someone (sorry I don't remember who) pointed out that mjolnir has a successor implementation	19:37
clarkb	so we may want to jump straight to that	19:37
clarkb	but otherwise I haven't seen any movement on this one	19:37
tonyb	that was mnasiadka (I think)	19:37
mnasiadka	Yeah, stumbled across that on Ansible community Matrix rooms and got interested	19:38
clarkb	ack thank you for calling it out. That is good to know so that we don't end up implementing something twice	19:38
fungi	what are the benefits of the successor? that it's actively maintained and the original isn't, i guess?	19:38
mnasiadka	clarkb: if there’s anything I can do to help re Matrix - happy to do that (but next week Mon/Tue I’m out)	19:39
clarkb	https://github.com/the-draupnir-project/Draupnir looks like it has simpler management ux	19:39
fungi	it's not nearly that time-sensitive	19:39
clarkb	mnasiadka: thanks I'll let you know. I think the first step is for me or another amdin to create the room	19:39
clarkb	and once that is done we can start experimenting and others can help out with tooling etc	19:39
clarkb	#topic Etherpad 2.5.2 Upgrade	19:40
clarkb	sorry I'm going to keep things moving along to make sure we cover all the agenda items	19:40
corvus	i'm happy to do statusbot for matrix	19:40
clarkb	last we spoke we were worried about etherpad 2.5.1 css issues	19:40
clarkb	corvus: thanks	19:40
clarkb	since then I filed a bug with etherpad and they fixed it quickly and now there is a 2.5.2 which seems to work	19:40
clarkb	#link https://github.com/ether/etherpad-lite/blob/v2.5.2/CHANGELOG.md	19:40
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/956593	19:41
clarkb	At this point myself, tonyb and mnasiadka have tested a held etherpad 2.5.2 node	19:41
clarkb	I think we're basically ready to upgrade etherpad	19:41
clarkb	I didn't want to do it last week with the PTG ahppening but that is over now	19:41
clarkb	I'm game for trying to do that today after a lunch and a possible bike ride otherwise will probably plan to do it first thing tomorrow	19:41
fungi	oh, and is meetpad still in the disable list?	19:41
clarkb	fungi: oh I think it is. We should pull it out.	19:42
clarkb	Probably pull out review03 at the same time?	19:42
fungi	on it	19:42
tonyb	and review03 if we're sure that isn't a problem	19:42
fungi	and taking review03 out too, yes	19:42
fungi	done	19:42
clarkb	tonyb: I'm like 99% certain its ok. The first spontaneous shutdown created that file as a directory and nothing exploded	19:42
clarkb	tonyb: shouldn't be any worse to have it as an empty file	19:42
tonyb	\o/	19:43
clarkb	so ya if you'd like to test etherpad do so soon. Otherwise expect it to be upgrade by sometime tomorrow	19:43
clarkb	#topic Gitea 1.25.0 Upgrade	19:43
clarkb	After updating Gitea to 1.24.7 last week they released 1.25.0	19:43
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/965960 Upgrade Gitea to 1.25.0	19:43
clarkb	That change includes a link to the upstream changelog which is quite long, but the breaking changes list is small and doesn't affect us	19:43
clarkb	I suspect this is a relatively straightforward upgrade for us, but therei s a held node you can interact with and double checking the changelog is always appreciated	19:44
clarkb	usually by the time we do that they release a .1 or .2 as well and that is what we actually upgrade to	19:44
clarkb	mnasiadka did some poking around and didn't find any obvious issues	19:44
clarkb	I do update the versions of golang and nodejs too as well as switch to pnpm to match upstrea	19:45
clarkb	so its still more than a bugfix upgrade	19:45
clarkb	#topic Gitea Performance	19:46
clarkb	then in parallel we're still seeing occasional gitea performance issues	19:46
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/964728 Don't allow direct backend access	19:46
clarkb	The idea behind this one is that we'll remoev any direct backend crawling which should force access through the lb allowing it to do its job more accurately	19:47
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/965420 Increase memcached cache size to mitigate effect of crawlers poisoning the cache	19:47
clarkb	the idea behind this one is that crawling effectively poisons the cache. Increasing the size of the cache may mitigate the effects of the poisoning on the cache	19:47
clarkb	I'm willing to try one or both or neither of these. Consider both changes a request for comment/feedback. Happy to try other appraoches too	19:48
clarkb	but I'm hopeful that by laod balancing all crawlers and having larger caches that will result in no single backend having a full poisoned cache improving performance generally	19:48
fungi	964728 seems to have plenty of consensus	19:49
* tonyb is in favor of both changes.		19:49
clarkb	oh cool I missed the consensus on the laod balancing change. I can plan to land that after the etherpad upgrade then	19:49
clarkb	and then once we've made some changes we can reevaluate if we need to do more or revert etc	19:50
fungi	i fat-fingered my +2 on 965420 and accidentally approved it for a brief moment, so it may need rechecking in a bit, sorry	19:50
clarkb	seems you did that quickly enough that zuul never enqued it to the gate	19:50
clarkb	fast typing there	19:50
clarkb	I can recheck it	19:51
clarkb	#topic Raxflex DFW3 Disabled	19:51
clarkb	the last item on the agenda is to cover what happened in raxflex dfw3 yseterday	19:51
clarkb	I discovered that the mirror was throwing errors at jobs and investigating the server directly showed some afs cache problems	19:51
clarkb	an fs flushall seemed to get stuck and system load was steadily climbing	19:52
clarkb	this resulted in me attempting to reboot the server. Doing so via the host itself got stuck waiting for afs to unmount	19:52
clarkb	after waiting ~10 minutes I asked nova to stop the server which it did. Asking nova to start the server again does not start the server again	19:52
clarkb	the server task state goes to powering-on according to server show but it never starts. unfortuantely, it never reports an error either	19:53
clarkb	We pinged dan_with about it in #opendev but haven't heard anything since	19:53
tonyb	and nothing on the console?	19:53
clarkb	tonyb: you can't get the console bceause the server isn't running	19:53
clarkb	dfw3 has since been disabled in zuul	19:53
tonyb	Oh okay. that level of 'never starts'	19:54
clarkb	I think our options are to either wait for rackspace to help us fix it (maybe we need tofile an issue for that?) or we can make a new mirror and just start over	19:54
clarkb	Considering the historical cinder volume issues there I think there is value in getting the cloud to investigate if we can, but maybe we don't need to wait for that while we return the region to service with a new mirror	19:54
fungi	will likely need a nerw cinder volume too	19:55
clarkb	ya	19:55
corvus	how about booting a new server+volume and then open a ticket for the old one. if nothing happens in a week, delete it?	19:55
tonyb	^^ That's what I was going to suggest	19:55
clarkb	corvus: I'd be happy to use that approach. I'm not sure I personally have time to drive that given my current todo list	19:55
corvus	i'm assuming we have enough quota in the non-zuul account there to run 2 mirrors	19:55
clarkb	yes I think we do	19:56
clarkb	I'm happy for someone else to drive that and will help as I can. This week is just really busy for me and I'm out Monday so don't want to overcommit	19:57
clarkb	#topic Open Discussion	19:57
clarkb	we have a few minutes to cover anything else if there is anything else	19:57
clarkb	apologies if it felt like I was speed running through all of that. I wanted to make sure the listed items got covered. Always feel free to followup outside the meeting on IRC or on the mailing list	19:59
clarkb	And thank you everyone for attending and helping to keep opendev running!	19:59
corvus	i am happy with your chairing :)	19:59
fungi	excellent timekeeping!	19:59
tonyb	hear hear!	19:59
clarkb	As I mentioned we should be back here next week at the same time and location, but the agenda email may be delayed	19:59
tonyb	chairing and cheering	20:00
clarkb	and now we are at time. Thanks again@	20:00
clarkb	#endmeeting	20:00
opendevmeet	Meeting ended Tue Nov 4 20:00:13 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:00
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-11-04-19.00.html	20:00
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-11-04-19.00.txt	20:00
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-11-04-19.00.log.html	20:00
*** mnaser[m] is now known as mnaser		20:46

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!