Tuesday, 2025-09-09

corvus	comm check	18:56
fungi	received	18:57
corvus_	++	18:57
fungi	er, 10-4?	18:57
corvus	five by five	18:57
fungi	over	18:57
corvus	roger, roger	18:57
clarkb	hello!	19:00
clarkb	#startmeeting infra	19:00
opendevmeet	Meeting started Tue Sep 9 19:00:11 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:00
opendevmeet	The meeting name has been set to 'infra'	19:00
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/QE3L6OGX345E2EKS6N6ASANSHWJZV2W4/ Our Agenda	19:00
clarkb	#topic Announcements	19:01
clarkb	I didn't have anything to announce	19:01
clarkb	guessing no one else does either given the silence	19:02
clarkb	lets dive in	19:02
clarkb	#topic Gerrit 3.11 Upgrade Planning	19:02
clarkb	I feel bad about this one because I continue to not have had time to really drive this forward for months. But we're deep into the openstack release cycle now so may be for the best	19:03
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/957555 this change to update to the latest gerrit bugfix releases could still use reviews though	19:04
clarkb	#topic Upgrading old servers	19:05
clarkb	Plenty of updates on this topic courtesy of fungi	19:05
clarkb	all of the afs and kerberos servers were updated to jammy then over the weekend fungi updated afs01.dfw.openstack.org to noble and while it booted the network didn't come up	19:06
clarkb	debugging today showed that the host was trying to configure eth0 and eth1 but those interfaces no longer exist. They are enX0 and enX1	19:06
fungi	yeah, i should have just enjoyed the weekend instead	19:06
clarkb	fungi fixed the network config and rebooted and things came up again thankfully. However, with only one vcpu	19:06
fungi	that upgrade was already plagued by an ongiong ubuntu/canonical infrastructure incident that delayed things by several days	19:07
clarkb	applying the fix suggested in https://docs.oracle.com/en/operating-systems/uek/8/relnotes8.0/38006792.html then rebooting again fixed the vcpu count and now the host sees 8	19:07
clarkb	I did an audit of the ansible fact cache and expect that all of the afs and kerberos servers are affected by the one vcpu issue except for afs02.dfw and kdc04	19:07
clarkb	so further in place upgrades will need to accomodate both the interface renames and the vcpu count issue	19:07
fungi	so at this point, i'm preparing to move the rw volumes back to afs01.dfw and upgrade the remaining servers in that group	19:08
clarkb	fungi: not sure if there is any testing we can/should do of afs01 before you proceed. can we force an afs client to fetch ro data from a specific server?	19:08
fungi	yeah, i'll basically amend /etc/network/interfaces and /etc/default/grub on all of them before moving to noble	19:08
clarkb	that is probably overkill if the cluster reports happyness though	19:08
fungi	i can move a less-important volume to it first	19:09
clarkb	fungi: except for afs02 and kdc04 they shouldn't need the grub thing	19:09
clarkb	fungi: I also meant to ask you if you had to specify a special image for the rescue image	19:10
corvus	i don't really understand what the vcpu count issue is -- other than something about servers having or not having vcpus, and maybe it's related to the weirdness affected some rax legacy nodes that we've seen in jobs. is there something i should know or check on if i do a new launch-node?	19:10
fungi	clarkb: i didn't do anything special for the rescue, just asked rackspace (via their web-based dashboard) to reboot the machine into rescue mode and then used the supplied root password to ssh into it and mount the old device to /mnt so i could chroot into it	19:11
clarkb	corvus: based on ansible facts rax classic has two sets of hypervisors. One with an older version than the other. Booting noble on the new hypervisor has no problems. Booting noble on the old hypervisor hits : https://docs.oracle.com/en/operating-systems/uek/8/relnotes8.0/38006792.html and those nodes only have one vcpu addressable	19:11
clarkb	corvus: I have already patched launcher node to reject nodes that have less than 2 vcpus. So you may do a launch node and have it fail and have to retry	19:11
clarkb	corvus: I think this is primarily a problem for doing in place upgrades since we can't request they migrate to the new hypervisors without submitting a ticket and hoping that the migration is possible/successful. I think using the workaround fungi found is reasonable instead	19:12
corvus	okay, thanks. i feel caught up now.	19:12
clarkb	fungi: ack thanks	19:13
clarkb	anything else on this topic?	19:13
fungi	nothing, other than i'm going to get the volume moves back to afs01.dfw rolling today so i can upgrade the rest to noble soon and take them all back out of the emergency disable list again	19:13
clarkb	sounds good	19:14
clarkb	#topic Matrix for OpenDev comms	19:14
clarkb	#link https://review.opendev.org/c/opendev/infra-specs/+/954826 Spec outlining the motivation and plan for Matrix trialing	19:14
clarkb	I'd like to raise the "last call on this spec" flag at this point	19:14
clarkb	feedback is positive and we have even heard from people outside of the team. fungi tonyb maybe you can try to weigh in this week otherwise I'll plan to merge it early next week?	19:14
* JayF notes that even he has element installed and in use now		19:15
JayF	🏳️	19:16
clarkb	#topic Pre PTG Planning	19:16
fungi	i use matrix every day but from a (somewhat incomplete) plugin in my irc client	19:16
clarkb	#link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document	19:16
clarkb	I still want to encourage folks with ideas and interest to add those ideas to the planning document	19:16
clarkb	so far only my thoughts have ended up there, and I'm sure there are more things to cover	19:17
clarkb	and a reminder that the pre ptg will replace our meeting on october 7. See you there	19:17
clarkb	(it will actually start at 1800 UTC)	19:17
clarkb	#topic Loss of upstream Debian bullseye-backports mirror	19:18
clarkb	#link https://review.opendev.org/c/zuul/zuul-jobs/+/958605	19:18
clarkb	I just approved this change to not enable backports by default going forward. Today was the announced change date for zuul-jobs	19:18
clarkb	if anyone screams about this they can add the configure_mirrors_extra_repos: true flag to their jobs. But keep in mind that our next step is to delete bullseye backports from our mirrors	19:19
clarkb	I suspect that will happen once fungi is done with the server upgrades	19:19
clarkb	then we can clean up the workaround to ignore undefined targets	19:20
fungi	yeah, it's next on my list	19:20
fungi	after afs/kerberos servers	19:20
clarkb	#topic Etherpad 2.5.0 Upgrade	19:21
clarkb	#link https://github.com/ether/etherpad-lite/blob/v2.5.0/CHANGELOG.md	19:21
clarkb	104.130.127.119 is a held node for testing. You need to edit /etc/hosts to point etherpad.opendev.org at that IP.	19:21
clarkb	I have a clarkb-test pad already on that held node	19:21
clarkb	If anyone else wants to look at the root page rendering and decide if it is too ugly and we need to fix it before the upgrade that would be great	19:22
clarkb	it is better than when they first broke it but not as shiny as what we have currently deployed	19:22
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/956593/	19:23
clarkb	maybe leave your thoughts about the state of etherpad-lite's no skin skin there	19:23
clarkb	#topic Moving OpenDev's python-base/python-builder/uwsig-base Images to Quay	19:24
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/957277	19:24
clarkb	we recently updated how zuul-jobs' container image building is done with docker to make this all actually work with docker speculative image builds (not docker runtime though)	19:25
clarkb	there was one complaint that we broke things and then we managed to fix that particular complaint. Since then there have been no complaints	19:25
clarkb	I think that means we're now in a good spot to consider actually landing this chagne and then updating all the images that are built off of these base images	19:25
clarkb	I have removed my WIP vote to reflect that	19:26
clarkb	infra-root can you weigh in on that with any other concerns you may have or potential testing gaps? Given the struggles we've had with this move in the past I don't want to rush with this change, but I also thing I've got it in a good spot finally	19:26
clarkb	#topic Adding Debian Trixie Base Python Container Images	19:27
corvus	on prev topic	19:27
clarkb	#undo	19:27
opendevmeet	Removing item from minutes: #topic Adding Debian Trixie Base Python Container Images	19:27
corvus	we should probably send an annoucement to service-discuss for downstream users.	19:27
corvus	obviously zuul is represented here and we can switch	19:28
corvus	but we'll want to let others know...	19:28
clarkb	corvus: ack I can probably do that today	19:28
corvus	do we want to pull those images from dockerhub at some point? or just leave them there.	19:28
clarkb	I think we need to leave them for some time while we get our consumers ported over	19:28
corvus	clarkb: i was thinking after we make the switch; i don't think we need to announce before	19:28
clarkb	corvus: ack	19:29
clarkb	given that we've reverted in the past too I don't want to clean things up immediately. But maybe in say 3 months or something we should do cleanup?	19:29
clarkb	that should be long enough for us to know we're unlikely to go back to docker hub but not too long that peopel are stuck in the old stuff forever	19:29
corvus	yeah. also, i think it's fine to leave them there. just wanted to talk about the plan.	19:29
corvus	if we do remove, i think 3 months sounds good	19:30
clarkb	I think the main value in cleaning thingsup there is it will serve as signal to people that those images are no longer usable	19:30
clarkb	in a more direct manner than an email announcement	19:30
clarkb	I'll put that as a rough timeline in the announcement after the change lands	19:31
clarkb	#topic Adding Debian Trixie Base Python Container Images	19:31
clarkb	once we're publishing to quay I'd also like to add trixie based images	19:31
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/958480	19:31
clarkb	I'm already noticing some trixie packages get updates that are not going into bookworm or bullseye so having the option to update to that platform seems like a good idea	19:32
clarkb	this should be safe to land once the moev to quay is complete so this is mostly a heads up and request for reviews	19:32
clarkb	make sure I didn't cross any wires when adding the new stuff and publsih trixie content to bookworm or vice versa	19:33
clarkb	#topic Dropping Ubuntu Bionic Test Nodes	19:33
clarkb	(If anyone thinks I'm rushing feel free to jump in and ask me to slow down)	19:34
clarkb	At this point I think opendev's direct reliance on bionic is gone	19:34
clarkb	but I wouldn't be surprised to learn I've missed some cleanup somehwere. Feel free to point any out to me or push fixes up yourselves	19:34
clarkb	the ansible 11 default change has caused people to notice that bionic isn't working and we're seeing slow cleanups elsewhere too	19:35
clarkb	corvus: I suspect that zuul will drop ansible 9 soon enough that opendev probably doesn't need to get ahead of that. We should mostly just ensure that we're not relying on it as much as possible then when zuul updates we can drop the images entirely in opendev	19:35
clarkb	corvus: any concerns with that appraoch?	19:35
corvus	...	19:36
clarkb	then the other thing is python2.7 has been running on bionic in a lot of places. It should still be possible to run python2.7 jobs on ubuntu jammy, but you need to pin tox to <4 as tox>=4 is not compatible with python2.7 virtualenvs	19:37
corvus	yes that all sounds right	19:37
clarkb	great, then we can delete bionic from our mirrors	19:37
fungi	i did already clean up bionic arm64 packages from the mirrors to free up space for volume moves	19:38
clarkb	ya we dropped arm64 bionic a while ago. So this is just x86_64 cleanup	19:39
clarkb	but that should still have a big impact (like 250GB or something along those lines)	19:39
fungi	removing bionic amd64 packages should free a similar amount of space, yes	19:39
clarkb	#topic Lists Server Slowness	19:39
fungi	i'm happy to repeat those steps, we have them documented but it's a bit convoluted to clean up the indices and make reprepro forget things from its database	19:39
clarkb	fungi: thanks!	19:40
clarkb	more and more people are noticing that lists.o.o suffers from frequent slowness	19:40
clarkb	last week we updated UA filter lists based on what was seen there and also restarted services to get it out of swap	19:40
fungi	yeah, we could i guess resize to a larger flavor without needing to move to a new ip address	19:40
clarkb	unfortunately this hasn't fixed the problem	19:41
clarkb	fungi: top reports high iowait while things are slow	19:41
fungi	yeah, which would be consistent with paging to swap as well	19:41
clarkb	maybe before resizing the instance we should try and determine where the iowait originates from as a resize may not help?	19:41
clarkb	fungi: yup though it happened after you restarted things and swap was almost empty	19:41
fungi	right	19:42
clarkb	I also suspect that mailman3 is not desinged to cope with a barrage of crawler bots	19:42
fungi	but also it almost immediately moved a bunch of stuff to swap as cache/buffers use swelled	19:42
clarkb	I noticed in the logs that there are a lot of query url requests which I assume means mailman3 is linking to those queries so bots find them while crawling	19:43
clarkb	and then I suspect that something about those queries makes them less cacheable in django so its like a snowball of all the problems	19:43
corvus	take a look at the cacti graphs	19:43
fungi	yes, teh model of mailing list archives being held in a database accessed through a django frontend does seem a bit resource-hungry compared to the pipermail flat files served for mm2 archives	19:43
corvus	i don't see constant swap activity	19:43
fungi	it could just be that the databse is slammed and it's not swap really, agreed	19:44
corvus	at first glance it looks like memory usage is "okay" -- in that there's sufficient free ram and it's really just swapping unused stuff out.	19:44
clarkb	maybe the database i the problem ya that	19:44
clarkb	its possible we need to tune mariadb to better handle this workload. That seems like a promising thread to pull on before resorting to resizing the node	19:45
fungi	mm3 has some caching in there, but a cache probably works against your interests when crawlers are trying to access every single bit of the content	19:45
clarkb	yup, we saw similar with gitea	19:45
corvus	increasing the ram may allow more caches, so that's still something to consider. just noting it doesn't look like "out of memory", and looks more like the other stuff.	19:45
clarkb	(and had to replace its built in cache system with memcached)	19:45
clarkb	the django install is set up to use diskcache which is a sqlite based caching system for python	19:46
clarkb	not sure if it also uses in memory caches. But could also be that the sqlite system is io bound or mariadb or both	19:46
corvus	i bet that is an opportunity for improvement	19:46
fungi	is iotop a thing on linux?	19:47
corvus	it is	19:47
fungi	i know i've used it on *bsd to figure out where most of the i/o activity is concentrated	19:47
clarkb	++ using iotop to determine where io is slow then debugging from there sounds like a great next step	19:48
clarkb	I think some of the ebpf tools can be used in a similar way if we have problems with iotop	19:48
fungi	yeah, whatever tools are good for it anyway, we need something that can work fairly granularly since device-level utilization won't tell us much in this casde	19:49
clarkb	right. I should be able to look into that today or tomorrow	19:49
clarkb	I've already been staring at too many apache logs to try and understand what is happening on the front end better	19:49
fungi	also we could implement fairly naive request rate limiters with an apache mod if we need something more immediate	19:50
clarkb	email seems to still be processed reasonably quickly so I haven't been treating this as an emergency	19:50
clarkb	but as more people notice I just want to make sure it is on our radar and that we have a plan. Which it sounds like we now do	19:50
fungi	right, it's just the webui which has been struggling	19:50
fungi	e-mails may be getting delayed by seconds as well, but that's less obvious than when a web request is delayed by seconds	19:51
clarkb	yup	19:51
clarkb	I'll see what I can find about where the iowait is happening and we can take it from there	19:51
clarkb	#topic Open Discussion	19:51
clarkb	Before our hour is up was there anything else?	19:52
fungi	openeuler package mirror?	19:52
fungi	i have a couple of changes up to rip it out, but it's a judgement call whether we keep it hoping someone will turn up to re-add images	19:52
clarkb	oh yes, so iirc where that ended up was those interested in openeuler swapped the content of the mirror from release N-1 to N. But then ran into problems bootstrapping release N in dib and therefore nodepool	19:52
clarkb	and we haven't heard or seen anything since	19:53
clarkb	fungi: maybe we should go ahead and delete the content from the mirror for now but leave the openafs volume in place. That way it is easy to rebuild if someone shows up	19:53
fungi	it's been about a year since we paused the broken image builds	19:53
clarkb	but even then if someone shows up I think that we ask them to use the public mirror infrastrucutre like rocky does to start	19:53
fungi	sure, freeing the data utilization is most of the win anyway	19:54
corvus	it should be much easier for someone to work on that now.	19:54
clarkb	I'm +2 on cleaning up the content in openafs	19:54
fungi	#link https://review.opendev.org/959892 Stop mirroring OpenEuler packages	19:55
fungi	#link https://review.opendev.org/959893 Remove mirror.openeuler utilization graph	19:55
clarkb	I'll review those changes after the meeting	19:55
fungi	the latter we could leave in for now i guess if we aren't planning to delete the volume itself	19:55
fungi	but the first change is obviously necessary if i'm going to delete the data	19:56
clarkb	fungi: I just left a quick question on the first change	19:57
clarkb	basically oyu can make it a two step process if you want to reduce the amount of manual work	19:57
clarkb	I'll leave that up to you if you're volunteering to do the manual work though	19:57
fungi	yeah, automating the deletion makes some sense if we're not deleting the volume, since there's no manual steps required	19:58
fungi	if we were going to delete the volume, there's manual steps regardless	19:58
fungi	also we have a fair number of empty and abandoned afs volumes that could probably stand to be removed	19:58
clarkb	right. I guess the main reason I'm thinking keep the volume is that it allows someone to add the mirror easily without infra-root intervention beyond code review	19:58
clarkb	and unlike rocky/alma I worry that their mirror infrastructure is very china centric so may actually need us to mirror if we're running test nodes on openeuler	19:59
fungi	normally i wouldn't notice, but i've become acutely aware while moving them from server to server	19:59
clarkb	but we can always recreate that volume and others if we end up in that situation	19:59
clarkb	and we're just about at time.	19:59
clarkb	Thank you everyone. See you back here at the same time and location next week. Until then thanks again for working on OpenDev	19:59
clarkb	#endmeeting	20:00
opendevmeet	Meeting ended Tue Sep 9 20:00:02 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:00
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-09-19.00.html	20:00
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-09-19.00.txt	20:00
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-09-19.00.log.html	20:00
fungi	thanks clarkb!	20:00

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!