#opendev-meeting log

19:00:11 <clarkb> #startmeeting infra
19:00:11 <opendevmeet> Meeting started Tue Sep  9 19:00:11 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:11 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:11 <opendevmeet> The meeting name has been set to 'infra'
19:00:23 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/QE3L6OGX345E2EKS6N6ASANSHWJZV2W4/ Our Agenda
19:01:25 <clarkb> #topic Announcements
19:01:29 <clarkb> I didn't have anything to announce
19:02:36 <clarkb> guessing no one else does either given the silence
19:02:41 <clarkb> lets dive in
19:02:45 <clarkb> #topic Gerrit 3.11 Upgrade Planning
19:03:12 <clarkb> I feel bad about this one because I continue to not have had time to really drive this forward for months. But we're deep into the openstack release cycle now so may be for the best
19:04:05 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/957555 this change to update to the latest gerrit bugfix releases could still use reviews though
19:05:31 <clarkb> #topic Upgrading old servers
19:05:38 <clarkb> Plenty of updates on this topic courtesy of fungi
19:06:08 <clarkb> all of the afs and kerberos servers were updated to jammy then over the weekend fungi updated afs01.dfw.openstack.org to noble and while it booted the network didn't come up
19:06:33 <clarkb> debugging today showed that the host was trying to configure eth0 and eth1 but those interfaces no longer exist. They are enX0 and enX1
19:06:34 <fungi> yeah, i should have just enjoyed the weekend instead
19:06:53 <clarkb> fungi fixed the network config and rebooted and things came up again thankfully. However, with only one vcpu
19:07:05 <fungi> that upgrade was already plagued by an ongiong ubuntu/canonical infrastructure incident that delayed things by several days
19:07:11 <clarkb> applying the fix suggested in https://docs.oracle.com/en/operating-systems/uek/8/relnotes8.0/38006792.html then rebooting again fixed the vcpu count and now the host sees 8
19:07:41 <clarkb> I did an audit of the ansible fact cache and expect that all of the afs and kerberos servers are affected by the one vcpu issue except for afs02.dfw and kdc04
19:07:55 <clarkb> so further in place upgrades will need to accomodate both the interface renames and the vcpu count issue
19:08:11 <fungi> so at this point, i'm preparing to move the rw volumes back to afs01.dfw and upgrade the remaining servers in that group
19:08:43 <clarkb> fungi: not sure if there is any testing we can/should do of afs01 before you proceed. can we force an afs client to fetch ro data from a specific server?
19:08:50 <fungi> yeah, i'll basically amend /etc/network/interfaces and /etc/default/grub on all of them before moving to noble
19:08:51 <clarkb> that is probably overkill if the cluster reports happyness though
19:09:07 <fungi> i can move a less-important volume to it first
19:09:14 <clarkb> fungi: except for afs02 and kdc04 they shouldn't need the grub thing
19:10:21 <clarkb> fungi: I also meant to ask you if you had to specify a special image for the rescue image
19:10:22 <corvus> i don't really understand what the vcpu count issue is -- other than something about servers having or not having vcpus, and maybe it's related to the weirdness affected some rax legacy nodes that we've seen in jobs.  is there something i should know or check on if i do a new launch-node?
19:11:23 <fungi> clarkb: i didn't do anything special for the rescue, just asked rackspace (via their web-based dashboard) to reboot the machine into rescue mode and then used the supplied root password to ssh into it and mount the old device to /mnt so i could chroot into it
19:11:25 <clarkb> corvus: based on ansible facts rax classic has two sets of hypervisors. One with an older version than the other. Booting noble on the new hypervisor has no problems. Booting noble on the old hypervisor hits : https://docs.oracle.com/en/operating-systems/uek/8/relnotes8.0/38006792.html and those nodes only have one vcpu addressable
19:11:49 <clarkb> corvus: I have already patched launcher node to reject nodes that have less than 2 vcpus. So you may do a launch node and have it fail and have to retry
19:12:26 <clarkb> corvus: I think this is primarily a problem for doing in place upgrades since we can't request they migrate to the new hypervisors without submitting a ticket and hoping that the migration is possible/successful. I think using the workaround fungi found is reasonable instead
19:12:54 <corvus> okay, thanks.  i feel caught up now.
19:13:18 <clarkb> fungi: ack thanks
19:13:22 <clarkb> anything else on this topic?
19:13:58 <fungi> nothing, other than i'm going to get the volume moves back to afs01.dfw rolling today so i can upgrade the rest to noble soon and take them all back out of the emergency disable list again
19:14:13 <clarkb> sounds good
19:14:18 <clarkb> #topic Matrix for OpenDev comms
19:14:25 <clarkb> #link https://review.opendev.org/c/opendev/infra-specs/+/954826 Spec outlining the motivation and plan for Matrix trialing
19:14:33 <clarkb> I'd like to raise the "last call on this spec" flag at this point
19:14:57 <clarkb> feedback is positive and we have even heard from people outside of the team. fungi tonyb maybe you can try to weigh in this week otherwise I'll plan to merge it early next week?
19:15:45 * JayF notes that even he has element installed and in use now
19:16:09 <JayF> 🏳️
19:16:20 <clarkb> #topic Pre PTG Planning
19:16:22 <fungi> i use matrix every day but from a (somewhat incomplete) plugin in my irc client
19:16:29 <clarkb> #link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document
19:16:44 <clarkb> I still want to encourage folks with ideas and interest to add those ideas to the planning document
19:17:15 <clarkb> so far only my thoughts have ended up there, and I'm sure there are more things to cover
19:17:48 <clarkb> and a reminder that the pre ptg will replace our meeting on october 7. See you there
19:17:56 <clarkb> (it will actually start at 1800 UTC)
19:18:10 <clarkb> #topic Loss of upstream Debian bullseye-backports mirror
19:18:24 <clarkb> #link https://review.opendev.org/c/zuul/zuul-jobs/+/958605
19:18:45 <clarkb> I just approved this change to not enable backports by default going forward. Today was the announced change date for zuul-jobs
19:19:20 <clarkb> if anyone screams about this they can add the configure_mirrors_extra_repos: true flag to their jobs. But keep in mind that our next step is to delete bullseye backports from our mirrors
19:19:33 <clarkb> I suspect that will happen once fungi is done with the server upgrades
19:20:08 <clarkb> then we can clean up the workaround to ignore undefined targets
19:20:41 <fungi> yeah, it's next on my list
19:20:49 <fungi> after afs/kerberos servers
19:21:29 <clarkb> #topic Etherpad 2.5.0 Upgrade
19:21:35 <clarkb> #link https://github.com/ether/etherpad-lite/blob/v2.5.0/CHANGELOG.md
19:21:42 <clarkb> 104.130.127.119 is a held node for testing. You need to edit /etc/hosts to point etherpad.opendev.org at that IP.
19:21:49 <clarkb> I have a clarkb-test pad already on that held node
19:22:13 <clarkb> If anyone else wants to look at the root page rendering and decide if it is too ugly and we need to fix it before the upgrade that would be great
19:22:26 <clarkb> it is better than when they first broke it but not as shiny as what we have currently deployed
19:23:07 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/956593/
19:23:19 <clarkb> maybe leave your thoughts about the state of etherpad-lite's no skin skin there
19:24:06 <clarkb> #topic Moving OpenDev's python-base/python-builder/uwsig-base Images to Quay
19:24:43 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/957277
19:25:19 <clarkb> we recently updated how zuul-jobs' container image building is done with docker to make this all actually work with docker speculative image builds (not docker runtime though)
19:25:38 <clarkb> there was one complaint that we broke things and then we managed to fix that particular complaint. Since then there have been no complaints
19:25:58 <clarkb> I think that means we're now in a good spot to consider actually landing this chagne and then updating all the images that are built off of these base images
19:26:23 <clarkb> I have removed my WIP vote to reflect that
19:26:52 <clarkb> infra-root can you weigh in on that with any other concerns you may have or potential testing gaps? Given the struggles we've had with this move in the past I don't want to rush with this change, but I also thing I've got it in a good spot finally
19:27:35 <clarkb> #topic Adding Debian Trixie Base Python Container Images
19:27:42 <corvus> on prev topic
19:27:45 <clarkb> #undo
19:27:45 <opendevmeet> Removing item from minutes: #topic Adding Debian Trixie Base Python Container Images
19:27:59 <corvus> we should probably send an annoucement to service-discuss for downstream users.
19:28:15 <corvus> obviously zuul is represented here and we can switch
19:28:24 <corvus> but we'll want to let others know...
19:28:31 <clarkb> corvus: ack I can probably do that today
19:28:40 <corvus> do we want to pull those images from dockerhub at some point?  or just leave them there.
19:28:55 <clarkb> I think we need to leave them for some time while we get our consumers ported over
19:28:58 <corvus> clarkb: i was thinking after we make the switch; i don't think we need to announce before
19:29:02 <clarkb> corvus: ack
19:29:20 <clarkb> given that we've reverted in the past too I don't want to clean things up immediately. But maybe in say 3 months or something we should do cleanup?
19:29:46 <clarkb> that should be long enough for us to know we're unlikely to go back to docker hub but not too long that peopel are stuck in the old stuff forever
19:29:53 <corvus> yeah.  also, i think it's fine to leave them there.  just wanted to talk about the plan.
19:30:14 <corvus> if we do remove, i think 3 months sounds good
19:30:22 <clarkb> I think the main value in cleaning thingsup there is it will serve as signal to people that those images are no longer usable
19:30:29 <clarkb> in a more direct manner than an email announcement
19:31:22 <clarkb> I'll put that as a rough timeline in the announcement after the change lands
19:31:26 <clarkb> #topic Adding Debian Trixie Base Python Container Images
19:31:40 <clarkb> once we're publishing to quay I'd also like to add trixie based images
19:31:46 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/958480
19:32:06 <clarkb> I'm already noticing some trixie packages get updates that are not going into bookworm or bullseye so having the option to update to that platform seems like a good idea
19:32:44 <clarkb> this should be safe to land once the moev to quay is complete so this is mostly a heads up and request for reviews
19:33:02 <clarkb> make sure I didn't cross any wires when adding the new stuff and publsih trixie content to bookworm or vice versa
19:33:45 <clarkb> #topic Dropping Ubuntu Bionic Test Nodes
19:34:04 <clarkb> (If anyone thinks I'm rushing feel free to jump in and ask me to slow down)
19:34:18 <clarkb> At this point I think opendev's direct reliance on bionic is gone
19:34:39 <clarkb> but I wouldn't be surprised to learn I've missed some cleanup somehwere. Feel free to point any out to me or push fixes up yourselves
19:35:03 <clarkb> the ansible 11 default change has caused people to notice that bionic isn't working and we're seeing slow cleanups elsewhere too
19:35:41 <clarkb> corvus: I suspect that zuul will drop ansible 9 soon enough that opendev probably doesn't need to get ahead of that. We should mostly just ensure that we're not relying on it as much as possible then when zuul updates we can drop the images entirely in opendev
19:35:45 <clarkb> corvus: any concerns with that appraoch?
19:36:30 <corvus> ...
19:37:16 <clarkb> then the other thing is python2.7 has been running on bionic in a lot of places. It should still be possible to run python2.7 jobs on ubuntu jammy, but you need to pin tox to <4 as tox>=4 is not compatible with python2.7 virtualenvs
19:37:18 <corvus> yes that all sounds right
19:37:56 <clarkb> great, then we can delete bionic from our mirrors
19:38:44 <fungi> i did already clean up bionic arm64 packages from the mirrors to free up space for volume moves
19:39:03 <clarkb> ya we dropped arm64 bionic a while ago. So this is just x86_64 cleanup
19:39:15 <clarkb> but that should still have a big impact (like 250GB or something along those lines)
19:39:20 <fungi> removing bionic amd64 packages should free a similar amount of space, yes
19:39:56 <clarkb> #topic Lists Server Slowness
19:39:58 <fungi> i'm happy to repeat those steps, we have them documented but it's a bit convoluted to clean up the indices and make reprepro forget things from its database
19:40:06 <clarkb> fungi: thanks!
19:40:26 <clarkb> more and more people are noticing that lists.o.o suffers from frequent slowness
19:40:54 <clarkb> last week we updated UA filter lists based on what was seen there and also restarted services to get it out of swap
19:40:57 <fungi> yeah, we could i guess resize to a larger flavor without needing to move to a new ip address
19:41:00 <clarkb> unfortunately this hasn't fixed the problem
19:41:17 <clarkb> fungi: top reports high iowait while things are slow
19:41:37 <fungi> yeah, which would be consistent with paging to swap as well
19:41:42 <clarkb> maybe before resizing the instance we should try and determine where the iowait originates from as a resize may not help?
19:41:53 <clarkb> fungi: yup though it happened after you restarted things and swap was almost empty
19:42:03 <fungi> right
19:42:29 <clarkb> I also suspect that mailman3 is not desinged to cope with a barrage of crawler bots
19:42:45 <fungi> but also it almost immediately moved a bunch of stuff to swap as cache/buffers use swelled
19:43:22 <clarkb> I noticed in the logs that there are a lot of query url requests which I assume means mailman3 is linking to those queries so bots find them while crawling
19:43:40 <clarkb> and then I suspect that something about those queries makes them less cacheable in django so its like a snowball of all the problems
19:43:40 <corvus> take a look at the cacti graphs
19:43:47 <fungi> yes, teh model of mailing list archives being held in a database accessed through a django frontend does seem a bit resource-hungry compared to the pipermail flat files served for mm2 archives
19:43:48 <corvus> i don't see constant swap activity
19:44:14 <fungi> it could just be that the databse is slammed and it's not swap really, agreed
19:44:17 <corvus> at first glance it looks like memory usage is "okay" -- in that there's sufficient free ram and it's really just swapping unused stuff out.
19:44:31 <clarkb> maybe the database i the problem ya that
19:45:09 <clarkb> its possible we need to tune mariadb to better handle this workload. That seems like a promising thread to pull on before resorting to resizing the node
19:45:29 <fungi> mm3 has some caching in there, but a cache probably works against your interests when crawlers are trying to access every single bit of the content
19:45:37 <clarkb> yup, we saw similar with gitea
19:45:47 <corvus> increasing the ram may allow more caches, so that's still something to consider.  just noting it doesn't look like "out of memory", and looks more like the other stuff.
19:45:47 <clarkb> (and had to replace its built in cache system with memcached)
19:46:25 <clarkb> the django install is set up to use diskcache which is a sqlite based caching system for python
19:46:45 <clarkb> not sure if it also uses in memory caches. But could also be that the sqlite system is io bound or mariadb or both
19:46:58 <corvus> i bet that is an opportunity for improvement
19:47:10 <fungi> is iotop a thing on linux?
19:47:26 <corvus> it is
19:47:31 <fungi> i know i've used it on *bsd to figure out where most of the i/o activity is concentrated
19:48:11 <clarkb> ++ using iotop to determine where io is slow then debugging from there sounds like a great next step
19:48:25 <clarkb> I think some of the ebpf tools can be used in a similar way if we have problems with iotop
19:49:08 <fungi> yeah, whatever tools are good for it anyway, we need something that can work fairly granularly since device-level utilization won't tell us much in this casde
19:49:33 <clarkb> right. I should be able to look into that today or tomorrow
19:49:44 <clarkb> I've already been staring at too many apache logs to try and understand what is happening on the front end better
19:50:02 <fungi> also we could implement fairly naive request rate limiters with an apache mod if we need something more immediate
19:50:33 <clarkb> email seems to still be processed reasonably quickly so I haven't been treating this as an emergency
19:50:52 <clarkb> but as more people notice I just want to make sure it is on our radar and that we have a plan. Which it sounds like we now do
19:50:55 <fungi> right, it's just the webui which has been struggling
19:51:28 <fungi> e-mails may be getting delayed by seconds as well, but that's less obvious than when a web request is delayed by seconds
19:51:36 <clarkb> yup
19:51:51 <clarkb> I'll see what I can find about where the iowait is happening and we can take it from there
19:51:59 <clarkb> #topic Open Discussion
19:52:04 <clarkb> Before our hour is up was there anything else?
19:52:16 <fungi> openeuler package mirror?
19:52:57 <fungi> i have a couple of changes up to rip it out, but it's a judgement call whether we keep it hoping someone will turn up to re-add images
19:52:58 <clarkb> oh yes, so iirc where that ended up was those interested in openeuler swapped the content of the mirror from release N-1 to N. But then ran into problems bootstrapping release N in dib and therefore nodepool
19:53:05 <clarkb> and we haven't heard or seen anything since
19:53:30 <clarkb> fungi: maybe we should go ahead and delete the content from the mirror for now but leave the openafs volume in place. That way it is easy to rebuild if someone shows up
19:53:30 <fungi> it's been about a year since we paused the broken image builds
19:53:47 <clarkb> but even then if someone shows up I think that we ask them to use the public mirror infrastrucutre like rocky does to start
19:54:01 <fungi> sure, freeing the data utilization is most of the win anyway
19:54:02 <corvus> it should be much easier for someone to work on that now.
19:54:05 <clarkb> I'm +2 on cleaning up the content in openafs
19:55:08 <fungi> #link https://review.opendev.org/959892 Stop mirroring OpenEuler packages
19:55:25 <fungi> #link https://review.opendev.org/959893 Remove mirror.openeuler utilization graph
19:55:32 <clarkb> I'll review those changes after the meeting
19:55:51 <fungi> the latter we could leave in for now i guess if we aren't planning to delete the volume itself
19:56:19 <fungi> but the first change is obviously necessary if i'm going to delete the data
19:57:02 <clarkb> fungi: I just left a quick question on the first change
19:57:12 <clarkb> basically oyu can make it a two step process if you want to reduce the amount of manual work
19:57:21 <clarkb> I'll leave that up to you if you're volunteering to do the manual work though
19:58:05 <fungi> yeah, automating the deletion makes some sense if we're not deleting the volume, since there's no manual steps required
19:58:19 <fungi> if we were going to delete the volume, there's manual steps regardless
19:58:48 <fungi> also we have a fair number of empty and abandoned afs volumes that could probably stand to be removed
19:58:48 <clarkb> right. I guess the main reason I'm thinking keep the volume is that it allows someone to add the mirror easily without infra-root intervention beyond code review
19:59:17 <clarkb> and unlike rocky/alma I worry that their mirror infrastructure is very china centric so may actually need us to mirror if we're running test nodes on openeuler
19:59:24 <fungi> normally i wouldn't notice, but i've become acutely aware while moving them from server to server
19:59:26 <clarkb> but we can always recreate that volume and others if we end up in that situation
19:59:40 <clarkb> and we're just about at time.
19:59:59 <clarkb> Thank you everyone. See you back here at the same time and location next week. Until then thanks again for working on OpenDev
20:00:02 <clarkb> #endmeeting