Tuesday, 2025-09-09

corvuscomm check18:56
fungireceived18:57
corvus_++18:57
fungier, 10-4?18:57
corvusfive by five18:57
fungiover18:57
corvusroger, roger18:57
clarkbhello!19:00
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Sep  9 19:00:11 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/QE3L6OGX345E2EKS6N6ASANSHWJZV2W4/ Our Agenda19:00
clarkb#topic Announcements19:01
clarkbI didn't have anything to announce19:01
clarkbguessing no one else does either given the silence19:02
clarkblets dive in19:02
clarkb#topic Gerrit 3.11 Upgrade Planning19:02
clarkbI feel bad about this one because I continue to not have had time to really drive this forward for months. But we're deep into the openstack release cycle now so may be for the best19:03
clarkb#link https://review.opendev.org/c/opendev/system-config/+/957555 this change to update to the latest gerrit bugfix releases could still use reviews though19:04
clarkb#topic Upgrading old servers19:05
clarkbPlenty of updates on this topic courtesy of fungi 19:05
clarkball of the afs and kerberos servers were updated to jammy then over the weekend fungi updated afs01.dfw.openstack.org to noble and while it booted the network didn't come up19:06
clarkbdebugging today showed that the host was trying to configure eth0 and eth1 but those interfaces no longer exist. They are enX0 and enX119:06
fungiyeah, i should have just enjoyed the weekend instead19:06
clarkbfungi fixed the network config and rebooted and things came up again thankfully. However, with only one vcpu19:06
fungithat upgrade was already plagued by an ongiong ubuntu/canonical infrastructure incident that delayed things by several days19:07
clarkbapplying the fix suggested in https://docs.oracle.com/en/operating-systems/uek/8/relnotes8.0/38006792.html then rebooting again fixed the vcpu count and now the host sees 819:07
clarkbI did an audit of the ansible fact cache and expect that all of the afs and kerberos servers are affected by the one vcpu issue except for afs02.dfw and kdc0419:07
clarkbso further in place upgrades will need to accomodate both the interface renames and the vcpu count issue19:07
fungiso at this point, i'm preparing to move the rw volumes back to afs01.dfw and upgrade the remaining servers in that group19:08
clarkbfungi: not sure if there is any testing we can/should do of afs01 before you proceed. can we force an afs client to fetch ro data from a specific server?19:08
fungiyeah, i'll basically amend /etc/network/interfaces and /etc/default/grub on all of them before moving to noble19:08
clarkbthat is probably overkill if the cluster reports happyness though19:08
fungii can move a less-important volume to it first19:09
clarkbfungi: except for afs02 and kdc04 they shouldn't need the grub thing19:09
clarkbfungi: I also meant to ask you if you had to specify a special image for the rescue image19:10
corvusi don't really understand what the vcpu count issue is -- other than something about servers having or not having vcpus, and maybe it's related to the weirdness affected some rax legacy nodes that we've seen in jobs.  is there something i should know or check on if i do a new launch-node?19:10
fungiclarkb: i didn't do anything special for the rescue, just asked rackspace (via their web-based dashboard) to reboot the machine into rescue mode and then used the supplied root password to ssh into it and mount the old device to /mnt so i could chroot into it19:11
clarkbcorvus: based on ansible facts rax classic has two sets of hypervisors. One with an older version than the other. Booting noble on the new hypervisor has no problems. Booting noble on the old hypervisor hits : https://docs.oracle.com/en/operating-systems/uek/8/relnotes8.0/38006792.html and those nodes only have one vcpu addressable19:11
clarkbcorvus: I have already patched launcher node to reject nodes that have less than 2 vcpus. So you may do a launch node and have it fail and have to retry19:11
clarkbcorvus: I think this is primarily a problem for doing in place upgrades since we can't request they migrate to the new hypervisors without submitting a ticket and hoping that the migration is possible/successful. I think using the workaround fungi found is reasonable instead19:12
corvusokay, thanks.  i feel caught up now.19:12
clarkbfungi: ack thanks19:13
clarkbanything else on this topic?19:13
funginothing, other than i'm going to get the volume moves back to afs01.dfw rolling today so i can upgrade the rest to noble soon and take them all back out of the emergency disable list again19:13
clarkbsounds good19:14
clarkb#topic Matrix for OpenDev comms19:14
clarkb#link https://review.opendev.org/c/opendev/infra-specs/+/954826 Spec outlining the motivation and plan for Matrix trialing19:14
clarkbI'd like to raise the "last call on this spec" flag at this point19:14
clarkbfeedback is positive and we have even heard from people outside of the team. fungi tonyb maybe you can try to weigh in this week otherwise I'll plan to merge it early next week?19:14
* JayF notes that even he has element installed and in use now19:15
JayF🏳️19:16
clarkb#topic Pre PTG Planning19:16
fungii use matrix every day but from a (somewhat incomplete) plugin in my irc client19:16
clarkb#link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document19:16
clarkbI still want to encourage folks with ideas and interest to add those ideas to the planning document19:16
clarkbso far only my thoughts have ended up there, and I'm sure there are more things to cover19:17
clarkband a reminder that the pre ptg will replace our meeting on october 7. See you there19:17
clarkb(it will actually start at 1800 UTC)19:17
clarkb#topic Loss of upstream Debian bullseye-backports mirror19:18
clarkb#link https://review.opendev.org/c/zuul/zuul-jobs/+/95860519:18
clarkbI just approved this change to not enable backports by default going forward. Today was the announced change date for zuul-jobs19:18
clarkbif anyone screams about this they can add the configure_mirrors_extra_repos: true flag to their jobs. But keep in mind that our next step is to delete bullseye backports from our mirrors19:19
clarkbI suspect that will happen once fungi is done with the server upgrades19:19
clarkbthen we can clean up the workaround to ignore undefined targets19:20
fungiyeah, it's next on my list19:20
fungiafter afs/kerberos servers19:20
clarkb#topic Etherpad 2.5.0 Upgrade19:21
clarkb#link https://github.com/ether/etherpad-lite/blob/v2.5.0/CHANGELOG.md19:21
clarkb104.130.127.119 is a held node for testing. You need to edit /etc/hosts to point etherpad.opendev.org at that IP.19:21
clarkbI have a clarkb-test pad already on that held node19:21
clarkbIf anyone else wants to look at the root page rendering and decide if it is too ugly and we need to fix it before the upgrade that would be great19:22
clarkbit is better than when they first broke it but not as shiny as what we have currently deployed19:22
clarkb#link https://review.opendev.org/c/opendev/system-config/+/956593/19:23
clarkbmaybe leave your thoughts about the state of etherpad-lite's no skin skin there19:23
clarkb#topic Moving OpenDev's python-base/python-builder/uwsig-base Images to Quay19:24
clarkb#link https://review.opendev.org/c/opendev/system-config/+/95727719:24
clarkbwe recently updated how zuul-jobs' container image building is done with docker to make this all actually work with docker speculative image builds (not docker runtime though)19:25
clarkbthere was one complaint that we broke things and then we managed to fix that particular complaint. Since then there have been no complaints19:25
clarkbI think that means we're now in a good spot to consider actually landing this chagne and then updating all the images that are built off of these base images19:25
clarkbI have removed my WIP vote to reflect that19:26
clarkbinfra-root can you weigh in on that with any other concerns you may have or potential testing gaps? Given the struggles we've had with this move in the past I don't want to rush with this change, but I also thing I've got it in a good spot finally19:26
clarkb#topic Adding Debian Trixie Base Python Container Images19:27
corvuson prev topic19:27
clarkb#undo19:27
opendevmeetRemoving item from minutes: #topic Adding Debian Trixie Base Python Container Images19:27
corvuswe should probably send an annoucement to service-discuss for downstream users.19:27
corvusobviously zuul is represented here and we can switch19:28
corvusbut we'll want to let others know...19:28
clarkbcorvus: ack I can probably do that today19:28
corvusdo we want to pull those images from dockerhub at some point?  or just leave them there.19:28
clarkbI think we need to leave them for some time while we get our consumers ported over19:28
corvusclarkb: i was thinking after we make the switch; i don't think we need to announce before19:28
clarkbcorvus: ack19:29
clarkbgiven that we've reverted in the past too I don't want to clean things up immediately. But maybe in say 3 months or something we should do cleanup?19:29
clarkbthat should be long enough for us to know we're unlikely to go back to docker hub but not too long that peopel are stuck in the old stuff forever19:29
corvusyeah.  also, i think it's fine to leave them there.  just wanted to talk about the plan.19:29
corvusif we do remove, i think 3 months sounds good19:30
clarkbI think the main value in cleaning thingsup there is it will serve as signal to people that those images are no longer usable19:30
clarkbin a more direct manner than an email announcement19:30
clarkbI'll put that as a rough timeline in the announcement after the change lands19:31
clarkb#topic Adding Debian Trixie Base Python Container Images19:31
clarkbonce we're publishing to quay I'd also like to add trixie based images19:31
clarkb#link https://review.opendev.org/c/opendev/system-config/+/95848019:31
clarkbI'm already noticing some trixie packages get updates that are not going into bookworm or bullseye so having the option to update to that platform seems like a good idea19:32
clarkbthis should be safe to land once the moev to quay is complete so this is mostly a heads up and request for reviews19:32
clarkbmake sure I didn't cross any wires when adding the new stuff and publsih trixie content to bookworm or vice versa19:33
clarkb#topic Dropping Ubuntu Bionic Test Nodes19:33
clarkb(If anyone thinks I'm rushing feel free to jump in and ask me to slow down)19:34
clarkbAt this point I think opendev's direct reliance on bionic is gone19:34
clarkbbut I wouldn't be surprised to learn I've missed some cleanup somehwere. Feel free to point any out to me or push fixes up yourselves19:34
clarkbthe ansible 11 default change has caused people to notice that bionic isn't working and we're seeing slow cleanups elsewhere too19:35
clarkbcorvus: I suspect that zuul will drop ansible 9 soon enough that opendev probably doesn't need to get ahead of that. We should mostly just ensure that we're not relying on it as much as possible then when zuul updates we can drop the images entirely in opendev19:35
clarkbcorvus: any concerns with that appraoch?19:35
corvus...19:36
clarkbthen the other thing is python2.7 has been running on bionic in a lot of places. It should still be possible to run python2.7 jobs on ubuntu jammy, but you need to pin tox to <4 as tox>=4 is not compatible with python2.7 virtualenvs19:37
corvusyes that all sounds right19:37
clarkbgreat, then we can delete bionic from our mirrors19:37
fungii did already clean up bionic arm64 packages from the mirrors to free up space for volume moves19:38
clarkbya we dropped arm64 bionic a while ago. So this is just x86_64 cleanup19:39
clarkbbut that should still have a big impact (like 250GB or something along those lines)19:39
fungiremoving bionic amd64 packages should free a similar amount of space, yes19:39
clarkb#topic Lists Server Slowness19:39
fungii'm happy to repeat those steps, we have them documented but it's a bit convoluted to clean up the indices and make reprepro forget things from its database19:39
clarkbfungi: thanks!19:40
clarkbmore and more people are noticing that lists.o.o suffers from frequent slowness19:40
clarkblast week we updated UA filter lists based on what was seen there and also restarted services to get it out of swap19:40
fungiyeah, we could i guess resize to a larger flavor without needing to move to a new ip address19:40
clarkbunfortunately this hasn't fixed the problem19:41
clarkbfungi: top reports high iowait while things are slow19:41
fungiyeah, which would be consistent with paging to swap as well19:41
clarkbmaybe before resizing the instance we should try and determine where the iowait originates from as a resize may not help?19:41
clarkbfungi: yup though it happened after you restarted things and swap was almost empty19:41
fungiright19:42
clarkbI also suspect that mailman3 is not desinged to cope with a barrage of crawler bots19:42
fungibut also it almost immediately moved a bunch of stuff to swap as cache/buffers use swelled19:42
clarkbI noticed in the logs that there are a lot of query url requests which I assume means mailman3 is linking to those queries so bots find them while crawling19:43
clarkband then I suspect that something about those queries makes them less cacheable in django so its like a snowball of all the problems19:43
corvustake a look at the cacti graphs19:43
fungiyes, teh model of mailing list archives being held in a database accessed through a django frontend does seem a bit resource-hungry compared to the pipermail flat files served for mm2 archives19:43
corvusi don't see constant swap activity19:43
fungiit could just be that the databse is slammed and it's not swap really, agreed19:44
corvusat first glance it looks like memory usage is "okay" -- in that there's sufficient free ram and it's really just swapping unused stuff out.19:44
clarkbmaybe the database i the problem ya that19:44
clarkbits possible we need to tune mariadb to better handle this workload. That seems like a promising thread to pull on before resorting to resizing the node19:45
fungimm3 has some caching in there, but a cache probably works against your interests when crawlers are trying to access every single bit of the content19:45
clarkbyup, we saw similar with gitea19:45
corvusincreasing the ram may allow more caches, so that's still something to consider.  just noting it doesn't look like "out of memory", and looks more like the other stuff.19:45
clarkb(and had to replace its built in cache system with memcached)19:45
clarkbthe django install is set up to use diskcache which is a sqlite based caching system for python19:46
clarkbnot sure if it also uses in memory caches. But could also be that the sqlite system is io bound or mariadb or both19:46
corvusi bet that is an opportunity for improvement19:46
fungiis iotop a thing on linux?19:47
corvusit is19:47
fungii know i've used it on *bsd to figure out where most of the i/o activity is concentrated19:47
clarkb++ using iotop to determine where io is slow then debugging from there sounds like a great next step19:48
clarkbI think some of the ebpf tools can be used in a similar way if we have problems with iotop19:48
fungiyeah, whatever tools are good for it anyway, we need something that can work fairly granularly since device-level utilization won't tell us much in this casde19:49
clarkbright. I should be able to look into that today or tomorrow19:49
clarkbI've already been staring at too many apache logs to try and understand what is happening on the front end better19:49
fungialso we could implement fairly naive request rate limiters with an apache mod if we need something more immediate19:50
clarkbemail seems to still be processed reasonably quickly so I haven't been treating this as an emergency19:50
clarkbbut as more people notice I just want to make sure it is on our radar and that we have a plan. Which it sounds like we now do19:50
fungiright, it's just the webui which has been struggling19:50
fungie-mails may be getting delayed by seconds as well, but that's less obvious than when a web request is delayed by seconds19:51
clarkbyup19:51
clarkbI'll see what I can find about where the iowait is happening and we can take it from there19:51
clarkb#topic Open Discussion19:51
clarkbBefore our hour is up was there anything else?19:52
fungiopeneuler package mirror?19:52
fungii have a couple of changes up to rip it out, but it's a judgement call whether we keep it hoping someone will turn up to re-add images19:52
clarkboh yes, so iirc where that ended up was those interested in openeuler swapped the content of the mirror from release N-1 to N. But then ran into problems bootstrapping release N in dib and therefore nodepool19:52
clarkband we haven't heard or seen anything since19:53
clarkbfungi: maybe we should go ahead and delete the content from the mirror for now but leave the openafs volume in place. That way it is easy to rebuild if someone shows up19:53
fungiit's been about a year since we paused the broken image builds19:53
clarkbbut even then if someone shows up I think that we ask them to use the public mirror infrastrucutre like rocky does to start19:53
fungisure, freeing the data utilization is most of the win anyway19:54
corvusit should be much easier for someone to work on that now.19:54
clarkbI'm +2 on cleaning up the content in openafs19:54
fungi#link https://review.opendev.org/959892 Stop mirroring OpenEuler packages19:55
fungi#link https://review.opendev.org/959893 Remove mirror.openeuler utilization graph19:55
clarkbI'll review those changes after the meeting19:55
fungithe latter we could leave in for now i guess if we aren't planning to delete the volume itself19:55
fungibut the first change is obviously necessary if i'm going to delete the data19:56
clarkbfungi: I just left a quick question on the first change19:57
clarkbbasically oyu can make it a two step process if you want to reduce the amount of manual work19:57
clarkbI'll leave that up to you if you're volunteering to do the manual work though19:57
fungiyeah, automating the deletion makes some sense if we're not deleting the volume, since there's no manual steps required19:58
fungiif we were going to delete the volume, there's manual steps regardless19:58
fungialso we have a fair number of empty and abandoned afs volumes that could probably stand to be removed19:58
clarkbright. I guess the main reason I'm thinking keep the volume is that it allows someone to add the mirror easily without infra-root intervention beyond code review19:58
clarkband unlike rocky/alma I worry that their mirror infrastructure is very china centric so may actually need us to mirror if we're running test nodes on openeuler19:59
funginormally i wouldn't notice, but i've become acutely aware while moving them from server to server19:59
clarkbbut we can always recreate that volume and others if we end up in that situation19:59
clarkband we're just about at time.19:59
clarkbThank you everyone. See you back here at the same time and location next week. Until then thanks again for working on OpenDev19:59
clarkb#endmeeting20:00
opendevmeetMeeting ended Tue Sep  9 20:00:02 2025 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:00
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-09-19.00.html20:00
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-09-19.00.txt20:00
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-09-19.00.log.html20:00
fungithanks clarkb!20:00

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!