19:00:11 #startmeeting infra 19:00:11 Meeting started Tue Sep 9 19:00:11 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:11 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:11 The meeting name has been set to 'infra' 19:00:23 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/QE3L6OGX345E2EKS6N6ASANSHWJZV2W4/ Our Agenda 19:01:25 #topic Announcements 19:01:29 I didn't have anything to announce 19:02:36 guessing no one else does either given the silence 19:02:41 lets dive in 19:02:45 #topic Gerrit 3.11 Upgrade Planning 19:03:12 I feel bad about this one because I continue to not have had time to really drive this forward for months. But we're deep into the openstack release cycle now so may be for the best 19:04:05 #link https://review.opendev.org/c/opendev/system-config/+/957555 this change to update to the latest gerrit bugfix releases could still use reviews though 19:05:31 #topic Upgrading old servers 19:05:38 Plenty of updates on this topic courtesy of fungi 19:06:08 all of the afs and kerberos servers were updated to jammy then over the weekend fungi updated afs01.dfw.openstack.org to noble and while it booted the network didn't come up 19:06:33 debugging today showed that the host was trying to configure eth0 and eth1 but those interfaces no longer exist. They are enX0 and enX1 19:06:34 yeah, i should have just enjoyed the weekend instead 19:06:53 fungi fixed the network config and rebooted and things came up again thankfully. However, with only one vcpu 19:07:05 that upgrade was already plagued by an ongiong ubuntu/canonical infrastructure incident that delayed things by several days 19:07:11 applying the fix suggested in https://docs.oracle.com/en/operating-systems/uek/8/relnotes8.0/38006792.html then rebooting again fixed the vcpu count and now the host sees 8 19:07:41 I did an audit of the ansible fact cache and expect that all of the afs and kerberos servers are affected by the one vcpu issue except for afs02.dfw and kdc04 19:07:55 so further in place upgrades will need to accomodate both the interface renames and the vcpu count issue 19:08:11 so at this point, i'm preparing to move the rw volumes back to afs01.dfw and upgrade the remaining servers in that group 19:08:43 fungi: not sure if there is any testing we can/should do of afs01 before you proceed. can we force an afs client to fetch ro data from a specific server? 19:08:50 yeah, i'll basically amend /etc/network/interfaces and /etc/default/grub on all of them before moving to noble 19:08:51 that is probably overkill if the cluster reports happyness though 19:09:07 i can move a less-important volume to it first 19:09:14 fungi: except for afs02 and kdc04 they shouldn't need the grub thing 19:10:21 fungi: I also meant to ask you if you had to specify a special image for the rescue image 19:10:22 i don't really understand what the vcpu count issue is -- other than something about servers having or not having vcpus, and maybe it's related to the weirdness affected some rax legacy nodes that we've seen in jobs. is there something i should know or check on if i do a new launch-node? 19:11:23 clarkb: i didn't do anything special for the rescue, just asked rackspace (via their web-based dashboard) to reboot the machine into rescue mode and then used the supplied root password to ssh into it and mount the old device to /mnt so i could chroot into it 19:11:25 corvus: based on ansible facts rax classic has two sets of hypervisors. One with an older version than the other. Booting noble on the new hypervisor has no problems. Booting noble on the old hypervisor hits : https://docs.oracle.com/en/operating-systems/uek/8/relnotes8.0/38006792.html and those nodes only have one vcpu addressable 19:11:49 corvus: I have already patched launcher node to reject nodes that have less than 2 vcpus. So you may do a launch node and have it fail and have to retry 19:12:26 corvus: I think this is primarily a problem for doing in place upgrades since we can't request they migrate to the new hypervisors without submitting a ticket and hoping that the migration is possible/successful. I think using the workaround fungi found is reasonable instead 19:12:54 okay, thanks. i feel caught up now. 19:13:18 fungi: ack thanks 19:13:22 anything else on this topic? 19:13:58 nothing, other than i'm going to get the volume moves back to afs01.dfw rolling today so i can upgrade the rest to noble soon and take them all back out of the emergency disable list again 19:14:13 sounds good 19:14:18 #topic Matrix for OpenDev comms 19:14:25 #link https://review.opendev.org/c/opendev/infra-specs/+/954826 Spec outlining the motivation and plan for Matrix trialing 19:14:33 I'd like to raise the "last call on this spec" flag at this point 19:14:57 feedback is positive and we have even heard from people outside of the team. fungi tonyb maybe you can try to weigh in this week otherwise I'll plan to merge it early next week? 19:15:45 * JayF notes that even he has element installed and in use now 19:16:09 🏳️ 19:16:20 #topic Pre PTG Planning 19:16:22 i use matrix every day but from a (somewhat incomplete) plugin in my irc client 19:16:29 #link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document 19:16:44 I still want to encourage folks with ideas and interest to add those ideas to the planning document 19:17:15 so far only my thoughts have ended up there, and I'm sure there are more things to cover 19:17:48 and a reminder that the pre ptg will replace our meeting on october 7. See you there 19:17:56 (it will actually start at 1800 UTC) 19:18:10 #topic Loss of upstream Debian bullseye-backports mirror 19:18:24 #link https://review.opendev.org/c/zuul/zuul-jobs/+/958605 19:18:45 I just approved this change to not enable backports by default going forward. Today was the announced change date for zuul-jobs 19:19:20 if anyone screams about this they can add the configure_mirrors_extra_repos: true flag to their jobs. But keep in mind that our next step is to delete bullseye backports from our mirrors 19:19:33 I suspect that will happen once fungi is done with the server upgrades 19:20:08 then we can clean up the workaround to ignore undefined targets 19:20:41 yeah, it's next on my list 19:20:49 after afs/kerberos servers 19:21:29 #topic Etherpad 2.5.0 Upgrade 19:21:35 #link https://github.com/ether/etherpad-lite/blob/v2.5.0/CHANGELOG.md 19:21:42 104.130.127.119 is a held node for testing. You need to edit /etc/hosts to point etherpad.opendev.org at that IP. 19:21:49 I have a clarkb-test pad already on that held node 19:22:13 If anyone else wants to look at the root page rendering and decide if it is too ugly and we need to fix it before the upgrade that would be great 19:22:26 it is better than when they first broke it but not as shiny as what we have currently deployed 19:23:07 #link https://review.opendev.org/c/opendev/system-config/+/956593/ 19:23:19 maybe leave your thoughts about the state of etherpad-lite's no skin skin there 19:24:06 #topic Moving OpenDev's python-base/python-builder/uwsig-base Images to Quay 19:24:43 #link https://review.opendev.org/c/opendev/system-config/+/957277 19:25:19 we recently updated how zuul-jobs' container image building is done with docker to make this all actually work with docker speculative image builds (not docker runtime though) 19:25:38 there was one complaint that we broke things and then we managed to fix that particular complaint. Since then there have been no complaints 19:25:58 I think that means we're now in a good spot to consider actually landing this chagne and then updating all the images that are built off of these base images 19:26:23 I have removed my WIP vote to reflect that 19:26:52 infra-root can you weigh in on that with any other concerns you may have or potential testing gaps? Given the struggles we've had with this move in the past I don't want to rush with this change, but I also thing I've got it in a good spot finally 19:27:35 #topic Adding Debian Trixie Base Python Container Images 19:27:42 on prev topic 19:27:45 #undo 19:27:45 Removing item from minutes: #topic Adding Debian Trixie Base Python Container Images 19:27:59 we should probably send an annoucement to service-discuss for downstream users. 19:28:15 obviously zuul is represented here and we can switch 19:28:24 but we'll want to let others know... 19:28:31 corvus: ack I can probably do that today 19:28:40 do we want to pull those images from dockerhub at some point? or just leave them there. 19:28:55 I think we need to leave them for some time while we get our consumers ported over 19:28:58 clarkb: i was thinking after we make the switch; i don't think we need to announce before 19:29:02 corvus: ack 19:29:20 given that we've reverted in the past too I don't want to clean things up immediately. But maybe in say 3 months or something we should do cleanup? 19:29:46 that should be long enough for us to know we're unlikely to go back to docker hub but not too long that peopel are stuck in the old stuff forever 19:29:53 yeah. also, i think it's fine to leave them there. just wanted to talk about the plan. 19:30:14 if we do remove, i think 3 months sounds good 19:30:22 I think the main value in cleaning thingsup there is it will serve as signal to people that those images are no longer usable 19:30:29 in a more direct manner than an email announcement 19:31:22 I'll put that as a rough timeline in the announcement after the change lands 19:31:26 #topic Adding Debian Trixie Base Python Container Images 19:31:40 once we're publishing to quay I'd also like to add trixie based images 19:31:46 #link https://review.opendev.org/c/opendev/system-config/+/958480 19:32:06 I'm already noticing some trixie packages get updates that are not going into bookworm or bullseye so having the option to update to that platform seems like a good idea 19:32:44 this should be safe to land once the moev to quay is complete so this is mostly a heads up and request for reviews 19:33:02 make sure I didn't cross any wires when adding the new stuff and publsih trixie content to bookworm or vice versa 19:33:45 #topic Dropping Ubuntu Bionic Test Nodes 19:34:04 (If anyone thinks I'm rushing feel free to jump in and ask me to slow down) 19:34:18 At this point I think opendev's direct reliance on bionic is gone 19:34:39 but I wouldn't be surprised to learn I've missed some cleanup somehwere. Feel free to point any out to me or push fixes up yourselves 19:35:03 the ansible 11 default change has caused people to notice that bionic isn't working and we're seeing slow cleanups elsewhere too 19:35:41 corvus: I suspect that zuul will drop ansible 9 soon enough that opendev probably doesn't need to get ahead of that. We should mostly just ensure that we're not relying on it as much as possible then when zuul updates we can drop the images entirely in opendev 19:35:45 corvus: any concerns with that appraoch? 19:36:30 ... 19:37:16 then the other thing is python2.7 has been running on bionic in a lot of places. It should still be possible to run python2.7 jobs on ubuntu jammy, but you need to pin tox to <4 as tox>=4 is not compatible with python2.7 virtualenvs 19:37:18 yes that all sounds right 19:37:56 great, then we can delete bionic from our mirrors 19:38:44 i did already clean up bionic arm64 packages from the mirrors to free up space for volume moves 19:39:03 ya we dropped arm64 bionic a while ago. So this is just x86_64 cleanup 19:39:15 but that should still have a big impact (like 250GB or something along those lines) 19:39:20 removing bionic amd64 packages should free a similar amount of space, yes 19:39:56 #topic Lists Server Slowness 19:39:58 i'm happy to repeat those steps, we have them documented but it's a bit convoluted to clean up the indices and make reprepro forget things from its database 19:40:06 fungi: thanks! 19:40:26 more and more people are noticing that lists.o.o suffers from frequent slowness 19:40:54 last week we updated UA filter lists based on what was seen there and also restarted services to get it out of swap 19:40:57 yeah, we could i guess resize to a larger flavor without needing to move to a new ip address 19:41:00 unfortunately this hasn't fixed the problem 19:41:17 fungi: top reports high iowait while things are slow 19:41:37 yeah, which would be consistent with paging to swap as well 19:41:42 maybe before resizing the instance we should try and determine where the iowait originates from as a resize may not help? 19:41:53 fungi: yup though it happened after you restarted things and swap was almost empty 19:42:03 right 19:42:29 I also suspect that mailman3 is not desinged to cope with a barrage of crawler bots 19:42:45 but also it almost immediately moved a bunch of stuff to swap as cache/buffers use swelled 19:43:22 I noticed in the logs that there are a lot of query url requests which I assume means mailman3 is linking to those queries so bots find them while crawling 19:43:40 and then I suspect that something about those queries makes them less cacheable in django so its like a snowball of all the problems 19:43:40 take a look at the cacti graphs 19:43:47 yes, teh model of mailing list archives being held in a database accessed through a django frontend does seem a bit resource-hungry compared to the pipermail flat files served for mm2 archives 19:43:48 i don't see constant swap activity 19:44:14 it could just be that the databse is slammed and it's not swap really, agreed 19:44:17 at first glance it looks like memory usage is "okay" -- in that there's sufficient free ram and it's really just swapping unused stuff out. 19:44:31 maybe the database i the problem ya that 19:45:09 its possible we need to tune mariadb to better handle this workload. That seems like a promising thread to pull on before resorting to resizing the node 19:45:29 mm3 has some caching in there, but a cache probably works against your interests when crawlers are trying to access every single bit of the content 19:45:37 yup, we saw similar with gitea 19:45:47 increasing the ram may allow more caches, so that's still something to consider. just noting it doesn't look like "out of memory", and looks more like the other stuff. 19:45:47 (and had to replace its built in cache system with memcached) 19:46:25 the django install is set up to use diskcache which is a sqlite based caching system for python 19:46:45 not sure if it also uses in memory caches. But could also be that the sqlite system is io bound or mariadb or both 19:46:58 i bet that is an opportunity for improvement 19:47:10 is iotop a thing on linux? 19:47:26 it is 19:47:31 i know i've used it on *bsd to figure out where most of the i/o activity is concentrated 19:48:11 ++ using iotop to determine where io is slow then debugging from there sounds like a great next step 19:48:25 I think some of the ebpf tools can be used in a similar way if we have problems with iotop 19:49:08 yeah, whatever tools are good for it anyway, we need something that can work fairly granularly since device-level utilization won't tell us much in this casde 19:49:33 right. I should be able to look into that today or tomorrow 19:49:44 I've already been staring at too many apache logs to try and understand what is happening on the front end better 19:50:02 also we could implement fairly naive request rate limiters with an apache mod if we need something more immediate 19:50:33 email seems to still be processed reasonably quickly so I haven't been treating this as an emergency 19:50:52 but as more people notice I just want to make sure it is on our radar and that we have a plan. Which it sounds like we now do 19:50:55 right, it's just the webui which has been struggling 19:51:28 e-mails may be getting delayed by seconds as well, but that's less obvious than when a web request is delayed by seconds 19:51:36 yup 19:51:51 I'll see what I can find about where the iowait is happening and we can take it from there 19:51:59 #topic Open Discussion 19:52:04 Before our hour is up was there anything else? 19:52:16 openeuler package mirror? 19:52:57 i have a couple of changes up to rip it out, but it's a judgement call whether we keep it hoping someone will turn up to re-add images 19:52:58 oh yes, so iirc where that ended up was those interested in openeuler swapped the content of the mirror from release N-1 to N. But then ran into problems bootstrapping release N in dib and therefore nodepool 19:53:05 and we haven't heard or seen anything since 19:53:30 fungi: maybe we should go ahead and delete the content from the mirror for now but leave the openafs volume in place. That way it is easy to rebuild if someone shows up 19:53:30 it's been about a year since we paused the broken image builds 19:53:47 but even then if someone shows up I think that we ask them to use the public mirror infrastrucutre like rocky does to start 19:54:01 sure, freeing the data utilization is most of the win anyway 19:54:02 it should be much easier for someone to work on that now. 19:54:05 I'm +2 on cleaning up the content in openafs 19:55:08 #link https://review.opendev.org/959892 Stop mirroring OpenEuler packages 19:55:25 #link https://review.opendev.org/959893 Remove mirror.openeuler utilization graph 19:55:32 I'll review those changes after the meeting 19:55:51 the latter we could leave in for now i guess if we aren't planning to delete the volume itself 19:56:19 but the first change is obviously necessary if i'm going to delete the data 19:57:02 fungi: I just left a quick question on the first change 19:57:12 basically oyu can make it a two step process if you want to reduce the amount of manual work 19:57:21 I'll leave that up to you if you're volunteering to do the manual work though 19:58:05 yeah, automating the deletion makes some sense if we're not deleting the volume, since there's no manual steps required 19:58:19 if we were going to delete the volume, there's manual steps regardless 19:58:48 also we have a fair number of empty and abandoned afs volumes that could probably stand to be removed 19:58:48 right. I guess the main reason I'm thinking keep the volume is that it allows someone to add the mirror easily without infra-root intervention beyond code review 19:59:17 and unlike rocky/alma I worry that their mirror infrastructure is very china centric so may actually need us to mirror if we're running test nodes on openeuler 19:59:24 normally i wouldn't notice, but i've become acutely aware while moving them from server to server 19:59:26 but we can always recreate that volume and others if we end up in that situation 19:59:40 and we're just about at time. 19:59:59 Thank you everyone. See you back here at the same time and location next week. Until then thanks again for working on OpenDev 20:00:02 #endmeeting