| corvus | comm check | 18:56 |
|---|---|---|
| fungi | received | 18:57 |
| corvus_ | ++ | 18:57 |
| fungi | er, 10-4? | 18:57 |
| corvus | five by five | 18:57 |
| fungi | over | 18:57 |
| corvus | roger, roger | 18:57 |
| clarkb | hello! | 19:00 |
| clarkb | #startmeeting infra | 19:00 |
| opendevmeet | Meeting started Tue Sep 9 19:00:11 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
| opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
| opendevmeet | The meeting name has been set to 'infra' | 19:00 |
| clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/QE3L6OGX345E2EKS6N6ASANSHWJZV2W4/ Our Agenda | 19:00 |
| clarkb | #topic Announcements | 19:01 |
| clarkb | I didn't have anything to announce | 19:01 |
| clarkb | guessing no one else does either given the silence | 19:02 |
| clarkb | lets dive in | 19:02 |
| clarkb | #topic Gerrit 3.11 Upgrade Planning | 19:02 |
| clarkb | I feel bad about this one because I continue to not have had time to really drive this forward for months. But we're deep into the openstack release cycle now so may be for the best | 19:03 |
| clarkb | #link https://review.opendev.org/c/opendev/system-config/+/957555 this change to update to the latest gerrit bugfix releases could still use reviews though | 19:04 |
| clarkb | #topic Upgrading old servers | 19:05 |
| clarkb | Plenty of updates on this topic courtesy of fungi | 19:05 |
| clarkb | all of the afs and kerberos servers were updated to jammy then over the weekend fungi updated afs01.dfw.openstack.org to noble and while it booted the network didn't come up | 19:06 |
| clarkb | debugging today showed that the host was trying to configure eth0 and eth1 but those interfaces no longer exist. They are enX0 and enX1 | 19:06 |
| fungi | yeah, i should have just enjoyed the weekend instead | 19:06 |
| clarkb | fungi fixed the network config and rebooted and things came up again thankfully. However, with only one vcpu | 19:06 |
| fungi | that upgrade was already plagued by an ongiong ubuntu/canonical infrastructure incident that delayed things by several days | 19:07 |
| clarkb | applying the fix suggested in https://docs.oracle.com/en/operating-systems/uek/8/relnotes8.0/38006792.html then rebooting again fixed the vcpu count and now the host sees 8 | 19:07 |
| clarkb | I did an audit of the ansible fact cache and expect that all of the afs and kerberos servers are affected by the one vcpu issue except for afs02.dfw and kdc04 | 19:07 |
| clarkb | so further in place upgrades will need to accomodate both the interface renames and the vcpu count issue | 19:07 |
| fungi | so at this point, i'm preparing to move the rw volumes back to afs01.dfw and upgrade the remaining servers in that group | 19:08 |
| clarkb | fungi: not sure if there is any testing we can/should do of afs01 before you proceed. can we force an afs client to fetch ro data from a specific server? | 19:08 |
| fungi | yeah, i'll basically amend /etc/network/interfaces and /etc/default/grub on all of them before moving to noble | 19:08 |
| clarkb | that is probably overkill if the cluster reports happyness though | 19:08 |
| fungi | i can move a less-important volume to it first | 19:09 |
| clarkb | fungi: except for afs02 and kdc04 they shouldn't need the grub thing | 19:09 |
| clarkb | fungi: I also meant to ask you if you had to specify a special image for the rescue image | 19:10 |
| corvus | i don't really understand what the vcpu count issue is -- other than something about servers having or not having vcpus, and maybe it's related to the weirdness affected some rax legacy nodes that we've seen in jobs. is there something i should know or check on if i do a new launch-node? | 19:10 |
| fungi | clarkb: i didn't do anything special for the rescue, just asked rackspace (via their web-based dashboard) to reboot the machine into rescue mode and then used the supplied root password to ssh into it and mount the old device to /mnt so i could chroot into it | 19:11 |
| clarkb | corvus: based on ansible facts rax classic has two sets of hypervisors. One with an older version than the other. Booting noble on the new hypervisor has no problems. Booting noble on the old hypervisor hits : https://docs.oracle.com/en/operating-systems/uek/8/relnotes8.0/38006792.html and those nodes only have one vcpu addressable | 19:11 |
| clarkb | corvus: I have already patched launcher node to reject nodes that have less than 2 vcpus. So you may do a launch node and have it fail and have to retry | 19:11 |
| clarkb | corvus: I think this is primarily a problem for doing in place upgrades since we can't request they migrate to the new hypervisors without submitting a ticket and hoping that the migration is possible/successful. I think using the workaround fungi found is reasonable instead | 19:12 |
| corvus | okay, thanks. i feel caught up now. | 19:12 |
| clarkb | fungi: ack thanks | 19:13 |
| clarkb | anything else on this topic? | 19:13 |
| fungi | nothing, other than i'm going to get the volume moves back to afs01.dfw rolling today so i can upgrade the rest to noble soon and take them all back out of the emergency disable list again | 19:13 |
| clarkb | sounds good | 19:14 |
| clarkb | #topic Matrix for OpenDev comms | 19:14 |
| clarkb | #link https://review.opendev.org/c/opendev/infra-specs/+/954826 Spec outlining the motivation and plan for Matrix trialing | 19:14 |
| clarkb | I'd like to raise the "last call on this spec" flag at this point | 19:14 |
| clarkb | feedback is positive and we have even heard from people outside of the team. fungi tonyb maybe you can try to weigh in this week otherwise I'll plan to merge it early next week? | 19:14 |
| * JayF notes that even he has element installed and in use now | 19:15 | |
| JayF | 🏳️ | 19:16 |
| clarkb | #topic Pre PTG Planning | 19:16 |
| fungi | i use matrix every day but from a (somewhat incomplete) plugin in my irc client | 19:16 |
| clarkb | #link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document | 19:16 |
| clarkb | I still want to encourage folks with ideas and interest to add those ideas to the planning document | 19:16 |
| clarkb | so far only my thoughts have ended up there, and I'm sure there are more things to cover | 19:17 |
| clarkb | and a reminder that the pre ptg will replace our meeting on october 7. See you there | 19:17 |
| clarkb | (it will actually start at 1800 UTC) | 19:17 |
| clarkb | #topic Loss of upstream Debian bullseye-backports mirror | 19:18 |
| clarkb | #link https://review.opendev.org/c/zuul/zuul-jobs/+/958605 | 19:18 |
| clarkb | I just approved this change to not enable backports by default going forward. Today was the announced change date for zuul-jobs | 19:18 |
| clarkb | if anyone screams about this they can add the configure_mirrors_extra_repos: true flag to their jobs. But keep in mind that our next step is to delete bullseye backports from our mirrors | 19:19 |
| clarkb | I suspect that will happen once fungi is done with the server upgrades | 19:19 |
| clarkb | then we can clean up the workaround to ignore undefined targets | 19:20 |
| fungi | yeah, it's next on my list | 19:20 |
| fungi | after afs/kerberos servers | 19:20 |
| clarkb | #topic Etherpad 2.5.0 Upgrade | 19:21 |
| clarkb | #link https://github.com/ether/etherpad-lite/blob/v2.5.0/CHANGELOG.md | 19:21 |
| clarkb | 104.130.127.119 is a held node for testing. You need to edit /etc/hosts to point etherpad.opendev.org at that IP. | 19:21 |
| clarkb | I have a clarkb-test pad already on that held node | 19:21 |
| clarkb | If anyone else wants to look at the root page rendering and decide if it is too ugly and we need to fix it before the upgrade that would be great | 19:22 |
| clarkb | it is better than when they first broke it but not as shiny as what we have currently deployed | 19:22 |
| clarkb | #link https://review.opendev.org/c/opendev/system-config/+/956593/ | 19:23 |
| clarkb | maybe leave your thoughts about the state of etherpad-lite's no skin skin there | 19:23 |
| clarkb | #topic Moving OpenDev's python-base/python-builder/uwsig-base Images to Quay | 19:24 |
| clarkb | #link https://review.opendev.org/c/opendev/system-config/+/957277 | 19:24 |
| clarkb | we recently updated how zuul-jobs' container image building is done with docker to make this all actually work with docker speculative image builds (not docker runtime though) | 19:25 |
| clarkb | there was one complaint that we broke things and then we managed to fix that particular complaint. Since then there have been no complaints | 19:25 |
| clarkb | I think that means we're now in a good spot to consider actually landing this chagne and then updating all the images that are built off of these base images | 19:25 |
| clarkb | I have removed my WIP vote to reflect that | 19:26 |
| clarkb | infra-root can you weigh in on that with any other concerns you may have or potential testing gaps? Given the struggles we've had with this move in the past I don't want to rush with this change, but I also thing I've got it in a good spot finally | 19:26 |
| clarkb | #topic Adding Debian Trixie Base Python Container Images | 19:27 |
| corvus | on prev topic | 19:27 |
| clarkb | #undo | 19:27 |
| opendevmeet | Removing item from minutes: #topic Adding Debian Trixie Base Python Container Images | 19:27 |
| corvus | we should probably send an annoucement to service-discuss for downstream users. | 19:27 |
| corvus | obviously zuul is represented here and we can switch | 19:28 |
| corvus | but we'll want to let others know... | 19:28 |
| clarkb | corvus: ack I can probably do that today | 19:28 |
| corvus | do we want to pull those images from dockerhub at some point? or just leave them there. | 19:28 |
| clarkb | I think we need to leave them for some time while we get our consumers ported over | 19:28 |
| corvus | clarkb: i was thinking after we make the switch; i don't think we need to announce before | 19:28 |
| clarkb | corvus: ack | 19:29 |
| clarkb | given that we've reverted in the past too I don't want to clean things up immediately. But maybe in say 3 months or something we should do cleanup? | 19:29 |
| clarkb | that should be long enough for us to know we're unlikely to go back to docker hub but not too long that peopel are stuck in the old stuff forever | 19:29 |
| corvus | yeah. also, i think it's fine to leave them there. just wanted to talk about the plan. | 19:29 |
| corvus | if we do remove, i think 3 months sounds good | 19:30 |
| clarkb | I think the main value in cleaning thingsup there is it will serve as signal to people that those images are no longer usable | 19:30 |
| clarkb | in a more direct manner than an email announcement | 19:30 |
| clarkb | I'll put that as a rough timeline in the announcement after the change lands | 19:31 |
| clarkb | #topic Adding Debian Trixie Base Python Container Images | 19:31 |
| clarkb | once we're publishing to quay I'd also like to add trixie based images | 19:31 |
| clarkb | #link https://review.opendev.org/c/opendev/system-config/+/958480 | 19:31 |
| clarkb | I'm already noticing some trixie packages get updates that are not going into bookworm or bullseye so having the option to update to that platform seems like a good idea | 19:32 |
| clarkb | this should be safe to land once the moev to quay is complete so this is mostly a heads up and request for reviews | 19:32 |
| clarkb | make sure I didn't cross any wires when adding the new stuff and publsih trixie content to bookworm or vice versa | 19:33 |
| clarkb | #topic Dropping Ubuntu Bionic Test Nodes | 19:33 |
| clarkb | (If anyone thinks I'm rushing feel free to jump in and ask me to slow down) | 19:34 |
| clarkb | At this point I think opendev's direct reliance on bionic is gone | 19:34 |
| clarkb | but I wouldn't be surprised to learn I've missed some cleanup somehwere. Feel free to point any out to me or push fixes up yourselves | 19:34 |
| clarkb | the ansible 11 default change has caused people to notice that bionic isn't working and we're seeing slow cleanups elsewhere too | 19:35 |
| clarkb | corvus: I suspect that zuul will drop ansible 9 soon enough that opendev probably doesn't need to get ahead of that. We should mostly just ensure that we're not relying on it as much as possible then when zuul updates we can drop the images entirely in opendev | 19:35 |
| clarkb | corvus: any concerns with that appraoch? | 19:35 |
| corvus | ... | 19:36 |
| clarkb | then the other thing is python2.7 has been running on bionic in a lot of places. It should still be possible to run python2.7 jobs on ubuntu jammy, but you need to pin tox to <4 as tox>=4 is not compatible with python2.7 virtualenvs | 19:37 |
| corvus | yes that all sounds right | 19:37 |
| clarkb | great, then we can delete bionic from our mirrors | 19:37 |
| fungi | i did already clean up bionic arm64 packages from the mirrors to free up space for volume moves | 19:38 |
| clarkb | ya we dropped arm64 bionic a while ago. So this is just x86_64 cleanup | 19:39 |
| clarkb | but that should still have a big impact (like 250GB or something along those lines) | 19:39 |
| fungi | removing bionic amd64 packages should free a similar amount of space, yes | 19:39 |
| clarkb | #topic Lists Server Slowness | 19:39 |
| fungi | i'm happy to repeat those steps, we have them documented but it's a bit convoluted to clean up the indices and make reprepro forget things from its database | 19:39 |
| clarkb | fungi: thanks! | 19:40 |
| clarkb | more and more people are noticing that lists.o.o suffers from frequent slowness | 19:40 |
| clarkb | last week we updated UA filter lists based on what was seen there and also restarted services to get it out of swap | 19:40 |
| fungi | yeah, we could i guess resize to a larger flavor without needing to move to a new ip address | 19:40 |
| clarkb | unfortunately this hasn't fixed the problem | 19:41 |
| clarkb | fungi: top reports high iowait while things are slow | 19:41 |
| fungi | yeah, which would be consistent with paging to swap as well | 19:41 |
| clarkb | maybe before resizing the instance we should try and determine where the iowait originates from as a resize may not help? | 19:41 |
| clarkb | fungi: yup though it happened after you restarted things and swap was almost empty | 19:41 |
| fungi | right | 19:42 |
| clarkb | I also suspect that mailman3 is not desinged to cope with a barrage of crawler bots | 19:42 |
| fungi | but also it almost immediately moved a bunch of stuff to swap as cache/buffers use swelled | 19:42 |
| clarkb | I noticed in the logs that there are a lot of query url requests which I assume means mailman3 is linking to those queries so bots find them while crawling | 19:43 |
| clarkb | and then I suspect that something about those queries makes them less cacheable in django so its like a snowball of all the problems | 19:43 |
| corvus | take a look at the cacti graphs | 19:43 |
| fungi | yes, teh model of mailing list archives being held in a database accessed through a django frontend does seem a bit resource-hungry compared to the pipermail flat files served for mm2 archives | 19:43 |
| corvus | i don't see constant swap activity | 19:43 |
| fungi | it could just be that the databse is slammed and it's not swap really, agreed | 19:44 |
| corvus | at first glance it looks like memory usage is "okay" -- in that there's sufficient free ram and it's really just swapping unused stuff out. | 19:44 |
| clarkb | maybe the database i the problem ya that | 19:44 |
| clarkb | its possible we need to tune mariadb to better handle this workload. That seems like a promising thread to pull on before resorting to resizing the node | 19:45 |
| fungi | mm3 has some caching in there, but a cache probably works against your interests when crawlers are trying to access every single bit of the content | 19:45 |
| clarkb | yup, we saw similar with gitea | 19:45 |
| corvus | increasing the ram may allow more caches, so that's still something to consider. just noting it doesn't look like "out of memory", and looks more like the other stuff. | 19:45 |
| clarkb | (and had to replace its built in cache system with memcached) | 19:45 |
| clarkb | the django install is set up to use diskcache which is a sqlite based caching system for python | 19:46 |
| clarkb | not sure if it also uses in memory caches. But could also be that the sqlite system is io bound or mariadb or both | 19:46 |
| corvus | i bet that is an opportunity for improvement | 19:46 |
| fungi | is iotop a thing on linux? | 19:47 |
| corvus | it is | 19:47 |
| fungi | i know i've used it on *bsd to figure out where most of the i/o activity is concentrated | 19:47 |
| clarkb | ++ using iotop to determine where io is slow then debugging from there sounds like a great next step | 19:48 |
| clarkb | I think some of the ebpf tools can be used in a similar way if we have problems with iotop | 19:48 |
| fungi | yeah, whatever tools are good for it anyway, we need something that can work fairly granularly since device-level utilization won't tell us much in this casde | 19:49 |
| clarkb | right. I should be able to look into that today or tomorrow | 19:49 |
| clarkb | I've already been staring at too many apache logs to try and understand what is happening on the front end better | 19:49 |
| fungi | also we could implement fairly naive request rate limiters with an apache mod if we need something more immediate | 19:50 |
| clarkb | email seems to still be processed reasonably quickly so I haven't been treating this as an emergency | 19:50 |
| clarkb | but as more people notice I just want to make sure it is on our radar and that we have a plan. Which it sounds like we now do | 19:50 |
| fungi | right, it's just the webui which has been struggling | 19:50 |
| fungi | e-mails may be getting delayed by seconds as well, but that's less obvious than when a web request is delayed by seconds | 19:51 |
| clarkb | yup | 19:51 |
| clarkb | I'll see what I can find about where the iowait is happening and we can take it from there | 19:51 |
| clarkb | #topic Open Discussion | 19:51 |
| clarkb | Before our hour is up was there anything else? | 19:52 |
| fungi | openeuler package mirror? | 19:52 |
| fungi | i have a couple of changes up to rip it out, but it's a judgement call whether we keep it hoping someone will turn up to re-add images | 19:52 |
| clarkb | oh yes, so iirc where that ended up was those interested in openeuler swapped the content of the mirror from release N-1 to N. But then ran into problems bootstrapping release N in dib and therefore nodepool | 19:52 |
| clarkb | and we haven't heard or seen anything since | 19:53 |
| clarkb | fungi: maybe we should go ahead and delete the content from the mirror for now but leave the openafs volume in place. That way it is easy to rebuild if someone shows up | 19:53 |
| fungi | it's been about a year since we paused the broken image builds | 19:53 |
| clarkb | but even then if someone shows up I think that we ask them to use the public mirror infrastrucutre like rocky does to start | 19:53 |
| fungi | sure, freeing the data utilization is most of the win anyway | 19:54 |
| corvus | it should be much easier for someone to work on that now. | 19:54 |
| clarkb | I'm +2 on cleaning up the content in openafs | 19:54 |
| fungi | #link https://review.opendev.org/959892 Stop mirroring OpenEuler packages | 19:55 |
| fungi | #link https://review.opendev.org/959893 Remove mirror.openeuler utilization graph | 19:55 |
| clarkb | I'll review those changes after the meeting | 19:55 |
| fungi | the latter we could leave in for now i guess if we aren't planning to delete the volume itself | 19:55 |
| fungi | but the first change is obviously necessary if i'm going to delete the data | 19:56 |
| clarkb | fungi: I just left a quick question on the first change | 19:57 |
| clarkb | basically oyu can make it a two step process if you want to reduce the amount of manual work | 19:57 |
| clarkb | I'll leave that up to you if you're volunteering to do the manual work though | 19:57 |
| fungi | yeah, automating the deletion makes some sense if we're not deleting the volume, since there's no manual steps required | 19:58 |
| fungi | if we were going to delete the volume, there's manual steps regardless | 19:58 |
| fungi | also we have a fair number of empty and abandoned afs volumes that could probably stand to be removed | 19:58 |
| clarkb | right. I guess the main reason I'm thinking keep the volume is that it allows someone to add the mirror easily without infra-root intervention beyond code review | 19:58 |
| clarkb | and unlike rocky/alma I worry that their mirror infrastructure is very china centric so may actually need us to mirror if we're running test nodes on openeuler | 19:59 |
| fungi | normally i wouldn't notice, but i've become acutely aware while moving them from server to server | 19:59 |
| clarkb | but we can always recreate that volume and others if we end up in that situation | 19:59 |
| clarkb | and we're just about at time. | 19:59 |
| clarkb | Thank you everyone. See you back here at the same time and location next week. Until then thanks again for working on OpenDev | 19:59 |
| clarkb | #endmeeting | 20:00 |
| opendevmeet | Meeting ended Tue Sep 9 20:00:02 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:00 |
| opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-09-19.00.html | 20:00 |
| opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-09-19.00.txt | 20:00 |
| opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-09-19.00.log.html | 20:00 |
| fungi | thanks clarkb! | 20:00 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!