#opendev-meeting log

19:00:27 <clarkb> #startmeeting infra
19:00:27 <opendevmeet> Meeting started Tue Oct 28 19:00:27 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:27 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:27 <opendevmeet> The meeting name has been set to 'infra'
19:01:30 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/MRT6BQQHJPYJ43ENYTRSH4IOT6AR7FIW/ Our Agenda
19:01:35 <clarkb> #topic Announcements
19:03:45 <clarkb> I decided to have this meeting today despite the PTG happening this week because it has been some time since we had a meeting
19:04:11 <clarkb> But also keep in mind the PTG is happening this week. I've already put the meetpad servers in teh ansible emergency file so they don't get randomly updated by upstream container image updates
19:05:30 <clarkb> #topic Gerrit 3.11 Upgrade Planning
19:05:53 <clarkb> After a zuul launcher upgrade issue my existing holds for testing this are no longer valid so I need to refresh them
19:06:55 <clarkb> At the summit the Gerrit folks didn't feel we were super far behidn so that was encouraging
19:07:06 <clarkb> I'm hoping to really start focusing on this again after this week and the ptg
19:07:14 <clarkb> #topic Gerrit Spontaneous Shutdown During Summit
19:07:37 <clarkb> That said during the Summit fungi and ianychoi noticed that Gerrit was not running. It was spontaneously in a shutdown state
19:08:03 <clarkb> fungi was able to restart the VM and then start containers. The main issue was that the h2 cache backing files were not cleared out before doing so whcih made startup take a while. But it did startup and has been running since
19:08:18 <clarkb> just keep that in mind if you're restarting Gerrit for any reason clearning out the h2 cache backing files can speed up startup.
19:08:46 <clarkb> We spoke to mfick about improving this at the summit and he felt he knew what a workable solution was an in fact had merged an attempt at it but it didn't accomodate existing plugin expectations so was reverted
19:08:56 <clarkb> but hopefully that means the issue can be addressed once plugin compatibility is addressed
19:09:34 <clarkb> As for why Gerrit shutdown we also spoke to nova folks at the summit and something like running out of memory on the hypervisor could cause libvirt to request VMs shutdown and the nova wouldn't read that as an error that is bubbled back up to users
19:10:00 <clarkb> so it seems like this is what caused the problem. Checking on that and mitigating the issue is somethign we'll have to bring up with vexxhost (which I don't think we have done yet)
19:10:05 <fungi> yeah, it's outwardly consistent with the hypervisor host's oom killer reaping the qemu process, but without access to the host logs we don't know for sure
19:10:20 <fungi> my money's on that though, because it's a very high-ram flavor
19:10:25 <clarkb> ya
19:13:02 <clarkb> #topic Upgrading old servers
19:13:19 <clarkb> I'm not aware of any movement on this topic. But as mentioend previously I think the backup servers are a good next location to focus this effort
19:13:33 <clarkb> we can replace them out at a time and transplant the current backup volumes onto the new servers to preserve that daat
19:14:01 <tonyb> I did refresh the mediawiki patches
19:14:09 <fungi> and we have plenty of cinder quota at this point, so new volumes are fine
19:14:24 <clarkb> tonyb: oh cool, should we be reviewing them then?
19:14:26 <tonyb> I think the container build is probably mergeable
19:14:33 <clarkb> excellent /me makes a note on the todo list
19:14:38 <fungi> i would actually add fresh volumes and use the server replacement as an excuse to rotate them
19:14:57 <fungi> and we can still detach the old volumes and attach to the new servers to make them easier to access
19:15:00 <tonyb> I also updated the ansible-devel series so reviewing them would be good
19:15:23 <clarkb> ack also added to my todo list
19:15:38 <tonyb> I added an new, short lived, ansible-next job that targets ansible-11 on bridge, rather than master
19:16:15 <tonyb> I figure we'll want to update bridge before we get rid of focal but that may not be a valid conclusion
19:16:21 <clarkb> makes sense since we're not really in a position to provide feedback upstraem for unreleased stuff but having signal about where incompatibilities with the next release are is helpful to us
19:16:40 <clarkb> tonyb: the current bridge is jammy so the order of operations there is a bit more vague I think
19:17:16 <tonyb> Okay
19:18:01 <clarkb> Anything else as far as updating servers goes? I'm glad there is progress and I just need to catch up!
19:18:16 <tonyb> I can chnage the ansible-next job to jammy if that's a more reasonable target
19:18:39 <clarkb> tonyb: might be worth checking just to see if we can run ansible on jammy with the python version there
19:18:46 <clarkb> to see if we need to upgrade to get new ansible or not
19:18:54 <tonyb> I still haven't actaully tested the held MW node, but apart from that I think I'm good
19:19:12 <tonyb> clarkb: noted.
19:20:41 <clarkb> #topic AFS mirror content cleanup
19:21:01 <clarkb> I think this effort has largely stalled out (whcih is fine, major improvements have been made and the wins we see going fowrard are much smaller)
19:21:24 <clarkb> I'm curious if A) anyone is interested in chasing that long tail of cleanup and B) if we think we're ready to start mirroring new stuff like say trixie packages?
19:22:56 <tonyb> I think "a" is still valuable, but I don't have cycles for it in the short term.  I have no opinion on "b".
19:23:29 <clarkb> ya maybe we need to put A on the backlog list etherpad linked to from specs
19:23:48 <clarkb> for B I'm happy to stick with the current status quo until people find they need it
19:23:56 <clarkb> mostly taking temperature on that I ugess
19:24:03 <fungi> noonedeadpunk indicated in #openstack-ansible earlier this week that he'd look into adding a reprepro config patch for trixie soon
19:24:19 <clarkb> cool so there is interest and we can probably wait for that change to show up then
19:24:51 <fungi> they apparently hit some mirror host identification bug in their jobs which was causing the pip.conf to list deb.debian.org as the pypi index host
19:25:14 <fungi> traced back to having an empty mirror host variable
19:25:32 <clarkb> thats weird
19:25:33 <frickler> yes, I had a similar issue with devstack
19:25:45 <clarkb> an unexpected fallback behavior for sure
19:26:43 <frickler> iirc that is because we had to work around the missing mirror in dib/image builds
19:27:05 <clarkb> I think the dib fallback was to use the upstream mirrors though
19:27:22 <clarkb> anyway its worth tracking down and we don't need to debug it now
19:27:37 <mnasiadka> I can help with cleanup if needed (or in some other area)
19:28:13 <clarkb> anything else related to afs mirroring? I think we can followup on A and B after the meeting as people have time
19:28:24 <tonyb> Possibly related to: https://review.opendev.org/c/zuul/zuul-jobs/+/965008 "Allow mirror_fqdn to be overriden"
19:30:13 <clarkb> #topic Zuul Launcher Updates
19:30:30 <clarkb> As a heads up there is a bug in zuul launcher that currently affects nodesets if the requested node boots fail
19:30:40 <clarkb> zuul tries to recover inappropriately and then fails the nodeset
19:31:02 <clarkb> there is a fix for this currently in the zuul gate, but zuul ci hit problems due to the new pip release so its been a slow march to get the fix landed
19:31:36 <clarkb> there was also a fix to some test cases identified to hopefully make the test cases more reliable. I'm hopeful with those two fixes in place we'll be able to land the launcher fix then restart launchers to address the node failure problem
19:31:59 <clarkb> at this point I think we're on the right path to correcting this but wanted peopel to be aware
19:32:26 <clarkb> any other zuul launcher concerns or feedback?
19:32:41 <clarkb> #link https://review.opendev.org/c/zuul/zuul/+/964893 this is the node failure fixup
19:33:29 <clarkb> #topic Matrix for OpenDev comms
19:33:44 <clarkb> In addition to the Gerrit upgrade this is the other item that is high on my todo list
19:34:06 <clarkb> I should be able to start on room creation and work through some of the bits of the spec that don't require user facing changes
19:34:20 <clarkb> then when we're happy with the state of things we can make it more official and start porting usage over
19:34:39 <tonyb> Sounds good
19:36:33 <clarkb> #topic Etherpad 2.5.1 Upgrade
19:36:51 <clarkb> Etherpad 2.5.0 was the version I was looking at previously with the broken but slightly improved css
19:37:13 <clarkb> since then there is a new 2.5.1 release so I need to update the upgrade change and recycle test nodes and check if css is happy now
19:37:26 <clarkb> but I didn't want to do that prior to or during the ptg so this is probably going to wait for a bit
19:37:38 <clarkb> #link https://github.com/ether/etherpad-lite/blob/v2.5.1/CHANGELOG.md Is the upstream changelog
19:38:06 <clarkb> I would say that often times their changelog is very incomplete
19:38:16 <clarkb> #topic Gitea 1.24.7 Upgrade
19:38:21 <clarkb> Gitea has pushed a new release too
19:38:26 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/964899/ Upgrade Gitea to 1.24.7
19:38:47 <clarkb> I think we can probably proceed with updating this service if it looks like the service itself is stable and not falling over due to crawlers
19:39:04 <clarkb> the screenshots looked good to me but please double check when you review the change
19:39:10 <clarkb> #topic Gitea Performance
19:39:16 <tonyb> They looks good to me
19:39:24 <clarkb> which brings us to the general gitea performance issue
19:39:42 <clarkb> prior to the summit we thought that part of the problem was crawlers hitting backends directly
19:39:55 <clarkb> this meant that the load balancer couldn't really balance effectively as it is unaware of any direct connections
19:40:00 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/964728 Don't allow direct backend access
19:40:14 <clarkb> this change is a response to that. It will limit our ability to test specific backends without doing something like ssh port forwarding
19:40:34 <clarkb> however, yesterday performance was poor and the traffic did seem to be going through the load balancer
19:40:56 <clarkb> so forcing everything through the load balancer is unlikely to fix all the issues. That said I suspect it will generally be an improvement
19:41:26 <clarkb> yesterday I had to block a particularly bad crawler's ip addresses after confirming it was crawling with odd and what appeared to be bogus user agent
19:41:50 <clarkb> after doing that things settled down a bit and the service seemed happier. Spot checking now seems to show thinsg are still reasonably happy
19:42:19 <clarkb> I did identify one other problematic crawler that I intended on blocking if things didn't improve after the first was blocked but that was not necessary
19:42:33 <clarkb> (this crawler is using a specific cloud provider and I was going to block that cloud provider's ip range....)
19:43:02 <clarkb> anyway I guess the point here is the battle is ongoing and I'm less certain 964728 will help significantly but I'm willing to try it if others think it is a good idea
19:43:16 <clarkb> I'm also open to other ideas and help
19:44:12 <tonyb> We can also (maybe?) use our existing UA-filter to create a block list for haproxy
19:44:29 <tonyb> something like:
19:44:30 <fungi> not at that layer
19:44:33 <tonyb> #link https://discourse.haproxy.org/t/howto-block-badbots-crawlers-scrapers-using-list-file/995
19:44:44 <clarkb> ya we're currently load balancing tcp not https
19:44:48 <fungi> we'd have to do it in apache since that's where https is terminated
19:45:13 <clarkb> but maybe if we force all traffic through the load balancer then a reasonable next step is terminating https there?
19:45:29 <clarkb> makes debugging even more difficult as clients don't see the backend specific altname
19:45:34 <tonyb> Ahhh I see.
19:45:40 <clarkb> but we could do more magic with haproxy if it mitm'd the service
19:46:31 <clarkb> I'm open to experiments though and ideas like that are worth pursuing if we can reconfigure the service setup to match
19:47:27 <clarkb> #link PBR Updates to Handle Setuptools Deprecations
19:47:50 <clarkb> The last thing I wanted to call out today is that setuptools set a date of october 31 for removing some deprecated code that pbr relies on (specifically easy_install related stuff)
19:47:57 <clarkb> #link https://review.opendev.org/c/openstack/pbr/+/964712/ and children aim to address this
19:48:15 <clarkb> We think this stack of changes should hopefully mitigate (thank you stephenfin)
19:48:28 <fungi> looks like they're passing again now
19:48:32 <clarkb> the pip release broke pbr tests though so I had to fix those yesterday and now we're trying to land things again
19:48:59 <clarkb> hopefully we can land the changes and get a relesae out tomorrow? but then be on the lookout for the next setuptools release and for any problems related to it
19:49:35 <clarkb> I was brainstorming was we might mitigate if necessaryand I think we could do things like pin setuptools in our container images if not already doing so for things building container images. And also we could add pyproject.toml files to pin setuptools elsewhere
19:49:43 <clarkb> this assumes that becomes necessary and we're hoping it won't be
19:50:26 <clarkb> definitely say something if you notice problems with setuptools in the near future.
19:50:30 <clarkb> #topic Open Discussion
19:50:32 <clarkb> Anything else?
19:51:10 <clarkb> I'm going to be out on the 10th. The 11th is a holiday but I expect to be around and have a meeting
19:53:21 <clarkb> Sounds like that may be everything. Thank you everyone! We'll be back here next week at the same time and location.
19:53:31 <clarkb> #endmeeting