| clarkb | meeting time | 19:00 |
|---|---|---|
| clarkb | #startmeeting infra | 19:00 |
| opendevmeet | Meeting started Tue Oct 28 19:00:27 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
| opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
| opendevmeet | The meeting name has been set to 'infra' | 19:00 |
| clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/MRT6BQQHJPYJ43ENYTRSH4IOT6AR7FIW/ Our Agenda | 19:01 |
| clarkb | #topic Announcements | 19:01 |
| clarkb | I decided to have this meeting today despite the PTG happening this week because it has been some time since we had a meeting | 19:03 |
| clarkb | But also keep in mind the PTG is happening this week. I've already put the meetpad servers in teh ansible emergency file so they don't get randomly updated by upstream container image updates | 19:04 |
| clarkb | #topic Gerrit 3.11 Upgrade Planning | 19:05 |
| clarkb | After a zuul launcher upgrade issue my existing holds for testing this are no longer valid so I need to refresh them | 19:05 |
| clarkb | At the summit the Gerrit folks didn't feel we were super far behidn so that was encouraging | 19:06 |
| clarkb | I'm hoping to really start focusing on this again after this week and the ptg | 19:07 |
| clarkb | #topic Gerrit Spontaneous Shutdown During Summit | 19:07 |
| clarkb | That said during the Summit fungi and ianychoi noticed that Gerrit was not running. It was spontaneously in a shutdown state | 19:07 |
| clarkb | fungi was able to restart the VM and then start containers. The main issue was that the h2 cache backing files were not cleared out before doing so whcih made startup take a while. But it did startup and has been running since | 19:08 |
| clarkb | just keep that in mind if you're restarting Gerrit for any reason clearning out the h2 cache backing files can speed up startup. | 19:08 |
| clarkb | We spoke to mfick about improving this at the summit and he felt he knew what a workable solution was an in fact had merged an attempt at it but it didn't accomodate existing plugin expectations so was reverted | 19:08 |
| clarkb | but hopefully that means the issue can be addressed once plugin compatibility is addressed | 19:08 |
| clarkb | As for why Gerrit shutdown we also spoke to nova folks at the summit and something like running out of memory on the hypervisor could cause libvirt to request VMs shutdown and the nova wouldn't read that as an error that is bubbled back up to users | 19:09 |
| clarkb | so it seems like this is what caused the problem. Checking on that and mitigating the issue is somethign we'll have to bring up with vexxhost (which I don't think we have done yet) | 19:10 |
| fungi | yeah, it's outwardly consistent with the hypervisor host's oom killer reaping the qemu process, but without access to the host logs we don't know for sure | 19:10 |
| fungi | my money's on that though, because it's a very high-ram flavor | 19:10 |
| clarkb | ya | 19:10 |
| clarkb | #topic Upgrading old servers | 19:13 |
| clarkb | I'm not aware of any movement on this topic. But as mentioend previously I think the backup servers are a good next location to focus this effort | 19:13 |
| clarkb | we can replace them out at a time and transplant the current backup volumes onto the new servers to preserve that daat | 19:13 |
| tonyb | I did refresh the mediawiki patches | 19:14 |
| fungi | and we have plenty of cinder quota at this point, so new volumes are fine | 19:14 |
| clarkb | tonyb: oh cool, should we be reviewing them then? | 19:14 |
| tonyb | I think the container build is probably mergeable | 19:14 |
| clarkb | excellent /me makes a note on the todo list | 19:14 |
| fungi | i would actually add fresh volumes and use the server replacement as an excuse to rotate them | 19:14 |
| fungi | and we can still detach the old volumes and attach to the new servers to make them easier to access | 19:14 |
| tonyb | I also updated the ansible-devel series so reviewing them would be good | 19:15 |
| clarkb | ack also added to my todo list | 19:15 |
| tonyb | I added an new, short lived, ansible-next job that targets ansible-11 on bridge, rather than master | 19:15 |
| tonyb | I figure we'll want to update bridge before we get rid of focal but that may not be a valid conclusion | 19:16 |
| clarkb | makes sense since we're not really in a position to provide feedback upstraem for unreleased stuff but having signal about where incompatibilities with the next release are is helpful to us | 19:16 |
| clarkb | tonyb: the current bridge is jammy so the order of operations there is a bit more vague I think | 19:16 |
| tonyb | Okay | 19:17 |
| clarkb | Anything else as far as updating servers goes? I'm glad there is progress and I just need to catch up! | 19:18 |
| tonyb | I can chnage the ansible-next job to jammy if that's a more reasonable target | 19:18 |
| clarkb | tonyb: might be worth checking just to see if we can run ansible on jammy with the python version there | 19:18 |
| clarkb | to see if we need to upgrade to get new ansible or not | 19:18 |
| tonyb | I still haven't actaully tested the held MW node, but apart from that I think I'm good | 19:18 |
| tonyb | clarkb: noted. | 19:19 |
| clarkb | #topic AFS mirror content cleanup | 19:20 |
| clarkb | I think this effort has largely stalled out (whcih is fine, major improvements have been made and the wins we see going fowrard are much smaller) | 19:21 |
| clarkb | I'm curious if A) anyone is interested in chasing that long tail of cleanup and B) if we think we're ready to start mirroring new stuff like say trixie packages? | 19:21 |
| tonyb | I think "a" is still valuable, but I don't have cycles for it in the short term. I have no opinion on "b". | 19:22 |
| clarkb | ya maybe we need to put A on the backlog list etherpad linked to from specs | 19:23 |
| clarkb | for B I'm happy to stick with the current status quo until people find they need it | 19:23 |
| clarkb | mostly taking temperature on that I ugess | 19:23 |
| fungi | noonedeadpunk indicated in #openstack-ansible earlier this week that he'd look into adding a reprepro config patch for trixie soon | 19:24 |
| clarkb | cool so there is interest and we can probably wait for that change to show up then | 19:24 |
| fungi | they apparently hit some mirror host identification bug in their jobs which was causing the pip.conf to list deb.debian.org as the pypi index host | 19:24 |
| fungi | traced back to having an empty mirror host variable | 19:25 |
| clarkb | thats weird | 19:25 |
| frickler | yes, I had a similar issue with devstack | 19:25 |
| clarkb | an unexpected fallback behavior for sure | 19:25 |
| frickler | iirc that is because we had to work around the missing mirror in dib/image builds | 19:26 |
| clarkb | I think the dib fallback was to use the upstream mirrors though | 19:27 |
| clarkb | anyway its worth tracking down and we don't need to debug it now | 19:27 |
| mnasiadka | I can help with cleanup if needed (or in some other area) | 19:27 |
| clarkb | anything else related to afs mirroring? I think we can followup on A and B after the meeting as people have time | 19:28 |
| tonyb | Possibly related to: https://review.opendev.org/c/zuul/zuul-jobs/+/965008 "Allow mirror_fqdn to be overriden" | 19:28 |
| clarkb | #topic Zuul Launcher Updates | 19:30 |
| clarkb | As a heads up there is a bug in zuul launcher that currently affects nodesets if the requested node boots fail | 19:30 |
| clarkb | zuul tries to recover inappropriately and then fails the nodeset | 19:30 |
| clarkb | there is a fix for this currently in the zuul gate, but zuul ci hit problems due to the new pip release so its been a slow march to get the fix landed | 19:31 |
| clarkb | there was also a fix to some test cases identified to hopefully make the test cases more reliable. I'm hopeful with those two fixes in place we'll be able to land the launcher fix then restart launchers to address the node failure problem | 19:31 |
| clarkb | at this point I think we're on the right path to correcting this but wanted peopel to be aware | 19:31 |
| clarkb | any other zuul launcher concerns or feedback? | 19:32 |
| clarkb | #link https://review.opendev.org/c/zuul/zuul/+/964893 this is the node failure fixup | 19:32 |
| clarkb | #topic Matrix for OpenDev comms | 19:33 |
| clarkb | In addition to the Gerrit upgrade this is the other item that is high on my todo list | 19:33 |
| clarkb | I should be able to start on room creation and work through some of the bits of the spec that don't require user facing changes | 19:34 |
| clarkb | then when we're happy with the state of things we can make it more official and start porting usage over | 19:34 |
| tonyb | Sounds good | 19:34 |
| clarkb | #topic Etherpad 2.5.1 Upgrade | 19:36 |
| clarkb | Etherpad 2.5.0 was the version I was looking at previously with the broken but slightly improved css | 19:36 |
| clarkb | since then there is a new 2.5.1 release so I need to update the upgrade change and recycle test nodes and check if css is happy now | 19:37 |
| clarkb | but I didn't want to do that prior to or during the ptg so this is probably going to wait for a bit | 19:37 |
| clarkb | #link https://github.com/ether/etherpad-lite/blob/v2.5.1/CHANGELOG.md Is the upstream changelog | 19:37 |
| clarkb | I would say that often times their changelog is very incomplete | 19:38 |
| clarkb | #topic Gitea 1.24.7 Upgrade | 19:38 |
| clarkb | Gitea has pushed a new release too | 19:38 |
| clarkb | #link https://review.opendev.org/c/opendev/system-config/+/964899/ Upgrade Gitea to 1.24.7 | 19:38 |
| clarkb | I think we can probably proceed with updating this service if it looks like the service itself is stable and not falling over due to crawlers | 19:38 |
| clarkb | the screenshots looked good to me but please double check when you review the change | 19:39 |
| clarkb | #topic Gitea Performance | 19:39 |
| tonyb | They looks good to me | 19:39 |
| clarkb | which brings us to the general gitea performance issue | 19:39 |
| clarkb | prior to the summit we thought that part of the problem was crawlers hitting backends directly | 19:39 |
| clarkb | this meant that the load balancer couldn't really balance effectively as it is unaware of any direct connections | 19:39 |
| clarkb | #link https://review.opendev.org/c/opendev/system-config/+/964728 Don't allow direct backend access | 19:40 |
| clarkb | this change is a response to that. It will limit our ability to test specific backends without doing something like ssh port forwarding | 19:40 |
| clarkb | however, yesterday performance was poor and the traffic did seem to be going through the load balancer | 19:40 |
| clarkb | so forcing everything through the load balancer is unlikely to fix all the issues. That said I suspect it will generally be an improvement | 19:40 |
| clarkb | yesterday I had to block a particularly bad crawler's ip addresses after confirming it was crawling with odd and what appeared to be bogus user agent | 19:41 |
| clarkb | after doing that things settled down a bit and the service seemed happier. Spot checking now seems to show thinsg are still reasonably happy | 19:41 |
| clarkb | I did identify one other problematic crawler that I intended on blocking if things didn't improve after the first was blocked but that was not necessary | 19:42 |
| clarkb | (this crawler is using a specific cloud provider and I was going to block that cloud provider's ip range....) | 19:42 |
| clarkb | anyway I guess the point here is the battle is ongoing and I'm less certain 964728 will help significantly but I'm willing to try it if others think it is a good idea | 19:43 |
| clarkb | I'm also open to other ideas and help | 19:43 |
| tonyb | We can also (maybe?) use our existing UA-filter to create a block list for haproxy | 19:44 |
| tonyb | something like: | 19:44 |
| fungi | not at that layer | 19:44 |
| tonyb | #link https://discourse.haproxy.org/t/howto-block-badbots-crawlers-scrapers-using-list-file/995 | 19:44 |
| clarkb | ya we're currently load balancing tcp not https | 19:44 |
| fungi | we'd have to do it in apache since that's where https is terminated | 19:44 |
| clarkb | but maybe if we force all traffic through the load balancer then a reasonable next step is terminating https there? | 19:45 |
| clarkb | makes debugging even more difficult as clients don't see the backend specific altname | 19:45 |
| tonyb | Ahhh I see. | 19:45 |
| clarkb | but we could do more magic with haproxy if it mitm'd the service | 19:45 |
| clarkb | I'm open to experiments though and ideas like that are worth pursuing if we can reconfigure the service setup to match | 19:46 |
| clarkb | #link PBR Updates to Handle Setuptools Deprecations | 19:47 |
| clarkb | The last thing I wanted to call out today is that setuptools set a date of october 31 for removing some deprecated code that pbr relies on (specifically easy_install related stuff) | 19:47 |
| clarkb | #link https://review.opendev.org/c/openstack/pbr/+/964712/ and children aim to address this | 19:47 |
| clarkb | We think this stack of changes should hopefully mitigate (thank you stephenfin) | 19:48 |
| fungi | looks like they're passing again now | 19:48 |
| clarkb | the pip release broke pbr tests though so I had to fix those yesterday and now we're trying to land things again | 19:48 |
| clarkb | hopefully we can land the changes and get a relesae out tomorrow? but then be on the lookout for the next setuptools release and for any problems related to it | 19:48 |
| clarkb | I was brainstorming was we might mitigate if necessaryand I think we could do things like pin setuptools in our container images if not already doing so for things building container images. And also we could add pyproject.toml files to pin setuptools elsewhere | 19:49 |
| clarkb | this assumes that becomes necessary and we're hoping it won't be | 19:49 |
| clarkb | definitely say something if you notice problems with setuptools in the near future. | 19:50 |
| clarkb | #topic Open Discussion | 19:50 |
| clarkb | Anything else? | 19:50 |
| clarkb | I'm going to be out on the 10th. The 11th is a holiday but I expect to be around and have a meeting | 19:51 |
| clarkb | Sounds like that may be everything. Thank you everyone! We'll be back here next week at the same time and location. | 19:53 |
| clarkb | #endmeeting | 19:53 |
| opendevmeet | Meeting ended Tue Oct 28 19:53:31 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:53 |
| opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-10-28-19.00.html | 19:53 |
| opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-10-28-19.00.txt | 19:53 |
| opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-10-28-19.00.log.html | 19:53 |
| tonyb | Thanks all | 19:53 |
| fungi | thanks! | 19:54 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!