Tuesday, 2025-10-28

clarkbmeeting time19:00
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Oct 28 19:00:27 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/MRT6BQQHJPYJ43ENYTRSH4IOT6AR7FIW/ Our Agenda19:01
clarkb#topic Announcements19:01
clarkbI decided to have this meeting today despite the PTG happening this week because it has been some time since we had a meeting19:03
clarkbBut also keep in mind the PTG is happening this week. I've already put the meetpad servers in teh ansible emergency file so they don't get randomly updated by upstream container image updates19:04
clarkb#topic Gerrit 3.11 Upgrade Planning19:05
clarkbAfter a zuul launcher upgrade issue my existing holds for testing this are no longer valid so I need to refresh them19:05
clarkbAt the summit the Gerrit folks didn't feel we were super far behidn so that was encouraging19:06
clarkbI'm hoping to really start focusing on this again after this week and the ptg19:07
clarkb#topic Gerrit Spontaneous Shutdown During Summit19:07
clarkbThat said during the Summit fungi and ianychoi noticed that Gerrit was not running. It was spontaneously in a shutdown state19:07
clarkbfungi was able to restart the VM and then start containers. The main issue was that the h2 cache backing files were not cleared out before doing so whcih made startup take a while. But it did startup and has been running since19:08
clarkbjust keep that in mind if you're restarting Gerrit for any reason clearning out the h2 cache backing files can speed up startup.19:08
clarkbWe spoke to mfick about improving this at the summit and he felt he knew what a workable solution was an in fact had merged an attempt at it but it didn't accomodate existing plugin expectations so was reverted19:08
clarkbbut hopefully that means the issue can be addressed once plugin compatibility is addressed19:08
clarkbAs for why Gerrit shutdown we also spoke to nova folks at the summit and something like running out of memory on the hypervisor could cause libvirt to request VMs shutdown and the nova wouldn't read that as an error that is bubbled back up to users19:09
clarkbso it seems like this is what caused the problem. Checking on that and mitigating the issue is somethign we'll have to bring up with vexxhost (which I don't think we have done yet)19:10
fungiyeah, it's outwardly consistent with the hypervisor host's oom killer reaping the qemu process, but without access to the host logs we don't know for sure19:10
fungimy money's on that though, because it's a very high-ram flavor19:10
clarkbya19:10
clarkb#topic Upgrading old servers19:13
clarkbI'm not aware of any movement on this topic. But as mentioend previously I think the backup servers are a good next location to focus this effort19:13
clarkbwe can replace them out at a time and transplant the current backup volumes onto the new servers to preserve that daat19:13
tonybI did refresh the mediawiki patches19:14
fungiand we have plenty of cinder quota at this point, so new volumes are fine19:14
clarkbtonyb: oh cool, should we be reviewing them then?19:14
tonybI think the container build is probably mergeable19:14
clarkbexcellent /me makes a note on the todo list19:14
fungii would actually add fresh volumes and use the server replacement as an excuse to rotate them19:14
fungiand we can still detach the old volumes and attach to the new servers to make them easier to access19:14
tonybI also updated the ansible-devel series so reviewing them would be good19:15
clarkback also added to my todo list19:15
tonybI added an new, short lived, ansible-next job that targets ansible-11 on bridge, rather than master19:15
tonybI figure we'll want to update bridge before we get rid of focal but that may not be a valid conclusion19:16
clarkbmakes sense since we're not really in a position to provide feedback upstraem for unreleased stuff but having signal about where incompatibilities with the next release are is helpful to us19:16
clarkbtonyb: the current bridge is jammy so the order of operations there is a bit more vague I think19:16
tonybOkay 19:17
clarkbAnything else as far as updating servers goes? I'm glad there is progress and I just need to catch up!19:18
tonybI can chnage the ansible-next job to jammy if that's a more reasonable target19:18
clarkbtonyb: might be worth checking just to see if we can run ansible on jammy with the python version there19:18
clarkbto see if we need to upgrade to get new ansible or not19:18
tonybI still haven't actaully tested the held MW node, but apart from that I think I'm good19:18
tonybclarkb: noted.19:19
clarkb#topic AFS mirror content cleanup19:20
clarkbI think this effort has largely stalled out (whcih is fine, major improvements have been made and the wins we see going fowrard are much smaller)19:21
clarkbI'm curious if A) anyone is interested in chasing that long tail of cleanup and B) if we think we're ready to start mirroring new stuff like say trixie packages?19:21
tonybI think "a" is still valuable, but I don't have cycles for it in the short term.  I have no opinion on "b".19:22
clarkbya maybe we need to put A on the backlog list etherpad linked to from specs19:23
clarkbfor B I'm happy to stick with the current status quo until people find they need it19:23
clarkbmostly taking temperature on that I ugess19:23
funginoonedeadpunk indicated in #openstack-ansible earlier this week that he'd look into adding a reprepro config patch for trixie soon19:24
clarkbcool so there is interest and we can probably wait for that change to show up then19:24
fungithey apparently hit some mirror host identification bug in their jobs which was causing the pip.conf to list deb.debian.org as the pypi index host19:24
fungitraced back to having an empty mirror host variable19:25
clarkbthats weird19:25
frickleryes, I had a similar issue with devstack19:25
clarkban unexpected fallback behavior for sure19:25
frickleriirc that is because we had to work around the missing mirror in dib/image builds19:26
clarkbI think the dib fallback was to use the upstream mirrors though19:27
clarkbanyway its worth tracking down and we don't need to debug it now19:27
mnasiadkaI can help with cleanup if needed (or in some other area)19:27
clarkbanything else related to afs mirroring? I think we can followup on A and B after the meeting as people have time19:28
tonybPossibly related to: https://review.opendev.org/c/zuul/zuul-jobs/+/965008 "Allow mirror_fqdn to be overriden"19:28
clarkb#topic Zuul Launcher Updates19:30
clarkbAs a heads up there is a bug in zuul launcher that currently affects nodesets if the requested node boots fail19:30
clarkbzuul tries to recover inappropriately and then fails the nodeset19:30
clarkbthere is a fix for this currently in the zuul gate, but zuul ci hit problems due to the new pip release so its been a slow march to get the fix landed19:31
clarkbthere was also a fix to some test cases identified to hopefully make the test cases more reliable. I'm hopeful with those two fixes in place we'll be able to land the launcher fix then restart launchers to address the node failure problem19:31
clarkbat this point I think we're on the right path to correcting this but wanted peopel to be aware19:31
clarkbany other zuul launcher concerns or feedback?19:32
clarkb#link https://review.opendev.org/c/zuul/zuul/+/964893 this is the node failure fixup19:32
clarkb#topic Matrix for OpenDev comms19:33
clarkbIn addition to the Gerrit upgrade this is the other item that is high on my todo list19:33
clarkbI should be able to start on room creation and work through some of the bits of the spec that don't require user facing changes19:34
clarkbthen when we're happy with the state of things we can make it more official and start porting usage over19:34
tonybSounds good19:34
clarkb#topic Etherpad 2.5.1 Upgrade19:36
clarkbEtherpad 2.5.0 was the version I was looking at previously with the broken but slightly improved css19:36
clarkbsince then there is a new 2.5.1 release so I need to update the upgrade change and recycle test nodes and check if css is happy now19:37
clarkbbut I didn't want to do that prior to or during the ptg so this is probably going to wait for a bit19:37
clarkb#link https://github.com/ether/etherpad-lite/blob/v2.5.1/CHANGELOG.md Is the upstream changelog19:37
clarkbI would say that often times their changelog is very incomplete19:38
clarkb#topic Gitea 1.24.7 Upgrade19:38
clarkbGitea has pushed a new release too19:38
clarkb#link https://review.opendev.org/c/opendev/system-config/+/964899/ Upgrade Gitea to 1.24.719:38
clarkbI think we can probably proceed with updating this service if it looks like the service itself is stable and not falling over due to crawlers19:38
clarkbthe screenshots looked good to me but please double check when you review the change19:39
clarkb#topic Gitea Performance19:39
tonybThey looks good to me19:39
clarkbwhich brings us to the general gitea performance issue19:39
clarkbprior to the summit we thought that part of the problem was crawlers hitting backends directly19:39
clarkbthis meant that the load balancer couldn't really balance effectively as it is unaware of any direct connections19:39
clarkb#link https://review.opendev.org/c/opendev/system-config/+/964728 Don't allow direct backend access19:40
clarkbthis change is a response to that. It will limit our ability to test specific backends without doing something like ssh port forwarding19:40
clarkbhowever, yesterday performance was poor and the traffic did seem to be going through the load balancer19:40
clarkbso forcing everything through the load balancer is unlikely to fix all the issues. That said I suspect it will generally be an improvement19:40
clarkbyesterday I had to block a particularly bad crawler's ip addresses after confirming it was crawling with odd and what appeared to be bogus user agent19:41
clarkbafter doing that things settled down a bit and the service seemed happier. Spot checking now seems to show thinsg are still reasonably happy19:41
clarkbI did identify one other problematic crawler that I intended on blocking if things didn't improve after the first was blocked but that was not necessary19:42
clarkb(this crawler is using a specific cloud provider and I was going to block that cloud provider's ip range....)19:42
clarkbanyway I guess the point here is the battle is ongoing and I'm less certain 964728 will help significantly but I'm willing to try it if others think it is a good idea19:43
clarkbI'm also open to other ideas and help19:43
tonybWe can also (maybe?) use our existing UA-filter to create a block list for haproxy 19:44
tonybsomething like:19:44
funginot at that layer19:44
tonyb#link https://discourse.haproxy.org/t/howto-block-badbots-crawlers-scrapers-using-list-file/99519:44
clarkbya we're currently load balancing tcp not https19:44
fungiwe'd have to do it in apache since that's where https is terminated19:44
clarkbbut maybe if we force all traffic through the load balancer then a reasonable next step is terminating https there?19:45
clarkbmakes debugging even more difficult as clients don't see the backend specific altname19:45
tonybAhhh I see.19:45
clarkbbut we could do more magic with haproxy if it mitm'd the service19:45
clarkbI'm open to experiments though and ideas like that are worth pursuing if we can reconfigure the service setup to match19:46
clarkb#link PBR Updates to Handle Setuptools Deprecations19:47
clarkbThe last thing I wanted to call out today is that setuptools set a date of october 31 for removing some deprecated code that pbr relies on (specifically easy_install related stuff)19:47
clarkb#link https://review.opendev.org/c/openstack/pbr/+/964712/ and children aim to address this19:47
clarkbWe think this stack of changes should hopefully mitigate (thank you stephenfin)19:48
fungilooks like they're passing again now19:48
clarkbthe pip release broke pbr tests though so I had to fix those yesterday and now we're trying to land things again19:48
clarkbhopefully we can land the changes and get a relesae out tomorrow? but then be on the lookout for the next setuptools release and for any problems related to it19:48
clarkbI was brainstorming was we might mitigate if necessaryand I think we could do things like pin setuptools in our container images if not already doing so for things building container images. And also we could add pyproject.toml files to pin setuptools elsewhere19:49
clarkbthis assumes that becomes necessary and we're hoping it won't be19:49
clarkbdefinitely say something if you notice problems with setuptools in the near future.19:50
clarkb#topic Open Discussion19:50
clarkbAnything else?19:50
clarkbI'm going to be out on the 10th. The 11th is a holiday but I expect to be around and have a meeting19:51
clarkbSounds like that may be everything. Thank you everyone! We'll be back here next week at the same time and location.19:53
clarkb#endmeeting19:53
opendevmeetMeeting ended Tue Oct 28 19:53:31 2025 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:53
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2025/infra.2025-10-28-19.00.html19:53
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-10-28-19.00.txt19:53
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2025/infra.2025-10-28-19.00.log.html19:53
tonybThanks all19:53
fungithanks!19:54

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!