19:00:27 #startmeeting infra 19:00:27 Meeting started Tue Oct 28 19:00:27 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:27 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:27 The meeting name has been set to 'infra' 19:01:30 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/MRT6BQQHJPYJ43ENYTRSH4IOT6AR7FIW/ Our Agenda 19:01:35 #topic Announcements 19:03:45 I decided to have this meeting today despite the PTG happening this week because it has been some time since we had a meeting 19:04:11 But also keep in mind the PTG is happening this week. I've already put the meetpad servers in teh ansible emergency file so they don't get randomly updated by upstream container image updates 19:05:30 #topic Gerrit 3.11 Upgrade Planning 19:05:53 After a zuul launcher upgrade issue my existing holds for testing this are no longer valid so I need to refresh them 19:06:55 At the summit the Gerrit folks didn't feel we were super far behidn so that was encouraging 19:07:06 I'm hoping to really start focusing on this again after this week and the ptg 19:07:14 #topic Gerrit Spontaneous Shutdown During Summit 19:07:37 That said during the Summit fungi and ianychoi noticed that Gerrit was not running. It was spontaneously in a shutdown state 19:08:03 fungi was able to restart the VM and then start containers. The main issue was that the h2 cache backing files were not cleared out before doing so whcih made startup take a while. But it did startup and has been running since 19:08:18 just keep that in mind if you're restarting Gerrit for any reason clearning out the h2 cache backing files can speed up startup. 19:08:46 We spoke to mfick about improving this at the summit and he felt he knew what a workable solution was an in fact had merged an attempt at it but it didn't accomodate existing plugin expectations so was reverted 19:08:56 but hopefully that means the issue can be addressed once plugin compatibility is addressed 19:09:34 As for why Gerrit shutdown we also spoke to nova folks at the summit and something like running out of memory on the hypervisor could cause libvirt to request VMs shutdown and the nova wouldn't read that as an error that is bubbled back up to users 19:10:00 so it seems like this is what caused the problem. Checking on that and mitigating the issue is somethign we'll have to bring up with vexxhost (which I don't think we have done yet) 19:10:05 yeah, it's outwardly consistent with the hypervisor host's oom killer reaping the qemu process, but without access to the host logs we don't know for sure 19:10:20 my money's on that though, because it's a very high-ram flavor 19:10:25 ya 19:13:02 #topic Upgrading old servers 19:13:19 I'm not aware of any movement on this topic. But as mentioend previously I think the backup servers are a good next location to focus this effort 19:13:33 we can replace them out at a time and transplant the current backup volumes onto the new servers to preserve that daat 19:14:01 I did refresh the mediawiki patches 19:14:09 and we have plenty of cinder quota at this point, so new volumes are fine 19:14:24 tonyb: oh cool, should we be reviewing them then? 19:14:26 I think the container build is probably mergeable 19:14:33 excellent /me makes a note on the todo list 19:14:38 i would actually add fresh volumes and use the server replacement as an excuse to rotate them 19:14:57 and we can still detach the old volumes and attach to the new servers to make them easier to access 19:15:00 I also updated the ansible-devel series so reviewing them would be good 19:15:23 ack also added to my todo list 19:15:38 I added an new, short lived, ansible-next job that targets ansible-11 on bridge, rather than master 19:16:15 I figure we'll want to update bridge before we get rid of focal but that may not be a valid conclusion 19:16:21 makes sense since we're not really in a position to provide feedback upstraem for unreleased stuff but having signal about where incompatibilities with the next release are is helpful to us 19:16:40 tonyb: the current bridge is jammy so the order of operations there is a bit more vague I think 19:17:16 Okay 19:18:01 Anything else as far as updating servers goes? I'm glad there is progress and I just need to catch up! 19:18:16 I can chnage the ansible-next job to jammy if that's a more reasonable target 19:18:39 tonyb: might be worth checking just to see if we can run ansible on jammy with the python version there 19:18:46 to see if we need to upgrade to get new ansible or not 19:18:54 I still haven't actaully tested the held MW node, but apart from that I think I'm good 19:19:12 clarkb: noted. 19:20:41 #topic AFS mirror content cleanup 19:21:01 I think this effort has largely stalled out (whcih is fine, major improvements have been made and the wins we see going fowrard are much smaller) 19:21:24 I'm curious if A) anyone is interested in chasing that long tail of cleanup and B) if we think we're ready to start mirroring new stuff like say trixie packages? 19:22:56 I think "a" is still valuable, but I don't have cycles for it in the short term. I have no opinion on "b". 19:23:29 ya maybe we need to put A on the backlog list etherpad linked to from specs 19:23:48 for B I'm happy to stick with the current status quo until people find they need it 19:23:56 mostly taking temperature on that I ugess 19:24:03 noonedeadpunk indicated in #openstack-ansible earlier this week that he'd look into adding a reprepro config patch for trixie soon 19:24:19 cool so there is interest and we can probably wait for that change to show up then 19:24:51 they apparently hit some mirror host identification bug in their jobs which was causing the pip.conf to list deb.debian.org as the pypi index host 19:25:14 traced back to having an empty mirror host variable 19:25:32 thats weird 19:25:33 yes, I had a similar issue with devstack 19:25:45 an unexpected fallback behavior for sure 19:26:43 iirc that is because we had to work around the missing mirror in dib/image builds 19:27:05 I think the dib fallback was to use the upstream mirrors though 19:27:22 anyway its worth tracking down and we don't need to debug it now 19:27:37 I can help with cleanup if needed (or in some other area) 19:28:13 anything else related to afs mirroring? I think we can followup on A and B after the meeting as people have time 19:28:24 Possibly related to: https://review.opendev.org/c/zuul/zuul-jobs/+/965008 "Allow mirror_fqdn to be overriden" 19:30:13 #topic Zuul Launcher Updates 19:30:30 As a heads up there is a bug in zuul launcher that currently affects nodesets if the requested node boots fail 19:30:40 zuul tries to recover inappropriately and then fails the nodeset 19:31:02 there is a fix for this currently in the zuul gate, but zuul ci hit problems due to the new pip release so its been a slow march to get the fix landed 19:31:36 there was also a fix to some test cases identified to hopefully make the test cases more reliable. I'm hopeful with those two fixes in place we'll be able to land the launcher fix then restart launchers to address the node failure problem 19:31:59 at this point I think we're on the right path to correcting this but wanted peopel to be aware 19:32:26 any other zuul launcher concerns or feedback? 19:32:41 #link https://review.opendev.org/c/zuul/zuul/+/964893 this is the node failure fixup 19:33:29 #topic Matrix for OpenDev comms 19:33:44 In addition to the Gerrit upgrade this is the other item that is high on my todo list 19:34:06 I should be able to start on room creation and work through some of the bits of the spec that don't require user facing changes 19:34:20 then when we're happy with the state of things we can make it more official and start porting usage over 19:34:39 Sounds good 19:36:33 #topic Etherpad 2.5.1 Upgrade 19:36:51 Etherpad 2.5.0 was the version I was looking at previously with the broken but slightly improved css 19:37:13 since then there is a new 2.5.1 release so I need to update the upgrade change and recycle test nodes and check if css is happy now 19:37:26 but I didn't want to do that prior to or during the ptg so this is probably going to wait for a bit 19:37:38 #link https://github.com/ether/etherpad-lite/blob/v2.5.1/CHANGELOG.md Is the upstream changelog 19:38:06 I would say that often times their changelog is very incomplete 19:38:16 #topic Gitea 1.24.7 Upgrade 19:38:21 Gitea has pushed a new release too 19:38:26 #link https://review.opendev.org/c/opendev/system-config/+/964899/ Upgrade Gitea to 1.24.7 19:38:47 I think we can probably proceed with updating this service if it looks like the service itself is stable and not falling over due to crawlers 19:39:04 the screenshots looked good to me but please double check when you review the change 19:39:10 #topic Gitea Performance 19:39:16 They looks good to me 19:39:24 which brings us to the general gitea performance issue 19:39:42 prior to the summit we thought that part of the problem was crawlers hitting backends directly 19:39:55 this meant that the load balancer couldn't really balance effectively as it is unaware of any direct connections 19:40:00 #link https://review.opendev.org/c/opendev/system-config/+/964728 Don't allow direct backend access 19:40:14 this change is a response to that. It will limit our ability to test specific backends without doing something like ssh port forwarding 19:40:34 however, yesterday performance was poor and the traffic did seem to be going through the load balancer 19:40:56 so forcing everything through the load balancer is unlikely to fix all the issues. That said I suspect it will generally be an improvement 19:41:26 yesterday I had to block a particularly bad crawler's ip addresses after confirming it was crawling with odd and what appeared to be bogus user agent 19:41:50 after doing that things settled down a bit and the service seemed happier. Spot checking now seems to show thinsg are still reasonably happy 19:42:19 I did identify one other problematic crawler that I intended on blocking if things didn't improve after the first was blocked but that was not necessary 19:42:33 (this crawler is using a specific cloud provider and I was going to block that cloud provider's ip range....) 19:43:02 anyway I guess the point here is the battle is ongoing and I'm less certain 964728 will help significantly but I'm willing to try it if others think it is a good idea 19:43:16 I'm also open to other ideas and help 19:44:12 We can also (maybe?) use our existing UA-filter to create a block list for haproxy 19:44:29 something like: 19:44:30 not at that layer 19:44:33 #link https://discourse.haproxy.org/t/howto-block-badbots-crawlers-scrapers-using-list-file/995 19:44:44 ya we're currently load balancing tcp not https 19:44:48 we'd have to do it in apache since that's where https is terminated 19:45:13 but maybe if we force all traffic through the load balancer then a reasonable next step is terminating https there? 19:45:29 makes debugging even more difficult as clients don't see the backend specific altname 19:45:34 Ahhh I see. 19:45:40 but we could do more magic with haproxy if it mitm'd the service 19:46:31 I'm open to experiments though and ideas like that are worth pursuing if we can reconfigure the service setup to match 19:47:27 #link PBR Updates to Handle Setuptools Deprecations 19:47:50 The last thing I wanted to call out today is that setuptools set a date of october 31 for removing some deprecated code that pbr relies on (specifically easy_install related stuff) 19:47:57 #link https://review.opendev.org/c/openstack/pbr/+/964712/ and children aim to address this 19:48:15 We think this stack of changes should hopefully mitigate (thank you stephenfin) 19:48:28 looks like they're passing again now 19:48:32 the pip release broke pbr tests though so I had to fix those yesterday and now we're trying to land things again 19:48:59 hopefully we can land the changes and get a relesae out tomorrow? but then be on the lookout for the next setuptools release and for any problems related to it 19:49:35 I was brainstorming was we might mitigate if necessaryand I think we could do things like pin setuptools in our container images if not already doing so for things building container images. And also we could add pyproject.toml files to pin setuptools elsewhere 19:49:43 this assumes that becomes necessary and we're hoping it won't be 19:50:26 definitely say something if you notice problems with setuptools in the near future. 19:50:30 #topic Open Discussion 19:50:32 Anything else? 19:51:10 I'm going to be out on the 10th. The 11th is a holiday but I expect to be around and have a meeting 19:53:21 Sounds like that may be everything. Thank you everyone! We'll be back here next week at the same time and location. 19:53:31 #endmeeting