#opendev-meeting log

19:00:56 <clarkb> #startmeeting infra
19:00:56 <opendevmeet> Meeting started Tue Nov  4 19:00:56 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:56 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:56 <opendevmeet> The meeting name has been set to 'infra'
19:01:09 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/O2QLEHP5SBVJLBCK5WVAGWFSXJEMDI52/ Our Agenda
19:01:12 <clarkb> #topic Announcements
19:01:30 <clarkb> If the time of this meeting surprises you then note that we hold the meeting at 1900 UTC which doesn't observe daylight saving time
19:01:56 <tonyb> one of its best features!
19:02:05 <clarkb> I will be popping out a little early on Friday and am out on Monday
19:02:18 <clarkb> we should expect a meeting in a week but the agenda may go out late
19:02:23 <clarkb> Anything else to announce?
19:04:06 <clarkb> #topic Gerrit 3.11 Upgrade Planning
19:04:16 <clarkb> I'm perpetually not keeping up with Gerrit :/
19:04:25 <clarkb> there is a new set of releases today from upstrea for 3.10.9 and 3.11.7
19:04:49 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/966084 Update Gerrit images for 3.10.9 and 3.11.7
19:05:08 <clarkb> I also had to delete one of my holds for Gerrit testing due to a launcher issue but that isn't a big deal as I need to refresh for ^ anyway
19:05:28 <clarkb> to tl;dr I think I'm back to needing to catch the gerrit deployment up to a happy spot then will refresh node holds and hopefully start testing things
19:05:53 <clarkb> the linked change above has a parent change that addresses a general issue with gerrit restarts as well and we should probably plan to get both changes in together then restart gerrit to ensure everything is happy about it
19:06:15 <clarkb> any questions or concerns? (we'll talk about the unexpected shutdowns next in its own topic)
19:07:19 <clarkb> #topic Gerrit Spontaneous Shutdowns
19:07:33 <clarkb> The other big gerrit topic is the issue of unexpected server shutdowns
19:07:42 <clarkb> We had one occur during the summit and we just had one early today UTC time
19:07:51 <clarkb> thank you tonyb for dealing with the shutdown that occurred today
19:08:26 <tonyb> All good.
19:08:35 <clarkb> The first thing I want to note is that the massive h2 cache files and their lockfiles in /home/gerrit2/review_site/cache can be deleted before starting gerrit again in order to speed up gerrit start
19:09:03 <fungi> yes, huge thanks tonyb, if that had sat waiting for me to wake up it would have severely impacted the publication timeline for an openstack security advisory
19:09:03 <tonyb> I did delete the files > 1G but not the lockfiles
19:09:09 <clarkb> the issue there is taht gerrit processes those massive files on startup and prunes things into shape which is done serially database by database and prevents many other actions from happening in the system. If we delete the files gerrit simply creates new caches and populates them as they go
19:09:18 <clarkb> tonyb: oh interesting, but gerrit startup was still slow?
19:09:22 <tonyb> Yes
19:09:56 <clarkb> ok so maybe there is an additional issue there
19:10:08 <clarkb> fwiw we don't want to delete all the h2 cache files as some manage things like user sessions
19:10:25 <clarkb> but caches like gerrit_file_diff.h2.db grow very large and can be deleted (but I'ev only ever deleted them with their lock files)
19:10:33 <clarkb> I wonder if it waited for some lock timeout or something like that
19:10:48 <tonyb> Yeah.  that was my assumption and based on my passive following of the h2 issue I was confident about deleteing the 2 large files
19:11:34 <clarkb> I did discuss that issue with upstream at the summit and mfick thought it was fixable and had attampted a fix that got reverted due to breaking plugins. But he still felt that it could be done without breaking plugins so hopefully soon it gets addressed
19:11:57 <tonyb> I'm not familiar enough with what the gerrit logs look like to really know whats "normal"
19:12:34 <clarkb> ya I also noticed in syslog there was something complaining about being able to connect to gerrit on 29418 for some time so I think it was basically doing whatever startup routines it needs to in order t ofeel ready and that took some time
19:12:45 <clarkb> previously we had believed that to be largely stuck in db processing but maybe there is something else
19:13:16 <clarkb> really the ideal situation here is ot have the service be more resilient and sane about this stuff which is happening (slowly)
19:13:28 <tonyb> It's hard to debug given it's the production server ;P
19:13:34 <clarkb> then on the cluod side I sent email to vexxhost support calling out the issue and asking them for advice on mitigating it
19:13:40 <clarkb> so hopefully we can make that better too
19:13:47 <clarkb> I cc'd infra rooters on that
19:14:26 <clarkb> I'm happy to try and schedule the 3.10.9 update restart for a time where as many people as are interested can follow along with the process so that we are more aware of what a "good" restart looks like
19:14:35 <clarkb> (I still expect the shutdown to timeout unfortunately)
19:15:25 <clarkb> to summarize gerrit shutdown happened again. Startup in that situation is still slower than we'd like. We may need to debug further or it may simply be a matter of deleting h2 db lock files when dleeting h2 cache dbs. And we've engaged with the cloud host to debug the underlying issue
19:15:32 <clarkb> anything else to note before we move on?
19:15:54 <tonyb> clarkb: I'd be interested in learning more if we can make that work
19:16:58 <clarkb> tonyb: yup lets coordinate as the followup changes get into a mergable state
19:17:10 <clarkb> #topic Upgrading old servers
19:17:20 <clarkb> tonyb: I did review both the wiki change and the ansible update stack
19:17:46 <clarkb> tonyb: on the wiki change I think it may be worthwhile to publish the image to quay.io and deploy the new server on noble so that we're not doing the docker hub -> quay.io and jammy -> noble migrations later unnecessarily
19:17:54 <clarkb> basically lets skip ahead to the state we want to be in long term if we can
19:18:00 <clarkb> (I noted this on the change)
19:18:01 <tonyb> Thank you!  I got distracted with other things and I'll update the series soon
19:18:26 <tonyb> I agree.  I'll figure that out based on gerrit/hound containers
19:18:36 <fungi> and thanks again for working on it
19:18:50 <tonyb> np, sorry for the really long pause
19:19:02 <clarkb> then on the ansible side of things I think we're in an odd position where ansible 11 needs python3.11 or never and jammy by deafult is 3.10. We can install 3.11 on jammy but I'm not sure how well that will work so suggested we test that with your change stack and determine if ansible 11 implies a noble bridge or if we can make it work on jammy so that we're not mixing
19:19:04 <clarkb> together bridge updates and ansible updates
19:19:12 <clarkb> and i think your change stack tehre is a good framework for testing that sort of thing
19:19:48 <clarkb> and based on what we learn from that we can make some planning decisions for bridge
19:20:29 <clarkb> anything else related to server upgrades?
19:20:44 <tonyb> Yeah I'm working on that I think I'll move the 'ansible-next'  to the end of the stack.  I've done some testing with 3.11 and it seems fine.  I'm working on a 'clunky' way to update the ansible-venv when needed
19:21:04 <clarkb> great thanks!
19:21:07 <tonyb> Well really move asside and recreate it with 3.11
19:21:57 <clarkb> right virtualenv updates are often weird and starting over is usually simplest
19:22:12 <tonyb> That's all from me on $topic
19:22:18 <clarkb> #topic AFS mirror content updates
19:22:44 <clarkb> the assertion last week that trixie nodes were getting upstream deb mirrors set as pypi proxies had me confused for a bit so I dug into our mirroring stuff
19:23:16 <clarkb> I think I understand it. basically those jobs must be overriding our generic base mirror fqdn variable which assumes all things are mirrored at the same location but they are setting the value to the upstream deb value
19:23:27 <clarkb> I believe you can separately set the pypi mirror location (and in this case you'd set it to upstream too)
19:23:46 <clarkb> so one solution here is to basically micromanage each of those mirror locations. Or we can just mirror trixie and the existing tooling should just work
19:23:51 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/965334 Mirror trixie packages
19:23:59 <clarkb> I've pushed ^ to take the just mirror trixie approach
19:24:26 <clarkb> I also wondered why rocky linux (and now alma linux) weren't affected by this issue and the reason appears to be that we have specific mirror configuration support for each distro
19:24:44 <clarkb> this means that debian having support means each debain release wants the same setup. But rockylinux having never been configured is fine
19:25:31 <clarkb> so long story short I think we have two otpions for distros like debian which are currently in a mixed state. Option one is just mirror all the things for that distro so it isn't in a mixed state and option two is configure each mirror url separately to be correct and point at upstream when not mirrored
19:25:48 <clarkb> then separately there is a spec in zuul-jobs to improve how mirrors are configured (by being more explicit and less implicit)
19:26:07 <clarkb> that hasn't been implemented yet, but if there is interest in pushing that over the finish line we can in theory take advantage to do this better
19:26:13 <clarkb> tonyb: I think you were working on that at one point?
19:26:41 <tonyb> I was.  I didn't get very far.
19:26:47 <clarkb> (and to be clear this is a long standing item in zuul-jobs)
19:27:21 <clarkb> ack mnasiadka I know you were talking about looking at thing to help. This might be a good option as it shouldn't require any special privileges and needs someone who understands different distro behaviors to accomodate those in the end result
19:27:32 <clarkb> let me see if I can find the docs link again
19:28:02 <clarkb> mnasiadka: (or anyone else interested) https://zuul-ci.org/docs/zuul-jobs/latest/mirror.html
19:28:27 <clarkb> and I don't mean to scare tonyb off just thinking this is a good option for someone who doesn't have root access since it should all be driven through zuul
19:28:52 <mnasiadka> Sure, can have a look :)
19:29:11 <tonyb> Oh for sure.  I was going to suggest the same thing :)
19:29:44 <clarkb> anything else related to afs content management?
19:30:33 <clarkb> #topic Zuul Launcher Updates
19:30:44 <clarkb> There is a new launcher bug that can create mixed provider nodesets
19:30:50 <clarkb> #link https://review.opendev.org/c/zuul/zuul/+/965954 Fix assignment of unassigned nodes.
19:30:52 <clarkb> this is the fix
19:31:10 <clarkb> unfortunately I think there are a couple other zuul bugs that need to be fixed before that change can land, but it is in the queue/todo list
19:31:43 <clarkb> The situation where this happens seems to be infrequent as it relies on the launcher reassigning unused nodes from requests that are no longer needed to new requests
19:31:52 <clarkb> so you have to get things aligned just right to hit it
19:32:23 <clarkb> Then separately I discovered yesterday that the quota in raxflex iad3 is smaller than I had thought. I had thought it was for 10 instances but we can only boot 5 due to cpu quotas (and 6 if memroy quotas are the limit)
19:32:47 <clarkb> this is why I dropped my held gerrit nodes to free up two held nodes in rax flex iad3 so that some openstack helm reuqests for 5 nodes could be handled
19:33:20 <clarkb> cardoe brought up the OSH issue and I asked cardoe to followup with cloudnull about bumping quotas. Otherwise we may need to consider dropping that region for now
19:33:41 <tonyb> Ahhhh that's what was causing the helm issue.
19:33:56 <clarkb> I just checked and the quotas haven't been bumped yet
19:34:46 <corvus> in the mean time, might be good to avoid holding nodes in raxflex-iad3
19:34:50 <clarkb> ++
19:34:58 <corvus> and if that's not tenable, yeah, maybe we should turn it down.
19:35:33 <fungi> not that we can avoid holding nodes in a specific provider, but we can certainly delete and reset the hold if it lands in one
19:35:48 <clarkb> I'm willing to wait another day or two to see if quotas bump but if that doesn't happen then I'm good with turning it off while we wait
19:35:52 <corvus> yep
19:36:11 <tonyb> sounds reasonable
19:36:21 <fungi> sounds fine to me. 5 nodes worth of quota is a drop in the bucket, but was good for making sure the provider is nominally operable for us
19:36:36 <clarkb> #topic Matrix for OpenDev comms
19:36:46 <clarkb> This item has not made it into the list of things I'm currently juggling :/
19:37:02 <clarkb> someone (sorry I don't remember who) pointed out that mjolnir has a successor implementation
19:37:07 <clarkb> so we may want to jump straight to that
19:37:14 <clarkb> but otherwise I haven't seen any movement on this one
19:37:19 <tonyb> that was mnasiadka (I think)
19:38:00 <mnasiadka> Yeah, stumbled across that on Ansible community Matrix rooms and got interested
19:38:34 <clarkb> ack thank you for calling it out. That is good to know so that we don't end up implementing something twice
19:38:35 <fungi> what are the benefits of the successor? that it's actively maintained and the original isn't, i guess?
19:39:07 <mnasiadka> clarkb: if there’s anything I can do to help re Matrix - happy to do that (but next week Mon/Tue I’m out)
19:39:15 <clarkb> https://github.com/the-draupnir-project/Draupnir looks like it has simpler management ux
19:39:23 <fungi> it's not nearly that time-sensitive
19:39:35 <clarkb> mnasiadka: thanks I'll let you know. I think the first step is for me or another amdin to create the room
19:39:44 <clarkb> and once that is done we can start experimenting and others can help out with tooling etc
19:40:17 <clarkb> #topic Etherpad 2.5.2 Upgrade
19:40:25 <clarkb> sorry I'm going to keep things moving along to make sure we cover all the agenda items
19:40:34 <corvus> i'm happy to do statusbot for matrix
19:40:36 <clarkb> last we spoke we were worried about etherpad 2.5.1 css issues
19:40:38 <clarkb> corvus: thanks
19:40:53 <clarkb> since then I filed a bug with etherpad and they fixed it quickly and now there is a 2.5.2 which seems to work
19:40:57 <clarkb> #link https://github.com/ether/etherpad-lite/blob/v2.5.2/CHANGELOG.md
19:41:01 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/956593
19:41:12 <clarkb> At this point myself, tonyb and mnasiadka have tested a held etherpad 2.5.2 node
19:41:20 <clarkb> I think we're basically ready to upgrade etherpad
19:41:30 <clarkb> I didn't want to do it last week with the PTG ahppening but that is over now
19:41:48 <clarkb> I'm game for trying to do that today after a lunch and a possible bike ride otherwise will probably plan to do it first thing tomorrow
19:41:53 <fungi> oh, and is meetpad still in the disable list?
19:42:02 <clarkb> fungi: oh I think it is. We should pull it out.
19:42:09 <clarkb> Probably pull out review03 at the same time?
19:42:09 <fungi> on it
19:42:29 <tonyb> and review03 if we're sure that isn't a problem
19:42:40 <fungi> and taking review03 out too, yes
19:42:46 <fungi> done
19:42:49 <clarkb> tonyb: I'm like 99% certain its ok. The first spontaneous shutdown created that file as a directory and nothing exploded
19:42:57 <clarkb> tonyb: shouldn't be any worse to have it as an empty file
19:43:01 <tonyb> \o/
19:43:17 <clarkb> so ya if you'd like to test etherpad do so soon. Otherwise expect it to be upgrade by sometime tomorrow
19:43:23 <clarkb> #topic Gitea 1.25.0 Upgrade
19:43:34 <clarkb> After updating Gitea to 1.24.7 last week they released 1.25.0
19:43:41 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/965960 Upgrade Gitea to 1.25.0
19:43:56 <clarkb> That change includes a link to the upstream changelog which is quite long, but the breaking changes list is small and doesn't affect us
19:44:17 <clarkb> I suspect this is a relatively straightforward upgrade for us, but therei s a held node you can interact with and double checking the changelog is always appreciated
19:44:28 <clarkb> usually by the time we do that they release a .1 or .2 as well and that is what we actually upgrade to
19:44:35 <clarkb> mnasiadka did some poking around and didn't find any obvious issues
19:45:00 <clarkb> I do update the versions of golang and nodejs too as well as switch to pnpm to match upstrea
19:45:21 <clarkb> so its still more than a bugfix upgrade
19:46:19 <clarkb> #topic Gitea Performance
19:46:39 <clarkb> then in parallel we're still seeing occasional gitea performance issues
19:46:44 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/964728 Don't allow direct backend access
19:47:06 <clarkb> The idea behind this one is that we'll remoev any direct backend crawling which should force access through the lb allowing it to do its job more accurately
19:47:12 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/965420 Increase memcached cache size to mitigate effect of crawlers poisoning the cache
19:47:36 <clarkb> the idea behind this one is that crawling effectively poisons the cache. Increasing the size of the cache may mitigate the effects of the poisoning on the cache
19:48:06 <clarkb> I'm willing to try one or both or neither of these. Consider both changes a request for comment/feedback. Happy to try other appraoches too
19:48:59 <clarkb> but I'm hopeful that by laod balancing all crawlers and having larger caches that will result in no single backend having a full poisoned cache improving performance generally
19:49:01 <fungi> 964728 seems to have plenty of consensus
19:49:05 * tonyb is in favor of both changes.
19:49:37 <clarkb> oh cool I missed the consensus on the laod balancing change. I can plan to land that after the etherpad upgrade then
19:50:03 <clarkb> and then once we've made some changes we can reevaluate if we need to do more or revert etc
19:50:22 <fungi> i fat-fingered my +2 on 965420 and accidentally approved it for a brief moment, so it may need rechecking in a bit, sorry
19:50:55 <clarkb> seems you did that quickly enough that zuul never enqued it to the gate
19:50:58 <clarkb> fast typing there
19:51:02 <clarkb> I can recheck it
19:51:04 <clarkb> #topic Raxflex DFW3 Disabled
19:51:30 <clarkb> the last item on the agenda is to cover what happened in raxflex dfw3 yseterday
19:51:47 <clarkb> I discovered that the mirror was throwing errors at jobs and investigating the server directly showed some afs cache problems
19:52:13 <clarkb> an fs flushall seemed to get stuck and system load was steadily climbing
19:52:27 <clarkb> this resulted in me attempting to reboot the server. Doing so via the host itself got stuck waiting for afs to unmount
19:52:52 <clarkb> after waiting ~10 minutes I asked nova to stop the server which it did. Asking nova to start the server again does not start the server again
19:53:13 <clarkb> the server task state goes to powering-on according to server show but it never starts. unfortuantely, it never reports an error either
19:53:26 <clarkb> We pinged dan_with about it in #opendev but haven't heard anything since
19:53:36 <tonyb> and nothing on the console?
19:53:45 <clarkb> tonyb: you can't get the console bceause the server isn't running
19:53:59 <clarkb> dfw3 has since been disabled in zuul
19:54:16 <tonyb> Oh okay.  that level of 'never starts'
19:54:26 <clarkb> I think our options are to either wait for rackspace to help us fix it (maybe we need tofile an issue for that?) or we can make a new mirror and just start over
19:54:51 <clarkb> Considering the historical cinder volume issues there I think there is value in getting the cloud to investigate if we can, but maybe we don't need to wait for that while we return the region to service with a new mirror
19:55:00 <fungi> will likely need a nerw cinder volume too
19:55:05 <clarkb> ya
19:55:31 <corvus> how about booting a new server+volume and then open a ticket for the old one.  if nothing happens in a week, delete it?
19:55:49 <tonyb> ^^ That's what I was going to suggest
19:55:51 <clarkb> corvus: I'd be happy to use that approach. I'm not sure I personally have time to drive that given my current todo list
19:55:56 <corvus> i'm assuming we have enough quota in the non-zuul account there to run 2 mirrors
19:56:11 <clarkb> yes I think we do
19:57:13 <clarkb> I'm happy for someone else to drive that and will help as I can. This week is just really busy for me and I'm out Monday so don't want to overcommit
19:57:27 <clarkb> #topic Open Discussion
19:57:36 <clarkb> we have a few minutes to cover anything else if there is anything else
19:59:10 <clarkb> apologies if it felt like I was speed running through all of that. I wanted to make sure the listed items got covered. Always feel free to followup outside the meeting on IRC or on the mailing list
19:59:38 <clarkb> And thank you everyone for attending and helping to keep opendev running!
19:59:39 <corvus> i am happy with your chairing :)
19:59:53 <fungi> excellent timekeeping!
19:59:57 <tonyb> hear hear!
19:59:58 <clarkb> As I mentioned we should be back here next week at the same time and location, but the agenda email may be delayed
20:00:10 <tonyb> chairing and cheering
20:00:11 <clarkb> and now we are at time. Thanks again@
20:00:13 <clarkb> #endmeeting