Tuesday, 2025-11-04

clarkbmeeting time19:00
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Nov  4 19:00:56 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/O2QLEHP5SBVJLBCK5WVAGWFSXJEMDI52/ Our Agenda19:01
clarkb#topic Announcements19:01
clarkbIf the time of this meeting surprises you then note that we hold the meeting at 1900 UTC which doesn't observe daylight saving time19:01
tonybone of its best features!19:01
clarkbI will be popping out a little early on Friday and am out on Monday19:02
clarkbwe should expect a meeting in a week but the agenda may go out late19:02
clarkbAnything else to announce?19:02
clarkb#topic Gerrit 3.11 Upgrade Planning19:04
clarkbI'm perpetually not keeping up with Gerrit :/19:04
clarkbthere is a new set of releases today from upstrea for 3.10.9 and 3.11.719:04
clarkb#link https://review.opendev.org/c/opendev/system-config/+/966084 Update Gerrit images for 3.10.9 and 3.11.719:04
clarkbI also had to delete one of my holds for Gerrit testing due to a launcher issue but that isn't a big deal as I need to refresh for ^ anyway19:05
clarkbto tl;dr I think I'm back to needing to catch the gerrit deployment up to a happy spot then will refresh node holds and hopefully start testing things19:05
clarkbthe linked change above has a parent change that addresses a general issue with gerrit restarts as well and we should probably plan to get both changes in together then restart gerrit to ensure everything is happy about it19:05
clarkbany questions or concerns? (we'll talk about the unexpected shutdowns next in its own topic)19:06
clarkb#topic Gerrit Spontaneous Shutdowns19:07
clarkbThe other big gerrit topic is the issue of unexpected server shutdowns19:07
clarkbWe had one occur during the summit and we just had one early today UTC time19:07
clarkbthank you tonyb for dealing with the shutdown that occurred today19:07
tonybAll good.19:08
clarkbThe first thing I want to note is that the massive h2 cache files and their lockfiles in /home/gerrit2/review_site/cache can be deleted before starting gerrit again in order to speed up gerrit start19:08
fungiyes, huge thanks tonyb, if that had sat waiting for me to wake up it would have severely impacted the publication timeline for an openstack security advisory19:09
tonybI did delete the files > 1G but not the lockfiles19:09
clarkbthe issue there is taht gerrit processes those massive files on startup and prunes things into shape which is done serially database by database and prevents many other actions from happening in the system. If we delete the files gerrit simply creates new caches and populates them as they go19:09
clarkbtonyb: oh interesting, but gerrit startup was still slow?19:09
tonybYes19:09
clarkbok so maybe there is an additional issue there19:09
clarkbfwiw we don't want to delete all the h2 cache files as some manage things like user sessions19:10
clarkbbut caches like gerrit_file_diff.h2.db grow very large and can be deleted (but I'ev only ever deleted them with their lock files)19:10
clarkbI wonder if it waited for some lock timeout or something like that19:10
tonybYeah.  that was my assumption and based on my passive following of the h2 issue I was confident about deleteing the 2 large files19:10
clarkbI did discuss that issue with upstream at the summit and mfick thought it was fixable and had attampted a fix that got reverted due to breaking plugins. But he still felt that it could be done without breaking plugins so hopefully soon it gets addressed19:11
tonybI'm not familiar enough with what the gerrit logs look like to really know whats "normal"19:11
clarkbya I also noticed in syslog there was something complaining about being able to connect to gerrit on 29418 for some time so I think it was basically doing whatever startup routines it needs to in order t ofeel ready and that took some time19:12
clarkbpreviously we had believed that to be largely stuck in db processing but maybe there is something else19:12
clarkbreally the ideal situation here is ot have the service be more resilient and sane about this stuff which is happening (slowly)19:13
tonybIt's hard to debug given it's the production server ;P19:13
clarkbthen on the cluod side I sent email to vexxhost support calling out the issue and asking them for advice on mitigating it19:13
clarkbso hopefully we can make that better too19:13
clarkbI cc'd infra rooters on that19:13
clarkbI'm happy to try and schedule the 3.10.9 update restart for a time where as many people as are interested can follow along with the process so that we are more aware of what a "good" restart looks like19:14
clarkb(I still expect the shutdown to timeout unfortunately)19:14
clarkbto summarize gerrit shutdown happened again. Startup in that situation is still slower than we'd like. We may need to debug further or it may simply be a matter of deleting h2 db lock files when dleeting h2 cache dbs. And we've engaged with the cloud host to debug the underlying issue19:15
clarkbanything else to note before we move on?19:15
tonybclarkb: I'd be interested in learning more if we can make that work19:15
clarkbtonyb: yup lets coordinate as the followup changes get into a mergable state19:16
clarkb#topic Upgrading old servers19:17
clarkbtonyb: I did review both the wiki change and the ansible update stack19:17
clarkbtonyb: on the wiki change I think it may be worthwhile to publish the image to quay.io and deploy the new server on noble so that we're not doing the docker hub -> quay.io and jammy -> noble migrations later unnecessarily19:17
clarkbbasically lets skip ahead to the state we want to be in long term if we can19:17
clarkb(I noted this on the change)19:18
tonybThank you!  I got distracted with other things and I'll update the series soon19:18
tonybI agree.  I'll figure that out based on gerrit/hound containers19:18
fungiand thanks again for working on it19:18
tonybnp, sorry for the really long pause19:18
clarkbthen on the ansible side of things I think we're in an odd position where ansible 11 needs python3.11 or never and jammy by deafult is 3.10. We can install 3.11 on jammy but I'm not sure how well that will work so suggested we test that with your change stack and determine if ansible 11 implies a noble bridge or if we can make it work on jammy so that we're not mixing19:19
clarkbtogether bridge updates and ansible updates19:19
clarkband i think your change stack tehre is a good framework for testing that sort of thing19:19
clarkband based on what we learn from that we can make some planning decisions for bridge19:19
clarkbanything else related to server upgrades?19:20
tonybYeah I'm working on that I think I'll move the 'ansible-next'  to the end of the stack.  I've done some testing with 3.11 and it seems fine.  I'm working on a 'clunky' way to update the ansible-venv when needed19:20
clarkbgreat thanks!19:21
tonybWell really move asside and recreate it with 3.1119:21
clarkbright virtualenv updates are often weird and starting over is usually simplest19:21
tonybThat's all from me on $topic19:22
clarkb#topic AFS mirror content updates19:22
clarkbthe assertion last week that trixie nodes were getting upstream deb mirrors set as pypi proxies had me confused for a bit so I dug into our mirroring stuff19:22
clarkbI think I understand it. basically those jobs must be overriding our generic base mirror fqdn variable which assumes all things are mirrored at the same location but they are setting the value to the upstream deb value19:23
clarkbI believe you can separately set the pypi mirror location (and in this case you'd set it to upstream too)19:23
clarkbso one solution here is to basically micromanage each of those mirror locations. Or we can just mirror trixie and the existing tooling should just work19:23
clarkb#link https://review.opendev.org/c/opendev/system-config/+/965334 Mirror trixie packages19:23
clarkbI've pushed ^ to take the just mirror trixie approach19:23
clarkbI also wondered why rocky linux (and now alma linux) weren't affected by this issue and the reason appears to be that we have specific mirror configuration support for each distro19:24
clarkbthis means that debian having support means each debain release wants the same setup. But rockylinux having never been configured is fine19:24
clarkbso long story short I think we have two otpions for distros like debian which are currently in a mixed state. Option one is just mirror all the things for that distro so it isn't in a mixed state and option two is configure each mirror url separately to be correct and point at upstream when not mirrored19:25
clarkbthen separately there is a spec in zuul-jobs to improve how mirrors are configured (by being more explicit and less implicit)19:25
clarkbthat hasn't been implemented yet, but if there is interest in pushing that over the finish line we can in theory take advantage to do this better19:26
clarkbtonyb: I think you were working on that at one point?19:26
tonybI was.  I didn't get very far.19:26
clarkb(and to be clear this is a long standing item in zuul-jobs)19:26
clarkback mnasiadka I know you were talking about looking at thing to help. This might be a good option as it shouldn't require any special privileges and needs someone who understands different distro behaviors to accomodate those in the end result19:27
clarkblet me see if I can find the docs link again19:27
clarkbmnasiadka: (or anyone else interested) https://zuul-ci.org/docs/zuul-jobs/latest/mirror.html19:28
clarkband I don't mean to scare tonyb off just thinking this is a good option for someone who doesn't have root access since it should all be driven through zuul19:28
mnasiadkaSure, can have a look :)19:28
tonybOh for sure.  I was going to suggest the same thing :)19:29
clarkbanything else related to afs content management?19:29
clarkb#topic Zuul Launcher Updates19:30
clarkbThere is a new launcher bug that can create mixed provider nodesets19:30
clarkb#link https://review.opendev.org/c/zuul/zuul/+/965954 Fix assignment of unassigned nodes.19:30
clarkbthis is the fix19:30
clarkbunfortunately I think there are a couple other zuul bugs that need to be fixed before that change can land, but it is in the queue/todo list19:31
clarkbThe situation where this happens seems to be infrequent as it relies on the launcher reassigning unused nodes from requests that are no longer needed to new requests19:31
clarkbso you have to get things aligned just right to hit it19:31
clarkbThen separately I discovered yesterday that the quota in raxflex iad3 is smaller than I had thought. I had thought it was for 10 instances but we can only boot 5 due to cpu quotas (and 6 if memroy quotas are the limit)19:32
clarkbthis is why I dropped my held gerrit nodes to free up two held nodes in rax flex iad3 so that some openstack helm reuqests for 5 nodes could be handled19:32
clarkbcardoe brought up the OSH issue and I asked cardoe to followup with cloudnull about bumping quotas. Otherwise we may need to consider dropping that region for now19:33
tonybAhhhh that's what was causing the helm issue.19:33
clarkbI just checked and the quotas haven't been bumped yet19:33
corvusin the mean time, might be good to avoid holding nodes in raxflex-iad319:34
clarkb++19:34
corvusand if that's not tenable, yeah, maybe we should turn it down.19:34
funginot that we can avoid holding nodes in a specific provider, but we can certainly delete and reset the hold if it lands in one19:35
clarkbI'm willing to wait another day or two to see if quotas bump but if that doesn't happen then I'm good with turning it off while we wait19:35
corvusyep19:35
tonybsounds reasonable19:36
fungisounds fine to me. 5 nodes worth of quota is a drop in the bucket, but was good for making sure the provider is nominally operable for us19:36
clarkb#topic Matrix for OpenDev comms19:36
clarkbThis item has not made it into the list of things I'm currently juggling :/19:36
clarkbsomeone (sorry I don't remember who) pointed out that mjolnir has a successor implementation19:37
clarkbso we may want to jump straight to that19:37
clarkbbut otherwise I haven't seen any movement on this one19:37
tonybthat was mnasiadka (I think)19:37
mnasiadkaYeah, stumbled across that on Ansible community Matrix rooms and got interested19:38
clarkback thank you for calling it out. That is good to know so that we don't end up implementing something twice19:38
fungiwhat are the benefits of the successor? that it's actively maintained and the original isn't, i guess?19:38
mnasiadkaclarkb: if there’s anything I can do to help re Matrix - happy to do that (but next week Mon/Tue I’m out)19:39
clarkbhttps://github.com/the-draupnir-project/Draupnir looks like it has simpler management ux19:39
fungiit's not nearly that time-sensitive19:39
clarkbmnasiadka: thanks I'll let you know. I think the first step is for me or another amdin to create the room19:39
clarkband once that is done we can start experimenting and others can help out with tooling etc19:39
clarkb#topic Etherpad 2.5.2 Upgrade19:40
clarkbsorry I'm going to keep things moving along to make sure we cover all the agenda items19:40
corvusi'm happy to do statusbot for matrix19:40
clarkblast we spoke we were worried about etherpad 2.5.1 css issues19:40
clarkbcorvus: thanks19:40
clarkbsince then I filed a bug with etherpad and they fixed it quickly and now there is a 2.5.2 which seems to work19:40
clarkb#link https://github.com/ether/etherpad-lite/blob/v2.5.2/CHANGELOG.md19:40
clarkb#link https://review.opendev.org/c/opendev/system-config/+/95659319:41
clarkbAt this point myself, tonyb and mnasiadka have tested a held etherpad 2.5.2 node19:41
clarkbI think we're basically ready to upgrade etherpad19:41
clarkbI didn't want to do it last week with the PTG ahppening but that is over now19:41
clarkbI'm game for trying to do that today after a lunch and a possible bike ride otherwise will probably plan to do it first thing tomorrow19:41
fungioh, and is meetpad still in the disable list?19:41
clarkbfungi: oh I think it is. We should pull it out.19:42
clarkbProbably pull out review03 at the same time?19:42
fungion it19:42
tonyband review03 if we're sure that isn't a problem19:42
fungiand taking review03 out too, yes19:42
fungidone19:42
clarkbtonyb: I'm like 99% certain its ok. The first spontaneous shutdown created that file as a directory and nothing exploded19:42
clarkbtonyb: shouldn't be any worse to have it as an empty file19:42
tonyb\o/19:43
clarkbso ya if you'd like to test etherpad do so soon. Otherwise expect it to be upgrade by sometime tomorrow19:43
clarkb#topic Gitea 1.25.0 Upgrade19:43
clarkbAfter updating Gitea to 1.24.7 last week they released 1.25.019:43
clarkb#link https://review.opendev.org/c/opendev/system-config/+/965960 Upgrade Gitea to 1.25.019:43
clarkbThat change includes a link to the upstream changelog which is quite long, but the breaking changes list is small and doesn't affect us19:43
clarkbI suspect this is a relatively straightforward upgrade for us, but therei s a held node you can interact with and double checking the changelog is always appreciated19:44
clarkbusually by the time we do that they release a .1 or .2 as well and that is what we actually upgrade to19:44
clarkbmnasiadka did some poking around and didn't find any obvious issues19:44
clarkbI do update the versions of golang and nodejs too as well as switch to pnpm to match upstrea19:45
clarkbso its still more than a bugfix upgrade19:45
clarkb#topic Gitea Performance19:46
clarkbthen in parallel we're still seeing occasional gitea performance issues19:46
clarkb#link https://review.opendev.org/c/opendev/system-config/+/964728 Don't allow direct backend access19:46
clarkbThe idea behind this one is that we'll remoev any direct backend crawling which should force access through the lb allowing it to do its job more accurately19:47
clarkb#link https://review.opendev.org/c/opendev/system-config/+/965420 Increase memcached cache size to mitigate effect of crawlers poisoning the cache19:47
clarkbthe idea behind this one is that crawling effectively poisons the cache. Increasing the size of the cache may mitigate the effects of the poisoning on the cache19:47
clarkbI'm willing to try one or both or neither of these. Consider both changes a request for comment/feedback. Happy to try other appraoches too19:48
clarkbbut I'm hopeful that by laod balancing all crawlers and having larger caches that will result in no single backend having a full poisoned cache improving performance generally19:48
fungi964728 seems to have plenty of consensus19:49
* tonyb is in favor of both changes.19:49
clarkboh cool I missed the consensus on the laod balancing change. I can plan to land that after the etherpad upgrade then19:49
clarkband then once we've made some changes we can reevaluate if we need to do more or revert etc19:50
fungii fat-fingered my +2 on 965420 and accidentally approved it for a brief moment, so it may need rechecking in a bit, sorry19:50
clarkbseems you did that quickly enough that zuul never enqued it to the gate19:50
clarkbfast typing there19:50
clarkbI can recheck it19:51
clarkb#topic Raxflex DFW3 Disabled19:51
clarkbthe last item on the agenda is to cover what happened in raxflex dfw3 yseterday19:51
clarkbI discovered that the mirror was throwing errors at jobs and investigating the server directly showed some afs cache problems19:51
clarkban fs flushall seemed to get stuck and system load was steadily climbing19:52
clarkbthis resulted in me attempting to reboot the server. Doing so via the host itself got stuck waiting for afs to unmount19:52
clarkbafter waiting ~10 minutes I asked nova to stop the server which it did. Asking nova to start the server again does not start the server again19:52
clarkbthe server task state goes to powering-on according to server show but it never starts. unfortuantely, it never reports an error either19:53
clarkbWe pinged dan_with about it in #opendev but haven't heard anything since19:53
tonyband nothing on the console?19:53
clarkbtonyb: you can't get the console bceause the server isn't running19:53
clarkbdfw3 has since been disabled in zuul19:53
tonybOh okay.  that level of 'never starts'19:54
clarkbI think our options are to either wait for rackspace to help us fix it (maybe we need tofile an issue for that?) or we can make a new mirror and just start over19:54
clarkbConsidering the historical cinder volume issues there I think there is value in getting the cloud to investigate if we can, but maybe we don't need to wait for that while we return the region to service with a new mirror19:54
fungiwill likely need a nerw cinder volume too19:55
clarkbya19:55
corvushow about booting a new server+volume and then open a ticket for the old one.  if nothing happens in a week, delete it?19:55
tonyb^^ That's what I was going to suggest19:55
clarkbcorvus: I'd be happy to use that approach. I'm not sure I personally have time to drive that given my current todo list19:55
corvusi'm assuming we have enough quota in the non-zuul account there to run 2 mirrors19:55
clarkbyes I think we do19:56
clarkbI'm happy for someone else to drive that and will help as I can. This week is just really busy for me and I'm out Monday so don't want to overcommit19:57
clarkb#topic Open Discussion19:57
clarkbwe have a few minutes to cover anything else if there is anything else19:57
clarkbapologies if it felt like I was speed running through all of that. I wanted to make sure the listed items got covered. Always feel free to followup outside the meeting on IRC or on the mailing list19:59
clarkbAnd thank you everyone for attending and helping to keep opendev running!19:59
corvusi am happy with your chairing :)19:59
fungiexcellent timekeeping!19:59
tonybhear hear!19:59
clarkbAs I mentioned we should be back here next week at the same time and location, but the agenda email may be delayed19:59
tonybchairing and cheering20:00
clarkband now we are at time. Thanks again@20:00
clarkb#endmeeting20:00
opendevmeetMeeting ended Tue Nov  4 20:00:13 2025 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:00
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2025/infra.2025-11-04-19.00.html20:00
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-11-04-19.00.txt20:00
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2025/infra.2025-11-04-19.00.log.html20:00
*** mnaser[m] is now known as mnaser20:46

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!