| clarkb | meeting time | 19:00 |
|---|---|---|
| clarkb | #startmeeting infra | 19:00 |
| opendevmeet | Meeting started Tue Nov 4 19:00:56 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
| opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
| opendevmeet | The meeting name has been set to 'infra' | 19:00 |
| clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/O2QLEHP5SBVJLBCK5WVAGWFSXJEMDI52/ Our Agenda | 19:01 |
| clarkb | #topic Announcements | 19:01 |
| clarkb | If the time of this meeting surprises you then note that we hold the meeting at 1900 UTC which doesn't observe daylight saving time | 19:01 |
| tonyb | one of its best features! | 19:01 |
| clarkb | I will be popping out a little early on Friday and am out on Monday | 19:02 |
| clarkb | we should expect a meeting in a week but the agenda may go out late | 19:02 |
| clarkb | Anything else to announce? | 19:02 |
| clarkb | #topic Gerrit 3.11 Upgrade Planning | 19:04 |
| clarkb | I'm perpetually not keeping up with Gerrit :/ | 19:04 |
| clarkb | there is a new set of releases today from upstrea for 3.10.9 and 3.11.7 | 19:04 |
| clarkb | #link https://review.opendev.org/c/opendev/system-config/+/966084 Update Gerrit images for 3.10.9 and 3.11.7 | 19:04 |
| clarkb | I also had to delete one of my holds for Gerrit testing due to a launcher issue but that isn't a big deal as I need to refresh for ^ anyway | 19:05 |
| clarkb | to tl;dr I think I'm back to needing to catch the gerrit deployment up to a happy spot then will refresh node holds and hopefully start testing things | 19:05 |
| clarkb | the linked change above has a parent change that addresses a general issue with gerrit restarts as well and we should probably plan to get both changes in together then restart gerrit to ensure everything is happy about it | 19:05 |
| clarkb | any questions or concerns? (we'll talk about the unexpected shutdowns next in its own topic) | 19:06 |
| clarkb | #topic Gerrit Spontaneous Shutdowns | 19:07 |
| clarkb | The other big gerrit topic is the issue of unexpected server shutdowns | 19:07 |
| clarkb | We had one occur during the summit and we just had one early today UTC time | 19:07 |
| clarkb | thank you tonyb for dealing with the shutdown that occurred today | 19:07 |
| tonyb | All good. | 19:08 |
| clarkb | The first thing I want to note is that the massive h2 cache files and their lockfiles in /home/gerrit2/review_site/cache can be deleted before starting gerrit again in order to speed up gerrit start | 19:08 |
| fungi | yes, huge thanks tonyb, if that had sat waiting for me to wake up it would have severely impacted the publication timeline for an openstack security advisory | 19:09 |
| tonyb | I did delete the files > 1G but not the lockfiles | 19:09 |
| clarkb | the issue there is taht gerrit processes those massive files on startup and prunes things into shape which is done serially database by database and prevents many other actions from happening in the system. If we delete the files gerrit simply creates new caches and populates them as they go | 19:09 |
| clarkb | tonyb: oh interesting, but gerrit startup was still slow? | 19:09 |
| tonyb | Yes | 19:09 |
| clarkb | ok so maybe there is an additional issue there | 19:09 |
| clarkb | fwiw we don't want to delete all the h2 cache files as some manage things like user sessions | 19:10 |
| clarkb | but caches like gerrit_file_diff.h2.db grow very large and can be deleted (but I'ev only ever deleted them with their lock files) | 19:10 |
| clarkb | I wonder if it waited for some lock timeout or something like that | 19:10 |
| tonyb | Yeah. that was my assumption and based on my passive following of the h2 issue I was confident about deleteing the 2 large files | 19:10 |
| clarkb | I did discuss that issue with upstream at the summit and mfick thought it was fixable and had attampted a fix that got reverted due to breaking plugins. But he still felt that it could be done without breaking plugins so hopefully soon it gets addressed | 19:11 |
| tonyb | I'm not familiar enough with what the gerrit logs look like to really know whats "normal" | 19:11 |
| clarkb | ya I also noticed in syslog there was something complaining about being able to connect to gerrit on 29418 for some time so I think it was basically doing whatever startup routines it needs to in order t ofeel ready and that took some time | 19:12 |
| clarkb | previously we had believed that to be largely stuck in db processing but maybe there is something else | 19:12 |
| clarkb | really the ideal situation here is ot have the service be more resilient and sane about this stuff which is happening (slowly) | 19:13 |
| tonyb | It's hard to debug given it's the production server ;P | 19:13 |
| clarkb | then on the cluod side I sent email to vexxhost support calling out the issue and asking them for advice on mitigating it | 19:13 |
| clarkb | so hopefully we can make that better too | 19:13 |
| clarkb | I cc'd infra rooters on that | 19:13 |
| clarkb | I'm happy to try and schedule the 3.10.9 update restart for a time where as many people as are interested can follow along with the process so that we are more aware of what a "good" restart looks like | 19:14 |
| clarkb | (I still expect the shutdown to timeout unfortunately) | 19:14 |
| clarkb | to summarize gerrit shutdown happened again. Startup in that situation is still slower than we'd like. We may need to debug further or it may simply be a matter of deleting h2 db lock files when dleeting h2 cache dbs. And we've engaged with the cloud host to debug the underlying issue | 19:15 |
| clarkb | anything else to note before we move on? | 19:15 |
| tonyb | clarkb: I'd be interested in learning more if we can make that work | 19:15 |
| clarkb | tonyb: yup lets coordinate as the followup changes get into a mergable state | 19:16 |
| clarkb | #topic Upgrading old servers | 19:17 |
| clarkb | tonyb: I did review both the wiki change and the ansible update stack | 19:17 |
| clarkb | tonyb: on the wiki change I think it may be worthwhile to publish the image to quay.io and deploy the new server on noble so that we're not doing the docker hub -> quay.io and jammy -> noble migrations later unnecessarily | 19:17 |
| clarkb | basically lets skip ahead to the state we want to be in long term if we can | 19:17 |
| clarkb | (I noted this on the change) | 19:18 |
| tonyb | Thank you! I got distracted with other things and I'll update the series soon | 19:18 |
| tonyb | I agree. I'll figure that out based on gerrit/hound containers | 19:18 |
| fungi | and thanks again for working on it | 19:18 |
| tonyb | np, sorry for the really long pause | 19:18 |
| clarkb | then on the ansible side of things I think we're in an odd position where ansible 11 needs python3.11 or never and jammy by deafult is 3.10. We can install 3.11 on jammy but I'm not sure how well that will work so suggested we test that with your change stack and determine if ansible 11 implies a noble bridge or if we can make it work on jammy so that we're not mixing | 19:19 |
| clarkb | together bridge updates and ansible updates | 19:19 |
| clarkb | and i think your change stack tehre is a good framework for testing that sort of thing | 19:19 |
| clarkb | and based on what we learn from that we can make some planning decisions for bridge | 19:19 |
| clarkb | anything else related to server upgrades? | 19:20 |
| tonyb | Yeah I'm working on that I think I'll move the 'ansible-next' to the end of the stack. I've done some testing with 3.11 and it seems fine. I'm working on a 'clunky' way to update the ansible-venv when needed | 19:20 |
| clarkb | great thanks! | 19:21 |
| tonyb | Well really move asside and recreate it with 3.11 | 19:21 |
| clarkb | right virtualenv updates are often weird and starting over is usually simplest | 19:21 |
| tonyb | That's all from me on $topic | 19:22 |
| clarkb | #topic AFS mirror content updates | 19:22 |
| clarkb | the assertion last week that trixie nodes were getting upstream deb mirrors set as pypi proxies had me confused for a bit so I dug into our mirroring stuff | 19:22 |
| clarkb | I think I understand it. basically those jobs must be overriding our generic base mirror fqdn variable which assumes all things are mirrored at the same location but they are setting the value to the upstream deb value | 19:23 |
| clarkb | I believe you can separately set the pypi mirror location (and in this case you'd set it to upstream too) | 19:23 |
| clarkb | so one solution here is to basically micromanage each of those mirror locations. Or we can just mirror trixie and the existing tooling should just work | 19:23 |
| clarkb | #link https://review.opendev.org/c/opendev/system-config/+/965334 Mirror trixie packages | 19:23 |
| clarkb | I've pushed ^ to take the just mirror trixie approach | 19:23 |
| clarkb | I also wondered why rocky linux (and now alma linux) weren't affected by this issue and the reason appears to be that we have specific mirror configuration support for each distro | 19:24 |
| clarkb | this means that debian having support means each debain release wants the same setup. But rockylinux having never been configured is fine | 19:24 |
| clarkb | so long story short I think we have two otpions for distros like debian which are currently in a mixed state. Option one is just mirror all the things for that distro so it isn't in a mixed state and option two is configure each mirror url separately to be correct and point at upstream when not mirrored | 19:25 |
| clarkb | then separately there is a spec in zuul-jobs to improve how mirrors are configured (by being more explicit and less implicit) | 19:25 |
| clarkb | that hasn't been implemented yet, but if there is interest in pushing that over the finish line we can in theory take advantage to do this better | 19:26 |
| clarkb | tonyb: I think you were working on that at one point? | 19:26 |
| tonyb | I was. I didn't get very far. | 19:26 |
| clarkb | (and to be clear this is a long standing item in zuul-jobs) | 19:26 |
| clarkb | ack mnasiadka I know you were talking about looking at thing to help. This might be a good option as it shouldn't require any special privileges and needs someone who understands different distro behaviors to accomodate those in the end result | 19:27 |
| clarkb | let me see if I can find the docs link again | 19:27 |
| clarkb | mnasiadka: (or anyone else interested) https://zuul-ci.org/docs/zuul-jobs/latest/mirror.html | 19:28 |
| clarkb | and I don't mean to scare tonyb off just thinking this is a good option for someone who doesn't have root access since it should all be driven through zuul | 19:28 |
| mnasiadka | Sure, can have a look :) | 19:28 |
| tonyb | Oh for sure. I was going to suggest the same thing :) | 19:29 |
| clarkb | anything else related to afs content management? | 19:29 |
| clarkb | #topic Zuul Launcher Updates | 19:30 |
| clarkb | There is a new launcher bug that can create mixed provider nodesets | 19:30 |
| clarkb | #link https://review.opendev.org/c/zuul/zuul/+/965954 Fix assignment of unassigned nodes. | 19:30 |
| clarkb | this is the fix | 19:30 |
| clarkb | unfortunately I think there are a couple other zuul bugs that need to be fixed before that change can land, but it is in the queue/todo list | 19:31 |
| clarkb | The situation where this happens seems to be infrequent as it relies on the launcher reassigning unused nodes from requests that are no longer needed to new requests | 19:31 |
| clarkb | so you have to get things aligned just right to hit it | 19:31 |
| clarkb | Then separately I discovered yesterday that the quota in raxflex iad3 is smaller than I had thought. I had thought it was for 10 instances but we can only boot 5 due to cpu quotas (and 6 if memroy quotas are the limit) | 19:32 |
| clarkb | this is why I dropped my held gerrit nodes to free up two held nodes in rax flex iad3 so that some openstack helm reuqests for 5 nodes could be handled | 19:32 |
| clarkb | cardoe brought up the OSH issue and I asked cardoe to followup with cloudnull about bumping quotas. Otherwise we may need to consider dropping that region for now | 19:33 |
| tonyb | Ahhhh that's what was causing the helm issue. | 19:33 |
| clarkb | I just checked and the quotas haven't been bumped yet | 19:33 |
| corvus | in the mean time, might be good to avoid holding nodes in raxflex-iad3 | 19:34 |
| clarkb | ++ | 19:34 |
| corvus | and if that's not tenable, yeah, maybe we should turn it down. | 19:34 |
| fungi | not that we can avoid holding nodes in a specific provider, but we can certainly delete and reset the hold if it lands in one | 19:35 |
| clarkb | I'm willing to wait another day or two to see if quotas bump but if that doesn't happen then I'm good with turning it off while we wait | 19:35 |
| corvus | yep | 19:35 |
| tonyb | sounds reasonable | 19:36 |
| fungi | sounds fine to me. 5 nodes worth of quota is a drop in the bucket, but was good for making sure the provider is nominally operable for us | 19:36 |
| clarkb | #topic Matrix for OpenDev comms | 19:36 |
| clarkb | This item has not made it into the list of things I'm currently juggling :/ | 19:36 |
| clarkb | someone (sorry I don't remember who) pointed out that mjolnir has a successor implementation | 19:37 |
| clarkb | so we may want to jump straight to that | 19:37 |
| clarkb | but otherwise I haven't seen any movement on this one | 19:37 |
| tonyb | that was mnasiadka (I think) | 19:37 |
| mnasiadka | Yeah, stumbled across that on Ansible community Matrix rooms and got interested | 19:38 |
| clarkb | ack thank you for calling it out. That is good to know so that we don't end up implementing something twice | 19:38 |
| fungi | what are the benefits of the successor? that it's actively maintained and the original isn't, i guess? | 19:38 |
| mnasiadka | clarkb: if there’s anything I can do to help re Matrix - happy to do that (but next week Mon/Tue I’m out) | 19:39 |
| clarkb | https://github.com/the-draupnir-project/Draupnir looks like it has simpler management ux | 19:39 |
| fungi | it's not nearly that time-sensitive | 19:39 |
| clarkb | mnasiadka: thanks I'll let you know. I think the first step is for me or another amdin to create the room | 19:39 |
| clarkb | and once that is done we can start experimenting and others can help out with tooling etc | 19:39 |
| clarkb | #topic Etherpad 2.5.2 Upgrade | 19:40 |
| clarkb | sorry I'm going to keep things moving along to make sure we cover all the agenda items | 19:40 |
| corvus | i'm happy to do statusbot for matrix | 19:40 |
| clarkb | last we spoke we were worried about etherpad 2.5.1 css issues | 19:40 |
| clarkb | corvus: thanks | 19:40 |
| clarkb | since then I filed a bug with etherpad and they fixed it quickly and now there is a 2.5.2 which seems to work | 19:40 |
| clarkb | #link https://github.com/ether/etherpad-lite/blob/v2.5.2/CHANGELOG.md | 19:40 |
| clarkb | #link https://review.opendev.org/c/opendev/system-config/+/956593 | 19:41 |
| clarkb | At this point myself, tonyb and mnasiadka have tested a held etherpad 2.5.2 node | 19:41 |
| clarkb | I think we're basically ready to upgrade etherpad | 19:41 |
| clarkb | I didn't want to do it last week with the PTG ahppening but that is over now | 19:41 |
| clarkb | I'm game for trying to do that today after a lunch and a possible bike ride otherwise will probably plan to do it first thing tomorrow | 19:41 |
| fungi | oh, and is meetpad still in the disable list? | 19:41 |
| clarkb | fungi: oh I think it is. We should pull it out. | 19:42 |
| clarkb | Probably pull out review03 at the same time? | 19:42 |
| fungi | on it | 19:42 |
| tonyb | and review03 if we're sure that isn't a problem | 19:42 |
| fungi | and taking review03 out too, yes | 19:42 |
| fungi | done | 19:42 |
| clarkb | tonyb: I'm like 99% certain its ok. The first spontaneous shutdown created that file as a directory and nothing exploded | 19:42 |
| clarkb | tonyb: shouldn't be any worse to have it as an empty file | 19:42 |
| tonyb | \o/ | 19:43 |
| clarkb | so ya if you'd like to test etherpad do so soon. Otherwise expect it to be upgrade by sometime tomorrow | 19:43 |
| clarkb | #topic Gitea 1.25.0 Upgrade | 19:43 |
| clarkb | After updating Gitea to 1.24.7 last week they released 1.25.0 | 19:43 |
| clarkb | #link https://review.opendev.org/c/opendev/system-config/+/965960 Upgrade Gitea to 1.25.0 | 19:43 |
| clarkb | That change includes a link to the upstream changelog which is quite long, but the breaking changes list is small and doesn't affect us | 19:43 |
| clarkb | I suspect this is a relatively straightforward upgrade for us, but therei s a held node you can interact with and double checking the changelog is always appreciated | 19:44 |
| clarkb | usually by the time we do that they release a .1 or .2 as well and that is what we actually upgrade to | 19:44 |
| clarkb | mnasiadka did some poking around and didn't find any obvious issues | 19:44 |
| clarkb | I do update the versions of golang and nodejs too as well as switch to pnpm to match upstrea | 19:45 |
| clarkb | so its still more than a bugfix upgrade | 19:45 |
| clarkb | #topic Gitea Performance | 19:46 |
| clarkb | then in parallel we're still seeing occasional gitea performance issues | 19:46 |
| clarkb | #link https://review.opendev.org/c/opendev/system-config/+/964728 Don't allow direct backend access | 19:46 |
| clarkb | The idea behind this one is that we'll remoev any direct backend crawling which should force access through the lb allowing it to do its job more accurately | 19:47 |
| clarkb | #link https://review.opendev.org/c/opendev/system-config/+/965420 Increase memcached cache size to mitigate effect of crawlers poisoning the cache | 19:47 |
| clarkb | the idea behind this one is that crawling effectively poisons the cache. Increasing the size of the cache may mitigate the effects of the poisoning on the cache | 19:47 |
| clarkb | I'm willing to try one or both or neither of these. Consider both changes a request for comment/feedback. Happy to try other appraoches too | 19:48 |
| clarkb | but I'm hopeful that by laod balancing all crawlers and having larger caches that will result in no single backend having a full poisoned cache improving performance generally | 19:48 |
| fungi | 964728 seems to have plenty of consensus | 19:49 |
| * tonyb is in favor of both changes. | 19:49 | |
| clarkb | oh cool I missed the consensus on the laod balancing change. I can plan to land that after the etherpad upgrade then | 19:49 |
| clarkb | and then once we've made some changes we can reevaluate if we need to do more or revert etc | 19:50 |
| fungi | i fat-fingered my +2 on 965420 and accidentally approved it for a brief moment, so it may need rechecking in a bit, sorry | 19:50 |
| clarkb | seems you did that quickly enough that zuul never enqued it to the gate | 19:50 |
| clarkb | fast typing there | 19:50 |
| clarkb | I can recheck it | 19:51 |
| clarkb | #topic Raxflex DFW3 Disabled | 19:51 |
| clarkb | the last item on the agenda is to cover what happened in raxflex dfw3 yseterday | 19:51 |
| clarkb | I discovered that the mirror was throwing errors at jobs and investigating the server directly showed some afs cache problems | 19:51 |
| clarkb | an fs flushall seemed to get stuck and system load was steadily climbing | 19:52 |
| clarkb | this resulted in me attempting to reboot the server. Doing so via the host itself got stuck waiting for afs to unmount | 19:52 |
| clarkb | after waiting ~10 minutes I asked nova to stop the server which it did. Asking nova to start the server again does not start the server again | 19:52 |
| clarkb | the server task state goes to powering-on according to server show but it never starts. unfortuantely, it never reports an error either | 19:53 |
| clarkb | We pinged dan_with about it in #opendev but haven't heard anything since | 19:53 |
| tonyb | and nothing on the console? | 19:53 |
| clarkb | tonyb: you can't get the console bceause the server isn't running | 19:53 |
| clarkb | dfw3 has since been disabled in zuul | 19:53 |
| tonyb | Oh okay. that level of 'never starts' | 19:54 |
| clarkb | I think our options are to either wait for rackspace to help us fix it (maybe we need tofile an issue for that?) or we can make a new mirror and just start over | 19:54 |
| clarkb | Considering the historical cinder volume issues there I think there is value in getting the cloud to investigate if we can, but maybe we don't need to wait for that while we return the region to service with a new mirror | 19:54 |
| fungi | will likely need a nerw cinder volume too | 19:55 |
| clarkb | ya | 19:55 |
| corvus | how about booting a new server+volume and then open a ticket for the old one. if nothing happens in a week, delete it? | 19:55 |
| tonyb | ^^ That's what I was going to suggest | 19:55 |
| clarkb | corvus: I'd be happy to use that approach. I'm not sure I personally have time to drive that given my current todo list | 19:55 |
| corvus | i'm assuming we have enough quota in the non-zuul account there to run 2 mirrors | 19:55 |
| clarkb | yes I think we do | 19:56 |
| clarkb | I'm happy for someone else to drive that and will help as I can. This week is just really busy for me and I'm out Monday so don't want to overcommit | 19:57 |
| clarkb | #topic Open Discussion | 19:57 |
| clarkb | we have a few minutes to cover anything else if there is anything else | 19:57 |
| clarkb | apologies if it felt like I was speed running through all of that. I wanted to make sure the listed items got covered. Always feel free to followup outside the meeting on IRC or on the mailing list | 19:59 |
| clarkb | And thank you everyone for attending and helping to keep opendev running! | 19:59 |
| corvus | i am happy with your chairing :) | 19:59 |
| fungi | excellent timekeeping! | 19:59 |
| tonyb | hear hear! | 19:59 |
| clarkb | As I mentioned we should be back here next week at the same time and location, but the agenda email may be delayed | 19:59 |
| tonyb | chairing and cheering | 20:00 |
| clarkb | and now we are at time. Thanks again@ | 20:00 |
| clarkb | #endmeeting | 20:00 |
| opendevmeet | Meeting ended Tue Nov 4 20:00:13 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:00 |
| opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-11-04-19.00.html | 20:00 |
| opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-11-04-19.00.txt | 20:00 |
| opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-11-04-19.00.log.html | 20:00 |
| *** mnaser[m] is now known as mnaser | 20:46 | |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!