19:00:56 #startmeeting infra 19:00:56 Meeting started Tue Nov 4 19:00:56 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:56 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:56 The meeting name has been set to 'infra' 19:01:09 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/O2QLEHP5SBVJLBCK5WVAGWFSXJEMDI52/ Our Agenda 19:01:12 #topic Announcements 19:01:30 If the time of this meeting surprises you then note that we hold the meeting at 1900 UTC which doesn't observe daylight saving time 19:01:56 one of its best features! 19:02:05 I will be popping out a little early on Friday and am out on Monday 19:02:18 we should expect a meeting in a week but the agenda may go out late 19:02:23 Anything else to announce? 19:04:06 #topic Gerrit 3.11 Upgrade Planning 19:04:16 I'm perpetually not keeping up with Gerrit :/ 19:04:25 there is a new set of releases today from upstrea for 3.10.9 and 3.11.7 19:04:49 #link https://review.opendev.org/c/opendev/system-config/+/966084 Update Gerrit images for 3.10.9 and 3.11.7 19:05:08 I also had to delete one of my holds for Gerrit testing due to a launcher issue but that isn't a big deal as I need to refresh for ^ anyway 19:05:28 to tl;dr I think I'm back to needing to catch the gerrit deployment up to a happy spot then will refresh node holds and hopefully start testing things 19:05:53 the linked change above has a parent change that addresses a general issue with gerrit restarts as well and we should probably plan to get both changes in together then restart gerrit to ensure everything is happy about it 19:06:15 any questions or concerns? (we'll talk about the unexpected shutdowns next in its own topic) 19:07:19 #topic Gerrit Spontaneous Shutdowns 19:07:33 The other big gerrit topic is the issue of unexpected server shutdowns 19:07:42 We had one occur during the summit and we just had one early today UTC time 19:07:51 thank you tonyb for dealing with the shutdown that occurred today 19:08:26 All good. 19:08:35 The first thing I want to note is that the massive h2 cache files and their lockfiles in /home/gerrit2/review_site/cache can be deleted before starting gerrit again in order to speed up gerrit start 19:09:03 yes, huge thanks tonyb, if that had sat waiting for me to wake up it would have severely impacted the publication timeline for an openstack security advisory 19:09:03 I did delete the files > 1G but not the lockfiles 19:09:09 the issue there is taht gerrit processes those massive files on startup and prunes things into shape which is done serially database by database and prevents many other actions from happening in the system. If we delete the files gerrit simply creates new caches and populates them as they go 19:09:18 tonyb: oh interesting, but gerrit startup was still slow? 19:09:22 Yes 19:09:56 ok so maybe there is an additional issue there 19:10:08 fwiw we don't want to delete all the h2 cache files as some manage things like user sessions 19:10:25 but caches like gerrit_file_diff.h2.db grow very large and can be deleted (but I'ev only ever deleted them with their lock files) 19:10:33 I wonder if it waited for some lock timeout or something like that 19:10:48 Yeah. that was my assumption and based on my passive following of the h2 issue I was confident about deleteing the 2 large files 19:11:34 I did discuss that issue with upstream at the summit and mfick thought it was fixable and had attampted a fix that got reverted due to breaking plugins. But he still felt that it could be done without breaking plugins so hopefully soon it gets addressed 19:11:57 I'm not familiar enough with what the gerrit logs look like to really know whats "normal" 19:12:34 ya I also noticed in syslog there was something complaining about being able to connect to gerrit on 29418 for some time so I think it was basically doing whatever startup routines it needs to in order t ofeel ready and that took some time 19:12:45 previously we had believed that to be largely stuck in db processing but maybe there is something else 19:13:16 really the ideal situation here is ot have the service be more resilient and sane about this stuff which is happening (slowly) 19:13:28 It's hard to debug given it's the production server ;P 19:13:34 then on the cluod side I sent email to vexxhost support calling out the issue and asking them for advice on mitigating it 19:13:40 so hopefully we can make that better too 19:13:47 I cc'd infra rooters on that 19:14:26 I'm happy to try and schedule the 3.10.9 update restart for a time where as many people as are interested can follow along with the process so that we are more aware of what a "good" restart looks like 19:14:35 (I still expect the shutdown to timeout unfortunately) 19:15:25 to summarize gerrit shutdown happened again. Startup in that situation is still slower than we'd like. We may need to debug further or it may simply be a matter of deleting h2 db lock files when dleeting h2 cache dbs. And we've engaged with the cloud host to debug the underlying issue 19:15:32 anything else to note before we move on? 19:15:54 clarkb: I'd be interested in learning more if we can make that work 19:16:58 tonyb: yup lets coordinate as the followup changes get into a mergable state 19:17:10 #topic Upgrading old servers 19:17:20 tonyb: I did review both the wiki change and the ansible update stack 19:17:46 tonyb: on the wiki change I think it may be worthwhile to publish the image to quay.io and deploy the new server on noble so that we're not doing the docker hub -> quay.io and jammy -> noble migrations later unnecessarily 19:17:54 basically lets skip ahead to the state we want to be in long term if we can 19:18:00 (I noted this on the change) 19:18:01 Thank you! I got distracted with other things and I'll update the series soon 19:18:26 I agree. I'll figure that out based on gerrit/hound containers 19:18:36 and thanks again for working on it 19:18:50 np, sorry for the really long pause 19:19:02 then on the ansible side of things I think we're in an odd position where ansible 11 needs python3.11 or never and jammy by deafult is 3.10. We can install 3.11 on jammy but I'm not sure how well that will work so suggested we test that with your change stack and determine if ansible 11 implies a noble bridge or if we can make it work on jammy so that we're not mixing 19:19:04 together bridge updates and ansible updates 19:19:12 and i think your change stack tehre is a good framework for testing that sort of thing 19:19:48 and based on what we learn from that we can make some planning decisions for bridge 19:20:29 anything else related to server upgrades? 19:20:44 Yeah I'm working on that I think I'll move the 'ansible-next' to the end of the stack. I've done some testing with 3.11 and it seems fine. I'm working on a 'clunky' way to update the ansible-venv when needed 19:21:04 great thanks! 19:21:07 Well really move asside and recreate it with 3.11 19:21:57 right virtualenv updates are often weird and starting over is usually simplest 19:22:12 That's all from me on $topic 19:22:18 #topic AFS mirror content updates 19:22:44 the assertion last week that trixie nodes were getting upstream deb mirrors set as pypi proxies had me confused for a bit so I dug into our mirroring stuff 19:23:16 I think I understand it. basically those jobs must be overriding our generic base mirror fqdn variable which assumes all things are mirrored at the same location but they are setting the value to the upstream deb value 19:23:27 I believe you can separately set the pypi mirror location (and in this case you'd set it to upstream too) 19:23:46 so one solution here is to basically micromanage each of those mirror locations. Or we can just mirror trixie and the existing tooling should just work 19:23:51 #link https://review.opendev.org/c/opendev/system-config/+/965334 Mirror trixie packages 19:23:59 I've pushed ^ to take the just mirror trixie approach 19:24:26 I also wondered why rocky linux (and now alma linux) weren't affected by this issue and the reason appears to be that we have specific mirror configuration support for each distro 19:24:44 this means that debian having support means each debain release wants the same setup. But rockylinux having never been configured is fine 19:25:31 so long story short I think we have two otpions for distros like debian which are currently in a mixed state. Option one is just mirror all the things for that distro so it isn't in a mixed state and option two is configure each mirror url separately to be correct and point at upstream when not mirrored 19:25:48 then separately there is a spec in zuul-jobs to improve how mirrors are configured (by being more explicit and less implicit) 19:26:07 that hasn't been implemented yet, but if there is interest in pushing that over the finish line we can in theory take advantage to do this better 19:26:13 tonyb: I think you were working on that at one point? 19:26:41 I was. I didn't get very far. 19:26:47 (and to be clear this is a long standing item in zuul-jobs) 19:27:21 ack mnasiadka I know you were talking about looking at thing to help. This might be a good option as it shouldn't require any special privileges and needs someone who understands different distro behaviors to accomodate those in the end result 19:27:32 let me see if I can find the docs link again 19:28:02 mnasiadka: (or anyone else interested) https://zuul-ci.org/docs/zuul-jobs/latest/mirror.html 19:28:27 and I don't mean to scare tonyb off just thinking this is a good option for someone who doesn't have root access since it should all be driven through zuul 19:28:52 Sure, can have a look :) 19:29:11 Oh for sure. I was going to suggest the same thing :) 19:29:44 anything else related to afs content management? 19:30:33 #topic Zuul Launcher Updates 19:30:44 There is a new launcher bug that can create mixed provider nodesets 19:30:50 #link https://review.opendev.org/c/zuul/zuul/+/965954 Fix assignment of unassigned nodes. 19:30:52 this is the fix 19:31:10 unfortunately I think there are a couple other zuul bugs that need to be fixed before that change can land, but it is in the queue/todo list 19:31:43 The situation where this happens seems to be infrequent as it relies on the launcher reassigning unused nodes from requests that are no longer needed to new requests 19:31:52 so you have to get things aligned just right to hit it 19:32:23 Then separately I discovered yesterday that the quota in raxflex iad3 is smaller than I had thought. I had thought it was for 10 instances but we can only boot 5 due to cpu quotas (and 6 if memroy quotas are the limit) 19:32:47 this is why I dropped my held gerrit nodes to free up two held nodes in rax flex iad3 so that some openstack helm reuqests for 5 nodes could be handled 19:33:20 cardoe brought up the OSH issue and I asked cardoe to followup with cloudnull about bumping quotas. Otherwise we may need to consider dropping that region for now 19:33:41 Ahhhh that's what was causing the helm issue. 19:33:56 I just checked and the quotas haven't been bumped yet 19:34:46 in the mean time, might be good to avoid holding nodes in raxflex-iad3 19:34:50 ++ 19:34:58 and if that's not tenable, yeah, maybe we should turn it down. 19:35:33 not that we can avoid holding nodes in a specific provider, but we can certainly delete and reset the hold if it lands in one 19:35:48 I'm willing to wait another day or two to see if quotas bump but if that doesn't happen then I'm good with turning it off while we wait 19:35:52 yep 19:36:11 sounds reasonable 19:36:21 sounds fine to me. 5 nodes worth of quota is a drop in the bucket, but was good for making sure the provider is nominally operable for us 19:36:36 #topic Matrix for OpenDev comms 19:36:46 This item has not made it into the list of things I'm currently juggling :/ 19:37:02 someone (sorry I don't remember who) pointed out that mjolnir has a successor implementation 19:37:07 so we may want to jump straight to that 19:37:14 but otherwise I haven't seen any movement on this one 19:37:19 that was mnasiadka (I think) 19:38:00 Yeah, stumbled across that on Ansible community Matrix rooms and got interested 19:38:34 ack thank you for calling it out. That is good to know so that we don't end up implementing something twice 19:38:35 what are the benefits of the successor? that it's actively maintained and the original isn't, i guess? 19:39:07 clarkb: if there’s anything I can do to help re Matrix - happy to do that (but next week Mon/Tue I’m out) 19:39:15 https://github.com/the-draupnir-project/Draupnir looks like it has simpler management ux 19:39:23 it's not nearly that time-sensitive 19:39:35 mnasiadka: thanks I'll let you know. I think the first step is for me or another amdin to create the room 19:39:44 and once that is done we can start experimenting and others can help out with tooling etc 19:40:17 #topic Etherpad 2.5.2 Upgrade 19:40:25 sorry I'm going to keep things moving along to make sure we cover all the agenda items 19:40:34 i'm happy to do statusbot for matrix 19:40:36 last we spoke we were worried about etherpad 2.5.1 css issues 19:40:38 corvus: thanks 19:40:53 since then I filed a bug with etherpad and they fixed it quickly and now there is a 2.5.2 which seems to work 19:40:57 #link https://github.com/ether/etherpad-lite/blob/v2.5.2/CHANGELOG.md 19:41:01 #link https://review.opendev.org/c/opendev/system-config/+/956593 19:41:12 At this point myself, tonyb and mnasiadka have tested a held etherpad 2.5.2 node 19:41:20 I think we're basically ready to upgrade etherpad 19:41:30 I didn't want to do it last week with the PTG ahppening but that is over now 19:41:48 I'm game for trying to do that today after a lunch and a possible bike ride otherwise will probably plan to do it first thing tomorrow 19:41:53 oh, and is meetpad still in the disable list? 19:42:02 fungi: oh I think it is. We should pull it out. 19:42:09 Probably pull out review03 at the same time? 19:42:09 on it 19:42:29 and review03 if we're sure that isn't a problem 19:42:40 and taking review03 out too, yes 19:42:46 done 19:42:49 tonyb: I'm like 99% certain its ok. The first spontaneous shutdown created that file as a directory and nothing exploded 19:42:57 tonyb: shouldn't be any worse to have it as an empty file 19:43:01 \o/ 19:43:17 so ya if you'd like to test etherpad do so soon. Otherwise expect it to be upgrade by sometime tomorrow 19:43:23 #topic Gitea 1.25.0 Upgrade 19:43:34 After updating Gitea to 1.24.7 last week they released 1.25.0 19:43:41 #link https://review.opendev.org/c/opendev/system-config/+/965960 Upgrade Gitea to 1.25.0 19:43:56 That change includes a link to the upstream changelog which is quite long, but the breaking changes list is small and doesn't affect us 19:44:17 I suspect this is a relatively straightforward upgrade for us, but therei s a held node you can interact with and double checking the changelog is always appreciated 19:44:28 usually by the time we do that they release a .1 or .2 as well and that is what we actually upgrade to 19:44:35 mnasiadka did some poking around and didn't find any obvious issues 19:45:00 I do update the versions of golang and nodejs too as well as switch to pnpm to match upstrea 19:45:21 so its still more than a bugfix upgrade 19:46:19 #topic Gitea Performance 19:46:39 then in parallel we're still seeing occasional gitea performance issues 19:46:44 #link https://review.opendev.org/c/opendev/system-config/+/964728 Don't allow direct backend access 19:47:06 The idea behind this one is that we'll remoev any direct backend crawling which should force access through the lb allowing it to do its job more accurately 19:47:12 #link https://review.opendev.org/c/opendev/system-config/+/965420 Increase memcached cache size to mitigate effect of crawlers poisoning the cache 19:47:36 the idea behind this one is that crawling effectively poisons the cache. Increasing the size of the cache may mitigate the effects of the poisoning on the cache 19:48:06 I'm willing to try one or both or neither of these. Consider both changes a request for comment/feedback. Happy to try other appraoches too 19:48:59 but I'm hopeful that by laod balancing all crawlers and having larger caches that will result in no single backend having a full poisoned cache improving performance generally 19:49:01 964728 seems to have plenty of consensus 19:49:05 * tonyb is in favor of both changes. 19:49:37 oh cool I missed the consensus on the laod balancing change. I can plan to land that after the etherpad upgrade then 19:50:03 and then once we've made some changes we can reevaluate if we need to do more or revert etc 19:50:22 i fat-fingered my +2 on 965420 and accidentally approved it for a brief moment, so it may need rechecking in a bit, sorry 19:50:55 seems you did that quickly enough that zuul never enqued it to the gate 19:50:58 fast typing there 19:51:02 I can recheck it 19:51:04 #topic Raxflex DFW3 Disabled 19:51:30 the last item on the agenda is to cover what happened in raxflex dfw3 yseterday 19:51:47 I discovered that the mirror was throwing errors at jobs and investigating the server directly showed some afs cache problems 19:52:13 an fs flushall seemed to get stuck and system load was steadily climbing 19:52:27 this resulted in me attempting to reboot the server. Doing so via the host itself got stuck waiting for afs to unmount 19:52:52 after waiting ~10 minutes I asked nova to stop the server which it did. Asking nova to start the server again does not start the server again 19:53:13 the server task state goes to powering-on according to server show but it never starts. unfortuantely, it never reports an error either 19:53:26 We pinged dan_with about it in #opendev but haven't heard anything since 19:53:36 and nothing on the console? 19:53:45 tonyb: you can't get the console bceause the server isn't running 19:53:59 dfw3 has since been disabled in zuul 19:54:16 Oh okay. that level of 'never starts' 19:54:26 I think our options are to either wait for rackspace to help us fix it (maybe we need tofile an issue for that?) or we can make a new mirror and just start over 19:54:51 Considering the historical cinder volume issues there I think there is value in getting the cloud to investigate if we can, but maybe we don't need to wait for that while we return the region to service with a new mirror 19:55:00 will likely need a nerw cinder volume too 19:55:05 ya 19:55:31 how about booting a new server+volume and then open a ticket for the old one. if nothing happens in a week, delete it? 19:55:49 ^^ That's what I was going to suggest 19:55:51 corvus: I'd be happy to use that approach. I'm not sure I personally have time to drive that given my current todo list 19:55:56 i'm assuming we have enough quota in the non-zuul account there to run 2 mirrors 19:56:11 yes I think we do 19:57:13 I'm happy for someone else to drive that and will help as I can. This week is just really busy for me and I'm out Monday so don't want to overcommit 19:57:27 #topic Open Discussion 19:57:36 we have a few minutes to cover anything else if there is anything else 19:59:10 apologies if it felt like I was speed running through all of that. I wanted to make sure the listed items got covered. Always feel free to followup outside the meeting on IRC or on the mailing list 19:59:38 And thank you everyone for attending and helping to keep opendev running! 19:59:39 i am happy with your chairing :) 19:59:53 excellent timekeeping! 19:59:57 hear hear! 19:59:58 As I mentioned we should be back here next week at the same time and location, but the agenda email may be delayed 20:00:10 chairing and cheering 20:00:11 and now we are at time. Thanks again@ 20:00:13 #endmeeting