#opendev-meeting log

19:00:57 <clarkb> #startmeeting infra
19:00:57 <opendevmeet> Meeting started Tue Jul 22 19:00:57 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:57 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:57 <opendevmeet> The meeting name has been set to 'infra'
19:01:57 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/7PMI6EMXUSHZA4J2CJ5XNXM3BKHH3CXH/ Our Agenda
19:02:27 <clarkb> #topic Announcements
19:02:29 <clarkb> Anything to announce?
19:03:19 <clarkb> Sounds like no
19:03:25 <clarkb> #topic Zuul-launcher
19:03:26 <fungi> i didn't have anything
19:03:28 <clarkb> We can dive right in then
19:04:24 <clarkb> mixed cloud nodesets are still happening (at a lower rate)
19:04:27 <clarkb> #link https://review.opendev.org/c/zuul/zuul/+/955545 is current proposal due to lack of verbose errors from openstack clouds
19:04:47 <clarkb> there is also a bugfix to move image hashing into the image upload role so that we can hash the correct format of the data
19:04:58 <clarkb> #link https://review.opendev.org/c/opendev/zuul-providers/+/955621/
19:05:09 <corvus> yeah, afaict they're happening because we're hitting quota, but the clouds give us error messages that don't tell us that.
19:05:27 <corvus> we're unable to upload raw images until 621 lands
19:05:37 <clarkb> and at least some of those errors we expect to be included in the error response btu there is a bug in nova
19:05:42 <fungi> because you need a time machine to interpret 15 years of openstack error messages as they evolved
19:06:02 <corvus> we do a lot, but some of them are just plain empty strings
19:06:28 <fungi> not much you can do with a null response
19:06:42 <corvus> so even though zuul-launcher is basically at the point where anything with "exceed" or "quota" in the string is treated as a quota error, yep, not much else we can do with nothing.
19:07:15 <clarkb> I see you noted centos 9 failures. Those issues are indicative of a broken mirror ( due  to upstream updates happening in files out of correct order)
19:07:29 <clarkb> we mirror directly from upstream now so this is upstream not updating their mirror files in the correct order aiui
19:07:32 <corvus> there's also a bug in launcher that's causing extra "pending" uploads; a fix for that is in progress
19:07:50 <corvus> clarkb: that failed a recheck too... how long should we expect those periods to last?
19:09:38 <clarkb> corvus: sometimes its until the next mirror sync which I think happens after 4 or 6 hours
19:09:47 <corvus> oof
19:10:00 <clarkb> but sometimes we've seen it go days when upstream isn't concerned about fixing that stuff
19:10:40 <clarkb> corvus: for fixes like this I feel like we can force merge if most of the images build and we have a problem like this
19:11:05 <corvus> yeah, maybe the way to go.  we could also think about making things nonvoting
19:11:09 <clarkb> though maybe that impacts record keeping for subsequent image uploads?
19:11:25 <clarkb> nonvoting is an interesting idea too since it should work when images build?
19:11:36 <clarkb> and that would make things lazy/optimistic/eventually consistent
19:11:45 <corvus> yep.  either way should work.
19:12:26 <clarkb> cool. The other thing I wanted to note here is that nodepool is completely shutdown at this point so if you see errors or have questions we need to refer to niz now
19:12:47 <corvus> yep, and i'm ready to delete the nodepool servers whenever
19:13:06 <corvus> how does "now" sound? :)
19:13:49 <clarkb> I think I'm ready if you are. Rolling back seems unlikely at this point as we've been able to rollforward on the majority of workload for several weeks
19:13:57 <corvus> https://review.opendev.org/955229 is the change to remove nodepool config completely; after that merges i can issue the delete commands
19:14:02 <clarkb> and worst case we can create new nodepool servers later if there is a reason to
19:14:54 <corvus> looks like that change may need to run the gauntlet (it failed on 2 different jobs on 2 different rechecks)
19:14:58 <corvus> but i think it's read
19:14:59 <corvus> y
19:15:09 <fungi> now is good with me
19:15:15 <clarkb> it runs a lot of jobs
19:15:41 <corvus> okay i will make it so
19:15:44 <clarkb> I suspect due to the changes to inventory/ we may be able to optimize the job selection there btter if it becomes a problem
19:16:12 <clarkb> anything else on the topic of nodepool in zuul?
19:16:26 <corvus> i think that's it
19:16:34 <clarkb> #topic Gerrit 3.11 Upgrade Planning
19:16:57 <clarkb> Last Friday fungi and I landed all the gerrit image backlog changes and restarted gerrit
19:17:23 <clarkb> the end result is images that we can use to test the gerrit upgrade and they are unlikely to change much between now and the actual upgrade
19:17:28 <clarkb> #link https://www.gerritcodereview.com/3.11.html
19:17:33 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade
19:17:52 <fungi> yay!
19:17:59 <clarkb> If you get a moment looking over the release notes and making notes in that etherpad for things to test/check would be great
19:18:09 <clarkb> I have also held new test nodes
19:18:14 <clarkb> #link https://zuul.opendev.org/t/openstack/build/f1ca0d1f2e054829a4506ececb58bed3
19:18:19 <clarkb> #link https://zuul.opendev.org/t/openstack/build/588723b923e94901af3065143d9df818
19:18:44 <clarkb> these two buildsare the builds with held nodes. I have not done any upgrade testing on them yet. But the 3.11 nodes might be good to interact with for any ui changes
19:19:12 <clarkb> My main concern at the moment is after the recent 3.11.4 update there are two different reports on the upstream mailing list for issues that would be problematic for is
19:19:29 <clarkb> first is offline reindexing not working (it spins forver after completeling 99% of the work)
19:20:07 <clarkb> and the other is the replication plugin refusign to attempt replication to targets after some time. Even asking for a full replication run doesn't work. YOu have to restart gerrit to get it to try again
19:20:23 <clarkb> so I want to see if I can test if these are problems in gerrit 3.11.4 or with the specific deployments involved as part of upgrade testing
19:21:28 <clarkb> I'm hoping to start digging into this tomorrow
19:21:47 <clarkb> any questions or concerns about the gerrit upgrade situation?
19:22:55 <clarkb> #topic Upgrading old servers
19:23:09 <clarkb> The other train of thought I've kicked off this week is starting to look at replacing the eavesdrop server
19:23:14 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/955544/ prep step for eavesdrop move to Noble
19:23:43 <clarkb> this change is a prep step to make the existing docker compose configs compatible with docker compose + podman on noble. It should be forward and backward compatible and is somethign we've applied elsewhere without issue
19:24:19 <clarkb> once that is in and happy on the old system I'll look into booting a new system and determining what the cut over looks like. I believe its something like shutdown all the irc bots on the old server, deploy the new server and ensure all the bots are started there and writing to afs happily again
19:24:32 <clarkb> (all the actual data is in afs iirc so we don't need to migrate volumes but I still need to double check that assertion)
19:25:50 <clarkb> fungi: any word on refstack yet?
19:26:41 <fungi> no, i'm planning to just write up an announcement and get someone to agree it's okay
19:27:01 <fungi> and then i'll send it out to openstack-discuss once i get approval
19:27:19 <clarkb> sounds good
19:27:21 <clarkb> thanks
19:27:35 <clarkb> anyone else have server replacement updates? I don't think we've done any recently but happy to have missed some :)
19:28:12 <fungi> i think nobody on the foundation staff cares what happens wrt refstack and associated git repos, it's mostly just me wanting to make sure users aren't surprised and we have some information to point at when there are questions
19:28:21 <clarkb> ++
19:28:50 <clarkb> #topic Matrix for OpenDev Comms
19:28:58 <clarkb> #link https://review.opendev.org/c/opendev/infra-specs/+/954826 Spec outlining the motivation and plan for Matrix trialing
19:29:01 <clarkb> I wrote the spec
19:29:27 <clarkb> I tried to capture why we've used IRC, what we like about it  and how Matrix helps fill those needs while also being more approachable to those who are more familiar with the modern Internet
19:30:32 <clarkb> I don't think anyone has reviewed it yet so I'm mostly hoping for some feedback
19:30:38 <clarkb> but feel free to leave that on the change itself
19:31:26 <clarkb> #topic Working through our TODO list
19:31:31 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup
19:31:39 <clarkb> I have not migrated this into a more permanent home yet
19:31:59 <corvus> (thanks for the spec, i'll take a look at it soon!)
19:32:00 <clarkb> I did do some cleanups to the specs repo and I'm thinking maybe I can port it in there as a high level list of things that don't have teh depth of detail of a list of specs
19:32:19 <clarkb> but maybe a list of stubs that could become specs if necessary and otherwise captuer the need
19:32:30 <clarkb> maybe I will just try that and see if I like it
19:32:47 <clarkb> #topic Pre PTG Planning
19:32:51 <clarkb> #link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document
19:32:58 <clarkb> and all of that feeds into planning for our october pre ptg event
19:33:23 <clarkb> if you've got topics you want to cover feel free to add them. My plan is to port things that need discussion from that todo list into there as well as anything that is more currently topical
19:34:19 <clarkb> we have a lot of time to get ready though so no rush
19:34:28 <clarkb> #topic Open Discussion
19:34:41 <clarkb> I did want to note that as july ends we approach service coordinator election period in August
19:34:53 <clarkb> I'll start putting a plan for that next week before August actually rolls around
19:35:06 <clarkb> if you are interested in runnign I'm happy to help/support anyone with the interest
19:36:24 <clarkb> and then fungi you are working on updating the gitea main page content
19:36:38 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/952407
19:36:47 <clarkb> this is the resulting squashed change so that we don't do more than one rolling update of gitea
19:36:49 <fungi> yeah, that's the squashed version now that we have some consensus
19:37:00 <clarkb> I'll rereview that shortly but maybe we can get that deployed today
19:37:22 <fungi> if folks are still okay with it, then whenever we're ready for another round of gitea restarts...
19:37:31 <fungi> i'm around all day
19:37:39 <fungi> happy to help monitor the deploy
19:39:11 <clarkb> great. Anything else to discuss before we end the meeting?
19:40:14 <clarkb> sounds like that may be it. Thank you everyone
19:40:23 <clarkb> We'll be back here same time and location next week
19:40:27 <clarkb> #endmeeting