19:00:45 <clarkb> #startmeeting infra
19:00:45 <opendevmeet> Meeting started Tue Jul  8 19:00:45 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:45 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:45 <opendevmeet> The meeting name has been set to 'infra'
19:00:54 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/EHPWD6ZYIOJ6KPI2D6QUN36KUXOV6YHL/ Our Agenda
19:00:57 <clarkb> #topic Announcements
19:01:36 <clarkb> I'll be out next week from the 15th to 17th. Sounded like fungi was willing to chair next week's meeting if there is sufficient interest
19:01:42 <fungi> yeah, i can take it
19:01:53 <fungi> unless someone else wants to
19:02:13 <clarkb> anything else to announce or should we jump into the agenda?
19:03:21 <fungi> i have nothing
19:03:35 <clarkb> #topic Zuul-launcher
19:03:59 <clarkb> the agenda is a bit out of date on this topic since thing are moving so quickly
19:04:26 <clarkb> but we noticed mixed provider nodes continuing to impact jobs. corvus found bugs and fixed them
19:04:40 <corvus> fixes for all known bugs related to mixed provider nodesets are in production; i spot-checked some periodic builds this morning and didn't see failures related to this
19:04:55 <corvus> so if someone sees a new issue, pls raise it
19:05:07 <fungi> yeah, other folks who had been tracking that symptom in their project reported no further incidences
19:05:35 <clarkb> excellent
19:05:56 <clarkb> Another problem we ran into was the gnutls errors from image builds fetching git repos to update our git repo image caches
19:05:58 <corvus> i have changes for a couple of issues related to old/stuck requests; those are gating now
19:06:47 <clarkb> mnasiadka updated dib to retry git requests. I found that one of the nodes I looked at seemed to alternate between requesting git resources via ipv4 and ipv6 protocols
19:07:08 <clarkb> when it stalled out it was trying ipv6 then after the timeout and retry it used ipv4 and no further problems were observed. Even after it switched back to ipv6 again
19:07:26 <clarkb> I thought that maybe there were stray interface updates but spot checking that provider we configure the interfaces statically for ipv6 so I'm really stumped
19:07:39 <corvus> what do test nodes use for dns now?
19:07:48 <clarkb> I think for now relying on the retries is probably a decent enough workaround and we can dig in further if problems become worse
19:08:03 <clarkb> corvus: should be unbound listening on localhost with unbound forwarding to google and cloudflare
19:08:18 <clarkb> if the host has an ipv6 address the ipv6 google and cloudflare addrs are used. If not then ipv4
19:08:27 <clarkb> (they don't recurse themselves)
19:08:54 <corvus> so if there's dns flapping, it would be google/cloudflare...
19:09:00 <clarkb> however image builds occur in a chroot which may impact /etc/resolv.conf I haven't checked on that more closely
19:09:13 <clarkb> corvus: ya either google or cloudflare or something about chroot setup overriding the host dns maybe
19:09:31 <corvus> we have a 1h ttl on opendev.org...
19:09:51 <clarkb> right and the flapping between the protocols occured over a time span less than an hour and it did so multiple times
19:09:58 <clarkb> its definitely a really odd situation.
19:10:03 <corvus> weird. nothing is jumping out at me either.
19:10:09 <clarkb> another idea is that its just Internet failures
19:10:32 <clarkb> in which case retrying is probably the most correct thing to do and we're doing that now
19:10:34 <fungi> that seems like the most probable cause fitting all the observed behaviors
19:10:50 <corvus> and retrying seems to be sufficient?
19:11:08 <clarkb> corvus: yes in the example I was digging into there was a single retry for one failure and it proceeded successfully from there
19:11:17 <clarkb> it appears to be quite intermittent
19:11:36 <corvus> i'm happy enough to just back away slowly from this then.  :)
19:11:42 <fungi> somebody's got a router somewhere in the path that's inadvertently hashing a small percentage of flows into a black hole
19:11:58 <fungi> and maybe only for ipv6
19:12:39 <clarkb> corvus: ya thats about where I've ended up
19:13:17 <clarkb> there are also trixie and EL10 image builds that I figured we could touch on briefly but I gave them their own topics
19:13:26 <clarkb> anything other than ^ we want to discuss on this topic?
19:13:45 <corvus> oh yeah
19:14:00 <corvus> i think we're really close to having zero nodepool traffic
19:14:17 <corvus> and i think probability of us wanting to roll-back is nearing zero
19:14:36 <corvus> so... do we want to decomission the nodepool servers?
19:14:59 <fungi> that would be awesome
19:15:00 <clarkb> I'm on baord with that. I was thinking it is probably a good idea to stop services for a week or so and keep an eye out for any unexpected corner cases we missed
19:15:29 <corvus> sounds like a plan.  a day or so after it looks like zero traffic, i'll stop, then wait a week to delete
19:15:38 <clarkb> then shutdown and remove the servers entirely after that period. Mostly just thinking there may be jobs that run infrequently enough or are lost in the periodic job noise that we may not notice until we shut it down properly
19:15:44 <clarkb> corvus: sounds great
19:16:12 <corvus> eot from me
19:16:23 <clarkb> #topic Adding Debian Trixie Images
19:16:40 <clarkb> related to the last topic frickler has done some work to add Debian Trixie images and only in zuul-launcher not nodepool
19:16:54 <clarkb> I think that is the correct choice at this point particularly since we are close to shutting down Nodepool but worth calling out
19:17:08 <corvus> ++
19:17:19 <clarkb> The other bits worth noting are that we are relying on a prerelase version of dib to build those images since they are really debian testing images until trixie is released
19:17:30 <clarkb> we will likely need tomake some small updates to s/testing/trixie/ post release too
19:18:17 <clarkb> thinking out loud here we shoudl prboably try and get a dib release out soon too to capture these updates, but I think I'm ok with using depends on and pulling dib from source. We may need to be careful about reviewing depends on before approving zuul-provider chagnes if we do that long term though?
19:18:44 <corvus> i think we switched zuul-provider to just always build from source
19:18:55 <clarkb> but even with depends on zuul-provider won't merge and upload an image until the dib change is reviewed and approved
19:19:10 <clarkb> so I think risk there is low. Mostly just want zuul-provider reviewers to know they need to think about and dib depends-on too
19:19:39 <corvus> yes, zuul-providers always uses dib from source now, so releases are irrelevant to it
19:20:03 <corvus> (for all image builds)
19:20:24 <clarkb> cool
19:20:34 <clarkb> oh the other thing is we aren't mirroring testing/trixie
19:20:50 <clarkb> this may break jobs that try to run on the new nodes as I think we try to configure mirrors for all debian images today
19:20:58 <clarkb> we can figure that out if/when it becomes a problem after images are built
19:21:38 <clarkb> #topic EL10 Images
19:21:51 <clarkb> then related to that we're also in the process of adding CentOS 10 Stream and Rocky Linux 10 images
19:21:58 <clarkb> #link https://review.opendev.org/c/opendev/zuul-providers/+/953460 CentOS 10 Stream
19:22:03 <clarkb> #link https://review.opendev.org/c/opendev/zuul-providers/+/954265 Rocky Linux 10
19:22:22 <clarkb> in addition to all of the prior notes about dib and mirrors the extra consideration here is that these nodes cannot boot in rackspace classic
19:22:42 <clarkb> I think I've come to terms with that and I think it may be a good way to push other providers (and rax) to expand on more capable hardware
19:23:04 <clarkb> but it does mean that like half our resources can't boot those images right now
19:23:57 <clarkb> thank you to everyone pushing both trixie and EL10 along. I think they have been a great illustration for how zuul-launcher is an improvement over nodepool when it comes to debug cycles and adding new images
19:24:11 <clarkb> we're able to do everything upfront rather than post merge on nodepool's daily rebuild cycle
19:24:41 <fungi> and this opens us up to the possbility of adding image acceptance tests
19:25:19 <fungi> as actual zuul jobs
19:26:10 <clarkb> #topic Gerrit 3.11 Upgrade Planning
19:26:24 <clarkb> The only thing I've done on this topic recently is clean up the old held nodes
19:26:45 <clarkb> I realized they are probably stale and don't represent our current images and config anymore (for exmaple the cleanup of h2 compaction timeout increases)
19:27:12 <clarkb> The other thing I've eralized is that we've got a handful of unrelated changes that all touch on the gerrit images that we might want to bundel up and apply with a single gerrit restart
19:27:18 <clarkb> There is the move to host gerrit images on quay
19:27:24 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/882900 Host Gerrit images on quay.io
19:27:31 <clarkb> There is the update to remove cla stuff
19:27:37 <clarkb> (which has content baked into the image iirc)
19:28:03 <clarkb> then there are the updates to the zuul status viewer which I don't think we've deployed yet either
19:28:19 <clarkb> fungi: maybe you and I can sync up on doing quay and the cla stuff one day and do a single restart to pick up both?
19:29:41 <fungi> yeah, i expect at least a week lead time to announce merging the acl change to remove cla enforcement from all remaining projects in our gerrit
19:29:43 <clarkb> then once that is done I want to hold new nodes with that up to date configuration state and container images and dig into planning the upgrade from there
19:29:51 <corvus> correct, i don't think the status plugin is deployed, but it was tested in a system-config build.
19:30:12 <clarkb> fungi: ok lets sync up after the meeting and figure out what planning for that looks like
19:30:28 <clarkb> but once all of that us caught up I can dig into the release notes and update the plan with testing input
19:30:33 <clarkb> #link https://www.gerritcodereview.com/3.11.html
19:30:39 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade
19:30:53 <clarkb> any other questions, concerns or comments in relation to the gerrit 3.10 to 3.11 upgrade planning?
19:31:20 <fungi> none on my side
19:31:49 <clarkb> #topic Upgrading old servers
19:32:07 <clarkb> we had a good sprint on this ~last week and the week prior
19:32:15 <clarkb> but since then I havent' seen any new servers (which is fine)
19:32:17 <fungi> no news on refstack
19:32:35 <clarkb> ack. Other than refstack the next "easy" server I've identified is eavesdrop
19:32:44 <clarkb> so when I find time that is the most likely next server to get replaced
19:34:36 <clarkb> #topic Trialing Matrix for OpenDev Comms
19:35:03 <clarkb> unlike gerrit upgrade prep I didn't dive into this because I realized there were prereqs. I just ran out of time between debugging image build and gitea stuff and the holiday
19:35:23 <clarkb> that said I should be able to write that spec this week as long as fires stay away. No holiday helps too
19:35:52 <clarkb> its on my todo list after reviewing PBR updates and some local laptop updates I need to do
19:35:59 <clarkb> oh and some paperwork that is due today
19:36:57 <clarkb> so ya hopefully soon
19:37:05 <clarkb> #topic Working through our TODO list
19:37:17 <clarkb> And now its time for the weekly reminder that we have a TODO list that still needs a new home
19:37:21 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup
19:37:41 <clarkb> that is related enough to spec work that I should probably roll those two efforts into adjacent blocks of time
19:37:51 <clarkb> #topic Pre PTG Planning
19:37:56 <clarkb> #link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document
19:38:20 <clarkb> I've started to put the planning for this 'event' into its own dedicated document.
19:38:56 <clarkb> Feel free to add agenda items and add suggestions for specific timeframes and days in the etherpad. Once we've got a bit more of a schedule I'll announce this on the mailing list as well to make it more official
19:39:57 <clarkb> I can also port over still relevant items from the last event to this one. We've managed to complete a few things but the list is long and there are likely updates
19:41:00 <clarkb> #topic Open Discussion
19:41:14 <clarkb> ~August will be our next Service Coordinator election
19:41:33 <clarkb> I'm calling that out now in part to remind myself to figure out planning for that but also to encourage others to volunteer if interested.
19:44:52 <clarkb> sounds like that may be everything?
19:45:01 <clarkb> Thank you everyone for keeping OpenDev up and running
19:45:14 <fungi> thanks clarkb!
19:45:31 <corvus> thanks!
19:45:44 <clarkb> #endmeeting