19:00:45 #startmeeting infra 19:00:45 Meeting started Tue Jul 8 19:00:45 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:45 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:45 The meeting name has been set to 'infra' 19:00:54 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/EHPWD6ZYIOJ6KPI2D6QUN36KUXOV6YHL/ Our Agenda 19:00:57 #topic Announcements 19:01:36 I'll be out next week from the 15th to 17th. Sounded like fungi was willing to chair next week's meeting if there is sufficient interest 19:01:42 yeah, i can take it 19:01:53 unless someone else wants to 19:02:13 anything else to announce or should we jump into the agenda? 19:03:21 i have nothing 19:03:35 #topic Zuul-launcher 19:03:59 the agenda is a bit out of date on this topic since thing are moving so quickly 19:04:26 but we noticed mixed provider nodes continuing to impact jobs. corvus found bugs and fixed them 19:04:40 fixes for all known bugs related to mixed provider nodesets are in production; i spot-checked some periodic builds this morning and didn't see failures related to this 19:04:55 so if someone sees a new issue, pls raise it 19:05:07 yeah, other folks who had been tracking that symptom in their project reported no further incidences 19:05:35 excellent 19:05:56 Another problem we ran into was the gnutls errors from image builds fetching git repos to update our git repo image caches 19:05:58 i have changes for a couple of issues related to old/stuck requests; those are gating now 19:06:47 mnasiadka updated dib to retry git requests. I found that one of the nodes I looked at seemed to alternate between requesting git resources via ipv4 and ipv6 protocols 19:07:08 when it stalled out it was trying ipv6 then after the timeout and retry it used ipv4 and no further problems were observed. Even after it switched back to ipv6 again 19:07:26 I thought that maybe there were stray interface updates but spot checking that provider we configure the interfaces statically for ipv6 so I'm really stumped 19:07:39 what do test nodes use for dns now? 19:07:48 I think for now relying on the retries is probably a decent enough workaround and we can dig in further if problems become worse 19:08:03 corvus: should be unbound listening on localhost with unbound forwarding to google and cloudflare 19:08:18 if the host has an ipv6 address the ipv6 google and cloudflare addrs are used. If not then ipv4 19:08:27 (they don't recurse themselves) 19:08:54 so if there's dns flapping, it would be google/cloudflare... 19:09:00 however image builds occur in a chroot which may impact /etc/resolv.conf I haven't checked on that more closely 19:09:13 corvus: ya either google or cloudflare or something about chroot setup overriding the host dns maybe 19:09:31 we have a 1h ttl on opendev.org... 19:09:51 right and the flapping between the protocols occured over a time span less than an hour and it did so multiple times 19:09:58 its definitely a really odd situation. 19:10:03 weird. nothing is jumping out at me either. 19:10:09 another idea is that its just Internet failures 19:10:32 in which case retrying is probably the most correct thing to do and we're doing that now 19:10:34 that seems like the most probable cause fitting all the observed behaviors 19:10:50 and retrying seems to be sufficient? 19:11:08 corvus: yes in the example I was digging into there was a single retry for one failure and it proceeded successfully from there 19:11:17 it appears to be quite intermittent 19:11:36 i'm happy enough to just back away slowly from this then. :) 19:11:42 somebody's got a router somewhere in the path that's inadvertently hashing a small percentage of flows into a black hole 19:11:58 and maybe only for ipv6 19:12:39 corvus: ya thats about where I've ended up 19:13:17 there are also trixie and EL10 image builds that I figured we could touch on briefly but I gave them their own topics 19:13:26 anything other than ^ we want to discuss on this topic? 19:13:45 oh yeah 19:14:00 i think we're really close to having zero nodepool traffic 19:14:17 and i think probability of us wanting to roll-back is nearing zero 19:14:36 so... do we want to decomission the nodepool servers? 19:14:59 that would be awesome 19:15:00 I'm on baord with that. I was thinking it is probably a good idea to stop services for a week or so and keep an eye out for any unexpected corner cases we missed 19:15:29 sounds like a plan. a day or so after it looks like zero traffic, i'll stop, then wait a week to delete 19:15:38 then shutdown and remove the servers entirely after that period. Mostly just thinking there may be jobs that run infrequently enough or are lost in the periodic job noise that we may not notice until we shut it down properly 19:15:44 corvus: sounds great 19:16:12 eot from me 19:16:23 #topic Adding Debian Trixie Images 19:16:40 related to the last topic frickler has done some work to add Debian Trixie images and only in zuul-launcher not nodepool 19:16:54 I think that is the correct choice at this point particularly since we are close to shutting down Nodepool but worth calling out 19:17:08 ++ 19:17:19 The other bits worth noting are that we are relying on a prerelase version of dib to build those images since they are really debian testing images until trixie is released 19:17:30 we will likely need tomake some small updates to s/testing/trixie/ post release too 19:18:17 thinking out loud here we shoudl prboably try and get a dib release out soon too to capture these updates, but I think I'm ok with using depends on and pulling dib from source. We may need to be careful about reviewing depends on before approving zuul-provider chagnes if we do that long term though? 19:18:44 i think we switched zuul-provider to just always build from source 19:18:55 but even with depends on zuul-provider won't merge and upload an image until the dib change is reviewed and approved 19:19:10 so I think risk there is low. Mostly just want zuul-provider reviewers to know they need to think about and dib depends-on too 19:19:39 yes, zuul-providers always uses dib from source now, so releases are irrelevant to it 19:20:03 (for all image builds) 19:20:24 cool 19:20:34 oh the other thing is we aren't mirroring testing/trixie 19:20:50 this may break jobs that try to run on the new nodes as I think we try to configure mirrors for all debian images today 19:20:58 we can figure that out if/when it becomes a problem after images are built 19:21:38 #topic EL10 Images 19:21:51 then related to that we're also in the process of adding CentOS 10 Stream and Rocky Linux 10 images 19:21:58 #link https://review.opendev.org/c/opendev/zuul-providers/+/953460 CentOS 10 Stream 19:22:03 #link https://review.opendev.org/c/opendev/zuul-providers/+/954265 Rocky Linux 10 19:22:22 in addition to all of the prior notes about dib and mirrors the extra consideration here is that these nodes cannot boot in rackspace classic 19:22:42 I think I've come to terms with that and I think it may be a good way to push other providers (and rax) to expand on more capable hardware 19:23:04 but it does mean that like half our resources can't boot those images right now 19:23:57 thank you to everyone pushing both trixie and EL10 along. I think they have been a great illustration for how zuul-launcher is an improvement over nodepool when it comes to debug cycles and adding new images 19:24:11 we're able to do everything upfront rather than post merge on nodepool's daily rebuild cycle 19:24:41 and this opens us up to the possbility of adding image acceptance tests 19:25:19 as actual zuul jobs 19:26:10 #topic Gerrit 3.11 Upgrade Planning 19:26:24 The only thing I've done on this topic recently is clean up the old held nodes 19:26:45 I realized they are probably stale and don't represent our current images and config anymore (for exmaple the cleanup of h2 compaction timeout increases) 19:27:12 The other thing I've eralized is that we've got a handful of unrelated changes that all touch on the gerrit images that we might want to bundel up and apply with a single gerrit restart 19:27:18 There is the move to host gerrit images on quay 19:27:24 #link https://review.opendev.org/c/opendev/system-config/+/882900 Host Gerrit images on quay.io 19:27:31 There is the update to remove cla stuff 19:27:37 (which has content baked into the image iirc) 19:28:03 then there are the updates to the zuul status viewer which I don't think we've deployed yet either 19:28:19 fungi: maybe you and I can sync up on doing quay and the cla stuff one day and do a single restart to pick up both? 19:29:41 yeah, i expect at least a week lead time to announce merging the acl change to remove cla enforcement from all remaining projects in our gerrit 19:29:43 then once that is done I want to hold new nodes with that up to date configuration state and container images and dig into planning the upgrade from there 19:29:51 correct, i don't think the status plugin is deployed, but it was tested in a system-config build. 19:30:12 fungi: ok lets sync up after the meeting and figure out what planning for that looks like 19:30:28 but once all of that us caught up I can dig into the release notes and update the plan with testing input 19:30:33 #link https://www.gerritcodereview.com/3.11.html 19:30:39 #link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade 19:30:53 any other questions, concerns or comments in relation to the gerrit 3.10 to 3.11 upgrade planning? 19:31:20 none on my side 19:31:49 #topic Upgrading old servers 19:32:07 we had a good sprint on this ~last week and the week prior 19:32:15 but since then I havent' seen any new servers (which is fine) 19:32:17 no news on refstack 19:32:35 ack. Other than refstack the next "easy" server I've identified is eavesdrop 19:32:44 so when I find time that is the most likely next server to get replaced 19:34:36 #topic Trialing Matrix for OpenDev Comms 19:35:03 unlike gerrit upgrade prep I didn't dive into this because I realized there were prereqs. I just ran out of time between debugging image build and gitea stuff and the holiday 19:35:23 that said I should be able to write that spec this week as long as fires stay away. No holiday helps too 19:35:52 its on my todo list after reviewing PBR updates and some local laptop updates I need to do 19:35:59 oh and some paperwork that is due today 19:36:57 so ya hopefully soon 19:37:05 #topic Working through our TODO list 19:37:17 And now its time for the weekly reminder that we have a TODO list that still needs a new home 19:37:21 #link https://etherpad.opendev.org/p/opendev-january-2025-meetup 19:37:41 that is related enough to spec work that I should probably roll those two efforts into adjacent blocks of time 19:37:51 #topic Pre PTG Planning 19:37:56 #link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document 19:38:20 I've started to put the planning for this 'event' into its own dedicated document. 19:38:56 Feel free to add agenda items and add suggestions for specific timeframes and days in the etherpad. Once we've got a bit more of a schedule I'll announce this on the mailing list as well to make it more official 19:39:57 I can also port over still relevant items from the last event to this one. We've managed to complete a few things but the list is long and there are likely updates 19:41:00 #topic Open Discussion 19:41:14 ~August will be our next Service Coordinator election 19:41:33 I'm calling that out now in part to remind myself to figure out planning for that but also to encourage others to volunteer if interested. 19:44:52 sounds like that may be everything? 19:45:01 Thank you everyone for keeping OpenDev up and running 19:45:14 thanks clarkb! 19:45:31 thanks! 19:45:44 #endmeeting