19:00:05 <clarkb> #startmeeting infra
19:00:05 <opendevmeet> Meeting started Tue Jul  1 19:00:05 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:05 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:05 <opendevmeet> The meeting name has been set to 'infra'
19:00:23 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/YQ43ECZ5W6GKF4CTPVARL2U5XCQOKQCB/ Our Agenda
19:00:29 <clarkb> #topic Announcements
19:00:40 <clarkb> Friday is a US holiday so I expect several of us won't be around that day
19:01:00 <clarkb> I'm also going to be out the 15th-17th which means someone else will need to run the meeting on the 15th or we can skip it
19:01:33 <clarkb> I'm happy to defer that decision on everyone else since I won't be here :)
19:01:35 <fungi> i expect to be around
19:01:58 <corvus> me too
19:02:06 <fungi> happy to run meetings whenever
19:02:31 <clarkb> thanks I'll let you dceide that week if you need to send a meeting agenda then fungi
19:02:36 <clarkb> Anything else to announce?
19:02:49 <fungi> openinfra summit schedule is out
19:04:12 <clarkb> #topic Zuul-launcher
19:04:27 <clarkb> I guess the big news here is that we're using zuul launcher for the majority of nodes now
19:04:38 <fungi> yay!
19:04:49 <clarkb> Be on the lookout for unexpected behaviors. I found one yesterday and corvus  quickly had a patch for that just merged
19:04:59 <corvus> yep!  a few nodes still supplied for images we don't have yet
19:05:04 <clarkb> (more often than expected you can get nodesets with nodes from different cloud regions)
19:05:09 <corvus> also, there are a few labels we don't have for images that we do have (like nested-virt)
19:05:32 <clarkb> #link https://review.opendev.org/c/opendev/zuul-providers/+/953269 Add Ubuntu Focal and Bionic images
19:05:45 <fungi> someone reported a cirros image file missing in a job just a little bit ago, not sure if it was on a zuul or nodepool built node though
19:05:56 <corvus> i'm hoping to restart launchers with the latest fixes today; those also include additional metrics, so i'll merge updates to the graphs then too.
19:06:00 <clarkb> fungi: ya unfornately they didn't provide a build log to check
19:06:36 <fungi> and we did at least confirm that we're building images with that file in the expected path
19:06:44 <fungi> or configured to do so anyway
19:07:07 <corvus> fungi: for nodepool, niz, or both?
19:07:14 <fungi> both
19:07:49 <fungi> but hard to dig deeper until we get more details
19:08:09 <corvus> good... there's a lot of image metadata just in the web ui now; so looking into that should be a little bit easier
19:08:21 <corvus> but i know all of the web ui stuff still needs some work
19:08:45 <clarkb> corvus: both mnasiadka's focal+bionic change and frickler's debian trixie depend on dib changes that haven't merged. If/when they do merge are we using dib from relaeses or master in the image build jobs?
19:09:16 <fungi> otherwise we'll need to d another dib release
19:09:36 <corvus> releases, i think, and if that's right, that makes the depends-on testing a dangerous non-co-gating situation
19:10:11 <corvus> (i wonder if we should have the job run, but fail, if it uses dib from source)
19:10:25 <corvus> but let's double check that, since i'm not sure
19:10:35 <clarkb> corvus: ++ that seems like a good plan. or we can use it from source too
19:10:40 <clarkb> from source all the time I mean
19:10:50 <corvus> yeah
19:11:24 <clarkb> ok so to summarize continue to try and debug issues that may be related to the niz switch, support corvus in improving things via code reviews etc, review the dib changes necessary to build additional images, and then try to rollout more images
19:11:38 <clarkb> and maybe we need to toggle how we're installing dib to make things less flaky post merge
19:11:54 <corvus> ++
19:12:04 <clarkb> #link https://review.opendev.org/c/opendev/zuul-providers/+/951471 Debian Trixie Image builds
19:12:54 <clarkb> One thing I wanted to do this morning when looking at the cirros image is missing claim is log into an image built by niz to double check. But the web ui only shows in use nodes and doesn't supply their IPs. It sounds like the json blob may have some of that info if anyone lese needs to look it up
19:13:06 <corvus> i'm pretty sure we only install dib from source if the repo is there
19:13:08 <clarkb> I suspect improving the web ui around that sort of thing is going to happen too as we get deeper into this
19:13:26 <corvus> so easiest way to make it install from source all the time is just to add it to required-projects for the image build jobs
19:13:59 <clarkb> makes sense
19:14:00 <corvus> clarkb: yeah, i think we should also be able to show the image id we used in the web ui/json
19:14:12 <clarkb> ++
19:14:14 <fungi> listing not-yet-used nodes in the ui could be weird, since the nodes list is tenant scoped and those nodes wouldn't be associated with a tenant yet?
19:14:37 <clarkb> I have confirmed that nodepool list doesn't seem to show you niz images. I didn't expect it to but thought maybe since they both use zk there would be enough overal pfor that to magically work
19:15:06 <corvus> fungi: yeah... though we do show building nodes that are assigned to a tenant
19:15:28 <corvus> the only thing that won't show up is unassigned ready nodes (typically from min-ready, but possibly from aborted buildsets)
19:15:28 <fungi> fair, if they've got an associated node request
19:15:55 <corvus> there is a way to filter for "ready nodes that could possibly be used by this tenant"; it's a bit more complex, but i think we can/should do it.
19:16:14 <fungi> cool!
19:16:17 <clarkb> I think that would've been useful for me today so I'm ++ on doing that
19:16:36 <corvus> (so then those ready nodes would show up in multiple tenant listings, until their probability field collapses)
19:17:02 <clarkb> each unassigned node is a quantum computer
19:17:06 <fungi> nodes that are "available" to the tenant
19:17:32 <clarkb> anything else on this topic?
19:17:49 <corvus> i think that's it from me
19:18:04 <clarkb> oh mnasiadka just mentioned in #opendev that image build jobs are running out of disk
19:18:22 <clarkb> so we may need to do more optimization of the disk usage in the jobs. But details are scarce right now. Needs more characterization
19:18:35 <corvus> oof.  i guess we'll followup on that in opendev
19:18:36 <corvus> we do have cacti graphs
19:18:53 <clarkb> I think this was on the image build node itself
19:18:57 <clarkb> not the launchers fwiw
19:19:06 <clarkb> but ya we can followup there later
19:19:07 <corvus> oh derp sorry
19:19:12 <clarkb> #topic Gerrit shutdown problems
19:19:26 <clarkb> Last week I finally got around to doing the "testing" of gerrit shutdown processes in production
19:19:47 <clarkb> And I think the hunch that h2 db compaction was the cause of slow gerrit shutdown was accurate.
19:20:19 <clarkb> we ran a manual kill -HUP against gerrit to rule out sigint vs sighup behavior differences and sighup produced the same slow shutdown. It ended up taking about 6-7 minutes to finally shutdown
19:20:45 <clarkb> while we were waiting I ran a strace against gerrit and it was read/writing/seeking in h2 db files. And after the process completed the db files were smaller than when we started
19:21:08 <clarkb> we did the restart to apply the revert of h2 compaction so the testing was also the fix and I expect the next restart will be happy
19:21:19 <clarkb> fungi: you helped out with ^ anything else to add?
19:21:42 <fungi> nope, just that there was good solid evidence to support your guess
19:22:02 <fungi> looking forward to smooter restarts in the future
19:22:05 <fungi> smoother
19:22:12 <clarkb> so ya good news I think we've corrected this which means I can return to figuring out a gerrit 3.11 upgrade
19:22:17 <clarkb> #topic Gerrit 3.11 Upgrade Planning
19:22:29 <clarkb> however, I haven't started on this since hopefully fixing gerrit restarts
19:22:38 <clarkb> #link https://www.gerritcodereview.com/3.11.html
19:22:46 <clarkb> reading over the release notes is still helpful if you haven't done it yet
19:22:53 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade
19:23:13 <clarkb> and you can put any thoughts or notes in that etherpad. I'd like to pick this up again either this week or next so hopefully there will be proper updates soon
19:23:47 <clarkb> #topic Upgrading old servers
19:23:53 <clarkb> There are updates on this topic!
19:24:08 <fungi> lots
19:24:15 <clarkb> late last week and over the weekend corvus replaced all of the core zuul servers with noble nodes
19:24:25 <clarkb> (schedulers, mergers, executors, launchers)
19:24:30 <corvus> i did self-approve some pro-forma changes.  hope that's okay.  :)
19:24:44 <fungi> perfectly
19:25:01 <clarkb> then I followed that work up by replacing the zookeeper nodes behind zuul and nodepool yseterday
19:25:12 <corvus> one thing that went wrong with the zuul upgrade: the load balancer
19:25:30 <clarkb> corvus: the problem is that we hardcode IP addrs in the proxy config right?
19:25:37 <corvus> entirely my fault, because i forgot 2 things: 1) we moved the config files; and 2) the backend server config is explicit
19:26:03 <clarkb> I feel like we do that because then we aren't reliant on DNS for that to work. Maybe we should just let haproxy do dns lookups?
19:26:15 <corvus> i cleaned up our old config file locations on both load balancers (zuul and gitea), so hopefully that won't bite anyone else :)
19:26:44 <corvus> yeah, i'm kind of thinking that just letting dns or ansible write that automatically might be okay
19:27:01 <clarkb> I'm on board with that. Worst case we notice flakyness on the frontend and can revert
19:27:16 <clarkb> then tackle it some other way
19:27:47 <corvus> automatic config is... uh... what i was expecting to happen... and i think our process for replacing servers would work with that.
19:28:29 <corvus> probably just need to still be able to take a server in and out easily, but just not specify it's ip address.
19:28:29 <clarkb> ya we do that with zookeeper actually
19:28:41 <clarkb> the servers are all listed with explicit IPs but ansible figures out what they are for us and puts them in he config file
19:29:04 <corvus> i like that approach
19:29:05 <clarkb> and it does so via looking at hosts in the zookeeper ansible group
19:29:54 <corvus> sounds like consensus to change that.  that was the only followup from my weekend
19:29:54 <tonyb> Makes sense to me
19:30:03 <clarkb> zookeeper replacement went smoothly. I did have one unexpcted election behavior but afterwards I thought it through and the behavior makes sense to me after the fact
19:30:17 <clarkb> #link https://review.opendev.org/c/opendev/zone-opendev.org/+/953844 Remove old zk servers from DNS
19:30:52 <clarkb> after lunch today I plan to approve ^ unless there are objections to remove the old zk servers from DNS. Then I'll plan to delete the old zk servers after or tomorrow morning (again if there are no objections)
19:31:08 <clarkb> corvus already acked doing ^ so let me know otherwise I'm planning to proceed with cleanup
19:31:18 <corvus> clarkb: one thought: now that we've shown we can do that particular upgrade process, we should make sure not to copy those questionable notes about upgrades from the etherpad.
19:31:35 <corvus> or revise them or whatever
19:31:42 <clarkb> corvus: ya I did put a note about them possibly being FUD in the etherpad already but I should go ahead and delete them or cross them out and mark them invalid
19:32:02 <corvus> ++
19:32:30 <clarkb> now that we have all these nodes running on noble with docker compose instead of docker compose we can clean up their docker-compose.yaml files
19:32:32 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/953846
19:32:45 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/953848
19:32:59 <clarkb> it was writing these changes that helped discover the nodes from multiple clouds behavior
19:34:12 <clarkb> those aren't urgent but it might be nice to get rid of the warnings
19:34:24 <corvus> looking forward to that :)
19:34:34 <clarkb> I also swapped otu mirror-update servers not sure if we discussed that previously.
19:34:48 <clarkb> Eavesdrop and refstack are the "easy" nodes I have remaingin on the easy list
19:34:58 <clarkb> fungi: I know we've been busy with plenty of other stuff but any word on refstack cleanup?
19:37:32 <fungi> nothing
19:37:56 <clarkb> ok. The list is still quite big but we're slowly whittling it down. Thanks for the help and happy to have more
19:38:02 <clarkb> #topic OFTC Matrix bridge no longer supporting new users
19:38:20 <clarkb> I have an action item to go write a spec for this I just haven't gotten to it yet. Maybe I should do that before I start looking at gerrit 3.11
19:40:31 <clarkb> I guess there isn't anything new on this until I do that
19:40:39 <clarkb> #topic Working through our TODO list
19:40:43 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup
19:40:56 <clarkb> I also need to migrate this list to something a bit more permanent/better
19:41:09 <clarkb> but a friendly reminder that the list exists if you get bored :)
19:43:19 <clarkb> #topic Pre PTG Planning
19:43:44 <clarkb> I haven't heard any additional feedback on the week of October 6-10 for holding an opendev pre ptg
19:44:09 <clarkb> I think we shoudl all pencil in those dates and I'll start on an announcement and an agenda that we can fill in before then
19:47:01 <clarkb> ok no more feedback is good feedback I'll proceed wit htaht as the plan for now
19:47:07 <clarkb> #topic Open Discussion
19:47:09 <clarkb> Anything else?
19:51:16 <fungi> i got nothin'
19:52:10 <clarkb> in that case thanks everyone for your time here and elsewhere keeping opendev up and running
19:52:21 <clarkb> we'll be back next week at the same time and location
19:52:22 <clarkb> #endmeeting