19:00:05 <clarkb> #startmeeting infra 19:00:05 <opendevmeet> Meeting started Tue Jul 1 19:00:05 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:05 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:05 <opendevmeet> The meeting name has been set to 'infra' 19:00:23 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/YQ43ECZ5W6GKF4CTPVARL2U5XCQOKQCB/ Our Agenda 19:00:29 <clarkb> #topic Announcements 19:00:40 <clarkb> Friday is a US holiday so I expect several of us won't be around that day 19:01:00 <clarkb> I'm also going to be out the 15th-17th which means someone else will need to run the meeting on the 15th or we can skip it 19:01:33 <clarkb> I'm happy to defer that decision on everyone else since I won't be here :) 19:01:35 <fungi> i expect to be around 19:01:58 <corvus> me too 19:02:06 <fungi> happy to run meetings whenever 19:02:31 <clarkb> thanks I'll let you dceide that week if you need to send a meeting agenda then fungi 19:02:36 <clarkb> Anything else to announce? 19:02:49 <fungi> openinfra summit schedule is out 19:04:12 <clarkb> #topic Zuul-launcher 19:04:27 <clarkb> I guess the big news here is that we're using zuul launcher for the majority of nodes now 19:04:38 <fungi> yay! 19:04:49 <clarkb> Be on the lookout for unexpected behaviors. I found one yesterday and corvus quickly had a patch for that just merged 19:04:59 <corvus> yep! a few nodes still supplied for images we don't have yet 19:05:04 <clarkb> (more often than expected you can get nodesets with nodes from different cloud regions) 19:05:09 <corvus> also, there are a few labels we don't have for images that we do have (like nested-virt) 19:05:32 <clarkb> #link https://review.opendev.org/c/opendev/zuul-providers/+/953269 Add Ubuntu Focal and Bionic images 19:05:45 <fungi> someone reported a cirros image file missing in a job just a little bit ago, not sure if it was on a zuul or nodepool built node though 19:05:56 <corvus> i'm hoping to restart launchers with the latest fixes today; those also include additional metrics, so i'll merge updates to the graphs then too. 19:06:00 <clarkb> fungi: ya unfornately they didn't provide a build log to check 19:06:36 <fungi> and we did at least confirm that we're building images with that file in the expected path 19:06:44 <fungi> or configured to do so anyway 19:07:07 <corvus> fungi: for nodepool, niz, or both? 19:07:14 <fungi> both 19:07:49 <fungi> but hard to dig deeper until we get more details 19:08:09 <corvus> good... there's a lot of image metadata just in the web ui now; so looking into that should be a little bit easier 19:08:21 <corvus> but i know all of the web ui stuff still needs some work 19:08:45 <clarkb> corvus: both mnasiadka's focal+bionic change and frickler's debian trixie depend on dib changes that haven't merged. If/when they do merge are we using dib from relaeses or master in the image build jobs? 19:09:16 <fungi> otherwise we'll need to d another dib release 19:09:36 <corvus> releases, i think, and if that's right, that makes the depends-on testing a dangerous non-co-gating situation 19:10:11 <corvus> (i wonder if we should have the job run, but fail, if it uses dib from source) 19:10:25 <corvus> but let's double check that, since i'm not sure 19:10:35 <clarkb> corvus: ++ that seems like a good plan. or we can use it from source too 19:10:40 <clarkb> from source all the time I mean 19:10:50 <corvus> yeah 19:11:24 <clarkb> ok so to summarize continue to try and debug issues that may be related to the niz switch, support corvus in improving things via code reviews etc, review the dib changes necessary to build additional images, and then try to rollout more images 19:11:38 <clarkb> and maybe we need to toggle how we're installing dib to make things less flaky post merge 19:11:54 <corvus> ++ 19:12:04 <clarkb> #link https://review.opendev.org/c/opendev/zuul-providers/+/951471 Debian Trixie Image builds 19:12:54 <clarkb> One thing I wanted to do this morning when looking at the cirros image is missing claim is log into an image built by niz to double check. But the web ui only shows in use nodes and doesn't supply their IPs. It sounds like the json blob may have some of that info if anyone lese needs to look it up 19:13:06 <corvus> i'm pretty sure we only install dib from source if the repo is there 19:13:08 <clarkb> I suspect improving the web ui around that sort of thing is going to happen too as we get deeper into this 19:13:26 <corvus> so easiest way to make it install from source all the time is just to add it to required-projects for the image build jobs 19:13:59 <clarkb> makes sense 19:14:00 <corvus> clarkb: yeah, i think we should also be able to show the image id we used in the web ui/json 19:14:12 <clarkb> ++ 19:14:14 <fungi> listing not-yet-used nodes in the ui could be weird, since the nodes list is tenant scoped and those nodes wouldn't be associated with a tenant yet? 19:14:37 <clarkb> I have confirmed that nodepool list doesn't seem to show you niz images. I didn't expect it to but thought maybe since they both use zk there would be enough overal pfor that to magically work 19:15:06 <corvus> fungi: yeah... though we do show building nodes that are assigned to a tenant 19:15:28 <corvus> the only thing that won't show up is unassigned ready nodes (typically from min-ready, but possibly from aborted buildsets) 19:15:28 <fungi> fair, if they've got an associated node request 19:15:55 <corvus> there is a way to filter for "ready nodes that could possibly be used by this tenant"; it's a bit more complex, but i think we can/should do it. 19:16:14 <fungi> cool! 19:16:17 <clarkb> I think that would've been useful for me today so I'm ++ on doing that 19:16:36 <corvus> (so then those ready nodes would show up in multiple tenant listings, until their probability field collapses) 19:17:02 <clarkb> each unassigned node is a quantum computer 19:17:06 <fungi> nodes that are "available" to the tenant 19:17:32 <clarkb> anything else on this topic? 19:17:49 <corvus> i think that's it from me 19:18:04 <clarkb> oh mnasiadka just mentioned in #opendev that image build jobs are running out of disk 19:18:22 <clarkb> so we may need to do more optimization of the disk usage in the jobs. But details are scarce right now. Needs more characterization 19:18:35 <corvus> oof. i guess we'll followup on that in opendev 19:18:36 <corvus> we do have cacti graphs 19:18:53 <clarkb> I think this was on the image build node itself 19:18:57 <clarkb> not the launchers fwiw 19:19:06 <clarkb> but ya we can followup there later 19:19:07 <corvus> oh derp sorry 19:19:12 <clarkb> #topic Gerrit shutdown problems 19:19:26 <clarkb> Last week I finally got around to doing the "testing" of gerrit shutdown processes in production 19:19:47 <clarkb> And I think the hunch that h2 db compaction was the cause of slow gerrit shutdown was accurate. 19:20:19 <clarkb> we ran a manual kill -HUP against gerrit to rule out sigint vs sighup behavior differences and sighup produced the same slow shutdown. It ended up taking about 6-7 minutes to finally shutdown 19:20:45 <clarkb> while we were waiting I ran a strace against gerrit and it was read/writing/seeking in h2 db files. And after the process completed the db files were smaller than when we started 19:21:08 <clarkb> we did the restart to apply the revert of h2 compaction so the testing was also the fix and I expect the next restart will be happy 19:21:19 <clarkb> fungi: you helped out with ^ anything else to add? 19:21:42 <fungi> nope, just that there was good solid evidence to support your guess 19:22:02 <fungi> looking forward to smooter restarts in the future 19:22:05 <fungi> smoother 19:22:12 <clarkb> so ya good news I think we've corrected this which means I can return to figuring out a gerrit 3.11 upgrade 19:22:17 <clarkb> #topic Gerrit 3.11 Upgrade Planning 19:22:29 <clarkb> however, I haven't started on this since hopefully fixing gerrit restarts 19:22:38 <clarkb> #link https://www.gerritcodereview.com/3.11.html 19:22:46 <clarkb> reading over the release notes is still helpful if you haven't done it yet 19:22:53 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade 19:23:13 <clarkb> and you can put any thoughts or notes in that etherpad. I'd like to pick this up again either this week or next so hopefully there will be proper updates soon 19:23:47 <clarkb> #topic Upgrading old servers 19:23:53 <clarkb> There are updates on this topic! 19:24:08 <fungi> lots 19:24:15 <clarkb> late last week and over the weekend corvus replaced all of the core zuul servers with noble nodes 19:24:25 <clarkb> (schedulers, mergers, executors, launchers) 19:24:30 <corvus> i did self-approve some pro-forma changes. hope that's okay. :) 19:24:44 <fungi> perfectly 19:25:01 <clarkb> then I followed that work up by replacing the zookeeper nodes behind zuul and nodepool yseterday 19:25:12 <corvus> one thing that went wrong with the zuul upgrade: the load balancer 19:25:30 <clarkb> corvus: the problem is that we hardcode IP addrs in the proxy config right? 19:25:37 <corvus> entirely my fault, because i forgot 2 things: 1) we moved the config files; and 2) the backend server config is explicit 19:26:03 <clarkb> I feel like we do that because then we aren't reliant on DNS for that to work. Maybe we should just let haproxy do dns lookups? 19:26:15 <corvus> i cleaned up our old config file locations on both load balancers (zuul and gitea), so hopefully that won't bite anyone else :) 19:26:44 <corvus> yeah, i'm kind of thinking that just letting dns or ansible write that automatically might be okay 19:27:01 <clarkb> I'm on board with that. Worst case we notice flakyness on the frontend and can revert 19:27:16 <clarkb> then tackle it some other way 19:27:47 <corvus> automatic config is... uh... what i was expecting to happen... and i think our process for replacing servers would work with that. 19:28:29 <corvus> probably just need to still be able to take a server in and out easily, but just not specify it's ip address. 19:28:29 <clarkb> ya we do that with zookeeper actually 19:28:41 <clarkb> the servers are all listed with explicit IPs but ansible figures out what they are for us and puts them in he config file 19:29:04 <corvus> i like that approach 19:29:05 <clarkb> and it does so via looking at hosts in the zookeeper ansible group 19:29:54 <corvus> sounds like consensus to change that. that was the only followup from my weekend 19:29:54 <tonyb> Makes sense to me 19:30:03 <clarkb> zookeeper replacement went smoothly. I did have one unexpcted election behavior but afterwards I thought it through and the behavior makes sense to me after the fact 19:30:17 <clarkb> #link https://review.opendev.org/c/opendev/zone-opendev.org/+/953844 Remove old zk servers from DNS 19:30:52 <clarkb> after lunch today I plan to approve ^ unless there are objections to remove the old zk servers from DNS. Then I'll plan to delete the old zk servers after or tomorrow morning (again if there are no objections) 19:31:08 <clarkb> corvus already acked doing ^ so let me know otherwise I'm planning to proceed with cleanup 19:31:18 <corvus> clarkb: one thought: now that we've shown we can do that particular upgrade process, we should make sure not to copy those questionable notes about upgrades from the etherpad. 19:31:35 <corvus> or revise them or whatever 19:31:42 <clarkb> corvus: ya I did put a note about them possibly being FUD in the etherpad already but I should go ahead and delete them or cross them out and mark them invalid 19:32:02 <corvus> ++ 19:32:30 <clarkb> now that we have all these nodes running on noble with docker compose instead of docker compose we can clean up their docker-compose.yaml files 19:32:32 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/953846 19:32:45 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/953848 19:32:59 <clarkb> it was writing these changes that helped discover the nodes from multiple clouds behavior 19:34:12 <clarkb> those aren't urgent but it might be nice to get rid of the warnings 19:34:24 <corvus> looking forward to that :) 19:34:34 <clarkb> I also swapped otu mirror-update servers not sure if we discussed that previously. 19:34:48 <clarkb> Eavesdrop and refstack are the "easy" nodes I have remaingin on the easy list 19:34:58 <clarkb> fungi: I know we've been busy with plenty of other stuff but any word on refstack cleanup? 19:37:32 <fungi> nothing 19:37:56 <clarkb> ok. The list is still quite big but we're slowly whittling it down. Thanks for the help and happy to have more 19:38:02 <clarkb> #topic OFTC Matrix bridge no longer supporting new users 19:38:20 <clarkb> I have an action item to go write a spec for this I just haven't gotten to it yet. Maybe I should do that before I start looking at gerrit 3.11 19:40:31 <clarkb> I guess there isn't anything new on this until I do that 19:40:39 <clarkb> #topic Working through our TODO list 19:40:43 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup 19:40:56 <clarkb> I also need to migrate this list to something a bit more permanent/better 19:41:09 <clarkb> but a friendly reminder that the list exists if you get bored :) 19:43:19 <clarkb> #topic Pre PTG Planning 19:43:44 <clarkb> I haven't heard any additional feedback on the week of October 6-10 for holding an opendev pre ptg 19:44:09 <clarkb> I think we shoudl all pencil in those dates and I'll start on an announcement and an agenda that we can fill in before then 19:47:01 <clarkb> ok no more feedback is good feedback I'll proceed wit htaht as the plan for now 19:47:07 <clarkb> #topic Open Discussion 19:47:09 <clarkb> Anything else? 19:51:16 <fungi> i got nothin' 19:52:10 <clarkb> in that case thanks everyone for your time here and elsewhere keeping opendev up and running 19:52:21 <clarkb> we'll be back next week at the same time and location 19:52:22 <clarkb> #endmeeting