19:00:06 <clarkb> #startmeeting infra 19:00:06 <opendevmeet> Meeting started Tue Jul 29 19:00:06 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:06 <opendevmeet> The meeting name has been set to 'infra' 19:00:12 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/SMACAD5RVHJU466XJGM3QKJ2GC7OT3XY/ Our Agenda 19:00:16 <clarkb> #topic Announcements 19:00:25 <clarkb> I'm taking next Monday off so will not be around that day 19:01:02 <fungi> ill be around 19:01:07 <clarkb> I've also drawn up a quick plan for service coordinator elections which we'll talk about later (but want peopel to be awre) 19:02:17 <clarkb> Anything else to announce? 19:04:25 <clarkb> sounds like no. Lets jump into the agenda then 19:04:31 <clarkb> #topic Zuul-launcher 19:05:00 <clarkb> The change to prevent mixed provider nodesets merged and deployed over the weekend. This is now only possible if it is the only way to fulfill a request (say k8s pods and opensatck vms in one nodeset) 19:05:26 <corvus> i think all the bugfixes are in now 19:05:29 <clarkb> yesterday we discovered that ovh bhs1 was fully utilized and that seems to have been a bug in handling failed node boots leading to leaks 19:05:42 <clarkb> and ya corvus restarted services with the fix for ^ this morning 19:06:15 <clarkb> the raxflex sjc3 graph looks weird still but different weird after the restart. I half suspect we maybe leaked floating ips there and we're hitting quota limits on those 19:06:20 <clarkb> I'll check on that after the meeting and lunch 19:06:45 <corvus> yeah, z-l doesn't detect floating ip quota yet 19:08:18 <clarkb> then separately last week we discovered that at least part of the problem with image builds was gitea-lb02 losing its ipv6 address constantly 19:08:44 <fungi> that was a super fun rabbit hole 19:08:48 <clarkb> the address would work for a few minutes then disappear then return an hour later and work for a bit before going away again. We replaced it with a new noble gitea-lb03 node as there was some indication it could be a jammy bug 19:08:54 <clarkb> specifically in systemd-networkd 19:09:11 <clarkb> in addition to that we improved the haproxy health checks to use the gitea health api 19:09:39 <fungi> but other servers running jammy there didn't exhibit this behavior, so odds it's a version-specific issue are slim 19:10:01 <clarkb> ya. I've kept the gitea-lb02 old server around after cleaning up system-config and dns in case vexxhost wants to dig in further 19:10:15 <clarkb> but probably at the end of this week we can clean the server up if we don't make progress on that 19:10:27 <fungi> thanks! 19:10:34 <clarkb> the last thing I've got on the agenda notes for this topic is that nodepool is gone 19:10:40 <clarkb> the servers are deleted etc 19:10:59 <clarkb> corvus: I think openstack/project-config still has nodepool/ dir contents. Is there a change to clean those up? Might be good to avoid future confusion 19:11:48 <corvus> i don't think so. i'll take a look. but i'd like to keep grafana dashboards around for a while. 19:12:13 <clarkb> ++ I think keeping those around is fine. I'm more worried about people trying to fix image builds or change max-server counts 19:12:22 <corvus> yep. i'll get rid of the rest 19:13:22 <clarkb> thanks 19:13:29 <clarkb> anything else on this topic? 19:13:45 <corvus> not from me 19:15:06 <clarkb> #topic Gerrit 3.11 Upgrade Planning 19:15:34 <clarkb> Short of testing an upgrade with our production data I've done what I think is reasonable to try and reproduce the reported offline reindexing problems with gerrit 3.11.4 19:15:39 <clarkb> I have been unable to reproduce the problem 19:16:34 <clarkb> given that I think I'll proceed with testing the upgrade itself. Hopefully tomorrow 19:16:40 <clarkb> #link https://www.gerritcodereview.com/3.11.html 19:16:53 <clarkb> these two job links are for jobs that held nodes that I'll use for testing 19:16:58 <clarkb> #link https://zuul.opendev.org/t/openstack/build/f1ca0d1f2e054829a4506ececb58bed3 19:17:02 <clarkb> #link https://zuul.opendev.org/t/openstack/build/588723b923e94901af3065143d9df818 19:17:11 <clarkb> the nodes ran under zuul launcher so didn't get lost in the nodepool cleanup 19:17:50 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade 19:18:30 <clarkb> Unfortunately, the delays have us getting into openstack's end of release cycle activities 19:18:48 <clarkb> so planning an actual date may be painful. But we'll figure something out once I've got a better picture of the upgrade itself 19:19:01 <clarkb> any comments, concerns or feedback on this topic before we move on? 19:19:07 <fungi> i have none 19:20:16 <clarkb> #topic Upgrading old servers 19:20:39 <clarkb> The change to update matrix and irc bot container logging to journald from syslog landed on the existing server 19:20:48 <clarkb> this is a prereq to upgrading to noble and using podman as the runtime backend 19:21:38 <clarkb> seems to be working fine. I then looked briefly at how the current server is configured to get a sense for what is required to replace it. The main thing is the logging data is on a cinder volume. I think this means we need a downtime to either move the cinder volume from host A to host B or to sync the data from host A to host B from one volume to another 19:22:04 <clarkb> its a bit tricky because for the limnoria irc bot and the matrix eavesdrop bot I think we really don't want them running concurrently on two different hosts 19:22:35 <clarkb> long story short I think we should pick a day (fridays are probably best due to when meetings occur) to stop services on the old server, land a chagne to configure the new server, and copy the data between them 19:23:13 <clarkb> something like boot new server with new volume, rsync data, time passes, rsync data again, stop services on old server, approve change to configure new server, rsync data, check deployment brings everything back up again 19:23:19 <clarkb> guessing at least an hour for that 19:23:27 <fungi> that wou 19:23:31 <fungi> ld work well 19:24:03 <clarkb> cool I would volunteer to do that this Friday but I'm meeting up with folks for lunch during FOSSY so will probably have to happen next week or someone else can atke it on 19:24:19 <fungi> i likely can 19:25:14 <clarkb> fungi: do you want to boot the new server and get things prepped for that or should I? 19:25:22 <fungi> i'll do it 19:25:27 <clarkb> perfect thanks 19:25:49 <clarkb> then for refstack you mentioned announcing it was going away then planning to proceed with that. 19:25:53 <clarkb> Any progress there? 19:26:01 <fungi> though if we're moving all services at the same time, any reason not to move the cinder volume? 19:26:25 <fungi> detach/attach instead of rsync 19:26:35 <clarkb> fungi: probably not. I always worry the cinder volumes won't detach cleanly but that is probably an overblown concern 19:26:41 <fungi> and no, haven't written up an announcement for refstack going away yet 19:26:58 <clarkb> the main reason to avoid it would be moving providers but I don't think we should do that in this case 19:27:15 <fungi> for some reason i thought we had moved logs into afs, but i hadn't looked at it recentl 19:27:17 <fungi> y 19:27:20 <clarkb> when you boot the new noble node don't forget to use the --config-drive flag if booting it in rax classic (its required for that image) 19:27:28 <clarkb> fungi: yes I had thought so too but we haven't 19:27:30 <fungi> yep, will do 19:28:19 <clarkb> anything else on this topic? 19:28:29 <fungi> i guess it was the meetings site hosting on static.o.o that threw me 19:28:45 <fungi> but i suppose it's proxied to eavesdrop still 19:28:46 <clarkb> ya I think the yaml2ical data is published to afs from zuul jobs? 19:28:59 <fungi> that's what it was, yep 19:29:08 <fungi> nothing else from me 19:29:30 <clarkb> #topic Vexxhost backup server inaccessible 19:29:42 <clarkb> yesterday fungi noticed the vexxhost backup server was inaccessible and backups to it were failing 19:29:50 <clarkb> grabbing the console log failed as did ping and ssh 19:30:11 <clarkb> we waited a day then corvus asked nova to reboot it today. The situation didn't change except the console log was available today 19:30:13 <fungi> though nova claimed it was "active" 19:30:31 <fungi> the whole time 19:30:35 <clarkb> guilhermesp managed to take a look today and root caused it to an OVS issue 19:30:46 <clarkb> whcih explains why network connectivity was sad but nova saw it as active/up 19:31:15 <fungi> which would explain why a reboot didn't fix it, though i'm not sure why console logs were initially unreachable, maybe both caused by a single incident 19:31:19 <clarkb> anyway this has been corrected by guilhermesp. guilhermesp indicates that this would not have been correctable as an end user so if we see similar symptoms in the future we should file a ticket with vexxhost 19:31:52 <clarkb> manual inspection of the host seems to show things are happy again. But we should keep an eye on the infra-root inbox to double check backups aren't still erroring to it 19:32:27 <clarkb> anything else we want to call out about this incident before we move on? 19:32:48 <fungi> not i 19:33:13 <clarkb> #topic Matrix for OpenDev comms 19:33:19 <clarkb> #link https://review.opendev.org/c/opendev/infra-specs/+/954826 Spec outlining the motivation and plan for Matrix trialing 19:33:41 <clarkb> frickler reviewed the spec. I've responded but haven't pushed a new patchset yet. Was hoping for a bit more feedback before I add in some of those suggestions 19:33:56 <clarkb> so if you have 15 minutes to read what I've written that would be great 19:34:33 <clarkb> it is probably also worth noting that the matrix.org homeserver now requires users to be at least 18 years old in response to a recent UK law change... 19:34:40 <corvus> thanks, i'll take a look (intended to earlier, but things came up!) 19:34:41 <fungi> i suppose a new development there is element's announcement that no one under 18 years of age is allowed to have a matrix.org account any longer, though using another homeserver is a potential workaround or affected users 19:34:46 <clarkb> anyone using that homeserver should've gotten a message from them with the terms of service update 19:35:06 <clarkb> fungi: yup. I'm not sure that is a huge barrier for our current user base, but something to consider 19:35:11 <corvus> that also seems like something that may ultimately not be limited to matrix 19:35:22 <clarkb> corvus: yes, apparently wikimedia is challenging the law 19:35:32 <fungi> i think the innternet is about to become 18-and-up 19:37:13 <clarkb> happy to update the spec with that info too if we think it is important to capture 19:37:30 <fungi> not at this stage, i don't expect 19:37:52 <fungi> something we can deal with down the road if it becomes relevant 19:38:13 <clarkb> #topic Working through our TODO list 19:38:22 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup 19:38:39 <clarkb> just our weekly reminder that if anyone ends up bored (ha) they can check the list for the next thing to chew on 19:38:49 <clarkb> #topic Pre PTG Planning 19:39:31 <clarkb> similarly we can figure out what our list looks like during our Pre PTG 19:39:38 <clarkb> #link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document 19:39:56 <clarkb> please add topic ideas to that etherpad with things you'd liek to see covered with a bit more depth than we can do day to day or in this meeting 19:40:36 <clarkb> I think we can do a retrospective on the noble + podman switch. A few bumps but overall seems to work well (as an example) 19:40:50 <clarkb> #topic Service Coordinator Election Planning 19:41:05 <clarkb> it has been almost 6 months since we last elected our service coordinator (me) 19:41:12 <clarkb> which means it is time to make a plan for the next election 19:41:21 <clarkb> Proposal: Nomination Period open from August 5, 2025 to August 19, 2025. If necessary we will hold an election from August 20, 2025 to August 27, 2025. All date ranges and times will be in UTC. 19:41:44 <clarkb> this is my proposal. It basically mimics what we've done the last several elections. 19:41:54 <fungi> wfm 19:42:11 <clarkb> I'd like to make that official today so if there are any comments, questions, or concerns please bring them up before EOD 19:42:23 <clarkb> (I'll make it official via email to the service-discuss list) 19:42:46 <clarkb> then I'd liek to say I'm hapyp for someone else to take on more of these organizational and liason duties if there is interest 19:42:55 <clarkb> I'm happy to hang around in a supporting role 19:43:37 <clarkb> let me know if there is interest or if you have any questions 19:43:41 <fungi> i too would be thrilled to help out supporting a new coordinator if there is one 19:44:27 <clarkb> #topic Open Discussion 19:44:30 <clarkb> Anything else? 19:46:50 <clarkb> sounds like that may be everything 19:46:54 <fungi> thanks clarkb! 19:46:56 <clarkb> thank you everyone for your time and help running opendev 19:47:05 <clarkb> we'll be back here next week at the same time and location 19:47:09 <clarkb> #endmeeting