#opendev-meeting log

19:00:06 <clarkb> #startmeeting infra
19:00:06 <opendevmeet> Meeting started Tue Jul 29 19:00:06 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:06 <opendevmeet> The meeting name has been set to 'infra'
19:00:12 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/SMACAD5RVHJU466XJGM3QKJ2GC7OT3XY/ Our Agenda
19:00:16 <clarkb> #topic Announcements
19:00:25 <clarkb> I'm taking next Monday off so will not be around that day
19:01:02 <fungi> ill be around
19:01:07 <clarkb> I've also drawn up a quick plan for service coordinator elections which we'll talk about later (but want peopel to be awre)
19:02:17 <clarkb> Anything else to announce?
19:04:25 <clarkb> sounds like no. Lets jump into the agenda then
19:04:31 <clarkb> #topic Zuul-launcher
19:05:00 <clarkb> The change to prevent mixed provider nodesets merged and deployed over the weekend. This is now only possible if it is the only way to fulfill a request (say k8s pods and opensatck vms in one nodeset)
19:05:26 <corvus> i think all the bugfixes are in now
19:05:29 <clarkb> yesterday we discovered that ovh bhs1 was fully utilized and that seems to have been a bug in handling failed node boots leading to leaks
19:05:42 <clarkb> and ya corvus restarted services with the fix for ^ this morning
19:06:15 <clarkb> the raxflex sjc3 graph looks weird still but different weird after the restart. I half suspect we maybe leaked floating ips there and we're hitting quota limits on those
19:06:20 <clarkb> I'll check on that after the meeting and lunch
19:06:45 <corvus> yeah, z-l doesn't detect floating ip quota yet
19:08:18 <clarkb> then separately last week we discovered that at least part of the problem with image builds was gitea-lb02 losing its ipv6 address constantly
19:08:44 <fungi> that was a super fun rabbit hole
19:08:48 <clarkb> the address would work for a few minutes then disappear then return an hour later and work for a bit before going away again. We replaced it with a new noble gitea-lb03 node as there was some indication it could be a jammy bug
19:08:54 <clarkb> specifically in systemd-networkd
19:09:11 <clarkb> in addition to that we improved the haproxy health checks to use the gitea health api
19:09:39 <fungi> but other servers running jammy there didn't exhibit this behavior, so odds it's a version-specific issue are slim
19:10:01 <clarkb> ya. I've kept the gitea-lb02 old server around after cleaning up system-config and dns in case vexxhost wants to dig in further
19:10:15 <clarkb> but probably at the end of this week we can clean the server up if we don't make progress on that
19:10:27 <fungi> thanks!
19:10:34 <clarkb> the last thing I've got on the agenda notes for this topic is that nodepool is gone
19:10:40 <clarkb> the servers are deleted etc
19:10:59 <clarkb> corvus: I think openstack/project-config still has nodepool/ dir contents. Is there a change to clean those up? Might be good to avoid future confusion
19:11:48 <corvus> i don't think so.  i'll take a look.  but i'd like to keep grafana dashboards around for a while.
19:12:13 <clarkb> ++ I think keeping those around is fine. I'm more worried about people trying to fix image builds or change max-server counts
19:12:22 <corvus> yep.  i'll get rid of the rest
19:13:22 <clarkb> thanks
19:13:29 <clarkb> anything else on this topic?
19:13:45 <corvus> not from me
19:15:06 <clarkb> #topic Gerrit 3.11 Upgrade Planning
19:15:34 <clarkb> Short of testing an upgrade with our production data I've done what I think is reasonable to try and reproduce the reported offline reindexing problems with gerrit 3.11.4
19:15:39 <clarkb> I have been unable to reproduce the problem
19:16:34 <clarkb> given that I think I'll proceed with testing the upgrade itself. Hopefully tomorrow
19:16:40 <clarkb> #link https://www.gerritcodereview.com/3.11.html
19:16:53 <clarkb> these two job links are for jobs that held nodes that I'll use for testing
19:16:58 <clarkb> #link https://zuul.opendev.org/t/openstack/build/f1ca0d1f2e054829a4506ececb58bed3
19:17:02 <clarkb> #link https://zuul.opendev.org/t/openstack/build/588723b923e94901af3065143d9df818
19:17:11 <clarkb> the nodes ran under zuul launcher so didn't get lost in the nodepool cleanup
19:17:50 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade
19:18:30 <clarkb> Unfortunately, the delays have us getting into openstack's end of release cycle activities
19:18:48 <clarkb> so planning an actual date may be painful. But we'll figure something out once I've got a better picture of the upgrade itself
19:19:01 <clarkb> any comments, concerns or feedback on this topic before we move on?
19:19:07 <fungi> i have none
19:20:16 <clarkb> #topic Upgrading old servers
19:20:39 <clarkb> The change to update matrix and irc bot container logging to journald from syslog landed on the existing server
19:20:48 <clarkb> this is a prereq to upgrading to noble and using podman as the runtime backend
19:21:38 <clarkb> seems to be working fine. I then looked briefly at how the current server is configured to get a sense for what is required to replace it. The main thing is the logging data is on a cinder volume. I think this means we need a downtime to either move the cinder volume from host A to host B or to sync the data from host A to host B from one volume to another
19:22:04 <clarkb> its a bit tricky because for the limnoria irc bot and the matrix eavesdrop bot I think we really don't want them running concurrently on two different hosts
19:22:35 <clarkb> long story short I think we should pick a day (fridays are probably best due to when meetings occur) to stop services on the old server, land a chagne to configure the new server, and copy the data between them
19:23:13 <clarkb> something like boot new server with new volume, rsync data, time passes, rsync data again, stop services on old server, approve change to configure new server, rsync data, check deployment brings everything back up again
19:23:19 <clarkb> guessing at least an hour for that
19:23:27 <fungi> that wou
19:23:31 <fungi> ld work well
19:24:03 <clarkb> cool I would volunteer to do that this Friday but I'm meeting up with folks for lunch during FOSSY so will probably have to happen next week or someone else can atke it on
19:24:19 <fungi> i likely can
19:25:14 <clarkb> fungi: do you want to boot the new server and get things prepped for that or should I?
19:25:22 <fungi> i'll do it
19:25:27 <clarkb> perfect thanks
19:25:49 <clarkb> then for refstack you mentioned announcing it was going away then planning to proceed with that.
19:25:53 <clarkb> Any progress there?
19:26:01 <fungi> though if we're moving all services at the same time, any reason not to move the cinder volume?
19:26:25 <fungi> detach/attach instead of rsync
19:26:35 <clarkb> fungi: probably not. I always worry the cinder volumes won't detach cleanly but that is probably an overblown concern
19:26:41 <fungi> and no, haven't written up an announcement for refstack going away yet
19:26:58 <clarkb> the main reason to avoid it would be moving providers but I don't think we should do that in this case
19:27:15 <fungi> for some reason i thought we had moved logs into afs, but i hadn't looked at it recentl
19:27:17 <fungi> y
19:27:20 <clarkb> when you boot the new noble node don't forget to use the --config-drive flag if booting it in rax classic (its required for that image)
19:27:28 <clarkb> fungi: yes I had thought so too but we haven't
19:27:30 <fungi> yep, will do
19:28:19 <clarkb> anything else on this topic?
19:28:29 <fungi> i guess it was the meetings site hosting on static.o.o that threw me
19:28:45 <fungi> but i suppose it's proxied to eavesdrop still
19:28:46 <clarkb> ya I think the yaml2ical data is published to afs from zuul jobs?
19:28:59 <fungi> that's what it was, yep
19:29:08 <fungi> nothing else from me
19:29:30 <clarkb> #topic Vexxhost backup server inaccessible
19:29:42 <clarkb> yesterday fungi noticed the vexxhost backup server was inaccessible and backups to it were failing
19:29:50 <clarkb> grabbing the console log failed as did ping and ssh
19:30:11 <clarkb> we waited a day then corvus asked nova to reboot it today. The situation didn't change except the console log was available today
19:30:13 <fungi> though nova claimed it was "active"
19:30:31 <fungi> the whole time
19:30:35 <clarkb> guilhermesp managed to take a look today and root caused it to an OVS issue
19:30:46 <clarkb> whcih explains why network connectivity was sad but nova saw it as active/up
19:31:15 <fungi> which would explain why a reboot didn't fix it, though i'm not sure why console logs were initially unreachable, maybe both caused by a single incident
19:31:19 <clarkb> anyway this has been corrected by guilhermesp. guilhermesp indicates that this would not have been correctable as an end user so if we see similar symptoms in the future we should file a ticket with vexxhost
19:31:52 <clarkb> manual inspection of the host seems to show things are happy again. But we should keep an eye on the infra-root inbox to double check backups aren't still erroring to it
19:32:27 <clarkb> anything else we want to call out about this incident before we move on?
19:32:48 <fungi> not i
19:33:13 <clarkb> #topic Matrix for OpenDev comms
19:33:19 <clarkb> #link https://review.opendev.org/c/opendev/infra-specs/+/954826 Spec outlining the motivation and plan for Matrix trialing
19:33:41 <clarkb> frickler reviewed the spec. I've responded but haven't pushed a new patchset yet. Was hoping for a bit more feedback before I add in some of those suggestions
19:33:56 <clarkb> so if you have 15 minutes to read what I've written that would be great
19:34:33 <clarkb> it is probably also worth noting that the matrix.org homeserver now requires users to be at least 18 years old in response to a recent UK law change...
19:34:40 <corvus> thanks, i'll take a look (intended to earlier, but things came up!)
19:34:41 <fungi> i suppose a new development there is element's announcement that no one under 18 years of age is allowed to have a matrix.org account any longer, though using another homeserver is a potential workaround or affected users
19:34:46 <clarkb> anyone using that homeserver should've gotten a message from them with the terms of service update
19:35:06 <clarkb> fungi: yup. I'm not sure that is a huge barrier for our current user base, but something to consider
19:35:11 <corvus> that also seems like something that may ultimately not be limited to matrix
19:35:22 <clarkb> corvus: yes, apparently wikimedia is challenging the law
19:35:32 <fungi> i think the innternet is about to become 18-and-up
19:37:13 <clarkb> happy to update the spec with that info too if we think it is important to capture
19:37:30 <fungi> not at this stage, i don't expect
19:37:52 <fungi> something we can deal with down the road if it becomes relevant
19:38:13 <clarkb> #topic Working through our TODO list
19:38:22 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup
19:38:39 <clarkb> just our weekly reminder that if anyone ends up bored (ha) they can check the list for the next thing to chew on
19:38:49 <clarkb> #topic Pre PTG Planning
19:39:31 <clarkb> similarly we can figure out what our list looks like during our Pre PTG
19:39:38 <clarkb> #link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document
19:39:56 <clarkb> please add topic ideas to that etherpad with things you'd liek to see covered with a bit more depth than we can do day to day or in this meeting
19:40:36 <clarkb> I think we can do a retrospective on the noble + podman switch. A few bumps but overall seems to work well (as an example)
19:40:50 <clarkb> #topic Service Coordinator Election Planning
19:41:05 <clarkb> it has been almost 6 months since we last elected our service coordinator (me)
19:41:12 <clarkb> which means it is time to make a plan for the next election
19:41:21 <clarkb> Proposal: Nomination Period open from August 5, 2025 to August 19, 2025. If necessary we will hold an election from August 20, 2025 to August 27, 2025. All date ranges and times will be in UTC.
19:41:44 <clarkb> this is my proposal. It basically mimics what we've done the last several elections.
19:41:54 <fungi> wfm
19:42:11 <clarkb> I'd like to make that official today so if there are any comments, questions, or concerns please bring them up before EOD
19:42:23 <clarkb> (I'll make it official via email to the service-discuss list)
19:42:46 <clarkb> then I'd liek to say I'm hapyp for someone else to take on more of these organizational and liason duties if there is interest
19:42:55 <clarkb> I'm happy to hang around in a supporting role
19:43:37 <clarkb> let me know if there is interest or if you have any questions
19:43:41 <fungi> i too would be thrilled to help out supporting a new coordinator if there is one
19:44:27 <clarkb> #topic Open Discussion
19:44:30 <clarkb> Anything else?
19:46:50 <clarkb> sounds like that may be everything
19:46:54 <fungi> thanks clarkb!
19:46:56 <clarkb> thank you everyone for your time and help running opendev
19:47:05 <clarkb> we'll be back here next week at the same time and location
19:47:09 <clarkb> #endmeeting