Tuesday, 2025-07-29

clarkbmeeting time19:00
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Jul 29 19:00:06 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/SMACAD5RVHJU466XJGM3QKJ2GC7OT3XY/ Our Agenda19:00
clarkb#topic Announcements19:00
clarkbI'm taking next Monday off so will not be around that day19:00
fungiill be around19:01
clarkbI've also drawn up a quick plan for service coordinator elections which we'll talk about later (but want peopel to be awre)19:01
clarkbAnything else to announce?19:02
clarkbsounds like no. Lets jump into the agenda then19:04
clarkb#topic Zuul-launcher19:04
clarkbThe change to prevent mixed provider nodesets merged and deployed over the weekend. This is now only possible if it is the only way to fulfill a request (say k8s pods and opensatck vms in one nodeset)19:05
corvusi think all the bugfixes are in now19:05
clarkbyesterday we discovered that ovh bhs1 was fully utilized and that seems to have been a bug in handling failed node boots leading to leaks19:05
clarkband ya corvus restarted services with the fix for ^ this morning19:05
clarkbthe raxflex sjc3 graph looks weird still but different weird after the restart. I half suspect we maybe leaked floating ips there and we're hitting quota limits on those19:06
clarkbI'll check on that after the meeting and lunch19:06
corvusyeah, z-l doesn't detect floating ip quota yet19:06
clarkbthen separately last week we discovered that at least part of the problem with image builds was gitea-lb02 losing its ipv6 address constantly19:08
fungithat was a super fun rabbit hole19:08
clarkbthe address would work for a few minutes then disappear then return an hour later and work for a bit before going away again. We replaced it with a new noble gitea-lb03 node as there was some indication it could be a jammy bug19:08
clarkbspecifically in systemd-networkd19:08
clarkbin addition to that we improved the haproxy health checks to use the gitea health api19:09
fungibut other servers running jammy there didn't exhibit this behavior, so odds it's a version-specific issue are slim19:09
clarkbya. I've kept the gitea-lb02 old server around after cleaning up system-config and dns in case vexxhost wants to dig in further19:10
clarkbbut probably at the end of this week we can clean the server up if we don't make progress on that19:10
fungithanks!19:10
clarkbthe last thing I've got on the agenda notes for this topic is that nodepool is gone19:10
clarkbthe servers are deleted etc19:10
clarkbcorvus: I think openstack/project-config still has nodepool/ dir contents. Is there a change to clean those up? Might be good to avoid future confusion19:10
corvusi don't think so.  i'll take a look.  but i'd like to keep grafana dashboards around for a while.19:11
clarkb++ I think keeping those around is fine. I'm more worried about people trying to fix image builds or change max-server counts19:12
corvusyep.  i'll get rid of the rest19:12
clarkbthanks19:13
clarkbanything else on this topic?19:13
corvusnot from me19:13
clarkb#topic Gerrit 3.11 Upgrade Planning19:15
clarkbShort of testing an upgrade with our production data I've done what I think is reasonable to try and reproduce the reported offline reindexing problems with gerrit 3.11.419:15
clarkbI have been unable to reproduce the problem19:15
clarkbgiven that I think I'll proceed with testing the upgrade itself. Hopefully tomorrow19:16
clarkb#link https://www.gerritcodereview.com/3.11.html19:16
clarkbthese two job links are for jobs that held nodes that I'll use for testing19:16
clarkb#link https://zuul.opendev.org/t/openstack/build/f1ca0d1f2e054829a4506ececb58bed319:16
clarkb#link https://zuul.opendev.org/t/openstack/build/588723b923e94901af3065143d9df81819:17
clarkbthe nodes ran under zuul launcher so didn't get lost in the nodepool cleanup19:17
clarkb#link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade19:17
clarkbUnfortunately, the delays have us getting into openstack's end of release cycle activities19:18
clarkbso planning an actual date may be painful. But we'll figure something out once I've got a better picture of the upgrade itself19:18
clarkbany comments, concerns or feedback on this topic before we move on?19:19
fungii have none19:19
clarkb#topic Upgrading old servers19:20
clarkbThe change to update matrix and irc bot container logging to journald from syslog landed on the existing server19:20
clarkbthis is a prereq to upgrading to noble and using podman as the runtime backend19:20
clarkbseems to be working fine. I then looked briefly at how the current server is configured to get a sense for what is required to replace it. The main thing is the logging data is on a cinder volume. I think this means we need a downtime to either move the cinder volume from host A to host B or to sync the data from host A to host B from one volume to another19:21
clarkbits a bit tricky because for the limnoria irc bot and the matrix eavesdrop bot I think we really don't want them running concurrently on two different hosts19:22
clarkblong story short I think we should pick a day (fridays are probably best due to when meetings occur) to stop services on the old server, land a chagne to configure the new server, and copy the data between them19:22
clarkbsomething like boot new server with new volume, rsync data, time passes, rsync data again, stop services on old server, approve change to configure new server, rsync data, check deployment brings everything back up again19:23
clarkbguessing at least an hour for that19:23
fungithat wou19:23
fungild work well19:23
clarkbcool I would volunteer to do that this Friday but I'm meeting up with folks for lunch during FOSSY so will probably have to happen next week or someone else can atke it on19:24
fungii likely can19:24
clarkbfungi: do you want to boot the new server and get things prepped for that or should I?19:25
fungii'll do it19:25
clarkbperfect thanks19:25
clarkbthen for refstack you mentioned announcing it was going away then planning to proceed with that.19:25
clarkbAny progress there?19:25
fungithough if we're moving all services at the same time, any reason not to move the cinder volume?19:26
fungidetach/attach instead of rsync19:26
clarkbfungi: probably not. I always worry the cinder volumes won't detach cleanly but that is probably an overblown concern19:26
fungiand no, haven't written up an announcement for refstack going away yet19:26
clarkbthe main reason to avoid it would be moving providers but I don't think we should do that in this case19:26
fungifor some reason i thought we had moved logs into afs, but i hadn't looked at it recentl19:27
fungiy19:27
clarkbwhen you boot the new noble node don't forget to use the --config-drive flag if booting it in rax classic (its required for that image)19:27
clarkbfungi: yes I had thought so too but we haven't19:27
fungiyep, will do19:27
clarkbanything else on this topic?19:28
fungii guess it was the meetings site hosting on static.o.o that threw me19:28
fungibut i suppose it's proxied to eavesdrop still19:28
clarkbya I think the yaml2ical data is published to afs from zuul jobs?19:28
fungithat's what it was, yep19:28
funginothing else from me19:29
clarkb#topic Vexxhost backup server inaccessible19:29
clarkbyesterday fungi noticed the vexxhost backup server was inaccessible and backups to it were failing19:29
clarkbgrabbing the console log failed as did ping and ssh19:29
clarkbwe waited a day then corvus asked nova to reboot it today. The situation didn't change except the console log was available today19:30
fungithough nova claimed it was "active"19:30
fungithe whole time19:30
clarkbguilhermesp managed to take a look today and root caused it to an OVS issue19:30
clarkbwhcih explains why network connectivity was sad but nova saw it as active/up19:30
fungiwhich would explain why a reboot didn't fix it, though i'm not sure why console logs were initially unreachable, maybe both caused by a single incident19:31
clarkbanyway this has been corrected by guilhermesp. guilhermesp indicates that this would not have been correctable as an end user so if we see similar symptoms in the future we should file a ticket with vexxhost19:31
clarkbmanual inspection of the host seems to show things are happy again. But we should keep an eye on the infra-root inbox to double check backups aren't still erroring to it19:31
clarkbanything else we want to call out about this incident before we move on?19:32
funginot i19:32
clarkb#topic Matrix for OpenDev comms19:33
clarkb#link https://review.opendev.org/c/opendev/infra-specs/+/954826 Spec outlining the motivation and plan for Matrix trialing19:33
clarkbfrickler reviewed the spec. I've responded but haven't pushed a new patchset yet. Was hoping for a bit more feedback before I add in some of those suggestions19:33
clarkbso if you have 15 minutes to read what I've written that would be great19:33
clarkbit is probably also worth noting that the matrix.org homeserver now requires users to be at least 18 years old in response to a recent UK law change...19:34
corvusthanks, i'll take a look (intended to earlier, but things came up!)19:34
fungii suppose a new development there is element's announcement that no one under 18 years of age is allowed to have a matrix.org account any longer, though using another homeserver is a potential workaround or affected users19:34
clarkbanyone using that homeserver should've gotten a message from them with the terms of service update19:34
clarkbfungi: yup. I'm not sure that is a huge barrier for our current user base, but something to consider19:35
corvusthat also seems like something that may ultimately not be limited to matrix19:35
clarkbcorvus: yes, apparently wikimedia is challenging the law19:35
fungii think the innternet is about to become 18-and-up19:35
clarkbhappy to update the spec with that info too if we think it is important to capture19:37
funginot at this stage, i don't expect19:37
fungisomething we can deal with down the road if it becomes relevant19:37
clarkb#topic Working through our TODO list19:38
clarkb#link https://etherpad.opendev.org/p/opendev-january-2025-meetup19:38
clarkbjust our weekly reminder that if anyone ends up bored (ha) they can check the list for the next thing to chew on19:38
clarkb#topic Pre PTG Planning19:38
clarkbsimilarly we can figure out what our list looks like during our Pre PTG19:39
clarkb#link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document19:39
clarkbplease add topic ideas to that etherpad with things you'd liek to see covered with a bit more depth than we can do day to day or in this meeting19:39
clarkbI think we can do a retrospective on the noble + podman switch. A few bumps but overall seems to work well (as an example)19:40
clarkb#topic Service Coordinator Election Planning19:40
clarkbit has been almost 6 months since we last elected our service coordinator (me)19:41
clarkbwhich means it is time to make a plan for the next election19:41
clarkbProposal: Nomination Period open from August 5, 2025 to August 19, 2025. If necessary we will hold an election from August 20, 2025 to August 27, 2025. All date ranges and times will be in UTC.19:41
clarkbthis is my proposal. It basically mimics what we've done the last several elections.19:41
fungiwfm19:41
clarkbI'd like to make that official today so if there are any comments, questions, or concerns please bring them up before EOD19:42
clarkb(I'll make it official via email to the service-discuss list)19:42
clarkbthen I'd liek to say I'm hapyp for someone else to take on more of these organizational and liason duties if there is interest19:42
clarkbI'm happy to hang around in a supporting role19:42
clarkblet me know if there is interest or if you have any questions19:43
fungii too would be thrilled to help out supporting a new coordinator if there is one19:43
clarkb#topic Open Discussion19:44
clarkbAnything else?19:44
clarkbsounds like that may be everything19:46
fungithanks clarkb!19:46
clarkbthank you everyone for your time and help running opendev19:46
clarkbwe'll be back here next week at the same time and location19:47
clarkb#endmeeting19:47
opendevmeetMeeting ended Tue Jul 29 19:47:09 2025 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:47
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-29-19.00.html19:47
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-29-19.00.txt19:47
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-29-19.00.log.html19:47

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!