clarkb | meeting time | 19:00 |
---|---|---|
clarkb | #startmeeting infra | 19:00 |
opendevmeet | Meeting started Tue Jul 29 19:00:06 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
opendevmeet | The meeting name has been set to 'infra' | 19:00 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/SMACAD5RVHJU466XJGM3QKJ2GC7OT3XY/ Our Agenda | 19:00 |
clarkb | #topic Announcements | 19:00 |
clarkb | I'm taking next Monday off so will not be around that day | 19:00 |
fungi | ill be around | 19:01 |
clarkb | I've also drawn up a quick plan for service coordinator elections which we'll talk about later (but want peopel to be awre) | 19:01 |
clarkb | Anything else to announce? | 19:02 |
clarkb | sounds like no. Lets jump into the agenda then | 19:04 |
clarkb | #topic Zuul-launcher | 19:04 |
clarkb | The change to prevent mixed provider nodesets merged and deployed over the weekend. This is now only possible if it is the only way to fulfill a request (say k8s pods and opensatck vms in one nodeset) | 19:05 |
corvus | i think all the bugfixes are in now | 19:05 |
clarkb | yesterday we discovered that ovh bhs1 was fully utilized and that seems to have been a bug in handling failed node boots leading to leaks | 19:05 |
clarkb | and ya corvus restarted services with the fix for ^ this morning | 19:05 |
clarkb | the raxflex sjc3 graph looks weird still but different weird after the restart. I half suspect we maybe leaked floating ips there and we're hitting quota limits on those | 19:06 |
clarkb | I'll check on that after the meeting and lunch | 19:06 |
corvus | yeah, z-l doesn't detect floating ip quota yet | 19:06 |
clarkb | then separately last week we discovered that at least part of the problem with image builds was gitea-lb02 losing its ipv6 address constantly | 19:08 |
fungi | that was a super fun rabbit hole | 19:08 |
clarkb | the address would work for a few minutes then disappear then return an hour later and work for a bit before going away again. We replaced it with a new noble gitea-lb03 node as there was some indication it could be a jammy bug | 19:08 |
clarkb | specifically in systemd-networkd | 19:08 |
clarkb | in addition to that we improved the haproxy health checks to use the gitea health api | 19:09 |
fungi | but other servers running jammy there didn't exhibit this behavior, so odds it's a version-specific issue are slim | 19:09 |
clarkb | ya. I've kept the gitea-lb02 old server around after cleaning up system-config and dns in case vexxhost wants to dig in further | 19:10 |
clarkb | but probably at the end of this week we can clean the server up if we don't make progress on that | 19:10 |
fungi | thanks! | 19:10 |
clarkb | the last thing I've got on the agenda notes for this topic is that nodepool is gone | 19:10 |
clarkb | the servers are deleted etc | 19:10 |
clarkb | corvus: I think openstack/project-config still has nodepool/ dir contents. Is there a change to clean those up? Might be good to avoid future confusion | 19:10 |
corvus | i don't think so. i'll take a look. but i'd like to keep grafana dashboards around for a while. | 19:11 |
clarkb | ++ I think keeping those around is fine. I'm more worried about people trying to fix image builds or change max-server counts | 19:12 |
corvus | yep. i'll get rid of the rest | 19:12 |
clarkb | thanks | 19:13 |
clarkb | anything else on this topic? | 19:13 |
corvus | not from me | 19:13 |
clarkb | #topic Gerrit 3.11 Upgrade Planning | 19:15 |
clarkb | Short of testing an upgrade with our production data I've done what I think is reasonable to try and reproduce the reported offline reindexing problems with gerrit 3.11.4 | 19:15 |
clarkb | I have been unable to reproduce the problem | 19:15 |
clarkb | given that I think I'll proceed with testing the upgrade itself. Hopefully tomorrow | 19:16 |
clarkb | #link https://www.gerritcodereview.com/3.11.html | 19:16 |
clarkb | these two job links are for jobs that held nodes that I'll use for testing | 19:16 |
clarkb | #link https://zuul.opendev.org/t/openstack/build/f1ca0d1f2e054829a4506ececb58bed3 | 19:16 |
clarkb | #link https://zuul.opendev.org/t/openstack/build/588723b923e94901af3065143d9df818 | 19:17 |
clarkb | the nodes ran under zuul launcher so didn't get lost in the nodepool cleanup | 19:17 |
clarkb | #link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade | 19:17 |
clarkb | Unfortunately, the delays have us getting into openstack's end of release cycle activities | 19:18 |
clarkb | so planning an actual date may be painful. But we'll figure something out once I've got a better picture of the upgrade itself | 19:18 |
clarkb | any comments, concerns or feedback on this topic before we move on? | 19:19 |
fungi | i have none | 19:19 |
clarkb | #topic Upgrading old servers | 19:20 |
clarkb | The change to update matrix and irc bot container logging to journald from syslog landed on the existing server | 19:20 |
clarkb | this is a prereq to upgrading to noble and using podman as the runtime backend | 19:20 |
clarkb | seems to be working fine. I then looked briefly at how the current server is configured to get a sense for what is required to replace it. The main thing is the logging data is on a cinder volume. I think this means we need a downtime to either move the cinder volume from host A to host B or to sync the data from host A to host B from one volume to another | 19:21 |
clarkb | its a bit tricky because for the limnoria irc bot and the matrix eavesdrop bot I think we really don't want them running concurrently on two different hosts | 19:22 |
clarkb | long story short I think we should pick a day (fridays are probably best due to when meetings occur) to stop services on the old server, land a chagne to configure the new server, and copy the data between them | 19:22 |
clarkb | something like boot new server with new volume, rsync data, time passes, rsync data again, stop services on old server, approve change to configure new server, rsync data, check deployment brings everything back up again | 19:23 |
clarkb | guessing at least an hour for that | 19:23 |
fungi | that wou | 19:23 |
fungi | ld work well | 19:23 |
clarkb | cool I would volunteer to do that this Friday but I'm meeting up with folks for lunch during FOSSY so will probably have to happen next week or someone else can atke it on | 19:24 |
fungi | i likely can | 19:24 |
clarkb | fungi: do you want to boot the new server and get things prepped for that or should I? | 19:25 |
fungi | i'll do it | 19:25 |
clarkb | perfect thanks | 19:25 |
clarkb | then for refstack you mentioned announcing it was going away then planning to proceed with that. | 19:25 |
clarkb | Any progress there? | 19:25 |
fungi | though if we're moving all services at the same time, any reason not to move the cinder volume? | 19:26 |
fungi | detach/attach instead of rsync | 19:26 |
clarkb | fungi: probably not. I always worry the cinder volumes won't detach cleanly but that is probably an overblown concern | 19:26 |
fungi | and no, haven't written up an announcement for refstack going away yet | 19:26 |
clarkb | the main reason to avoid it would be moving providers but I don't think we should do that in this case | 19:26 |
fungi | for some reason i thought we had moved logs into afs, but i hadn't looked at it recentl | 19:27 |
fungi | y | 19:27 |
clarkb | when you boot the new noble node don't forget to use the --config-drive flag if booting it in rax classic (its required for that image) | 19:27 |
clarkb | fungi: yes I had thought so too but we haven't | 19:27 |
fungi | yep, will do | 19:27 |
clarkb | anything else on this topic? | 19:28 |
fungi | i guess it was the meetings site hosting on static.o.o that threw me | 19:28 |
fungi | but i suppose it's proxied to eavesdrop still | 19:28 |
clarkb | ya I think the yaml2ical data is published to afs from zuul jobs? | 19:28 |
fungi | that's what it was, yep | 19:28 |
fungi | nothing else from me | 19:29 |
clarkb | #topic Vexxhost backup server inaccessible | 19:29 |
clarkb | yesterday fungi noticed the vexxhost backup server was inaccessible and backups to it were failing | 19:29 |
clarkb | grabbing the console log failed as did ping and ssh | 19:29 |
clarkb | we waited a day then corvus asked nova to reboot it today. The situation didn't change except the console log was available today | 19:30 |
fungi | though nova claimed it was "active" | 19:30 |
fungi | the whole time | 19:30 |
clarkb | guilhermesp managed to take a look today and root caused it to an OVS issue | 19:30 |
clarkb | whcih explains why network connectivity was sad but nova saw it as active/up | 19:30 |
fungi | which would explain why a reboot didn't fix it, though i'm not sure why console logs were initially unreachable, maybe both caused by a single incident | 19:31 |
clarkb | anyway this has been corrected by guilhermesp. guilhermesp indicates that this would not have been correctable as an end user so if we see similar symptoms in the future we should file a ticket with vexxhost | 19:31 |
clarkb | manual inspection of the host seems to show things are happy again. But we should keep an eye on the infra-root inbox to double check backups aren't still erroring to it | 19:31 |
clarkb | anything else we want to call out about this incident before we move on? | 19:32 |
fungi | not i | 19:32 |
clarkb | #topic Matrix for OpenDev comms | 19:33 |
clarkb | #link https://review.opendev.org/c/opendev/infra-specs/+/954826 Spec outlining the motivation and plan for Matrix trialing | 19:33 |
clarkb | frickler reviewed the spec. I've responded but haven't pushed a new patchset yet. Was hoping for a bit more feedback before I add in some of those suggestions | 19:33 |
clarkb | so if you have 15 minutes to read what I've written that would be great | 19:33 |
clarkb | it is probably also worth noting that the matrix.org homeserver now requires users to be at least 18 years old in response to a recent UK law change... | 19:34 |
corvus | thanks, i'll take a look (intended to earlier, but things came up!) | 19:34 |
fungi | i suppose a new development there is element's announcement that no one under 18 years of age is allowed to have a matrix.org account any longer, though using another homeserver is a potential workaround or affected users | 19:34 |
clarkb | anyone using that homeserver should've gotten a message from them with the terms of service update | 19:34 |
clarkb | fungi: yup. I'm not sure that is a huge barrier for our current user base, but something to consider | 19:35 |
corvus | that also seems like something that may ultimately not be limited to matrix | 19:35 |
clarkb | corvus: yes, apparently wikimedia is challenging the law | 19:35 |
fungi | i think the innternet is about to become 18-and-up | 19:35 |
clarkb | happy to update the spec with that info too if we think it is important to capture | 19:37 |
fungi | not at this stage, i don't expect | 19:37 |
fungi | something we can deal with down the road if it becomes relevant | 19:37 |
clarkb | #topic Working through our TODO list | 19:38 |
clarkb | #link https://etherpad.opendev.org/p/opendev-january-2025-meetup | 19:38 |
clarkb | just our weekly reminder that if anyone ends up bored (ha) they can check the list for the next thing to chew on | 19:38 |
clarkb | #topic Pre PTG Planning | 19:38 |
clarkb | similarly we can figure out what our list looks like during our Pre PTG | 19:39 |
clarkb | #link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document | 19:39 |
clarkb | please add topic ideas to that etherpad with things you'd liek to see covered with a bit more depth than we can do day to day or in this meeting | 19:39 |
clarkb | I think we can do a retrospective on the noble + podman switch. A few bumps but overall seems to work well (as an example) | 19:40 |
clarkb | #topic Service Coordinator Election Planning | 19:40 |
clarkb | it has been almost 6 months since we last elected our service coordinator (me) | 19:41 |
clarkb | which means it is time to make a plan for the next election | 19:41 |
clarkb | Proposal: Nomination Period open from August 5, 2025 to August 19, 2025. If necessary we will hold an election from August 20, 2025 to August 27, 2025. All date ranges and times will be in UTC. | 19:41 |
clarkb | this is my proposal. It basically mimics what we've done the last several elections. | 19:41 |
fungi | wfm | 19:41 |
clarkb | I'd like to make that official today so if there are any comments, questions, or concerns please bring them up before EOD | 19:42 |
clarkb | (I'll make it official via email to the service-discuss list) | 19:42 |
clarkb | then I'd liek to say I'm hapyp for someone else to take on more of these organizational and liason duties if there is interest | 19:42 |
clarkb | I'm happy to hang around in a supporting role | 19:42 |
clarkb | let me know if there is interest or if you have any questions | 19:43 |
fungi | i too would be thrilled to help out supporting a new coordinator if there is one | 19:43 |
clarkb | #topic Open Discussion | 19:44 |
clarkb | Anything else? | 19:44 |
clarkb | sounds like that may be everything | 19:46 |
fungi | thanks clarkb! | 19:46 |
clarkb | thank you everyone for your time and help running opendev | 19:46 |
clarkb | we'll be back here next week at the same time and location | 19:47 |
clarkb | #endmeeting | 19:47 |
opendevmeet | Meeting ended Tue Jul 29 19:47:09 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:47 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-29-19.00.html | 19:47 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-29-19.00.txt | 19:47 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-29-19.00.log.html | 19:47 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!