Tuesday, 2025-09-23

seunghunleeHello everyone. I'm not sure if this is the right place to ask. But is there a way to find if Zuul had a problem on last Friday?09:27
mnasiadkaseunghunlee: that’s the channel for regular weekly OpenDev meeting - it’s better to ask on #opendev10:06
seunghunleemnasiadka: thanks10:11
*** NeilHanlon_ is now known as NeilHanlon15:04
clarkbjust about meeting time18:58
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Sep 23 19:00:19 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/YGIEBCAQV4W5TXZVDQTYOHFZQ47SRBPP/ Our Agenda19:00
clarkb#topic Announcements19:00
clarkbI'm out on Thurdsay just a heads up19:00
clarkbthen also I have someone from my ISP coming out to check my regular internet connectivity trboules later today so don't be surprised if my internets drop out this afternoon19:01
clarkbDid anyone else have announcements?19:01
fungii did not19:01
clarkbseems like we can dive right in19:02
clarkb#topic Gerrit 3.11 Upgrade Planning19:02
clarkbI don't have any new updates there. Unfortuantely I've been distracted by the raxflex double nic issue and preparing for the summit. But considering that the openstack release is happening in a week this is probably not super urgent at the moment19:03
clarkbDid anyone else have thoughts/concerns/ideas around teh gerrit 3.11 upgrade?19:03
fungii too have been distracted by other tasks, so not yet19:04
clarkb#link https://zuul.opendev.org/t/openstack/build/54f6629a3041466ca2b1cc6bf17886c419:04
clarkb#link https://zuul.opendev.org/t/openstack/build/c9051c435bf7414b986c37256f71538e19:04
clarkbthese job links should point at held nodes for anyone wants to look at them19:04
clarkbthese were refreshed for the new gerrit container images after we rebuilt for 3.10.8 and 3.11.519:04
clarkb#topic Upgrading old servers19:05
clarkbI think the openafs and kerberos clusters are fully upgraded to noble now19:05
clarkbthank you fungi for driving that and getting it done. it took a while but slow steady progress saw it to the end19:05
clarkbanything to note about those upgrades? I guess we have to watch out for the single cpu boots when upgrading in place to noble as well as updating network configs19:06
fungiyeah, all done19:06
clarkbbut I'm not sure we'll do any more inplace upgrades to noble? Maybe for lists one day (it is jammy so not in a hurry on that one)19:07
fungii haven't cleaned up the old eth0 interface configs, but they aren't hurting anything19:07
clarkbya should be completely ignored at this point19:07
fungialso our use of openafs from a ppa complicated the upgrades in ways that the lists server upgrade won't be impacted19:07
clarkbwith these servers done the graphite and backup servers are next up19:08
clarkbI briefly thought about how we might handle the backup servers and I think deploying new noble nodes with new backup volumes is probably ideal. Then after backups have shifted to the new nodes we can move the old backup volumes to the new nodes to preserve that data for a bit then delete the old backup servers19:09
fungiwfm19:10
fungior use new volumes even, and just retire the old servers and volumes after a while19:10
clarkbthere are two concerns I want to look into before doing this. The first is what does having 3 or 4 backup servers in the inventory look like when it comes to the backup cron jobs on nodes (will we backup to 4 locations or can we be smarter about that?) and then also we need a newer version of borg on the server side on noble due to python versions. We'll want to test that the19:10
clarkbold borg version on older nodes can backup to the new version19:10
clarkbwe already know that new borg backing up to old borg seems to work so I expect it to work the other way around but it is worth checking19:10
clarkbgraphite is probably a bit more straightforward. We'll need a new server then take a downtime to move the data and start services up again on the new server after updating dns19:11
clarkbwe might need to restart services that report to graphite if they cache the dns record19:11
clarkbI'm not sure either of these two efforts are on my immediate todo list so help is appreciated. I'm also happy to rearrange my priorities if others think these should be bumped up over say zuul launcher debugging and gerrit upgrade planning and opendev matrix work19:12
tonybif the new borg because of python versions is a problem, we could use a borg container in the near term19:12
clarkbtonyb: that is a good point. Given it works the other way around I don't expect it to be a problem but we can fall back to that too19:12
clarkband ya as more things update to noble fewer things will be in that situation anwyay19:12
clarkbok any other thoughts on server upgrades?19:12
clarkb#topic AFS mirror content cleanup19:13
clarkbonce the afs servers were updated fungi went ahead and performed the cleanups we had talked about19:14
clarkbdebian stretch and bullseye backports content has been cleared out19:14
clarkbas has the openeuler mirror content19:14
fungiyeah, i ran into one stumbling block19:14
clarkbI still wonder if we can put the reprepro cleanup steps in the regular cronjob for mirror updates19:14
clarkboh I missed that19:14
fungiour reprepro mirror script keeps separate state in the form of a list of packages to remove on the next pass19:15
fungiif those packages don't exist anywhere, then that command fails19:15
fungii'll try to remember to push up an addition to our docs that say to clear out that file19:15
fungibecause it confused the heck out of me for hours before i realized that was the problem19:15
clarkbthe cleanup steps clear out files that would be listed in that file and cleared out in the next pass?19:16
fungiyes19:16
clarkbgot it. That might also complicate automating the cleanup steps19:16
fungiand since they're not present, the clearout errors19:16
fungiit was dead simple once i realized what was happening, but basically took me realizing that the update was working and then stepping through our script to find the failing command19:17
clarkbwe can also probably start looking at mirror cleanups for puppetlabs, ceph, and docker content that is old/ancient. These are unlikely to have the same level of impact from a disk consumption perspective though19:17
clarkband then once that is done we can also clean up old python wheel cache content for distro releases we no longer have test nodes for19:17
fungiit tripped me up on the bionic arm64 cleanup too, but i somehow "fixed" the problem at some point without realizing it19:17
clarkbso still lots of cleanup possible but more of a long tail in terms of disk consumption19:17
fungioh, the follow-up change for the openeuler mirror removal still needs reviews19:18
fungi#link https://review.opendev.org/959892 Clean up OpenEuler mirroring infrastructure19:18
clarkbI'll review that after the meeting19:19
fungithanks!19:19
clarkbat this point the cleanup is probably sufficient that we can entertain adding new mirrors again19:19
clarkbI still think we need to be careful about doing so because it is easy to add content but more difficult to clear it out as this long overdue cleanup illustrates19:19
fungiyes, though we still need to be thinking about how soon to yank ubuntu-bionic19:19
fungi(amd64)19:20
clarkbyup I suspect but don't know for certain the starlingx is relying on bionic19:20
clarkbI think they plan to do a release next month so we may want to check with them first?19:20
fungior after their release, if that's what you meant19:20
clarkbbut other than that I feel like we can be full steam ahead on that cleanup19:20
clarkbfungi: ya either after their release or if they give an indication its unlikely to affect them19:21
clarkbspot checking a random recent change they run a lot of things on debian bullseye19:22
clarkbso maybe bionic won't impact them like I think it will19:22
clarkbbut yes clearing out bionic will be a good thing all around19:22
clarkbits old, its mirror is not small, and its an image build that we don't need anymore19:22
clarkbalso centos 9 stream's mirror grows at a rate that requires regular quota bumps19:23
clarkbsomeone like spotz might have insight into how we can manage that better19:23
clarkbany other afs mirror content insights/thoguhts/cleanups we want to talk about?19:24
funginot it19:25
clarkb#topic Lists Server Slowness19:25
fungialso done, i think (other than deleting the old copy of mailman data)19:25
clarkbas noted last week we think we tracked this down to iops limits on the disk that was hosting the lists service. Since then fungi attached a new cinder volume with many more iops and migrated mailman on to it. I haven't had any issues with slow mailman since the migration19:26
fungiif nobody's aware of any problems since friday, i'll go ahead and do that19:26
clarkbI haven't seen any. I also plan to take this off of the meeting agenda for next week as things seem much happier now19:26
fungithe new cinder volume is ssd-backed too, for clarity19:26
tonybOh I wrote tool to verify the repo meta-data for an RPM based distribution.  I could possibly extend that to generate an rsync filter to help us mirror less19:26
clarkbtonyb: that would be neat19:27
fungivery neat indeed!19:27
tonybid we're okay being a partial mirror not a compleet one19:27
clarkbtonyb: yes, we already try to trim out things we don't need19:27
fungii expect we're very fine with that19:27
clarkbits just not feasible for us to mirror complete mirrors foe each of the distros we're working with19:27
tonybOkay I'm on it19:27
fungiwe also have incomplete mirrors of deb-oriented repos too19:27
clarkbif you notice any slowness from lists/mailman please let us know. Otherwise I think we're considering this fixed19:27
fungilike not mirroring all architectures and suites19:28
tonybOkay.19:28
fungibasically we mirror content if a large subset of our ci jobs will use it. anything jobs aren't using doesn't need to be mirrored19:28
tonybThat's what I thought19:29
fungiwe don't intend, and don't even want, for these to be treated as general-purpose mirrors19:29
clarkb#topic Deleting ze1119:30
clarkba while back we disabled ze11 in the zuul cluster because it was not able to clone nova within the 10 minute timeout19:30
clarkbother nodes take about 3-4 minutes. ze11 needs about 13 or so19:31
clarkbrather than try and debug that further at this point I think we shoudl probably just delete it. We don't need that many executors with our workload nyamore19:31
clarkbthen if we want to we can also delete ze12 to avoid confusion about the gap19:31
clarkbmostly want to bring this up to see if anyone has concerns (maybe we do need that many executors?) and feedback on whether ze12 should meet the same fate if ze11 goes away19:32
tonybFine by me, I don't really care about the gap19:32
corvushttps://review.opendev.org/961530 is related19:32
tonybIf corvus agrees then I think that's a fair plan19:33
corvusthat's intended to clean up the gauges (would take effect now, it's orthogonal to deletion)19:33
corvusyeah, i think deleting it is fine19:33
clarkbcool. I can put that on my todo list as I'm not hearing any objections19:34
fungidoes that clear out the data, so if we take a server offline for a day we lose all its history?19:34
clarkbthe commit message does say "delete"19:34
corvusi don't think so; it should just stop sending bogus values19:34
corvusthey're deleted from the statsd sever, not from graphite19:35
clarkbah19:35
fungiokay, that sounds a little better19:35
fungiso if we bring a server back online a few days later, there's just a reporting gap from where it was not in the statsd server19:35
corvusyep19:36
fungithanks@19:36
fungis/@/!/19:36
corvusthe trick is getting the interval right19:36
corvushow long is long enough to say it's dead versus just idle19:36
corvus24h is my starting guess :)19:36
fungiwfm19:37
clarkbseems reasonable19:37
clarkb#topic Zuul Launcher Updates19:37
clarkbthen also related to zuul I wanted to recap some of the launcher stuff that has happend recently because tehre was some confusion over what had been done19:38
clarkbAfter zuul restarts ~10 days ago the launchers stopped being able to boot nodes in rax-flex. They were getting multiple networks found errors. I discovered that about a week ago and added explicit network config to clouds.yaml for rax flex not realizing it was already in the launcher config19:38
clarkbafter restarting on the clouds.yaml update instances started getting multiple nics breaking their routing tables19:39
clarkbwe then dropped the zuul launcher config and things worked as expected with one nic19:39
clarkbthen thinking the combo of clouds.yaml and launcher config is why we get two interfaces we flipped things around to having no network config in clouds.yaml and used launcher config to configure the network. This put us back into the multiple nic broken situation19:40
clarkbso now we're back to only defining networks in the clouds.yaml for that cloud region19:40
clarkbI plan to write a script to try and reproduce this outside of the launcher code so that we can track it down. One idea i have is that it could be related to using floating ips. This is our only fip cloud and maybe attaching an fip is adding another interface for some reason but I have no evidence for this 19:40
clarkbthen separately we disabled rax dfw (classic) because its api started having problems like we had previously seen in iad and ord (classic)19:41
clarkbI sent email to rackers asking about this and also gave them a list of old error'd nodepool nodes in ord to clean up. I don't think they are real instances but they exist in the nova db so I think they count against our quota so getting them cleaned up would be nice19:41
clarkboh also I deleted all other nodepool booted instances in our cloud regions. The only remainders should be those in ord that are stuck in an error state that I cannot delete19:42
clarkbit occurs to me that we might want to check for nodepool era images that need cleanup now too19:42
clarkbI haven't done that19:42
clarkblast week there ws also a bug in the launcher that prevented cleanup of leaked nodes19:43
clarkbthat should be fixed now19:43
clarkbanything else related to the launchers?19:43
corvuswhat's the next step for rax classic?19:43
clarkbcorvus: in terms of reenabling rax-dfw or ?19:43
fungiother than hoping we hear back from them19:43
corvusyeah19:44
clarkbI guess if we don't hear back this week we could try turning it back on and see if the issues were resolved and we just weren't notified?19:44
clarkband if still unhappy write another email?19:44
clarkbjames denton was very responsive when in irc but isn't there right now unfortunately19:45
corvusthat sounds like a plan.19:45
fungii'm good with that19:46
clarkb#topic Matrix for OpenDev Comms19:47
clarkbwe're running out of time so want tokeep things moving19:47
clarkbI have not started on this yet (see notes about distractions that kept me from gerrit things earlier in the meeting)19:47
clarkbbut it is on my todo list to start bootstrapping things from the spec19:47
corvuslmk if you want to throw a work item my way19:47
clarkbwill do19:47
clarkbElement just updated and looks slightly different fwiw. Not sure I'm a huge fan but its not terrible. Also personal spces are becoming helpful as I add more matrix rooms over time19:48
clarkb#topic Pre PTG Planning19:48
clarkb#link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document19:48
clarkbTimes: Tuesday October 7 1800-2000 UTC, Wednesday October 8 1500-1700 UTC, Thursday October 9 1500-170019:48
clarkbthis is ~2 weeks away19:48
clarkbwe'll have our normal meeting here next week then the week after our meeting will be the pre ptg19:49
clarkbI did want to note that TheJulia added a topic about exploring AI/LLM driven code review processes/tooling19:49
clarkbI suspect that that can be driven entirely via zuul jobs by communities interested in doing so. I also suggested that there may be opportunity to collaborate with other zuul users in zuul-jobs to build jobs/roles that enable the functioanlity19:50
tonybFWIW: I don't think I will be in a US timezone for the pre-ptg but I'll adjust me sleep schedule to maximise time overlap that week19:50
clarkbbut I wanted to call that out as its not something we're already dealing with day to day so it is a topic some of us may wish to read up on before hand19:50
clarkbtonyb: thank you for the heads up19:50
clarkbtonyb: will you be +11ish ? (I think that is at least close to your typical timezone)19:51
clarkbin any case feel free to add your ideas for the pre ptg to the etherpad and take a moment to read the current list to ensure that we're ready to dive in in a couple of weeks19:52
clarkb#topic Etherpad 2.5.0 Upgrade19:52
fungion the previous topic, i'm happy to find apac-friendly times to hang out as well19:53
clarkbya we can probably adjust as necessary too19:53
tonybclarkb: I think that's the correct UTC offset19:53
clarkbfungi: for etherpad did you check the test node?19:53
clarkb104.130.127.119 is a held node for testing.19:53
clarkbmostly wondering if anyone else has checked ti to either confirm or deny what I'm seeing in my browser in terms of layout behavior still being meh for the front page but ok on the etherpads19:54
tonybDon't adjust for me.  1500-1700 UTC is a little rough but it's only a couple of days, and I'm kinda optional anyway19:55
clarkbok we've only got a few minutes left. If someone can check the held etherpad node and let me know if the layout loosk terrible for them too then I can work on a followup issue update for upstream19:56
clarkb#topic Open Discussion19:56
clarkbIf you are attending the summit I think Friday night is the one night I'll have free to get together so if you'd keep that night free too we can do an informal dinner get together or something along those lines19:56
clarkbanything else?19:57
clarkbwanted to make sure we didn't run out of time before opening the floor19:57
tonybclarkb: Friday 17th? (clarifying)19:59
clarkbtonyb: yes19:59
tonybnoted19:59
clarkbI fly in thursday night, friday is first day of the event, then saturday and sunday I've got things already planned for me and then I'm out mondayish (I'm actually spending the day in paris monday and flying out tuesday morning)19:59
clarkband we are at time20:00
clarkbthank you everyone20:00
tonybnoted.20:00
clarkbwe'll be back here same time and location next week20:00
tonybThanks all20:00
clarkbsee you there20:00
clarkb#endmeeting20:00
opendevmeetMeeting ended Tue Sep 23 20:00:30 2025 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:00
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-23-19.00.html20:00
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-23-19.00.txt20:00
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-23-19.00.log.html20:00

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!