| seunghunlee | Hello everyone. I'm not sure if this is the right place to ask. But is there a way to find if Zuul had a problem on last Friday? | 09:27 |
|---|---|---|
| mnasiadka | seunghunlee: that’s the channel for regular weekly OpenDev meeting - it’s better to ask on #opendev | 10:06 |
| seunghunlee | mnasiadka: thanks | 10:11 |
| *** NeilHanlon_ is now known as NeilHanlon | 15:04 | |
| clarkb | just about meeting time | 18:58 |
| clarkb | #startmeeting infra | 19:00 |
| opendevmeet | Meeting started Tue Sep 23 19:00:19 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
| opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
| opendevmeet | The meeting name has been set to 'infra' | 19:00 |
| clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/YGIEBCAQV4W5TXZVDQTYOHFZQ47SRBPP/ Our Agenda | 19:00 |
| clarkb | #topic Announcements | 19:00 |
| clarkb | I'm out on Thurdsay just a heads up | 19:00 |
| clarkb | then also I have someone from my ISP coming out to check my regular internet connectivity trboules later today so don't be surprised if my internets drop out this afternoon | 19:01 |
| clarkb | Did anyone else have announcements? | 19:01 |
| fungi | i did not | 19:01 |
| clarkb | seems like we can dive right in | 19:02 |
| clarkb | #topic Gerrit 3.11 Upgrade Planning | 19:02 |
| clarkb | I don't have any new updates there. Unfortuantely I've been distracted by the raxflex double nic issue and preparing for the summit. But considering that the openstack release is happening in a week this is probably not super urgent at the moment | 19:03 |
| clarkb | Did anyone else have thoughts/concerns/ideas around teh gerrit 3.11 upgrade? | 19:03 |
| fungi | i too have been distracted by other tasks, so not yet | 19:04 |
| clarkb | #link https://zuul.opendev.org/t/openstack/build/54f6629a3041466ca2b1cc6bf17886c4 | 19:04 |
| clarkb | #link https://zuul.opendev.org/t/openstack/build/c9051c435bf7414b986c37256f71538e | 19:04 |
| clarkb | these job links should point at held nodes for anyone wants to look at them | 19:04 |
| clarkb | these were refreshed for the new gerrit container images after we rebuilt for 3.10.8 and 3.11.5 | 19:04 |
| clarkb | #topic Upgrading old servers | 19:05 |
| clarkb | I think the openafs and kerberos clusters are fully upgraded to noble now | 19:05 |
| clarkb | thank you fungi for driving that and getting it done. it took a while but slow steady progress saw it to the end | 19:05 |
| clarkb | anything to note about those upgrades? I guess we have to watch out for the single cpu boots when upgrading in place to noble as well as updating network configs | 19:06 |
| fungi | yeah, all done | 19:06 |
| clarkb | but I'm not sure we'll do any more inplace upgrades to noble? Maybe for lists one day (it is jammy so not in a hurry on that one) | 19:07 |
| fungi | i haven't cleaned up the old eth0 interface configs, but they aren't hurting anything | 19:07 |
| clarkb | ya should be completely ignored at this point | 19:07 |
| fungi | also our use of openafs from a ppa complicated the upgrades in ways that the lists server upgrade won't be impacted | 19:07 |
| clarkb | with these servers done the graphite and backup servers are next up | 19:08 |
| clarkb | I briefly thought about how we might handle the backup servers and I think deploying new noble nodes with new backup volumes is probably ideal. Then after backups have shifted to the new nodes we can move the old backup volumes to the new nodes to preserve that data for a bit then delete the old backup servers | 19:09 |
| fungi | wfm | 19:10 |
| fungi | or use new volumes even, and just retire the old servers and volumes after a while | 19:10 |
| clarkb | there are two concerns I want to look into before doing this. The first is what does having 3 or 4 backup servers in the inventory look like when it comes to the backup cron jobs on nodes (will we backup to 4 locations or can we be smarter about that?) and then also we need a newer version of borg on the server side on noble due to python versions. We'll want to test that the | 19:10 |
| clarkb | old borg version on older nodes can backup to the new version | 19:10 |
| clarkb | we already know that new borg backing up to old borg seems to work so I expect it to work the other way around but it is worth checking | 19:10 |
| clarkb | graphite is probably a bit more straightforward. We'll need a new server then take a downtime to move the data and start services up again on the new server after updating dns | 19:11 |
| clarkb | we might need to restart services that report to graphite if they cache the dns record | 19:11 |
| clarkb | I'm not sure either of these two efforts are on my immediate todo list so help is appreciated. I'm also happy to rearrange my priorities if others think these should be bumped up over say zuul launcher debugging and gerrit upgrade planning and opendev matrix work | 19:12 |
| tonyb | if the new borg because of python versions is a problem, we could use a borg container in the near term | 19:12 |
| clarkb | tonyb: that is a good point. Given it works the other way around I don't expect it to be a problem but we can fall back to that too | 19:12 |
| clarkb | and ya as more things update to noble fewer things will be in that situation anwyay | 19:12 |
| clarkb | ok any other thoughts on server upgrades? | 19:12 |
| clarkb | #topic AFS mirror content cleanup | 19:13 |
| clarkb | once the afs servers were updated fungi went ahead and performed the cleanups we had talked about | 19:14 |
| clarkb | debian stretch and bullseye backports content has been cleared out | 19:14 |
| clarkb | as has the openeuler mirror content | 19:14 |
| fungi | yeah, i ran into one stumbling block | 19:14 |
| clarkb | I still wonder if we can put the reprepro cleanup steps in the regular cronjob for mirror updates | 19:14 |
| clarkb | oh I missed that | 19:14 |
| fungi | our reprepro mirror script keeps separate state in the form of a list of packages to remove on the next pass | 19:15 |
| fungi | if those packages don't exist anywhere, then that command fails | 19:15 |
| fungi | i'll try to remember to push up an addition to our docs that say to clear out that file | 19:15 |
| fungi | because it confused the heck out of me for hours before i realized that was the problem | 19:15 |
| clarkb | the cleanup steps clear out files that would be listed in that file and cleared out in the next pass? | 19:16 |
| fungi | yes | 19:16 |
| clarkb | got it. That might also complicate automating the cleanup steps | 19:16 |
| fungi | and since they're not present, the clearout errors | 19:16 |
| fungi | it was dead simple once i realized what was happening, but basically took me realizing that the update was working and then stepping through our script to find the failing command | 19:17 |
| clarkb | we can also probably start looking at mirror cleanups for puppetlabs, ceph, and docker content that is old/ancient. These are unlikely to have the same level of impact from a disk consumption perspective though | 19:17 |
| clarkb | and then once that is done we can also clean up old python wheel cache content for distro releases we no longer have test nodes for | 19:17 |
| fungi | it tripped me up on the bionic arm64 cleanup too, but i somehow "fixed" the problem at some point without realizing it | 19:17 |
| clarkb | so still lots of cleanup possible but more of a long tail in terms of disk consumption | 19:17 |
| fungi | oh, the follow-up change for the openeuler mirror removal still needs reviews | 19:18 |
| fungi | #link https://review.opendev.org/959892 Clean up OpenEuler mirroring infrastructure | 19:18 |
| clarkb | I'll review that after the meeting | 19:19 |
| fungi | thanks! | 19:19 |
| clarkb | at this point the cleanup is probably sufficient that we can entertain adding new mirrors again | 19:19 |
| clarkb | I still think we need to be careful about doing so because it is easy to add content but more difficult to clear it out as this long overdue cleanup illustrates | 19:19 |
| fungi | yes, though we still need to be thinking about how soon to yank ubuntu-bionic | 19:19 |
| fungi | (amd64) | 19:20 |
| clarkb | yup I suspect but don't know for certain the starlingx is relying on bionic | 19:20 |
| clarkb | I think they plan to do a release next month so we may want to check with them first? | 19:20 |
| fungi | or after their release, if that's what you meant | 19:20 |
| clarkb | but other than that I feel like we can be full steam ahead on that cleanup | 19:20 |
| clarkb | fungi: ya either after their release or if they give an indication its unlikely to affect them | 19:21 |
| clarkb | spot checking a random recent change they run a lot of things on debian bullseye | 19:22 |
| clarkb | so maybe bionic won't impact them like I think it will | 19:22 |
| clarkb | but yes clearing out bionic will be a good thing all around | 19:22 |
| clarkb | its old, its mirror is not small, and its an image build that we don't need anymore | 19:22 |
| clarkb | also centos 9 stream's mirror grows at a rate that requires regular quota bumps | 19:23 |
| clarkb | someone like spotz might have insight into how we can manage that better | 19:23 |
| clarkb | any other afs mirror content insights/thoguhts/cleanups we want to talk about? | 19:24 |
| fungi | not it | 19:25 |
| clarkb | #topic Lists Server Slowness | 19:25 |
| fungi | also done, i think (other than deleting the old copy of mailman data) | 19:25 |
| clarkb | as noted last week we think we tracked this down to iops limits on the disk that was hosting the lists service. Since then fungi attached a new cinder volume with many more iops and migrated mailman on to it. I haven't had any issues with slow mailman since the migration | 19:26 |
| fungi | if nobody's aware of any problems since friday, i'll go ahead and do that | 19:26 |
| clarkb | I haven't seen any. I also plan to take this off of the meeting agenda for next week as things seem much happier now | 19:26 |
| fungi | the new cinder volume is ssd-backed too, for clarity | 19:26 |
| tonyb | Oh I wrote tool to verify the repo meta-data for an RPM based distribution. I could possibly extend that to generate an rsync filter to help us mirror less | 19:26 |
| clarkb | tonyb: that would be neat | 19:27 |
| fungi | very neat indeed! | 19:27 |
| tonyb | id we're okay being a partial mirror not a compleet one | 19:27 |
| clarkb | tonyb: yes, we already try to trim out things we don't need | 19:27 |
| fungi | i expect we're very fine with that | 19:27 |
| clarkb | its just not feasible for us to mirror complete mirrors foe each of the distros we're working with | 19:27 |
| tonyb | Okay I'm on it | 19:27 |
| fungi | we also have incomplete mirrors of deb-oriented repos too | 19:27 |
| clarkb | if you notice any slowness from lists/mailman please let us know. Otherwise I think we're considering this fixed | 19:27 |
| fungi | like not mirroring all architectures and suites | 19:28 |
| tonyb | Okay. | 19:28 |
| fungi | basically we mirror content if a large subset of our ci jobs will use it. anything jobs aren't using doesn't need to be mirrored | 19:28 |
| tonyb | That's what I thought | 19:29 |
| fungi | we don't intend, and don't even want, for these to be treated as general-purpose mirrors | 19:29 |
| clarkb | #topic Deleting ze11 | 19:30 |
| clarkb | a while back we disabled ze11 in the zuul cluster because it was not able to clone nova within the 10 minute timeout | 19:30 |
| clarkb | other nodes take about 3-4 minutes. ze11 needs about 13 or so | 19:31 |
| clarkb | rather than try and debug that further at this point I think we shoudl probably just delete it. We don't need that many executors with our workload nyamore | 19:31 |
| clarkb | then if we want to we can also delete ze12 to avoid confusion about the gap | 19:31 |
| clarkb | mostly want to bring this up to see if anyone has concerns (maybe we do need that many executors?) and feedback on whether ze12 should meet the same fate if ze11 goes away | 19:32 |
| tonyb | Fine by me, I don't really care about the gap | 19:32 |
| corvus | https://review.opendev.org/961530 is related | 19:32 |
| tonyb | If corvus agrees then I think that's a fair plan | 19:33 |
| corvus | that's intended to clean up the gauges (would take effect now, it's orthogonal to deletion) | 19:33 |
| corvus | yeah, i think deleting it is fine | 19:33 |
| clarkb | cool. I can put that on my todo list as I'm not hearing any objections | 19:34 |
| fungi | does that clear out the data, so if we take a server offline for a day we lose all its history? | 19:34 |
| clarkb | the commit message does say "delete" | 19:34 |
| corvus | i don't think so; it should just stop sending bogus values | 19:34 |
| corvus | they're deleted from the statsd sever, not from graphite | 19:35 |
| clarkb | ah | 19:35 |
| fungi | okay, that sounds a little better | 19:35 |
| fungi | so if we bring a server back online a few days later, there's just a reporting gap from where it was not in the statsd server | 19:35 |
| corvus | yep | 19:36 |
| fungi | thanks@ | 19:36 |
| fungi | s/@/!/ | 19:36 |
| corvus | the trick is getting the interval right | 19:36 |
| corvus | how long is long enough to say it's dead versus just idle | 19:36 |
| corvus | 24h is my starting guess :) | 19:36 |
| fungi | wfm | 19:37 |
| clarkb | seems reasonable | 19:37 |
| clarkb | #topic Zuul Launcher Updates | 19:37 |
| clarkb | then also related to zuul I wanted to recap some of the launcher stuff that has happend recently because tehre was some confusion over what had been done | 19:38 |
| clarkb | After zuul restarts ~10 days ago the launchers stopped being able to boot nodes in rax-flex. They were getting multiple networks found errors. I discovered that about a week ago and added explicit network config to clouds.yaml for rax flex not realizing it was already in the launcher config | 19:38 |
| clarkb | after restarting on the clouds.yaml update instances started getting multiple nics breaking their routing tables | 19:39 |
| clarkb | we then dropped the zuul launcher config and things worked as expected with one nic | 19:39 |
| clarkb | then thinking the combo of clouds.yaml and launcher config is why we get two interfaces we flipped things around to having no network config in clouds.yaml and used launcher config to configure the network. This put us back into the multiple nic broken situation | 19:40 |
| clarkb | so now we're back to only defining networks in the clouds.yaml for that cloud region | 19:40 |
| clarkb | I plan to write a script to try and reproduce this outside of the launcher code so that we can track it down. One idea i have is that it could be related to using floating ips. This is our only fip cloud and maybe attaching an fip is adding another interface for some reason but I have no evidence for this | 19:40 |
| clarkb | then separately we disabled rax dfw (classic) because its api started having problems like we had previously seen in iad and ord (classic) | 19:41 |
| clarkb | I sent email to rackers asking about this and also gave them a list of old error'd nodepool nodes in ord to clean up. I don't think they are real instances but they exist in the nova db so I think they count against our quota so getting them cleaned up would be nice | 19:41 |
| clarkb | oh also I deleted all other nodepool booted instances in our cloud regions. The only remainders should be those in ord that are stuck in an error state that I cannot delete | 19:42 |
| clarkb | it occurs to me that we might want to check for nodepool era images that need cleanup now too | 19:42 |
| clarkb | I haven't done that | 19:42 |
| clarkb | last week there ws also a bug in the launcher that prevented cleanup of leaked nodes | 19:43 |
| clarkb | that should be fixed now | 19:43 |
| clarkb | anything else related to the launchers? | 19:43 |
| corvus | what's the next step for rax classic? | 19:43 |
| clarkb | corvus: in terms of reenabling rax-dfw or ? | 19:43 |
| fungi | other than hoping we hear back from them | 19:43 |
| corvus | yeah | 19:44 |
| clarkb | I guess if we don't hear back this week we could try turning it back on and see if the issues were resolved and we just weren't notified? | 19:44 |
| clarkb | and if still unhappy write another email? | 19:44 |
| clarkb | james denton was very responsive when in irc but isn't there right now unfortunately | 19:45 |
| corvus | that sounds like a plan. | 19:45 |
| fungi | i'm good with that | 19:46 |
| clarkb | #topic Matrix for OpenDev Comms | 19:47 |
| clarkb | we're running out of time so want tokeep things moving | 19:47 |
| clarkb | I have not started on this yet (see notes about distractions that kept me from gerrit things earlier in the meeting) | 19:47 |
| clarkb | but it is on my todo list to start bootstrapping things from the spec | 19:47 |
| corvus | lmk if you want to throw a work item my way | 19:47 |
| clarkb | will do | 19:47 |
| clarkb | Element just updated and looks slightly different fwiw. Not sure I'm a huge fan but its not terrible. Also personal spces are becoming helpful as I add more matrix rooms over time | 19:48 |
| clarkb | #topic Pre PTG Planning | 19:48 |
| clarkb | #link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document | 19:48 |
| clarkb | Times: Tuesday October 7 1800-2000 UTC, Wednesday October 8 1500-1700 UTC, Thursday October 9 1500-1700 | 19:48 |
| clarkb | this is ~2 weeks away | 19:48 |
| clarkb | we'll have our normal meeting here next week then the week after our meeting will be the pre ptg | 19:49 |
| clarkb | I did want to note that TheJulia added a topic about exploring AI/LLM driven code review processes/tooling | 19:49 |
| clarkb | I suspect that that can be driven entirely via zuul jobs by communities interested in doing so. I also suggested that there may be opportunity to collaborate with other zuul users in zuul-jobs to build jobs/roles that enable the functioanlity | 19:50 |
| tonyb | FWIW: I don't think I will be in a US timezone for the pre-ptg but I'll adjust me sleep schedule to maximise time overlap that week | 19:50 |
| clarkb | but I wanted to call that out as its not something we're already dealing with day to day so it is a topic some of us may wish to read up on before hand | 19:50 |
| clarkb | tonyb: thank you for the heads up | 19:50 |
| clarkb | tonyb: will you be +11ish ? (I think that is at least close to your typical timezone) | 19:51 |
| clarkb | in any case feel free to add your ideas for the pre ptg to the etherpad and take a moment to read the current list to ensure that we're ready to dive in in a couple of weeks | 19:52 |
| clarkb | #topic Etherpad 2.5.0 Upgrade | 19:52 |
| fungi | on the previous topic, i'm happy to find apac-friendly times to hang out as well | 19:53 |
| clarkb | ya we can probably adjust as necessary too | 19:53 |
| tonyb | clarkb: I think that's the correct UTC offset | 19:53 |
| clarkb | fungi: for etherpad did you check the test node? | 19:53 |
| clarkb | 104.130.127.119 is a held node for testing. | 19:53 |
| clarkb | mostly wondering if anyone else has checked ti to either confirm or deny what I'm seeing in my browser in terms of layout behavior still being meh for the front page but ok on the etherpads | 19:54 |
| tonyb | Don't adjust for me. 1500-1700 UTC is a little rough but it's only a couple of days, and I'm kinda optional anyway | 19:55 |
| clarkb | ok we've only got a few minutes left. If someone can check the held etherpad node and let me know if the layout loosk terrible for them too then I can work on a followup issue update for upstream | 19:56 |
| clarkb | #topic Open Discussion | 19:56 |
| clarkb | If you are attending the summit I think Friday night is the one night I'll have free to get together so if you'd keep that night free too we can do an informal dinner get together or something along those lines | 19:56 |
| clarkb | anything else? | 19:57 |
| clarkb | wanted to make sure we didn't run out of time before opening the floor | 19:57 |
| tonyb | clarkb: Friday 17th? (clarifying) | 19:59 |
| clarkb | tonyb: yes | 19:59 |
| tonyb | noted | 19:59 |
| clarkb | I fly in thursday night, friday is first day of the event, then saturday and sunday I've got things already planned for me and then I'm out mondayish (I'm actually spending the day in paris monday and flying out tuesday morning) | 19:59 |
| clarkb | and we are at time | 20:00 |
| clarkb | thank you everyone | 20:00 |
| tonyb | noted. | 20:00 |
| clarkb | we'll be back here same time and location next week | 20:00 |
| tonyb | Thanks all | 20:00 |
| clarkb | see you there | 20:00 |
| clarkb | #endmeeting | 20:00 |
| opendevmeet | Meeting ended Tue Sep 23 20:00:30 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:00 |
| opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-23-19.00.html | 20:00 |
| opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-23-19.00.txt | 20:00 |
| opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-09-23-19.00.log.html | 20:00 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!