19:00:14 #startmeeting infra 19:00:14 Meeting started Tue Sep 16 19:00:14 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:14 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:14 The meeting name has been set to 'infra' 19:00:20 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/3XPEGJQNZUENYS54A2BRGINSG2EU7X6I/ Our Agenda 19:00:23 #topic Announcements 19:00:43 We're nearing the openstack release. I think starlingx may be working on a release too. Just keep that in mind as we're making changes 19:01:03 Also I will be out on the 25th (that is Thursday next week) 19:01:19 that was all I wanted to mention here. Did anyone else have anything to announce? 19:01:41 i'm taking off most/all the day this thursday 19:02:15 (but am around all next week) 19:02:45 thanks for the heads up 19:03:25 #topic Moving OpenDev's python-base/python-builder/uwsgi-base Images to Quay 19:03:36 #link https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/HO6Z66QIMDIDY7CCVAREDOPSYZYNKIT3/ 19:03:45 The images were moved and the move was announced in the link above 19:04:18 since then my changes to update opendev's consumption of those images have all merged 19:04:36 I believe that means this task is basically complete. I even dropped mirroring of the images on docker hub to quay since quay is the canonical location now 19:05:19 The one thing I haven't pushed a change for and will do so immediately after the meeting is flipping our default container command for the generic zuul-jobs container roles back to docker from podman. The changes that have landed all selectively opted into that already so this is a noop that will simply ensure new images do the right thing 19:05:43 thank you everyone for all the help with this. It has been a big effort over a couple of years at this point to be able to do this 19:06:04 I should note that some images that are not part of the python base image inheritance path are still published to docker hub (gitea, mailman, etc) 19:06:25 but the servers they run on are not on noble yet which means not on podman yet so they aren't able to run speculative image builds at runtime 19:06:49 but we've got a good chunk moved and I think we're proving out that this generally works which is nice 19:07:08 any questions/concerns/comments around this change? its a fairly big one but also largely behind the scenes as long as everything is working properly 19:08:19 oh as a note we also added trixie base images for python3.12 and 3.11 19:08:33 so eventually we'll want to start migrating things onto that platform and look at adding 3.13 images too 19:08:51 gerrit will be a bit trickier as there are jvm implications with that. Which we can talk about now 19:08:56 #topic Gerrit 3.11 Upgrade Planning 19:09:28 One of the things I was waiting on with Gerrit was getting the container images updated to 3.10.8 and 3.11.5 (the latest bugfix releases). We did that yesterday and also updated the base image location to quay 19:09:45 I then set up two new node holds on jobs using those newer images 19:09:53 #link https://zuul.opendev.org/t/openstack/build/54f6629a3041466ca2b1cc6bf17886c4 3.10.8 held node 19:10:02 #link https://zuul.opendev.org/t/openstack/build/c9051c435bf7414b986c37256f71538e 3.11.5 held node 19:10:40 due to the proximity of the openstack release I don't think we're upgraded in the next couple of weeks however I think we can make a good push to get everything ready to upgrade next month sometime (after the summit maybe that tends to be a quieter time for us) 19:11:08 then once we are upgraded to 3.11.5 we can switch to trixie as I believe 3.11.5 is the first java 21 compatible release 19:11:35 The other Gerrit thing we learned yseterday was that when our h2 cache files get very large gerrit shutdown is very slow and hits the podman/docker shutdown timeout 19:11:42 that timeout is currently set to 300 seconds 19:12:15 I think that we should consider more regular gerrit restarts to try and avoid this problem. We could also "embrace" it and set the timeout to a much shorter value say 60 seconds 19:12:42 and basically expect that we're going to forcefully stop gerrit while it is trying to prune its h2 db caches that we will delete as soon as its shutdown 19:12:58 I'm interested in feedback on ^ but I don't think either item is urgent right now 19:14:09 #topic Upgrading old servers 19:14:45 fungi has continued to make great progress with the openafs and kerberos clusters. Everything is upgraded to noble except for afs02.dfw.openstack.org. Waiting on one more mirror volume to move its RW copy to afs01.dfw before proceeding with afs02 19:14:56 yeah, we're a little over 30 hours into the mirror.ubuntu rw volume move from afs02.dfw to afs01.dfw, but once that completes i'll be able to upgrade afs02.dfw from jammy to noble 19:15:08 at least one afs server and one kdc have been removed from the emergency file list too so ansible seems happy with the results as well 19:15:36 fungi: do you think we should remove all of the nodes from the emergency file except for afs02 at this point or should we awit for afs02 to be done and do the whole lot at once? 19:15:56 doesn't matter, i can do it now sure 19:16:08 mostly I don't want us to forget 19:16:24 done now, only afs02.dfw is still disabled at this point 19:16:57 once these servers are done the next on the list are graphite and the backup servers. Then we can start looking at getting an early jump on jammy server upgrades 19:17:12 (as well as continued effort to uplift the truly ancient servers) 19:17:19 but ya this is great progress thank you! 19:17:27 Any other questions/concerns/comments around server upgrades? 19:18:41 #topic AFS mirror content cleanup 19:18:53 fungi discovered we're carrying some openafs mirror content that we don't need to any longer 19:18:57 this is next on my plate after the volume moves/upgrades 19:19:15 specifically openeuler (we don't have openeuler test nodes) and debian stretch content 19:19:30 #link https://review.opendev.org/959892 Stop mirroring OpenEuler packages 19:19:38 then hopefully not too far into the future we'll be dropping ubuntu bionic test nodes too and we can clear its content out as well 19:19:42 we stopped mirroring debian stretch long ago 19:19:47 just didn't delete the files 19:20:10 and with that all done we'll be able to answer questions about whether or not we can mirror trixie or centos 10 stream or whatever 19:20:34 oh, also we can drop bullseye backports 19:20:43 since it's vanished upstream 19:20:45 ++ 19:21:03 and as of about a week ago nothing should be accidentally using those files anyway 19:21:20 so they can safely disappear now 19:21:34 if you identify any other stale content that should be removed say something 19:21:55 though I think after these items its mostly old docker and ceph packages which are relatively tiny and don't have similar impact 19:22:20 i expect there's plenty of puppet and ceph package mirroring that can be cleaned up 19:22:29 er, docker, yeah 19:23:09 potentially puppet too. I think either the deb or rpm puppet mirror is quite large too but also seems to still be used 19:23:21 its possible the changes to puppet binary releases will affect that though 19:23:34 (puppet upstream is only releases source code for future work aiui and will no longer supply packages or binaries) 19:23:41 well, we may be mirroring a lot of old packages that nobody's using too 19:23:49 for puppey 19:23:50 true 19:24:07 could be worth a quick check given its relative size. Start there rather than docker or ceph 19:25:27 #topic Matrix for OpenDev comms 19:25:33 The spec (954826) has merged. 19:26:07 I'm thinking this is a good task to get going while openstack release stuff is happening as its impact should be nonexistant to that process 19:26:17 so once I dig out of my current backlog a bit I'll try to start on this 19:27:06 I don't think there is anything else really to do here other than start on the process as described in the spec. I'll let everyone know if I hit any issues 19:27:11 #topic Pre PTG Planning 19:27:22 #link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document 19:27:28 Times: Tuesday October 7 1800-2000 UTC, Wednesday October 8 1500-1700 UTC, Thursday October 9 1500-1700 19:27:52 This is 3 weeks away. Add your ideas to the etherpad if you've got them 19:28:27 #topic Etherpad 2.5.0 Upgrade 19:28:32 #link https://github.com/ether/etherpad-lite/blob/v2.5.0/CHANGELOG.md 19:28:37 #link https://review.opendev.org/c/opendev/system-config/+/956593/ 19:28:59 as mentioned previously I think the root page css is still a bit odd, but I'm hoping others will have a chance to check it and indicate whether or not they feel this is a blocker for us 19:29:05 104.130.127.119 is a held node for testing. You need to edit /etc/hosts to point etherpad.opendev.org at that IP. 19:29:15 I set up the clarkb-test etherpad there if you want to see some existing edits 19:29:29 what was the link to the upstream issue about the css regression? 19:29:38 i guess there's been no further movement there 19:30:07 fungi: tehy "fixed" it while you were on vacation 19:30:12 and the problem went away for the etherpad pages 19:30:27 https://github.com/ether/etherpad-lite/issues/7065 19:30:37 but the root page still looks odd (not as bad as with the 2.4.2 release though) 19:30:53 so its mostly a question of is this good enough or should I reopen the issue / file a new issue 19:31:00 oh nice 19:31:24 that's probably a new issue about an incomplete fix 19:32:08 and reference the prior issue in it so github adds a mention 19:32:11 I also wondered if we could just edit some css to fix it 19:32:25 but I haven't looked beyond the visual check to say oh huh its still a bit weird 19:32:39 worth a try for sure, might be an excuse for us to custom "theme" that page i guess 19:32:40 so ya feedback on whether we think it is an issue would be appreciated 19:33:38 #topic Lists Server Slowness 19:33:55 "good" news! I think I managed track down the source of the iowait on this server 19:34:30 the tl;dr is that the server is using a "standard" flavor not a "performance" flavor and the flavors have disk_io_index properties. The standard flavor is set to 2 and performance is set to 40 19:35:00 that is a 20x difference something that is experimentally confirmed using fio's randread test. I get about 1k iops on standard there and 20k iops on performance 19:35:31 considering that iowait is very high during busy periods/when mailman is slow I think the solution here is to move mailman onto a device with better iops performance 19:35:50 earlier today fungi attached and ssd volume to the instance and started copying data in preparation of a such a move 19:36:02 at this point i've primed a copy of all the mailman state (archives, database, et cetera) onto an ssd-backed cinder volume. first copy took 53m30s to complete, i'm doing a second one now to see how much faster that goes 19:36:11 fungi: it just occurred to me that you can run the fio tests on that ssd volume when the rsync is done just to confirm iops are better 19:36:19 good idea 19:36:28 fungi: keep in mind that fio will create files to read against and not delete them after so you may need to do manual cleanup 19:36:37 but I think that is a good sanity check before we commit to this solution 19:36:58 as for the cut-over, i've outlined a rough maintenance plan to minimize service downtime 19:37:06 #link https://etherpad.opendev.org/p/2025-09-mailman-volume-maintenance 19:37:35 the fio commands should be in my user's history on that server. It creates the files in the current working dir. Note there is a read test and a randread test. You should probably run both 19:37:42 or I can run them if you like just let me know if that is easier 19:38:10 i can give it a shot after i get back from dinner 19:39:01 cool I'll take a look at that etherpad plan shortly too 19:39:21 anything else on this topic? 19:39:27 not from me 19:40:00 #topic Open Discussion 19:40:28 I don't know who will be attending the summit next month, but if you are going to be there it looks like Friday evening is the opportunity for an opendev+zuul type of get together 19:40:54 let's get opendev and zuul together 19:40:54 I don't plan on doing anything formal, but we'll probably try to aim at a common location for dinner that night if anyone else is interested 19:41:15 you got your opendev in my zuul! you got your zuul in my opendev! 19:41:38 oh so we're going to cross the streams? 19:41:48 that would be... not so bad? 19:41:57 also I may have found some stickers.... 19:42:06 I just have to remember to pack them 19:42:26 yay! 19:43:45 Last call. Anything else that we haven't covered that should be discussed? Also we can always discuss things on the mailing list or the regular irc channel 19:45:26 sounds like now 19:45:28 *no 19:45:40 thanks clarkb! 19:45:43 thank you everyone! we'll be back here same time and location next week. Until then thank you for your helpf working on opendev 19:45:51 and see you there 19:45:53 #endmeeting