19:00:27 <clarkb> #startmeeting infra
19:00:27 <opendevmeet> Meeting started Tue Jun 25 19:00:27 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:27 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:27 <opendevmeet> The meeting name has been set to 'infra'
19:00:52 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/IWHUW7OFDHZSL7MGPOK53LJ3ZR43OSOF/ Our Agenda
19:01:16 <clarkb> #topic Announcements
19:01:36 <clarkb> Its been a couple weeks since the last meeting but other than that I'm not aware of any important announcements
19:01:56 <clarkb> might be worth noting that a week from Thursday is a major US holdiay and I expect those of us in the US will be busy enjoying the day off
19:02:27 <clarkb> anything else to announce?
19:04:33 <clarkb> seems like thats it
19:04:37 <clarkb> #topic Upgrading Old Servers
19:05:03 <clarkb> tonyb continues to poke at configuration management and related infrastructure for a rebuilt wiki server
19:05:40 <tonyb> Yup it's getting closer
19:05:42 <clarkb> The last big question that came up was where should logs live for the apache embedded in the container images that is responsible for running php stuff. I suggested that we could hook that into our regular container logs to syslog to log files on disk system
19:06:20 <clarkb> I like that appraoch as it keeps the logs distinct from the host side ssl terminating apache that we use on most of our systems and are likely to use here as well
19:06:32 <clarkb> but there are many ways to do that so I guess let tonyb know if you have different ideas
19:07:13 <tonyb> I still need to configure elastic search but I can build a server in CI that "works" seems to do the important stuff
19:07:23 <clarkb> that is great progress
19:07:49 <fungi> was the conclusion to directly expose the container apache or reverse-proxy to it from an outer apache?
19:07:53 <tonyb> At some stage soon I want to try importing the images from the existing server and pointing a held node at the trove DB
19:08:33 <tonyb> fungi: I'm still using a reverse proxy
19:08:36 <clarkb> tonyb: if you do that you'll probably need to make a copy of the db? since I'm not sure its safe to have two different installs talking to the same db?
19:09:20 <clarkb> but that sounds like a great step to include as part of the how do we migrate story
19:09:39 <tonyb> I'll double check on that ... fortunately there is a mariadb server just waiting for data :)
19:10:02 <fungi> yeah, weird as it sounds, proxying from an apache to another apache makes the most sense for consistency with our other systems
19:10:11 <fungi> just wanted to double-check
19:10:53 <tonyb> fungi: when I get a more complete install I'll need you to poke at the antispam stuff to see if it's working as expected
19:11:04 <fungi> yep
19:11:13 <tonyb> It's installed but that's about as far as I have gotten
19:12:05 <tonyb> It's funny to connect to http://new-server/ and watch all the 30X replies to get to a 200 ;P
19:12:16 <clarkb> redirect all the things
19:12:39 <clarkb> tonyb: you were also poking at booting noble nodes but ran into trouble with the afs packaging situation on noble (due to our use of a ppa that doesn't have packages yet)
19:13:00 <clarkb> tonyb: has there been any progress on that? Thats somethign I wanted to catch up on after my week off and haven't managed to
19:13:26 <tonyb> I have patches up that I think passed CI yesterday the topic is noble-mirror or something like that
19:14:09 <tonyb> I poked the Ubuntu stable team to get the fixed packages moved into -updates
19:14:12 <clarkb> #link https://review.opendev.org/q/topic:noble-mirror Deploying noble based mirror servers
19:15:08 <clarkb> anything else for server upgrades?
19:15:32 <tonyb> I think that's it for now
19:15:42 <clarkb> #topic AFS Mirror Cleanups
19:15:55 <clarkb> This is something I've started pushing on again recently.
19:16:48 <clarkb> topic:drop-ubuntu-xenial has at least one new change. This chagne drops xenial from system-config base role testing. I noted in the commit mesasge for that change that this might be one of the more "dangerous" changes in the xenial cleanup for us. The reason for that is we require the infra-prod-service-base job to succeed before running most other jobs
19:17:15 <clarkb> so if we start failing against xenial we could prevent other jobs from running even for stuff not on xenial. That said I think the risk is still relatively low as the base role stuff doesn't change often
19:17:32 <clarkb> open to feedback if you've got it (probably best to keep that in the review though)
19:18:21 <clarkb> I've also been trying to catch up with centos 8 stream cleanup now that none of those test nodes are really functioanl
19:18:57 <clarkb> topic:drop-centos-8-stream has a whole bunch of fun changes to make that cleanup happen. The vast majority should be safe. There is an openstack-zuul-jobs change that is -1'd by zuul bceause projects have c8s jobs for fips
19:19:20 <clarkb> I've been trying to get the word out and brought this up with the openstack qa team and tc. Sounds like the qa team may send email about it
19:19:42 <clarkb> at some point I expect we may end up force merging that chagne though and forcing projects to address the problems if they haven't already
19:21:00 <clarkb> I think the risk is really low though considering that all the jobs are failing on that platform since no packages are availalbe.
19:21:12 <clarkb> #topic Gitea 1.22 Upgrade
19:21:46 <clarkb> This is the other item that is/was high on my todo list. Unfortunately there still isn't a 1.22.1 release yet. I don't think many if any of the issues they have with 1.22.0 will affect us but there were a lot of issuse and some of them seemed important (for general use)
19:21:58 <clarkb> so I'm feeling more confident waiting for a 1.22.1 release before we upgrade
19:22:35 <clarkb> once we have upgraded we can proceed with fixing up the database encodings and all that
19:22:47 <clarkb> anyway no new info here other than 1.22.1 is still absent
19:22:59 <clarkb> #topic Improving Mailman Mail Throughput
19:23:22 <clarkb> as discussed last time we likely need to increase both mailman queue batch sizes and exim verp batch sizes
19:23:42 <clarkb> I don't think a change has been pushed for that yet unless I missed it in today's early morning scrollback but fungi was looking into it
19:23:48 <clarkb> fungi: is there anything else to add?
19:24:43 <tonyb> #link https://review.opendev.org/c/opendev/system-config/+/922703
19:24:49 <fungi> oh, i pushed one, sorry
19:24:59 <tonyb> I think that's what we're talking about right?
19:25:01 <clarkb> oh hey I did miss it in scrollback thanks
19:25:03 <fungi> thanks tonyb
19:25:04 <clarkb> yup
19:25:46 <clarkb> I'll givethat a review after the meeting
19:26:14 <clarkb> #topic OpenMetal Cloud Rebuild
19:26:22 <clarkb> So this was all going great until it wasn't :)
19:26:48 <clarkb> We had the cloud fully running in nodepool with max-servers 50 set and jobs were running etc. But then there were failures booting nodes and frickler's debugging indicated we ran out of disk?
19:27:14 <clarkb> sounds like maybe ceph isn't backing all of the disk related stuff that we expect it to be backing and that lead to us filling the smaller portion of disck allocated to not ceph
19:27:25 <clarkb> frickler: is that a reasonable summary?
19:27:30 <frickler> I sent a mail about that to the openmetal thread
19:28:00 <frickler> the issue is that glance and nova cannot share the ceph pool properly in the current configuration
19:28:03 <clarkb> ya you also suggested some kolla setting changes that could be applied to fix it. I think it is a good idea to work through the openmetal folks on this as I suspect it may affect their product as a whole and this si something we want to ensure is working for us and everyone else
19:28:24 <frickler> thus nova needs to download each image and upload it to ceph again
19:28:31 <clarkb> aha
19:28:34 <clarkb> that would be inefficient
19:28:50 <frickler> which fills up the root partition which is only like 200G and our images are large
19:29:09 <frickler> it was actually openmetal who discovered the disk full condition, too
19:29:52 <frickler> maybe you can poke yuri again since there was no response yet to my mail afaict
19:29:53 <clarkb> frickler: I haven't seen a response to your email yet, any chance they responded to you directly instead of cc'ing the rest of us? or should i try and followup with them in a bit?
19:29:58 <clarkb> ack can do
19:30:43 <clarkb> there was also the question of the number of ceph partition groups per osd being very low. The docs suggest 100 per osd but we're at like 4? I can mention that too
19:30:55 <frickler> once that is fixed we can do another production attempt and them possibly decide about the ceph tuning
19:30:59 <clarkb> though the docs are also a bit hand wavey around how much it actually matters
19:31:07 <clarkb> #link https://docs.ceph.com/en/latest/dev/placement-group/
19:31:24 <clarkb> sounds good thank you for helping with the debugging on that
19:31:29 <frickler> well it will worsen the performance a bit, but difficult to tell how much
19:32:48 <clarkb> anything else openmetal related?
19:33:08 <frickler> I think that should be all for now
19:33:20 <frickler> oh
19:33:46 <frickler> ask about the monitoring thing again, I think that also went unnoticed in your mail or unresponded at least
19:34:12 <clarkb> can do /me scribbles some notes on the todo list
19:34:36 <clarkb> #topic Testing Rackspace's New Cloud Offering
19:35:22 <clarkb> rax recently reached out to fungi and myself about this. Still not a ton of info, but I'm working to try and schedule a short meeting to allow us to discuss it more syncrhonously and determine the next step here
19:35:55 <clarkb> I proposed July 8 as a possible day as it avoids the holiday next week and isn't this week, but have no idea what their schedule is like and haven't heard back yet.
19:36:12 <frickler> do you have a mail about this that you could share?
19:36:34 <clarkb> But hopefully we'll be able to sync up and learn more and do something productive around this. I think helping them burn in the new product if we can is a great idea
19:36:49 <clarkb> frickler: just the ticket they filed against the nodepool account
19:37:01 <fungi> i think the details were the same as what they sent opendev, yeah
19:37:49 <frickler> ah, that sounded like you were contacted directly, ok
19:38:02 <clarkb> ya its basically just "we have a new product in test/beta/limited availability would you like to be an early user"
19:38:11 <fungi> the separate outreach was technically to the foundation staff about using it, but the actual foundation wouldn't really have much use for it beyond supplying resources for opendev to use, i think
19:38:43 <frickler> ack
19:38:54 <fungi> rackspace contacted business development folks on the foundation staff, who basically forwarded it to me and clarkb
19:39:20 <fungi> and in the end it was a matter of "yeah they already reached out to opendev about this same thing"
19:39:51 <clarkb> and without any more info than we got in the ticket
19:40:09 <clarkb> once I hear back anything more actionable I can share that info
19:40:18 <frickler> so more business than technical, I'm fine to be left out of that ;)
19:40:49 <clarkb> #topic Open Discussion
19:41:15 <clarkb> I wanted to mention that dib was also hit by the c8s stuff but is getting sorted out
19:41:43 <clarkb> mostly an fyi  Idon't think we need help with it and jobs should be passing again as of today
19:42:07 <corvus> some of the recent zuul performance improvements have resulted in a significant (40%+) reduction in peak zk data size in opendev:
19:42:11 <corvus> https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-30d&to=now&viewPanel=38
19:42:21 <clarkb> starlingx also ran into pip 24.1 problems (there is a bunch of discussion of that in #opendev from today) related to pip failing to handle metadata on some packages
19:42:24 <fungi> oh wow!
19:42:33 <fungi> that's a nice performance gain
19:42:44 <tonyb> corvus: awesome!
19:43:00 <corvus> fungi: yeah, i was not expecting it to be so big with opendev's specific characteristics, so was a pleasant surprise :)
19:43:55 <corvus> clarkb: is an underlying cause the age of the package in question?  i'm wondering if there may be more time-bombs like that...
19:44:49 <fungi> well, there are a number of backward-incompatible changes in pip 24.1
19:45:49 <clarkb> corvus: sort of. The package is older and newer versions do fix it
19:46:00 <clarkb> corvus: but I think you could hit the same issue with modern packages too
19:46:08 <corvus> ack
19:46:16 <fungi> it dropped support for python 3.7, refuses to install anything with a non-pep440-compliant version string, and also vendors newer libs for things like metadata processing which may have gotten more strict than they used to be
19:49:29 <clarkb> I may end up being away from irc/matrix at some point today and/or tomorrow. I really need to finally get around to RMAing this laptop and before I do I want to retest with ubuntu noble and wayland vs x11 etc to ensure this isn't just a sofwtare problem. I think I'm going to start diving into that today if I ca and then sit on the phoen tomorrow with lenovo if still necessary
19:49:46 <clarkb> I got a talk accepted to the openinfra summit event in korea and need a working laptop before that event
19:50:41 <clarkb> the annoying thing is it almost mostly works if I disable modesetting in the kernel but if I do that everything has to run at 1920x1080 (including external displays) and I can't control display brightness
19:50:45 <tonyb> clarkb: congrats and noted
19:51:02 <clarkb> Last call otherwise I think we can have 10 minutes back for $meal/sleep/etc
19:51:26 <tonyb> .zZ
19:52:09 <clarkb> thanks everyone!
19:52:11 <clarkb> #endmeeting