19:00:27 #startmeeting infra 19:00:27 Meeting started Tue Jun 25 19:00:27 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:27 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:27 The meeting name has been set to 'infra' 19:00:52 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/IWHUW7OFDHZSL7MGPOK53LJ3ZR43OSOF/ Our Agenda 19:01:16 #topic Announcements 19:01:36 Its been a couple weeks since the last meeting but other than that I'm not aware of any important announcements 19:01:56 might be worth noting that a week from Thursday is a major US holdiay and I expect those of us in the US will be busy enjoying the day off 19:02:27 anything else to announce? 19:04:33 seems like thats it 19:04:37 #topic Upgrading Old Servers 19:05:03 tonyb continues to poke at configuration management and related infrastructure for a rebuilt wiki server 19:05:40 Yup it's getting closer 19:05:42 The last big question that came up was where should logs live for the apache embedded in the container images that is responsible for running php stuff. I suggested that we could hook that into our regular container logs to syslog to log files on disk system 19:06:20 I like that appraoch as it keeps the logs distinct from the host side ssl terminating apache that we use on most of our systems and are likely to use here as well 19:06:32 but there are many ways to do that so I guess let tonyb know if you have different ideas 19:07:13 I still need to configure elastic search but I can build a server in CI that "works" seems to do the important stuff 19:07:23 that is great progress 19:07:49 was the conclusion to directly expose the container apache or reverse-proxy to it from an outer apache? 19:07:53 At some stage soon I want to try importing the images from the existing server and pointing a held node at the trove DB 19:08:33 fungi: I'm still using a reverse proxy 19:08:36 tonyb: if you do that you'll probably need to make a copy of the db? since I'm not sure its safe to have two different installs talking to the same db? 19:09:20 but that sounds like a great step to include as part of the how do we migrate story 19:09:39 I'll double check on that ... fortunately there is a mariadb server just waiting for data :) 19:10:02 yeah, weird as it sounds, proxying from an apache to another apache makes the most sense for consistency with our other systems 19:10:11 just wanted to double-check 19:10:53 fungi: when I get a more complete install I'll need you to poke at the antispam stuff to see if it's working as expected 19:11:04 yep 19:11:13 It's installed but that's about as far as I have gotten 19:12:05 It's funny to connect to http://new-server/ and watch all the 30X replies to get to a 200 ;P 19:12:16 redirect all the things 19:12:39 tonyb: you were also poking at booting noble nodes but ran into trouble with the afs packaging situation on noble (due to our use of a ppa that doesn't have packages yet) 19:13:00 tonyb: has there been any progress on that? Thats somethign I wanted to catch up on after my week off and haven't managed to 19:13:26 I have patches up that I think passed CI yesterday the topic is noble-mirror or something like that 19:14:09 I poked the Ubuntu stable team to get the fixed packages moved into -updates 19:14:12 #link https://review.opendev.org/q/topic:noble-mirror Deploying noble based mirror servers 19:15:08 anything else for server upgrades? 19:15:32 I think that's it for now 19:15:42 #topic AFS Mirror Cleanups 19:15:55 This is something I've started pushing on again recently. 19:16:48 topic:drop-ubuntu-xenial has at least one new change. This chagne drops xenial from system-config base role testing. I noted in the commit mesasge for that change that this might be one of the more "dangerous" changes in the xenial cleanup for us. The reason for that is we require the infra-prod-service-base job to succeed before running most other jobs 19:17:15 so if we start failing against xenial we could prevent other jobs from running even for stuff not on xenial. That said I think the risk is still relatively low as the base role stuff doesn't change often 19:17:32 open to feedback if you've got it (probably best to keep that in the review though) 19:18:21 I've also been trying to catch up with centos 8 stream cleanup now that none of those test nodes are really functioanl 19:18:57 topic:drop-centos-8-stream has a whole bunch of fun changes to make that cleanup happen. The vast majority should be safe. There is an openstack-zuul-jobs change that is -1'd by zuul bceause projects have c8s jobs for fips 19:19:20 I've been trying to get the word out and brought this up with the openstack qa team and tc. Sounds like the qa team may send email about it 19:19:42 at some point I expect we may end up force merging that chagne though and forcing projects to address the problems if they haven't already 19:21:00 I think the risk is really low though considering that all the jobs are failing on that platform since no packages are availalbe. 19:21:12 #topic Gitea 1.22 Upgrade 19:21:46 This is the other item that is/was high on my todo list. Unfortunately there still isn't a 1.22.1 release yet. I don't think many if any of the issues they have with 1.22.0 will affect us but there were a lot of issuse and some of them seemed important (for general use) 19:21:58 so I'm feeling more confident waiting for a 1.22.1 release before we upgrade 19:22:35 once we have upgraded we can proceed with fixing up the database encodings and all that 19:22:47 anyway no new info here other than 1.22.1 is still absent 19:22:59 #topic Improving Mailman Mail Throughput 19:23:22 as discussed last time we likely need to increase both mailman queue batch sizes and exim verp batch sizes 19:23:42 I don't think a change has been pushed for that yet unless I missed it in today's early morning scrollback but fungi was looking into it 19:23:48 fungi: is there anything else to add? 19:24:43 #link https://review.opendev.org/c/opendev/system-config/+/922703 19:24:49 oh, i pushed one, sorry 19:24:59 I think that's what we're talking about right? 19:25:01 oh hey I did miss it in scrollback thanks 19:25:03 thanks tonyb 19:25:04 yup 19:25:46 I'll givethat a review after the meeting 19:26:14 #topic OpenMetal Cloud Rebuild 19:26:22 So this was all going great until it wasn't :) 19:26:48 We had the cloud fully running in nodepool with max-servers 50 set and jobs were running etc. But then there were failures booting nodes and frickler's debugging indicated we ran out of disk? 19:27:14 sounds like maybe ceph isn't backing all of the disk related stuff that we expect it to be backing and that lead to us filling the smaller portion of disck allocated to not ceph 19:27:25 frickler: is that a reasonable summary? 19:27:30 I sent a mail about that to the openmetal thread 19:28:00 the issue is that glance and nova cannot share the ceph pool properly in the current configuration 19:28:03 ya you also suggested some kolla setting changes that could be applied to fix it. I think it is a good idea to work through the openmetal folks on this as I suspect it may affect their product as a whole and this si something we want to ensure is working for us and everyone else 19:28:24 thus nova needs to download each image and upload it to ceph again 19:28:31 aha 19:28:34 that would be inefficient 19:28:50 which fills up the root partition which is only like 200G and our images are large 19:29:09 it was actually openmetal who discovered the disk full condition, too 19:29:52 maybe you can poke yuri again since there was no response yet to my mail afaict 19:29:53 frickler: I haven't seen a response to your email yet, any chance they responded to you directly instead of cc'ing the rest of us? or should i try and followup with them in a bit? 19:29:58 ack can do 19:30:43 there was also the question of the number of ceph partition groups per osd being very low. The docs suggest 100 per osd but we're at like 4? I can mention that too 19:30:55 once that is fixed we can do another production attempt and them possibly decide about the ceph tuning 19:30:59 though the docs are also a bit hand wavey around how much it actually matters 19:31:07 #link https://docs.ceph.com/en/latest/dev/placement-group/ 19:31:24 sounds good thank you for helping with the debugging on that 19:31:29 well it will worsen the performance a bit, but difficult to tell how much 19:32:48 anything else openmetal related? 19:33:08 I think that should be all for now 19:33:20 oh 19:33:46 ask about the monitoring thing again, I think that also went unnoticed in your mail or unresponded at least 19:34:12 can do /me scribbles some notes on the todo list 19:34:36 #topic Testing Rackspace's New Cloud Offering 19:35:22 rax recently reached out to fungi and myself about this. Still not a ton of info, but I'm working to try and schedule a short meeting to allow us to discuss it more syncrhonously and determine the next step here 19:35:55 I proposed July 8 as a possible day as it avoids the holiday next week and isn't this week, but have no idea what their schedule is like and haven't heard back yet. 19:36:12 do you have a mail about this that you could share? 19:36:34 But hopefully we'll be able to sync up and learn more and do something productive around this. I think helping them burn in the new product if we can is a great idea 19:36:49 frickler: just the ticket they filed against the nodepool account 19:37:01 i think the details were the same as what they sent opendev, yeah 19:37:49 ah, that sounded like you were contacted directly, ok 19:38:02 ya its basically just "we have a new product in test/beta/limited availability would you like to be an early user" 19:38:11 the separate outreach was technically to the foundation staff about using it, but the actual foundation wouldn't really have much use for it beyond supplying resources for opendev to use, i think 19:38:43 ack 19:38:54 rackspace contacted business development folks on the foundation staff, who basically forwarded it to me and clarkb 19:39:20 and in the end it was a matter of "yeah they already reached out to opendev about this same thing" 19:39:51 and without any more info than we got in the ticket 19:40:09 once I hear back anything more actionable I can share that info 19:40:18 so more business than technical, I'm fine to be left out of that ;) 19:40:49 #topic Open Discussion 19:41:15 I wanted to mention that dib was also hit by the c8s stuff but is getting sorted out 19:41:43 mostly an fyi Idon't think we need help with it and jobs should be passing again as of today 19:42:07 some of the recent zuul performance improvements have resulted in a significant (40%+) reduction in peak zk data size in opendev: 19:42:11 https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-30d&to=now&viewPanel=38 19:42:21 starlingx also ran into pip 24.1 problems (there is a bunch of discussion of that in #opendev from today) related to pip failing to handle metadata on some packages 19:42:24 oh wow! 19:42:33 that's a nice performance gain 19:42:44 corvus: awesome! 19:43:00 fungi: yeah, i was not expecting it to be so big with opendev's specific characteristics, so was a pleasant surprise :) 19:43:55 clarkb: is an underlying cause the age of the package in question? i'm wondering if there may be more time-bombs like that... 19:44:49 well, there are a number of backward-incompatible changes in pip 24.1 19:45:49 corvus: sort of. The package is older and newer versions do fix it 19:46:00 corvus: but I think you could hit the same issue with modern packages too 19:46:08 ack 19:46:16 it dropped support for python 3.7, refuses to install anything with a non-pep440-compliant version string, and also vendors newer libs for things like metadata processing which may have gotten more strict than they used to be 19:49:29 I may end up being away from irc/matrix at some point today and/or tomorrow. I really need to finally get around to RMAing this laptop and before I do I want to retest with ubuntu noble and wayland vs x11 etc to ensure this isn't just a sofwtare problem. I think I'm going to start diving into that today if I ca and then sit on the phoen tomorrow with lenovo if still necessary 19:49:46 I got a talk accepted to the openinfra summit event in korea and need a working laptop before that event 19:50:41 the annoying thing is it almost mostly works if I disable modesetting in the kernel but if I do that everything has to run at 1920x1080 (including external displays) and I can't control display brightness 19:50:45 clarkb: congrats and noted 19:51:02 Last call otherwise I think we can have 10 minutes back for $meal/sleep/etc 19:51:26 .zZ 19:52:09 thanks everyone! 19:52:11 #endmeeting