19:00:39 <clarkb> #startmeeting infra 19:00:39 <opendevmeet> Meeting started Tue Jun 4 19:00:39 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:39 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:39 <opendevmeet> The meeting name has been set to 'infra' 19:00:46 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/JH2FNAUEA32AS4GH475AHYEPLP4FUGPE/ Our Agenda 19:01:09 <clarkb> #topic Announcements 19:01:23 <clarkb> I will not be able to run the meeting on June 18 19:01:39 <clarkb> that is two weeks from today. More than happy for someone else to take over or we can skip if people prefer that 19:02:17 <tonyb> I'll be back in Australia by then so I probably be can't run it 19:02:42 <clarkb> I wil be afk from the 14-19th 19:03:18 <tonyb> okay. I'll be travelling for part of that. 19:03:22 <clarkb> we can sort out the plan next week. Plenty of time 19:03:23 <tonyb> poor fungi 19:03:29 <tonyb> sounds good 19:03:43 <clarkb> #topic Upgrading old servers 19:03:48 <clarkb> #link https://etherpad.opendev.org/p/opendev-mediawiki-upgrade 19:03:54 <clarkb> I think this has been the recent focus of this effort 19:04:00 <tonyb> yup 19:04:05 <clarkb> looks like there was a change just psuehd too I should #link that too 19:04:18 <tonyb> there are some things to read about the plan etc 19:04:19 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/921321 Mediawiki deployment for OpenDev 19:05:00 <tonyb> reviews very welcome. it works for local testing but needs more and comprehensive testing 19:05:04 <clarkb> tonyb: anything specific in the plan etherpad you want us to be looking at or careful about? 19:05:50 <tonyb> There is a small announcement that could do with a review 19:05:50 <clarkb> tonyb: ya I think if we can get something close enough to migrate data into we can live with small issues like theming not working and plan a cut over. I assume we'll want to shutdown the old wiki so we don't have content divergence? 19:06:01 <tonyb> as I'd like to get that out reasonably soon 19:06:32 <clarkb> #link https://etherpad.opendev.org/p/opendev-wiki-announce Announcement for wiki changes 19:06:35 <clarkb> that one I assume 19:06:50 <tonyb> Yes we'll shutdown the current server ASAP 19:07:39 <tonyb> There is plenty of planning stuff like moving away from rax-trove etc but I'm pretty happy with the progress 19:07:58 <clarkb> yup I think this is great. I'm planning to dig into it more today after meetings and lunch 19:08:03 <tonyb> the bare bones upgrade the host OS is IMO a solid improvement 19:08:27 <tonyb> I'll try a noble server this week 19:08:46 <tonyb> I think that's pretty much all there is to say for server upgrades 19:08:53 <clarkb> thanks 19:09:00 <clarkb> #topic AFS Mirror Cleanup 19:09:10 <clarkb> Not much new here other than that devstack-gate has been fully retired 19:09:26 <clarkb> I've been distracted by many other things like gerrit upgrades and gitea upgrades and cloud shutdowns which we'll get to shortly 19:09:37 <clarkb> #topic Gerrit 3.9 Upgrade 19:09:57 <clarkb> This happened. It seemed to go well. People are even noticing some of the new features like suggested edits 19:10:13 <clarkb> Has anyone else seen or heard of any issues since the upgrade? 19:10:26 <clarkb> I guess there was the small gertty issue which is resolvable by starting with a new sqlite db 19:11:13 <tonyb> That's all I'm aware of 19:11:55 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/920938 is ready when we think we are unlikely to revert 19:12:07 <clarkb> I'm going to go ahead and remove my WIP vote on that change since we arne't aware of any problems 19:13:21 <tonyb> I'm happy to go ahead whenever 19:13:22 <clarkb> That change has a few children which add the 3.10 image builds and testing. The testing change seems to error when trying to pull the 3.10 which I thought should be in the intermediate ci registry. But I wonder if it is because the tag doesn't exist at all 19:13:45 <clarkb> corvus: ^ just noting that in case you've seen it like does the tooling try to fetch from docker hub proper and then fail even if it could find stuff locally? 19:14:11 <clarkb> in any case that isn't urgent but getting some early feedback on whether or not the next release works and is upgradeable is always good (particularly after the last upgrade was broken initially) 19:14:58 <clarkb> thank you to everyone that helped with the upgrade. Despite my concerns feeling under prepared things went smoothly which probably speaks to how prepared we actually were 19:15:08 <clarkb> #topic Gitea 1.22 Upgrade 19:15:44 <clarkb> I was hoping there would be a 1.22.1 by now but as far as I can tell there isn't. I'm also likely going to put this on the back burner for the immediate future as I've got several other more time sensitive things to worry about before I take that time off 19:16:27 <tonyb> That's fair. 19:16:49 <clarkb> That said I think the next step is going to be getting the upgrade chagne into shape so if people can review that it is still helpful 19:17:04 <clarkb> then we can upgrade then we can do the doctor tooling one by one on our backend nodes 19:17:15 <tonyb> Whatever works, if we decide we need it then we're pretty prepared thanks to your work 19:17:37 <clarkb> testing of the doctor tool seems to indicate running it is straightforward. We should be able to do it with minimal impact taking one node out of service and doctoring it and doing that in a loop 19:17:51 <clarkb> tonyb: ya I think reviews are the most useful thing at this point for the gitea upgrade 19:17:58 <clarkb> since the initial pass is working as is the doctor tool in testing 19:18:56 <tonyb> Cool 19:19:00 <clarkb> #topic Fixing Cloud Launcher Ansible 19:19:30 <clarkb> frickler: fixed up the security groups in the osuosl cloud but then almost immediately we ran into the git perms trust issue that is a side effect of recent git packaging updates for security concern fixes 19:19:50 <clarkb> This appears to be our only infra-prod-* job affected by the git updates so overall pretty good 19:20:05 <clarkb> for fixing cloud launcher I went ahead and wrote a change that trusts the ansible role repos when we clone them 19:20:07 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/921061 workaround the Git issue 19:20:19 <clarkb> I did find that you can pass config options to git clone but they only apply after most of the cloning is complete 19:20:34 <clarkb> so I'm like 99% sure that doing that won't work for this case as the perms check occurs before cloning begins 19:21:21 <tonyb> Yeah it'd be nice if the options applied "earlier" but I think what you've done is good for now 19:21:36 <clarkb> One thing to keep in mind reviewing that change is I'm pretty sure we don't have any real test coverage for it 19:21:50 <clarkb> so careful review is a good idea and tonyb already caught one big issue with it (now fixed) 19:22:10 <tonyb> The general git safe.directory doesn't have a one size fits all solution 19:22:47 <corvus> clarkb: (sorry i'm late) i don't think lack of image in dockerhub should be a problem; that may point to a bug/flaw 19:23:19 <clarkb> corvus: ok, I feel like this has happend before and I've double checked the dependencies and provides/requires and I think everything is working in the order I would expect 19:23:44 <clarkb> and I just wondered if using a new tag is part of the issue since 3.10 doesn't exist anywhere yet but a :latest would be present typically 19:24:04 <clarkb> I guess I'll look more closely 19:24:12 <clarkb> #topic Increase Mailman 3 Out Runner Count 19:24:20 <corvus> yeah; lemme know if you want me to dig into details with you 19:24:23 <clarkb> corvus: thanks 19:24:43 <clarkb> I don't know if anyone else has noticed but recently I had some confusion over emails being in the openstack-discuss list archive and not in my inbox thinking there was some problem 19:24:59 <clarkb> the issue was that delivery for that list takes time. Upwards of 10 minutes. 19:25:12 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/920765 Increase the number of out runners to speed up outbound mail delivery 19:25:36 <clarkb> In that change I linked to a discussion about similar problems another mail list host had and ultimately the biggest impact for them was increasing the number of runner instances 19:25:59 <tonyb> Makes sense to me. 19:26:01 <clarkb> I've gone ahead and proposed we do the same. This isn't super critical but I think it will help avoid confusion in the future and keep discussions moving without unnecessary delay 19:26:33 <tonyb> I'm cool with your fix but it'd be cool to have a way to verify it has the intended impact 19:26:45 <corvus> are we still letting exim do the verp? 19:26:50 <tonyb> not that I'd block the chnage until that exists 19:26:56 <clarkb> corvus: yes I believe exim is doing verp 19:27:12 <clarkb> corvus: and disabling verp is one of the suggestions in the thread I found about improving this performance 19:27:33 <clarkb> however, it seems like we can try this first since it is less impactful to behavior (in theory anyway) 19:28:11 <corvus> k. exim doing verp make things pretty fast, since if we're delivering 10k messages, and 5k are for gmail and 4k are for redhat and 1k are for everyone else (made up numbers), that should boil down to just a handful of outgoing smtp transactions for mailman. 19:28:25 <corvus> s/make things/should make things/ 19:29:02 <clarkb> corvus: makes sense. And ya I suspect the bottleneck may be between mailman and exim based on that thread (but I haven't profiled it locally) 19:29:07 <corvus> if, however, mailman is doing 5k smtp transactions to exim for gmail, then we've lost the advantage 19:29:29 <fungi> from what i gather, mailman 3 limits the number of addresses per message to the mta to a fairly small number in order to avoid some spam detection heuristics that may trigger when you have too many recipients 19:29:40 <corvus> (it should do, ideally, 1, but there are recipient limits, so maybe 10 total, 500 recipients each) 19:30:08 <fungi> i don't recall what the default is for sure, but think it may be something like 10 addresses per delivery 19:30:12 <corvus> okay, so there might be an opportunity to tune that, so that we can maximize the work exim does 19:30:43 <clarkb> suonds good. Do we think that is something to do instead of increasing the out runner instances or somethign to try next after this update? 19:30:49 <fungi> but yeah, the usual recommendation is to increase the number of threads mailman/django will use to submit messages to the mta so it doesn't just serialize and block 19:31:16 <corvus> my gut says try both, order doesn't matter 19:31:29 <clarkb> ack I guess we proceed with this and try the other thing too 19:31:35 <corvus> (non-exclusive) 19:31:37 <corvus> yep 19:32:01 <corvus> (sending multiple 500 recipient messages to exim in parallel is an ideal outcome) 19:32:43 <clarkb> #topic OpenMetal Cloud Rebuild 19:33:10 <clarkb> The Inmotion/OpenMetal folks sent email recently calling out that they have updated their openstack cloud as a service tooling to deploy on new platforms and deploy a newer openstack version 19:33:35 <clarkb> they have volunteered to help out with the provisiining in the near future so I've been try to prepare cleanup/shutdown of the existing cloud so that we can gracefully replace it 19:34:02 <clarkb> The hardware needs to be reused rather than setting up a new cloud adjacent to the old one which means shutting everything down first 19:34:08 <clarkb> #link https://review.opendev.org/c/openstack/project-config/+/921072 Nodepool cleanups 19:34:14 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/921075 System-config cleanups 19:34:37 <clarkb> my goal is to get these chagnes landed over the next day or so as I'm meeting with Yuriys at 1pm Pacific time tomorrow to discuss further actions 19:35:06 <clarkb> I expect that nodepool will be in a dirty state (with stuck nodes and maybe stuck images) after the cleanup changes land. corvus pointed out the erase command which I'll use if it comes to that 19:35:19 <clarkb> Also if anyone else is interested in joining tomorrow just let me know. I can share the conference link 19:35:19 <tonyb> Happy to help with all/any of that 19:35:52 <clarkb> tonyb: will do. I think the immediate next step is to land teh change which should clean up images in nodepool and see where that gets us 19:35:58 <clarkb> oh and the system-config change should be landable too 19:36:06 <corvus> ping me if there are nodepool cleanup probs 19:36:10 <clarkb> corvus: will do 19:36:17 <clarkb> then maybe tomorrow we land the final cleanup and do the erase step 19:37:02 <clarkb> I'm hopeful that by the end of the week or early next week we'll be able to spin up a new mirror and add the new cloud to nodepool under its new openmetal name 19:37:29 <tonyb> That seems doable :) 19:38:10 <corvus> ++ 19:38:30 <clarkb> #topic Testing Rackspace's New Cloud Product 19:39:05 <clarkb> Yesterday a ticket was opened in our rax nodepool account to let us know that their new openstack flex product is in a beta/testing/trial period and we coudl opt into testing it. They seem specifically interested in any feedback we may have 19:39:31 <clarkb> I think this is a good idea, but the info is a bit scarse so not sure if it is a good fit for us yet. I'm also pretty swamped with the other stuff going on so not sure I'll have time to run this down before I take that time off 19:39:52 <corvus> heh my first question is "are you sure?" :) 19:39:57 <corvus> you=rax 19:40:11 <clarkb> Wanted to call it out if anyone else wants to look into this more closely. The product is called openstack flex and it might be worth pinging cloudnull to get details or a referal to someone else with that info 19:40:28 <clarkb> corvus: ya I think the upside is we really do batter a cloud pretty good so if they take that data and feedback and improve with it then we're helping 19:40:34 <clarkb> at the same time we might make their lives miserable :) 19:40:35 <tonyb> I can take a stab at that if you'd like 19:41:04 <clarkb> tonyb: sure, I think the main thing is starting up some communication to see if this fits our use case and then go for it I guess 19:42:29 <clarkb> #topic Open Discussion 19:42:39 <clarkb> dansmith discovered today that centos 8 stream seems to have been deleted upstream 19:42:54 <clarkb> our mirrors have faithfully synced this state and now centos 8 stream jobs are breaking 19:43:03 <tonyb> Oh nice 19:43:23 <clarkb> I think that means we can probably put removing centos 8 stream nodes on the todo list for nodepool cleanups 19:43:34 <clarkb> and we can also remove the centos afs mirror as everything should live in centos-stream now 19:43:44 <fungi> the new rax service might be interesting if it performs better, is kvm-based, has stable support for nested kvm acceleration, etc 19:43:45 <clarkb> but all the content has been deleted already so that is mostly just bookkeeping 19:43:50 <clarkb> fungi: ++ 19:44:26 <fungi> we do get a lot of users complaining about performance or lack of cpu features in our current rax nodes 19:45:38 <tonyb> I can find out more 19:45:54 <tonyb> I assume I need to do that via the Account/webUI? 19:46:36 <tonyb> I can also ping cloudnull for a informal chat 19:46:38 <clarkb> tonyb: that would be one appraoch though it might get us to the first level of support first 19:46:47 <clarkb> an informal chat might be better if cloudnull has time to at least point us at someone else 19:48:01 <tonyb> Okay 19:50:18 <clarkb> sounds like that might be everything 19:50:24 <clarkb> thank you for your time 19:50:28 <clarkb> #endmeeting