#opendev-meeting log

19:00:39 <clarkb> #startmeeting infra
19:00:39 <opendevmeet> Meeting started Tue Jun  4 19:00:39 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:39 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:39 <opendevmeet> The meeting name has been set to 'infra'
19:00:46 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/JH2FNAUEA32AS4GH475AHYEPLP4FUGPE/ Our Agenda
19:01:09 <clarkb> #topic Announcements
19:01:23 <clarkb> I will not be able to run the meeting on June 18
19:01:39 <clarkb> that is two weeks from today. More than happy for someone else to take over or we can skip if people prefer that
19:02:17 <tonyb> I'll be back in Australia by then so I probably be can't run it
19:02:42 <clarkb> I wil be afk from the 14-19th
19:03:18 <tonyb> okay.  I'll be travelling for part of that.
19:03:22 <clarkb> we can sort out the plan next week. Plenty of time
19:03:23 <tonyb> poor fungi
19:03:29 <tonyb> sounds good
19:03:43 <clarkb> #topic Upgrading old servers
19:03:48 <clarkb> #link https://etherpad.opendev.org/p/opendev-mediawiki-upgrade
19:03:54 <clarkb> I think this has been the recent focus of this effort
19:04:00 <tonyb> yup
19:04:05 <clarkb> looks like there was a change just psuehd too I should #link that too
19:04:18 <tonyb> there are some things to read about the plan etc
19:04:19 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/921321 Mediawiki deployment for OpenDev
19:05:00 <tonyb> reviews very welcome.   it works for local testing but needs more and comprehensive testing
19:05:04 <clarkb> tonyb: anything specific in the plan etherpad you want us to be looking at or careful about?
19:05:50 <tonyb> There is a small announcement that could do with a review
19:05:50 <clarkb> tonyb: ya I think if we can get something close enough to migrate data into we can live with small issues like theming not working and plan a cut over. I assume we'll want to shutdown the old wiki so we don't have content divergence?
19:06:01 <tonyb> as I'd like to get that out reasonably soon
19:06:32 <clarkb> #link https://etherpad.opendev.org/p/opendev-wiki-announce Announcement for wiki changes
19:06:35 <clarkb> that one I assume
19:06:50 <tonyb> Yes we'll shutdown the current server ASAP
19:07:39 <tonyb> There is plenty of planning stuff like moving away from rax-trove etc but I'm pretty happy with the progress
19:07:58 <clarkb> yup I think this is great. I'm planning to dig into it more today after meetings and lunch
19:08:03 <tonyb> the bare bones upgrade the host OS is IMO a solid improvement
19:08:27 <tonyb> I'll try a noble server this week
19:08:46 <tonyb> I think that's pretty much all there is to say for server upgrades
19:08:53 <clarkb> thanks
19:09:00 <clarkb> #topic AFS Mirror Cleanup
19:09:10 <clarkb> Not much new here other than that devstack-gate has been fully retired
19:09:26 <clarkb> I've been distracted by many other things like gerrit upgrades and gitea upgrades and cloud shutdowns which we'll get to shortly
19:09:37 <clarkb> #topic Gerrit 3.9 Upgrade
19:09:57 <clarkb> This happened. It seemed to go well. People are even noticing some of the new features like suggested edits
19:10:13 <clarkb> Has anyone else seen or heard of any issues since the upgrade?
19:10:26 <clarkb> I guess there was the small gertty issue which is resolvable by starting with a new sqlite db
19:11:13 <tonyb> That's all I'm aware of
19:11:55 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/920938 is ready when we think we are unlikely to revert
19:12:07 <clarkb> I'm going to go ahead and remove my WIP vote on that change since we arne't aware of any problems
19:13:21 <tonyb> I'm happy to go ahead whenever
19:13:22 <clarkb> That change has a few children which add the 3.10 image builds and testing. The testing change seems to error when trying to pull the 3.10 which I thought should be in the intermediate ci registry. But I wonder if it is because the tag doesn't exist at all
19:13:45 <clarkb> corvus: ^ just noting that in case you've seen it like does the tooling try to fetch from docker hub proper and then fail even if it could find stuff locally?
19:14:11 <clarkb> in any case that isn't urgent but getting some early feedback on whether or not the next release works and is upgradeable is always good (particularly after the last upgrade was broken initially)
19:14:58 <clarkb> thank you to everyone that helped with the upgrade. Despite my concerns feeling under prepared things went smoothly which probably speaks to how prepared we actually were
19:15:08 <clarkb> #topic Gitea 1.22 Upgrade
19:15:44 <clarkb> I was hoping there would be a 1.22.1 by now but as far as I can tell there isn't. I'm also likely going to put this on the back burner for the immediate future as I've got several other more time sensitive things to worry about before I take that time off
19:16:27 <tonyb> That's fair.
19:16:49 <clarkb> That said I think the next step is going to be getting the upgrade chagne into shape so if people can review that it is still helpful
19:17:04 <clarkb> then we can upgrade then we can do the doctor tooling one by one on our backend nodes
19:17:15 <tonyb> Whatever works, if we decide we need it then we're pretty prepared thanks to your work
19:17:37 <clarkb> testing of the doctor tool seems to indicate running it is straightforward. We should be able to do it with minimal impact taking one node out of service and doctoring it and doing that in a loop
19:17:51 <clarkb> tonyb: ya I think reviews are the most useful thing at this point for the gitea upgrade
19:17:58 <clarkb> since the initial pass is working as is the doctor tool in testing
19:18:56 <tonyb> Cool
19:19:00 <clarkb> #topic Fixing Cloud Launcher Ansible
19:19:30 <clarkb> frickler: fixed up the security groups in the osuosl cloud but then almost immediately we ran into the git perms trust issue that is a side effect of recent git packaging updates for security concern fixes
19:19:50 <clarkb> This appears to be our only infra-prod-* job affected by the git updates so overall pretty good
19:20:05 <clarkb> for fixing cloud launcher I went ahead and wrote a change that trusts the ansible role repos when we clone them
19:20:07 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/921061 workaround the Git issue
19:20:19 <clarkb> I did find that you can pass config options to git clone but they only apply after most of the cloning is complete
19:20:34 <clarkb> so I'm like 99% sure that doing that won't work for this case as the perms check occurs before cloning begins
19:21:21 <tonyb> Yeah it'd be nice if the options applied "earlier" but I think what you've done is good for now
19:21:36 <clarkb> One thing to keep in mind reviewing that change is I'm pretty sure we don't have any real test coverage for it
19:21:50 <clarkb> so careful review is a good idea and tonyb already caught one big issue with it (now fixed)
19:22:10 <tonyb> The general git safe.directory doesn't have a one size fits all solution
19:22:47 <corvus> clarkb: (sorry i'm late) i don't think lack of image in dockerhub should be a problem; that may point to a bug/flaw
19:23:19 <clarkb> corvus: ok, I feel like this has happend before and I've double checked the dependencies and provides/requires and I think everything is working in the order I would expect
19:23:44 <clarkb> and I just wondered if using a new tag is part of the issue since 3.10 doesn't exist anywhere yet but a :latest would be present typically
19:24:04 <clarkb> I guess I'll look more closely
19:24:12 <clarkb> #topic Increase Mailman 3 Out Runner Count
19:24:20 <corvus> yeah; lemme know if you want me to dig into details with you
19:24:23 <clarkb> corvus: thanks
19:24:43 <clarkb> I don't know if anyone else has noticed but recently I had some confusion over emails being in the openstack-discuss list archive and not in my inbox thinking there was some problem
19:24:59 <clarkb> the issue was that delivery for that list takes time. Upwards of 10 minutes.
19:25:12 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/920765 Increase the number of out runners to speed up outbound mail delivery
19:25:36 <clarkb> In that change I linked to a discussion about similar problems another mail list host had and ultimately the biggest impact for them was increasing the number of runner instances
19:25:59 <tonyb> Makes sense to me.
19:26:01 <clarkb> I've gone ahead and proposed we do the same. This isn't super critical but I think it will help avoid confusion in the future and keep discussions moving without unnecessary delay
19:26:33 <tonyb> I'm cool with your fix but it'd be cool to have a way to verify it has the intended impact
19:26:45 <corvus> are we still letting exim do the verp?
19:26:50 <tonyb> not that I'd block the chnage until that exists
19:26:56 <clarkb> corvus: yes I believe exim is doing verp
19:27:12 <clarkb> corvus: and disabling verp is one of the suggestions in the thread I found about improving this performance
19:27:33 <clarkb> however, it seems like we can try this first since it is less impactful to behavior (in theory anyway)
19:28:11 <corvus> k.  exim doing verp make things pretty fast, since if we're delivering 10k messages, and 5k are for gmail and 4k are for redhat and 1k are for everyone else (made up numbers), that should boil down to just a handful of outgoing smtp transactions for mailman.
19:28:25 <corvus> s/make things/should make things/
19:29:02 <clarkb> corvus: makes sense. And ya I suspect the bottleneck may be between mailman and exim based on that thread (but I haven't profiled it locally)
19:29:07 <corvus> if, however, mailman is doing 5k smtp transactions to exim for gmail, then we've lost the advantage
19:29:29 <fungi> from what i gather, mailman 3 limits the number of addresses per message to the mta to a fairly small number in order to avoid some spam detection heuristics that may trigger when you have too many recipients
19:29:40 <corvus> (it should do, ideally, 1, but there are recipient limits, so maybe 10 total, 500 recipients each)
19:30:08 <fungi> i don't recall what the default is for sure, but think it may be something like 10 addresses per delivery
19:30:12 <corvus> okay, so there might be an opportunity to tune that, so that we can maximize the work exim does
19:30:43 <clarkb> suonds good. Do we think that is something to do instead of increasing the out runner instances or somethign to try next after this update?
19:30:49 <fungi> but yeah, the usual recommendation is to increase the number of threads mailman/django will use to submit messages to the mta so it doesn't just serialize and block
19:31:16 <corvus> my gut says try both, order doesn't matter
19:31:29 <clarkb> ack I guess we proceed with this and try the other thing too
19:31:35 <corvus> (non-exclusive)
19:31:37 <corvus> yep
19:32:01 <corvus> (sending multiple 500 recipient messages to exim in parallel is an ideal outcome)
19:32:43 <clarkb> #topic OpenMetal Cloud Rebuild
19:33:10 <clarkb> The Inmotion/OpenMetal folks sent email recently calling out that they have updated their openstack cloud as a service tooling to deploy on new platforms and deploy a newer openstack version
19:33:35 <clarkb> they have volunteered to help out with the provisiining in the near future so I've been try to prepare cleanup/shutdown of the existing cloud so that we can gracefully replace it
19:34:02 <clarkb> The hardware needs to be reused rather than setting up a new cloud adjacent to the old one which means shutting everything down first
19:34:08 <clarkb> #link https://review.opendev.org/c/openstack/project-config/+/921072 Nodepool cleanups
19:34:14 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/921075 System-config cleanups
19:34:37 <clarkb> my goal is to get these chagnes landed over the next day or so as I'm meeting with Yuriys at 1pm Pacific time tomorrow to discuss further actions
19:35:06 <clarkb> I expect that nodepool will be in a dirty state (with stuck nodes and maybe stuck images) after the cleanup changes land. corvus  pointed out the erase command which I'll use if it comes to that
19:35:19 <clarkb> Also if anyone else is interested in joining tomorrow just let me know. I can share the conference link
19:35:19 <tonyb> Happy to help with all/any of that
19:35:52 <clarkb> tonyb: will do. I think the immediate next step is to land teh change which should clean up images in nodepool and see where that gets us
19:35:58 <clarkb> oh and the system-config change should be landable too
19:36:06 <corvus> ping me if there are nodepool cleanup probs
19:36:10 <clarkb> corvus: will do
19:36:17 <clarkb> then maybe tomorrow we land the final cleanup and do the erase step
19:37:02 <clarkb> I'm hopeful that by the end of the week or early next week we'll be able to spin up a new mirror and add the new cloud to nodepool under its new openmetal name
19:37:29 <tonyb> That seems doable :)
19:38:10 <corvus> ++
19:38:30 <clarkb> #topic Testing Rackspace's New Cloud Product
19:39:05 <clarkb> Yesterday a ticket was opened in our rax nodepool account to let us know that their new openstack flex product is in a beta/testing/trial period and we coudl opt into testing it. They seem specifically interested in any feedback we may have
19:39:31 <clarkb> I think this is a good idea, but the info is a bit scarse so not sure if it is a good fit for us yet. I'm also pretty swamped with the other stuff going on so not sure I'll have time to run this down before I take that time off
19:39:52 <corvus> heh my first question is "are you sure?" :)
19:39:57 <corvus> you=rax
19:40:11 <clarkb> Wanted to call it out if anyone else wants to look into this more closely. The product is called openstack flex and it might be worth pinging cloudnull to get details or a referal to someone else with that info
19:40:28 <clarkb> corvus: ya I think the upside is we really do batter a cloud pretty good so if they take that data and feedback and improve with it then we're helping
19:40:34 <clarkb> at the same time we might make their lives miserable :)
19:40:35 <tonyb> I can take a stab at that if you'd like
19:41:04 <clarkb> tonyb: sure, I think the main thing is starting up some communication to see if this fits our use case and then go for it I guess
19:42:29 <clarkb> #topic Open Discussion
19:42:39 <clarkb> dansmith discovered today that centos 8 stream seems to have been deleted upstream
19:42:54 <clarkb> our mirrors have faithfully synced this state and now centos 8 stream jobs are breaking
19:43:03 <tonyb> Oh nice
19:43:23 <clarkb> I think that means we can probably put removing centos 8 stream nodes on the todo list for nodepool cleanups
19:43:34 <clarkb> and we can also remove the centos afs mirror as everything should live in centos-stream now
19:43:44 <fungi> the new rax service might be interesting if it performs better, is kvm-based, has stable support for nested kvm acceleration, etc
19:43:45 <clarkb> but all the content has been deleted already so that is mostly just bookkeeping
19:43:50 <clarkb> fungi: ++
19:44:26 <fungi> we do get a lot of users complaining about performance or lack of cpu features in our current rax nodes
19:45:38 <tonyb> I can find out more
19:45:54 <tonyb> I assume I need to do that via the Account/webUI?
19:46:36 <tonyb> I can also ping cloudnull for a informal chat
19:46:38 <clarkb> tonyb: that would be one appraoch though it might get us to the first level of support first
19:46:47 <clarkb> an informal chat might be better if cloudnull has time to at least point us at someone else
19:48:01 <tonyb> Okay
19:50:18 <clarkb> sounds like that might be everything
19:50:24 <clarkb> thank you for your time
19:50:28 <clarkb> #endmeeting