19:00:07 <clarkb> #startmeeting infra
19:00:07 <opendevmeet> Meeting started Tue Sep 10 19:00:07 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:07 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:07 <opendevmeet> The meeting name has been set to 'infra'
19:00:21 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/LIHCPHNOQTZLO26CYOYY6Y5LNGOOB2CV/ Our Agenda
19:00:58 <clarkb> #topic Announcements
19:01:06 <NeilHanlon> o/
19:01:10 <fungi> ohai
19:01:33 <clarkb> This didn't make it to the agenda because it was made official earlier today but https://www.socallinuxexpo.org/scale/22x/events/open-infra-days is happening in March 2025. You have until November 1 to get your cfp submissions in if interested
19:02:14 <clarkb> also not erally an announcement, but I suspect my scheduling will be far more normal at this point. School started again, family returned to their homes, and I'm back from the asia summit
19:03:14 <clarkb> #topic Upgrading Old Servers
19:03:29 <clarkb> Its been a little while since I reviewed the wiki config management changes, any updates there?
19:04:25 <clarkb> looks like no new patchsets since my last review
19:04:45 <clarkb> fungi deployed a new mirror in the new raxflex region and had to make a small edit to the dns keys package installation order
19:04:58 <clarkb> are there any other server replacement/upgrade/addition items to be aware of?
19:06:41 <clarkb> sounds like no. Let's move on and we can get back to this later if necessary
19:06:49 <clarkb> #topic AFS Mirror Cleanups
19:07:09 <clarkb> When we updated the prepare-workspace-git role to do more of its work in python rather than ansible tasks we did end up breaking xenial due to the use of fstrings
19:07:19 <clarkb> I pointed out that I have a change to remove xenial testing from system-config
19:07:41 <clarkb> https://review.opendev.org/c/opendev/system-config/+/922680 but rather than merge that we dropped the use of fstraings in prepare-workspace-git
19:07:56 <clarkb> anyway calling that out as this sort of cleanup is a necessary precursor to removing the content in the mirrors
19:08:02 <clarkb> other than that I don't have any updates here
19:08:53 <clarkb> #topic Rackspace Flex Cloud
19:09:14 <fungi> seems to be working well
19:09:34 <fungi> though our ram quota is only enough for 32 of our standard nodes for now
19:09:37 <clarkb> yup since we last met fungi et al have set this cloud up in nodepool and we are running with max-servers set to 32 utilizing all of our quota there
19:10:10 <frickler> but cardoe also offered to bump our quota if we want to go further
19:11:07 <clarkb> ya I think the next step is to put together an email to rax folks with what we've learned (the ssh keyscan timeout thing is weird but also seems to be mitigated?)
19:11:27 <corvus> (or just not happening today)
19:11:29 <clarkb> and then they can decide if they are interested in bumping our quota and if so on what schedule
19:11:36 <fungi> hard to be sure, yep
19:12:02 <corvus> the delay between keyscan and ready we noted earlier today could be a nodepool problem, or it could be a symptom of some other issue; it's probably worth analyzing a bit more
19:12:37 <corvus> that doesn't currently rise to the level of "problem" only "weird"
19:12:46 <corvus> (but weird can be a precursor to problem)
19:12:47 <clarkb> right, its information not accusation
19:13:24 <clarkb> anyway thank you everyone for putting this together I ended up mostly busy with travel and other things so wasn't as helpful as I'd hoped
19:13:49 <clarkb> and it seems to be working. Be aware that some jobs may notice the use if floating IPs like swift did. They were trying to bind to the public fip address which isn't configured on the servers and that failed
19:14:14 <clarkb> siwtching over to 0.0.0.0/127.0.0.1 or ipv6 equivalents would work as would using the private host ip (I think the private ip is what swift switched to)
19:14:46 <fungi> oh, something else worth pointing out, we resized/renumbered the internal address cidr
19:14:57 <fungi> from /24 to /20
19:15:02 <clarkb> I can work on a draft email in an etehrpad after lunch if we want to capture the things we noticed (like the ephemeral drive mounting) and the networking
19:15:17 <clarkb> fungi: did we do that via the cloud launcher stuff or in the cloud directly?
19:15:31 <fungi> so the addresses now may be anywhere in the range of 10.0.16.0-10.0.31.255
19:15:39 <fungi> cloud launcher config
19:15:47 <corvus> why?  (curious)
19:15:48 <fungi> in the inventory
19:15:58 <fungi> in case we get a quota >250 nodes
19:16:07 <fungi> frickler's suggestion
19:16:26 <fungi> easier to resize it before it was in use
19:16:56 <corvus> oh this is on a subnet we make
19:16:58 <fungi> since it involves deleting the network/router/interface from neutron
19:17:04 <fungi> yep
19:17:12 <corvus> makes sense thx
19:17:17 <fungi> there's no shared provider net (hence the fips)
19:17:25 <clarkb> anything else re rax flex?
19:17:32 <fungi> i didn't have anything
19:17:35 <clarkb> #topic Etherpad 2.2.4 Upgrade
19:17:42 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/926078 Change implementing the upgrade
19:18:11 <clarkb> tl;dr here is we are on 2.1.1. 2.2.2 broke ep_headings plugin so we switched to ep_headings2 which appears to work with our old markup (thank you fungi for organizing that testing)
19:18:36 <fungi> yeah, proposed upgrade lgtm
19:18:46 <clarkb> the current release is 2.2.4 and I think we should make a push to upgrade to that version. The first step is in the parent to 926078 where we'll tag a 2.1.1 etherpad image so that we can rollback if the headings plugin explodes spectacularly
19:19:04 <fungi> also changelog makes a vague mention of security fixes, so probably should take it as somewhat urgent
19:19:24 <clarkb> maybe after the meeting we can land that first change and make sure we get the fallback image tagged and all that. Then do the upgrade afterwards or first thing tomorrow?
19:19:29 <fungi> and we should remember to test meetpad/jitsi integration after we upgrade
19:19:35 <clarkb> ++
19:19:45 <fungi> especially with the ptg coming up next month
19:19:56 <clarkb> ya they changed some paths which required editing the mod rewrite rules in apache
19:20:05 <clarkb> its theoretically possible that meetpad would also be angry about the new paths
19:21:03 <clarkb> reviews and assistance welcome. Let me know if you have any concerns that I need to address before we proceed
19:21:41 <clarkb> #topic OSUOSL ARM Cloud Issues
19:22:04 <clarkb> The nodepool builder appears to have successfully built images for arm things
19:22:11 <clarkb> that resolves one issue there
19:22:38 <clarkb> As for job slowness things appear to have improved a bit but aren't consistently good. The kolla image build job does succeed occasionally now but does also still timeout a fair bit
19:22:47 <frickler> also ramereth did some ram upgrades iiuc
19:22:48 <fungi> seems like Ramereth's upgrade work helped the slow jobs
19:22:49 <clarkb> I suspect that these improvements are due to upgraded hardware that they have been rolling in
19:22:52 <clarkb> ya
19:23:02 <clarkb> there was also a ceph disk that was sad and got replaced?
19:23:14 <fungi> yep
19:23:38 <clarkb> So anyway not completely resolved but actions being taken appear to be improving the situation. I'll probably drop this off of the next meeting's agenda but I wanted to followup today and thank those who helped debug and improve things. Thank you!
19:24:00 <clarkb> and if things start trending in the wrong direction again let us know and we'll see what we can do
19:24:02 <fungi> though getting a second arm provider again would still be good
19:24:16 <fungi> even if just for resiliency
19:24:29 <clarkb> ++
19:24:50 <fungi> though Ramereth did seem to imply that we could ask for more quota in osuosl too
19:24:58 <fungi> if we decide we need it
19:25:09 <clarkb> also worth noting that most jobs are fine on that cloud
19:25:34 <clarkb> the issues specifically appear to be io related so doing a lot of io like doing kolla image builds was hit with slowness but unittests and similar run in reasonable amounts of time
19:26:34 <clarkb> #topic Updating ansible+ansible-lint versions in our repos
19:27:03 <clarkb> Now that we have updated centos 9 arm images I was able to get the ozj change related to this passing
19:27:08 <clarkb> #link https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/926970
19:27:21 <clarkb> There is also a project-config change but it has passed all along
19:27:23 <clarkb> #link https://review.opendev.org/c/openstack/project-config/+/926848
19:27:54 <clarkb> neither of these changes are particularly fun reviews but hopefully we can get them done then not bother with this stuff too much for another few years
19:28:05 <clarkb> and they are largely mechanical updates
19:28:24 <clarkb> do be on the lookout for any potential behavior changes in those updates because there shouldn't be any
19:28:50 <clarkb> getting this done will get these jobs running on noble too which is nice
19:29:27 <clarkb> #topic Zuul-launcher Image Builds
19:29:49 <clarkb> zuul-launcher (zuul's internal daemon for managing historically nodepool related things) has landed
19:30:11 <fungi> yay!
19:30:15 <clarkb> the next step in proving out that all works is to get opendev building test node images using jobs and zuul-launcher to upload them rather than nodepool itself
19:30:29 <clarkb> There are dib roles in zuul-jobs to simplify the job side
19:30:52 <clarkb> corvus has a question about where the resulting jobs and configs should live for this
19:30:55 <corvus> i wanted to discuss where we might put those image jobs
19:31:05 <clarkb> my hunch is that opendev/project-config is the appropriate location
19:31:28 <clarkb> since we're building these images for use across opendev and want to have control of that and this would be one less thing to migrate out of openstack/project-config if we manage it there
19:31:30 <corvus> yes -- but if we put them in an untrusted-project, then speculative changes would be possible
19:32:19 <clarkb> and you're suggesting that would be useful in this situation?
19:32:33 <corvus> i don't think we have any untrusted central repos like that (like openstack-zuul-jobs is for openstack)
19:32:45 <clarkb> correct opendev doesn't currently have a repo set up that way
19:32:54 <corvus> perhaps!  no one knows yet... but it certainly was useful when i was working on the initial test job
19:33:18 <clarkb> but if that is useful for image builds then maybe it should. Something like opendev/zuul-configs ?
19:33:19 <corvus> it would let people propose changes to dib settings and see if they work
19:33:27 <clarkb> (that name is maybe not descriptive enough)
19:34:22 <corvus> yeah, we can do something like that, or go for something with "image" in the name if we want to narrow the scope a bit
19:34:26 <clarkb> thinking out loud here: if we had an untrusted base jobs like repo in opendev we could potentially use that to refactor the base jobs a bit to do less work
19:35:08 <fungi> that's basically the model recommended in zuul's pti document
19:35:08 <clarkb> I know this is something we've talked about for ages but never done it because the current system mostly works, but if there are other reasons to split the configs into trusted and unstrusted them maybe we'll finally do that refactoring too
19:35:49 <fungi> back when we were operating on a single zuul tenant that's what we used the openstack/openstack-zuul-jobs repo for
19:36:03 <fungi> and we still have some stuff in there that's not openstack-specific
19:36:25 <corvus> yeah -- though a lot of that thinking was originally related to base role testing, and we've pretty much got that nailed down now, so it's not quite as important as it was.  but it's still valid and could be useful, and if we had it then i probably would have said lets start there and not started this conversation :)
19:36:26 <clarkb> in any case a new repo with a better name than the one I suggested makes sense to me. It can house stuff a level above base jobs as well as image build things
19:36:55 <corvus> so if we feel like "generic opendev untrusted low-level jobs" is a useful repo to have then sure let's start there :)
19:37:10 <clarkb> ++
19:37:24 <corvus> incidentally, there is a related thing:
19:38:16 <corvus> not only will we define the image build jobs as we just discussed, but we will have zuul configuration objects for the images themselves, and we want to include those in every tenant.  we will eventually want to decide if we want to put those in this repo, or some other repo dedicated to images.
19:38:44 <corvus> we can kick that down the road a bit, and move things around later.  and the ability to include and exclude config objects in each tenant gives us a lot of flexibility
19:39:03 <corvus> but i just wanted to point that out as a distinct but related concern as we start to develop our mental model around this stuff
19:39:07 <clarkb> if we use this repo for "base" job stuff then we may end up wanting to include it everywhere for jobs anyway
19:39:16 <corvus> yep
19:39:17 <clarkb> yup its a good callout since they are distinct config items
19:39:57 <corvus> we will definitely only want to run the image build jobs in one tenant, so if we put them in the "untrusted base job" repo, we will only include the project stanza in the opendev tenant
19:40:13 <corvus> i think that would be normal behavior for that repo anyway
19:40:53 <corvus> okay, i think that's enough to proceed; i'll propose a new repo and then sync up with tonyb on making jobs
19:41:20 <fungi> thanks!
19:41:21 <clarkb> sounds good thanks
19:41:27 <clarkb> #topic Open Discussion
19:41:33 <clarkb> Anything else before we end the meeting?
19:42:08 <NeilHanlon> i have something
19:42:15 <clarkb> go for it
19:42:25 <NeilHanlon> wanted to discuss adding rocky mirrors, following up from early august
19:43:54 <clarkb> this was brought up in the context of I think kolla and the two things I suggested to start there are 1) determine if the jobs are failing due to mirror/internet flakyness (so that we're investing effort where it has most impact) and 2) determine how large the mirror content is to determine if it will fit in the existing afs filesystem
19:44:01 <clarkb> https://grafana.opendev.org/d/9871b26303/afs?orgId=1 shows current utilization
19:45:18 <clarkb> hrm the graphs don't show us free space just utilization and used space (whic we can use to math out free space; must be 5tb total or about 1.2tb currently free?)
19:45:42 <fungi> related to utilization, openeuler is already huge/full and is asking to add an additional major release
19:45:44 <fungi> #link https://review.opendev.org/927462 Update openEuler mirror repo
19:46:06 <clarkb> then assuming the mirrors will help and we've got space we just need to add an rsync mirroring script for rocky (as well as the afs volume bits I guess)
19:46:31 <clarkb> fungi: in openEuler's case I think we replace the old with new instaed of carrying both
19:46:52 <clarkb> fungi: the old version was never really used much aiui and there probably isn't much reason to double the capacity needs there
19:47:10 <clarkb> NeilHanlon: do you know how big the mirror size would be for rockylinux 9?
19:47:12 <NeilHanlon> yeah, it came up again in OSA a few weeks back, not directly related to jobs, but not unrelated, either
19:47:29 <clarkb> NeilHanlon: I guess we could also do x86 or arm or both so that may be more than one number
19:47:42 <fungi> yeah, wrt openeuler i was hinting at that as well, though the author left a comment about some existing jobs continuing to rely on the old version
19:48:23 <clarkb> fungi: right I'm saying delete those jobs like we delete centos 7 and opensuse and hopefully soon xenial jobs :)
19:48:43 <clarkb> its not ideal but pruning like that enables us to do more interesting things so rather tahn try and keep all things alive with minimal use we just need to moev on
19:48:55 <clarkb> NeilHanlon: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/centos-stream-mirror-update this is what the centos stream sync script looks like
19:49:29 <NeilHanlon> 8 and 9, with images, it's about 800GiB; w/o, less.. and with only x86/arm and no debuginfo/source... probably under 300
19:49:39 <clarkb> NeilHanlon: it should be fairly consistent rpm mirror to rpm mirror. We just need to tailor the source location and rsync excludes to the specific mirror
19:49:58 <NeilHanlon> it should be nominally the same as whatever the c9s one is, really
19:50:11 <clarkb> NeilHanlon: ya we won't mirror images or debuginfo/source stuff. The thought there is you can fetch those when necessary from the source but the vast majority of ci jobs don't need that info
19:50:23 <NeilHanlon> right, makes sense
19:50:29 <clarkb> NeilHanlon: oh thats good input c9s is currently 303GB
19:50:37 <clarkb> (from that grafana dashboard I linked above)
19:51:27 <clarkb> so napkin math says we can fit a rockylinux mirror
19:52:03 <clarkb> a good next step would be to just confirm that the failures we're seeing are expected to be addressed by a mirror then push up a change to add the syncing script
19:52:05 <NeilHanlon> excellent :) I do agree we should have some firm "evidence" it's needed, even making the assumption that it will fix some percentage of test failures
19:52:31 <clarkb> an opendev admin will need to create the afs volume which they can do as part of the coordination for merging the change that adds the sync script
19:52:45 <NeilHanlon> that 'some percentage' should be quantifiable, somehow. I'll take an action to try and scrape some logs for commonalities
19:53:18 <clarkb> NeilHanlon: I think even if it is "in the last week we had X failures related to mirror download failures" that would be sufficient. Just enough to know wer'e actually solving a problem and not simply assuming it will help
19:53:41 <NeilHanlon> roger
19:53:48 <clarkb> (because one of the reasons I didn't want to mirror rocky initially was to challenge the assumption we've had that it is necessary, if evidence proves it is still necessary then lets fix that
19:54:14 <NeilHanlon> in fairness, I think it's gone really well all things considered, though I don't know that.. just a feeling :)
19:55:09 <NeilHanlon> alright, well, thank you. I've got my marching orders 🫡
19:55:25 <clarkb> fungi: openeuler doesn't appaer to have a nodeset in opendev/base-jobs as just a point of reference
19:55:54 <clarkb> fungi: and codesaerch doesn't show any obvious CI usage of that platform
19:56:12 <clarkb> so ya I'd like to see evidence that we're going to break something important before we do a replacement rather than a side by side rollover
19:56:52 <clarkb> just a few more minutes, last call for anything else
19:58:04 <fungi> looks like kolla dropped openeuler testing this cycle
19:58:29 <fungi> per a comment in https://review.opendev.org/927466
19:59:08 <fungi> though kolla-ansible stable branches may be running jobs on it
20:00:09 <clarkb> and we are at time
20:00:12 <clarkb> Thank you everyone
20:00:18 <clarkb> we'll be back here same time and location next week
20:00:20 <clarkb> #endmeeting