19:00:07 #startmeeting infra 19:00:07 Meeting started Tue Sep 10 19:00:07 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:07 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:07 The meeting name has been set to 'infra' 19:00:21 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/LIHCPHNOQTZLO26CYOYY6Y5LNGOOB2CV/ Our Agenda 19:00:58 #topic Announcements 19:01:06 o/ 19:01:10 ohai 19:01:33 This didn't make it to the agenda because it was made official earlier today but https://www.socallinuxexpo.org/scale/22x/events/open-infra-days is happening in March 2025. You have until November 1 to get your cfp submissions in if interested 19:02:14 also not erally an announcement, but I suspect my scheduling will be far more normal at this point. School started again, family returned to their homes, and I'm back from the asia summit 19:03:14 #topic Upgrading Old Servers 19:03:29 Its been a little while since I reviewed the wiki config management changes, any updates there? 19:04:25 looks like no new patchsets since my last review 19:04:45 fungi deployed a new mirror in the new raxflex region and had to make a small edit to the dns keys package installation order 19:04:58 are there any other server replacement/upgrade/addition items to be aware of? 19:06:41 sounds like no. Let's move on and we can get back to this later if necessary 19:06:49 #topic AFS Mirror Cleanups 19:07:09 When we updated the prepare-workspace-git role to do more of its work in python rather than ansible tasks we did end up breaking xenial due to the use of fstrings 19:07:19 I pointed out that I have a change to remove xenial testing from system-config 19:07:41 https://review.opendev.org/c/opendev/system-config/+/922680 but rather than merge that we dropped the use of fstraings in prepare-workspace-git 19:07:56 anyway calling that out as this sort of cleanup is a necessary precursor to removing the content in the mirrors 19:08:02 other than that I don't have any updates here 19:08:53 #topic Rackspace Flex Cloud 19:09:14 seems to be working well 19:09:34 though our ram quota is only enough for 32 of our standard nodes for now 19:09:37 yup since we last met fungi et al have set this cloud up in nodepool and we are running with max-servers set to 32 utilizing all of our quota there 19:10:10 but cardoe also offered to bump our quota if we want to go further 19:11:07 ya I think the next step is to put together an email to rax folks with what we've learned (the ssh keyscan timeout thing is weird but also seems to be mitigated?) 19:11:27 (or just not happening today) 19:11:29 and then they can decide if they are interested in bumping our quota and if so on what schedule 19:11:36 hard to be sure, yep 19:12:02 the delay between keyscan and ready we noted earlier today could be a nodepool problem, or it could be a symptom of some other issue; it's probably worth analyzing a bit more 19:12:37 that doesn't currently rise to the level of "problem" only "weird" 19:12:46 (but weird can be a precursor to problem) 19:12:47 right, its information not accusation 19:13:24 anyway thank you everyone for putting this together I ended up mostly busy with travel and other things so wasn't as helpful as I'd hoped 19:13:49 and it seems to be working. Be aware that some jobs may notice the use if floating IPs like swift did. They were trying to bind to the public fip address which isn't configured on the servers and that failed 19:14:14 siwtching over to 0.0.0.0/127.0.0.1 or ipv6 equivalents would work as would using the private host ip (I think the private ip is what swift switched to) 19:14:46 oh, something else worth pointing out, we resized/renumbered the internal address cidr 19:14:57 from /24 to /20 19:15:02 I can work on a draft email in an etehrpad after lunch if we want to capture the things we noticed (like the ephemeral drive mounting) and the networking 19:15:17 fungi: did we do that via the cloud launcher stuff or in the cloud directly? 19:15:31 so the addresses now may be anywhere in the range of 10.0.16.0-10.0.31.255 19:15:39 cloud launcher config 19:15:47 why? (curious) 19:15:48 in the inventory 19:15:58 in case we get a quota >250 nodes 19:16:07 frickler's suggestion 19:16:26 easier to resize it before it was in use 19:16:56 oh this is on a subnet we make 19:16:58 since it involves deleting the network/router/interface from neutron 19:17:04 yep 19:17:12 makes sense thx 19:17:17 there's no shared provider net (hence the fips) 19:17:25 anything else re rax flex? 19:17:32 i didn't have anything 19:17:35 #topic Etherpad 2.2.4 Upgrade 19:17:42 #link https://review.opendev.org/c/opendev/system-config/+/926078 Change implementing the upgrade 19:18:11 tl;dr here is we are on 2.1.1. 2.2.2 broke ep_headings plugin so we switched to ep_headings2 which appears to work with our old markup (thank you fungi for organizing that testing) 19:18:36 yeah, proposed upgrade lgtm 19:18:46 the current release is 2.2.4 and I think we should make a push to upgrade to that version. The first step is in the parent to 926078 where we'll tag a 2.1.1 etherpad image so that we can rollback if the headings plugin explodes spectacularly 19:19:04 also changelog makes a vague mention of security fixes, so probably should take it as somewhat urgent 19:19:24 maybe after the meeting we can land that first change and make sure we get the fallback image tagged and all that. Then do the upgrade afterwards or first thing tomorrow? 19:19:29 and we should remember to test meetpad/jitsi integration after we upgrade 19:19:35 ++ 19:19:45 especially with the ptg coming up next month 19:19:56 ya they changed some paths which required editing the mod rewrite rules in apache 19:20:05 its theoretically possible that meetpad would also be angry about the new paths 19:21:03 reviews and assistance welcome. Let me know if you have any concerns that I need to address before we proceed 19:21:41 #topic OSUOSL ARM Cloud Issues 19:22:04 The nodepool builder appears to have successfully built images for arm things 19:22:11 that resolves one issue there 19:22:38 As for job slowness things appear to have improved a bit but aren't consistently good. The kolla image build job does succeed occasionally now but does also still timeout a fair bit 19:22:47 also ramereth did some ram upgrades iiuc 19:22:48 seems like Ramereth's upgrade work helped the slow jobs 19:22:49 I suspect that these improvements are due to upgraded hardware that they have been rolling in 19:22:52 ya 19:23:02 there was also a ceph disk that was sad and got replaced? 19:23:14 yep 19:23:38 So anyway not completely resolved but actions being taken appear to be improving the situation. I'll probably drop this off of the next meeting's agenda but I wanted to followup today and thank those who helped debug and improve things. Thank you! 19:24:00 and if things start trending in the wrong direction again let us know and we'll see what we can do 19:24:02 though getting a second arm provider again would still be good 19:24:16 even if just for resiliency 19:24:29 ++ 19:24:50 though Ramereth did seem to imply that we could ask for more quota in osuosl too 19:24:58 if we decide we need it 19:25:09 also worth noting that most jobs are fine on that cloud 19:25:34 the issues specifically appear to be io related so doing a lot of io like doing kolla image builds was hit with slowness but unittests and similar run in reasonable amounts of time 19:26:34 #topic Updating ansible+ansible-lint versions in our repos 19:27:03 Now that we have updated centos 9 arm images I was able to get the ozj change related to this passing 19:27:08 #link https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/926970 19:27:21 There is also a project-config change but it has passed all along 19:27:23 #link https://review.opendev.org/c/openstack/project-config/+/926848 19:27:54 neither of these changes are particularly fun reviews but hopefully we can get them done then not bother with this stuff too much for another few years 19:28:05 and they are largely mechanical updates 19:28:24 do be on the lookout for any potential behavior changes in those updates because there shouldn't be any 19:28:50 getting this done will get these jobs running on noble too which is nice 19:29:27 #topic Zuul-launcher Image Builds 19:29:49 zuul-launcher (zuul's internal daemon for managing historically nodepool related things) has landed 19:30:11 yay! 19:30:15 the next step in proving out that all works is to get opendev building test node images using jobs and zuul-launcher to upload them rather than nodepool itself 19:30:29 There are dib roles in zuul-jobs to simplify the job side 19:30:52 corvus has a question about where the resulting jobs and configs should live for this 19:30:55 i wanted to discuss where we might put those image jobs 19:31:05 my hunch is that opendev/project-config is the appropriate location 19:31:28 since we're building these images for use across opendev and want to have control of that and this would be one less thing to migrate out of openstack/project-config if we manage it there 19:31:30 yes -- but if we put them in an untrusted-project, then speculative changes would be possible 19:32:19 and you're suggesting that would be useful in this situation? 19:32:33 i don't think we have any untrusted central repos like that (like openstack-zuul-jobs is for openstack) 19:32:45 correct opendev doesn't currently have a repo set up that way 19:32:54 perhaps! no one knows yet... but it certainly was useful when i was working on the initial test job 19:33:18 but if that is useful for image builds then maybe it should. Something like opendev/zuul-configs ? 19:33:19 it would let people propose changes to dib settings and see if they work 19:33:27 (that name is maybe not descriptive enough) 19:34:22 yeah, we can do something like that, or go for something with "image" in the name if we want to narrow the scope a bit 19:34:26 thinking out loud here: if we had an untrusted base jobs like repo in opendev we could potentially use that to refactor the base jobs a bit to do less work 19:35:08 that's basically the model recommended in zuul's pti document 19:35:08 I know this is something we've talked about for ages but never done it because the current system mostly works, but if there are other reasons to split the configs into trusted and unstrusted them maybe we'll finally do that refactoring too 19:35:49 back when we were operating on a single zuul tenant that's what we used the openstack/openstack-zuul-jobs repo for 19:36:03 and we still have some stuff in there that's not openstack-specific 19:36:25 yeah -- though a lot of that thinking was originally related to base role testing, and we've pretty much got that nailed down now, so it's not quite as important as it was. but it's still valid and could be useful, and if we had it then i probably would have said lets start there and not started this conversation :) 19:36:26 in any case a new repo with a better name than the one I suggested makes sense to me. It can house stuff a level above base jobs as well as image build things 19:36:55 so if we feel like "generic opendev untrusted low-level jobs" is a useful repo to have then sure let's start there :) 19:37:10 ++ 19:37:24 incidentally, there is a related thing: 19:38:16 not only will we define the image build jobs as we just discussed, but we will have zuul configuration objects for the images themselves, and we want to include those in every tenant. we will eventually want to decide if we want to put those in this repo, or some other repo dedicated to images. 19:38:44 we can kick that down the road a bit, and move things around later. and the ability to include and exclude config objects in each tenant gives us a lot of flexibility 19:39:03 but i just wanted to point that out as a distinct but related concern as we start to develop our mental model around this stuff 19:39:07 if we use this repo for "base" job stuff then we may end up wanting to include it everywhere for jobs anyway 19:39:16 yep 19:39:17 yup its a good callout since they are distinct config items 19:39:57 we will definitely only want to run the image build jobs in one tenant, so if we put them in the "untrusted base job" repo, we will only include the project stanza in the opendev tenant 19:40:13 i think that would be normal behavior for that repo anyway 19:40:53 okay, i think that's enough to proceed; i'll propose a new repo and then sync up with tonyb on making jobs 19:41:20 thanks! 19:41:21 sounds good thanks 19:41:27 #topic Open Discussion 19:41:33 Anything else before we end the meeting? 19:42:08 i have something 19:42:15 go for it 19:42:25 wanted to discuss adding rocky mirrors, following up from early august 19:43:54 this was brought up in the context of I think kolla and the two things I suggested to start there are 1) determine if the jobs are failing due to mirror/internet flakyness (so that we're investing effort where it has most impact) and 2) determine how large the mirror content is to determine if it will fit in the existing afs filesystem 19:44:01 https://grafana.opendev.org/d/9871b26303/afs?orgId=1 shows current utilization 19:45:18 hrm the graphs don't show us free space just utilization and used space (whic we can use to math out free space; must be 5tb total or about 1.2tb currently free?) 19:45:42 related to utilization, openeuler is already huge/full and is asking to add an additional major release 19:45:44 #link https://review.opendev.org/927462 Update openEuler mirror repo 19:46:06 then assuming the mirrors will help and we've got space we just need to add an rsync mirroring script for rocky (as well as the afs volume bits I guess) 19:46:31 fungi: in openEuler's case I think we replace the old with new instaed of carrying both 19:46:52 fungi: the old version was never really used much aiui and there probably isn't much reason to double the capacity needs there 19:47:10 NeilHanlon: do you know how big the mirror size would be for rockylinux 9? 19:47:12 yeah, it came up again in OSA a few weeks back, not directly related to jobs, but not unrelated, either 19:47:29 NeilHanlon: I guess we could also do x86 or arm or both so that may be more than one number 19:47:42 yeah, wrt openeuler i was hinting at that as well, though the author left a comment about some existing jobs continuing to rely on the old version 19:48:23 fungi: right I'm saying delete those jobs like we delete centos 7 and opensuse and hopefully soon xenial jobs :) 19:48:43 its not ideal but pruning like that enables us to do more interesting things so rather tahn try and keep all things alive with minimal use we just need to moev on 19:48:55 NeilHanlon: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/centos-stream-mirror-update this is what the centos stream sync script looks like 19:49:29 8 and 9, with images, it's about 800GiB; w/o, less.. and with only x86/arm and no debuginfo/source... probably under 300 19:49:39 NeilHanlon: it should be fairly consistent rpm mirror to rpm mirror. We just need to tailor the source location and rsync excludes to the specific mirror 19:49:58 it should be nominally the same as whatever the c9s one is, really 19:50:11 NeilHanlon: ya we won't mirror images or debuginfo/source stuff. The thought there is you can fetch those when necessary from the source but the vast majority of ci jobs don't need that info 19:50:23 right, makes sense 19:50:29 NeilHanlon: oh thats good input c9s is currently 303GB 19:50:37 (from that grafana dashboard I linked above) 19:51:27 so napkin math says we can fit a rockylinux mirror 19:52:03 a good next step would be to just confirm that the failures we're seeing are expected to be addressed by a mirror then push up a change to add the syncing script 19:52:05 excellent :) I do agree we should have some firm "evidence" it's needed, even making the assumption that it will fix some percentage of test failures 19:52:31 an opendev admin will need to create the afs volume which they can do as part of the coordination for merging the change that adds the sync script 19:52:45 that 'some percentage' should be quantifiable, somehow. I'll take an action to try and scrape some logs for commonalities 19:53:18 NeilHanlon: I think even if it is "in the last week we had X failures related to mirror download failures" that would be sufficient. Just enough to know wer'e actually solving a problem and not simply assuming it will help 19:53:41 roger 19:53:48 (because one of the reasons I didn't want to mirror rocky initially was to challenge the assumption we've had that it is necessary, if evidence proves it is still necessary then lets fix that 19:54:14 in fairness, I think it's gone really well all things considered, though I don't know that.. just a feeling :) 19:55:09 alright, well, thank you. I've got my marching orders 🫡 19:55:25 fungi: openeuler doesn't appaer to have a nodeset in opendev/base-jobs as just a point of reference 19:55:54 fungi: and codesaerch doesn't show any obvious CI usage of that platform 19:56:12 so ya I'd like to see evidence that we're going to break something important before we do a replacement rather than a side by side rollover 19:56:52 just a few more minutes, last call for anything else 19:58:04 looks like kolla dropped openeuler testing this cycle 19:58:29 per a comment in https://review.opendev.org/927466 19:59:08 though kolla-ansible stable branches may be running jobs on it 20:00:09 and we are at time 20:00:12 Thank you everyone 20:00:18 we'll be back here same time and location next week 20:00:20 #endmeeting