Tuesday, 2024-09-10

clarkbJust about meeting time18:59
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Sep 10 19:00:07 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/LIHCPHNOQTZLO26CYOYY6Y5LNGOOB2CV/ Our Agenda19:00
clarkb#topic Announcements19:00
NeilHanlono/ 19:01
fungiohai19:01
clarkbThis didn't make it to the agenda because it was made official earlier today but https://www.socallinuxexpo.org/scale/22x/events/open-infra-days is happening in March 2025. You have until November 1 to get your cfp submissions in if interested19:01
clarkbalso not erally an announcement, but I suspect my scheduling will be far more normal at this point. School started again, family returned to their homes, and I'm back from the asia summit19:02
clarkb#topic Upgrading Old Servers19:03
clarkbIts been a little while since I reviewed the wiki config management changes, any updates there?19:03
clarkblooks like no new patchsets since my last review19:04
clarkbfungi deployed a new mirror in the new raxflex region and had to make a small edit to the dns keys package installation order19:04
clarkbare there any other server replacement/upgrade/addition items to be aware of?19:04
clarkbsounds like no. Let's move on and we can get back to this later if necessary19:06
clarkb#topic AFS Mirror Cleanups19:06
clarkbWhen we updated the prepare-workspace-git role to do more of its work in python rather than ansible tasks we did end up breaking xenial due to the use of fstrings19:07
clarkbI pointed out that I have a change to remove xenial testing from system-config19:07
clarkbhttps://review.opendev.org/c/opendev/system-config/+/922680 but rather than merge that we dropped the use of fstraings in prepare-workspace-git19:07
clarkbanyway calling that out as this sort of cleanup is a necessary precursor to removing the content in the mirrors19:07
clarkbother than that I don't have any updates here19:08
clarkb#topic Rackspace Flex Cloud19:08
fungiseems to be working well19:09
fungithough our ram quota is only enough for 32 of our standard nodes for now19:09
clarkbyup since we last met fungi et al have set this cloud up in nodepool and we are running with max-servers set to 32 utilizing all of our quota there19:09
fricklerbut cardoe also offered to bump our quota if we want to go further19:10
clarkbya I think the next step is to put together an email to rax folks with what we've learned (the ssh keyscan timeout thing is weird but also seems to be mitigated?)19:11
corvus(or just not happening today)19:11
clarkband then they can decide if they are interested in bumping our quota and if so on what schedule19:11
fungihard to be sure, yep19:11
corvusthe delay between keyscan and ready we noted earlier today could be a nodepool problem, or it could be a symptom of some other issue; it's probably worth analyzing a bit more19:12
corvusthat doesn't currently rise to the level of "problem" only "weird"19:12
corvus(but weird can be a precursor to problem)19:12
clarkbright, its information not accusation19:12
clarkbanyway thank you everyone for putting this together I ended up mostly busy with travel and other things so wasn't as helpful as I'd hoped19:13
clarkband it seems to be working. Be aware that some jobs may notice the use if floating IPs like swift did. They were trying to bind to the public fip address which isn't configured on the servers and that failed19:13
clarkbsiwtching over to 0.0.0.0/127.0.0.1 or ipv6 equivalents would work as would using the private host ip (I think the private ip is what swift switched to)19:14
fungioh, something else worth pointing out, we resized/renumbered the internal address cidr19:14
fungifrom /24 to /2019:14
clarkbI can work on a draft email in an etehrpad after lunch if we want to capture the things we noticed (like the ephemeral drive mounting) and the networking19:15
clarkbfungi: did we do that via the cloud launcher stuff or in the cloud directly?19:15
fungiso the addresses now may be anywhere in the range of 10.0.16.0-10.0.31.25519:15
fungicloud launcher config19:15
corvuswhy?  (curious)19:15
fungiin the inventory19:15
fungiin case we get a quota >250 nodes19:15
fungifrickler's suggestion19:16
fungieasier to resize it before it was in use19:16
corvusoh this is on a subnet we make19:16
fungisince it involves deleting the network/router/interface from neutron19:16
fungiyep19:17
corvusmakes sense thx19:17
fungithere's no shared provider net (hence the fips)19:17
clarkbanything else re rax flex?19:17
fungii didn't have anything19:17
clarkb#topic Etherpad 2.2.4 Upgrade19:17
clarkb#link https://review.opendev.org/c/opendev/system-config/+/926078 Change implementing the upgrade19:17
clarkbtl;dr here is we are on 2.1.1. 2.2.2 broke ep_headings plugin so we switched to ep_headings2 which appears to work with our old markup (thank you fungi for organizing that testing)19:18
fungiyeah, proposed upgrade lgtm19:18
clarkbthe current release is 2.2.4 and I think we should make a push to upgrade to that version. The first step is in the parent to 926078 where we'll tag a 2.1.1 etherpad image so that we can rollback if the headings plugin explodes spectacularly19:18
fungialso changelog makes a vague mention of security fixes, so probably should take it as somewhat urgent19:19
clarkbmaybe after the meeting we can land that first change and make sure we get the fallback image tagged and all that. Then do the upgrade afterwards or first thing tomorrow?19:19
fungiand we should remember to test meetpad/jitsi integration after we upgrade19:19
clarkb++19:19
fungiespecially with the ptg coming up next month19:19
clarkbya they changed some paths which required editing the mod rewrite rules in apache19:19
clarkbits theoretically possible that meetpad would also be angry about the new paths19:20
clarkbreviews and assistance welcome. Let me know if you have any concerns that I need to address before we proceed19:21
clarkb#topic OSUOSL ARM Cloud Issues19:21
clarkbThe nodepool builder appears to have successfully built images for arm things19:22
clarkbthat resolves one issue there19:22
clarkbAs for job slowness things appear to have improved a bit but aren't consistently good. The kolla image build job does succeed occasionally now but does also still timeout a fair bit19:22
frickleralso ramereth did some ram upgrades iiuc19:22
fungiseems like Ramereth's upgrade work helped the slow jobs19:22
clarkbI suspect that these improvements are due to upgraded hardware that they have been rolling in19:22
clarkbya19:22
clarkbthere was also a ceph disk that was sad and got replaced?19:23
fungiyep19:23
clarkbSo anyway not completely resolved but actions being taken appear to be improving the situation. I'll probably drop this off of the next meeting's agenda but I wanted to followup today and thank those who helped debug and improve things. Thank you!19:23
clarkband if things start trending in the wrong direction again let us know and we'll see what we can do19:24
fungithough getting a second arm provider again would still be good19:24
fungieven if just for resiliency19:24
clarkb++19:24
fungithough Ramereth did seem to imply that we could ask for more quota in osuosl too19:24
fungiif we decide we need it19:24
clarkbalso worth noting that most jobs are fine on that cloud19:25
clarkbthe issues specifically appear to be io related so doing a lot of io like doing kolla image builds was hit with slowness but unittests and similar run in reasonable amounts of time19:25
clarkb#topic Updating ansible+ansible-lint versions in our repos19:26
clarkbNow that we have updated centos 9 arm images I was able to get the ozj change related to this passing19:27
clarkb#link https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/92697019:27
clarkbThere is also a project-config change but it has passed all along19:27
clarkb#link https://review.opendev.org/c/openstack/project-config/+/92684819:27
clarkbneither of these changes are particularly fun reviews but hopefully we can get them done then not bother with this stuff too much for another few years19:27
clarkband they are largely mechanical updates19:28
clarkbdo be on the lookout for any potential behavior changes in those updates because there shouldn't be any19:28
clarkbgetting this done will get these jobs running on noble too which is nice19:28
clarkb#topic Zuul-launcher Image Builds19:29
clarkbzuul-launcher (zuul's internal daemon for managing historically nodepool related things) has landed19:29
fungiyay!19:30
clarkbthe next step in proving out that all works is to get opendev building test node images using jobs and zuul-launcher to upload them rather than nodepool itself19:30
clarkbThere are dib roles in zuul-jobs to simplify the job side19:30
clarkbcorvus has a question about where the resulting jobs and configs should live for this19:30
corvusi wanted to discuss where we might put those image jobs19:30
clarkbmy hunch is that opendev/project-config is the appropriate location19:31
clarkbsince we're building these images for use across opendev and want to have control of that and this would be one less thing to migrate out of openstack/project-config if we manage it there19:31
corvusyes -- but if we put them in an untrusted-project, then speculative changes would be possible19:31
clarkband you're suggesting that would be useful in this situation?19:32
corvusi don't think we have any untrusted central repos like that (like openstack-zuul-jobs is for openstack)19:32
clarkbcorrect opendev doesn't currently have a repo set up that way19:32
corvusperhaps!  no one knows yet... but it certainly was useful when i was working on the initial test job19:32
clarkbbut if that is useful for image builds then maybe it should. Something like opendev/zuul-configs ?19:33
corvusit would let people propose changes to dib settings and see if they work19:33
clarkb(that name is maybe not descriptive enough)19:33
corvusyeah, we can do something like that, or go for something with "image" in the name if we want to narrow the scope a bit19:34
clarkbthinking out loud here: if we had an untrusted base jobs like repo in opendev we could potentially use that to refactor the base jobs a bit to do less work19:34
fungithat's basically the model recommended in zuul's pti document19:35
clarkbI know this is something we've talked about for ages but never done it because the current system mostly works, but if there are other reasons to split the configs into trusted and unstrusted them maybe we'll finally do that refactoring too19:35
fungiback when we were operating on a single zuul tenant that's what we used the openstack/openstack-zuul-jobs repo for19:35
fungiand we still have some stuff in there that's not openstack-specific19:36
corvusyeah -- though a lot of that thinking was originally related to base role testing, and we've pretty much got that nailed down now, so it's not quite as important as it was.  but it's still valid and could be useful, and if we had it then i probably would have said lets start there and not started this conversation :)19:36
clarkbin any case a new repo with a better name than the one I suggested makes sense to me. It can house stuff a level above base jobs as well as image build things19:36
corvusso if we feel like "generic opendev untrusted low-level jobs" is a useful repo to have then sure let's start there :)19:36
clarkb++19:37
corvusincidentally, there is a related thing:19:37
corvusnot only will we define the image build jobs as we just discussed, but we will have zuul configuration objects for the images themselves, and we want to include those in every tenant.  we will eventually want to decide if we want to put those in this repo, or some other repo dedicated to images.19:38
corvuswe can kick that down the road a bit, and move things around later.  and the ability to include and exclude config objects in each tenant gives us a lot of flexibility19:38
corvusbut i just wanted to point that out as a distinct but related concern as we start to develop our mental model around this stuff19:39
clarkbif we use this repo for "base" job stuff then we may end up wanting to include it everywhere for jobs anyway19:39
corvusyep19:39
clarkbyup its a good callout since they are distinct config items19:39
corvuswe will definitely only want to run the image build jobs in one tenant, so if we put them in the "untrusted base job" repo, we will only include the project stanza in the opendev tenant19:39
corvusi think that would be normal behavior for that repo anyway19:40
corvusokay, i think that's enough to proceed; i'll propose a new repo and then sync up with tonyb on making jobs19:40
fungithanks!19:41
clarkbsounds good thanks19:41
clarkb#topic Open Discussion19:41
clarkbAnything else before we end the meeting?19:41
NeilHanloni have something19:42
clarkbgo for it19:42
NeilHanlonwanted to discuss adding rocky mirrors, following up from early august19:42
clarkbthis was brought up in the context of I think kolla and the two things I suggested to start there are 1) determine if the jobs are failing due to mirror/internet flakyness (so that we're investing effort where it has most impact) and 2) determine how large the mirror content is to determine if it will fit in the existing afs filesystem19:43
clarkbhttps://grafana.opendev.org/d/9871b26303/afs?orgId=1 shows current utilization19:44
clarkbhrm the graphs don't show us free space just utilization and used space (whic we can use to math out free space; must be 5tb total or about 1.2tb currently free?)19:45
fungirelated to utilization, openeuler is already huge/full and is asking to add an additional major release19:45
fungi#link https://review.opendev.org/927462 Update openEuler mirror repo19:45
clarkbthen assuming the mirrors will help and we've got space we just need to add an rsync mirroring script for rocky (as well as the afs volume bits I guess)19:46
clarkbfungi: in openEuler's case I think we replace the old with new instaed of carrying both19:46
clarkbfungi: the old version was never really used much aiui and there probably isn't much reason to double the capacity needs there19:46
clarkbNeilHanlon: do you know how big the mirror size would be for rockylinux 9?19:47
NeilHanlonyeah, it came up again in OSA a few weeks back, not directly related to jobs, but not unrelated, either19:47
clarkbNeilHanlon: I guess we could also do x86 or arm or both so that may be more than one number19:47
fungiyeah, wrt openeuler i was hinting at that as well, though the author left a comment about some existing jobs continuing to rely on the old version19:47
clarkbfungi: right I'm saying delete those jobs like we delete centos 7 and opensuse and hopefully soon xenial jobs :)19:48
clarkbits not ideal but pruning like that enables us to do more interesting things so rather tahn try and keep all things alive with minimal use we just need to moev on19:48
clarkbNeilHanlon: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/centos-stream-mirror-update this is what the centos stream sync script looks like19:48
NeilHanlon8 and 9, with images, it's about 800GiB; w/o, less.. and with only x86/arm and no debuginfo/source... probably under 30019:49
clarkbNeilHanlon: it should be fairly consistent rpm mirror to rpm mirror. We just need to tailor the source location and rsync excludes to the specific mirror19:49
NeilHanlonit should be nominally the same as whatever the c9s one is, really19:49
clarkbNeilHanlon: ya we won't mirror images or debuginfo/source stuff. The thought there is you can fetch those when necessary from the source but the vast majority of ci jobs don't need that info19:50
NeilHanlonright, makes sense19:50
clarkbNeilHanlon: oh thats good input c9s is currently 303GB19:50
clarkb(from that grafana dashboard I linked above)19:50
clarkbso napkin math says we can fit a rockylinux mirror19:51
clarkba good next step would be to just confirm that the failures we're seeing are expected to be addressed by a mirror then push up a change to add the syncing script19:52
NeilHanlonexcellent :) I do agree we should have some firm "evidence" it's needed, even making the assumption that it will fix some percentage of test failures19:52
clarkban opendev admin will need to create the afs volume which they can do as part of the coordination for merging the change that adds the sync script19:52
NeilHanlonthat 'some percentage' should be quantifiable, somehow. I'll take an action to try and scrape some logs for commonalities 19:52
clarkbNeilHanlon: I think even if it is "in the last week we had X failures related to mirror download failures" that would be sufficient. Just enough to know wer'e actually solving a problem and not simply assuming it will help19:53
NeilHanlonroger19:53
clarkb(because one of the reasons I didn't want to mirror rocky initially was to challenge the assumption we've had that it is necessary, if evidence proves it is still necessary then lets fix that19:53
NeilHanlonin fairness, I think it's gone really well all things considered, though I don't know that.. just a feeling :) 19:54
NeilHanlonalright, well, thank you. I've got my marching orders 🫡19:55
clarkbfungi: openeuler doesn't appaer to have a nodeset in opendev/base-jobs as just a point of reference19:55
clarkbfungi: and codesaerch doesn't show any obvious CI usage of that platform19:55
clarkbso ya I'd like to see evidence that we're going to break something important before we do a replacement rather than a side by side rollover19:56
clarkbjust a few more minutes, last call for anything else19:56
fungilooks like kolla dropped openeuler testing this cycle19:58
fungiper a comment in https://review.opendev.org/92746619:58
fungithough kolla-ansible stable branches may be running jobs on it19:59
clarkband we are at time20:00
clarkbThank you everyone20:00
clarkbwe'll be back here same time and location next week20:00
clarkb#endmeeting20:00
opendevmeetMeeting ended Tue Sep 10 20:00:20 2024 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:00
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2024/infra.2024-09-10-19.00.html20:00
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-09-10-19.00.txt20:00
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2024/infra.2024-09-10-19.00.log.html20:00
NeilHanlonthanks clarkb, fungi, all :) 20:01

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!