Tuesday, 2024-09-10

clarkb	Just about meeting time	18:59
clarkb	#startmeeting infra	19:00
opendevmeet	Meeting started Tue Sep 10 19:00:07 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:00
opendevmeet	The meeting name has been set to 'infra'	19:00
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/LIHCPHNOQTZLO26CYOYY6Y5LNGOOB2CV/ Our Agenda	19:00
clarkb	#topic Announcements	19:00
NeilHanlon	o/	19:01
fungi	ohai	19:01
clarkb	This didn't make it to the agenda because it was made official earlier today but https://www.socallinuxexpo.org/scale/22x/events/open-infra-days is happening in March 2025. You have until November 1 to get your cfp submissions in if interested	19:01
clarkb	also not erally an announcement, but I suspect my scheduling will be far more normal at this point. School started again, family returned to their homes, and I'm back from the asia summit	19:02
clarkb	#topic Upgrading Old Servers	19:03
clarkb	Its been a little while since I reviewed the wiki config management changes, any updates there?	19:03
clarkb	looks like no new patchsets since my last review	19:04
clarkb	fungi deployed a new mirror in the new raxflex region and had to make a small edit to the dns keys package installation order	19:04
clarkb	are there any other server replacement/upgrade/addition items to be aware of?	19:04
clarkb	sounds like no. Let's move on and we can get back to this later if necessary	19:06
clarkb	#topic AFS Mirror Cleanups	19:06
clarkb	When we updated the prepare-workspace-git role to do more of its work in python rather than ansible tasks we did end up breaking xenial due to the use of fstrings	19:07
clarkb	I pointed out that I have a change to remove xenial testing from system-config	19:07
clarkb	https://review.opendev.org/c/opendev/system-config/+/922680 but rather than merge that we dropped the use of fstraings in prepare-workspace-git	19:07
clarkb	anyway calling that out as this sort of cleanup is a necessary precursor to removing the content in the mirrors	19:07
clarkb	other than that I don't have any updates here	19:08
clarkb	#topic Rackspace Flex Cloud	19:08
fungi	seems to be working well	19:09
fungi	though our ram quota is only enough for 32 of our standard nodes for now	19:09
clarkb	yup since we last met fungi et al have set this cloud up in nodepool and we are running with max-servers set to 32 utilizing all of our quota there	19:09
frickler	but cardoe also offered to bump our quota if we want to go further	19:10
clarkb	ya I think the next step is to put together an email to rax folks with what we've learned (the ssh keyscan timeout thing is weird but also seems to be mitigated?)	19:11
corvus	(or just not happening today)	19:11
clarkb	and then they can decide if they are interested in bumping our quota and if so on what schedule	19:11
fungi	hard to be sure, yep	19:11
corvus	the delay between keyscan and ready we noted earlier today could be a nodepool problem, or it could be a symptom of some other issue; it's probably worth analyzing a bit more	19:12
corvus	that doesn't currently rise to the level of "problem" only "weird"	19:12
corvus	(but weird can be a precursor to problem)	19:12
clarkb	right, its information not accusation	19:12
clarkb	anyway thank you everyone for putting this together I ended up mostly busy with travel and other things so wasn't as helpful as I'd hoped	19:13
clarkb	and it seems to be working. Be aware that some jobs may notice the use if floating IPs like swift did. They were trying to bind to the public fip address which isn't configured on the servers and that failed	19:13
clarkb	siwtching over to 0.0.0.0/127.0.0.1 or ipv6 equivalents would work as would using the private host ip (I think the private ip is what swift switched to)	19:14
fungi	oh, something else worth pointing out, we resized/renumbered the internal address cidr	19:14
fungi	from /24 to /20	19:14
clarkb	I can work on a draft email in an etehrpad after lunch if we want to capture the things we noticed (like the ephemeral drive mounting) and the networking	19:15
clarkb	fungi: did we do that via the cloud launcher stuff or in the cloud directly?	19:15
fungi	so the addresses now may be anywhere in the range of 10.0.16.0-10.0.31.255	19:15
fungi	cloud launcher config	19:15
corvus	why? (curious)	19:15
fungi	in the inventory	19:15
fungi	in case we get a quota >250 nodes	19:15
fungi	frickler's suggestion	19:16
fungi	easier to resize it before it was in use	19:16
corvus	oh this is on a subnet we make	19:16
fungi	since it involves deleting the network/router/interface from neutron	19:16
fungi	yep	19:17
corvus	makes sense thx	19:17
fungi	there's no shared provider net (hence the fips)	19:17
clarkb	anything else re rax flex?	19:17
fungi	i didn't have anything	19:17
clarkb	#topic Etherpad 2.2.4 Upgrade	19:17
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/926078 Change implementing the upgrade	19:17
clarkb	tl;dr here is we are on 2.1.1. 2.2.2 broke ep_headings plugin so we switched to ep_headings2 which appears to work with our old markup (thank you fungi for organizing that testing)	19:18
fungi	yeah, proposed upgrade lgtm	19:18
clarkb	the current release is 2.2.4 and I think we should make a push to upgrade to that version. The first step is in the parent to 926078 where we'll tag a 2.1.1 etherpad image so that we can rollback if the headings plugin explodes spectacularly	19:18
fungi	also changelog makes a vague mention of security fixes, so probably should take it as somewhat urgent	19:19
clarkb	maybe after the meeting we can land that first change and make sure we get the fallback image tagged and all that. Then do the upgrade afterwards or first thing tomorrow?	19:19
fungi	and we should remember to test meetpad/jitsi integration after we upgrade	19:19
clarkb	++	19:19
fungi	especially with the ptg coming up next month	19:19
clarkb	ya they changed some paths which required editing the mod rewrite rules in apache	19:19
clarkb	its theoretically possible that meetpad would also be angry about the new paths	19:20
clarkb	reviews and assistance welcome. Let me know if you have any concerns that I need to address before we proceed	19:21
clarkb	#topic OSUOSL ARM Cloud Issues	19:21
clarkb	The nodepool builder appears to have successfully built images for arm things	19:22
clarkb	that resolves one issue there	19:22
clarkb	As for job slowness things appear to have improved a bit but aren't consistently good. The kolla image build job does succeed occasionally now but does also still timeout a fair bit	19:22
frickler	also ramereth did some ram upgrades iiuc	19:22
fungi	seems like Ramereth's upgrade work helped the slow jobs	19:22
clarkb	I suspect that these improvements are due to upgraded hardware that they have been rolling in	19:22
clarkb	ya	19:22
clarkb	there was also a ceph disk that was sad and got replaced?	19:23
fungi	yep	19:23
clarkb	So anyway not completely resolved but actions being taken appear to be improving the situation. I'll probably drop this off of the next meeting's agenda but I wanted to followup today and thank those who helped debug and improve things. Thank you!	19:23
clarkb	and if things start trending in the wrong direction again let us know and we'll see what we can do	19:24
fungi	though getting a second arm provider again would still be good	19:24
fungi	even if just for resiliency	19:24
clarkb	++	19:24
fungi	though Ramereth did seem to imply that we could ask for more quota in osuosl too	19:24
fungi	if we decide we need it	19:24
clarkb	also worth noting that most jobs are fine on that cloud	19:25
clarkb	the issues specifically appear to be io related so doing a lot of io like doing kolla image builds was hit with slowness but unittests and similar run in reasonable amounts of time	19:25
clarkb	#topic Updating ansible+ansible-lint versions in our repos	19:26
clarkb	Now that we have updated centos 9 arm images I was able to get the ozj change related to this passing	19:27
clarkb	#link https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/926970	19:27
clarkb	There is also a project-config change but it has passed all along	19:27
clarkb	#link https://review.opendev.org/c/openstack/project-config/+/926848	19:27
clarkb	neither of these changes are particularly fun reviews but hopefully we can get them done then not bother with this stuff too much for another few years	19:27
clarkb	and they are largely mechanical updates	19:28
clarkb	do be on the lookout for any potential behavior changes in those updates because there shouldn't be any	19:28
clarkb	getting this done will get these jobs running on noble too which is nice	19:28
clarkb	#topic Zuul-launcher Image Builds	19:29
clarkb	zuul-launcher (zuul's internal daemon for managing historically nodepool related things) has landed	19:29
fungi	yay!	19:30
clarkb	the next step in proving out that all works is to get opendev building test node images using jobs and zuul-launcher to upload them rather than nodepool itself	19:30
clarkb	There are dib roles in zuul-jobs to simplify the job side	19:30
clarkb	corvus has a question about where the resulting jobs and configs should live for this	19:30
corvus	i wanted to discuss where we might put those image jobs	19:30
clarkb	my hunch is that opendev/project-config is the appropriate location	19:31
clarkb	since we're building these images for use across opendev and want to have control of that and this would be one less thing to migrate out of openstack/project-config if we manage it there	19:31
corvus	yes -- but if we put them in an untrusted-project, then speculative changes would be possible	19:31
clarkb	and you're suggesting that would be useful in this situation?	19:32
corvus	i don't think we have any untrusted central repos like that (like openstack-zuul-jobs is for openstack)	19:32
clarkb	correct opendev doesn't currently have a repo set up that way	19:32
corvus	perhaps! no one knows yet... but it certainly was useful when i was working on the initial test job	19:32
clarkb	but if that is useful for image builds then maybe it should. Something like opendev/zuul-configs ?	19:33
corvus	it would let people propose changes to dib settings and see if they work	19:33
clarkb	(that name is maybe not descriptive enough)	19:33
corvus	yeah, we can do something like that, or go for something with "image" in the name if we want to narrow the scope a bit	19:34
clarkb	thinking out loud here: if we had an untrusted base jobs like repo in opendev we could potentially use that to refactor the base jobs a bit to do less work	19:34
fungi	that's basically the model recommended in zuul's pti document	19:35
clarkb	I know this is something we've talked about for ages but never done it because the current system mostly works, but if there are other reasons to split the configs into trusted and unstrusted them maybe we'll finally do that refactoring too	19:35
fungi	back when we were operating on a single zuul tenant that's what we used the openstack/openstack-zuul-jobs repo for	19:35
fungi	and we still have some stuff in there that's not openstack-specific	19:36
corvus	yeah -- though a lot of that thinking was originally related to base role testing, and we've pretty much got that nailed down now, so it's not quite as important as it was. but it's still valid and could be useful, and if we had it then i probably would have said lets start there and not started this conversation :)	19:36
clarkb	in any case a new repo with a better name than the one I suggested makes sense to me. It can house stuff a level above base jobs as well as image build things	19:36
corvus	so if we feel like "generic opendev untrusted low-level jobs" is a useful repo to have then sure let's start there :)	19:36
clarkb	++	19:37
corvus	incidentally, there is a related thing:	19:37
corvus	not only will we define the image build jobs as we just discussed, but we will have zuul configuration objects for the images themselves, and we want to include those in every tenant. we will eventually want to decide if we want to put those in this repo, or some other repo dedicated to images.	19:38
corvus	we can kick that down the road a bit, and move things around later. and the ability to include and exclude config objects in each tenant gives us a lot of flexibility	19:38
corvus	but i just wanted to point that out as a distinct but related concern as we start to develop our mental model around this stuff	19:39
clarkb	if we use this repo for "base" job stuff then we may end up wanting to include it everywhere for jobs anyway	19:39
corvus	yep	19:39
clarkb	yup its a good callout since they are distinct config items	19:39
corvus	we will definitely only want to run the image build jobs in one tenant, so if we put them in the "untrusted base job" repo, we will only include the project stanza in the opendev tenant	19:39
corvus	i think that would be normal behavior for that repo anyway	19:40
corvus	okay, i think that's enough to proceed; i'll propose a new repo and then sync up with tonyb on making jobs	19:40
fungi	thanks!	19:41
clarkb	sounds good thanks	19:41
clarkb	#topic Open Discussion	19:41
clarkb	Anything else before we end the meeting?	19:41
NeilHanlon	i have something	19:42
clarkb	go for it	19:42
NeilHanlon	wanted to discuss adding rocky mirrors, following up from early august	19:42
clarkb	this was brought up in the context of I think kolla and the two things I suggested to start there are 1) determine if the jobs are failing due to mirror/internet flakyness (so that we're investing effort where it has most impact) and 2) determine how large the mirror content is to determine if it will fit in the existing afs filesystem	19:43
clarkb	https://grafana.opendev.org/d/9871b26303/afs?orgId=1 shows current utilization	19:44
clarkb	hrm the graphs don't show us free space just utilization and used space (whic we can use to math out free space; must be 5tb total or about 1.2tb currently free?)	19:45
fungi	related to utilization, openeuler is already huge/full and is asking to add an additional major release	19:45
fungi	#link https://review.opendev.org/927462 Update openEuler mirror repo	19:45
clarkb	then assuming the mirrors will help and we've got space we just need to add an rsync mirroring script for rocky (as well as the afs volume bits I guess)	19:46
clarkb	fungi: in openEuler's case I think we replace the old with new instaed of carrying both	19:46
clarkb	fungi: the old version was never really used much aiui and there probably isn't much reason to double the capacity needs there	19:46
clarkb	NeilHanlon: do you know how big the mirror size would be for rockylinux 9?	19:47
NeilHanlon	yeah, it came up again in OSA a few weeks back, not directly related to jobs, but not unrelated, either	19:47
clarkb	NeilHanlon: I guess we could also do x86 or arm or both so that may be more than one number	19:47
fungi	yeah, wrt openeuler i was hinting at that as well, though the author left a comment about some existing jobs continuing to rely on the old version	19:47
clarkb	fungi: right I'm saying delete those jobs like we delete centos 7 and opensuse and hopefully soon xenial jobs :)	19:48
clarkb	its not ideal but pruning like that enables us to do more interesting things so rather tahn try and keep all things alive with minimal use we just need to moev on	19:48
clarkb	NeilHanlon: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/centos-stream-mirror-update this is what the centos stream sync script looks like	19:48
NeilHanlon	8 and 9, with images, it's about 800GiB; w/o, less.. and with only x86/arm and no debuginfo/source... probably under 300	19:49
clarkb	NeilHanlon: it should be fairly consistent rpm mirror to rpm mirror. We just need to tailor the source location and rsync excludes to the specific mirror	19:49
NeilHanlon	it should be nominally the same as whatever the c9s one is, really	19:49
clarkb	NeilHanlon: ya we won't mirror images or debuginfo/source stuff. The thought there is you can fetch those when necessary from the source but the vast majority of ci jobs don't need that info	19:50
NeilHanlon	right, makes sense	19:50
clarkb	NeilHanlon: oh thats good input c9s is currently 303GB	19:50
clarkb	(from that grafana dashboard I linked above)	19:50
clarkb	so napkin math says we can fit a rockylinux mirror	19:51
clarkb	a good next step would be to just confirm that the failures we're seeing are expected to be addressed by a mirror then push up a change to add the syncing script	19:52
NeilHanlon	excellent :) I do agree we should have some firm "evidence" it's needed, even making the assumption that it will fix some percentage of test failures	19:52
clarkb	an opendev admin will need to create the afs volume which they can do as part of the coordination for merging the change that adds the sync script	19:52
NeilHanlon	that 'some percentage' should be quantifiable, somehow. I'll take an action to try and scrape some logs for commonalities	19:52
clarkb	NeilHanlon: I think even if it is "in the last week we had X failures related to mirror download failures" that would be sufficient. Just enough to know wer'e actually solving a problem and not simply assuming it will help	19:53
NeilHanlon	roger	19:53
clarkb	(because one of the reasons I didn't want to mirror rocky initially was to challenge the assumption we've had that it is necessary, if evidence proves it is still necessary then lets fix that	19:53
NeilHanlon	in fairness, I think it's gone really well all things considered, though I don't know that.. just a feeling :)	19:54
NeilHanlon	alright, well, thank you. I've got my marching orders 🫡	19:55
clarkb	fungi: openeuler doesn't appaer to have a nodeset in opendev/base-jobs as just a point of reference	19:55
clarkb	fungi: and codesaerch doesn't show any obvious CI usage of that platform	19:55
clarkb	so ya I'd like to see evidence that we're going to break something important before we do a replacement rather than a side by side rollover	19:56
clarkb	just a few more minutes, last call for anything else	19:56
fungi	looks like kolla dropped openeuler testing this cycle	19:58
fungi	per a comment in https://review.opendev.org/927466	19:58
fungi	though kolla-ansible stable branches may be running jobs on it	19:59
clarkb	and we are at time	20:00
clarkb	Thank you everyone	20:00
clarkb	we'll be back here same time and location next week	20:00
clarkb	#endmeeting	20:00
opendevmeet	Meeting ended Tue Sep 10 20:00:20 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:00
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2024/infra.2024-09-10-19.00.html	20:00
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-09-10-19.00.txt	20:00
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2024/infra.2024-09-10-19.00.log.html	20:00
NeilHanlon	thanks clarkb, fungi, all :)	20:01

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!