Friday, 2025-08-22

Ramereth	clarkb: I think I fixed the issue with our cluster	00:04
Clark[m]	Thanks I think we see it is happy again from our side too	00:06
Ramereth	My little network outage earlier caused a bunch of nova-compute agents to get stuck for some reason	00:06
opendevreview	OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/957995	02:15
*** elibrokeit__ is now known as elibrokeit		02:30
opendevreview	Merged opendev/zuul-providers master: Update bindep-fallback path https://review.opendev.org/c/opendev/zuul-providers/+/958247	02:41
*** clarkb is now known as Guest24718		07:46
fungi	448m28.615s to complete the docs rw volume move to ord	13:06
fungi	i'll get the rest of the ord ones done and then work on the local dfw moves	13:06
fungi	~7.5 hours	13:07
fungi	so as luck would have it, moments before i gave up for the evening and dropped offline	13:09
fungi	or moments after	13:09
fungi	6 minutes to move docs.dev toord	13:13
fungi	to ord	13:13
fungi	the rest of them went quickly, next longest was project.airship which took 37 seconds	13:16
fungi	getting to work on the afs01.dfw to afs02.dfw moves next	13:17
fungi	46 volumes, the complete list i'm working from is at https://paste.opendev.org/show/bpYk6CtVQ3vfrRgyC5ka/	13:32
fungi	running through those in a loop in the root screen session on afs01.dfw now	13:34
*** Guest24718 is now known as clarkb		13:47
clarkb	fungi: I remembered kdc03 backups were failing and looking at email seems to still be the case. Should we go ahead and remove the two kdc servers from the emergency file (and I guess remove their ansible fact cache files too first) and see if that corrects?	13:48
fungi	these are also not exactly fast, i suspect the distro package mirror volumes could even take days to complete	13:49
clarkb	probably need to move the old borg venv aside first as well	13:49
fungi	yeah, i'll take those back out in a sec	13:49
clarkb	in theory those moves should be much quicker within the same cloud region due to shorter round trip times impacting the small window sizes less. But the small window size is the limiter aiui	13:50
fungi	mirror.centos-stream is copying now, i guess this will give us some idea of how fast these larger volumes move between file servers in the same region	13:51
clarkb	I guess I can try and work through those next week while you're out then we resume the upgrades when you're back assuming I can get through the list	13:51
fungi	i've taken the kdcs out of the emergency list and moved /opt/borg to borg.old on kdc03	13:53
fungi	we'll need to do that again after it's upgraded to noble	13:54
clarkb	I still see /var/cache/ansible/facts/kdc03.opendev.org.yaml and kdc04.opendev.org.yaml. I believe it is safe to (re)move these files and ansible will repopulate the cache entries when it next connects to the nodes	13:55
clarkb	that is on bridge	13:55
fungi	oh, right i'll take care of that too	13:56
clarkb	otherwise it may think the two nodes are still focal etc	13:57
clarkb	I think thi matters less for focal -> jammy but is impactful once on noble	13:57
clarkb	since noble changes borg versions and if we were running containers switches us to podman all of which is determined by the fact values	13:58
fungi	removed /var/cache/ansible/facts/kdc0{3,4}.openstack.org	13:59
opendevreview	Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138	14:26
tkajinam	I've seen 503 errors from repository mirror in CI. Is there any maintenance activity going on ?	14:26
opendevreview	Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138	14:27
tkajinam	http://mirror-int.ord.rax.opendev.org:8080/rdo/centos9-master/current/delorean.repo shows https://c566b6b62447a25ced42-92a01e1da4e956a5b782691fa3feaba0.ssl.cf2.rackcdn.com/openstack/516333bc458241f094abf33d2983d9a4/logs/etc/yum.repos.d/delorean.repo.txt	14:28
fungi	tkajinam: i'm trying to move the rw volumes to a different file server but it's supposed to be transparent	14:28
fungi	tkajinam: you won't be able to reach the mirror-int addresses remotely, those are only accessible from within their respective cloud regions because they're on rfc-1918 addresses	14:29
fungi	though if you drop the "-int" part of the hostname it should work	14:29
tkajinam	this error is seen only in puppet jobs, which pulls rdo mirror so it can be specific to that part (I saw that the other jobs like lint jobs on ubuntu are passing)	14:29
tkajinam	ah, ok	14:30
tkajinam	I didn't get that content locally but that's what is obtained during CI run	14:30
fungi	the mirror.apt-puppetlabs volume was the first one i moved in this batch, mirror.centos-stream has been in progress for a while	14:30
tkajinam	ok	14:30
tkajinam	that's from rdo so it may not be related	14:31
tkajinam	I mean the file/url I'm pointing	14:31
fungi	oh! yeah the :8080/rdo may be a proxy not a mirror	14:31
tkajinam	ah, ok	14:32
fungi	so that's probably entirely unrelated to what i'm working on in that case	14:32
tkajinam	ok seems rdo repository is down	14:32
fungi	that would definitely explain it	14:32
fungi	as well as i wouldn't have expected afs issues to result in 5xx errors (4xx would have been more likely)	14:33
tkajinam	yeah	14:33
tkajinam	let me discuss it with rdo team	14:33
fungi	5xx foming from apache mod_proxy is typical	14:33
fungi	s/foming/coming/	14:33
tkajinam	yeah	14:33
tkajinam	it looks like that	14:33
fungi	thanks for checking into it!	14:33
tkajinam	:-)	14:34
tkajinam	by the way I'm trying to add openvox to mirror due to nearly-dead puppet https://review.opendev.org/c/opendev/system-config/+/957299	14:34
tkajinam	it seems I have to ask the team to create the repository ? it'd be nice if it can be looked into after that data migration	14:34
fungi	ah, yes, i'll take a look	14:34
tkajinam	I mea,n create the "directory" in afs	14:34
fungi	tkajinam: correct, looks like this is going to use two new volumes (one for the rpms, one for the debs). those will need manual creation steps to make the two rw volumes and ro replicas as well as set quotas for them. do you have a rough guess as to how much space these will require currently? about the same as the equivalent puppet mirrors?	14:39
tkajinam	fungi, these should be even smaller than puppet mirrors, because they provide only recent versions	14:44
tkajinam	but the same size would be the reasonable estimate	14:45
fungi	sounds good, thanks	14:46
opendevreview	Merged openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138	14:55
clarkb	tkajinam: fungi: fwiw I'm trying to get us out of mirroring stuff as much as possible	15:00
clarkb	the mirrors have historically been one of the places where we struggle with significant tech debt. For big things that we use a lot the cost benefits are reasonable. eg distro packages for popular distros	15:01
clarkb	but for random tools we just end up carrying things forever	15:01
clarkb	tkajinam: fungi: also a quick way to check if something is proxied or mirrored is to check the index at https://mirror.dfw.rax.opendev.org/ everything there is mirrored. Anything else is proxied	15:12
clarkb	and if you want an example of the tech debt I'm talking about https://mirror.dfw.rax.opendev.org/yum-puppetlabs/puppet5/el/7/x86_64/ is a good illustration	15:13
clarkb	we don't have centos 7 nodes anymore but no one is cleaning up the mirrored content for that release	15:13
clarkb	and then all of that additional content makes doing openafs server maintenance take longer as we're discovering right now with the volume moves	15:15
clarkb	https://trunk.rdoproject.org/ is what /rdo/ proxies to	15:17
clarkb	I get "server is taking too long to respond" which apache probably timed out on as well then sent a 5xx down to the client	15:20
tkajinam	I had this which is supposed to clear unused things, which has been kept for some time, but I agree these are mostly ignored https://review.opendev.org/c/opendev/system-config/+/946501	15:21
tkajinam	if external traffic is not a concern then I'll probably just keep using upstream repo, until we actual face hard requirements to have mirrors (such as bw limit set in upstream repo)	15:23
tkajinam	actually face *	15:23
clarkb	right thats been the approach I've been trying to shift us to. For things that we aren't updating in every single job lets see if we can get away with fetching from upstream directly occasionally	15:23
clarkb	definitely for the base distro packages that we update and install from constantly its worth the effort to mirror.	15:24
clarkb	then if we notice problems we can reconsider	15:25
clarkb	tkajinam: re https://review.opendev.org/c/opendev/system-config/+/946501 I have a note/question about the rsync pattern matching rules. someone like fungi may know off the top of their head if that is a problem	15:30
clarkb	it is forecast to be about 38C today so I'm going to pop outside now for some less hot outdoor time. Back in a bit	15:31
opendevreview	Takashi Kajinami proposed opendev/system-config master: Update exclude list for puppetlabs repository sync https://review.opendev.org/c/opendev/system-config/+/946501	16:09
opendevreview	Takashi Kajinami proposed opendev/system-config master: Update exclude list for puppetlabs repository sync https://review.opendev.org/c/opendev/system-config/+/946501	16:11
*** dhill is now known as Guest24778		17:04
clarkb	fungi: I +2'd but didn't approve https://review.opendev.org/c/opendev/system-config/+/946501 because I'm not sure if we're concnerd about that impacting the work you're doing moving the rw volumes	17:46
clarkb	fungi: I guess you can approve it if you think its safe enough	17:46
clarkb	fungi: then for the work you're doing if you can give me a mini braindump on which volumes need moving and the commands to do so I can pick that up again on monday (as I'm operating on the assumption it won't be done today)	17:47
fungi	it's probably fine to approve, i don't expect the churn to be all that significant	17:50
clarkb	ok I've approved it	17:51
fungi	as for what's in flight, there's a for loop running in a root screen session on afs01.dfw moving the 46 rw volumes at the top of https://paste.opendev.org/show/bpYk6CtVQ3vfrRgyC5ka/ to afs02.dfw	17:51
fungi	the 8 listed below there for afs01.ord have already been moved	17:51
clarkb	what about the ones below line 59. I think you did those too?	17:51
clarkb	except for docs-old?	17:52
fungi	the "afs01.dfw->???" section there is the ones i don't know what we should do about because they currently have their only ro replica locally on afs01.dfw or have no ro replica at all	17:52
fungi	i expect we can just ignore them, mirror.logs is the only one that seems like it might end up being a problem	17:53
clarkb	I think you moved the rw for them though?	17:53
fungi	i did not	17:53
fungi	at least not yet	17:53
clarkb	ah ok. We had talked about that being possible but I guess it hasn't been done yet. And ya my inclination would be to move the rw for those to another server and possibly create a new ro for them on afs02 as well	17:54
clarkb	that way we can flip flop like everything else and avoid problems	17:54
fungi	though keep in mind that this is all for a handful of reboots and a few minutes of afs down-time during package upgrades, so the impact may be minimal	17:54
fungi	but yeah, we could add replicas for them on afs02.dfw probably	17:55
clarkb	ya but is the service volume hosting all the service.foo stuff that hosts websites for zuul opendev etc?	17:55
clarkb	basically if that goes down it is likely going to be noticed. And we write to the logs dir pretty persistenctly so may make things unhappy downstream of that.	17:55
clarkb	anyway I suspect that the lack of those ro volumes is an oversight and not intentional and correcting that is a good idea imo	17:56
fungi	service is an empty volume that just has the mountpoints for the service.foo volumes, so i don't know how critical it is to have a rw or ro volume up 100% of the time. maybe it will affect browsing the parent directory of those?	17:56
clarkb	ya my suspciion is that the child dirs will become inaccessible	17:57
clarkb	like nested nfs mounts	17:57
fungi	anyway, yes those 4 volumes are in need of some decision making and probably extra work	17:57
clarkb	looking at ps on afs01.dfw I got excited thinking that centos was done then realized that is a largely empty volume	17:57
clarkb	so centos-stream is the first large one and it is still running	17:57
fungi	yeah, centos took 2 seconds	17:57
fungi	centos-stream has been in progress for hours	17:58
fungi	also i'll note that in retrospect i should have echoed the volume names from the loop, so you have to infer from the list and prior volume numeric ids what it's working on	17:58
clarkb	they show up in ps -elf output	17:59
fungi	or that	17:59
clarkb	so just an extra step to figure that out	17:59
fungi	anyway, once all the important rw volumes are off afs01.dfw and there's a ro replica for them on at least one other server (that's the case already for the ones the current loop is iterating over) then we should probably go ahead and upgrade afs01.dfw to jammy and then to noble, since all the others are already on jammy now	18:01
fungi	then we can move all the rw volumes back to afs01.dfw and upgrade everything else to noble	18:02
clarkb	++	18:02
clarkb	we'll get there eventually	18:02
fungi	also this can be paused at any time we want, and resumed when i get back too	18:02
fungi	having them running a mix of jammy and noble shouldn't be a problem	18:02
fungi	mostly i just want to avoid moving these rw volumes around any more than we absolutely have to	18:03
clarkb	ya I think your plan of getting afs01.dfw to jammy then it being the first to go to noble is a good one	18:04
clarkb	then the other thing is we won't update the borg situation on kdc03 by default until daily runs at 0200. Do we want to enqueue some change to trigger the service-borg-backup.yaml playbook sooner?	18:04
clarkb	I'm not terribly concerned about it since that data has been largely static in recent weeks/months so the existing backups should be good	18:04
clarkb	but if we do want to accelerate the fixup there we need to intervene	18:04
clarkb	if anyone is wondering the centos 9 stream vm images are still broken	18:15
clarkb	I'm half wondering if we should just drop centos 9 stream testing from dib and only test it with the -minimal element	18:15
opendevreview	Merged opendev/system-config master: Update exclude list for puppetlabs repository sync https://review.opendev.org/c/opendev/system-config/+/946501	18:39
opendevreview	Jeremy Stanley proposed opendev/system-config master: Correct Kerberos realm var documentation https://review.opendev.org/c/opendev/system-config/+/958302	19:04
fungi	clarkb: that ^ should do the trick, yeah?	19:04
fungi	doubles as an actual fix, albeit a minor one	19:04
clarkb	fungi: I think it is actually this job: infra-prod-service-borg-backup that we need to run	19:09
clarkb	and that one doesn't match on the kerberos role but does match on the borg roles.	19:09
fungi	d'oh! yep, no idea what i was thinking	19:09
clarkb	+2 to your change as it is a good fix. But we may need something similar in one of the roles that the borg job triggers off of	19:10
fungi	yeah, working on that now	19:10
opendevreview	Jeremy Stanley proposed opendev/system-config master: Expand explanation of borg version pinning https://review.opendev.org/c/opendev/system-config/+/958305	19:18
fungi	clarkb: ^ i tried not to make that entirely useless, though it's admittedly a stretch	19:19
fungi	i going to go grab something to eat, should be back within the hour	19:19
clarkb	I'm working on lunch too, but I've gone ahead and approved that	19:22
clarkb	~15 minutes away from the borg role update landing	20:15
clarkb	arg it failed. I'll recheck it	20:44
fungi	centos-stream volume move is still in progress	20:52
fungi	hah! just as i said that... 435m9.175s	20:52
fungi	so now it's on to mirror.ceph-deb-reef	20:53
fungi	the next half dozen should go quickly, and then it's on to mirror.debian which will be a similar order of magnitude to centos-stream	20:53
clarkb	zoom zoom not really	21:06
clarkb	ok borg chagne is back in the gate now	21:28
opendevreview	Merged opendev/system-config master: Expand explanation of borg version pinning https://review.opendev.org/c/opendev/system-config/+/958305	22:11
clarkb	/opt/borg/bin/borg exists on kdc03 again	22:18
clarkb	and the kdc03 facts cache file reports jammy	22:19
clarkb	so that looks good to me. Will have to see if the backups are happy next	22:20
clarkb	and the vos move is up to mirror.debian-security	22:21
clarkb	that comes after mirror.debian. Was it faster bceause we're not udpating debian right now due to that key error maybe?	22:21
fungi	that could be	22:22

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!