Friday, 2025-08-22

Ramerethclarkb: I think I fixed the issue with our cluster00:04
Clark[m]Thanks I think we see it is happy again from our side too00:06
RamerethMy little network outage earlier caused a bunch of nova-compute agents to get stuck for some reason00:06
opendevreviewOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/c/openstack/project-config/+/95799502:15
*** elibrokeit__ is now known as elibrokeit02:30
opendevreviewMerged opendev/zuul-providers master: Update bindep-fallback path  https://review.opendev.org/c/opendev/zuul-providers/+/95824702:41
*** clarkb is now known as Guest2471807:46
fungi448m28.615s to complete the docs rw volume move to ord 13:06
fungii'll get the rest of the ord ones done and then work on the local dfw moves13:06
fungi~7.5 hours13:07
fungiso as luck would have it, moments before i gave up for the evening and dropped offline13:09
fungior moments after13:09
fungi6 minutes to move docs.dev toord13:13
fungito ord13:13
fungithe rest of them went quickly, next longest was project.airship which took 37 seconds13:16
fungigetting to work on the afs01.dfw to afs02.dfw moves next13:17
fungi46 volumes, the complete list i'm working from is at https://paste.opendev.org/show/bpYk6CtVQ3vfrRgyC5ka/13:32
fungirunning through those in a loop in the root screen session on afs01.dfw now13:34
*** Guest24718 is now known as clarkb13:47
clarkbfungi: I remembered kdc03 backups were failing and looking at email seems to still be the case. Should we go ahead and remove the two kdc servers from the emergency file (and I guess remove their ansible fact cache files too first) and see if that corrects?13:48
fungithese are also not exactly fast, i suspect the distro package mirror volumes could even take days to complete13:49
clarkbprobably need to move the old borg venv aside first as well13:49
fungiyeah, i'll take those back out in a sec13:49
clarkbin theory those moves should be much quicker within the same cloud region due to shorter round trip times impacting the small window sizes less. But the small window size is the limiter aiui13:50
fungimirror.centos-stream is copying now, i guess this will give us some idea of how fast these larger volumes move between file servers in the same region13:51
clarkbI guess I can try and work through those next week while you're out then we resume the upgrades when you're back assuming I can get through the list13:51
fungii've taken the kdcs out of the emergency list and moved /opt/borg to borg.old on kdc0313:53
fungiwe'll need to do that again after it's upgraded to noble13:54
clarkbI still see /var/cache/ansible/facts/kdc03.opendev.org.yaml and kdc04.opendev.org.yaml. I believe it is safe to (re)move these files and ansible will repopulate the cache entries when it next connects to the nodes13:55
clarkbthat is on bridge13:55
fungioh, right i'll take care of that too13:56
clarkbotherwise it may think the two nodes are still focal etc13:57
clarkbI think thi matters less for focal -> jammy but is impactful once on noble13:57
clarkbsince noble changes borg versions and if we were running containers switches us to podman all of which is determined by the fact values13:58
fungiremoved /var/cache/ansible/facts/kdc0{3,4}.openstack.org13:59
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer  https://review.opendev.org/c/openstack/project-config/+/94113814:26
tkajinamI've seen 503 errors from repository mirror in CI. Is there any maintenance activity going on ?14:26
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer  https://review.opendev.org/c/openstack/project-config/+/94113814:27
tkajinam http://mirror-int.ord.rax.opendev.org:8080/rdo/centos9-master/current/delorean.repo shows https://c566b6b62447a25ced42-92a01e1da4e956a5b782691fa3feaba0.ssl.cf2.rackcdn.com/openstack/516333bc458241f094abf33d2983d9a4/logs/etc/yum.repos.d/delorean.repo.txt14:28
fungitkajinam: i'm trying to move the rw volumes to a different file server but it's supposed to be transparent14:28
fungitkajinam: you won't be able to reach the mirror-int addresses remotely, those are only accessible from within their respective cloud regions because they're on rfc-1918 addresses14:29
fungithough if you drop the "-int" part of the hostname it should work14:29
tkajinamthis error is seen only in puppet jobs, which pulls rdo mirror so it can be specific to that part (I saw that the other jobs like lint jobs on ubuntu are passing)14:29
tkajinamah, ok14:30
tkajinamI didn't get that content locally but that's what is obtained during CI run14:30
fungithe mirror.apt-puppetlabs volume was the first one i moved in this batch, mirror.centos-stream has been in progress for a while14:30
tkajinamok14:30
tkajinamthat's from rdo so it may not be related14:31
tkajinamI mean the file/url I'm pointing 14:31
fungioh! yeah the :8080/rdo may be a proxy not a mirror14:31
tkajinamah, ok14:32
fungiso that's probably entirely unrelated to what i'm working on in that case14:32
tkajinamok seems rdo repository is down14:32
fungithat would definitely explain it14:32
fungias well as i wouldn't have expected afs issues to result in 5xx errors (4xx would have been more likely)14:33
tkajinamyeah14:33
tkajinamlet me discuss it with rdo team14:33
fungi5xx foming from apache mod_proxy is typical14:33
fungis/foming/coming/14:33
tkajinamyeah14:33
tkajinamit looks like that14:33
fungithanks for checking into it!14:33
tkajinam:-)14:34
tkajinamby the way I'm trying to add openvox to mirror due to nearly-dead puppet https://review.opendev.org/c/opendev/system-config/+/95729914:34
tkajinamit seems I have to ask the team to create the repository ? it'd be nice if it can be looked into after that data migration14:34
fungiah, yes, i'll take a look14:34
tkajinamI mea,n create the "directory" in afs14:34
fungitkajinam: correct, looks like this is going to use two new volumes (one for the rpms, one for the debs). those will need manual creation steps to make the two rw volumes and ro replicas as well as set quotas for them. do you have a rough guess as to how much space these will require currently? about the same as the equivalent puppet mirrors?14:39
tkajinamfungi, these should be even smaller than puppet mirrors, because they provide only recent versions14:44
tkajinambut the same size would be the reasonable estimate14:45
fungisounds good, thanks14:46
opendevreviewMerged openstack/project-config master: Stop using storyboard by Freezer  https://review.opendev.org/c/openstack/project-config/+/94113814:55
clarkbtkajinam: fungi: fwiw I'm trying to get us out of mirroring stuff as much as possible15:00
clarkbthe mirrors have historically been one of the places where we struggle with significant tech debt. For big things that we use a lot the cost benefits are reasonable. eg distro packages for popular distros15:01
clarkbbut for random tools we just end up carrying things forever15:01
clarkbtkajinam: fungi: also a quick way to check if something is proxied or mirrored is to check the index at https://mirror.dfw.rax.opendev.org/ everything there is mirrored. Anything else is proxied15:12
clarkband if you want an example of the tech debt I'm talking about https://mirror.dfw.rax.opendev.org/yum-puppetlabs/puppet5/el/7/x86_64/ is a good illustration15:13
clarkbwe don't have centos 7 nodes anymore but no one is cleaning up the mirrored content for that release15:13
clarkband then all of that additional content makes doing openafs server maintenance take longer as we're discovering right now with the volume moves15:15
clarkbhttps://trunk.rdoproject.org/ is what /rdo/ proxies to15:17
clarkbI get "server is taking too long to respond" which apache probably timed out on as well then sent a 5xx down to the client15:20
tkajinamI had this which is supposed to clear unused things, which has been kept for some time, but I agree these are mostly ignored https://review.opendev.org/c/opendev/system-config/+/94650115:21
tkajinamif external traffic is not a concern then I'll probably just keep using upstream repo, until we actual face hard requirements to have mirrors (such as bw limit set in upstream repo)15:23
tkajinamactually face *15:23
clarkbright thats been the approach I've been trying to shift us to. For things that we aren't updating in every single job lets see if we can get away with fetching from upstream directly occasionally15:23
clarkbdefinitely for the base distro packages that we update and install from constantly its worth the effort to mirror.15:24
clarkbthen if we notice problems we can reconsider15:25
clarkbtkajinam: re https://review.opendev.org/c/opendev/system-config/+/946501 I have a note/question about the rsync pattern matching rules. someone like fungi may know off the top of their head if that is a problem15:30
clarkbit is forecast to be about 38C today so I'm going to pop outside now for some less hot outdoor time. Back in a bit15:31
opendevreviewTakashi Kajinami proposed opendev/system-config master: Update exclude list for puppetlabs repository sync  https://review.opendev.org/c/opendev/system-config/+/94650116:09
opendevreviewTakashi Kajinami proposed opendev/system-config master: Update exclude list for puppetlabs repository sync  https://review.opendev.org/c/opendev/system-config/+/94650116:11
*** dhill is now known as Guest2477817:04
clarkbfungi: I +2'd but didn't approve https://review.opendev.org/c/opendev/system-config/+/946501 because I'm not sure if we're concnerd about that impacting the work you're doing moving the rw volumes17:46
clarkbfungi: I guess you can approve it if you think its safe enough17:46
clarkbfungi: then for the work you're doing if you can give me a mini braindump on which volumes need moving and the commands to do so I can pick that up again on monday (as I'm operating on the assumption it won't be done today)17:47
fungiit's probably fine to approve, i don't expect the churn to be all that significant17:50
clarkbok I've approved it17:51
fungias for what's in flight, there's a for loop running in a root screen session on afs01.dfw moving the 46 rw volumes at the top of https://paste.opendev.org/show/bpYk6CtVQ3vfrRgyC5ka/ to afs02.dfw17:51
fungithe 8 listed below there for afs01.ord have already been moved17:51
clarkbwhat about the ones below line 59. I think you did those too?17:51
clarkbexcept for docs-old?17:52
fungithe "afs01.dfw->???" section there is the ones i don't know what we should do about because they currently have their only ro replica locally on afs01.dfw or have no ro replica at all17:52
fungii expect we can just ignore them, mirror.logs is the only one that seems like it might end up being a problem17:53
clarkbI think you moved the rw for them though? 17:53
fungii did not17:53
fungiat least not yet17:53
clarkbah ok. We had talked about that being possible but I guess it hasn't been done yet. And ya my inclination would be to move the rw for those to another server and possibly create a new ro for them on afs02 as well17:54
clarkbthat way we can flip flop like everything else and avoid problems17:54
fungithough keep in mind that this is all for a handful of reboots and a few minutes of afs down-time during package upgrades, so the impact may be minimal17:54
fungibut yeah, we could add replicas for them on afs02.dfw probably17:55
clarkbya but is the service volume hosting all the service.foo stuff that hosts websites for zuul opendev etc?17:55
clarkbbasically if that goes down it is likely going to be noticed. And we write to the logs dir pretty persistenctly so may make things unhappy downstream of that.17:55
clarkbanyway I suspect that the lack of those ro volumes is an oversight and not intentional and correcting that is a good idea imo17:56
fungiservice is an empty volume that just has the mountpoints for the service.foo volumes, so i don't know how critical it is to have a rw or ro volume up 100% of the time. maybe it will affect browsing the parent directory of those?17:56
clarkbya my suspciion is that the child dirs will become inaccessible17:57
clarkblike nested nfs mounts17:57
fungianyway, yes those 4 volumes are in need of some decision making and probably extra work17:57
clarkblooking at ps on afs01.dfw I got excited thinking that centos was done then realized that is a largely empty volume17:57
clarkbso centos-stream is the first large one and it is still running17:57
fungiyeah, centos took 2 seconds17:57
fungicentos-stream has been in progress for hours17:58
fungialso i'll note that in retrospect i should have echoed the volume names from the loop, so you have to infer from the list and prior volume numeric ids what it's working on17:58
clarkbthey show up in ps -elf output17:59
fungior that17:59
clarkbso just an extra step to figure that out17:59
fungianyway, once all the important rw volumes are off afs01.dfw and there's a ro replica for them on at least one other server (that's the case already for the ones the current loop is iterating over) then we should probably go ahead and upgrade afs01.dfw to jammy and then to noble, since all the others are already on jammy now18:01
fungithen we can move all the rw volumes back to afs01.dfw and upgrade everything else to noble18:02
clarkb++18:02
clarkbwe'll get there eventually18:02
fungialso this can be paused at any time we want, and resumed when i get back too18:02
fungihaving them running a mix of jammy and noble shouldn't be a problem18:02
fungimostly i just want to avoid moving these rw volumes around any more than we absolutely have to18:03
clarkbya I think your plan of getting afs01.dfw to jammy then it being the first to go to noble is a good one18:04
clarkbthen the other thing is we won't update the borg situation on kdc03 by default until daily runs at 0200. Do we want to enqueue some change to trigger the service-borg-backup.yaml playbook sooner?18:04
clarkbI'm not terribly concerned about it since that data has been largely static in recent weeks/months so the existing backups should be good18:04
clarkbbut if we do want to accelerate the fixup there we need to intervene18:04
clarkbif anyone is wondering the centos 9 stream vm images are still broken18:15
clarkbI'm half wondering if we should just drop centos 9 stream testing from dib and only test it with the -minimal element18:15
opendevreviewMerged opendev/system-config master: Update exclude list for puppetlabs repository sync  https://review.opendev.org/c/opendev/system-config/+/94650118:39
opendevreviewJeremy Stanley proposed opendev/system-config master: Correct Kerberos realm var documentation  https://review.opendev.org/c/opendev/system-config/+/95830219:04
fungiclarkb: that ^ should do the trick, yeah?19:04
fungidoubles as an actual fix, albeit a minor one19:04
clarkbfungi: I think it is actually this job: infra-prod-service-borg-backup that we need to run19:09
clarkband that one doesn't match on the kerberos role but does match on the borg roles.19:09
fungid'oh! yep, no idea what i was thinking19:09
clarkb+2 to your change as it is a good fix. But we may need something similar in one of the roles that the borg job triggers off of19:10
fungiyeah, working on that now19:10
opendevreviewJeremy Stanley proposed opendev/system-config master: Expand explanation of borg version pinning  https://review.opendev.org/c/opendev/system-config/+/95830519:18
fungiclarkb: ^ i tried not to make that entirely useless, though it's admittedly a stretch19:19
fungii going to go grab something to eat, should be back within the hour19:19
clarkbI'm working on lunch too, but I've gone ahead and approved that19:22
clarkb~15 minutes away from the borg role update landing20:15
clarkbarg it failed. I'll recheck it20:44
fungicentos-stream volume move is still in progress20:52
fungihah! just as i said that... 435m9.175s20:52
fungiso now it's on to mirror.ceph-deb-reef20:53
fungithe next half dozen should go quickly, and then it's on to mirror.debian which will be a similar order of magnitude to centos-stream20:53
clarkbzoom zoom not really21:06
clarkbok borg chagne is back in the gate now21:28
opendevreviewMerged opendev/system-config master: Expand explanation of borg version pinning  https://review.opendev.org/c/opendev/system-config/+/95830522:11
clarkb/opt/borg/bin/borg exists on kdc03 again22:18
clarkband the kdc03 facts cache file reports jammy22:19
clarkbso that looks good to me. Will have to see if the backups are happy next22:20
clarkband the vos move is up to mirror.debian-security22:21
clarkbthat comes after mirror.debian. Was it faster bceause we're not udpating debian right now due to that key error maybe?22:21
fungithat could be22:22

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!