Ramereth | clarkb: I think I fixed the issue with our cluster | 00:04 |
---|---|---|
Clark[m] | Thanks I think we see it is happy again from our side too | 00:06 |
Ramereth | My little network outage earlier caused a bunch of nova-compute agents to get stuck for some reason | 00:06 |
opendevreview | OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/957995 | 02:15 |
*** elibrokeit__ is now known as elibrokeit | 02:30 | |
opendevreview | Merged opendev/zuul-providers master: Update bindep-fallback path https://review.opendev.org/c/opendev/zuul-providers/+/958247 | 02:41 |
*** clarkb is now known as Guest24718 | 07:46 | |
fungi | 448m28.615s to complete the docs rw volume move to ord | 13:06 |
fungi | i'll get the rest of the ord ones done and then work on the local dfw moves | 13:06 |
fungi | ~7.5 hours | 13:07 |
fungi | so as luck would have it, moments before i gave up for the evening and dropped offline | 13:09 |
fungi | or moments after | 13:09 |
fungi | 6 minutes to move docs.dev toord | 13:13 |
fungi | to ord | 13:13 |
fungi | the rest of them went quickly, next longest was project.airship which took 37 seconds | 13:16 |
fungi | getting to work on the afs01.dfw to afs02.dfw moves next | 13:17 |
fungi | 46 volumes, the complete list i'm working from is at https://paste.opendev.org/show/bpYk6CtVQ3vfrRgyC5ka/ | 13:32 |
fungi | running through those in a loop in the root screen session on afs01.dfw now | 13:34 |
*** Guest24718 is now known as clarkb | 13:47 | |
clarkb | fungi: I remembered kdc03 backups were failing and looking at email seems to still be the case. Should we go ahead and remove the two kdc servers from the emergency file (and I guess remove their ansible fact cache files too first) and see if that corrects? | 13:48 |
fungi | these are also not exactly fast, i suspect the distro package mirror volumes could even take days to complete | 13:49 |
clarkb | probably need to move the old borg venv aside first as well | 13:49 |
fungi | yeah, i'll take those back out in a sec | 13:49 |
clarkb | in theory those moves should be much quicker within the same cloud region due to shorter round trip times impacting the small window sizes less. But the small window size is the limiter aiui | 13:50 |
fungi | mirror.centos-stream is copying now, i guess this will give us some idea of how fast these larger volumes move between file servers in the same region | 13:51 |
clarkb | I guess I can try and work through those next week while you're out then we resume the upgrades when you're back assuming I can get through the list | 13:51 |
fungi | i've taken the kdcs out of the emergency list and moved /opt/borg to borg.old on kdc03 | 13:53 |
fungi | we'll need to do that again after it's upgraded to noble | 13:54 |
clarkb | I still see /var/cache/ansible/facts/kdc03.opendev.org.yaml and kdc04.opendev.org.yaml. I believe it is safe to (re)move these files and ansible will repopulate the cache entries when it next connects to the nodes | 13:55 |
clarkb | that is on bridge | 13:55 |
fungi | oh, right i'll take care of that too | 13:56 |
clarkb | otherwise it may think the two nodes are still focal etc | 13:57 |
clarkb | I think thi matters less for focal -> jammy but is impactful once on noble | 13:57 |
clarkb | since noble changes borg versions and if we were running containers switches us to podman all of which is determined by the fact values | 13:58 |
fungi | removed /var/cache/ansible/facts/kdc0{3,4}.openstack.org | 13:59 |
opendevreview | Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138 | 14:26 |
tkajinam | I've seen 503 errors from repository mirror in CI. Is there any maintenance activity going on ? | 14:26 |
opendevreview | Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138 | 14:27 |
tkajinam | http://mirror-int.ord.rax.opendev.org:8080/rdo/centos9-master/current/delorean.repo shows https://c566b6b62447a25ced42-92a01e1da4e956a5b782691fa3feaba0.ssl.cf2.rackcdn.com/openstack/516333bc458241f094abf33d2983d9a4/logs/etc/yum.repos.d/delorean.repo.txt | 14:28 |
fungi | tkajinam: i'm trying to move the rw volumes to a different file server but it's supposed to be transparent | 14:28 |
fungi | tkajinam: you won't be able to reach the mirror-int addresses remotely, those are only accessible from within their respective cloud regions because they're on rfc-1918 addresses | 14:29 |
fungi | though if you drop the "-int" part of the hostname it should work | 14:29 |
tkajinam | this error is seen only in puppet jobs, which pulls rdo mirror so it can be specific to that part (I saw that the other jobs like lint jobs on ubuntu are passing) | 14:29 |
tkajinam | ah, ok | 14:30 |
tkajinam | I didn't get that content locally but that's what is obtained during CI run | 14:30 |
fungi | the mirror.apt-puppetlabs volume was the first one i moved in this batch, mirror.centos-stream has been in progress for a while | 14:30 |
tkajinam | ok | 14:30 |
tkajinam | that's from rdo so it may not be related | 14:31 |
tkajinam | I mean the file/url I'm pointing | 14:31 |
fungi | oh! yeah the :8080/rdo may be a proxy not a mirror | 14:31 |
tkajinam | ah, ok | 14:32 |
fungi | so that's probably entirely unrelated to what i'm working on in that case | 14:32 |
tkajinam | ok seems rdo repository is down | 14:32 |
fungi | that would definitely explain it | 14:32 |
fungi | as well as i wouldn't have expected afs issues to result in 5xx errors (4xx would have been more likely) | 14:33 |
tkajinam | yeah | 14:33 |
tkajinam | let me discuss it with rdo team | 14:33 |
fungi | 5xx foming from apache mod_proxy is typical | 14:33 |
fungi | s/foming/coming/ | 14:33 |
tkajinam | yeah | 14:33 |
tkajinam | it looks like that | 14:33 |
fungi | thanks for checking into it! | 14:33 |
tkajinam | :-) | 14:34 |
tkajinam | by the way I'm trying to add openvox to mirror due to nearly-dead puppet https://review.opendev.org/c/opendev/system-config/+/957299 | 14:34 |
tkajinam | it seems I have to ask the team to create the repository ? it'd be nice if it can be looked into after that data migration | 14:34 |
fungi | ah, yes, i'll take a look | 14:34 |
tkajinam | I mea,n create the "directory" in afs | 14:34 |
fungi | tkajinam: correct, looks like this is going to use two new volumes (one for the rpms, one for the debs). those will need manual creation steps to make the two rw volumes and ro replicas as well as set quotas for them. do you have a rough guess as to how much space these will require currently? about the same as the equivalent puppet mirrors? | 14:39 |
tkajinam | fungi, these should be even smaller than puppet mirrors, because they provide only recent versions | 14:44 |
tkajinam | but the same size would be the reasonable estimate | 14:45 |
fungi | sounds good, thanks | 14:46 |
opendevreview | Merged openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138 | 14:55 |
clarkb | tkajinam: fungi: fwiw I'm trying to get us out of mirroring stuff as much as possible | 15:00 |
clarkb | the mirrors have historically been one of the places where we struggle with significant tech debt. For big things that we use a lot the cost benefits are reasonable. eg distro packages for popular distros | 15:01 |
clarkb | but for random tools we just end up carrying things forever | 15:01 |
clarkb | tkajinam: fungi: also a quick way to check if something is proxied or mirrored is to check the index at https://mirror.dfw.rax.opendev.org/ everything there is mirrored. Anything else is proxied | 15:12 |
clarkb | and if you want an example of the tech debt I'm talking about https://mirror.dfw.rax.opendev.org/yum-puppetlabs/puppet5/el/7/x86_64/ is a good illustration | 15:13 |
clarkb | we don't have centos 7 nodes anymore but no one is cleaning up the mirrored content for that release | 15:13 |
clarkb | and then all of that additional content makes doing openafs server maintenance take longer as we're discovering right now with the volume moves | 15:15 |
clarkb | https://trunk.rdoproject.org/ is what /rdo/ proxies to | 15:17 |
clarkb | I get "server is taking too long to respond" which apache probably timed out on as well then sent a 5xx down to the client | 15:20 |
tkajinam | I had this which is supposed to clear unused things, which has been kept for some time, but I agree these are mostly ignored https://review.opendev.org/c/opendev/system-config/+/946501 | 15:21 |
tkajinam | if external traffic is not a concern then I'll probably just keep using upstream repo, until we actual face hard requirements to have mirrors (such as bw limit set in upstream repo) | 15:23 |
tkajinam | actually face * | 15:23 |
clarkb | right thats been the approach I've been trying to shift us to. For things that we aren't updating in every single job lets see if we can get away with fetching from upstream directly occasionally | 15:23 |
clarkb | definitely for the base distro packages that we update and install from constantly its worth the effort to mirror. | 15:24 |
clarkb | then if we notice problems we can reconsider | 15:25 |
clarkb | tkajinam: re https://review.opendev.org/c/opendev/system-config/+/946501 I have a note/question about the rsync pattern matching rules. someone like fungi may know off the top of their head if that is a problem | 15:30 |
clarkb | it is forecast to be about 38C today so I'm going to pop outside now for some less hot outdoor time. Back in a bit | 15:31 |
opendevreview | Takashi Kajinami proposed opendev/system-config master: Update exclude list for puppetlabs repository sync https://review.opendev.org/c/opendev/system-config/+/946501 | 16:09 |
opendevreview | Takashi Kajinami proposed opendev/system-config master: Update exclude list for puppetlabs repository sync https://review.opendev.org/c/opendev/system-config/+/946501 | 16:11 |
*** dhill is now known as Guest24778 | 17:04 | |
clarkb | fungi: I +2'd but didn't approve https://review.opendev.org/c/opendev/system-config/+/946501 because I'm not sure if we're concnerd about that impacting the work you're doing moving the rw volumes | 17:46 |
clarkb | fungi: I guess you can approve it if you think its safe enough | 17:46 |
clarkb | fungi: then for the work you're doing if you can give me a mini braindump on which volumes need moving and the commands to do so I can pick that up again on monday (as I'm operating on the assumption it won't be done today) | 17:47 |
fungi | it's probably fine to approve, i don't expect the churn to be all that significant | 17:50 |
clarkb | ok I've approved it | 17:51 |
fungi | as for what's in flight, there's a for loop running in a root screen session on afs01.dfw moving the 46 rw volumes at the top of https://paste.opendev.org/show/bpYk6CtVQ3vfrRgyC5ka/ to afs02.dfw | 17:51 |
fungi | the 8 listed below there for afs01.ord have already been moved | 17:51 |
clarkb | what about the ones below line 59. I think you did those too? | 17:51 |
clarkb | except for docs-old? | 17:52 |
fungi | the "afs01.dfw->???" section there is the ones i don't know what we should do about because they currently have their only ro replica locally on afs01.dfw or have no ro replica at all | 17:52 |
fungi | i expect we can just ignore them, mirror.logs is the only one that seems like it might end up being a problem | 17:53 |
clarkb | I think you moved the rw for them though? | 17:53 |
fungi | i did not | 17:53 |
fungi | at least not yet | 17:53 |
clarkb | ah ok. We had talked about that being possible but I guess it hasn't been done yet. And ya my inclination would be to move the rw for those to another server and possibly create a new ro for them on afs02 as well | 17:54 |
clarkb | that way we can flip flop like everything else and avoid problems | 17:54 |
fungi | though keep in mind that this is all for a handful of reboots and a few minutes of afs down-time during package upgrades, so the impact may be minimal | 17:54 |
fungi | but yeah, we could add replicas for them on afs02.dfw probably | 17:55 |
clarkb | ya but is the service volume hosting all the service.foo stuff that hosts websites for zuul opendev etc? | 17:55 |
clarkb | basically if that goes down it is likely going to be noticed. And we write to the logs dir pretty persistenctly so may make things unhappy downstream of that. | 17:55 |
clarkb | anyway I suspect that the lack of those ro volumes is an oversight and not intentional and correcting that is a good idea imo | 17:56 |
fungi | service is an empty volume that just has the mountpoints for the service.foo volumes, so i don't know how critical it is to have a rw or ro volume up 100% of the time. maybe it will affect browsing the parent directory of those? | 17:56 |
clarkb | ya my suspciion is that the child dirs will become inaccessible | 17:57 |
clarkb | like nested nfs mounts | 17:57 |
fungi | anyway, yes those 4 volumes are in need of some decision making and probably extra work | 17:57 |
clarkb | looking at ps on afs01.dfw I got excited thinking that centos was done then realized that is a largely empty volume | 17:57 |
clarkb | so centos-stream is the first large one and it is still running | 17:57 |
fungi | yeah, centos took 2 seconds | 17:57 |
fungi | centos-stream has been in progress for hours | 17:58 |
fungi | also i'll note that in retrospect i should have echoed the volume names from the loop, so you have to infer from the list and prior volume numeric ids what it's working on | 17:58 |
clarkb | they show up in ps -elf output | 17:59 |
fungi | or that | 17:59 |
clarkb | so just an extra step to figure that out | 17:59 |
fungi | anyway, once all the important rw volumes are off afs01.dfw and there's a ro replica for them on at least one other server (that's the case already for the ones the current loop is iterating over) then we should probably go ahead and upgrade afs01.dfw to jammy and then to noble, since all the others are already on jammy now | 18:01 |
fungi | then we can move all the rw volumes back to afs01.dfw and upgrade everything else to noble | 18:02 |
clarkb | ++ | 18:02 |
clarkb | we'll get there eventually | 18:02 |
fungi | also this can be paused at any time we want, and resumed when i get back too | 18:02 |
fungi | having them running a mix of jammy and noble shouldn't be a problem | 18:02 |
fungi | mostly i just want to avoid moving these rw volumes around any more than we absolutely have to | 18:03 |
clarkb | ya I think your plan of getting afs01.dfw to jammy then it being the first to go to noble is a good one | 18:04 |
clarkb | then the other thing is we won't update the borg situation on kdc03 by default until daily runs at 0200. Do we want to enqueue some change to trigger the service-borg-backup.yaml playbook sooner? | 18:04 |
clarkb | I'm not terribly concerned about it since that data has been largely static in recent weeks/months so the existing backups should be good | 18:04 |
clarkb | but if we do want to accelerate the fixup there we need to intervene | 18:04 |
clarkb | if anyone is wondering the centos 9 stream vm images are still broken | 18:15 |
clarkb | I'm half wondering if we should just drop centos 9 stream testing from dib and only test it with the -minimal element | 18:15 |
opendevreview | Merged opendev/system-config master: Update exclude list for puppetlabs repository sync https://review.opendev.org/c/opendev/system-config/+/946501 | 18:39 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Correct Kerberos realm var documentation https://review.opendev.org/c/opendev/system-config/+/958302 | 19:04 |
fungi | clarkb: that ^ should do the trick, yeah? | 19:04 |
fungi | doubles as an actual fix, albeit a minor one | 19:04 |
clarkb | fungi: I think it is actually this job: infra-prod-service-borg-backup that we need to run | 19:09 |
clarkb | and that one doesn't match on the kerberos role but does match on the borg roles. | 19:09 |
fungi | d'oh! yep, no idea what i was thinking | 19:09 |
clarkb | +2 to your change as it is a good fix. But we may need something similar in one of the roles that the borg job triggers off of | 19:10 |
fungi | yeah, working on that now | 19:10 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Expand explanation of borg version pinning https://review.opendev.org/c/opendev/system-config/+/958305 | 19:18 |
fungi | clarkb: ^ i tried not to make that entirely useless, though it's admittedly a stretch | 19:19 |
fungi | i going to go grab something to eat, should be back within the hour | 19:19 |
clarkb | I'm working on lunch too, but I've gone ahead and approved that | 19:22 |
clarkb | ~15 minutes away from the borg role update landing | 20:15 |
clarkb | arg it failed. I'll recheck it | 20:44 |
fungi | centos-stream volume move is still in progress | 20:52 |
fungi | hah! just as i said that... 435m9.175s | 20:52 |
fungi | so now it's on to mirror.ceph-deb-reef | 20:53 |
fungi | the next half dozen should go quickly, and then it's on to mirror.debian which will be a similar order of magnitude to centos-stream | 20:53 |
clarkb | zoom zoom not really | 21:06 |
clarkb | ok borg chagne is back in the gate now | 21:28 |
opendevreview | Merged opendev/system-config master: Expand explanation of borg version pinning https://review.opendev.org/c/opendev/system-config/+/958305 | 22:11 |
clarkb | /opt/borg/bin/borg exists on kdc03 again | 22:18 |
clarkb | and the kdc03 facts cache file reports jammy | 22:19 |
clarkb | so that looks good to me. Will have to see if the backups are happy next | 22:20 |
clarkb | and the vos move is up to mirror.debian-security | 22:21 |
clarkb | that comes after mirror.debian. Was it faster bceause we're not udpating debian right now due to that key error maybe? | 22:21 |
fungi | that could be | 22:22 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!