Thursday, 2024-02-29

*** tkajinam is now known as Guest1349		01:24
*** dhill is now known as Guest1368		05:03
opendevreview	Sumanth Kumar Batchu proposed opendev/gerritlib master: Added a comment https://review.opendev.org/c/opendev/gerritlib/+/910568	06:59
noonedeadpunk	clarkb: from our gut feeling, what indeed takes most of the time - connection and forking. And I fully clueless of what can be done there, except optimize SSH settings, like using lightweight ciphers, disabling GSSAPI/Kerberos, tuning persistant connections, and then disabling things like dynamic motd which is part of default ubuntu setup	08:12
noonedeadpunk	And another thing we were focusing on - trying to reduce amount of variables for the runtime. Like disabling INJECT_FACTS_AS_VARS having quite significant performance improvement, but mainly neglected by external dependency roles.	08:14
noonedeadpunk	But task engine is indeed smth we didn't touch....	08:14
noonedeadpunk	I pretty much hoped that their agressive move/deprecation of python versions was exactly to drop legacy code and improve performance of forking	08:15
noonedeadpunk	and then indeed we came to a point of discussing replacing whole roles that we call multiple times (like for creating systemd.service) just into an ansible modules. This would improve runtime speed dramatically, but basically it's a way to Jinja Charms...	08:18
*** elodilles_pto is now known as elodilles		08:29
opendevreview	Artem Goncharov proposed openstack/project-config master: Add OpenAPI related repos to OpenStackSDK project https://review.opendev.org/c/openstack/project-config/+/910580	08:32
opendevreview	Artem Goncharov proposed openstack/project-config master: Add OpenAPI related repos to OpenStackSDK project https://review.opendev.org/c/openstack/project-config/+/910580	08:33
opendevreview	Lukas Kranz proposed zuul/zuul-jobs master: Make prepare-workspace-git fail faster. https://review.opendev.org/c/zuul/zuul-jobs/+/910582	08:49
*** tosky_ is now known as tosky		11:48
fungi	noonedeadpunk: is pipelining an option, keeping a persistent ssh connection open for the duration of the play and reusing it rather than reconnecting for each task? or does it already do some of that?	13:44
noonedeadpunk	Yup, we do have pipelining enabled - using `-C -o ControlMaster=auto -o ControlPersist=300` for ssh arguments	13:47
*** ralonsoh_ is now known as ralonsoh		13:59
fungi	so it's probably more fork initialization and cpython interpreter startup time i suppose	14:03
Clark[m]	fungi: noonedeadpunk: yes I think it is python overhead due to inefficient process management. And I think it occurs both on the controller (no preforking of the -f threads and on the remote side starting a new python for each task)	15:34
Clark[m]	But I haven't looked in the code in a long time. There was a huge performance regression after a refactor of this stuff though	15:35
Clark[m]	Way back when I mean. And it hasn't really improved since	15:35
fungi	just add more ram. it's the solution to every performance regression	15:36
noonedeadpunk	+ overhead for copying the module I assume	15:45
noonedeadpunk	but yes, I agree. and unfortunatelly, the way of dealing with it is slightly beyond my scope/time constraints	15:47
opendevreview	Brian Rosmaita proposed openstack/project-config master: Add more permissions for 'glance-ptl' group https://review.opendev.org/c/openstack/project-config/+/910641	15:49
clarkb	the debian mirror update is currently running (and holding the lock). I'll try to grab it as soon as it finishes then I'll approve the change for debian mirror cleanup	16:12
fungi	thanks!	16:16
frickler	fungi: clarkb: could one of you have a look at https://review.opendev.org/c/openstack/project-config/+/904837 please?	16:29
clarkb	fungi probably has more context on that than I do but I can take a look too	16:30
frickler	I was actually thinking the same, but then I didn't want to make you feel excluded ;)	16:32
clarkb	I have the debian reprepro lock. I'm approving https://review.opendev.org/c/opendev/system-config/+/910032 now	16:33
clarkb	oh double checking the change looks like i need locks for all the debian things	16:33
clarkb	debian, debian-security, and debian-ceph-octopus	16:34
clarkb	grabbing the other two before I approve	16:34
clarkb	those were not held so I got them immediately	16:36
fungi	that change's mix of grep and bash pattern manipulation is mind-bending. i'm not confident i know, for example, why the script uses more toothpicks for grep ${NEW_BRANCH/\//-}-eol than for ${NEW_BRANCH//@(stable\/\|unmaintained\/)}-eol	16:45
clarkb	fungi: I copy pasted into bash locally and the outputs looked good fwiw	16:45
clarkb	but I think it is because it is doing a regex replace vs a dumb replace	16:45
clarkb	and bash is heavy on the syntax	16:46
opendevreview	Merged openstack/project-config master: Adapt make_branch script to new 'unmaintained/<series>' branch https://review.opendev.org/c/openstack/project-config/+/904837	16:54
opendevreview	Merged opendev/mqtt_statsd master: Revert "Retire this repo" https://review.opendev.org/c/opendev/mqtt_statsd/+/905142	16:55
opendevreview	Merged opendev/system-config master: Remove debian buster package mirrors https://review.opendev.org/c/opendev/system-config/+/910032	17:18
clarkb	the updated reprepro configs appear to have been applied. I'm running the cleanup for ceph octopus first	17:52
clarkb	I completed the steps documented in https://docs.opendev.org/opendev/system-config/latest/reprepro.html#removing-components but https://mirror.bhs1.ovh.opendev.org/ceph-deb-octopus/dists/buster/ still exists.	17:55
clarkb	Before I drop the lockfile I think I should manually delete the dists/buster dir in that repo?	17:55
clarkb	fungi: ^	17:55
clarkb	then I can rerun the vos release and drop the lock	17:55
clarkb	there are also a few files to remove from https://mirror.bhs1.ovh.opendev.org/ceph-deb-octopus/lists/	17:56
clarkb	if that looks correct to you I'll do that after eating some breakfast, then I'll write a docs update change and then finally continue on with debian and debian-security	17:56
clarkb	actually I can delete the files then rerun a regular sync. If they come back then I know they should stick around. If they don't then it is proper cleanup. I'll proceed with that plan	18:08
fungi	clarkb: yeah, maybe delete the dir and rerun the mirror script again to make sure it doesn't recreate anything	18:08
fungi	right, what you also just said	18:08
clarkb	I did a vos release after the cleanup just to see it reflected on the mirrors. Will run a regular reprepro sync now	18:16
clarkb	ok rerun didn't readd the files so I think this is correct to do	18:19
clarkb	I'll write up a quick docs change and then proceed with debian-security and debian	18:19
fungi	perfect	18:23
opendevreview	Clark Boylan proposed opendev/system-config master: Update reprepro cleanup docs to cover dists/ and lists/ cleanup https://review.opendev.org/c/opendev/system-config/+/910655	18:33
clarkb	something like that	18:33
clarkb	now doing debian security.	18:34
clarkb	and now finally debian proper	18:45
clarkb	I note that stretch still has those extra files hanging around. I know we also haev ubuntu ports (and probably ubuntu releases?) with similar issues. I think I'm going to focus on cleaning up buster then we can do a broader cleanup of these extra files for other releases later	18:47
clarkb	tonyb may want to dig into that after setting up openafs creds? It has a lot of interactions with interesting afs things	18:48
clarkb	there are a lot of files to clean up in buster proper :)	18:55
clarkb	only into the r packages	18:56
clarkb	Error in vos release command.	19:08
clarkb	Volume needs to be salvaged	19:08
clarkb	that is unfortunate and annoying	19:08
clarkb	I'm not sure I undersatnd what this means yet either	19:10
clarkb	ok afs01.dfw.openstack.org has ext4 errors according to dmesg. I suspect this is the cause	19:11
clarkb	I may need help here. I think xvdb disappeared on us	19:13
clarkb	infra-root ^ fyi	19:13
fungi	argh!	19:14
clarkb	afs02 seems fine	19:14
clarkb	so do we reboot or just try to remount so it isn't remounted ro?	19:15
fungi	[Thu Feb 29 18:48:13 2024] INFO: task jbd2/dm-0-8:541 blocked for more than 120 seconds.	19:15
fungi	that looks like the start of the incident	19:15
clarkb	I think the order of operatins here is going to be make afs01 happy with xvdb again and then we have to make afs happy	19:15
fungi	probably first we check for rackspace tickets about hardware drama	19:16
clarkb	fungi: I didn't see any emails at least	19:16
fungi	k	19:16
clarkb	but feel free to double check	19:16
clarkb	I'm wondering if we need to grab locks for everything so this problem doesn't spread?	19:17
clarkb	but it may be too late for that?	19:17
clarkb	and yes argh	19:17
fungi	we'll probably want to make sure that all active volumes are being served from one of the other two fileservers	19:17
clarkb	fungi: you mean move RW to afs02 if it is on afs01?	19:18
clarkb	the RO stuff should already be on both?	19:18
fungi	yeah	19:18
clarkb	thats a lot of volumes... I can start by holding mirror-update locks I guess	19:19
clarkb	then we move things then we cry?	19:19
clarkb	:)	19:19
fungi	looks like we have this too: https://docs.opendev.org/opendev/system-config/latest/afs.html#recovering-a-failed-fileserver	19:20
fungi	doesn't seem to recommend failing over volumes like in https://docs.opendev.org/opendev/system-config/latest/afs.html#afs0x-openstack-org	19:21
fungi	but it does say we should pause writes	19:21
clarkb	I've grabbed all the lockfiles on mirror update	19:25
clarkb	this doesn't deal with docs/tarballs/etc	19:25
clarkb	looking for that now	19:25
fungi	grabbing a cup of tea real quick since the rest of our afternoons just flashed before my eyes	19:25
clarkb	I can never find where we run that crontab	19:26
clarkb	but I'm trying to find it now	19:26
fungi	clarkb: it's on mirror-update	19:26
fungi	/5 * * * /opt/afs-release/release-volumes.py -d >> /var/log/afs-release/afs-release.log 2>&1	19:27
fungi	in root's crontab	19:27
fungi	that's what does tarballs, docs, etc. i think if we just comment out all the cronjobs and put mirror-update in emergency disable we'll be fine	19:27
clarkb	ya and in that script the lockfile is /var/run/release-volumes.lock	19:27
clarkb	fungi: ok I grabbed all the locks. The publish logs thing doesn't appear to do a lockfile	19:29
clarkb	we need to put the server in the ermgecy file before editing the crontab	19:29
fungi	you're just holding all the locks instead of commenting out the cronjobs. that works and i guess keeps us from needing to worry about putting mirror-update in the emergency file	19:29
fungi	or we can do both if you like	19:30
clarkb	I'm doing both because of the logging publishing	19:30
clarkb	ok we should be idled now	19:31
fungi	This message is to inform you that our monitoring systems have detected a problem with the server which hosts your Cloud Block Storage device, afs01.dfw.opendev.org/main01, '1c4c3a46-2571-4442-a85a-5603ff91a68d' at 2024-02-29T19:16:47.073638. We are currently investigating the issue and will update you as soon as we have additional information regarding the alert. Please do not access or modify	19:32
fungi	'1c4c3a46-2571-4442-a85a-5603ff91a68d' during this process.	19:32
fungi	Please reference this incident ID if you need to contact support: CBSHD-5cf24b43	19:32
fungi	found that in the rackspace dashboard just now when i logged in	19:32
fungi	that's from Thursday, February 29, 2024 at 7:16 PM UTC	19:32
fungi	so about 15 minutes ago	19:32
fungi	we should probably refrain from rebooting the server until they've given the all-clear signal?	19:33
clarkb	++	19:33
clarkb	reading our docs and upstream salvage docs I'm not fully sure I understand the implications of this situation	19:33
clarkb	like will we replace the content on afs01 with afs02 content automagically? Or maybe it juist comes up and is happy?	19:34
fungi	openafs should try to auto-salvage the volumes i think	19:34
clarkb	gotcha	19:34
fungi	if it can't for some reason, then we take additional steps to replace/recover the volume	19:34
fungi	separately, we'll likely have hung/lost transactions from vos release commands that occurred during the incident	19:35
fungi	and we'll need to cancel them manually	19:35
clarkb	fungi: ya I think my concern is if we do a vos release will it potentially overwrite good RO content on afs02 with bad content from afs01. But sounds like it should do consistency checks and in theory we'd do a manual promotion to afs02 then potentially redo my debian cleanup	19:35
clarkb	fungi: how do we clean those up?	19:36
clarkb	though maybe we should avoid cleaning those up until afs01 is happy?	19:36
clarkb	otherwise we may create io load we don't want	19:36
fungi	the tasks? i have to refresh my memory on the exact command, but i would wait until after we get the underlying filesystem fixed	19:36
clarkb	++	19:37
clarkb	this doesn't give me a lot of confidence that centos 7 and xenial cleanups will go smoothyl...	19:37
clarkb	though it could be coincidence	19:37
fungi	looks like in the past we've used `vos status -server afs01.dfw.openstack.org -localauth` to check the transactions, then `vos endtrans -v -localauth afs01.dfw.openstack.org <transaction_id>` to clean them up, and maybe manually unlocked the volumes with `vos unlock -localauth some.volume` if necessary	19:40
clarkb	I've brought up if we should ask openstack release to idle too (but docs won't be idled by that alone)	19:41
clarkb	corvus: you too may be interested in this as it may impact zuul stuff	19:41
fungi	or perhaps `vos endtrans -server localhost -transaction <transaction_id> -localauth -verbose` looking at shell history	19:41
clarkb	ya there are post failures in tarball jobs	19:44
clarkb	fungi: an email from rax says "pending customer" is that an indication that maybe they want us to check if it is happy again?	19:45
clarkb	talking out loud here but I wonder if we should disable afs services on afs01 before rebooting it (once we get to that point0 that way we can enable services manually and check them post boot?	19:47
clarkb	would allow us to fsck too I think	19:47
clarkb	fungi: if you are still logged into the dashboard can you check the ticket status?	19:48
fungi	i'm refreshing now	19:50
fungi	This message is to inform you that your Cloud Block Storage device,afs01.dfw.opendev.org/main01, 1c4c3a46-2571-4442-a85a-5603ff91a68d has been returned to service.	19:50
fungi	Thursday, February 29, 2024 at 7:47 PM UTC	19:51
clarkb	about 4 minutes ago	19:51
fungi	so we should be clear to reboot the server now	19:51
clarkb	fungi: did we want to disable services first so that we can fsck or whatever first?	19:51
clarkb	first post reboot I mean	19:51
clarkb	or just follow the doc and see what happens?	19:51
fungi	i would just follow the doc from here	19:51
fungi	i'll pull up the server console though in case it wants to fsck for a while at boot	19:52
clarkb	ok it says "fix any filesystem errors" which I'm not sure we'll fsck for to detect which is why I ask	19:52
clarkb	fungi: sounds good let me know when I should issue a reboot command on the server	19:52
fungi	/dev/main/vicepa /vicepa ext4 errors=remount-ro,barrier=0 0 2	19:52
fungi	but i expect the lvm2 layer shields us here since the volume was put into a non-writeable state as soon as errors cropped up on th eunderlying pv	19:53
clarkb	I see	19:53
clarkb	the 2 should cause a fsck to ahppen though right?	19:53
fungi	yes, once the lvm volume is activated, before its filesystem is mounted	19:54
clarkb	alright then ready for me to reboot?	19:55
clarkb	or you can do it when you are ready with the console if you prefer	19:55
fungi	i have the console up now	19:55
clarkb	I see you have a shell too	19:55
fungi	it doesn't seem to accept input from my keyboard, but there is a "send ctrlaltdel" button	19:55
fungi	but yeah, i can reboot it from the shell anyway	19:56
clarkb	ya I think we resort to the other thing if that doesn't work	19:56
clarkb	since this should be the most graceful option	19:56
fungi	previous uptime 312 days	19:56
fungi	server is rebooting	19:56
clarkb	still no ping responses from there	19:58
clarkb	s/there/here/	19:58
fungi	yeah, console is blank at the moment	19:59
fungi	haven't seen it start booting yet	19:59
fungi	may still be in the process of shutting down	19:59
clarkb	ya could be	19:59
clarkb	systemd makes sshd go away fast but other things may still be slowly shutting down	19:59
fungi	sure is taking a while though	20:01
clarkb	~5 or 6 minutes now?	20:02
clarkb	it just started pinging	20:02
fungi	yeah, i see a fsck of xvda1 ran	20:02
clarkb	xvdb is the device that had a sad	20:03
clarkb	fwiw pvs,lvs,vgs looks ok	20:03
clarkb	I see afs and bos services running	20:03
fungi	boot.log says it did fsck /vicepa	20:04
fungi	so seems that was fine	20:04
clarkb	I think the next thing to find is the salvager logs?	20:04
fungi	i've got a root screen session on afs01.dfw	20:04
* clarkb joins		20:04
clarkb	looks like ti did some salvaging. It isn't clear to me if salvaging is still in progress though	20:06
clarkb	it also looks like it primarily needed to salvage where I couldn't get things idled fast enough (mirror.logs for example)	20:08
fungi	looks like it salvaged mirror.logs and project.readonly	20:08
fungi	the latter is probably due to some docs jobs	20:08
fungi	looks like it didn't complain about anything else	20:09
* clarkb will update docs for recovering a file server to add notes about where to find the salvager logs		20:09
clarkb	ya so next step is looking for stuck volume transactions (lets get the docs for that updated too)	20:09
fungi	the manpage claims it will be SalvageLog but seems it's actually SalsrvLog	20:10
clarkb	agreed	20:11
fungi	okay, process of elimination, we need to use the global v4 address like `vos status -server 104.130.138.161 -localauth`	20:11
clarkb	fungi: we should also check transactions on the other two servers	20:12
clarkb	just in case they "own" the transaction that may be stuck	20:12
clarkb	cool no transactions to worry about	20:13
fungi	no active transactions for any of the three, so i think we lucked out?	20:13
clarkb	yup, I guess it cancelled my vos release properly	20:13
clarkb	so now we need to vos release the many many volumes.	20:13
fungi	we'll find out when we try to manually vos release each volume, which is our next step	20:13
clarkb	++	20:14
clarkb	should we start with the smaller volumes and get them all out of the way?	20:14
clarkb	then we can leave the big ones to sit for a while (and I can grab lunch)	20:14
fungi	yeah	20:15
clarkb	I'll make a list in an etherpad	20:15
fungi	just looking over the `vos listvldb` output now to make sure they look okay	20:15
fungi	do you need a raw volume list for the pad or do you have one handy?	20:16
clarkb	I have one	20:17
clarkb	fungi: I notice a few of them look "weird" like docs old and root.cell and root.afs	20:17
clarkb	I assume we treat them like nay other volume though	20:17
fungi	docs-old we didn't bother making replicas for i think	20:17
fungi	and the others are internal in some way, but yeah we can still try to vos release them	20:18
fungi	lemme know the pad url when you're ready	20:20
clarkb	https://etherpad.opendev.org/p/cPfI3Q3fsexFvhTXbhkp	20:21
clarkb	trying to organize them now. Sorry they ended up with a - suffix that needs to be trimmed	20:21
clarkb	some may have had their names truncated too?	20:21
corvus	i picked a good time to afk!	20:21
clarkb	oh no its just ubuntu-cloud not ubuntu-cloud-archive	20:21
clarkb	corvus: heh yes	20:21
corvus	root.X are technically just like any other volume, only their name is magic	20:21
opendevreview	Brian Rosmaita proposed openstack/project-config master: Add more permissions for 'glance-ptl' group https://review.opendev.org/c/openstack/project-config/+/910641	20:21
clarkb	corvus: got it	20:23
fungi	we need to release parent volumes after their children for thoroughness, right?	20:23
fungi	or is that just when adding/removing volumes?	20:23
corvus	order shouldn't matter	20:25
fungi	okay, cool	20:25
fungi	i think it's when adding new volumes that it's bit me	20:26
fungi	clarkb: should i just start from the top then?	20:26
corvus	(the mounts are just pointers by name, so in this case, they'll always resolve to something. if there's no read-only copy of a volume though you can't mount it, so that's probably what you're thinking of)	20:27
fungi	yeah	20:27
clarkb	fungi: ya I think we can start at the top	20:27
clarkb	I've almost got the list organized	20:27
fungi	will do it in that root screen session	20:27
fungi	vos release project -localauth -verbose	20:28
fungi	that command look right?	20:28
corvus	lgtm	20:29
fungi	i guess i can stick the volume name on the end to make subsequent ones easier	20:29
fungi	Released volume project successfully	20:29
fungi	i'll proceed down the list and raise concerns if i see any error	20:30
clarkb	also I count 58 in the etherpad (41, 14, 3) which seemd to match teh count a previous listvldb had	20:32
clarkb	just as a sanity check I didn't lose any in the shuffle	20:32
fungi	agreed	20:32
fungi	aha, no point in doing a vos release of non-replicated volumes	20:35
fungi	so `vos release service` errors accordingly	20:35
clarkb	yup says it is meaningless	20:36
fungi	Volume 536870921 has no replicas - release operation is meaningless!	20:36
clarkb	fungi: when you get a cahnce can you fetch the transaction listing command out of your command history?	20:36
clarkb	I'm writing a docs update to capture this	20:36
fungi	presumably we just ignore that condition unless it happens with a volume we know should have replicas	20:36
fungi	clarkb: vos status -server 23.253.73.143 -localauth	20:37
clarkb	thanks	20:37
fungi	et cetera, one for each server	20:37
fungi	if memory serves, the raw address is required when running on the server due to quirks of name resolution?	20:38
fungi	if you run it from elsewhere i think you can get by with the normal dns names	20:38
clarkb	interesting. I wrote it down using ip addrs	20:38
clarkb	not a big deal	20:38
fungi	root@afs01:~# vos release -localauth -verbose starlingx.io	20:39
fungi	RW Volume is not found in VLDB entry for volume 536871056	20:39
clarkb	fungi: I'm guessing that starlingx volume was unused due to project.starlingx existing?	20:39
fungi	maybe we meant to delete that?	20:39
clarkb	and ya it only has two RO volumes	20:39
clarkb	fungi: probably	20:39
clarkb	we should make sure we know what content is in it first though	20:39
clarkb	test.corvus is also RO only	20:40
fungi	yeah, i'm bolding ones that don't release for some reason, and striking through those which do	20:41
fungi	we can revisit later	20:42
opendevreview	Clark Boylan proposed opendev/system-config master: Add more info to afs fileserver recovery docs https://review.opendev.org/c/opendev/system-config/+/910662	20:43
clarkb	anyone have a sense for whether or not I should rerun my debian mirror reprepro commands and vos release?	20:43
clarkb	That was the volume I was working with when things went south and I'm half worried that we may have incomplete cleanup there if I don't start over?	20:44
fungi	can't hurt, but probably wait until we've released the other volumes	20:44
clarkb	for sure	20:44
fungi	just because they can fight for bandwidth	20:45
clarkb	ya I want everything to be as happy as possible before trying this again :)	20:45
clarkb	also Volume needs to be salvaged is what vos release reported previously which amkes me wonder if we will need to take some extra intervention against mirror.debian despite the salvager not complaining about it	20:49
fungi	getting ready to start on the "slow" volumes now	20:52
clarkb	I'm waiting for them to start taking longer than a few seconds then I can go eat lunch :) do we want to start a few in parallel or is that too risky we think?	20:52
fungi	mirror.logs has no replicas	20:52
fungi	shall i save mirror.debian for last?	20:55
clarkb	fungi: ++	20:55
clarkb	these others are going quickly but that one may need to sit and churn a bit	20:55
fungi	my thoughts exactly	20:56
clarkb	lol fedora is a complete release	20:56
clarkb	our optimizing by skipping debian may not have saved much time	20:56
clarkb	I guess now the question becomes do we want to run a few in parallel?	20:56
clarkb	I think I'm ok with that but am happy to be cautious if we prefer	20:57
fungi	wonder if we should have deleted this volume some time ago	20:57
clarkb	oh right it should be mostly empty?	20:57
clarkb	ya only 9kb according to grafana maybe it won't be too slow to do a complete release?	20:57
fungi	mirror.epel was "a complete release" too and only took a few seconds	20:58
fungi	Volume needs to be salvaged	20:58
fungi	maybe we should flag this one and come back to it	20:58
clarkb	intersting. both afs01 and afs02 in dfw doesn't show disk issues currently	20:58
clarkb	so ya I don't think this is a recurrence of our previous issue. ++ to skipping	20:58
fungi	entirely possible this one was broken before	20:59
fungi	mirror.openeuler is taking a while, wonder if it had a bunch of unreleased updates	21:03
clarkb	oh it may because it is at quota	21:03
clarkb	its possible we're releasing an inconsistent mirror state there	21:03
clarkb	whereas previously we wouldn't release because it would error in the rsync	21:03
clarkb	I'm not going to worry about that too much though	21:04
clarkb	reading bos salvage docs I think it will clean up the RO volume(s) if they are discovered to be corrupted	21:05
clarkb	in the case of fedora we may want to just clean the whole thing up as you say.	21:05
clarkb	we may need to update our mirror configs for that though so maybe just creating an empty RO volume is easiest for now (if that is the case)	21:05
clarkb	I'm going to eat something while we wait for openeuler	21:05
fungi	well, it finished, but you don't need to stick around and watch if there's food with your name on it	21:06
Clark[m]	I have a sandwich	21:08
fungi	mirror.ubuntu-ports needs to be salvaged	21:09
clarkb	ubuntu ports is also at quota	21:12
fungi	mirror.debian is all that's left other than the ones that reported they needed to be salvaged	21:12
fungi	should i proceed?	21:12
clarkb	yes I think if it is all that remains we should go ahead	21:12
clarkb	then we'll figure out the salvaged needed volumes	21:12
fungi	k	21:12
clarkb	fwiw still nothing in dmesg indicating disk trouble currently causing the need for salvaging	21:13
clarkb	I wish it would indicate what is wrong requiring it to be salvaged	21:14
clarkb	ok debian also needs to be salvaged	21:14
clarkb	fungi: I think we should start with fedora since it is unused	21:14
fungi	not entirely surprised	21:15
clarkb	fungi: oh look in /var/log/openafs	21:15
clarkb	there is a new salvgae log and it appears to have auto salvaged debian after deciding it needed to be salvaged	21:15
clarkb	oh its gone but the other file was updated	21:16
clarkb	I think it did fedora too. Lets list the vldb entries for them to see if we have all the expected RW and RO replicas then maybe try rerunning vos release?	21:17
clarkb	ya vos listvldb shows the three entries for all three volumes that reported needing salvaging	21:18
fungi	Released volume mirror.fedora successfully	21:18
clarkb	ya so I guess it autodetects the need for salvaging but then bails on the release	21:18
clarkb	but I think all three are worth trying to release since they have the replicas we expect	21:18
fungi	openeuler and ubuntu-ports might need a quota bump first?	21:19
clarkb	fungi: openeuler completed right?	21:19
fungi	er, which was the other one that needed a quota increase?	21:20
clarkb	but ya they both need quota bumps generally. I think we should make sure they release manually then bump them then we can remove cron locks and let cron jobs update them normally	21:20
clarkb	fungi: those two need quota bumps but only one needed salvaging and that one was ubuntu ports	21:20
fungi	ah okay	21:20
fungi	Released volume mirror.debian successfully	21:21
clarkb	assuming ubuntu-ports releases properly I think the next step is for me to manually finish the debian mirror cleanup. While I do that you can bump quotas on ubuntu-ports and openeuler (beacuse that needs locks held) then with that all done we can reenable crons on mirror-update and remoev the node from emergency	21:23
fungi	sounds good	21:23
clarkb	and by manually finish I actually mean manually start all the steps over again just to be sure everything applied properly	21:24
fungi	planning to raise openeuler from 300gb to 350gb and ubuntu-ports from 550gb to 600gb... that work?	21:27
clarkb	++	21:28
clarkb	the debian and opensuse cleanup is on the order of 350GB I think? maybe 300GB so should be plenty of headroom	21:28
clarkb	ubuntu ports released successfully	21:28
clarkb	I'll proceed with redoing the debian buster cleanup	21:28
fungi	yep, i'll proceed with the aforementioned quota bumps and re-release the two affected volumes	21:29
clarkb	the two reprepro commands appear to have nooped but I'm redoing a vos release for debian for good measure	21:30
fungi	okay, volume increases and rereleases are done	21:30
clarkb	then I'll do the manual file deletions and run reprepro-mirror-update	21:30
fungi	awesome, and then we can release locks/take stuff out of the emergency list and i can start cooking dinner	21:31
clarkb	reprepro is running against debian for a sanity check sync now and that will include a vos release then ya I can start undoing the lock holds and so on	21:34
fungi	i'll go ahead and close out the rackspace ticket	21:36
* clarkb looks at the days todo list and has a sad		21:37
fungi	my todo list just said "fight fires"	21:37
fungi	(i wish)	21:37
clarkb	if you get bored you can review my doc updates :)	21:38
clarkb	though probably best to approve those after we remove the held locks and reenable cron jobs	21:38
fungi	and after i cook/eat dinner	21:38
clarkb	++	21:39
clarkb	if anything makes this better its a cold rainy windy miserable day here	21:39
clarkb	so I have nothing better to do than sit at my desk	21:39
clarkb	checkpool fast is running	21:39
clarkb	and now the last vos release is running	21:41
clarkb	ok thats done. I'm going to drop all my held locks now then uncomment the cronjobs	21:42
clarkb	fungi: would be good if you can double check the crontab when this is done	21:42
fungi	can do, though also so will ansible	21:43
clarkb	fungi: thats done	21:45
clarkb	I have removed mirror-update02 from the emergency file	21:46
fungi	root crontab on mirror-update lgtm	21:46
fungi	and i let #openstack-release know we're done	21:46
clarkb	I disconnected from the root screen if you want to drop it	21:47
clarkb	and thank you for checking	21:47
clarkb	In order to help me not forget we need to investigate cleanup of RO only volumes, investigate removal of the fedora volume, and do manual dists/ and lists/ cleanup for old distro releases in debuntu reprepro mirrors	21:48
clarkb	but all of that can wait until after we've had a break and dinner :)	21:49
fungi	screen session gone	21:52
clarkb	we should also maybe make note of the volumes that have problems. We may want to remove them	21:52
clarkb	xvdb on afs01.dfw.openstack.org in this case if future me looks in logs	21:52
clarkb	fungi: thank you for helping me get through that	21:55
fungi	thank you for the same!	21:57
clarkb	the afsmon crontab should fire in a couple of minutes and that should update our grafana graphs	21:58
clarkb	oh nope I read it as every half hour but it runs on the half hour	22:01
clarkb	so 29 minutes away now	22:01
clarkb	graphs updated and show the changes we made (deletions and quota bumps both)	22:37
ianw	(i'm just following along being glad that you're handling it, but if it's getting late and you want me to watch anything/kick of anything etc. i'm around)	22:40
clarkb	ianw: thanks! I think everything appears to be back to normal now	22:47
clarkb	ianw: maybe you want to review the changes I wrote related to this? otherwise I think we're good	22:47
clarkb	https://review.opendev.org/c/opendev/system-config/+/910662 and parent	22:48
ianw	all lgtm!	22:53
clarkb	I think we did end up clearnig about 400GB total between debian and opensuse	22:53
clarkb	centos 7 and xenial should make a big dent too	22:53
opendevreview	Merged opendev/system-config master: Update reprepro cleanup docs to cover dists/ and lists/ cleanup https://review.opendev.org/c/opendev/system-config/+/910655	23:15

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!