*** tkajinam is now known as Guest1349 | 01:24 | |
*** dhill is now known as Guest1368 | 05:03 | |
opendevreview | Sumanth Kumar Batchu proposed opendev/gerritlib master: Added a comment https://review.opendev.org/c/opendev/gerritlib/+/910568 | 06:59 |
---|---|---|
noonedeadpunk | clarkb: from our gut feeling, what indeed takes most of the time - connection and forking. And I fully clueless of what can be done there, except optimize SSH settings, like using lightweight ciphers, disabling GSSAPI/Kerberos, tuning persistant connections, and then disabling things like dynamic motd which is part of default ubuntu setup | 08:12 |
noonedeadpunk | And another thing we were focusing on - trying to reduce amount of variables for the runtime. Like disabling INJECT_FACTS_AS_VARS having quite significant performance improvement, but mainly neglected by external dependency roles. | 08:14 |
noonedeadpunk | But task engine is indeed smth we didn't touch.... | 08:14 |
noonedeadpunk | I pretty much hoped that their agressive move/deprecation of python versions was exactly to drop legacy code and improve performance of forking | 08:15 |
noonedeadpunk | and then indeed we came to a point of discussing replacing whole roles that we call multiple times (like for creating systemd.service) just into an ansible modules. This would improve runtime speed dramatically, but basically it's a way to Jinja Charms... | 08:18 |
*** elodilles_pto is now known as elodilles | 08:29 | |
opendevreview | Artem Goncharov proposed openstack/project-config master: Add OpenAPI related repos to OpenStackSDK project https://review.opendev.org/c/openstack/project-config/+/910580 | 08:32 |
opendevreview | Artem Goncharov proposed openstack/project-config master: Add OpenAPI related repos to OpenStackSDK project https://review.opendev.org/c/openstack/project-config/+/910580 | 08:33 |
opendevreview | Lukas Kranz proposed zuul/zuul-jobs master: Make prepare-workspace-git fail faster. https://review.opendev.org/c/zuul/zuul-jobs/+/910582 | 08:49 |
*** tosky_ is now known as tosky | 11:48 | |
fungi | noonedeadpunk: is pipelining an option, keeping a persistent ssh connection open for the duration of the play and reusing it rather than reconnecting for each task? or does it already do some of that? | 13:44 |
noonedeadpunk | Yup, we do have pipelining enabled - using `-C -o ControlMaster=auto -o ControlPersist=300` for ssh arguments | 13:47 |
*** ralonsoh_ is now known as ralonsoh | 13:59 | |
fungi | so it's probably more fork initialization and cpython interpreter startup time i suppose | 14:03 |
Clark[m] | fungi: noonedeadpunk: yes I think it is python overhead due to inefficient process management. And I think it occurs both on the controller (no preforking of the -f threads and on the remote side starting a new python for each task) | 15:34 |
Clark[m] | But I haven't looked in the code in a long time. There was a huge performance regression after a refactor of this stuff though | 15:35 |
Clark[m] | Way back when I mean. And it hasn't really improved since | 15:35 |
fungi | just add more ram. it's the solution to every performance regression | 15:36 |
noonedeadpunk | + overhead for copying the module I assume | 15:45 |
noonedeadpunk | but yes, I agree. and unfortunatelly, the way of dealing with it is slightly beyond my scope/time constraints | 15:47 |
opendevreview | Brian Rosmaita proposed openstack/project-config master: Add more permissions for 'glance-ptl' group https://review.opendev.org/c/openstack/project-config/+/910641 | 15:49 |
clarkb | the debian mirror update is currently running (and holding the lock). I'll try to grab it as soon as it finishes then I'll approve the change for debian mirror cleanup | 16:12 |
fungi | thanks! | 16:16 |
frickler | fungi: clarkb: could one of you have a look at https://review.opendev.org/c/openstack/project-config/+/904837 please? | 16:29 |
clarkb | fungi probably has more context on that than I do but I can take a look too | 16:30 |
frickler | I was actually thinking the same, but then I didn't want to make you feel excluded ;) | 16:32 |
clarkb | I have the debian reprepro lock. I'm approving https://review.opendev.org/c/opendev/system-config/+/910032 now | 16:33 |
clarkb | oh double checking the change looks like i need locks for all the debian things | 16:33 |
clarkb | debian, debian-security, and debian-ceph-octopus | 16:34 |
clarkb | grabbing the other two before I approve | 16:34 |
clarkb | those were not held so I got them immediately | 16:36 |
fungi | that change's mix of grep and bash pattern manipulation is mind-bending. i'm not confident i know, for example, why the script uses more toothpicks for grep ${NEW_BRANCH/\//-}-eol than for ${NEW_BRANCH//@(stable\/|unmaintained\/)}-eol | 16:45 |
clarkb | fungi: I copy pasted into bash locally and the outputs looked good fwiw | 16:45 |
clarkb | but I think it is because it is doing a regex replace vs a dumb replace | 16:45 |
clarkb | and bash is heavy on the syntax | 16:46 |
opendevreview | Merged openstack/project-config master: Adapt make_branch script to new 'unmaintained/<series>' branch https://review.opendev.org/c/openstack/project-config/+/904837 | 16:54 |
opendevreview | Merged opendev/mqtt_statsd master: Revert "Retire this repo" https://review.opendev.org/c/opendev/mqtt_statsd/+/905142 | 16:55 |
opendevreview | Merged opendev/system-config master: Remove debian buster package mirrors https://review.opendev.org/c/opendev/system-config/+/910032 | 17:18 |
clarkb | the updated reprepro configs appear to have been applied. I'm running the cleanup for ceph octopus first | 17:52 |
clarkb | I completed the steps documented in https://docs.opendev.org/opendev/system-config/latest/reprepro.html#removing-components but https://mirror.bhs1.ovh.opendev.org/ceph-deb-octopus/dists/buster/ still exists. | 17:55 |
clarkb | Before I drop the lockfile I think I should manually delete the dists/buster dir in that repo? | 17:55 |
clarkb | fungi: ^ | 17:55 |
clarkb | then I can rerun the vos release and drop the lock | 17:55 |
clarkb | there are also a few files to remove from https://mirror.bhs1.ovh.opendev.org/ceph-deb-octopus/lists/ | 17:56 |
clarkb | if that looks correct to you I'll do that after eating some breakfast, then I'll write a docs update change and then finally continue on with debian and debian-security | 17:56 |
clarkb | actually I can delete the files then rerun a regular sync. If they come back then I know they should stick around. If they don't then it is proper cleanup. I'll proceed with that plan | 18:08 |
fungi | clarkb: yeah, maybe delete the dir *and* rerun the mirror script again to make sure it doesn't recreate anything | 18:08 |
fungi | right, what you also just said | 18:08 |
clarkb | I did a vos release after the cleanup just to see it reflected on the mirrors. Will run a regular reprepro sync now | 18:16 |
clarkb | ok rerun didn't readd the files so I think this is correct to do | 18:19 |
clarkb | I'll write up a quick docs change and then proceed with debian-security and debian | 18:19 |
fungi | perfect | 18:23 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update reprepro cleanup docs to cover dists/ and lists/ cleanup https://review.opendev.org/c/opendev/system-config/+/910655 | 18:33 |
clarkb | something like that | 18:33 |
clarkb | now doing debian security. | 18:34 |
clarkb | and now finally debian proper | 18:45 |
clarkb | I note that stretch still has those extra files hanging around. I know we also haev ubuntu ports (and probably ubuntu releases?) with similar issues. I think I'm going to focus on cleaning up buster then we can do a broader cleanup of these extra files for other releases later | 18:47 |
clarkb | tonyb may want to dig into that after setting up openafs creds? It has a lot of interactions with interesting afs things | 18:48 |
clarkb | there are a lot of files to clean up in buster proper :) | 18:55 |
clarkb | only into the r packages | 18:56 |
clarkb | Error in vos release command. | 19:08 |
clarkb | Volume needs to be salvaged | 19:08 |
clarkb | that is unfortunate and annoying | 19:08 |
clarkb | I'm not sure I undersatnd what this means yet either | 19:10 |
clarkb | ok afs01.dfw.openstack.org has ext4 errors according to dmesg. I suspect this is the cause | 19:11 |
clarkb | I may need help here. I think xvdb disappeared on us | 19:13 |
clarkb | infra-root ^ fyi | 19:13 |
fungi | argh! | 19:14 |
clarkb | afs02 seems fine | 19:14 |
clarkb | so do we reboot or just try to remount so it isn't remounted ro? | 19:15 |
fungi | [Thu Feb 29 18:48:13 2024] INFO: task jbd2/dm-0-8:541 blocked for more than 120 seconds. | 19:15 |
fungi | that looks like the start of the incident | 19:15 |
clarkb | I think the order of operatins here is going to be make afs01 happy with xvdb again and then we have to make afs happy | 19:15 |
fungi | probably first we check for rackspace tickets about hardware drama | 19:16 |
clarkb | fungi: I didn't see any emails at least | 19:16 |
fungi | k | 19:16 |
clarkb | but feel free to double check | 19:16 |
clarkb | I'm wondering if we need to grab locks for everything so this problem doesn't spread? | 19:17 |
clarkb | but it may be too late for that? | 19:17 |
clarkb | and yes argh | 19:17 |
fungi | we'll probably want to make sure that all active volumes are being served from one of the other two fileservers | 19:17 |
clarkb | fungi: you mean move RW to afs02 if it is on afs01? | 19:18 |
clarkb | the RO stuff should already be on both? | 19:18 |
fungi | yeah | 19:18 |
clarkb | thats a lot of volumes... I can start by holding mirror-update locks I guess | 19:19 |
clarkb | then we move things then we cry? | 19:19 |
clarkb | :) | 19:19 |
fungi | looks like we have this too: https://docs.opendev.org/opendev/system-config/latest/afs.html#recovering-a-failed-fileserver | 19:20 |
fungi | doesn't seem to recommend failing over volumes like in https://docs.opendev.org/opendev/system-config/latest/afs.html#afs0x-openstack-org | 19:21 |
fungi | but it does say we should pause writes | 19:21 |
clarkb | I've grabbed all the lockfiles on mirror update | 19:25 |
clarkb | this doesn't deal with docs/tarballs/etc | 19:25 |
clarkb | looking for that now | 19:25 |
fungi | grabbing a cup of tea real quick since the rest of our afternoons just flashed before my eyes | 19:25 |
clarkb | I can never find where we run that crontab | 19:26 |
clarkb | but I'm trying to find it now | 19:26 |
fungi | clarkb: it's on mirror-update | 19:26 |
fungi | */5 * * * * /opt/afs-release/release-volumes.py -d >> /var/log/afs-release/afs-release.log 2>&1 | 19:27 |
fungi | in root's crontab | 19:27 |
fungi | that's what does tarballs, docs, etc. i think if we just comment out all the cronjobs and put mirror-update in emergency disable we'll be fine | 19:27 |
clarkb | ya and in that script the lockfile is /var/run/release-volumes.lock | 19:27 |
clarkb | fungi: ok I grabbed all the locks. The publish logs thing doesn't appear to do a lockfile | 19:29 |
clarkb | we need to put the server in the ermgecy file before editing the crontab | 19:29 |
fungi | you're just holding all the locks instead of commenting out the cronjobs. that works and i guess keeps us from needing to worry about putting mirror-update in the emergency file | 19:29 |
fungi | or we can do both if you like | 19:30 |
clarkb | I'm doing both because of the logging publishing | 19:30 |
clarkb | ok we should be idled now | 19:31 |
fungi | This message is to inform you that our monitoring systems have detected a problem with the server which hosts your Cloud Block Storage device, afs01.dfw.opendev.org/main01, '1c4c3a46-2571-4442-a85a-5603ff91a68d' at 2024-02-29T19:16:47.073638. We are currently investigating the issue and will update you as soon as we have additional information regarding the alert. Please do not access or modify | 19:32 |
fungi | '1c4c3a46-2571-4442-a85a-5603ff91a68d' during this process. | 19:32 |
fungi | Please reference this incident ID if you need to contact support: CBSHD-5cf24b43 | 19:32 |
fungi | found that in the rackspace dashboard just now when i logged in | 19:32 |
fungi | that's from Thursday, February 29, 2024 at 7:16 PM UTC | 19:32 |
fungi | so about 15 minutes ago | 19:32 |
fungi | we should probably refrain from rebooting the server until they've given the all-clear signal? | 19:33 |
clarkb | ++ | 19:33 |
clarkb | reading our docs and upstream salvage docs I'm not fully sure I understand the implications of this situation | 19:33 |
clarkb | like will we replace the content on afs01 with afs02 content automagically? Or maybe it juist comes up and is happy? | 19:34 |
fungi | openafs should try to auto-salvage the volumes i think | 19:34 |
clarkb | gotcha | 19:34 |
fungi | if it can't for some reason, then we take additional steps to replace/recover the volume | 19:34 |
fungi | separately, we'll likely have hung/lost transactions from vos release commands that occurred during the incident | 19:35 |
fungi | and we'll need to cancel them manually | 19:35 |
clarkb | fungi: ya I think my concern is if we do a vos release will it potentially overwrite good RO content on afs02 with bad content from afs01. But sounds like it should do consistency checks and in theory we'd do a manual promotion to afs02 then potentially redo my debian cleanup | 19:35 |
clarkb | fungi: how do we clean those up? | 19:36 |
clarkb | though maybe we should avoid cleaning those up until afs01 is happy? | 19:36 |
clarkb | otherwise we may create io load we don't want | 19:36 |
fungi | the tasks? i have to refresh my memory on the exact command, but i would wait until after we get the underlying filesystem fixed | 19:36 |
clarkb | ++ | 19:37 |
clarkb | this doesn't give me a lot of confidence that centos 7 and xenial cleanups will go smoothyl... | 19:37 |
clarkb | though it could be coincidence | 19:37 |
fungi | looks like in the past we've used `vos status -server afs01.dfw.openstack.org -localauth` to check the transactions, then `vos endtrans -v -localauth afs01.dfw.openstack.org <transaction_id>` to clean them up, and maybe manually unlocked the volumes with `vos unlock -localauth some.volume` if necessary | 19:40 |
clarkb | I've brought up if we should ask openstack release to idle too (but docs won't be idled by that alone) | 19:41 |
clarkb | corvus: you too may be interested in this as it may impact zuul stuff | 19:41 |
fungi | or perhaps `vos endtrans -server localhost -transaction <transaction_id> -localauth -verbose` looking at shell history | 19:41 |
clarkb | ya there are post failures in tarball jobs | 19:44 |
clarkb | fungi: an email from rax says "pending customer" is that an indication that maybe they want us to check if it is happy again? | 19:45 |
clarkb | talking out loud here but I wonder if we should disable afs services on afs01 before rebooting it (once we get to that point0 that way we can enable services manually and check them post boot? | 19:47 |
clarkb | would allow us to fsck too I think | 19:47 |
clarkb | fungi: if you are still logged into the dashboard can you check the ticket status? | 19:48 |
fungi | i'm refreshing now | 19:50 |
fungi | This message is to inform you that your Cloud Block Storage device,afs01.dfw.opendev.org/main01, 1c4c3a46-2571-4442-a85a-5603ff91a68d has been returned to service. | 19:50 |
fungi | Thursday, February 29, 2024 at 7:47 PM UTC | 19:51 |
clarkb | about 4 minutes ago | 19:51 |
fungi | so we should be clear to reboot the server now | 19:51 |
clarkb | fungi: did we want to disable services first so that we can fsck or whatever first? | 19:51 |
clarkb | first post reboot I mean | 19:51 |
clarkb | or just follow the doc and see what happens? | 19:51 |
fungi | i would just follow the doc from here | 19:51 |
fungi | i'll pull up the server console though in case it wants to fsck for a while at boot | 19:52 |
clarkb | ok it says "fix any filesystem errors" which I'm not sure we'll fsck for to detect which is why I ask | 19:52 |
clarkb | fungi: sounds good let me know when I should issue a reboot command on the server | 19:52 |
fungi | /dev/main/vicepa /vicepa ext4 errors=remount-ro,barrier=0 0 2 | 19:52 |
fungi | but i expect the lvm2 layer shields us here since the volume was put into a non-writeable state as soon as errors cropped up on th eunderlying pv | 19:53 |
clarkb | I see | 19:53 |
clarkb | the 2 should cause a fsck to ahppen though right? | 19:53 |
fungi | yes, once the lvm volume is activated, before its filesystem is mounted | 19:54 |
clarkb | alright then ready for me to reboot? | 19:55 |
clarkb | or you can do it when you are ready with the console if you prefer | 19:55 |
fungi | i have the console up now | 19:55 |
clarkb | I see you have a shell too | 19:55 |
fungi | it doesn't seem to accept input from my keyboard, but there is a "send ctrlaltdel" button | 19:55 |
fungi | but yeah, i can reboot it from the shell anyway | 19:56 |
clarkb | ya I think we resort to the other thing if that doesn't work | 19:56 |
clarkb | since this should be the most graceful option | 19:56 |
fungi | previous uptime 312 days | 19:56 |
fungi | server is rebooting | 19:56 |
clarkb | still no ping responses from there | 19:58 |
clarkb | s/there/here/ | 19:58 |
fungi | yeah, console is blank at the moment | 19:59 |
fungi | haven't seen it start booting yet | 19:59 |
fungi | may still be in the process of shutting down | 19:59 |
clarkb | ya could be | 19:59 |
clarkb | systemd makes sshd go away fast but other things may still be slowly shutting down | 19:59 |
fungi | sure is taking a while though | 20:01 |
clarkb | ~5 or 6 minutes now? | 20:02 |
clarkb | it just started pinging | 20:02 |
fungi | yeah, i see a fsck of xvda1 ran | 20:02 |
clarkb | xvdb is the device that had a sad | 20:03 |
clarkb | fwiw pvs,lvs,vgs looks ok | 20:03 |
clarkb | I see afs and bos services running | 20:03 |
fungi | boot.log says it did fsck /vicepa | 20:04 |
fungi | so seems that was fine | 20:04 |
clarkb | I think the next thing to find is the salvager logs? | 20:04 |
fungi | i've got a root screen session on afs01.dfw | 20:04 |
* clarkb joins | 20:04 | |
clarkb | looks like ti did some salvaging. It isn't clear to me if salvaging is still in progress though | 20:06 |
clarkb | it also looks like it primarily needed to salvage where I couldn't get things idled fast enough (mirror.logs for example) | 20:08 |
fungi | looks like it salvaged mirror.logs and project.readonly | 20:08 |
fungi | the latter is probably due to some docs jobs | 20:08 |
fungi | looks like it didn't complain about anything else | 20:09 |
* clarkb will update docs for recovering a file server to add notes about where to find the salvager logs | 20:09 | |
clarkb | ya so next step is looking for stuck volume transactions (lets get the docs for that updated too) | 20:09 |
fungi | the manpage claims it will be SalvageLog but seems it's actually SalsrvLog | 20:10 |
clarkb | agreed | 20:11 |
fungi | okay, process of elimination, we need to use the global v4 address like `vos status -server 104.130.138.161 -localauth` | 20:11 |
clarkb | fungi: we should also check transactions on the other two servers | 20:12 |
clarkb | just in case they "own" the transaction that may be stuck | 20:12 |
clarkb | cool no transactions to worry about | 20:13 |
fungi | no active transactions for any of the three, so i think we lucked out? | 20:13 |
clarkb | yup, I guess it cancelled my vos release properly | 20:13 |
clarkb | so now we need to vos release the many many volumes. | 20:13 |
fungi | we'll find out when we try to manually vos release each volume, which is our next step | 20:13 |
clarkb | ++ | 20:14 |
clarkb | should we start with the smaller volumes and get them all out of the way? | 20:14 |
clarkb | then we can leave the big ones to sit for a while (and I can grab lunch) | 20:14 |
fungi | yeah | 20:15 |
clarkb | I'll make a list in an etherpad | 20:15 |
fungi | just looking over the `vos listvldb` output now to make sure they look okay | 20:15 |
fungi | do you need a raw volume list for the pad or do you have one handy? | 20:16 |
clarkb | I have one | 20:17 |
clarkb | fungi: I notice a few of them look "weird" like docs old and root.cell and root.afs | 20:17 |
clarkb | I assume we treat them like nay other volume though | 20:17 |
fungi | docs-old we didn't bother making replicas for i think | 20:17 |
fungi | and the others are internal in some way, but yeah we can still try to vos release them | 20:18 |
fungi | lemme know the pad url when you're ready | 20:20 |
clarkb | https://etherpad.opendev.org/p/cPfI3Q3fsexFvhTXbhkp | 20:21 |
clarkb | trying to organize them now. Sorry they ended up with a - suffix that needs to be trimmed | 20:21 |
clarkb | some may have had their names truncated too? | 20:21 |
corvus | i picked a good time to afk! | 20:21 |
clarkb | oh no its just ubuntu-cloud not ubuntu-cloud-archive | 20:21 |
clarkb | corvus: heh yes | 20:21 |
corvus | root.X are technically just like any other volume, only their name is magic | 20:21 |
opendevreview | Brian Rosmaita proposed openstack/project-config master: Add more permissions for 'glance-ptl' group https://review.opendev.org/c/openstack/project-config/+/910641 | 20:21 |
clarkb | corvus: got it | 20:23 |
fungi | we need to release parent volumes after their children for thoroughness, right? | 20:23 |
fungi | or is that just when adding/removing volumes? | 20:23 |
corvus | order shouldn't matter | 20:25 |
fungi | okay, cool | 20:25 |
fungi | i think it's when adding new volumes that it's bit me | 20:26 |
fungi | clarkb: should i just start from the top then? | 20:26 |
corvus | (the mounts are just pointers by name, so in this case, they'll always resolve to something. if there's no read-only copy of a volume though you can't mount it, so that's probably what you're thinking of) | 20:27 |
fungi | yeah | 20:27 |
clarkb | fungi: ya I think we can start at the top | 20:27 |
clarkb | I've almost got the list organized | 20:27 |
fungi | will do it in that root screen session | 20:27 |
fungi | vos release project -localauth -verbose | 20:28 |
fungi | that command look right? | 20:28 |
corvus | lgtm | 20:29 |
fungi | i guess i can stick the volume name on the end to make subsequent ones easier | 20:29 |
fungi | Released volume project successfully | 20:29 |
fungi | i'll proceed down the list and raise concerns if i see any error | 20:30 |
clarkb | also I count 58 in the etherpad (41, 14, 3) which seemd to match teh count a previous listvldb had | 20:32 |
clarkb | just as a sanity check I didn't lose any in the shuffle | 20:32 |
fungi | agreed | 20:32 |
fungi | aha, no point in doing a vos release of non-replicated volumes | 20:35 |
fungi | so `vos release service` errors accordingly | 20:35 |
clarkb | yup says it is meaningless | 20:36 |
fungi | Volume 536870921 has no replicas - release operation is meaningless! | 20:36 |
clarkb | fungi: when you get a cahnce can you fetch the transaction listing command out of your command history? | 20:36 |
clarkb | I'm writing a docs update to capture this | 20:36 |
fungi | presumably we just ignore that condition unless it happens with a volume we know should have replicas | 20:36 |
fungi | clarkb: vos status -server 23.253.73.143 -localauth | 20:37 |
clarkb | thanks | 20:37 |
fungi | et cetera, one for each server | 20:37 |
fungi | if memory serves, the raw address is required when running on the server due to quirks of name resolution? | 20:38 |
fungi | if you run it from elsewhere i think you can get by with the normal dns names | 20:38 |
clarkb | interesting. I wrote it down using ip addrs | 20:38 |
clarkb | not a big deal | 20:38 |
fungi | root@afs01:~# vos release -localauth -verbose starlingx.io | 20:39 |
fungi | RW Volume is not found in VLDB entry for volume 536871056 | 20:39 |
clarkb | fungi: I'm guessing that starlingx volume was unused due to project.starlingx existing? | 20:39 |
fungi | maybe we meant to delete that? | 20:39 |
clarkb | and ya it only has two RO volumes | 20:39 |
clarkb | fungi: probably | 20:39 |
clarkb | we should make sure we know what content is in it first though | 20:39 |
clarkb | test.corvus is also RO only | 20:40 |
fungi | yeah, i'm bolding ones that don't release for some reason, and striking through those which do | 20:41 |
fungi | we can revisit later | 20:42 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add more info to afs fileserver recovery docs https://review.opendev.org/c/opendev/system-config/+/910662 | 20:43 |
clarkb | anyone have a sense for whether or not I should rerun my debian mirror reprepro commands and vos release? | 20:43 |
clarkb | That was the volume I was working with when things went south and I'm half worried that we may have incomplete cleanup there if I don't start over? | 20:44 |
fungi | can't hurt, but probably wait until we've released the other volumes | 20:44 |
clarkb | for sure | 20:44 |
fungi | just because they can fight for bandwidth | 20:45 |
clarkb | ya I want everything to be as happy as possible before trying this again :) | 20:45 |
clarkb | also Volume needs to be salvaged is what vos release reported previously which amkes me wonder if we will need to take some extra intervention against mirror.debian despite the salvager not complaining about it | 20:49 |
fungi | getting ready to start on the "slow" volumes now | 20:52 |
clarkb | I'm waiting for them to start taking longer than a few seconds then I can go eat lunch :) do we want to start a few in parallel or is that too risky we think? | 20:52 |
fungi | mirror.logs has no replicas | 20:52 |
fungi | shall i save mirror.debian for last? | 20:55 |
clarkb | fungi: ++ | 20:55 |
clarkb | these others are going quickly but that one may need to sit and churn a bit | 20:55 |
fungi | my thoughts exactly | 20:56 |
clarkb | lol fedora is a complete release | 20:56 |
clarkb | our optimizing by skipping debian may not have saved much time | 20:56 |
clarkb | I guess now the question becomes do we want to run a few in parallel? | 20:56 |
clarkb | I think I'm ok with that but am happy to be cautious if we prefer | 20:57 |
fungi | wonder if we should have deleted this volume some time ago | 20:57 |
clarkb | oh right it should be mostly empty? | 20:57 |
clarkb | ya only 9kb according to grafana maybe it won't be too slow to do a complete release? | 20:57 |
fungi | mirror.epel was "a complete release" too and only took a few seconds | 20:58 |
fungi | Volume needs to be salvaged | 20:58 |
fungi | maybe we should flag this one and come back to it | 20:58 |
clarkb | intersting. both afs01 and afs02 in dfw doesn't show disk issues currently | 20:58 |
clarkb | so ya I don't think this is a recurrence of our previous issue. ++ to skipping | 20:58 |
fungi | entirely possible this one was broken before | 20:59 |
fungi | mirror.openeuler is taking a while, wonder if it had a bunch of unreleased updates | 21:03 |
clarkb | oh it may because it is at quota | 21:03 |
clarkb | its possible we're releasing an inconsistent mirror state there | 21:03 |
clarkb | whereas previously we wouldn't release because it would error in the rsync | 21:03 |
clarkb | I'm not going to worry about that too much though | 21:04 |
clarkb | reading bos salvage docs I think it will clean up the RO volume(s) if they are discovered to be corrupted | 21:05 |
clarkb | in the case of fedora we may want to just clean the whole thing up as you say. | 21:05 |
clarkb | we may need to update our mirror configs for that though so maybe just creating an empty RO volume is easiest for now (if that is the case) | 21:05 |
clarkb | I'm going to eat something while we wait for openeuler | 21:05 |
fungi | well, it finished, but you don't need to stick around and watch if there's food with your name on it | 21:06 |
Clark[m] | I have a sandwich | 21:08 |
fungi | mirror.ubuntu-ports needs to be salvaged | 21:09 |
clarkb | ubuntu ports is also at quota | 21:12 |
fungi | mirror.debian is all that's left other than the ones that reported they needed to be salvaged | 21:12 |
fungi | should i proceed? | 21:12 |
clarkb | yes I think if it is all that remains we should go ahead | 21:12 |
clarkb | then we'll figure out the salvaged needed volumes | 21:12 |
fungi | k | 21:12 |
clarkb | fwiw still nothing in dmesg indicating disk trouble currently causing the need for salvaging | 21:13 |
clarkb | I wish it would indicate what is wrong requiring it to be salvaged | 21:14 |
clarkb | ok debian also needs to be salvaged | 21:14 |
clarkb | fungi: I think we should start with fedora since it is unused | 21:14 |
fungi | not entirely surprised | 21:15 |
clarkb | fungi: oh look in /var/log/openafs | 21:15 |
clarkb | there is a new salvgae log and it appears to have auto salvaged debian after deciding it needed to be salvaged | 21:15 |
clarkb | oh its gone but the other file was updated | 21:16 |
clarkb | I think it did fedora too. Lets list the vldb entries for them to see if we have all the expected RW and RO replicas then maybe try rerunning vos release? | 21:17 |
clarkb | ya vos listvldb shows the three entries for all three volumes that reported needing salvaging | 21:18 |
fungi | Released volume mirror.fedora successfully | 21:18 |
clarkb | ya so I guess it autodetects the need for salvaging but then bails on the release | 21:18 |
clarkb | but I think all three are worth trying to release since they have the replicas we expect | 21:18 |
fungi | openeuler and ubuntu-ports might need a quota bump first? | 21:19 |
clarkb | fungi: openeuler completed right? | 21:19 |
fungi | er, which was the other one that needed a quota increase? | 21:20 |
clarkb | but ya they both need quota bumps generally. I think we should make sure they release manually then bump them then we can remove cron locks and let cron jobs update them normally | 21:20 |
clarkb | fungi: those two need quota bumps but only one needed salvaging and that one was ubuntu ports | 21:20 |
fungi | ah okay | 21:20 |
fungi | Released volume mirror.debian successfully | 21:21 |
clarkb | assuming ubuntu-ports releases properly I think the next step is for me to manually finish the debian mirror cleanup. While I do that you can bump quotas on ubuntu-ports and openeuler (beacuse that needs locks held) then with that all done we can reenable crons on mirror-update and remoev the node from emergency | 21:23 |
fungi | sounds good | 21:23 |
clarkb | and by manually finish I actually mean manually start all the steps over again just to be sure everything applied properly | 21:24 |
fungi | planning to raise openeuler from 300gb to 350gb and ubuntu-ports from 550gb to 600gb... that work? | 21:27 |
clarkb | ++ | 21:28 |
clarkb | the debian and opensuse cleanup is on the order of 350GB I think? maybe 300GB so should be plenty of headroom | 21:28 |
clarkb | ubuntu ports released successfully | 21:28 |
clarkb | I'll proceed with redoing the debian buster cleanup | 21:28 |
fungi | yep, i'll proceed with the aforementioned quota bumps and re-release the two affected volumes | 21:29 |
clarkb | the two reprepro commands appear to have nooped but I'm redoing a vos release for debian for good measure | 21:30 |
fungi | okay, volume increases and rereleases are done | 21:30 |
clarkb | then I'll do the manual file deletions and run reprepro-mirror-update | 21:30 |
fungi | awesome, and then we can release locks/take stuff out of the emergency list and i can start cooking dinner | 21:31 |
clarkb | reprepro is running against debian for a sanity check sync now and that will include a vos release then ya I can start undoing the lock holds and so on | 21:34 |
fungi | i'll go ahead and close out the rackspace ticket | 21:36 |
* clarkb looks at the days todo list and has a sad | 21:37 | |
fungi | my todo list just said "fight fires" | 21:37 |
fungi | (i wish) | 21:37 |
clarkb | if you get bored you can review my doc updates :) | 21:38 |
clarkb | though probably best to approve those after we remove the held locks and reenable cron jobs | 21:38 |
fungi | and after i cook/eat dinner | 21:38 |
clarkb | ++ | 21:39 |
clarkb | if anything makes this better its a cold rainy windy miserable day here | 21:39 |
clarkb | so I have nothing better to do than sit at my desk | 21:39 |
clarkb | checkpool fast is running | 21:39 |
clarkb | and now the last vos release is running | 21:41 |
clarkb | ok thats done. I'm going to drop all my held locks now then uncomment the cronjobs | 21:42 |
clarkb | fungi: would be good if you can double check the crontab when this is done | 21:42 |
fungi | can do, though also so will ansible | 21:43 |
clarkb | fungi: thats done | 21:45 |
clarkb | I have removed mirror-update02 from the emergency file | 21:46 |
fungi | root crontab on mirror-update lgtm | 21:46 |
fungi | and i let #openstack-release know we're done | 21:46 |
clarkb | I disconnected from the root screen if you want to drop it | 21:47 |
clarkb | and thank you for checking | 21:47 |
clarkb | In order to help me not forget we need to investigate cleanup of RO only volumes, investigate removal of the fedora volume, and do manual dists/ and lists/ cleanup for old distro releases in debuntu reprepro mirrors | 21:48 |
clarkb | but all of that can wait until after we've had a break and dinner :) | 21:49 |
fungi | screen session gone | 21:52 |
clarkb | we should also maybe make note of the volumes that have problems. We may want to remove them | 21:52 |
clarkb | xvdb on afs01.dfw.openstack.org in this case if future me looks in logs | 21:52 |
clarkb | fungi: thank you for helping me get through that | 21:55 |
fungi | thank you for the same! | 21:57 |
clarkb | the afsmon crontab should fire in a couple of minutes and that should update our grafana graphs | 21:58 |
clarkb | oh nope I read it as every half hour but it runs on the half hour | 22:01 |
clarkb | so 29 minutes away now | 22:01 |
clarkb | graphs updated and show the changes we made (deletions and quota bumps both) | 22:37 |
ianw | (i'm just following along being glad that you're handling it, but if it's getting late and you want me to watch anything/kick of anything etc. i'm around) | 22:40 |
clarkb | ianw: thanks! I think everything appears to be back to normal now | 22:47 |
clarkb | ianw: maybe you want to review the changes I wrote related to this? otherwise I think we're good | 22:47 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/910662 and parent | 22:48 |
ianw | all lgtm! | 22:53 |
clarkb | I think we did end up clearnig about 400GB total between debian and opensuse | 22:53 |
clarkb | centos 7 and xenial should make a big dent too | 22:53 |
opendevreview | Merged opendev/system-config master: Update reprepro cleanup docs to cover dists/ and lists/ cleanup https://review.opendev.org/c/opendev/system-config/+/910655 | 23:15 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!