*** pojadhav|out is now known as pojadhav|ruck | 04:26 | |
*** ykarel|away is now known as ykarel | 04:29 | |
*** pojadhav|ruck is now known as pojadhav|afk | 09:27 | |
*** ykarel is now known as ykarel|lunch | 09:39 | |
*** pojadhav|afk is now known as pojadhav|ruck | 09:40 | |
*** pojadhav|ruck is now known as pojadhav|lunch | 10:02 | |
*** jpena|off is now known as jpena | 10:10 | |
*** pojadhav|lunch is now known as pojadhav|ruck | 10:30 | |
*** ykarel|lunch is now known as ykarel | 10:54 | |
*** pojadhav is now known as pojadhav|ruck | 11:01 | |
opendevreview | Merged openstack/project-config master: kolla-cli: end gating for retirement https://review.opendev.org/c/openstack/project-config/+/814580 | 11:03 |
---|---|---|
*** pojadhav|ruck is now known as pojadhav|afk | 11:04 | |
*** pojadhav|afk is now known as pojadhav|rucl | 12:21 | |
*** pojadhav|rucl is now known as pojadhav|ruck | 12:21 | |
*** jpena is now known as jpena|lunch | 12:27 | |
*** ykarel is now known as ykarel|afk | 12:33 | |
*** pojadhav is now known as pojadhav|ruck | 12:53 | |
*** jpena|lunch is now known as jpena | 13:26 | |
fungi | just seen on the starlingx-discuss ml, github has followed our lead and dropped git protocol support: https://github.blog/2021-09-01-improving-git-protocol-security-github/ | 14:06 |
fungi | or has at least started doing brown-outs for it | 14:06 |
fungi | as of today | 14:06 |
clarkb | with the improvements to the http protocl there isn't much performance argument for git:// anymore and it suffers from a lack of security features. Good for github | 14:27 |
*** ykarel|afk is now known as ykarel | 14:29 | |
fungi | yep, agreed | 14:37 |
fungi | it just seems to be taking some folks by surprise where they had git:// urls in automation and config management | 14:37 |
*** ykarel is now known as ykarel|away | 14:42 | |
frickler | nice | 14:51 |
corvus | i'd like to restart zuul on master. there will probably be an error (we added extra debug info to try to understand it). i'll keep a close eye and be prepared to roll back with minimal disruption. | 14:56 |
clarkb | corvus: is this a multi scheduler restart or just on one for now? | 14:57 |
clarkb | also queues look quiet right now. Should I warn the openstack release team? | 14:58 |
corvus | clarkb: just one; i think we saw the change_cache bug show up after a restart on one scheduler | 14:58 |
corvus | yeah, letting release know would probably be good | 14:58 |
fungi | agreed, we saw it continue after a full restart with the second scheduler stopped and after clearing zk | 14:59 |
clarkb | done | 14:59 |
fungi | we logged it pretty much immediately and constantly, so hopefully shouldn't take too long to reproduce | 15:00 |
corvus | btw, this is the manual rollback procedure: https://paste.opendev.org/show/810342/ | 15:00 |
fungi | thanks | 15:01 |
corvus | fungi: well, it took a bit to show up the first time i think. | 15:01 |
corvus | "a bit" ~= hour? | 15:01 |
fungi | oh, maybe yeah | 15:01 |
corvus | but yeah, hard to miss once it happens :) | 15:01 |
fungi | though also it was the weekend | 15:01 |
fungi | higher volume probably means we'll trip it sooner | 15:02 |
corvus | that would be convenient :) | 15:02 |
corvus | restarting now | 15:02 |
fungi | though i suppose we need to be careful with repeated restarts in rapid succession so we don't anger the github api rate limits | 15:03 |
corvus | hopefully 2 will be ok | 15:10 |
fungi | yeah, i think it's been three or more within an hour that we've seen trouble | 15:13 |
corvus | re-enqueing | 15:14 |
fungi | strangely, they both claim to have run unattended-upgrades at roughly the same time | 15:15 |
fungi | er, wrong channel | 15:16 |
fungi | not seeing "AttributeError: 'NoneType' object has no attribute 'cache_key'" in the shceduler's debug log yet | 15:31 |
corvus | reenqueue finished | 15:36 |
fungi | thanks! | 15:37 |
corvus | there are already lines like this: 2021-11-02 15:36:37,852 ERROR zuul.ChangeCache.gerrit: Removing cache key <ChangeKey gerrit None GerritChange 816333 1 hash=a7193a06329b0486dec727317ce838287e3d2fb6a3479b641550591784f7f874> with corrupt data node uuid 2c8df83c5904485b9b2c0af924aa1f78 data b'' len 0 | 15:41 |
corvus | but i suspect those entries were from the previous restart, so i don't think we have the full picture yet | 15:41 |
*** sshnaidm_ is now known as sshnaidm | 15:49 | |
*** pojadhav|ruck is now known as pojadhav|out | 16:01 | |
corvus | maybe if nothing shows up after 2 hours we should start a 2nd scheduler? | 16:08 |
fungi | yeah, that seems reasonable | 16:11 |
fungi | i'm checking periodically, but i've also got meetings 17:00-18:00 and 19:00-20:00 utc today so i'm only around intermittently | 16:11 |
*** marios is now known as marios|out | 16:47 | |
clarkb | sorry I had to step away for breakfast. I think that is reasonable but like fungi have a couple of meetings shortly | 16:49 |
fungi | i'm tailing /var/log/zuul/debug.log filtering for any occurrences of "cache_key" so will hopefully notice if that terminal updates | 16:50 |
*** jpena is now known as jpena|off | 17:32 | |
clarkb | I've just remembered nb03 is still sad. I'll try rebooting it after this meeting | 17:34 |
fungi | oh, yep thanks! | 17:34 |
opendevreview | Andre Aranha proposed zuul/zuul-jobs master: Add enable-fips as false to unittest job https://review.opendev.org/c/zuul/zuul-jobs/+/816385 | 17:59 |
clarkb | I have asked nb03 to reboot now | 18:05 |
clarkb | it has begun responding to ping again but no ssh yet | 18:06 |
clarkb | oh maybe it never stopped responding to ping and is being slow to properly shutdown. Not surprising given the issues | 18:06 |
clarkb | I can try a hard reboot in a bit if it doesn't manage on its own | 18:07 |
fungi | yeah, i expected it would be unlikely to shut down cleanly, given the behaviors it was exhibiting | 18:12 |
clarkb | ya I've escalated to a normal reboot requested through the nova api | 18:13 |
clarkb | that is acpi based iirc. It isn't looking any better after that and will try a hard reboot which is like pulling the power | 18:13 |
clarkb | oh wait it rebooted after the gentle api request | 18:14 |
clarkb | success | 18:14 |
clarkb | and now I've started the nodepool builder container there. System load looks fine | 18:15 |
clarkb | hrm it is in a restart loop though | 18:15 |
clarkb | ModuleNotFoundError: No module named 'openshift.dynamic' | 18:15 |
clarkb | now I'm wondering if the next restart of nodepool launchers will have us dead in the water | 18:17 |
clarkb | openshift==0.0.1 <- how did that get installed | 18:20 |
clarkb | because there is only a wheel for 0.0.1 | 18:21 |
clarkb | did this work before because we weren't loading all drivers unless necessary and nodepool updated? | 18:21 |
clarkb | or maybe they deleted all their wheels except one? | 18:22 |
fungi | or maybe we started only installing wheels? there's a pip option for that | 18:22 |
fungi | oh, you know what? | 18:22 |
fungi | new pip, if it encounters an error building a wheel from sdist, will try the next newest sdist | 18:23 |
fungi | and so on, until it finds a wheel it can use without having to successfully build | 18:23 |
fungi | but pip silently discards all the output from the build failures | 18:23 |
clarkb | fungi: we set the flag to prefer wheels when we first started building arm64 images. It was the only way to get sane build times | 18:23 |
fungi | because it's helpful that way and doesn't want to burden users with errors about sdists they can't use on their platform | 18:24 |
clarkb | so in this case it is picking 0.0.1 because it is a wheel | 18:24 |
clarkb | I think we can fix this by setting a lower bound then it will make a wheel | 18:24 |
clarkb | I'm working on that change now | 18:24 |
fungi | got it, so we set that option *because* of pip's "try building every version until you find one which succeeds" behavior | 18:24 |
fungi | i guess | 18:24 |
clarkb | fungi: no beacuse building wheels for cryptography and pynacl and cffi is really slow on qemu emulated arm64 | 18:24 |
clarkb | so we told it to prefer wheels instead. But that means we need to set lower bounds on things like openshift | 18:25 |
fungi | ahh, yeah | 18:25 |
fungi | well, declaring a minimum version the software should work with is probably a good thing anyway | 18:26 |
clarkb | https://mirror.dfw.rax.opendev.org/wheel/debian-10-aarch64/openshift/ <- that is why it worked before | 18:28 |
clarkb | fungi: ++ | 18:28 |
clarkb | remote: https://review.opendev.org/c/zuul/nodepool/+/816389 Fix openshift dep for arm64 docker image builds | 18:28 |
fungi | oh fun | 18:30 |
fungi | so we were prebuilding wheels, but then we switched our container images to be based on a different platform which we weren't building wheels for | 18:31 |
clarkb | yes | 18:31 |
fungi | or, at least, stopped building openshift wheels for | 18:32 |
fungi | we still have a lot of others under https://mirror.dfw.rax.opendev.org/wheel/debian-11-aarch64/ | 18:32 |
clarkb | https://opendev.org/openstack/requirements/commit/e706eab641ccafc791eb86b794d4cdeb954073d2 is why that happened | 18:32 |
fungi | https://review.opendev.org/679366 Remove openshift from requirements | 18:33 |
fungi | Submitted-at: Fri, 30 Aug 2019 08:53:34 +0000 | 18:33 |
fungi | yeah, just found it myself | 18:33 |
fungi | okay, mystery solved! | 18:33 |
fungi | and still no hits for cache_key in the scheduler's debug log | 18:35 |
clarkb | separately I wish everyone with a pure python pacakge would make wheels for them | 18:38 |
clarkb | it is a simple extra step for packaging and uploading to pypi that greatly simplifies user's lives | 18:38 |
corvus | i guess i'll start another scheduler? | 18:51 |
corvus | zuul01 scheduler starting | 18:52 |
corvus | status pages are going to look weird while it starts. don't panic. :) | 18:53 |
corvus | (look weird or return errors) | 18:53 |
*** ianw_pto is now known as ianw | 19:00 | |
corvus | looks like the bug hit | 19:07 |
clarkb | probably related to multiple scheduelrs then since it wasn't a problem for a long time? | 19:07 |
fungi | i concur, i see it reported now | 19:08 |
corvus | okay, i want to get one bit of info from the repl, then i'll restart | 19:09 |
fungi | thanks! | 19:11 |
corvus | stopping zuul | 19:16 |
corvus | deleting state | 19:17 |
corvus | starting zuul | 19:22 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: fedora-container: update to Fedora 35 https://review.opendev.org/c/openstack/diskimage-builder/+/815574 | 19:27 |
ianw | clarkb: ^ needed a merge, if you would be able to look over would be good. very simple, just seems to need one more package because everyone loves changing dependencies | 19:32 |
clarkb | ianw: ya I'll look after the meeting | 19:32 |
clarkb | ianw: also remote: https://review.opendev.org/c/zuul/nodepool/+/816389 Fix openshift dep for arm64 docker image builds | 19:32 |
johnsom | I am getting "Something went wrong" when loading zuul status page | 19:34 |
clarkb | johnsom: zuul is being restarted | 19:35 |
clarkb | it does that when the js in your brwoser can't load the json from the server | 19:35 |
fungi | it should be fully started here in a moment | 19:36 |
clarkb | if you load the root you'll see all of the loaded tenants https://zuul.opendev.org/tenants just waiting on opensatck now (it is the largest and slowest to load | 19:42 |
corvus | it's up now | 19:46 |
johnsom | Do we need to recheck jobs? I'm not seeing my job come back up on the status page | 19:50 |
corvus | i'm re-enqueing now | 19:52 |
corvus | it won't hurt to recheck if you'd prefer | 19:52 |
fungi | but if it was already enqueued as of 19:16 utc when the restart began, then it should be reenqueued without any intervention | 19:53 |
clarkb | I need to eat lunch but looking at the gerrit theming stuff gerrit's docs still say you can do that via review_site/etc files https://gerrit-review.googlesource.com/Documentation/config-themes.html | 20:11 |
clarkb | that might more closely match what we are trying to do than overloading the plugin interface which is signficiantly more complicated now it seems like | 20:11 |
clarkb | A held node would be a good place to experiment I guess | 20:12 |
fungi | yeah, if there's a way to do it without requiring a polygerrit plugin, that's even better | 20:12 |
ianw | yeah i do have a held node, i will poke | 20:15 |
ianw | fungi: btw the iptables thing in podman got fixed by moving the dependency to some plugin container bit https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=997976 | 20:17 |
ianw | we can probably drop the explicit install of that now | 20:17 |
corvus | the re-enqueue is done | 20:19 |
fungi | ianw: oh neat | 20:58 |
fungi | ianw: the way i read that bug closure, they simply added a depends | 20:59 |
clarkb | corvus: I don't think the reenqueue enqueued https://review.opendev.org/c/zuul/nodepool/+/816389 I have rechecekd it | 21:01 |
fungi | ianw: oh! yeah i misread, so they added it as a dependency in an updated golang-github-containernetworking-plugins package | 21:02 |
clarkb | fungi: ianw does that mean https://review.opendev.org/c/zuul/nodepool/+/815766/3/Dockerfile should be updated? | 21:02 |
ianw | clarkb: yep we can drop the explicit iptables probably | 21:03 |
clarkb | ok I won't recheck it then | 21:03 |
corvus | clarkb: oh yeah, sorry i did a re-enqueue from backup because the bug was affecting the status json and omitted the non-openstack tenants. i should have said so; sorry. | 21:03 |
clarkb | got it | 21:04 |
fungi | ianw: https://packages.debian.org/sid/containernetworking-plugins that's where it ended up, but that version isn't in bullseye | 21:04 |
clarkb | fungi: the change is pulling form unstable | 21:04 |
fungi | are we still using the version from sid? | 21:04 |
fungi | yeah, that should work then | 21:04 |
ianw | yes -- presumably that pulls containernetwork-plugins from sid too though it's worth checking | 21:05 |
clarkb | fungi: we moved from suse's obs repo to bullseye but the proposed change switches to unstable to get the libc fix for selinux or whatever was tripping over the different libc | 21:05 |
ianw | sooo many layers to all this :/ | 21:05 |
clarkb | remember when they promised containers would make this easy? | 21:05 |
clarkb | turns out not entirely true if you aren't able to handle libc changes | 21:05 |
clarkb | oh this reminds me for some weird reason. ianw backup02 is at 91% of disk space and fungi and I thought it would be good for one of us that is not you to run the prune while you are around | 21:06 |
opendevreview | Ian Wienand proposed opendev/system-config master: Drop Fedora 33 mirror https://review.opendev.org/c/opendev/system-config/+/815931 | 21:06 |
clarkb | I think it is all scripted and we basically need to start a screen on the host and run the script but wanted to make sure we had your insight if things got sad | 21:06 |
ianw | clarkb: yep, sure -- that's all it should be | 21:07 |
ianw | that we do the weekly fsck makes me feel better about it too | 21:07 |
clarkb | ok I'll look at doing that after the school run unless fungi feels he really wants to do it | 21:07 |
opendevreview | Ian Wienand proposed opendev/system-config master: Add Fedora 35 mirror https://review.opendev.org/c/opendev/system-config/+/816404 | 21:09 |
ianw | heh i have to do reverse school run, thankfully! kids finally back to "in-person learning" full-time | 21:10 |
clarkb | I normally do the morning run but thought I had a meeting this morning (didn't) so I've got the afternoon run | 21:10 |
ianw | https://opendev.org/openstack/openstacksdk/src/branch/master/.zuul.yaml#L206 | 21:15 |
ianw | MAGNUM_GUEST_IMAGE_URL: https://tarballs.openstack.org/magnum/images/fedora-atomic-f23-dib.qcow2 | 21:15 |
ianw | clarkb: i'm actually not sure magnum is referencing the mirror copy, still looking | 21:16 |
ianw | unsurprisingly openstacksdk-functional-devstack-magnum is non-voting | 21:18 |
ianw | i think we can just backport https://review.opendev.org/c/openstack/magnum/+/816407 | 21:36 |
clarkb | that makes sense | 21:55 |
clarkb | ianw: fungi: ok I'm ready to start a screen and run the prune script. It has a noop option. Do we usually want to run it with the noop option first or just go for it now that we've run it a couple of times? | 21:57 |
ianw | clarkb: i think just go for it | 21:59 |
clarkb | ok it is running in a screen. window 0 is the command and window 1 is tailing the log file | 22:01 |
ianw | https://review.opendev.org/q/Ie3c8dff0bad6db483b54086afed0402ef24b0b4b is the stack to clear magnum | 22:08 |
clarkb | seems to be doing what I expect re pruning and our keep policy | 22:09 |
opendevreview | Ian Wienand proposed opendev/system-config master: mirror-update: Drop Fedora Atomic mirrors https://review.opendev.org/c/opendev/system-config/+/816416 | 22:12 |
fungi | clarkb: i'll volunteer to do it next time, just finished cleanup up from dinner | 22:20 |
fungi | and thanks! | 22:20 |
ianw | magnum-functional-k8s - NODE_FAILURE -- this seems to go back to legacy-dsvm-base and legacy-ubuntu-xenial | 22:38 |
ianw | i'm not sure why this has such old branches | 22:39 |
clarkb | oh wow | 22:41 |
clarkb | in that case maybe we can just rip things out and ask them to cleanup instead | 22:41 |
ianw | i mean i can force merge, but missing mirror bits is the least of its worries :) | 22:42 |
fungi | worth noting, per a few minutes ago in #openstack-tc they're in the throes of a leadership handover too, so may be a bit preoccupied | 22:42 |
fungi | sounds like flwang is leaving, and strigazi may be stepping in as interim ptl until the next election | 22:45 |
opendevreview | Merged opendev/system-config master: Drop Fedora 33 mirror https://review.opendev.org/c/opendev/system-config/+/815931 | 22:45 |
opendevreview | Ian Wienand proposed opendev/system-config master: mirror-update: Drop Fedora Atomic mirrors https://review.opendev.org/c/opendev/system-config/+/816416 | 23:05 |
clarkb | infra-root the backup prune is done. All prunes report they terminated with success. Two things I notice: for old servers we're eventually just going to prune down to a single annual backup? servers like ask and review01? Also for review01 it seems like it is mixing up the filesystem and mysql backups beacuse the filesystem backup doesn't haev a suffix and it matches the mysql | 23:20 |
clarkb | backups too? | 23:20 |
clarkb | re review01 it didn't prune anything on this pass so I think that ship sailed during a previous pruning | 23:21 |
ianw | hrm, if you had of asked me, i would have thought that we'd still keep the latest weekly/monthly backups for disabled servers | 23:23 |
clarkb | ianw: we might, it isn't clear to me yet as they aren't old enough to potentially be pruned out | 23:24 |
clarkb | looks like lists, review-dev01 and review01 have the problem of not having the -filesystem suffix | 23:24 |
clarkb | in the case of lists I think we're backing up both lists and lists-filesystem | 23:24 |
ianw | hrm, i may well have fiddled that and not removed an old file | 23:25 |
ianw | i have a memory of trying to convert everythign to have that -filesystem extension | 23:25 |
clarkb | I think review01 is the only one I'd worry about as we seem to have pruned out the old filesystem backups because the mysql backups are considered too | 23:25 |
clarkb | in the case of review01 I think the only thing we might want to do is ensure we don't accidentally prune out the same content on backup01? | 23:26 |
clarkb | its an old server and review02 carries those backups now and review02 lgmt (though it seems to have a mysql backup thatwe maybe decided we didn't need anymore since it is just the reviewed flag which isn't important) | 23:27 |
clarkb | ianw: ya I think we're good for everything but review01 actually. Looking at lists it seems ti converted and we just have some of the old ones around so thats fine | 23:27 |
clarkb | and for review01 all we want to do is avoid pruning it on backup01? | 23:27 |
ianw | "up to 7 most recent days with backups (days without backups do not count)." | 23:28 |
clarkb | aha | 23:28 |
ianw | so yeah, that's what i assuming, that if backups stop it just keeps the last thing | 23:28 |
clarkb | then ya the only real issue here is that pruning treated review01 and review01-mysql as equivalent and so it is keeping the -mysql stuff and pruning out the old filesystem stuff | 23:28 |
clarkb | oh also it seems we don't have storyboard db backups? | 23:28 |
clarkb | ianw: maybe we can rename the review01 backups on backup01 to review01-filesystem to avoid this conflict with pruning? I don't think we've pruned on that server so we should be able to keep them around that way | 23:29 |
ianw | perhaps that script could look in the directory for a ".noprune" file and skip | 23:29 |
ianw | when we shutdown a server, we can put in a .noprune and just leave it in an archived state | 23:30 |
ianw | or; move it to attic/ manually | 23:30 |
ianw | moving it is probably the KISS approach | 23:31 |
clarkb | ianw: /opt/backups/prune-2021-08-04-04-42-05.log shows it pruning the review01 stuff | 23:32 |
clarkb | but now I'm even more confused because what it pruned was review01-filesystem | 23:32 |
clarkb | like it considered them to be equivalent backups for pruning pruposes. paste01 had similar output but in that case it kept filesystem stuff | 23:33 |
ianw | hrm, just paging in what happend | 23:34 |
ianw | the main change was | 23:34 |
ianw | https://review.opendev.org/c/opendev/system-config/+/771738 | 23:34 |
clarkb | ok I think I see it. We prune based on prefix | 23:35 |
clarkb | in the review01 case the prefix we used was 'review01' which matched both -filesystem and -mysql | 23:35 |
ianw | my memory from the time was that this was all new, and i restarted each backup so everything had a "-filesystem" archive | 23:35 |
clarkb | and the mysql stuff must run after the fielsystem stuff so was newer in all cases and kept | 23:36 |
clarkb | ianw: ya I think this is a general issue if we use a prefix that is a subprefix of another prefix | 23:39 |
clarkb | lists is suffering from this but because we don't do a db backup and a filesystem backup its all just fielsystem but with two prefixes we are fine | 23:40 |
clarkb | review01 did this and is a problem because the review01 prefix matches both review01-filesystem and review01-mysql and then it prunes out -filesystem in favor of -mysql | 23:40 |
clarkb | which means for review01 we want to make sure we have unique prefixes before we prune? | 23:42 |
clarkb | er sorry for backup01 | 23:42 |
ianw | we should only be doing one archive at a time | 23:43 |
ianw | # Look at all archives and strip the timestamp, leaving just the archive names | 23:43 |
ianw | # We limit the prune by --prefix so each archive is considered separately | 23:43 |
ianw | archives=$(/opt/borg/bin/borg list ${BORG_REPO} | awk '{$1 = substr($1, 0, length($1)-20); print $1}' | sort | uniq) | 23:43 |
clarkb | yes but in the case of review01 the prefix was review01 which matched review01-filesystem and review01-mysql | 23:43 |
clarkb | ianw: | Wed Aug 4 05:26:51 UTC 2021 Pruning /opt/backups/borg-review01/backup archive review01 <- is the log line from the log in the previous prune that shows this happening | 23:44 |
clarkb | the last term there is the prefix used | 23:44 |
clarkb | this is also happening for lists but in lists case we care less because lists and lists-filesystem are equiavlent archives | 23:44 |
clarkb | (there is no db or secondary backup) | 23:44 |
ianw | ahh right, the problem with review01 is that there is still a generic archive | 23:45 |
clarkb | when I try to run that borg list it complains about some previously unknown repo and doesn't want to proceed by default | 23:45 |
ianw | review01-2020-11-30T05:51:02 Mon, 2020-11-30 05:51:04 [b14cfe17e7802e4be4cf048f90890c1fabb78f7b556ef64a65fb16f1ac86a4e2] | 23:45 |
clarkb | ianw: well and all of the -filesystem archive is gone | 23:45 |
clarkb | this is my main concern as I don't think we've pruned backup01 yet so we have the opportunity to prevent accidentally over pruning thato ne | 23:45 |
ianw | $ /opt/borg/bin/borg list ./backup/ | awk '{$1 = substr($1, 0, length($1)-20); print $1}' | sort | uniq | 23:45 |
ianw | review01 | 23:45 |
ianw | review01-mysql-accountPatchReviewDb | 23:45 |
clarkb | if I run borg list as root it complains. Should I run it as my normal user? | 23:46 |
ianw | right; this *should* have had "review01-filesystem" and "review01-mysql-accountPatchReviewDb" in that list only | 23:46 |
ianw | it's the /opt symlink that confuses it | 23:47 |
clarkb | ah so cd into the path first, got it | 23:47 |
ianw | just set "BORG_RELOCATED_REPO_ACCESS_IS_OK=y" | 23:47 |
clarkb | I still get the error setting that var | 23:49 |
clarkb | well if I do it as a normal user I get told that I don't have permissions to set the lock file and hcecking perms on backup/ this is true | 23:49 |
ianw | you should probably do it as the backup user | 23:49 |
ianw | for that host | 23:50 |
ianw | backup01 is right | 23:50 |
clarkb | aha that is what the prune script does it sudo -u's | 23:50 |
ianw | https://paste.opendev.org/show/810350/ | 23:50 |
ianw | i feel like we've actually discussed this before now ... | 23:50 |
clarkb | ianw: I don't think that is correct on 01 either? | 23:50 |
clarkb | we flipped the servers over much later in the year? | 23:51 |
clarkb | basically we should have a matching -filesystem and -mysql pair for each of the retained backups | 23:51 |
ianw | i mean all the archives have unique names there | 23:51 |
clarkb | but if it got pruned there too then not much we can do at this point | 23:51 |
clarkb | ah got it | 23:51 |
clarkb | ianw: is that a complete list of backups though? it seems like we should have more for that server (I can't remember when exactly we switched the servers but it was in the NA summer?) | 23:53 |
ianw | oh no, that was just an extract | 23:53 |
ianw | $ /opt/borg/bin/borg list ./backup/ | awk '{$1 = substr($1, 0, length($1)-20); print $1}' | sort | uniq | 23:53 |
ianw | review01-filesystem | 23:53 |
ianw | review01-mysql | 23:53 |
ianw | review01-mysql-accountPatchReviewDb | 23:53 |
clarkb | got it | 23:54 |
ianw | that's the list of archives it will prune on backup01 -- all are unique enough | 23:54 |
ianw | i can't find it in my notes or irc logs, but i have memory of this now | 23:54 |
clarkb | I guess we should double check all of them for unique prefixes? backup02 lists is another without unique prefixes, but as mentioned it is less of a concern since they are equivalent | 23:54 |
clarkb | can we convert lists to lists-filesystem archive somehow? | 23:54 |
ianw | i feel like we realised the same thing we're concluding now, we had pruned the archives on the vexxhost backup server, but things were correctly named on the rax one | 23:55 |
clarkb | ya I'm recalling similar discussion now but only vaguely :) | 23:55 |
clarkb | as long as the review01 prefixes on backup01 are unique we should be good for future prunes there. I think at this point my only real concerns are that we double check there aren't any additional servers in that situation (backup02: review01 and lists are known) and decide if we need to rename anything like with lists if possible | 23:56 |
clarkb | oh and storyboard db backups should happen | 23:56 |
clarkb | I don't think anything is urgently broken. The onyl thing I can identify as wrong has been that way since august | 23:57 |
ianw | https://paste.opendev.org/show/810351/ | 23:59 |
clarkb | ianw: I'm going to have to do dinner here in a couple of minutes. Maybe ping me or email or whatever if there is anything I should pick up in the morning. Like skimming through the listings and looking for any non unique prefixes we don't alredy know about. Then we can decide if/how we want to handle any of them | 23:59 |
ianw | lists and review are the ones affected by this | 23:59 |
clarkb | ianw: did you run that on backup01 as well? | 23:59 |
clarkb | curious if it has any others that might be affected | 23:59 |
ianw | no, just backup02 for now, but i will do for both | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!