Tuesday, 2021-11-02

*** pojadhav|out is now known as pojadhav|ruck04:26
*** ykarel|away is now known as ykarel04:29
*** pojadhav|ruck is now known as pojadhav|afk09:27
*** ykarel is now known as ykarel|lunch09:39
*** pojadhav|afk is now known as pojadhav|ruck09:40
*** pojadhav|ruck is now known as pojadhav|lunch10:02
*** jpena|off is now known as jpena10:10
*** pojadhav|lunch is now known as pojadhav|ruck10:30
*** ykarel|lunch is now known as ykarel10:54
*** pojadhav is now known as pojadhav|ruck11:01
opendevreviewMerged openstack/project-config master: kolla-cli: end gating for retirement  https://review.opendev.org/c/openstack/project-config/+/81458011:03
*** pojadhav|ruck is now known as pojadhav|afk11:04
*** pojadhav|afk is now known as pojadhav|rucl12:21
*** pojadhav|rucl is now known as pojadhav|ruck12:21
*** jpena is now known as jpena|lunch12:27
*** ykarel is now known as ykarel|afk12:33
*** pojadhav is now known as pojadhav|ruck12:53
*** jpena|lunch is now known as jpena13:26
fungijust seen on the starlingx-discuss ml, github has followed our lead and dropped git protocol support: https://github.blog/2021-09-01-improving-git-protocol-security-github/14:06
fungior has at least started doing brown-outs for it14:06
fungias of today14:06
clarkbwith the improvements to the http protocl there isn't much performance argument for git:// anymore and it suffers from a lack of security features. Good for github14:27
*** ykarel|afk is now known as ykarel14:29
fungiyep, agreed14:37
fungiit just seems to be taking some folks by surprise where they had git:// urls in automation and config management14:37
*** ykarel is now known as ykarel|away14:42
corvusi'd like to restart zuul on master.  there will probably be an error (we added extra debug info to try to understand it).  i'll keep a close eye and be prepared to roll back with minimal disruption.14:56
clarkbcorvus: is this a multi scheduler restart or just on one for now?14:57
clarkbalso queues look quiet right now. Should I warn the openstack release team?14:58
corvusclarkb: just one; i think we saw the change_cache bug show up after a restart on one scheduler14:58
corvusyeah, letting release know would probably be good14:58
fungiagreed, we saw it continue after a full restart with the second scheduler stopped and after clearing zk14:59
fungiwe logged it pretty much immediately and constantly, so hopefully shouldn't take too long to reproduce15:00
corvusbtw, this is the manual rollback procedure: https://paste.opendev.org/show/810342/15:00
corvusfungi: well, it took a bit to show up the first time i think.15:01
corvus"a bit" ~= hour?15:01
fungioh, maybe yeah15:01
corvusbut yeah, hard to miss once it happens :)15:01
fungithough also it was the weekend15:01
fungihigher volume probably means we'll trip it sooner15:02
corvusthat would be convenient :)15:02
corvusrestarting now15:02
fungithough i suppose we need to be careful with repeated restarts in rapid succession so we don't anger the github api rate limits15:03
corvushopefully 2 will be ok15:10
fungiyeah, i think it's been three or more within an hour that we've seen trouble15:13
fungistrangely, they both claim to have run unattended-upgrades at roughly the same time15:15
fungier, wrong channel15:16
funginot seeing "AttributeError: 'NoneType' object has no attribute 'cache_key'" in the shceduler's debug log yet15:31
corvusreenqueue finished15:36
corvusthere are already lines like this: 2021-11-02 15:36:37,852 ERROR zuul.ChangeCache.gerrit: Removing cache key <ChangeKey gerrit None GerritChange 816333 1 hash=a7193a06329b0486dec727317ce838287e3d2fb6a3479b641550591784f7f874> with corrupt data node uuid 2c8df83c5904485b9b2c0af924aa1f78 data b'' len 015:41
corvusbut i suspect those entries were from the previous restart, so i don't think we have the full picture yet15:41
*** sshnaidm_ is now known as sshnaidm15:49
*** pojadhav|ruck is now known as pojadhav|out16:01
corvusmaybe if nothing shows up after 2 hours we should start a 2nd scheduler?16:08
fungiyeah, that seems reasonable16:11
fungii'm checking periodically, but i've also got meetings 17:00-18:00 and 19:00-20:00 utc today so i'm only around intermittently16:11
*** marios is now known as marios|out16:47
clarkbsorry I had to step away for breakfast. I think that is reasonable but like fungi have a couple of meetings shortly16:49
fungii'm tailing /var/log/zuul/debug.log filtering for any occurrences of "cache_key" so will hopefully notice if that terminal updates16:50
*** jpena is now known as jpena|off17:32
clarkbI've just remembered nb03 is still sad. I'll try rebooting it after this meeting17:34
fungioh, yep thanks!17:34
opendevreviewAndre Aranha proposed zuul/zuul-jobs master: Add enable-fips as false to unittest job  https://review.opendev.org/c/zuul/zuul-jobs/+/81638517:59
clarkbI have asked nb03 to reboot now18:05
clarkbit has begun responding to ping again but no ssh yet18:06
clarkboh maybe it never stopped responding to ping and is being slow to properly shutdown. Not surprising given the issues18:06
clarkbI can try a hard reboot in a bit if it doesn't manage on its own18:07
fungiyeah, i expected it would be unlikely to shut down cleanly, given the behaviors it was exhibiting18:12
clarkbya I've escalated to a normal reboot requested through the nova api18:13
clarkbthat is acpi based iirc. It isn't looking any better after that and will try a hard reboot which is like pulling the power18:13
clarkboh wait it rebooted after the gentle api request18:14
clarkband now I've started the nodepool builder container there. System load looks fine18:15
clarkbhrm it is in a restart loop though18:15
clarkbModuleNotFoundError: No module named 'openshift.dynamic'18:15
clarkbnow I'm wondering if the next restart of nodepool launchers will have us dead in the water18:17
clarkbopenshift==0.0.1 <- how did that get installed18:20
clarkbbecause there is only a wheel for 0.0.118:21
clarkbdid this work before because we weren't loading all drivers unless necessary and nodepool updated?18:21
clarkbor maybe they deleted all their wheels except one?18:22
fungior maybe we started only installing wheels? there's a pip option for that18:22
fungioh, you know what?18:22
funginew pip, if it encounters an error building a wheel from sdist, will try the next newest sdist18:23
fungiand so on, until it finds a wheel it can use without having to successfully build18:23
fungibut pip silently discards all the output from the build failures18:23
clarkbfungi: we set the flag to prefer wheels when we first started building arm64 images. It was the only way to get sane build times18:23
fungibecause it's helpful that way and doesn't want to burden users with errors about sdists they can't use on their platform18:24
clarkbso in this case it is picking 0.0.1 because it is a wheel18:24
clarkbI think we can fix this by setting a lower bound then it will make a wheel18:24
clarkbI'm working on that change now18:24
fungigot it, so we set that option *because* of pip's "try building every version until you find one which succeeds" behavior18:24
fungii guess18:24
clarkbfungi: no beacuse building wheels for cryptography and pynacl and cffi is really slow on qemu emulated arm6418:24
clarkbso we told it to prefer wheels instead. But that means we need to set lower bounds on things like openshift18:25
fungiahh, yeah18:25
fungiwell, declaring a minimum version the software should work with is probably a good thing anyway18:26
clarkbhttps://mirror.dfw.rax.opendev.org/wheel/debian-10-aarch64/openshift/ <- that is why it worked before18:28
clarkbfungi: ++18:28
clarkbremote:   https://review.opendev.org/c/zuul/nodepool/+/816389 Fix openshift dep for arm64 docker image builds18:28
fungioh fun18:30
fungiso we were prebuilding wheels, but then we switched our container images to be based on a different platform which we weren't building wheels for18:31
fungior, at least, stopped building openshift wheels for18:32
fungiwe still have a lot of others under https://mirror.dfw.rax.opendev.org/wheel/debian-11-aarch64/18:32
clarkbhttps://opendev.org/openstack/requirements/commit/e706eab641ccafc791eb86b794d4cdeb954073d2 is why that happened18:32
fungihttps://review.opendev.org/679366 Remove openshift from requirements18:33
fungiSubmitted-at: Fri, 30 Aug 2019 08:53:34 +000018:33
fungiyeah, just found it myself18:33
fungiokay, mystery solved!18:33
fungiand still no hits for cache_key in the scheduler's debug log18:35
clarkbseparately I wish everyone with a pure python pacakge would make wheels for them18:38
clarkbit is a simple extra step for packaging and uploading to pypi that greatly simplifies user's lives18:38
corvusi guess i'll start another scheduler?18:51
corvuszuul01 scheduler starting18:52
corvusstatus pages are going to look weird while it starts.  don't panic.  :)18:53
corvus(look weird or return errors)18:53
*** ianw_pto is now known as ianw19:00
corvuslooks like the bug hit19:07
clarkbprobably related to multiple scheduelrs then since it wasn't a problem for a long time?19:07
fungii concur, i see it reported now19:08
corvusokay, i want to get one bit of info from the repl, then i'll restart19:09
corvusstopping zuul19:16
corvusdeleting state19:17
corvusstarting zuul19:22
opendevreviewIan Wienand proposed openstack/diskimage-builder master: fedora-container: update to Fedora 35  https://review.opendev.org/c/openstack/diskimage-builder/+/81557419:27
ianwclarkb: ^ needed a merge, if you would be able to look over would be good.  very simple, just seems to need one more package because everyone loves changing dependencies19:32
clarkbianw: ya I'll look after the meeting19:32
clarkbianw: also remote:   https://review.opendev.org/c/zuul/nodepool/+/816389 Fix openshift dep for arm64 docker image builds19:32
johnsomI am getting "Something went wrong" when loading zuul status page19:34
clarkbjohnsom: zuul is being restarted19:35
clarkbit does that when the js in your brwoser can't load the json from the server19:35
fungiit should be fully started here in a moment19:36
clarkbif you load the root you'll see all of the loaded tenants https://zuul.opendev.org/tenants just waiting on opensatck now (it is the largest and slowest to load19:42
corvusit's up now19:46
johnsomDo we need to recheck jobs? I'm not seeing my job come back up on the status page19:50
corvusi'm re-enqueing now19:52
corvusit won't hurt to recheck if you'd prefer19:52
fungibut if it was already enqueued as of 19:16 utc when the restart began, then it should be reenqueued without any intervention19:53
clarkbI need to eat lunch but looking at the gerrit theming stuff gerrit's docs still say you can do that via review_site/etc files https://gerrit-review.googlesource.com/Documentation/config-themes.html20:11
clarkbthat might more closely match what we are trying to do than overloading the plugin interface which is signficiantly more complicated now it seems like20:11
clarkbA held node would be a good place to experiment I guess20:12
fungiyeah, if there's a way to do it without requiring a polygerrit plugin, that's even better20:12
ianwyeah i do have a held node, i will poke20:15
ianwfungi: btw the iptables thing in podman got fixed by moving the dependency to some plugin container bit https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=99797620:17
ianwwe can probably drop the explicit install of that now20:17
corvusthe re-enqueue is done20:19
fungiianw: oh neat20:58
fungiianw: the way i read that bug closure, they simply added a depends20:59
clarkbcorvus: I don't think the reenqueue enqueued https://review.opendev.org/c/zuul/nodepool/+/816389 I have rechecekd it21:01
fungiianw: oh! yeah i misread, so they added it as a dependency in an updated golang-github-containernetworking-plugins package21:02
clarkbfungi: ianw  does that mean https://review.opendev.org/c/zuul/nodepool/+/815766/3/Dockerfile should be updated?21:02
ianwclarkb: yep we can drop the explicit iptables probably21:03
clarkbok I won't recheck it then21:03
corvusclarkb: oh yeah, sorry i did a re-enqueue from backup because the bug was affecting the status json and omitted the non-openstack tenants.  i should have said so; sorry.21:03
clarkbgot it21:04
fungiianw: https://packages.debian.org/sid/containernetworking-plugins that's where it ended up, but that version isn't in bullseye21:04
clarkbfungi: the change is pulling form unstable21:04
fungiare we still using the version from sid?21:04
fungiyeah, that should work then21:04
ianwyes -- presumably that pulls containernetwork-plugins from sid too though it's worth checking21:05
clarkbfungi: we moved from suse's obs repo to bullseye but the proposed change switches to unstable to get the libc fix for selinux or whatever was tripping over the different libc21:05
ianwsooo many layers to all this :/21:05
clarkbremember when they promised containers would make this easy?21:05
clarkbturns out not entirely true if you aren't able to handle libc changes21:05
clarkboh this reminds me for some weird reason. ianw  backup02 is at 91% of disk space and fungi and I thought it would be good for one of us that is not you to run the prune while you are around21:06
opendevreviewIan Wienand proposed opendev/system-config master: Drop Fedora 33 mirror  https://review.opendev.org/c/opendev/system-config/+/81593121:06
clarkbI think it is all scripted and we basically need to start a screen on the host and run the script but wanted to make sure we had your insight if things got sad21:06
ianwclarkb: yep, sure -- that's all it should be21:07
ianwthat we do the weekly fsck makes me feel better about it too21:07
clarkbok I'll look at doing that after the school run unless fungi feels he really wants to do it21:07
opendevreviewIan Wienand proposed opendev/system-config master: Add Fedora 35 mirror  https://review.opendev.org/c/opendev/system-config/+/81640421:09
ianwheh i have to do reverse school run, thankfully!  kids finally back to "in-person learning" full-time21:10
clarkbI normally do the morning run but thought I had a meeting this morning (didn't) so I've got the afternoon run21:10
ianw MAGNUM_GUEST_IMAGE_URL: https://tarballs.openstack.org/magnum/images/fedora-atomic-f23-dib.qcow221:15
ianwclarkb: i'm actually not sure magnum is referencing the mirror copy, still looking21:16
ianwunsurprisingly openstacksdk-functional-devstack-magnum is non-voting21:18
ianwi think we can just backport https://review.opendev.org/c/openstack/magnum/+/81640721:36
clarkbthat makes sense21:55
clarkbianw: fungi: ok I'm ready to start a screen and run the prune script. It has a noop option. Do we usually want to run it with the noop option first or just go for it now that we've run it a couple of times?21:57
ianwclarkb: i think just go for it21:59
clarkbok it is running in a screen. window 0 is the command and window 1 is tailing the log file22:01
ianwhttps://review.opendev.org/q/Ie3c8dff0bad6db483b54086afed0402ef24b0b4b is the stack to clear magnum22:08
clarkbseems to be doing what I expect re pruning and our keep policy22:09
opendevreviewIan Wienand proposed opendev/system-config master: mirror-update: Drop Fedora Atomic mirrors  https://review.opendev.org/c/opendev/system-config/+/81641622:12
fungiclarkb: i'll volunteer to do it next time, just finished cleanup up from dinner22:20
fungiand thanks!22:20
ianwmagnum-functional-k8s - NODE_FAILURE -- this seems to go back to legacy-dsvm-base and legacy-ubuntu-xenial22:38
ianwi'm not sure why this has such old branches22:39
clarkboh wow22:41
clarkbin that case maybe we can just rip things out and ask them to cleanup instead22:41
ianwi mean i can force merge, but missing mirror bits is the least of its worries :)22:42
fungiworth noting, per a few minutes ago in #openstack-tc they're in the throes of a leadership handover too, so may be a bit preoccupied22:42
fungisounds like flwang is leaving, and strigazi may be stepping in as interim ptl until the next election22:45
opendevreviewMerged opendev/system-config master: Drop Fedora 33 mirror  https://review.opendev.org/c/opendev/system-config/+/81593122:45
opendevreviewIan Wienand proposed opendev/system-config master: mirror-update: Drop Fedora Atomic mirrors  https://review.opendev.org/c/opendev/system-config/+/81641623:05
clarkbinfra-root the backup prune is done. All prunes report they terminated with success. Two things I notice: for old servers we're eventually just going to prune down to a single annual backup? servers like ask and review01? Also for review01 it seems like it is mixing up the filesystem and mysql backups beacuse the filesystem backup doesn't haev a suffix and it matches the mysql23:20
clarkbbackups too?23:20
clarkbre review01 it didn't prune anything on this pass so I think that ship sailed during a previous pruning23:21
ianwhrm, if you had of asked me, i would have thought that we'd still keep the latest weekly/monthly backups for disabled servers23:23
clarkbianw: we might, it isn't clear to me yet as they aren't old enough to potentially be pruned out23:24
clarkblooks like lists, review-dev01 and review01 have the problem of not having the -filesystem suffix23:24
clarkbin the case of lists I think we're backing up both lists and lists-filesystem23:24
ianwhrm, i may well have fiddled that and not removed an old file23:25
ianwi have a memory of trying to convert everythign to have that -filesystem extension23:25
clarkbI think review01 is the only one I'd worry about as we seem to have pruned out the old filesystem backups because the mysql backups are considered too23:25
clarkbin the case of review01 I think the only thing we might want to do is ensure we don't accidentally prune out the same content on backup01?23:26
clarkbits an old server and review02 carries those backups now and review02 lgmt (though it seems to have a mysql backup thatwe maybe decided we didn't need anymore since it is just the reviewed flag which isn't important)23:27
clarkbianw: ya I think we're good for everything but review01 actually. Looking at lists it seems ti converted and we just have some of the old ones around so thats fine23:27
clarkband for review01 all we want to do is avoid pruning it on backup01?23:27
ianw"up to 7 most recent days with backups (days without backups do not count)."23:28
ianwso yeah, that's what i assuming, that if backups stop it just keeps the last thing23:28
clarkbthen ya the only real issue here is that pruning treated review01 and review01-mysql as equivalent and so it is keeping the -mysql stuff and pruning out the old filesystem stuff23:28
clarkboh also it seems we don't have storyboard db backups?23:28
clarkbianw: maybe we can rename the review01 backups on backup01 to review01-filesystem to avoid this conflict with pruning? I don't think we've pruned on that server so we should be able to keep them around that way23:29
ianwperhaps that script could look in the directory for a ".noprune" file and skip 23:29
ianwwhen we shutdown a server, we can put in a .noprune and just leave it in an archived state23:30
ianwor; move it to attic/ manually23:30
ianwmoving it is probably the KISS approach23:31
clarkbianw: /opt/backups/prune-2021-08-04-04-42-05.log shows it pruning the review01 stuff23:32
clarkbbut now I'm even more confused because what it pruned was review01-filesystem23:32
clarkblike it considered them to be equivalent backups for pruning pruposes. paste01 had similar output but in that case it kept filesystem stuff23:33
ianwhrm, just paging in what happend23:34
ianwthe main change was23:34
clarkbok I think I see it. We prune based on prefix23:35
clarkbin the review01 case the prefix we used was 'review01' which matched both -filesystem and -mysql23:35
ianwmy memory from the time was that this was all new, and i restarted each backup so everything had a "-filesystem" archive23:35
clarkband the mysql stuff must run after the fielsystem stuff so was newer in all cases and kept23:36
clarkbianw: ya I think this is a general issue if we use a prefix that is a subprefix of another prefix23:39
clarkblists is suffering from this but because we don't do a db backup and a filesystem backup its all just fielsystem but with two prefixes we are fine23:40
clarkbreview01 did this and is a problem because the review01 prefix matches both review01-filesystem and review01-mysql and then it prunes out -filesystem in favor of -mysql23:40
clarkbwhich means for review01 we want to make sure we have unique prefixes before we prune?23:42
clarkber sorry for backup0123:42
ianwwe should only be doing one archive at a time23:43
ianw    # Look at all archives and strip the timestamp, leaving just the archive names23:43
ianw    # We limit the prune by --prefix so each archive is considered separately23:43
ianw    archives=$(/opt/borg/bin/borg list ${BORG_REPO} | awk '{$1 = substr($1, 0, length($1)-20); print $1}' | sort | uniq)23:43
clarkbyes but in the case of review01 the prefix was review01 which matched review01-filesystem and review01-mysql23:43
clarkbianw: | Wed Aug  4 05:26:51 UTC 2021 Pruning /opt/backups/borg-review01/backup archive review01 <- is the log line from the log in the previous prune that shows this happening23:44
clarkbthe last term there is the prefix used23:44
clarkbthis is also happening for lists but in lists case we care less because lists and lists-filesystem are equiavlent archives23:44
clarkb(there is no db or secondary backup)23:44
ianwahh right, the problem with review01 is that there is still a generic archive23:45
clarkbwhen I try to run that borg list it complains about some previously unknown repo and doesn't want to proceed by default23:45
ianwreview01-2020-11-30T05:51:02         Mon, 2020-11-30 05:51:04 [b14cfe17e7802e4be4cf048f90890c1fabb78f7b556ef64a65fb16f1ac86a4e2]23:45
clarkbianw: well and all of the -filesystem archive is gone23:45
clarkbthis is my main concern as I don't think we've pruned backup01 yet so we have the opportunity to prevent accidentally over pruning thato ne23:45
ianw$ /opt/borg/bin/borg list  ./backup/ |  awk '{$1 = substr($1, 0, length($1)-20); print $1}' | sort | uniq23:45
clarkbif I run borg list as root it complains. Should I run it as my normal user?23:46
ianwright; this *should* have had "review01-filesystem" and "review01-mysql-accountPatchReviewDb" in that list only23:46
ianwit's the /opt symlink that confuses it23:47
clarkbah so cd into the path first, got it23:47
ianwjust set "BORG_RELOCATED_REPO_ACCESS_IS_OK=y"23:47
clarkbI still get the error setting that var23:49
clarkbwell if I do it as a normal user I get told that I don't have permissions to set the lock file and hcecking perms on backup/ this is true23:49
ianwyou should probably do it as the backup user23:49
ianwfor that host23:50
ianwbackup01 is right23:50
clarkbaha that is what the prune script does it sudo -u's23:50
ianwi feel like we've actually discussed this before now ...23:50
clarkbianw: I don't think that is correct on 01 either?23:50
clarkbwe flipped the servers over much later in the year?23:51
clarkbbasically we should have a matching -filesystem and -mysql pair for each of the retained backups23:51
ianwi mean all the archives have unique names there23:51
clarkbbut if it got pruned there too then not much we can do at this point23:51
clarkbah got it23:51
clarkbianw: is that a complete list of backups though? it seems like we should have more for that server (I can't remember when exactly we switched the servers but it was in the NA summer?)23:53
ianwoh no, that was just an extract23:53
ianw$ /opt/borg/bin/borg list  ./backup/ | awk '{$1 = substr($1, 0, length($1)-20); print $1}' | sort | uniq23:53
clarkbgot it23:54
ianwthat's the list of archives it will prune on backup01 -- all are unique enough23:54
ianwi can't find it in my notes or irc logs, but i have memory of this now23:54
clarkbI guess we should double check all of them for unique prefixes? backup02 lists is another without unique prefixes, but as mentioned it is less of a concern since they are equivalent23:54
clarkbcan we convert lists to lists-filesystem archive somehow?23:54
ianwi feel like we realised the same thing we're concluding now, we had pruned the archives on the vexxhost backup server, but things were correctly named on the rax one23:55
clarkbya I'm recalling similar discussion now but only vaguely :)23:55
clarkbas long as the review01 prefixes on backup01 are unique we should be good for future prunes there. I think at this point my only real concerns are that we double check there aren't any additional servers in that situation (backup02: review01 and lists are known) and decide if we need to rename anything like with lists if possible23:56
clarkboh and storyboard db backups should happen23:56
clarkbI don't think anything is urgently broken. The onyl thing I can identify as wrong has been that way since august23:57
clarkbianw: I'm going to have to do dinner here in a couple of minutes. Maybe ping me or email or whatever if there is anything I should pick up in the morning. Like skimming through the listings and looking for any non unique prefixes we don't alredy know about. Then we can decide if/how we want to handle any of them23:59
ianwlists and review are the ones affected by this23:59
clarkbianw: did you run that on backup01 as well?23:59
clarkbcurious if it has any others that might be affected23:59
ianwno, just backup02 for now, but i will do for both23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!