ianw | clarkb: scrolling back :) | 00:00 |
---|---|---|
clarkb | ianw: sorry its a bit stream of consciousness going on there | 00:02 |
ianw | i do think we've probably never had something as old as <1.0.9 in the mix | 00:05 |
clarkb | ok thats good to know | 00:05 |
clarkb | I expect that makes our risk here extra low | 00:05 |
clarkb | (because anything that gets deleted probably should be deleted) | 00:05 |
clarkb | and hopefully borg mount isn't completetly unhappy with the situation :) | 00:06 |
ianw | from the initial commit 028d6553750eaafc24ef7c244fbd4832d68a11a2 | 00:06 |
clarkb | should know soon. I'm on the node in question just waiting for the job to actually get to that point now | 00:06 |
ianw | +borg_version: 1.1.13 | 00:06 |
clarkb | ya I see that too. | 00:07 |
clarkb | I didn't even think to check that (too many other things) | 00:07 |
ianw | so i think that clears up the TAM issue. probably no way to do thing other than what you're doing working through issues trying to use 1.2 client | 00:13 |
clarkb | ya and probably we'll continue to learn more if/when we eventually get 1.2 into production for paste | 00:14 |
clarkb | its interesting how much of a forcing function python3.12 has become | 00:14 |
ianw | upgrading the borg version has been on my todo list for ... forever. i sort of vaguely had it in my head the best way might be to make sure that running 1.4 it's done as a totally separate thing, in parallel. basically, work things so that two borgs can run separately | 00:14 |
ianw | that seems to be most future proof, in that when a new major comes out you cut over to it in a fresh backup, and stop the old version when you're happy with the new | 00:15 |
clarkb | the 1.2 to 1.4 upgrade doesn't actually look too bad. And 1.1 to 1.2 is really only bad for the tam stuff. Everything else they've written down seems solvable | 00:15 |
clarkb | but ya maybe another option is a new backup server also running noble and running 1.4 and then paste runs 1.4 | 00:15 |
clarkb | we shall see where the testing gets us | 00:16 |
clarkb | `borg mount not available: no FUSE support, BORG_FUSE_IMPL=pyfuse3,llfuse.` adding -e was helpful | 00:16 |
ianw | ++ broke my own rules of being "set -eu" safe there :) | 00:17 |
clarkb | ok I think I see how to fix this. New patch in a bit. | 00:19 |
ianw | oh i didn't think of /var/log/borg-... updating while it's being backed up. probably should have put that in a subdir that is ignored | 00:20 |
ianw | i guess it's been warning/failling to back that up all the time, just 1.2 tells you about it? | 00:20 |
clarkb | ianw: ya the change is 1.2 exits 1 if there are warnings | 00:21 |
clarkb | I've just written a monstrosity of a jinja template string. Lets see if it works | 00:23 |
opendevreview | Clark Boylan proposed opendev/system-config master: Use newer borgbackup on Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/939491 | 00:23 |
clarkb | this might have to be the last patchset before dinner though. Feel free to poke at that chagne if there is interest. Otherwise I'll pick it up tomorrow | 00:23 |
ianw | you probably need a "+" in -> 'borgbackup[fuse]=='borg_version | 00:24 |
opendevreview | Clark Boylan proposed opendev/system-config master: Use newer borgbackup on Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/939491 | 00:30 |
clarkb | thanks! I figured I would get something wrong in that | 00:30 |
clarkb | the new pyfuse3 lib is lgpl licensed and maintained by the borg person so that seems fine to switch to | 00:31 |
clarkb | I don't know why they broke out 'fuse' extras into different library options. Would be nice if they kept fuse and also added the more specific options for people who need them | 00:31 |
opendevreview | Clark Boylan proposed opendev/system-config master: Use newer borgbackup on Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/939491 | 00:53 |
clarkb | it failed quickly enough with an actionable error message t o get another ps in | 00:53 |
MattCrees[m] | Hello #opendev, would anyone be available to hold a node for me please? I'd like to get in and debug the CI failures here https://review.opendev.org/c/openstack/kolla-ansible/+/924623 | 10:54 |
frickler | MattCrees[m]: which job? and then I'd need your ssh public key | 11:07 |
MattCrees[m] | kolla-ansible-ubuntu-podman | 11:07 |
MattCrees[m] | ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFSiegoucsoBqvk22UEqHPX1NL48kAGIBT3/drMGTNed mattc@stackhpc.com | 11:08 |
frickler | MattCrees[m]: ok, hold set up and recheck triggered, I'll let you know when the node is ready | 11:10 |
MattCrees[m] | Brilliant, thank you :) | 11:10 |
fungi | MattCrees[m]: ssh root@213.32.78.123 | 13:45 |
MattCrees[m] | I'm in, thanks | 13:46 |
fungi | no sweat, just let us know once you're done so we can release its resources back into the pool | 13:48 |
priteau | Hello. Is there an issue with opendev right now? It is suddenly very slow for me, causing timeouts to fetch upper constraints. | 14:38 |
frickler | priteau: I'm seeing sporadic failures, too, like: stderr: 'fatal: unable to access 'https://opendev.org/zuul/zuul-jobs/': gnutls_handshake() failed: The TLS connection was non-properly terminated.' | 14:42 |
priteau | I seem to remember a similar issue a while ago. Apache might need a restart? | 14:53 |
fungi | well, if recent experience is any indication, it's being overrun by new llm training crawlers we need to block | 14:57 |
priteau | Lovely | 14:58 |
fungi | load average on all 6 gitea servers is over 20 | 14:58 |
mnasiadka | Sweet… | 14:59 |
fungi | https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/apache-ua-filter/files/ua-filter.conf is the list of what we're blocking so far | 15:05 |
MattCrees[m] | Hi again, I'm finished with that node now, feel free to release it | 15:15 |
fungi | thanks MattCrees[m]! i've cleaned up your hold | 15:18 |
fungi | so, analyzing apache logs on gitea09, in the past 10 minutes it received 32107 requests from clients claiming this ua string: | 15:21 |
opendevreview | Clark Boylan proposed opendev/system-config master: Use newer borgbackup on Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/939491 | 15:21 |
fungi | "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43" | 15:21 |
clarkb | fungi: that is not a modern chrome though I suppose it could be a current edge/safari. Does Edge report itself with the missing e? | 15:22 |
clarkb | fungi: not urgent keep looking at gitea but I woke up this morning with the idea that I needed to refactor 939491 a bit to make it more sane and that is all the latest ps does. Should be equivalent just easier to understand and modify in the future | 15:22 |
fungi | the other backends seem to be seeing similar request volumes with the exact same ua | 15:23 |
fungi | note that it's two orders of magnitude more requests than for the second most active ua | 15:23 |
fungi | it looks like real edge uses "Edge/some.version" instead | 15:24 |
fungi | i'll stick this one into the list | 15:25 |
clarkb | ++ | 15:25 |
clarkb | frickler: not sure if you've seen 939491 but its commit message is probably worth reading if you have time. Mostly to call out any additional things we should test | 15:28 |
clarkb | I have since confirmed that our prune and verify scripts are tested in CI so Iexpect that a 1.2.8 client pushing to a 1.1.18 server can prune and verify/check safely using 1.1.18 | 15:28 |
clarkb | the biggest issues will come later for existing hosts if/when we migrat them to 1.2. In particular if we have to clean up TAMless archives and then there is also a step to run to generate some index data I think | 15:29 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Block another bogus crawler user agent https://review.opendev.org/c/opendev/system-config/+/939541 | 15:32 |
fungi | priteau: hopefully that ^ will help | 15:32 |
clarkb | fungi: ^ is missing the [OR] condition at the end of line on the line prior to your addition | 15:32 |
fungi | d'oh | 15:33 |
clarkb | fungi: also you are doing a direct match so can use = instead of anchoring the regex | 15:33 |
clarkb | that might make it slightly faster | 15:33 |
clarkb | look at examples near the beginning of the list and compare to the end | 15:33 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Block another bogus crawler user agent https://review.opendev.org/c/opendev/system-config/+/939541 | 15:34 |
clarkb | you can avoid extra \s that way | 15:34 |
clarkb | but the regex should work so is fine too | 15:34 |
fungi | can do | 15:34 |
clarkb | ya I always have to remember to add the [OR] because its easy to yy p and then forget | 15:35 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Block another bogus crawler user agent https://review.opendev.org/c/opendev/system-config/+/939541 | 15:35 |
fungi | indeed, exactly what i did too | 15:36 |
clarkb | I've gone ahead and approved it. I don't think that can make things worse if CI passes and it should hopefully make things better | 15:36 |
clarkb | then I guess if we get gitea settled down we have to consider if we're gonna send the borg update. | 15:37 |
clarkb | I think our CI coverage is pretty good so its mostly a matter of ensuring the change itslef is structurally sound and that the various scary borg upgrade warnings have been considered | 15:38 |
fungi | thanks | 15:39 |
tobias-urdin | stupid crawlers, holding my thumbs – been trying to clone https://opendev.org/openstack/puppet-barbican for a while now | 15:47 |
opendevreview | Clark Boylan proposed opendev/system-config master: Use newer borgbackup on Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/939491 | 15:56 |
clarkb | ok that was a silly mistake in my refactor (the whole point was to capture the different fuse extras then I didn't) but I'm hopeful for this latest ps | 15:56 |
fungi | seems like i missed that oopsie | 15:58 |
fungi | pypi has added the ability to "archive" deprecated projects: https://discuss.python.org/t/13937 | 16:01 |
priteau | fungi: Thanks. Just pointing out that a Google Search finds lots of User Agent entries for "Edg/114.0.1823.43", unlike with "Edge", so this may be a genuine one (that might still be used by a crawler) | 16:05 |
fungi | interesting, when i went looking up the ua for edge it seemed the official one was spelled out fully | 16:06 |
clarkb | also that version string is old | 16:07 |
fungi | maybe it changed at some point | 16:07 |
clarkb | I think even if it were valid we'd tell people to 1) upgrade and 2) stop being a nuisance | 16:07 |
clarkb | but yes it looks like Edg/version is potentially valid | 16:09 |
fungi | i wonder what prompted them to abbreviate a 4-character name to 3 | 16:10 |
frickler | for me, opendev.org is still essentially down, so maybe force-merging 939541 once checks pass would be good? or manually applying? | 16:20 |
Clark[m] | Manually applying would be quickest and shouldn't get undone by hourly jobs | 16:25 |
clarkb | ok breakfast and morning tasks are done. I'm starting to manually apply the rule on 14 and working my way down | 17:07 |
clarkb | the rule should be in place everywhere. I accidentally restarted apache on gitea10 instead of reloading it. Everywhere else got a reload | 17:13 |
clarkb | I don't expect that restart was very noticeable with all the other problems. | 17:13 |
clarkb | load seems to be steadily falling | 17:13 |
clarkb | things appear slow but improving in my testing of the web ui | 17:15 |
clarkb | I suspect some of that is the crawling pollutes the caches in gitea so the things interactive users want need to be cached and then they get quicker | 17:15 |
clarkb | looking at logs we're sending a lot of 403s now | 17:22 |
clarkb | seems like the experience I get now is initial load of a new page is slow (likely bceause gitea has to rebuild and then cache the content) then subsequent loads are reasonable | 17:32 |
clarkb | I suspect this is far more usable now | 17:32 |
opendevreview | Merged opendev/system-config master: Block another bogus crawler user agent https://review.opendev.org/c/opendev/system-config/+/939541 | 17:56 |
clarkb | applying ^ should be a noop at this point but good to get it into config so it doesn't unapply later | 17:57 |
clarkb | fungi: if you're not otherwise distracted by other obligations I think my priority is borg fix then gitea then gerrit. And honestly I'd be happy just to muddle through the borg fix. If we want to prioritize based on risk I'd say gitea is less risky then borg and gerrit are probably similarly likeyl to need intervention (gerrit if the restart goes sideways again) | 18:07 |
fungi | clarkb: i'm back from lunch so taking a look now | 18:16 |
fungi | have to find some time this afternoon to tidy up the mil's yard a little since the weekend weather is looking gross, but otherwise nothing pressing | 18:18 |
fungi | borg fix first sounds good | 18:19 |
clarkb | as mentioned before I think with the scope of this initial borg update change the worst case should be new psate02 server backups are sad and everyone else continues as is. But our ci job seems to show general happyiness. I am hopeful | 18:20 |
clarkb | I guess I should triple check logs in the job ut I looekd and I think check/verify and prune are both happy. Can also triple check the correct versions are installed on the correct hosts | 18:21 |
fungi | yeah, i looked through the borg release notes and agree the risk is low and manageable in the worst case | 18:21 |
fungi | checking build logs now | 18:21 |
clarkb | https://zuul.opendev.org/t/openstack/build/4f9d33fd3a4b4d24a17c937c0fe227ee/log/borg-backup01.region.provider.opendev.org/verify-borg-backups.log https://zuul.opendev.org/t/openstack/build/4f9d33fd3a4b4d24a17c937c0fe227ee/log/borg-backup01.region.provider.opendev.org/prune-borg-backups.log and | 18:22 |
clarkb | https://zuul.opendev.org/t/openstack/build/4f9d33fd3a4b4d24a17c937c0fe227ee/log/borg-backup-noble.opendev.org/borg-backup-borg-backup01.region.provider.opendev.org.log lgtm | 18:22 |
clarkb | I'm trying to dig up ionstallations to verify versions next | 18:22 |
clarkb | https://d9c6b55444bb367e4cfd-f794ac2492944f87fa91e71db0988d60.ssl.cf2.rackcdn.com/939491/11/check/system-config-run-borg-backup/4f9d33f/bridge99.opendev.org/ara-report/playbooks/4.html ^F pip then check the entry for each node | 18:23 |
clarkb | I think I'm happy with ^ (there is a test case that tries to check the installed version too) | 18:24 |
clarkb | fungi: I guess let me know if you think the new version isn't more readable/understandable than what I had written yesterday. I think I personally prefer this as its less magic and more explicit | 18:25 |
clarkb | it should also be easier to override on a server by server basis if we change borg_version | 18:25 |
clarkb | as a followup we might want ot make the _fuse_extras value overrideable too but that doesn't seem important yet | 18:25 |
fungi | yeah, i don't see anything concerning in the logs | 18:30 |
clarkb | oh as a note nothing currently sets borg_version so we're keeping the overridable default for potential future use | 18:31 |
clarkb | I see your +2. Do we want to go ahead and send it in now. I suspect there won't be anyone else to review it until Monday. Not ure if you think this is worth waiting on and then switch to other stuff in the interim | 18:33 |
clarkb | I think I'd like to get it in before the weekend because that is when the verification script runs iirc. But I can go either way | 18:33 |
clarkb | I'll defer to you on what you feel is safest for your availability. I expect to be able to be around all day if necessary | 18:36 |
fungi | yeah, approved | 18:46 |
fungi | and yes the backup verifications are weekly on sunday | 19:05 |
fungi | that's what was conflicting with the gerrit server backups for a while | 19:05 |
clarkb | cool getting this in now should get us good exercise then. ~4-6 backups then verification sunday | 19:06 |
clarkb | assuming it works. It won't be the first time it fails but at least this time we are directly testing it so odds seem better | 19:06 |
clarkb | the change should merge around when hourly jobs end so we should know soon | 19:08 |
fungi | hourly jobs have already finished | 19:22 |
fungi | last gate job is wrapping up finally | 19:23 |
opendevreview | Merged opendev/system-config master: Use newer borgbackup on Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/939491 | 19:24 |
clarkb | here we go | 19:24 |
clarkb | the job has begun | 19:25 |
fungi | load average on the gitea servers has fakken down around 1 | 19:30 |
fungi | s/fakken/fallen/ | 19:30 |
clarkb | paste02 has crontab entries for backups now | 19:30 |
clarkb | fungi: wouldn't surprise me if a good chunk of that is simply returning 403s | 19:31 |
fungi | yeah | 19:32 |
fungi | deploy job succeeded | 19:32 |
clarkb | https://zuul.opendev.org/t/openstack/build/b2813525b82a4d129c2e3b41c064f768 yup | 19:32 |
clarkb | 0526 and 1726 are the two backup attempt times | 19:32 |
clarkb | so we're solidly in the middle of that. We could manually run them but I'm not sure that is necesasry unless you want to give that a go | 19:33 |
clarkb | for https://review.opendev.org/c/opendev/system-config/+/938826 I guess we have to decide if we're concerned about the load impacting that somehow though as you mention load seems consistently reasonable now | 19:34 |
clarkb | but given my priority list ^ would be next | 19:34 |
clarkb | I'm going to go clean up some autoholds I made for backup stesting | 19:34 |
clarkb | tkajinam still has a hold on a heat-functional setup | 19:35 |
clarkb | tkajinam: ^ can we claen that up? | 19:35 |
fungi | yeah, i think gitea activity has calmed down enough that rotating through the restarts should have minimal impact | 19:36 |
fungi | clarkb: i can approve 938826 if you're ready for that | 19:37 |
clarkb | fungi: I think I am. Not sure what other testing I would do for that at this point. I guess triple checking the backup situation. The gate jobs take about an hour though so I say approve it and we can unapprove if it we discover problems in the next hour with other things that demand attention | 19:38 |
clarkb | I'm going to check borg versions on some representative servers | 19:38 |
fungi | yeah, the held node looked fine to me too | 19:38 |
fungi | approved the gitea upgrade just now | 19:38 |
clarkb | spot checking a backup server, paste02 and etherpad the borg versions are as expected | 19:39 |
fungi | i'm going to spend a few minutes on yard cleanup while that filters through the gate | 19:41 |
clarkb | sounds good | 19:42 |
clarkb | I'll be making something for lunch soon myself | 19:42 |
fungi | gate jobs are looking good so far, zuul is estimating another 50 minutes to merge | 19:57 |
clarkb | and I'm going to make food now. | 19:58 |
fungi | looks like it'll be merging in the next few minutes, deploy may end up fighting the hourlies | 20:46 |
clarkb | I suspect if it merges according to the estimate it will deploy before the hourlies | 20:47 |
fungi | yeah, it finished faster than i expected | 20:47 |
opendevreview | Merged opendev/system-config master: Update to Gitea 1.23 https://review.opendev.org/c/opendev/system-config/+/938826 | 20:48 |
fungi | bam | 20:48 |
clarkb | I've got a git clone command ready to run against gitea09 once it is done and have gitea09 opened to system-config to check web | 20:50 |
clarkb | https://gitea09.opendev.org:3081/opendev/system-config/ loads for me and reports the expected version | 20:52 |
clarkb | git clone works too | 20:52 |
clarkb | mariadb switch seems to have mostly nooped as we expect | 20:53 |
clarkb | image ids match etc but container was restarted as part of the update and we report the quay location as the image name | 20:53 |
clarkb | web ui seems to be working for me | 20:54 |
clarkb | 10 is done now too | 20:55 |
fungi | yeah, it's working fine for me | 20:55 |
clarkb | 11 is done. Halfway there | 20:56 |
clarkb | now down to waiting on 14 to be done | 21:00 |
clarkb | fungi: if you happen to have a new patchset for an existing chagne you can push that would be good to check replication. I'll look at my change list too once 14 is done | 21:00 |
fungi | load on them is spiking up around 5 but starting to fall again, so probably just additional startup pressure | 21:00 |
clarkb | fungi: I think it must be doing upgrade tasks (db migrations?) the ssh service isn't starting for like 30-50 seconds while ansbile waits for web to come up | 21:01 |
fungi | yeah, seems likely | 21:02 |
fungi | load average is back to 1 on gitea09 | 21:02 |
clarkb | and now 14 is done | 21:02 |
clarkb | https://zuul.opendev.org/t/openstack/build/ccec84325b7e4a0cbbe9e55da57355b8 job reports success too | 21:03 |
clarkb | now to find something that will trigger replication | 21:03 |
opendevreview | Clark Boylan proposed opendev/lodgeit master: Reapply "Move lodgeit image publication to quay.io" https://review.opendev.org/c/opendev/lodgeit/+/939385 | 21:04 |
clarkb | that change needed a recheck anyway so I just updated the commit message to trigger reruns of jobs | 21:04 |
clarkb | now to check replication | 21:04 |
clarkb | `git fetch origin refs/changes/85/939385/2` shows me what I expect with originhttps://opendev.org/opendev/lodgeit (fetch) | 21:06 |
clarkb | so I think replication is working | 21:06 |
clarkb | includes the edit to the commit message and the sha matches | 21:06 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Install ssl-cert-check from distro package not Git https://review.opendev.org/c/opendev/system-config/+/939187 | 21:06 |
* fungi was too slow | 21:06 | |
clarkb | no it is good to sanity check I'm not doing anything wrong | 21:07 |
clarkb | looks like load is elevated on 9, 13, and 14 | 21:07 |
clarkb | nothing crazy though | 21:07 |
clarkb | also falling. It could just be luck of the draw for those getting reconnects or similar from clients that balance to them. I don't see anything that makes me think there is a problem | 21:08 |
clarkb | I can't remember who was pointing out the occasional issues with their jobs fetching constraints. Was it noonedeadpunk? Anyway I am hopeful that the reliability changes that have gone into 1.23 make that better now | 21:11 |
opendevreview | Clark Boylan proposed opendev/system-config master: Fix a typo in the uwsgi-base mirror short name field https://review.opendev.org/c/opendev/system-config/+/939562 | 21:16 |
clarkb | I'm trying to find the periodic job logs now to see if that just failed previously. It doesn't look like there is a 'uwsgi base' image create for us anyway | 21:16 |
clarkb | ya the job failed and eventually went into retry failure | 21:18 |
clarkb | landing 929562 should hopefully be all we need to correct that | 21:19 |
fungi | ERROR: failed to solve: quay.io/opendevmirror/uwsgi-base:3.11-bookworm: failed to resolve source metadata for quay.io/opendevmirror/uwsgi-base:3.11-bookworm: unexpected status from HEAD request to https://quay.io/v2/opendevmirror/uwsgi-base/manifests/3.11-bookworm: 401 UNAUTHORIZED | 21:20 |
fungi | reason for that i guess? | 21:21 |
clarkb | yes | 21:22 |
fungi | 929562 is abandoned "DNM - Trest CentOS" so i assume you meant another change | 21:22 |
clarkb | 939562 the change I just pushed. Thats what I get for tpying the change out by hand | 21:22 |
clarkb | ironic given that was likely also the source of the original bug | 21:22 |
fungi | hah, whoops (re: 939562) | 21:24 |
opendevreview | Merged opendev/system-config master: Fix a typo in the uwsgi-base mirror short name field https://review.opendev.org/c/opendev/system-config/+/939562 | 21:34 |
clarkb | what an eventful day. I think its basically EOD for fungi so we can hold off on Gerrit today (it is less urgent) | 21:49 |
clarkb | and that gives me an opportunity to maybe go out for a walk in the sunlight before it is gone | 21:49 |
fungi | do it! | 21:50 |
fungi | and yeah, i'm mostly checked out at this point, but will try to keep an eye on stuff off and on over the weekend as time permits | 21:51 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!