corvus | something about how swiftclient uploaded them did that | 00:00 |
---|---|---|
corvus | but things i uploaded through sdk do get deleted with that method | 00:00 |
corvus | so, tldr, with that patch we will have all the functionality we need while working around the two known bugs in the service | 00:00 |
clarkb | got it | 00:01 |
opendevreview | Merged opendev/system-config master: Add basic docs on updating the OpenAFS ppa https://review.opendev.org/c/opendev/system-config/+/935026 | 06:13 |
frickler | this sounds interesting, might be worth discussing whether we can/should work on adding opendev to that list https://docs.pypi.org/trusted-publishers/internals/ | 09:51 |
fungi | frickler: it would have to be our keycloak oidc that gets added, i think we need to work on finishing the migration off launchpad if we're going to consider using it to authenticate package publication and supply attestations | 13:12 |
fungi | also one of the criteria they listed in conversations about adding other code forges is whether they employ on-call staff. they've essentially ruled out community-run providers (ironically, since i don't even think pypi itself lives up to the expectations they're setting for the organizations running trusted publishes) | 13:15 |
fungi | s/publishes/publishers/ | 13:15 |
frickler | hmm, bummer | 13:53 |
fungi | at the very end of that document they hint at it: "Reliability & notability: The effort necessary to integrate with a new Trusted Publisher is not exceptional, but not trivial either. In the interest of making the best use of PyPI's finite resources, we only plan to support platforms that have a reasonable level of usage among PyPI users for publishing. Additionally, we have high | 13:56 |
fungi | standards for overall reliability and security in the operation of a supported Identity Provider: in practice, this means that a home-grown or personal use IdP will not be eligible." | 13:56 |
fungi | we're publishing maybe 1k different packages to pypi, tops. they've been focusing on integrating providers multiple orders of magnitude larger than that | 13:58 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Switch to using openstacksdk for image uploads https://review.opendev.org/c/opendev/zuul-jobs/+/935218 | 14:51 |
corvus | fungi frickler actually the upcoming zuul oidc may be a better choice for pypi integration. however, their requirement regarding a resurrection attack would need a solution (as there is no "owner account id" in zuul). in principle i don't think there's a vulnerability here because there are no usernames to reassign, but figuring out how to meet or waive that requirement is work. | 15:02 |
fungi | ah, yeah i hadn't thought about zuul-as-an-oidc there | 15:02 |
fungi | i imagined it calling out to keycloak to do the tokens | 15:03 |
corvus | yeah, trusted publishing of artifacts is really the use case it's intended for; it's a good match. | 15:04 |
fungi | right, i even read through that spec, it should have dawned on me | 15:04 |
corvus | but pypi is worried about github changing user names, but in opendev, the repo is the repo; we don't reassign repos to different "users". | 15:05 |
corvus | #status log disabled slow query log on zuul-db01 | 15:08 |
opendevstatus | corvus: finished logging | 15:08 |
corvus | fungi: one thing in our favor in convincing them to add us: we publish some packages with rather high download numbers. | 15:10 |
fungi | this is true, a few hundred are in the "critical" whatever category they initially used to decide who would end up with mandatory 2fa enforcement | 15:11 |
corvus | yep. so maybe some time next year when they're through the initial busy phase and we've got the oidc thing in place... might be worth looking into it then. :) | 15:12 |
fungi | corvus: looks like that last fix to 935218 probably isn't the last | 15:16 |
corvus | yeah, i was skeptical of that, but it was easy, and the only one i could see without starting to take stuff apart again. :/ | 15:17 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Switch to using openstacksdk for image uploads https://review.opendev.org/c/opendev/zuul-jobs/+/935218 | 15:32 |
corvus | that should be it. | 15:32 |
corvus | clarkb: i've started pruning the registry in a root screen session on the server | 15:46 |
Clark[m] | Thanks! | 15:47 |
corvus | #status log started manual prune of intermediate registry | 15:47 |
opendevstatus | corvus: finished logging | 15:47 |
corvus | it ran for a bit, then stopped with an error; it got a 404 when trying to delete something. the url it was deleting also showed up in the dry run, so it's been there for a bit (ie, it's not racing a new upload or anything). since it has shown up in two object listings, this feels like a situation where the object backend is out of sync with the listing. | 15:57 |
corvus | i think we should just restart the process and see if it fails at the same place, or continues. based on what i was seeing with rax flex yesterday, issuing a delete and getting a 404 may actually be effective in putting the object listing back in sync with the backend. | 15:57 |
Clark[m] | WFM I think we already established that we expect the process to be resumable | 15:58 |
corvus | ack, restarting with a new log file name (-2) | 15:58 |
corvus | it resumed without error and proceeded past that point; looks like the problem object did not show up in the listing after the 404-delete | 15:59 |
corvus | ftr, it was upload 0895b58737d54a44a63252b3c8333651 -- if you want to see that in the logs later | 16:00 |
Clark[m] | Should the prune proceed if it gets a 404 from delete calls? | 16:03 |
Clark[m] | That seems like it should generally be safe since 404 means there is nothing there? | 16:03 |
corvus | yeah i think so. it stopped again on a similar error; it will probably save us time to actually make that change :) | 16:14 |
clarkb | ok finally sitting down at the computer are you working on that or should I? | 16:28 |
corvus | almost done | 16:28 |
clarkb | ack | 16:28 |
corvus | remote: https://review.opendev.org/c/zuul/zuul-registry/+/935370 Ignore 404 on swift object delete [NEW] | 16:29 |
corvus | (sorry for the delay, i was also stuffing my face with breakfast) | 16:29 |
clarkb | my delays are also morning related. I now have tea | 16:29 |
corvus | the prune stopped again; i think we should just wait for that to merge before proceeding | 16:30 |
clarkb | wfm | 16:30 |
clarkb | corvus: will that retry function retry the 404 several times too? so we know that its really not found? | 16:31 |
corvus | clarkb: no, it stops on 404 | 16:31 |
clarkb | oh no notfound raises immedaitely | 16:31 |
fungi | lgtm | 16:31 |
corvus | but if we ever get a 404 that's a lie (i don't think we have, but if we did) then presumably it would show up in a future object listing and a future prune would find and delete it. | 16:32 |
clarkb | fungi raced me to the +2. I'ev approved it now | 16:32 |
clarkb | corvus: and we'll log it so it will be somewhat traceable | 16:32 |
corvus | ya | 16:33 |
clarkb | registry should get merged and promoted in the next ~10 minutes. We can then wait for hourly jobs to update the running service and its image or we can do that manually and speed things up by ~15 minutes | 16:41 |
clarkb | promote succeeded | 16:53 |
corvus | i will pull and restart now | 16:54 |
corvus | restarted; log #4 | 16:55 |
clarkb | I see it going in screen too | 16:55 |
clarkb | corvus: I think since its an exec we're still running the old code? | 16:56 |
corvus | oh huh i thought we went with run oops | 16:56 |
clarkb | and that the hourly autoupdate may interrupt us? Not a big deal | 16:56 |
clarkb | no I tried run and it failed due to using host networking or something | 16:57 |
corvus | bad memory on my part | 16:57 |
clarkb | we can see what causes it to die first a 404 or the hourly runs :) | 16:57 |
corvus | so yeah, it'll be interrupted one way or another in the next few mins. then we'll run the new code. :) | 16:57 |
corvus | exactly | 16:57 |
corvus | 404 wins | 16:58 |
corvus | down/up and restarted on log #5 | 16:59 |
corvus | 2024-11-15 17:00:14,730 DEBUG registry.swift: NotFound error when deleting _local/uploads/5e7f892ab4824e9b832b3f2139d215de/metadata | 17:00 |
corvus | that's our new log message; it did not stop for that, so that's good. | 17:01 |
clarkb | ++ | 17:01 |
clarkb | fwiw the hourly jobs for zuul-registry have completed and they didn't restart containers (as expected) and pruning continues | 17:20 |
corvus | it finished the uploads and is pruning manifests now. | 17:21 |
corvus | based on line numbers from the dry run, we're 9.8% through. | 17:38 |
corvus | we are doing 8 delete calls/second | 17:40 |
clarkb | and we've been running for a grand total of about an hour? | 17:41 |
clarkb | so this won't complete until EOD ish? not bad considerng how long its been since we pruned | 17:41 |
corvus | yeah, i think i estimated 51 hours if each delete took 1 second... so i think it's something in the ball park of 8 hours total with those figures. | 17:42 |
corvus | (about 1.5 hours for everything except the deletes, add about 6 hours of deletes) | 17:43 |
corvus | that's napkin math, but hopefully the right oom. :) | 17:43 |
clarkb | fungi: re ^ I suspect we can do something similar for a general swift container deletion tool. It won't run quickly but we can basically say "delete everything older than X days and ignore 404s" and then let that run in the background somewhere for various containers | 17:53 |
fungi | for the containers i'm interested in cleaning up, their contents are at least 3-4 years old (some haven't had new objects created in a decade) | 17:54 |
fungi | but also some of them have millions of objects | 17:55 |
fungi | so no clue how long that would take | 17:55 |
clarkb | fungi: 8 deletes per second :) | 17:55 |
* clarkb does some math | 17:55 | |
clarkb | day and a half per million? | 17:55 |
fungi | this is deletions in old rackspace classic's swift, yeah? | 17:56 |
clarkb | yes | 17:56 |
fungi | cool, maybe doable then | 17:57 |
fungi | yeah, logs_periodic has an object count of 2359532 for example | 18:12 |
fungi | logs_periodic-stable is 1198139 | 18:12 |
clarkb | in theory we could do them in parallel but still thats a few days | 18:13 |
clarkb | but that should be doable | 18:13 |
fungi | there's also 256 different logs_NN containers that need deleting | 18:13 |
clarkb | are those not the ones we use today? | 18:13 |
clarkb | I can't remember the sharding scheme we use today | 18:14 |
fungi | nope, zuul_opendev_logs_NNN now (1024 of them) | 18:14 |
fungi | sorry, 2048 of those | 18:14 |
clarkb | oh right we moved to a prefix and suffix system in zuul-jobs because s3 doesn't do name collisions | 18:14 |
fungi | bad math day | 18:14 |
clarkb | and ceph mimic s3 | 18:15 |
fungi | the logs_NN containers seem to average in the thousands of objects each, so will go faster individually but there are still 256 of them | 18:16 |
clarkb | and in all cases for your conatienrs we don't care about object timestamps/ttls we want to delete them all and then delete the container which should simplify things a little | 18:17 |
fungi | exactly | 18:17 |
fungi | i manually removed the cdn mapping for all of them already months ago too | 18:17 |
fungi | so they're no longer published/reachable | 18:17 |
clarkb | in that case ya I think we should just do a script that basically does a for each object in container foo delete (ignore 404s) and then attempt a container delete at the end and if that fails we come back around and figure that out | 18:18 |
clarkb | I suspect that 80% of the conatiners will clean up just fine that way then we'll find the corner cases that dont | 18:18 |
clarkb | I guess we may also need some logic in there to get more than 10k object back in each listing or whatever the limti is (would have to look at openstacksdk to see if it doesn't do that already) | 18:19 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add a swift container deletion script https://review.opendev.org/c/opendev/system-config/+/935395 | 18:47 |
clarkb | fungi: ^ a very rough first draft. I think I need to undersatnd better what the subdir vs not relationship is and how many objects we'll get back per iteration and probably adding some argparse would be good too | 18:48 |
clarkb | I also haven't run that so it probably fails on a syntax error somewhere | 18:48 |
clarkb | but I think that is the general shaope of a (slow) container deletion tool | 18:48 |
clarkb | after yesterday's adventures with swift and client tooling I'm not sure I'm in the mood to actualyl run and debug a tool like that but I wanted to braindump so I can pick it up with a better mood next week | 18:49 |
fungi | thanks! | 18:52 |
fungi | and yeah, that's lower priority than whatever else we've got going on right now | 18:52 |
clarkb | oh ya we shouldn't need prefix and delimiter bceause we aren't actually interested in treating this like an fs tree (the original data may have been stored that way but our intent is to simply delete all the objects so we can just list them) | 19:01 |
fungi | right, absolutely no need to filter the object lists in any way | 19:02 |
clarkb | reading the docs I think we get back a max of 10k objcets per listing each time. I'm not seeing where zuul-registry handles that (its possible that maybe zuul-regsitry doesn't? I wonder if that would lead to extra deletes since we wouldn't see all manifests to keep potentailly? cc corvus though I think we mitigate that in zuul-registry by listing each namesapce (whcih is a container | 19:07 |
clarkb | image) separately) | 19:07 |
clarkb | corvus: ^ fyi I think that it may be possible that if a container image namespace has more than 10k entries we may overdelete since some manifests may not end up in the manifests to keep list | 19:07 |
clarkb | there are ~5k top level uploads nad ~640 top level repos | 19:14 |
clarkb | and there are at least 15k blobs? I think beacuse these are over the limit its hard to say there aren't more | 19:16 |
clarkb | however since manifests are generally below the limit I think we should be fine but this si still potentially buggy? | 19:16 |
fungi | openstack container show should give you an object_count | 19:20 |
fungi | if you need to know an exact number | 19:20 |
fungi | not sure what the sdk method is for that, but presumably something similar | 19:20 |
clarkb | fungi: in this case it is more complicated than that because zuul-registry is treating it more like a filesystem so its never listing the whole thing at any one time | 19:21 |
fungi | ahh | 19:21 |
clarkb | I think zuul/zuul is a problem. it has 10000 manifests according to my log grepping | 19:21 |
clarkb | which is curiously right at the limit. I'm working on collating the info from the logs to see if there are any others | 19:21 |
clarkb | at this point I think I can ^C and it will stop before deleting any blobs. It will have only deleting manifests | 19:22 |
clarkb | which I think is still in the "it hasn't broken anything yet" portion of the prune | 19:22 |
clarkb | its when we start deleting blobs that we'll have problems | 19:22 |
clarkb | ok there are seveeral | 19:25 |
clarkb | but ironically I think the only one that matters is _local/repos/openstackswift/saio/manifests/ | 19:26 |
clarkb | the reason for that is all of the others are old zuul on dockerhub images and those are all old enough they should all get deleted anyway so it doesn't matter if we accidentally overdelete their blobs | 19:27 |
clarkb | however I think I will ^C at this point | 19:27 |
clarkb | if I don't hear any objections to doing that in the next couple minues I will ^C | 19:30 |
clarkb | https://paste.opendev.org/show/baPZzWHsFxvDeo6CVOuf/ these have 10k object listings +1 entry in the logs for the top level manifest entry | 19:30 |
clarkb | the problem is we make a list of objects in all of the manifests to keep and we don't delete those. If we have an incomplete manifests to keep list then we will/can accidentally delete too many objects and then docker will fail | 19:31 |
clarkb | ok I ^C'd it | 19:38 |
clarkb | corvus: ^ fyi tl;dr is that looking into mass conatiner deletion I realized that 10k objects was the limit for listing too and that we may have short listed manifest listing and looking at the dry run data I was able to confirm that for a number of repos (in the paste above) | 19:38 |
clarkb | corvus: I ^C'd the prune run while it was still pruning manifests so I don't think we should have any problems as this point but we need ot figure out how we can safely prune things which isn't immediately obvious to me because there isn't true pagination as far as I can tell and we delete any blob we don't explicitly list as something to keep | 19:40 |
timburke | clarkb, as far as pagination, look into using `marker` in your query -- you can take the last name returned, pass that as the marker, and only get objects (or containers, if listing the account) that come after it | 19:42 |
clarkb | timburke: my concern with that (which may be a non issue but docs are unclear) is that we use prefix and delimiter | 19:43 |
clarkb | timburke: if you look in https://paste.opendev.org/show/baPZzWHsFxvDeo6CVOuf/ each one of those prefixes has at least 10k entres under it. Would setting marker be relative to the beginning of that prefix or to the content after the prefix? | 19:44 |
clarkb | I guess we can work with both we just need to know which is appropriate | 19:44 |
clarkb | hrm also this is inherently racy bceause the data is hashed. So a new upload at an earlier marker position would not go into the keep list and would get its blobs deleted | 19:50 |
clarkb | and actually that is the case regardless of whether we paginate or not | 19:50 |
clarkb | oh but that is why we have a timelimit | 19:50 |
clarkb | so I think we're generally ok for ^ but we should increase the timelimit for these long prune runs as right now I think it is an hour. But since we could take many hours that is probably not conservative enough | 19:52 |
clarkb | paginating and increasing that to say 24 housr is probably enough to be reliable | 19:52 |
clarkb | https://review.opendev.org/c/zuul/zuul-registry/+/935403 and https://review.opendev.org/c/zuul/zuul-registry/+/935404 have been pushed to mitigate these issues | 20:11 |
clarkb | I'm not sure I have the pagination working properly (I went with full path as the marker value since the marker docs don't say prefix/delimiter affect marker behavior I'm assuming they don't and we must use the full path) | 20:12 |
clarkb | corvus: I did confirm that the run I ^C'd did not get to the paths like _local/repos/zuul/zuul-executor/manifests/ so in theory we can do another dry run and count those results and they should be > 10k | 20:12 |
clarkb | corvus: `grep 'registry.storage: \(Prune\|Keep\) _local/repos/.*/.*/manifests/' zuul-registry-prune-dry.log | cut -d' ' -f6 | sed -e 's/\(.*\/manifests\/\).*/\1/' | uniq -c | sort -u` is my super ugly data processing script | 20:19 |
clarkb | zuul development has also noticed an uptick in docker hub rate limit errors. I half suspect that these may originate in our proxy caches | 20:24 |
clarkb | because otherwise we should have a fairly distributed set of IP addresses in the CI system. I hate this suggestion becuse its silly but we might be better off not using the docker registry caches anymore | 20:24 |
fungi | trade random disconnects for random quota rejections | 20:25 |
clarkb | I don't think these errors are a side effect of pruning in the intermediate registry | 20:30 |
clarkb | the reason for that is alpine doesn't show up in my pruning logs as a having an manifests in the registry at all | 20:31 |
clarkb | so we would never have retrieved the sad image from there in the first place | 20:31 |
clarkb | but if anyone has evidence to the contrary please point it out | 20:32 |
clarkb | instaed I half wonder if cardoe's OSH interset has resulted in a lot more docker requests (or something along those lines basically our total request count is up due to normal workload) | 20:33 |
fungi | mnaser reported similar problems, yeah? i don't recall whether his jobs were using the proxy or no | 20:37 |
Clark[m] | Ya it was zuul-jobs tests pulling from docker | 20:39 |
Clark[m] | I suppose these jobs might not use the cache and that is the problem | 20:39 |
cardoe | there I go breaking the internet again. | 20:44 |
fungi | nah, the internet takes fridays off | 20:45 |
cardoe | So it's certainly possible cause OSH gets very few patches and I've been slinging a handful against it and the PTL has been running a DNM one over and over to play around with some of the packaging changes I've been talking to you all about. | 20:45 |
clarkb | cardoe: I don't have any actual evidence that is the cause, it just wouldn't surprise me if there is a usage change that has us tripping over it | 20:48 |
clarkb | it looks like the job that failed is using our caching proxy | 20:48 |
clarkb | so ya filtering all requests through a single ip if we haven't already cached the data | 20:48 |
clarkb | we might actually be better off not using the caches which I just hate | 20:49 |
clarkb | iirc their rate limiting is manifest based so the more images you have and depend on the worse it gets. We (opendev and zuul) have really shallow depth on that so I guess theoretically something that isn't could quickly exceed limits | 20:51 |
opendevreview | Jay Faulkner proposed openstack/diskimage-builder master: [gentoo] Fix+Update CI for 23.0 profile https://review.opendev.org/c/openstack/diskimage-builder/+/923985 | 21:42 |
corvus | clarkb: ack re pagination | 23:02 |
Clark[m] | corvus: on my bike ride I realized that I could test pagination similarly to my manual ACL setting via a repl | 23:11 |
corvus | Clark: yeah i was just about to start manually testing that | 23:12 |
corvus | Clark: i left a -1 on the first change; i don't think it's necessary | 23:12 |
Clark[m] | corvus: oh ya I guess that is the case since we take the timestamp before doing any work | 23:17 |
corvus | second change lgtm; i think we should manually test that and make sure we get what we expect; then merge it and resume. | 23:18 |
Clark[m] | ++ I need about 15 minutes post bike ride but then I can help | 23:19 |
corvus | cool, i'll get started with a test script. | 23:19 |
Clark[m] | And a dry run would probably be good to confirm we get more than 10k zuul/zuul-executor manifests back before repruning for real? | 23:20 |
corvus | yep | 23:20 |
corvus | i'm actually toying with the idea of just pulling from the intermediate registry to do a dry run... but we'd have to figure out docker run. :) | 23:20 |
opendevreview | Jay Faulkner proposed openstack/diskimage-builder master: [gentoo] Fix+Update CI for 23.0 profile https://review.opendev.org/c/openstack/diskimage-builder/+/923985 | 23:21 |
corvus | Clark: i rebased it, fixed one issue with it, and ran it in dry-run on the server in a second screen window | 23:31 |
corvus | output is in /var/registry/conf/zuul-registry-dry-2.log | 23:31 |
corvus | i see a bit over 16k zuul-executor manifests | 23:32 |
corvus | and it did 3 object listings for zuul-executor. | 23:32 |
corvus | so that all seems okay to me. | 23:32 |
corvus | (n+1 listings since we do a final one that should get back the empty list) | 23:32 |
corvus | Clark: i left a +2 on the updated change; i'll wait for you to get back and check things out, then i think we should aprv it. | 23:35 |
clarkb | three would be expected 10k then 6k then none | 23:40 |
clarkb | corvus: I guess my only other concern is that maybe this could regress normal operations, do we want to do a release before we merge this? or consider the dry run sufficient? | 23:40 |
clarkb | honestly the dry run is probably sufficient since it will exercise this more than anything else? | 23:41 |
clarkb | corvus: oh yup your change is a good one :) | 23:42 |
clarkb | let me run my ugly script against the log file and then if that looks good I guess we proceed | 23:42 |
clarkb | https://paste.opendev.org/show/bnC7PZJmrfJRxGigpiuq/ there is enough variability there I'm inclined to believe this is working | 23:44 |
clarkb | corvus: I +2'd but didn't approve in case you want to consider making a release of registry (just slightly worried we could break it and create a chicken and egg situation) | 23:44 |
clarkb | though we tagged a reelase before we started making any of these prune changes which is probably fine to fallback to | 23:45 |
corvus | yeah, i think merge without release | 23:48 |
corvus | i approved | 23:48 |
clarkb | ack | 23:48 |
corvus | clarkb: okay for me to stop the dry-run now? | 23:48 |
clarkb | corvus: ya I don't think we'll learn anything more from the blob listings | 23:48 |
corvus | k. i will do that and clean up. | 23:48 |
clarkb | the manifest listings show this appears to work which is probably sufficient | 23:48 |
corvus | clarkb: i moved the log file to ~root | 23:49 |
clarkb | the other thing I realized on my bike ride is for fungi's deletion needs we should just list then pass that striaght to delete and delete 10k at a time | 23:50 |
clarkb | we can't do that here as easily (though possibly could) | 23:50 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!