Friday, 2024-11-15

corvussomething about how swiftclient uploaded them did that00:00
corvusbut things i uploaded through sdk do get deleted with that method00:00
corvusso, tldr, with that patch we will have all the functionality we need while working around the two known bugs in the service00:00
clarkbgot it00:01
opendevreviewMerged opendev/system-config master: Add basic docs on updating the OpenAFS ppa  https://review.opendev.org/c/opendev/system-config/+/93502606:13
fricklerthis sounds interesting, might be worth discussing whether we can/should work on adding opendev to that list https://docs.pypi.org/trusted-publishers/internals/09:51
fungifrickler: it would have to be our keycloak oidc that gets added, i think we need to work on finishing the migration off launchpad if we're going to consider using it to authenticate package publication and supply attestations13:12
fungialso one of the criteria they listed in conversations about adding other code forges is whether they employ on-call staff. they've essentially ruled out community-run providers (ironically, since i don't even think pypi itself lives up to the expectations they're setting for the organizations running trusted publishes)13:15
fungis/publishes/publishers/13:15
fricklerhmm, bummer13:53
fungiat the very end of that document they hint at it: "Reliability & notability: The effort necessary to integrate with a new Trusted Publisher is not exceptional, but not trivial either. In the interest of making the best use of PyPI's finite resources, we only plan to support platforms that have a reasonable level of usage among PyPI users for publishing. Additionally, we have high13:56
fungistandards for overall reliability and security in the operation of a supported Identity Provider: in practice, this means that a home-grown or personal use IdP will not be eligible."13:56
fungiwe're publishing maybe 1k different packages to pypi, tops. they've been focusing on integrating providers multiple orders of magnitude larger than that13:58
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Switch to using openstacksdk for image uploads  https://review.opendev.org/c/opendev/zuul-jobs/+/93521814:51
corvusfungi frickler actually the upcoming zuul oidc may be a better choice for pypi integration.  however, their requirement regarding a resurrection attack would need a solution (as there is no "owner account id" in zuul).  in principle i don't think there's a vulnerability here because there are no usernames to reassign, but figuring out how to meet or waive that requirement is work.15:02
fungiah, yeah i hadn't thought about zuul-as-an-oidc there15:02
fungii imagined it calling out to keycloak to do the tokens15:03
corvusyeah, trusted publishing of artifacts is really the use case it's intended for; it's a good match.15:04
fungiright, i even read through that spec, it should have dawned on me15:04
corvusbut pypi is worried about github changing user names, but in opendev, the repo is the repo; we don't reassign repos to different "users".15:05
corvus#status log disabled slow query log on zuul-db0115:08
opendevstatuscorvus: finished logging15:08
corvusfungi: one thing in our favor in convincing them to add us: we publish some packages with rather high download numbers.15:10
fungithis is true, a few hundred are in the "critical" whatever category they initially used to decide who would end up with mandatory 2fa enforcement15:11
corvusyep.  so maybe some time next year when they're through the initial busy phase and we've got the oidc thing in place... might be worth looking into it then.  :)15:12
fungicorvus: looks like that last fix to 935218 probably isn't the last15:16
corvusyeah, i was skeptical of that, but it was easy, and the only one i could see without starting to take stuff apart again.  :/15:17
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Switch to using openstacksdk for image uploads  https://review.opendev.org/c/opendev/zuul-jobs/+/93521815:32
corvusthat should be it.15:32
corvusclarkb: i've started pruning the registry in a root screen session on the server15:46
Clark[m]Thanks!15:47
corvus#status log started manual prune of intermediate registry15:47
opendevstatuscorvus: finished logging15:47
corvusit ran for a bit, then stopped with an error; it got a 404 when trying to delete something.  the url it was deleting also showed up in the dry run, so it's been there for a bit (ie, it's not racing a new upload or anything).  since it has shown up in two object listings, this feels like a situation where the object backend is out of sync with the listing.15:57
corvusi think we should just restart the process and see if it fails at the same place, or continues.  based on what i was seeing with rax flex yesterday, issuing a delete and getting a 404 may actually be effective in putting the object listing back in sync with the backend.15:57
Clark[m]WFM I think we already established that we expect the process to be resumable 15:58
corvusack, restarting with a new log file name (-2)15:58
corvusit resumed without error and proceeded past that point; looks like the problem object did not show up in the listing after the 404-delete15:59
corvusftr, it was upload 0895b58737d54a44a63252b3c8333651 -- if you want to see that in the logs later16:00
Clark[m]Should the prune proceed if it gets a 404 from delete calls?16:03
Clark[m]That seems like it should generally be safe since 404 means there is nothing there?16:03
corvusyeah i think so.  it stopped again on a similar error; it will probably save us time to actually make that change :)16:14
clarkbok finally sitting down at the computer are you working on that or should I?16:28
corvusalmost done16:28
clarkback16:28
corvusremote:   https://review.opendev.org/c/zuul/zuul-registry/+/935370 Ignore 404 on swift object delete [NEW]        16:29
corvus(sorry for the delay, i was also stuffing my face with breakfast)16:29
clarkbmy delays are also morning related. I now have tea16:29
corvusthe prune stopped again; i think we should just wait for that to merge before proceeding16:30
clarkbwfm16:30
clarkbcorvus: will that retry function retry the 404 several times too? so we know that its really not found?16:31
corvusclarkb: no, it stops on 40416:31
clarkboh no notfound raises immedaitely16:31
fungilgtm16:31
corvusbut if we ever get a 404 that's a lie (i don't think we have, but if we did) then presumably it would show up in a future object listing and a future prune would find and delete it.16:32
clarkbfungi raced me to the +2. I'ev approved it now16:32
clarkbcorvus: and we'll log it so it will be somewhat traceable16:32
corvusya16:33
clarkbregistry should get merged and promoted in the next ~10 minutes. We can then wait for hourly jobs to update the running service and its image or we can do that manually and speed things up by ~15 minutes16:41
clarkbpromote succeeded16:53
corvusi will pull and restart now16:54
corvusrestarted; log #416:55
clarkbI see it going in screen too16:55
clarkbcorvus: I think since its an exec we're still running the old code?16:56
corvusoh huh i thought we went with run oops16:56
clarkband that the hourly autoupdate may interrupt us? Not a big deal16:56
clarkbno I tried run and it failed due to using host networking or something16:57
corvusbad memory on my part16:57
clarkbwe can see what causes it to die first a 404 or the hourly runs :)16:57
corvusso yeah, it'll be interrupted one way or another in the next few mins.  then we'll run the new code.  :)16:57
corvusexactly16:57
corvus404 wins16:58
corvusdown/up and restarted on log #516:59
corvus2024-11-15 17:00:14,730 DEBUG registry.swift: NotFound error when deleting _local/uploads/5e7f892ab4824e9b832b3f2139d215de/metadata17:00
corvusthat's our new log message; it did not stop for that, so that's good.17:01
clarkb++17:01
clarkbfwiw the hourly jobs for zuul-registry have completed and they didn't restart containers (as expected) and pruning continues17:20
corvusit finished the uploads and is pruning manifests now.17:21
corvusbased on line numbers from the dry run, we're 9.8% through.17:38
corvuswe are doing 8 delete calls/second17:40
clarkband we've been running for a grand total of about an hour?17:41
clarkbso this won't complete until EOD ish? not bad considerng how long its been since we pruned17:41
corvusyeah, i think i estimated 51 hours if each delete took 1 second... so i think it's something in the ball park of 8 hours total with those figures.17:42
corvus(about 1.5 hours for everything except the deletes, add about 6 hours of deletes)17:43
corvusthat's napkin math, but hopefully the right oom.  :)17:43
clarkbfungi: re ^ I suspect we can do something similar for a general swift container deletion tool. It won't run quickly but we can basically say "delete everything older than X days and ignore 404s" and then let that run in the background somewhere for various containers17:53
fungifor the containers i'm interested in cleaning up, their contents are at least 3-4 years old (some haven't had new objects created in a decade)17:54
fungibut also some of them have millions of objects17:55
fungiso no clue how long that would take17:55
clarkbfungi: 8 deletes per second :)17:55
* clarkb does some math17:55
clarkbday and a half per million?17:55
fungithis is deletions in old rackspace classic's swift, yeah?17:56
clarkbyes17:56
fungicool, maybe doable then17:57
fungiyeah, logs_periodic has an object count of 2359532 for example18:12
fungilogs_periodic-stable is 119813918:12
clarkbin theory we could do them in parallel but still thats a few days18:13
clarkbbut that should be doable18:13
fungithere's also 256 different logs_NN containers that need deleting18:13
clarkbare those not the ones we use today?18:13
clarkbI can't remember the sharding scheme we use today18:14
funginope, zuul_opendev_logs_NNN now (1024 of them)18:14
fungisorry, 2048 of those18:14
clarkboh right we moved to a prefix and suffix system in zuul-jobs because s3 doesn't do name collisions18:14
fungibad math day18:14
clarkband ceph mimic s318:15
fungithe logs_NN containers seem to average in the thousands of objects each, so will go faster individually but there are still 256 of them18:16
clarkband in all cases for your conatienrs we don't care about object timestamps/ttls we want to delete them all and then delete the container which should simplify things a little18:17
fungiexactly18:17
fungii manually removed the cdn mapping for all of them already months ago too18:17
fungiso they're no longer published/reachable18:17
clarkbin that case ya I think we should just do a script that basically does a for each object in container foo delete (ignore 404s) and then attempt a container delete at the end and if that fails we come back around and figure that out18:18
clarkbI suspect that 80% of the conatiners will clean up just fine that way then we'll find the corner cases that dont18:18
clarkbI guess we may also need some logic in there to get more than 10k object back in each listing or whatever the limti is (would have to look at openstacksdk to see if it doesn't do that already)18:19
opendevreviewClark Boylan proposed opendev/system-config master: Add a swift container deletion script  https://review.opendev.org/c/opendev/system-config/+/93539518:47
clarkbfungi: ^ a very rough first draft. I think I need to undersatnd better what the subdir vs not relationship is and how many objects we'll get back per iteration and probably adding some argparse would be good too18:48
clarkbI also haven't run that so it probably fails on a syntax error somewhere18:48
clarkbbut I think that is the general shaope of a (slow) container deletion tool18:48
clarkbafter yesterday's adventures with swift and client tooling I'm not sure I'm in the mood to actualyl run and debug a tool like that but I wanted to braindump so I can pick it up with a better mood next week18:49
fungithanks!18:52
fungiand yeah, that's lower priority than whatever else we've got going on right now18:52
clarkboh ya we shouldn't need prefix and delimiter bceause we aren't actually interested in treating this like an fs tree (the original data may have been stored that way but our intent is to simply delete all the objects so we can just list them)19:01
fungiright, absolutely no need to filter the object lists in any way19:02
clarkbreading the docs I think we get back a max of 10k objcets per listing each time. I'm not seeing where zuul-registry handles that (its possible that maybe zuul-regsitry doesn't? I wonder if that would lead to extra deletes since we wouldn't see all manifests to keep potentailly? cc corvus though I think we mitigate that in zuul-registry by listing each namesapce (whcih is a container19:07
clarkbimage) separately)19:07
clarkbcorvus: ^ fyi I think that it may be possible that if a container image namespace has more than 10k entries we may overdelete since some manifests may not end up in the manifests to keep list19:07
clarkbthere are ~5k top level uploads nad ~640 top level repos19:14
clarkband there are at least 15k blobs? I think beacuse these are over the limit its hard to say there aren't more19:16
clarkbhowever since manifests are generally below the limit I think we should be fine but this si still potentially buggy?19:16
fungiopenstack container show should give you an object_count19:20
fungiif you need to know an exact number19:20
funginot sure what the sdk method is for that, but presumably something similar19:20
clarkbfungi: in this case it is more complicated than that because zuul-registry is treating it more like a filesystem so its never listing the whole thing at any one time19:21
fungiahh19:21
clarkbI think zuul/zuul is a problem. it has 10000 manifests according to my log grepping19:21
clarkbwhich is curiously right at the limit. I'm working on collating the info from the logs to see if there are any others19:21
clarkbat this point I think I can ^C and it will stop before deleting any blobs. It will have only deleting manifests19:22
clarkbwhich I think is still in the "it hasn't broken anything yet" portion of the prune19:22
clarkbits when we start deleting blobs that we'll have problems19:22
clarkbok there are seveeral19:25
clarkbbut ironically I think the only one that matters is _local/repos/openstackswift/saio/manifests/19:26
clarkbthe reason for that is all of the others are old zuul on dockerhub images and those are all old enough they should all get deleted anyway so it doesn't matter if we accidentally overdelete their blobs19:27
clarkbhowever I think I will ^C at this point19:27
clarkbif I don't hear any objections to doing that in the next couple minues I will ^C19:30
clarkbhttps://paste.opendev.org/show/baPZzWHsFxvDeo6CVOuf/ these have 10k object listings +1 entry in the logs for the top level manifest entry19:30
clarkbthe problem is we make a list of objects in all of the manifests to keep and we don't delete those. If we have an incomplete manifests to keep list then we will/can accidentally delete too many objects and then docker will fail19:31
clarkbok I ^C'd it19:38
clarkbcorvus: ^ fyi tl;dr is that looking into mass conatiner deletion I realized that 10k objects was the limit for listing too and that we may have short listed manifest listing and looking at the dry run data I was able to confirm that for a number of repos (in the paste above)19:38
clarkbcorvus: I ^C'd the prune run while it was still pruning manifests so I don't think we should have any problems as this point but we need ot figure out how we can safely prune things which isn't immediately obvious to me because there isn't true pagination as far as I can tell and we delete any blob we don't explicitly list as something to keep19:40
timburkeclarkb, as far as pagination, look into using `marker` in your query -- you can take the last name returned, pass that as the marker, and only get objects (or containers, if listing the account) that come after it19:42
clarkbtimburke: my concern with that (which may be a non issue but docs are unclear) is that we use prefix and delimiter19:43
clarkbtimburke: if you look in https://paste.opendev.org/show/baPZzWHsFxvDeo6CVOuf/ each one of those prefixes has at least 10k entres under it. Would setting marker be relative to the beginning of that prefix or to the content after the prefix?19:44
clarkbI guess we can work with both we just need to know which is appropriate19:44
clarkbhrm also this is inherently racy bceause the data is hashed. So a new upload at an earlier marker position would not go into the keep list and would get its blobs deleted19:50
clarkband actually that is the case regardless of whether we paginate or not19:50
clarkboh but that is why we have a timelimit19:50
clarkbso I think we're generally ok for ^ but we should increase the timelimit for these long prune runs as right now I think it is an hour. But since we could take many hours that is probably not conservative enough19:52
clarkbpaginating and increasing that to say 24 housr is probably enough to be reliable19:52
clarkbhttps://review.opendev.org/c/zuul/zuul-registry/+/935403 and https://review.opendev.org/c/zuul/zuul-registry/+/935404 have been pushed to mitigate these issues20:11
clarkbI'm not sure I have the pagination working properly (I went with full path as the marker value since the marker docs don't say prefix/delimiter affect marker behavior I'm assuming they don't and we must use the full path)20:12
clarkbcorvus: I did confirm that the run I ^C'd did not get to the paths like _local/repos/zuul/zuul-executor/manifests/ so in theory we can do another dry run and count those results and they should be > 10k20:12
clarkbcorvus: `grep 'registry.storage: \(Prune\|Keep\) _local/repos/.*/.*/manifests/' zuul-registry-prune-dry.log  | cut -d' ' -f6 | sed -e 's/\(.*\/manifests\/\).*/\1/' | uniq -c | sort -u` is my super ugly data processing script20:19
clarkbzuul development has also noticed an uptick in docker hub rate limit errors. I half suspect that these may originate in our proxy caches20:24
clarkbbecause otherwise we should have a fairly distributed set of IP addresses in the CI system. I hate this suggestion becuse its silly but we might be better off not using the docker registry caches anymore20:24
fungitrade random disconnects for random quota rejections20:25
clarkbI don't think these errors are a side effect of pruning in the intermediate registry20:30
clarkbthe reason for that is alpine doesn't show up in my pruning logs as a having an manifests in the registry at all20:31
clarkbso we would never have retrieved the sad image from there in the first place20:31
clarkbbut if anyone has evidence to the contrary please point it out20:32
clarkbinstaed I half wonder if cardoe's OSH interset has resulted in a lot more docker requests (or something along those lines basically our total request count is up due to normal workload)20:33
fungimnaser reported similar problems, yeah? i don't recall whether his jobs were using the proxy or no20:37
Clark[m]Ya it was zuul-jobs tests pulling from docker20:39
Clark[m]I suppose these jobs might not use the cache and that is the problem 20:39
cardoethere I go breaking the internet again.20:44
funginah, the internet takes fridays off20:45
cardoeSo it's certainly possible cause OSH gets very few patches and I've been slinging a handful against it and the PTL has been running a DNM one over and over to play around with some of the packaging changes I've been talking to you all about.20:45
clarkbcardoe: I don't have any actual evidence that is the cause, it just wouldn't surprise me if there is a usage change that has us tripping over it20:48
clarkbit looks like the job that failed is using our caching proxy20:48
clarkbso ya filtering all requests through a single ip if we haven't already cached the data20:48
clarkbwe might actually be better off not using the caches which I just hate20:49
clarkbiirc their rate limiting is manifest based so the more images you have and depend on the worse it gets. We (opendev and zuul) have really shallow depth on that so I guess  theoretically something that isn't could quickly exceed limits20:51
opendevreviewJay Faulkner proposed openstack/diskimage-builder master: [gentoo] Fix+Update CI for 23.0 profile  https://review.opendev.org/c/openstack/diskimage-builder/+/92398521:42
corvusclarkb: ack re pagination23:02
Clark[m]corvus: on my bike ride I realized that I could test pagination similarly to my manual ACL setting via a repl23:11
corvusClark: yeah i was just about to start manually testing that23:12
corvusClark: i left a -1 on the first change; i don't think it's necessary23:12
Clark[m]corvus: oh ya I guess that is the case since we take the timestamp before doing any work23:17
corvussecond change lgtm; i think we should manually test that and make sure we get what we expect; then merge it and resume.23:18
Clark[m]++ I need about 15 minutes post bike ride but then I can help23:19
corvuscool, i'll get started with a test script.23:19
Clark[m]And a dry run would probably be good to confirm we get more than 10k zuul/zuul-executor manifests back before repruning for real?23:20
corvusyep23:20
corvusi'm actually toying with the idea of just pulling from the intermediate registry to do a dry run... but we'd have to figure out docker run.  :)23:20
opendevreviewJay Faulkner proposed openstack/diskimage-builder master: [gentoo] Fix+Update CI for 23.0 profile  https://review.opendev.org/c/openstack/diskimage-builder/+/92398523:21
corvusClark: i rebased it, fixed one issue with it, and ran it in dry-run on the server in a second screen window23:31
corvusoutput is in /var/registry/conf/zuul-registry-dry-2.log23:31
corvusi see a bit over 16k zuul-executor manifests23:32
corvusand it did 3 object listings for zuul-executor.23:32
corvusso that all seems okay to me.23:32
corvus(n+1 listings since we do a final one that should get back the empty list)23:32
corvusClark: i left a +2 on the updated change; i'll wait for you to get back and check things out, then i think we should aprv it.23:35
clarkbthree would be expected 10k then 6k then none23:40
clarkbcorvus: I guess my only other concern is that maybe this could regress normal operations, do we want to do a release before we merge this? or consider the dry run sufficient?23:40
clarkbhonestly the dry run is probably sufficient since it will exercise this more than anything else?23:41
clarkbcorvus: oh yup your change is a good one :)23:42
clarkblet me run my ugly script against the log file and then if that looks good I guess we proceed23:42
clarkbhttps://paste.opendev.org/show/bnC7PZJmrfJRxGigpiuq/ there is enough variability there I'm inclined to believe this is working23:44
clarkbcorvus: I +2'd but didn't approve in case you want to consider making a release of registry (just slightly worried we could break it and create a chicken and egg situation)23:44
clarkbthough we tagged a reelase before we started making any of these prune changes which is probably fine to fallback to23:45
corvusyeah, i think merge without release23:48
corvusi approved23:48
clarkback23:48
corvusclarkb: okay for me to stop the dry-run now?23:48
clarkbcorvus: ya I don't think we'll learn anything more from the blob listings23:48
corvusk.  i will do that and clean up.23:48
clarkbthe manifest listings show this appears to work which is probably sufficient23:48
corvusclarkb: i moved the log file to ~root23:49
clarkbthe other thing I realized on my bike ride is for fungi's deletion needs we should just list then pass that striaght to delete and delete 10k at a time23:50
clarkbwe can't do that here as easily (though possibly could)23:50

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!