Friday, 2024-11-15

corvus	something about how swiftclient uploaded them did that	00:00
corvus	but things i uploaded through sdk do get deleted with that method	00:00
corvus	so, tldr, with that patch we will have all the functionality we need while working around the two known bugs in the service	00:00
clarkb	got it	00:01
opendevreview	Merged opendev/system-config master: Add basic docs on updating the OpenAFS ppa https://review.opendev.org/c/opendev/system-config/+/935026	06:13
frickler	this sounds interesting, might be worth discussing whether we can/should work on adding opendev to that list https://docs.pypi.org/trusted-publishers/internals/	09:51
fungi	frickler: it would have to be our keycloak oidc that gets added, i think we need to work on finishing the migration off launchpad if we're going to consider using it to authenticate package publication and supply attestations	13:12
fungi	also one of the criteria they listed in conversations about adding other code forges is whether they employ on-call staff. they've essentially ruled out community-run providers (ironically, since i don't even think pypi itself lives up to the expectations they're setting for the organizations running trusted publishes)	13:15
fungi	s/publishes/publishers/	13:15
frickler	hmm, bummer	13:53
fungi	at the very end of that document they hint at it: "Reliability & notability: The effort necessary to integrate with a new Trusted Publisher is not exceptional, but not trivial either. In the interest of making the best use of PyPI's finite resources, we only plan to support platforms that have a reasonable level of usage among PyPI users for publishing. Additionally, we have high	13:56
fungi	standards for overall reliability and security in the operation of a supported Identity Provider: in practice, this means that a home-grown or personal use IdP will not be eligible."	13:56
fungi	we're publishing maybe 1k different packages to pypi, tops. they've been focusing on integrating providers multiple orders of magnitude larger than that	13:58
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Switch to using openstacksdk for image uploads https://review.opendev.org/c/opendev/zuul-jobs/+/935218	14:51
corvus	fungi frickler actually the upcoming zuul oidc may be a better choice for pypi integration. however, their requirement regarding a resurrection attack would need a solution (as there is no "owner account id" in zuul). in principle i don't think there's a vulnerability here because there are no usernames to reassign, but figuring out how to meet or waive that requirement is work.	15:02
fungi	ah, yeah i hadn't thought about zuul-as-an-oidc there	15:02
fungi	i imagined it calling out to keycloak to do the tokens	15:03
corvus	yeah, trusted publishing of artifacts is really the use case it's intended for; it's a good match.	15:04
fungi	right, i even read through that spec, it should have dawned on me	15:04
corvus	but pypi is worried about github changing user names, but in opendev, the repo is the repo; we don't reassign repos to different "users".	15:05
corvus	#status log disabled slow query log on zuul-db01	15:08
opendevstatus	corvus: finished logging	15:08
corvus	fungi: one thing in our favor in convincing them to add us: we publish some packages with rather high download numbers.	15:10
fungi	this is true, a few hundred are in the "critical" whatever category they initially used to decide who would end up with mandatory 2fa enforcement	15:11
corvus	yep. so maybe some time next year when they're through the initial busy phase and we've got the oidc thing in place... might be worth looking into it then. :)	15:12
fungi	corvus: looks like that last fix to 935218 probably isn't the last	15:16
corvus	yeah, i was skeptical of that, but it was easy, and the only one i could see without starting to take stuff apart again. :/	15:17
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Switch to using openstacksdk for image uploads https://review.opendev.org/c/opendev/zuul-jobs/+/935218	15:32
corvus	that should be it.	15:32
corvus	clarkb: i've started pruning the registry in a root screen session on the server	15:46
Clark[m]	Thanks!	15:47
corvus	#status log started manual prune of intermediate registry	15:47
opendevstatus	corvus: finished logging	15:47
corvus	it ran for a bit, then stopped with an error; it got a 404 when trying to delete something. the url it was deleting also showed up in the dry run, so it's been there for a bit (ie, it's not racing a new upload or anything). since it has shown up in two object listings, this feels like a situation where the object backend is out of sync with the listing.	15:57
corvus	i think we should just restart the process and see if it fails at the same place, or continues. based on what i was seeing with rax flex yesterday, issuing a delete and getting a 404 may actually be effective in putting the object listing back in sync with the backend.	15:57
Clark[m]	WFM I think we already established that we expect the process to be resumable	15:58
corvus	ack, restarting with a new log file name (-2)	15:58
corvus	it resumed without error and proceeded past that point; looks like the problem object did not show up in the listing after the 404-delete	15:59
corvus	ftr, it was upload 0895b58737d54a44a63252b3c8333651 -- if you want to see that in the logs later	16:00
Clark[m]	Should the prune proceed if it gets a 404 from delete calls?	16:03
Clark[m]	That seems like it should generally be safe since 404 means there is nothing there?	16:03
corvus	yeah i think so. it stopped again on a similar error; it will probably save us time to actually make that change :)	16:14
clarkb	ok finally sitting down at the computer are you working on that or should I?	16:28
corvus	almost done	16:28
clarkb	ack	16:28
corvus	remote: https://review.opendev.org/c/zuul/zuul-registry/+/935370 Ignore 404 on swift object delete [NEW]	16:29
corvus	(sorry for the delay, i was also stuffing my face with breakfast)	16:29
clarkb	my delays are also morning related. I now have tea	16:29
corvus	the prune stopped again; i think we should just wait for that to merge before proceeding	16:30
clarkb	wfm	16:30
clarkb	corvus: will that retry function retry the 404 several times too? so we know that its really not found?	16:31
corvus	clarkb: no, it stops on 404	16:31
clarkb	oh no notfound raises immedaitely	16:31
fungi	lgtm	16:31
corvus	but if we ever get a 404 that's a lie (i don't think we have, but if we did) then presumably it would show up in a future object listing and a future prune would find and delete it.	16:32
clarkb	fungi raced me to the +2. I'ev approved it now	16:32
clarkb	corvus: and we'll log it so it will be somewhat traceable	16:32
corvus	ya	16:33
clarkb	registry should get merged and promoted in the next ~10 minutes. We can then wait for hourly jobs to update the running service and its image or we can do that manually and speed things up by ~15 minutes	16:41
clarkb	promote succeeded	16:53
corvus	i will pull and restart now	16:54
corvus	restarted; log #4	16:55
clarkb	I see it going in screen too	16:55
clarkb	corvus: I think since its an exec we're still running the old code?	16:56
corvus	oh huh i thought we went with run oops	16:56
clarkb	and that the hourly autoupdate may interrupt us? Not a big deal	16:56
clarkb	no I tried run and it failed due to using host networking or something	16:57
corvus	bad memory on my part	16:57
clarkb	we can see what causes it to die first a 404 or the hourly runs :)	16:57
corvus	so yeah, it'll be interrupted one way or another in the next few mins. then we'll run the new code. :)	16:57
corvus	exactly	16:57
corvus	404 wins	16:58
corvus	down/up and restarted on log #5	16:59
corvus	2024-11-15 17:00:14,730 DEBUG registry.swift: NotFound error when deleting _local/uploads/5e7f892ab4824e9b832b3f2139d215de/metadata	17:00
corvus	that's our new log message; it did not stop for that, so that's good.	17:01
clarkb	++	17:01
clarkb	fwiw the hourly jobs for zuul-registry have completed and they didn't restart containers (as expected) and pruning continues	17:20
corvus	it finished the uploads and is pruning manifests now.	17:21
corvus	based on line numbers from the dry run, we're 9.8% through.	17:38
corvus	we are doing 8 delete calls/second	17:40
clarkb	and we've been running for a grand total of about an hour?	17:41
clarkb	so this won't complete until EOD ish? not bad considerng how long its been since we pruned	17:41
corvus	yeah, i think i estimated 51 hours if each delete took 1 second... so i think it's something in the ball park of 8 hours total with those figures.	17:42
corvus	(about 1.5 hours for everything except the deletes, add about 6 hours of deletes)	17:43
corvus	that's napkin math, but hopefully the right oom. :)	17:43
clarkb	fungi: re ^ I suspect we can do something similar for a general swift container deletion tool. It won't run quickly but we can basically say "delete everything older than X days and ignore 404s" and then let that run in the background somewhere for various containers	17:53
fungi	for the containers i'm interested in cleaning up, their contents are at least 3-4 years old (some haven't had new objects created in a decade)	17:54
fungi	but also some of them have millions of objects	17:55
fungi	so no clue how long that would take	17:55
clarkb	fungi: 8 deletes per second :)	17:55
* clarkb does some math		17:55
clarkb	day and a half per million?	17:55
fungi	this is deletions in old rackspace classic's swift, yeah?	17:56
clarkb	yes	17:56
fungi	cool, maybe doable then	17:57
fungi	yeah, logs_periodic has an object count of 2359532 for example	18:12
fungi	logs_periodic-stable is 1198139	18:12
clarkb	in theory we could do them in parallel but still thats a few days	18:13
clarkb	but that should be doable	18:13
fungi	there's also 256 different logs_NN containers that need deleting	18:13
clarkb	are those not the ones we use today?	18:13
clarkb	I can't remember the sharding scheme we use today	18:14
fungi	nope, zuul_opendev_logs_NNN now (1024 of them)	18:14
fungi	sorry, 2048 of those	18:14
clarkb	oh right we moved to a prefix and suffix system in zuul-jobs because s3 doesn't do name collisions	18:14
fungi	bad math day	18:14
clarkb	and ceph mimic s3	18:15
fungi	the logs_NN containers seem to average in the thousands of objects each, so will go faster individually but there are still 256 of them	18:16
clarkb	and in all cases for your conatienrs we don't care about object timestamps/ttls we want to delete them all and then delete the container which should simplify things a little	18:17
fungi	exactly	18:17
fungi	i manually removed the cdn mapping for all of them already months ago too	18:17
fungi	so they're no longer published/reachable	18:17
clarkb	in that case ya I think we should just do a script that basically does a for each object in container foo delete (ignore 404s) and then attempt a container delete at the end and if that fails we come back around and figure that out	18:18
clarkb	I suspect that 80% of the conatiners will clean up just fine that way then we'll find the corner cases that dont	18:18
clarkb	I guess we may also need some logic in there to get more than 10k object back in each listing or whatever the limti is (would have to look at openstacksdk to see if it doesn't do that already)	18:19
opendevreview	Clark Boylan proposed opendev/system-config master: Add a swift container deletion script https://review.opendev.org/c/opendev/system-config/+/935395	18:47
clarkb	fungi: ^ a very rough first draft. I think I need to undersatnd better what the subdir vs not relationship is and how many objects we'll get back per iteration and probably adding some argparse would be good too	18:48
clarkb	I also haven't run that so it probably fails on a syntax error somewhere	18:48
clarkb	but I think that is the general shaope of a (slow) container deletion tool	18:48
clarkb	after yesterday's adventures with swift and client tooling I'm not sure I'm in the mood to actualyl run and debug a tool like that but I wanted to braindump so I can pick it up with a better mood next week	18:49
fungi	thanks!	18:52
fungi	and yeah, that's lower priority than whatever else we've got going on right now	18:52
clarkb	oh ya we shouldn't need prefix and delimiter bceause we aren't actually interested in treating this like an fs tree (the original data may have been stored that way but our intent is to simply delete all the objects so we can just list them)	19:01
fungi	right, absolutely no need to filter the object lists in any way	19:02
clarkb	reading the docs I think we get back a max of 10k objcets per listing each time. I'm not seeing where zuul-registry handles that (its possible that maybe zuul-regsitry doesn't? I wonder if that would lead to extra deletes since we wouldn't see all manifests to keep potentailly? cc corvus though I think we mitigate that in zuul-registry by listing each namesapce (whcih is a container	19:07
clarkb	image) separately)	19:07
clarkb	corvus: ^ fyi I think that it may be possible that if a container image namespace has more than 10k entries we may overdelete since some manifests may not end up in the manifests to keep list	19:07
clarkb	there are ~5k top level uploads nad ~640 top level repos	19:14
clarkb	and there are at least 15k blobs? I think beacuse these are over the limit its hard to say there aren't more	19:16
clarkb	however since manifests are generally below the limit I think we should be fine but this si still potentially buggy?	19:16
fungi	openstack container show should give you an object_count	19:20
fungi	if you need to know an exact number	19:20
fungi	not sure what the sdk method is for that, but presumably something similar	19:20
clarkb	fungi: in this case it is more complicated than that because zuul-registry is treating it more like a filesystem so its never listing the whole thing at any one time	19:21
fungi	ahh	19:21
clarkb	I think zuul/zuul is a problem. it has 10000 manifests according to my log grepping	19:21
clarkb	which is curiously right at the limit. I'm working on collating the info from the logs to see if there are any others	19:21
clarkb	at this point I think I can ^C and it will stop before deleting any blobs. It will have only deleting manifests	19:22
clarkb	which I think is still in the "it hasn't broken anything yet" portion of the prune	19:22
clarkb	its when we start deleting blobs that we'll have problems	19:22
clarkb	ok there are seveeral	19:25
clarkb	but ironically I think the only one that matters is _local/repos/openstackswift/saio/manifests/	19:26
clarkb	the reason for that is all of the others are old zuul on dockerhub images and those are all old enough they should all get deleted anyway so it doesn't matter if we accidentally overdelete their blobs	19:27
clarkb	however I think I will ^C at this point	19:27
clarkb	if I don't hear any objections to doing that in the next couple minues I will ^C	19:30
clarkb	https://paste.opendev.org/show/baPZzWHsFxvDeo6CVOuf/ these have 10k object listings +1 entry in the logs for the top level manifest entry	19:30
clarkb	the problem is we make a list of objects in all of the manifests to keep and we don't delete those. If we have an incomplete manifests to keep list then we will/can accidentally delete too many objects and then docker will fail	19:31
clarkb	ok I ^C'd it	19:38
clarkb	corvus: ^ fyi tl;dr is that looking into mass conatiner deletion I realized that 10k objects was the limit for listing too and that we may have short listed manifest listing and looking at the dry run data I was able to confirm that for a number of repos (in the paste above)	19:38
clarkb	corvus: I ^C'd the prune run while it was still pruning manifests so I don't think we should have any problems as this point but we need ot figure out how we can safely prune things which isn't immediately obvious to me because there isn't true pagination as far as I can tell and we delete any blob we don't explicitly list as something to keep	19:40
timburke	clarkb, as far as pagination, look into using `marker` in your query -- you can take the last name returned, pass that as the marker, and only get objects (or containers, if listing the account) that come after it	19:42
clarkb	timburke: my concern with that (which may be a non issue but docs are unclear) is that we use prefix and delimiter	19:43
clarkb	timburke: if you look in https://paste.opendev.org/show/baPZzWHsFxvDeo6CVOuf/ each one of those prefixes has at least 10k entres under it. Would setting marker be relative to the beginning of that prefix or to the content after the prefix?	19:44
clarkb	I guess we can work with both we just need to know which is appropriate	19:44
clarkb	hrm also this is inherently racy bceause the data is hashed. So a new upload at an earlier marker position would not go into the keep list and would get its blobs deleted	19:50
clarkb	and actually that is the case regardless of whether we paginate or not	19:50
clarkb	oh but that is why we have a timelimit	19:50
clarkb	so I think we're generally ok for ^ but we should increase the timelimit for these long prune runs as right now I think it is an hour. But since we could take many hours that is probably not conservative enough	19:52
clarkb	paginating and increasing that to say 24 housr is probably enough to be reliable	19:52
clarkb	https://review.opendev.org/c/zuul/zuul-registry/+/935403 and https://review.opendev.org/c/zuul/zuul-registry/+/935404 have been pushed to mitigate these issues	20:11
clarkb	I'm not sure I have the pagination working properly (I went with full path as the marker value since the marker docs don't say prefix/delimiter affect marker behavior I'm assuming they don't and we must use the full path)	20:12
clarkb	corvus: I did confirm that the run I ^C'd did not get to the paths like _local/repos/zuul/zuul-executor/manifests/ so in theory we can do another dry run and count those results and they should be > 10k	20:12
clarkb	corvus: `grep 'registry.storage: \(Prune\\|Keep\) _local/repos/././manifests/' zuul-registry-prune-dry.log \| cut -d' ' -f6 \| sed -e 's/\(.\/manifests\/\)./\1/' \| uniq -c \| sort -u` is my super ugly data processing script	20:19
clarkb	zuul development has also noticed an uptick in docker hub rate limit errors. I half suspect that these may originate in our proxy caches	20:24
clarkb	because otherwise we should have a fairly distributed set of IP addresses in the CI system. I hate this suggestion becuse its silly but we might be better off not using the docker registry caches anymore	20:24
fungi	trade random disconnects for random quota rejections	20:25
clarkb	I don't think these errors are a side effect of pruning in the intermediate registry	20:30
clarkb	the reason for that is alpine doesn't show up in my pruning logs as a having an manifests in the registry at all	20:31
clarkb	so we would never have retrieved the sad image from there in the first place	20:31
clarkb	but if anyone has evidence to the contrary please point it out	20:32
clarkb	instaed I half wonder if cardoe's OSH interset has resulted in a lot more docker requests (or something along those lines basically our total request count is up due to normal workload)	20:33
fungi	mnaser reported similar problems, yeah? i don't recall whether his jobs were using the proxy or no	20:37
Clark[m]	Ya it was zuul-jobs tests pulling from docker	20:39
Clark[m]	I suppose these jobs might not use the cache and that is the problem	20:39
cardoe	there I go breaking the internet again.	20:44
fungi	nah, the internet takes fridays off	20:45
cardoe	So it's certainly possible cause OSH gets very few patches and I've been slinging a handful against it and the PTL has been running a DNM one over and over to play around with some of the packaging changes I've been talking to you all about.	20:45
clarkb	cardoe: I don't have any actual evidence that is the cause, it just wouldn't surprise me if there is a usage change that has us tripping over it	20:48
clarkb	it looks like the job that failed is using our caching proxy	20:48
clarkb	so ya filtering all requests through a single ip if we haven't already cached the data	20:48
clarkb	we might actually be better off not using the caches which I just hate	20:49
clarkb	iirc their rate limiting is manifest based so the more images you have and depend on the worse it gets. We (opendev and zuul) have really shallow depth on that so I guess theoretically something that isn't could quickly exceed limits	20:51
opendevreview	Jay Faulkner proposed openstack/diskimage-builder master: [gentoo] Fix+Update CI for 23.0 profile https://review.opendev.org/c/openstack/diskimage-builder/+/923985	21:42
corvus	clarkb: ack re pagination	23:02
Clark[m]	corvus: on my bike ride I realized that I could test pagination similarly to my manual ACL setting via a repl	23:11
corvus	Clark: yeah i was just about to start manually testing that	23:12
corvus	Clark: i left a -1 on the first change; i don't think it's necessary	23:12
Clark[m]	corvus: oh ya I guess that is the case since we take the timestamp before doing any work	23:17
corvus	second change lgtm; i think we should manually test that and make sure we get what we expect; then merge it and resume.	23:18
Clark[m]	++ I need about 15 minutes post bike ride but then I can help	23:19
corvus	cool, i'll get started with a test script.	23:19
Clark[m]	And a dry run would probably be good to confirm we get more than 10k zuul/zuul-executor manifests back before repruning for real?	23:20
corvus	yep	23:20
corvus	i'm actually toying with the idea of just pulling from the intermediate registry to do a dry run... but we'd have to figure out docker run. :)	23:20
opendevreview	Jay Faulkner proposed openstack/diskimage-builder master: [gentoo] Fix+Update CI for 23.0 profile https://review.opendev.org/c/openstack/diskimage-builder/+/923985	23:21
corvus	Clark: i rebased it, fixed one issue with it, and ran it in dry-run on the server in a second screen window	23:31
corvus	output is in /var/registry/conf/zuul-registry-dry-2.log	23:31
corvus	i see a bit over 16k zuul-executor manifests	23:32
corvus	and it did 3 object listings for zuul-executor.	23:32
corvus	so that all seems okay to me.	23:32
corvus	(n+1 listings since we do a final one that should get back the empty list)	23:32
corvus	Clark: i left a +2 on the updated change; i'll wait for you to get back and check things out, then i think we should aprv it.	23:35
clarkb	three would be expected 10k then 6k then none	23:40
clarkb	corvus: I guess my only other concern is that maybe this could regress normal operations, do we want to do a release before we merge this? or consider the dry run sufficient?	23:40
clarkb	honestly the dry run is probably sufficient since it will exercise this more than anything else?	23:41
clarkb	corvus: oh yup your change is a good one :)	23:42
clarkb	let me run my ugly script against the log file and then if that looks good I guess we proceed	23:42
clarkb	https://paste.opendev.org/show/bnC7PZJmrfJRxGigpiuq/ there is enough variability there I'm inclined to believe this is working	23:44
clarkb	corvus: I +2'd but didn't approve in case you want to consider making a release of registry (just slightly worried we could break it and create a chicken and egg situation)	23:44
clarkb	though we tagged a reelase before we started making any of these prune changes which is probably fine to fallback to	23:45
corvus	yeah, i think merge without release	23:48
corvus	i approved	23:48
clarkb	ack	23:48
corvus	clarkb: okay for me to stop the dry-run now?	23:48
clarkb	corvus: ya I don't think we'll learn anything more from the blob listings	23:48
corvus	k. i will do that and clean up.	23:48
clarkb	the manifest listings show this appears to work which is probably sufficient	23:48
corvus	clarkb: i moved the log file to ~root	23:49
clarkb	the other thing I realized on my bike ride is for fungi's deletion needs we should just list then pass that striaght to delete and delete 10k at a time	23:50
clarkb	we can't do that here as easily (though possibly could)	23:50

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!