Monday, 2024-11-18

*** dmellado0755391 is now known as dmellado075539		08:34
*** elodilles is now known as elodilles_pto		08:37
*** ralonsoh_ is now known as ralonsoh		10:04
jamespage	mordred: https://bugs.launchpad.net/ubuntu/+source/python-keystoneauth1/+bug/2088451	11:39
fungi	i need to run an errand, but should be back in an hour-ish	14:58
clarkb	we are up to 5c8e sha prefixes now	15:55
clarkb	we're almost 40% done? we started Friday evening and it is Monday morning now (relatiev to my timezone) /me does some math	15:57
clarkb	I think that is on track for 6.25ish days total and 2.5 days have completed (iirc corvus original estimate over the weekend was 6 days so that seems to hold)	15:58
corvus	++	16:01
corvus	we should get a new deletes/second value; i've seen it hang for a bit occasionally, so the overall over a long period might be slower than 8/sec	16:02
clarkb	the smaller backup server is complaining about disk utilization again. A good reminder to review the changes to do backup pruning :)	16:02
corvus	(for use in estimating the log deletes)	16:02
clarkb	corvus: do we want to push up a change to run this prune ~weekly?	16:10
opendevreview	James E. Blair proposed opendev/system-config master: Revert "Temporarily disable intermediate registry prune" https://review.opendev.org/c/opendev/system-config/+/935542	16:35
corvus	clarkb: ^ :) apparently we used to have it set up to run daily...	16:36
clarkb	corvus: do we need to supply the configuration path in that command?	16:39
corvus	clarkb: i don't think so; it's an exec in the container; should read it from the default location	16:40
mordred	jamespage: thanks!	17:11
jamespage	mordred: np - that will take a bit of time to work through the SRU process but I'll keep nudging it along	17:11
mordred	jamespage: I think corvus has a local workaround for the original issue - it just jumped out at my eyes so I thought we should get that sorted for anyone else. thanks for jumping on that	17:12
opendevreview	Clark Boylan proposed opendev/system-config master: Add a swift container deletion script https://review.opendev.org/c/opendev/system-config/+/935395	18:05
clarkb	fungi: ^ that is still completely untested but I've adjusted to trying to using the bulk delete approach	18:06
corvus	40m build; 33m upload to swift (from ovh-bhs1); 9m download from swift; 5m upload to local glance	18:22
corvus	that's image timings after the recent launcher efficiency updates	18:22
fungi	fast!	18:31
fungi	(in relative terms)	18:31
fungi	under 1.5 hours from build start to image available in glance	18:32
corvus	the 9m download is a huge improvement (previous was about 1 hour). the 33m upload may be a bit of a regression; i think we got about 18m before with swiftclient.	18:32
corvus	i'm not going to do anything about that 33m right now though; i think it warrants more data collection.	18:33
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Increase swift upload threads to 10 https://review.opendev.org/c/opendev/zuul-jobs/+/935553	18:40
corvus	on second thought... i thought of one easy/obvious difference that we can/should correct. ^	18:40
corvus	there's a good chance that knocks 13-15 minutes off that time.	18:40
corvus	clarkb: fungi i think our deletion rate is something like 16k objects per hour.	18:58
corvus	(or about 4.5/sec)	18:58
fungi	that's fairly swift (pun intended)	18:58
clarkb	with that rate using bulk deletion for the big delets instead of the pruning is probably a good idea	19:01
clarkb	https://review.opendev.org/c/opendev/system-config/+/935395 should roughly have that shape now but I need to test it.	19:01
opendevreview	Jeremy Stanley proposed opendev/infra-openafs-deb noble: Build 1.8.13 for Noble https://review.opendev.org/c/opendev/infra-openafs-deb/+/935556	19:12
clarkb	quick reminder to edit the meeting agenda or let me know what changes you'd like to see made to it. I have an afternoon errand but will get that sent out before the end of my day today	19:26
JayF	I am experiencing a crazy behavior in log digging for an Ironic CI failure; I've validated it's not just a local caching issue as adamcarthur5 has reproduced it:	20:58
JayF	1) https://zuul.opendev.org/t/openstack/build/e107efab24014945a3802738abd47057 2) click "logs" 3) job_output.txt 4) notice that the job_output.txt you just displayed IS NOT the one snippeted from "task summary" at step 1	20:59
JayF	our hunch is that somehow e107efa (the shortened sha in the raw URL) may somehow be hitting a conflict	21:00
frickler	JayF: well the output in 1) is extracted from the .json, not from the .txt, but I don't see a major conflict here	21:05
clarkb	the path is namespaced with the change, pipeline, and job in addition to the sha	21:05
clarkb	it would be really surprising if we have a collision there	21:06
JayF	frickler: how in the world is it today that I learned that !	21:06
JayF	Now I wonder why that stdout wasn't in job_output.txt	21:07
clarkb	it is theoretically possible but like heat death of the universe and all that	21:07
JayF	Then let me reframe my question:	21:07
frickler	there might be some racing due to multiple devstack threads outputting things in parallel, but I do see the failure lines in the .txt, too	21:07
JayF	why doesn't " No tenant network is available for allocation." which appears in the json, not appear in https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e10/933104/4/check/ironic-tempest-bios-redfish-pxe/e107efa/job-output.txt but it does appear in devstacklog.txt	21:08
JayF	I guess my expectations were wrong but they were also broken here so I wanna learn about where the lines /actually/ are	21:08
frickler	JayF: as I said above, I think this may happen due to devstack doing async tasks. if this is reproducible for you, you could try running with DEVSTACK_PARALLEL=False	21:14
clarkb	ya you can see the timestamps gap	21:15
JayF	aha, okay	21:15
clarkb	2024-11-18 15:24:31.204691 is last real log entry in job-output-txt then 2024-11-18 15:32:26.418727 records the error	21:15
JayF	of course /o\	21:16
clarkb	everything you're seein in the json/console panel that is extra was written in that time window I'm sure	21:16
clarkb	as for why things stop writing for ~8 minutes maybe parallel execution if accounting for the different threads goes haywire? Could also be a buffering issue where stdout/stderr buffering prevents that from being written (this is where job-output.txt collect the data)	21:17
clarkb	but then when it exits all of the info ends up in the ansibel return? I don't know	21:17
mnaser	did we not run some sort of squid proxy for dockerhub btw?	21:21
mnaser	i am getting failures left and right with toomanyrequests	21:21
fungi	mnaser: we do, not squid but apache mod_proxy	21:21
clarkb	and I suspect that cache is part of the problem	21:22
fungi	though if the proxy itself is hitting rate limits in dockerhub (our suspicion) then that just makes matters worse	21:22
clarkb	right	21:22
mnaser	ah right, so its a guranateed failure in that case(tm)	21:22
clarkb	I think what has happened is there is a usage pattern change in our docker hub access somewhere resulting in far more requests than a week ago.	21:22
fungi	also we only earmark about... 40gb? for total apache mod_proxy caches on each mirror server, so pulling lots of large images could quickly make the actual caching part relatively useless	21:23
clarkb	And then with those funneled through the proxy cache we're getting the rate limit problems	21:23
clarkb	fungi: 100gb	21:23
mnaser	i think dockerhub made a change tbh, because i've noticed something different here too, a lot more failures, and we don't use any mirror so our vms pull directly..	21:23
mnaser	and i think we'd have to be _very_ lucky to have all of our ips get hit :)	21:23
clarkb	mnaser: ah if you're seeing it outside of the CI environment then ya I wouldn't be surprised if upstream changed something	21:23
mnaser	well i mean both OUR side and also opendev seeing the same issue	21:23
mnaser	so makes me feel they might have done something in dockerhub world	21:24
fungi	like trying to get rid of their users, for example	21:24
clarkb	one thing I noticed is that a request for library/alpine got hit with a rate limit. For some reason I thought all of those open source things were not rate limited regardless of who requseted them	21:24
clarkb	that could've been bad interpretation of how their rate limits work on my part but maybe that was a good interpretation and they've now changed it to apply rate limits to those resources too	21:24
fungi	i think they have two kinds of rate limits: per-project pulls (regardless of who is pulling the images) and per-client (ip or login) pulls	21:25
fungi	my understanding was that the first kind of rate limit (per-project) was waived for qualifying open source projects who applied and kept renewing it ~yearly	21:26
fungi	but that you could still hit a per-client rate limit when pulling any image regardless of what the per-project rate limit situation was for that org	21:26
clarkb	anyway workarounds that have been suggested so far include not using the proxy cache so that we're using more IPs and distributing the requests more. Using buildset registry more aggressively so that we download from docker once then use the buildset registry for downstream jobs. quay.io doesn't have limits and so on	21:28
fungi	also authenticated downloads may get different quotas/rate limits	21:29
corvus	we have a workable path now for speculative testing of quay images, so we could resume the work to move system-config images to quay. that would have the direct effect of reducing a few pulls (probably not that much) from our ci fleet; but also the indirect effect from moving the python-builder and friends images. i don't believe that is at the top of anyone's list right now.	21:30
clarkb	I think authenticated requests only get better limits if you apy for them	21:30
opendevreview	Merged opendev/zuul-jobs master: Increase swift upload threads to 10 https://review.opendev.org/c/opendev/zuul-jobs/+/935553	21:34
corvus	i wonder if there's a quay mirror for alpine, apache, etc....	21:35
clarkb	"Unauthenticated users will be limited to 10 Docker Hub pulls/hr/IP address." from https://www.docker.com/blog/november-2024-updated-plans-announcement/	21:36
clarkb	so ya I think this is upstream	21:36
clarkb	gg docker	21:36
clarkb	kozhukalov: ^ fyi	21:37
JayF	makes me wonder if the proxy could inject auth into unauthenticated requests ... maybe possible but wonder if it's desirable	21:37
clarkb	I don't know that apache could do it but a smarter proxy could. If I'm honest I don't really want to spend time on that	21:38
clarkb	I think not using docker should be where effort is spent	21:38
clarkb	(which as someone who has already moving everything off of docker once I realize is not the most straightforward of migrations)	21:38
kozhukalov	Thanks for letting me know	21:39
JayF	TheJulia: ^ something worth keeping in mind as we work on oci:// and bootc -- tl;dr docker is rate limiting unauthenticated requests, so we should avoid that as a dependency as we setup CI	21:40
clarkb	to be clear I believe they always have had rate limits	21:41
clarkb	but originally rate limits were placed on blob objects which they claim was too abstract for users to understand but made a lot of sense for those of us caching things	21:41
JayF	I didn't think it was this bad before; but either way the end suggestion still applies: don't depend on dockerhub in ci	21:42
clarkb	then they switched to rate limiting manifests instead of blobs	21:42
TheJulia	Quay does as well, fwiw.	21:42
corvus	s/in ci//	21:42
clarkb	and now they've severly reduced the number of requests from 500/6 hours to 10 per hour	21:42
clarkb	TheJulia: quay says tehy won't fwiw	21:42
clarkb	oh interesting quay does have a document saying they have limits now (previously they said they didn't rate lmiit)	21:43
TheJulia	Eh, I’ve seen suggestions otherwise, but it’s all good. Authentication is table stakes regardless	21:43
clarkb	the problem with authentication is now everyone needs an account that they have to manage in the jobs. WHich is doable but really annoying	21:43
clarkb	https://access.redhat.com/solutions/6218921 is what quay says now. A few requests per second	21:44
clarkb	manifest objects like pypi indexes have very short ttls I think because :latest may have moved	21:45
clarkb	the actual data is in tbe blobs so if you care about data transfer costs rate limiting blobs makes sense and is easily mitgated by caching blobs since a lot of blobs are shared among manifests especially if you reuse layers	21:46
clarkb	does dockerhub have bot credentials or similar like quay? Thats the other downside iwth credentials you may not want to use the same account for fetching images in normal jobs as you would for publishing images in privileged jobs due to risk of exposure	21:47
clarkb	I think the last time I looked at this there were improvments around that but I don't remember the specifics	21:47
clarkb	this becomes important if you apply to their open source program and get a single account under that. However, I've also heard of pains with their open source program needing to be renewed annually and sometimes that doesn't happen and suddenly everything stops working (kinda like dropping rate limits to 10 per hour I guess)	21:48
TheJulia	I’m fairly sure with quay it is possible to have bot accounts with restricted access. I know some folks who have done it, which makes me less concerned overall. Ultimately the question of how the job(s) are designed and such	21:53
clarkb	yes quay does so	21:54
clarkb	I don't know if docker does which is likely to be necessary for any "just authenticate" solution	21:54
clarkb	or I suppose you could stop running any jobs with docker hub pulls pre review	21:54
corvus	clarkb: zuul uses a handful of images not hosted on dockerhub; it seems feasible to host a mirror of those images on quay and update the necessary tags once per day. does it sound reasonable to write a zuul job to do that? (with the understanding that the zuul job would be subject to the limits under discussion and may fail unless we also add authn to it)	21:57
corvus	i'm thinking that if our external dependency images are updated roughly once a day that's probably okay...	21:57
TheJulia	I suspect, if the answer is use quay for sanity, and we need accounts, then we can likely sort through that since it would be upstream project usage	22:00
TheJulia	I don't know who I'd need to talk to but I'm sure with a little effort we can sort it out	22:00
Clark[m]	corvus: you mean not hosted in quay.io? And ya opendev also has to figure this out if it becomes a problem (looks like it will)	22:03
Clark[m]	Worth noting docker also refuses to support mirrors for anything but docker hub in their clients	22:05
corvus	Clark: yes. ie, i'm proposing that, for example, the zuul quickstart update to use "quay.io/zuul-ci/httpd:latest" instead of "docker.io/library/httpd:latest"	22:06
Clark[m]	Got it	22:06
corvus	(and we set up a job to copy the latter to the former)	22:07
*** dmellado0755393 is now known as dmellado075539		22:09
opendevreview	Goutham Pacha Ravi proposed opendev/irc-meetings master: Update chair for manila meetings https://review.opendev.org/c/opendev/irc-meetings/+/935572	22:38
opendevreview	Goutham Pacha Ravi proposed opendev/irc-meetings master: Add eventlet-removal biweekly meeting https://review.opendev.org/c/opendev/irc-meetings/+/935573	22:49
opendevreview	Goutham Pacha Ravi proposed opendev/irc-meetings master: Add eventlet-removal biweekly meeting https://review.opendev.org/c/opendev/irc-meetings/+/935573	22:49
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574	22:56
corvus	Clark: ^ that's a bunch of brain vomit that may do what we discussed above (but will probably just fail due to all the moving parts).	22:57
corvus	(i mean, the actual pull/push is easy; the hard part is the simulated registry for tests)	22:58
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574	23:01
clarkb	ack. I'm going to make sure we have a meeting agenda out on time today (and will include the apparent new limits on there too)	23:27
clarkb	going to remove the rtd and developer docs job failures as I think both got resolved	23:29
opendevreview	Clark Boylan proposed openstack/project-config master: Disable raxflex cloud https://review.opendev.org/c/openstack/project-config/+/935575	23:46
clarkb	you can see the mirror isn't serving anything at https://mirror.sjc3.raxflex.opendev.org/ and it appears the cinder volume backing those caches is not happy. I'm going to self approve 935575 now	23:47
fungi	yikes, ng	23:57
clarkb	syslog reports a very similar cannot read sector 0 error as last time	23:57
opendevreview	James E. Blair proposed zuul/zuul-jobs master: WIP: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574	23:58
clarkb	I'm currently focused on meeting agenda stuff but will put the mirro status on there too	23:58
clarkb	disabling the region should effectively work around the problem for now	23:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!