Monday, 2024-11-18

*** dmellado0755391 is now known as dmellado07553908:34
*** elodilles is now known as elodilles_pto08:37
*** ralonsoh_ is now known as ralonsoh10:04
jamespagemordred: https://bugs.launchpad.net/ubuntu/+source/python-keystoneauth1/+bug/208845111:39
fungii need to run an errand, but should be back in an hour-ish14:58
clarkbwe are up to 5c8e sha prefixes now15:55
clarkbwe're almost 40% done? we started Friday evening and it is Monday morning now (relatiev to my timezone) /me does some math15:57
clarkbI think that is on track for 6.25ish days total and 2.5 days have completed (iirc corvus original estimate over the weekend was 6 days so that seems to hold)15:58
corvus++16:01
corvuswe should get a new deletes/second value; i've seen it hang for a bit occasionally, so the overall over a long period might be slower than 8/sec16:02
clarkbthe smaller backup server is complaining about disk utilization again. A good reminder to review the changes to do backup pruning :)16:02
corvus(for use in estimating the log deletes)16:02
clarkbcorvus: do we want to push up a change to run this prune ~weekly?16:10
opendevreviewJames E. Blair proposed opendev/system-config master: Revert "Temporarily disable intermediate registry prune"  https://review.opendev.org/c/opendev/system-config/+/93554216:35
corvusclarkb: ^ :)  apparently we used to have it set up to run daily...16:36
clarkbcorvus: do we need to supply the configuration path in that command?16:39
corvusclarkb: i don't think so; it's an exec in the container; should read it from the default location16:40
mordredjamespage: thanks!17:11
jamespagemordred: np - that will take a bit of time to work through the SRU process but I'll keep nudging it along17:11
mordredjamespage: I think corvus has a local workaround for the original issue - it just jumped out at my eyes so I thought we should get that sorted for anyone else. thanks for jumping on that17:12
opendevreviewClark Boylan proposed opendev/system-config master: Add a swift container deletion script  https://review.opendev.org/c/opendev/system-config/+/93539518:05
clarkbfungi: ^ that is still completely untested but I've adjusted to trying to using the bulk delete approach18:06
corvus40m build; 33m upload to swift (from ovh-bhs1); 9m download from swift; 5m upload to local glance18:22
corvusthat's image timings after the recent launcher efficiency updates18:22
fungifast!18:31
fungi(in relative terms)18:31
fungiunder 1.5 hours from build start to image available in glance18:32
corvusthe 9m download is a huge improvement (previous was about 1 hour).  the 33m upload may be a bit of a regression; i think we got about 18m before with swiftclient.18:32
corvusi'm not going to do anything about that 33m right now though; i think it warrants more data collection.18:33
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Increase swift upload threads to 10  https://review.opendev.org/c/opendev/zuul-jobs/+/93555318:40
corvuson second thought... i thought of one easy/obvious difference that we can/should correct. ^18:40
corvusthere's a good chance that knocks 13-15 minutes off that time.18:40
corvusclarkb: fungi i think our deletion rate is something like 16k objects per hour.18:58
corvus(or about 4.5/sec)18:58
fungithat's fairly swift (pun intended)18:58
clarkbwith that rate using bulk deletion for the big delets instead of the pruning is probably a good idea19:01
clarkbhttps://review.opendev.org/c/opendev/system-config/+/935395 should roughly have that shape now but I need to test it.19:01
opendevreviewJeremy Stanley proposed opendev/infra-openafs-deb noble: Build 1.8.13 for Noble  https://review.opendev.org/c/opendev/infra-openafs-deb/+/93555619:12
clarkbquick reminder to edit the meeting agenda or let me know what changes you'd like to see made to it. I have an afternoon errand but will get that sent out before the end of my day today19:26
JayFI am experiencing a crazy behavior in log digging for an Ironic CI failure; I've validated it's not just a local caching issue as adamcarthur5 has reproduced it:20:58
JayF1) https://zuul.opendev.org/t/openstack/build/e107efab24014945a3802738abd47057 2) click "logs" 3) job_output.txt 4) notice that the job_output.txt you just displayed IS NOT the one snippeted from "task summary" at step 120:59
JayFour hunch is that somehow e107efa (the shortened sha in the raw URL) may somehow be hitting a conflict21:00
fricklerJayF: well the output in 1) is extracted from the .json, not from the .txt, but I don't see a major conflict here21:05
clarkbthe path is namespaced with the change, pipeline, and job in addition to the sha21:05
clarkbit would be really surprising if we have a collision there21:06
JayFfrickler: how in the world is it today that I learned that !21:06
JayFNow I wonder why that stdout wasn't in job_output.txt21:07
clarkbit is theoretically possible but like heat death of the universe and all that21:07
JayFThen let me reframe my question:21:07
fricklerthere might be some racing due to multiple devstack threads outputting things in parallel, but I do see the failure lines in the .txt, too21:07
JayFwhy doesn't " No tenant network is available for allocation." which appears in the json, not appear in https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e10/933104/4/check/ironic-tempest-bios-redfish-pxe/e107efa/job-output.txt but it does appear in devstacklog.txt21:08
JayFI guess my expectations were wrong but they were also broken here so I wanna learn about where the lines /actually/ are21:08
fricklerJayF: as I said above, I think this may happen due to devstack doing async tasks. if this is reproducible for you, you could try running with DEVSTACK_PARALLEL=False21:14
clarkbya you can see the timestamps gap21:15
JayFaha, okay21:15
clarkb2024-11-18 15:24:31.204691 is last real log entry in job-output-txt then 2024-11-18 15:32:26.418727  records the error21:15
JayFof course  /o\21:16
clarkbeverything you're seein in the json/console panel that is extra was written in that time window I'm sure21:16
clarkbas for why things stop writing for ~8 minutes maybe parallel execution if accounting for the different threads goes haywire? Could also be a buffering issue where stdout/stderr buffering prevents that from being written (this is where job-output.txt collect the data)21:17
clarkbbut then when it exits all of the info ends up in the ansibel return? I don't know21:17
mnaserdid we not run some sort of squid proxy for dockerhub btw?21:21
mnaseri am getting failures left and right with toomanyrequests21:21
fungimnaser: we do, not squid but apache mod_proxy21:21
clarkband I suspect that cache is part of the problem21:22
fungithough if the proxy itself is hitting rate limits in dockerhub (our suspicion) then that just makes matters worse21:22
clarkbright21:22
mnaserah right, so its a guranateed failure in that case(tm)21:22
clarkbI think what has happened is there is a usage pattern change in our docker hub access somewhere resulting in far more requests than a week ago.21:22
fungialso we only earmark about... 40gb? for total apache mod_proxy caches on each mirror server, so pulling lots of large images could quickly make the actual caching part relatively useless21:23
clarkbAnd then with those funneled through the proxy cache we're getting the rate limit problems21:23
clarkbfungi: 100gb21:23
mnaseri think dockerhub made a change tbh, because i've noticed something different here too, a lot more failures, and we don't use any mirror so our vms pull directly..21:23
mnaserand i think we'd have to be _very_ lucky to have all of our ips get hit :)21:23
clarkbmnaser: ah if you're seeing it outside of the CI environment then ya I wouldn't be surprised if upstream changed something21:23
mnaserwell i mean both OUR side and also opendev seeing the same issue21:23
mnaserso makes me feel they might have done something in dockerhub world21:24
fungilike trying to get rid of their users, for example21:24
clarkbone thing I noticed is that a request for library/alpine got hit with a rate limit. For some reason I thought all of those open source things were not rate limited regardless of who requseted them21:24
clarkbthat could've been bad interpretation of how their rate limits work on my part but maybe that was a good interpretation and they've now changed it to apply rate limits to those resources too21:24
fungii think they have two kinds of rate limits: per-project pulls (regardless of who is pulling the images) and per-client (ip or login) pulls21:25
fungimy understanding was that the first kind of rate limit (per-project) was waived for qualifying open source projects who applied and kept renewing it ~yearly21:26
fungibut that you could still hit a per-client rate limit when pulling any image regardless of what the per-project rate limit situation was for that org21:26
clarkbanyway workarounds that have been suggested so far include not using the proxy cache so that we're using more IPs and distributing the requests more. Using buildset registry more aggressively so that we download from docker once then use the buildset registry for downstream jobs. quay.io doesn't have limits and so on21:28
fungialso authenticated downloads may get different quotas/rate limits21:29
corvuswe have a workable path now for speculative testing of quay images, so we could resume the work to move system-config images to quay.  that would have the direct effect of reducing a few pulls (probably not that much) from our ci fleet; but also the indirect effect from moving the python-builder and friends images.  i don't believe that is at the top of anyone's list right now.21:30
clarkbI think authenticated requests only get better limits if you apy for them21:30
opendevreviewMerged opendev/zuul-jobs master: Increase swift upload threads to 10  https://review.opendev.org/c/opendev/zuul-jobs/+/93555321:34
corvusi wonder if there's a quay mirror for alpine, apache, etc....21:35
clarkb"Unauthenticated users will be limited to 10 Docker Hub pulls/hr/IP address." from https://www.docker.com/blog/november-2024-updated-plans-announcement/21:36
clarkbso ya I think this is upstream21:36
clarkbgg docker21:36
clarkbkozhukalov: ^ fyi21:37
JayFmakes me wonder if the proxy could inject auth into unauthenticated requests ... maybe possible but wonder if it's desirable21:37
clarkbI don't know that apache could do it but a smarter proxy could. If I'm honest I don't really want to spend time on that21:38
clarkbI think not using docker should be where effort is spent21:38
clarkb(which as someone who has already moving everything off of docker once I realize is not the most straightforward of migrations)21:38
kozhukalovThanks for letting me know21:39
JayFTheJulia: ^ something worth keeping in mind as we work on oci:// and bootc -- tl;dr docker is rate limiting unauthenticated requests, so we should avoid that as a dependency as we setup CI21:40
clarkbto be clear I believe they always have had rate limits21:41
clarkbbut originally rate limits were placed on blob objects which they claim was too abstract for users to understand but made a lot of sense for those of us caching things21:41
JayFI didn't think it was this bad before; but either way the end suggestion still applies: don't depend on dockerhub in ci21:42
clarkbthen they switched to rate limiting manifests instead of blobs21:42
TheJuliaQuay does as well, fwiw.21:42
corvuss/in ci//21:42
clarkband now they've severly reduced the number of requests from 500/6 hours to 10 per hour21:42
clarkbTheJulia: quay says tehy won't fwiw21:42
clarkboh interesting quay does have a document saying they have limits now (previously they said they didn't rate lmiit)21:43
TheJuliaEh, I’ve seen suggestions otherwise, but it’s all good. Authentication is table stakes regardless21:43
clarkbthe problem with authentication is now everyone needs an account that they have to manage in the jobs. WHich is doable but really annoying21:43
clarkbhttps://access.redhat.com/solutions/6218921 is what quay says now. A few requests per second21:44
clarkbmanifest objects like pypi indexes have very short ttls I think because :latest may have moved21:45
clarkbthe actual data is in tbe blobs so if you care about data transfer costs rate limiting blobs makes sense and is easily mitgated by caching blobs since a lot of blobs are shared among manifests especially if you reuse layers21:46
clarkbdoes dockerhub have bot credentials or similar like quay? Thats the other downside iwth credentials you may not want to use the same account for fetching images in normal jobs as you would for publishing images in privileged jobs due to risk of exposure21:47
clarkbI think the last time I looked at this there were improvments around that but I don't remember the specifics21:47
clarkbthis becomes important if you apply to their open source program and get a single account under that. However, I've also heard of pains with their open source program needing to be renewed annually and sometimes that doesn't happen and suddenly everything stops working (kinda like dropping rate limits to 10 per hour I guess)21:48
TheJuliaI’m fairly sure with quay it is possible to have bot accounts with restricted access. I know some folks who have done it, which makes me less concerned overall. Ultimately the question of how the job(s) are designed and such21:53
clarkbyes quay does so21:54
clarkbI don't know if docker does which is likely to be necessary for any "just authenticate" solution21:54
clarkbor I suppose you could stop running any jobs with docker hub pulls pre review21:54
corvusclarkb: zuul uses a handful of images not hosted on dockerhub; it seems feasible to host a mirror of those images on quay and update the necessary tags once per day.  does it sound reasonable to write a zuul job to do that? (with the understanding that the zuul job would be subject to the limits under discussion and may fail unless we also add authn to it)21:57
corvusi'm thinking that if our external dependency images are updated roughly once a day that's probably okay...21:57
TheJuliaI suspect, if the answer is use quay for sanity, and we need accounts, then we can likely sort through that since it would be upstream project usage22:00
TheJuliaI don't know who I'd need to talk to but I'm sure with a little effort we can sort it out22:00
Clark[m]corvus: you mean not hosted in quay.io? And ya opendev also has to figure this out if it becomes a problem (looks like it will)22:03
Clark[m]Worth noting docker also refuses to support mirrors for anything but docker hub in their clients22:05
corvusClark: yes.  ie, i'm proposing that, for example, the zuul quickstart update to use "quay.io/zuul-ci/httpd:latest" instead of "docker.io/library/httpd:latest"22:06
Clark[m]Got it22:06
corvus(and we set up a job to copy the latter to the former)22:07
*** dmellado0755393 is now known as dmellado07553922:09
opendevreviewGoutham Pacha Ravi proposed opendev/irc-meetings master: Update chair for manila meetings  https://review.opendev.org/c/opendev/irc-meetings/+/93557222:38
opendevreviewGoutham Pacha Ravi proposed opendev/irc-meetings master: Add eventlet-removal biweekly meeting  https://review.opendev.org/c/opendev/irc-meetings/+/93557322:49
opendevreviewGoutham Pacha Ravi proposed opendev/irc-meetings master: Add eventlet-removal biweekly meeting  https://review.opendev.org/c/opendev/irc-meetings/+/93557322:49
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Add mirror-container-images role and job  https://review.opendev.org/c/zuul/zuul-jobs/+/93557422:56
corvusClark: ^ that's a bunch of brain vomit that may do what we discussed above (but will probably just fail due to all the moving parts).22:57
corvus(i mean, the actual pull/push is easy; the hard part is the simulated registry for tests)22:58
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Add mirror-container-images role and job  https://review.opendev.org/c/zuul/zuul-jobs/+/93557423:01
clarkback. I'm going to make sure we have a meeting agenda out on time today (and will include the apparent new limits on there too)23:27
clarkbgoing to remove the rtd and developer docs job failures as I think both got resolved23:29
opendevreviewClark Boylan proposed openstack/project-config master: Disable raxflex cloud  https://review.opendev.org/c/openstack/project-config/+/93557523:46
clarkbyou can see the mirror isn't serving anything at https://mirror.sjc3.raxflex.opendev.org/ and it appears the cinder volume backing those caches is not happy. I'm going to self approve 935575 now23:47
fungiyikes, ng23:57
clarkbsyslog reports a very similar cannot read sector 0 error as last time23:57
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: WIP: Add mirror-container-images role and job  https://review.opendev.org/c/zuul/zuul-jobs/+/93557423:58
clarkbI'm currently focused on meeting agenda stuff but will put the mirro status on there too23:58
clarkbdisabling the region should effectively work around the problem for now23:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!