*** dmellado0755391 is now known as dmellado075539 | 08:34 | |
*** elodilles is now known as elodilles_pto | 08:37 | |
*** ralonsoh_ is now known as ralonsoh | 10:04 | |
jamespage | mordred: https://bugs.launchpad.net/ubuntu/+source/python-keystoneauth1/+bug/2088451 | 11:39 |
---|---|---|
fungi | i need to run an errand, but should be back in an hour-ish | 14:58 |
clarkb | we are up to 5c8e sha prefixes now | 15:55 |
clarkb | we're almost 40% done? we started Friday evening and it is Monday morning now (relatiev to my timezone) /me does some math | 15:57 |
clarkb | I think that is on track for 6.25ish days total and 2.5 days have completed (iirc corvus original estimate over the weekend was 6 days so that seems to hold) | 15:58 |
corvus | ++ | 16:01 |
corvus | we should get a new deletes/second value; i've seen it hang for a bit occasionally, so the overall over a long period might be slower than 8/sec | 16:02 |
clarkb | the smaller backup server is complaining about disk utilization again. A good reminder to review the changes to do backup pruning :) | 16:02 |
corvus | (for use in estimating the log deletes) | 16:02 |
clarkb | corvus: do we want to push up a change to run this prune ~weekly? | 16:10 |
opendevreview | James E. Blair proposed opendev/system-config master: Revert "Temporarily disable intermediate registry prune" https://review.opendev.org/c/opendev/system-config/+/935542 | 16:35 |
corvus | clarkb: ^ :) apparently we used to have it set up to run daily... | 16:36 |
clarkb | corvus: do we need to supply the configuration path in that command? | 16:39 |
corvus | clarkb: i don't think so; it's an exec in the container; should read it from the default location | 16:40 |
mordred | jamespage: thanks! | 17:11 |
jamespage | mordred: np - that will take a bit of time to work through the SRU process but I'll keep nudging it along | 17:11 |
mordred | jamespage: I think corvus has a local workaround for the original issue - it just jumped out at my eyes so I thought we should get that sorted for anyone else. thanks for jumping on that | 17:12 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add a swift container deletion script https://review.opendev.org/c/opendev/system-config/+/935395 | 18:05 |
clarkb | fungi: ^ that is still completely untested but I've adjusted to trying to using the bulk delete approach | 18:06 |
corvus | 40m build; 33m upload to swift (from ovh-bhs1); 9m download from swift; 5m upload to local glance | 18:22 |
corvus | that's image timings after the recent launcher efficiency updates | 18:22 |
fungi | fast! | 18:31 |
fungi | (in relative terms) | 18:31 |
fungi | under 1.5 hours from build start to image available in glance | 18:32 |
corvus | the 9m download is a huge improvement (previous was about 1 hour). the 33m upload may be a bit of a regression; i think we got about 18m before with swiftclient. | 18:32 |
corvus | i'm not going to do anything about that 33m right now though; i think it warrants more data collection. | 18:33 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Increase swift upload threads to 10 https://review.opendev.org/c/opendev/zuul-jobs/+/935553 | 18:40 |
corvus | on second thought... i thought of one easy/obvious difference that we can/should correct. ^ | 18:40 |
corvus | there's a good chance that knocks 13-15 minutes off that time. | 18:40 |
corvus | clarkb: fungi i think our deletion rate is something like 16k objects per hour. | 18:58 |
corvus | (or about 4.5/sec) | 18:58 |
fungi | that's fairly swift (pun intended) | 18:58 |
clarkb | with that rate using bulk deletion for the big delets instead of the pruning is probably a good idea | 19:01 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/935395 should roughly have that shape now but I need to test it. | 19:01 |
opendevreview | Jeremy Stanley proposed opendev/infra-openafs-deb noble: Build 1.8.13 for Noble https://review.opendev.org/c/opendev/infra-openafs-deb/+/935556 | 19:12 |
clarkb | quick reminder to edit the meeting agenda or let me know what changes you'd like to see made to it. I have an afternoon errand but will get that sent out before the end of my day today | 19:26 |
JayF | I am experiencing a crazy behavior in log digging for an Ironic CI failure; I've validated it's not just a local caching issue as adamcarthur5 has reproduced it: | 20:58 |
JayF | 1) https://zuul.opendev.org/t/openstack/build/e107efab24014945a3802738abd47057 2) click "logs" 3) job_output.txt 4) notice that the job_output.txt you just displayed IS NOT the one snippeted from "task summary" at step 1 | 20:59 |
JayF | our hunch is that somehow e107efa (the shortened sha in the raw URL) may somehow be hitting a conflict | 21:00 |
frickler | JayF: well the output in 1) is extracted from the .json, not from the .txt, but I don't see a major conflict here | 21:05 |
clarkb | the path is namespaced with the change, pipeline, and job in addition to the sha | 21:05 |
clarkb | it would be really surprising if we have a collision there | 21:06 |
JayF | frickler: how in the world is it today that I learned that ! | 21:06 |
JayF | Now I wonder why that stdout wasn't in job_output.txt | 21:07 |
clarkb | it is theoretically possible but like heat death of the universe and all that | 21:07 |
JayF | Then let me reframe my question: | 21:07 |
frickler | there might be some racing due to multiple devstack threads outputting things in parallel, but I do see the failure lines in the .txt, too | 21:07 |
JayF | why doesn't " No tenant network is available for allocation." which appears in the json, not appear in https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e10/933104/4/check/ironic-tempest-bios-redfish-pxe/e107efa/job-output.txt but it does appear in devstacklog.txt | 21:08 |
JayF | I guess my expectations were wrong but they were also broken here so I wanna learn about where the lines /actually/ are | 21:08 |
frickler | JayF: as I said above, I think this may happen due to devstack doing async tasks. if this is reproducible for you, you could try running with DEVSTACK_PARALLEL=False | 21:14 |
clarkb | ya you can see the timestamps gap | 21:15 |
JayF | aha, okay | 21:15 |
clarkb | 2024-11-18 15:24:31.204691 is last real log entry in job-output-txt then 2024-11-18 15:32:26.418727 records the error | 21:15 |
JayF | of course /o\ | 21:16 |
clarkb | everything you're seein in the json/console panel that is extra was written in that time window I'm sure | 21:16 |
clarkb | as for why things stop writing for ~8 minutes maybe parallel execution if accounting for the different threads goes haywire? Could also be a buffering issue where stdout/stderr buffering prevents that from being written (this is where job-output.txt collect the data) | 21:17 |
clarkb | but then when it exits all of the info ends up in the ansibel return? I don't know | 21:17 |
mnaser | did we not run some sort of squid proxy for dockerhub btw? | 21:21 |
mnaser | i am getting failures left and right with toomanyrequests | 21:21 |
fungi | mnaser: we do, not squid but apache mod_proxy | 21:21 |
clarkb | and I suspect that cache is part of the problem | 21:22 |
fungi | though if the proxy itself is hitting rate limits in dockerhub (our suspicion) then that just makes matters worse | 21:22 |
clarkb | right | 21:22 |
mnaser | ah right, so its a guranateed failure in that case(tm) | 21:22 |
clarkb | I think what has happened is there is a usage pattern change in our docker hub access somewhere resulting in far more requests than a week ago. | 21:22 |
fungi | also we only earmark about... 40gb? for total apache mod_proxy caches on each mirror server, so pulling lots of large images could quickly make the actual caching part relatively useless | 21:23 |
clarkb | And then with those funneled through the proxy cache we're getting the rate limit problems | 21:23 |
clarkb | fungi: 100gb | 21:23 |
mnaser | i think dockerhub made a change tbh, because i've noticed something different here too, a lot more failures, and we don't use any mirror so our vms pull directly.. | 21:23 |
mnaser | and i think we'd have to be _very_ lucky to have all of our ips get hit :) | 21:23 |
clarkb | mnaser: ah if you're seeing it outside of the CI environment then ya I wouldn't be surprised if upstream changed something | 21:23 |
mnaser | well i mean both OUR side and also opendev seeing the same issue | 21:23 |
mnaser | so makes me feel they might have done something in dockerhub world | 21:24 |
fungi | like trying to get rid of their users, for example | 21:24 |
clarkb | one thing I noticed is that a request for library/alpine got hit with a rate limit. For some reason I thought all of those open source things were not rate limited regardless of who requseted them | 21:24 |
clarkb | that could've been bad interpretation of how their rate limits work on my part but maybe that was a good interpretation and they've now changed it to apply rate limits to those resources too | 21:24 |
fungi | i think they have two kinds of rate limits: per-project pulls (regardless of who is pulling the images) and per-client (ip or login) pulls | 21:25 |
fungi | my understanding was that the first kind of rate limit (per-project) was waived for qualifying open source projects who applied and kept renewing it ~yearly | 21:26 |
fungi | but that you could still hit a per-client rate limit when pulling any image regardless of what the per-project rate limit situation was for that org | 21:26 |
clarkb | anyway workarounds that have been suggested so far include not using the proxy cache so that we're using more IPs and distributing the requests more. Using buildset registry more aggressively so that we download from docker once then use the buildset registry for downstream jobs. quay.io doesn't have limits and so on | 21:28 |
fungi | also authenticated downloads may get different quotas/rate limits | 21:29 |
corvus | we have a workable path now for speculative testing of quay images, so we could resume the work to move system-config images to quay. that would have the direct effect of reducing a few pulls (probably not that much) from our ci fleet; but also the indirect effect from moving the python-builder and friends images. i don't believe that is at the top of anyone's list right now. | 21:30 |
clarkb | I think authenticated requests only get better limits if you apy for them | 21:30 |
opendevreview | Merged opendev/zuul-jobs master: Increase swift upload threads to 10 https://review.opendev.org/c/opendev/zuul-jobs/+/935553 | 21:34 |
corvus | i wonder if there's a quay mirror for alpine, apache, etc.... | 21:35 |
clarkb | "Unauthenticated users will be limited to 10 Docker Hub pulls/hr/IP address." from https://www.docker.com/blog/november-2024-updated-plans-announcement/ | 21:36 |
clarkb | so ya I think this is upstream | 21:36 |
clarkb | gg docker | 21:36 |
clarkb | kozhukalov: ^ fyi | 21:37 |
JayF | makes me wonder if the proxy could inject auth into unauthenticated requests ... maybe possible but wonder if it's desirable | 21:37 |
clarkb | I don't know that apache could do it but a smarter proxy could. If I'm honest I don't really want to spend time on that | 21:38 |
clarkb | I think not using docker should be where effort is spent | 21:38 |
clarkb | (which as someone who has already moving everything off of docker once I realize is not the most straightforward of migrations) | 21:38 |
kozhukalov | Thanks for letting me know | 21:39 |
JayF | TheJulia: ^ something worth keeping in mind as we work on oci:// and bootc -- tl;dr docker is rate limiting unauthenticated requests, so we should avoid that as a dependency as we setup CI | 21:40 |
clarkb | to be clear I believe they always have had rate limits | 21:41 |
clarkb | but originally rate limits were placed on blob objects which they claim was too abstract for users to understand but made a lot of sense for those of us caching things | 21:41 |
JayF | I didn't think it was this bad before; but either way the end suggestion still applies: don't depend on dockerhub in ci | 21:42 |
clarkb | then they switched to rate limiting manifests instead of blobs | 21:42 |
TheJulia | Quay does as well, fwiw. | 21:42 |
corvus | s/in ci// | 21:42 |
clarkb | and now they've severly reduced the number of requests from 500/6 hours to 10 per hour | 21:42 |
clarkb | TheJulia: quay says tehy won't fwiw | 21:42 |
clarkb | oh interesting quay does have a document saying they have limits now (previously they said they didn't rate lmiit) | 21:43 |
TheJulia | Eh, I’ve seen suggestions otherwise, but it’s all good. Authentication is table stakes regardless | 21:43 |
clarkb | the problem with authentication is now everyone needs an account that they have to manage in the jobs. WHich is doable but really annoying | 21:43 |
clarkb | https://access.redhat.com/solutions/6218921 is what quay says now. A few requests per second | 21:44 |
clarkb | manifest objects like pypi indexes have very short ttls I think because :latest may have moved | 21:45 |
clarkb | the actual data is in tbe blobs so if you care about data transfer costs rate limiting blobs makes sense and is easily mitgated by caching blobs since a lot of blobs are shared among manifests especially if you reuse layers | 21:46 |
clarkb | does dockerhub have bot credentials or similar like quay? Thats the other downside iwth credentials you may not want to use the same account for fetching images in normal jobs as you would for publishing images in privileged jobs due to risk of exposure | 21:47 |
clarkb | I think the last time I looked at this there were improvments around that but I don't remember the specifics | 21:47 |
clarkb | this becomes important if you apply to their open source program and get a single account under that. However, I've also heard of pains with their open source program needing to be renewed annually and sometimes that doesn't happen and suddenly everything stops working (kinda like dropping rate limits to 10 per hour I guess) | 21:48 |
TheJulia | I’m fairly sure with quay it is possible to have bot accounts with restricted access. I know some folks who have done it, which makes me less concerned overall. Ultimately the question of how the job(s) are designed and such | 21:53 |
clarkb | yes quay does so | 21:54 |
clarkb | I don't know if docker does which is likely to be necessary for any "just authenticate" solution | 21:54 |
clarkb | or I suppose you could stop running any jobs with docker hub pulls pre review | 21:54 |
corvus | clarkb: zuul uses a handful of images not hosted on dockerhub; it seems feasible to host a mirror of those images on quay and update the necessary tags once per day. does it sound reasonable to write a zuul job to do that? (with the understanding that the zuul job would be subject to the limits under discussion and may fail unless we also add authn to it) | 21:57 |
corvus | i'm thinking that if our external dependency images are updated roughly once a day that's probably okay... | 21:57 |
TheJulia | I suspect, if the answer is use quay for sanity, and we need accounts, then we can likely sort through that since it would be upstream project usage | 22:00 |
TheJulia | I don't know who I'd need to talk to but I'm sure with a little effort we can sort it out | 22:00 |
Clark[m] | corvus: you mean not hosted in quay.io? And ya opendev also has to figure this out if it becomes a problem (looks like it will) | 22:03 |
Clark[m] | Worth noting docker also refuses to support mirrors for anything but docker hub in their clients | 22:05 |
corvus | Clark: yes. ie, i'm proposing that, for example, the zuul quickstart update to use "quay.io/zuul-ci/httpd:latest" instead of "docker.io/library/httpd:latest" | 22:06 |
Clark[m] | Got it | 22:06 |
corvus | (and we set up a job to copy the latter to the former) | 22:07 |
*** dmellado0755393 is now known as dmellado075539 | 22:09 | |
opendevreview | Goutham Pacha Ravi proposed opendev/irc-meetings master: Update chair for manila meetings https://review.opendev.org/c/opendev/irc-meetings/+/935572 | 22:38 |
opendevreview | Goutham Pacha Ravi proposed opendev/irc-meetings master: Add eventlet-removal biweekly meeting https://review.opendev.org/c/opendev/irc-meetings/+/935573 | 22:49 |
opendevreview | Goutham Pacha Ravi proposed opendev/irc-meetings master: Add eventlet-removal biweekly meeting https://review.opendev.org/c/opendev/irc-meetings/+/935573 | 22:49 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574 | 22:56 |
corvus | Clark: ^ that's a bunch of brain vomit that may do what we discussed above (but will probably just fail due to all the moving parts). | 22:57 |
corvus | (i mean, the actual pull/push is easy; the hard part is the simulated registry for tests) | 22:58 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574 | 23:01 |
clarkb | ack. I'm going to make sure we have a meeting agenda out on time today (and will include the apparent new limits on there too) | 23:27 |
clarkb | going to remove the rtd and developer docs job failures as I think both got resolved | 23:29 |
opendevreview | Clark Boylan proposed openstack/project-config master: Disable raxflex cloud https://review.opendev.org/c/openstack/project-config/+/935575 | 23:46 |
clarkb | you can see the mirror isn't serving anything at https://mirror.sjc3.raxflex.opendev.org/ and it appears the cinder volume backing those caches is not happy. I'm going to self approve 935575 now | 23:47 |
fungi | yikes, ng | 23:57 |
clarkb | syslog reports a very similar cannot read sector 0 error as last time | 23:57 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: WIP: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574 | 23:58 |
clarkb | I'm currently focused on meeting agenda stuff but will put the mirro status on there too | 23:58 |
clarkb | disabling the region should effectively work around the problem for now | 23:58 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!