ykarel | thx corvus | 04:24 |
---|---|---|
*** tosky_ is now known as tosky | 07:33 | |
*** jroll08 is now known as jroll0 | 07:36 | |
rpittau | hello everyone, since bullseye backports repo has been dropped one of our jobs is failing as a required pkg was coming from there https://ca4e422190e719ff4803-b95957069fa081f0196bfab2502640d8.ssl.cf5.rackcdn.com/openstack/d8af28e2e4da43ffbc315d64d8d1c90b/job-output.txt | 08:48 |
rpittau | it was suggested to us to explicitly enable that repository for that job, but I have no idea how to do that, is there an option in the zuul job config? | 08:48 |
jrosser | isnt that trying to take a debian bullseye package onto an ubuntu noble system? | 09:22 |
rpittau | jrosser: it's building a bullseye image on a noble system | 09:23 |
jrosser | ooooh ok | 09:23 |
rpittau | using DIB :) | 09:23 |
opendevreview | Francisco Seruca Salgado proposed zuul/zuul-jobs master: Trigger Test https://review.opendev.org/c/zuul/zuul-jobs/+/955583 | 11:08 |
dtantsur | Are all openstack sites extremely slow just for me? | 13:09 |
fungi | dtantsur: if it's all sites whose names are aliases for static.opendev.org then it could be that server getting pounded. i'll take a look | 13:09 |
dtantsur | fungi: could be. Docs very slow, tox gets stuck on downloading requirements. | 13:10 |
fungi | my personal webserver is utterly offline this morning because it can't keep up with the llm crawlers | 13:10 |
fungi | dtantsur: yeah, looks similar to a situation we saw late last week where the server load is almost nonexistent but very few requests are being processed. i'll see if i can get debug status data out of apache before i restart the service | 13:12 |
dtantsur | thanks! | 13:12 |
fungi | last observed occurrence was thursday: https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2025-07-17.log.html#opendev.2025-07-17.log.html%23t2025-07-17T15:47:18 | 13:14 |
fungi | i was able to get the status this time at least, scoreboard shows most slots are in a "reading request" state | 13:18 |
fungi | none were accepting connections at the time i got it to return data | 13:19 |
fungi | more than half of the clients have no associated vhost or request | 13:21 |
fungi | which strikes me as odd | 13:21 |
fungi | anyway, i'll restart apache for now and we can pour over the collected status data | 13:23 |
fungi | infra-root: i took two status dumps about 6 minutes apart, they're server-status-1316.html and server-status-1322.html in my homedir on static.o.o | 13:24 |
fungi | #status log Restarted apache on static.opendev.org in order to clear hung worker processes | 13:25 |
opendevstatus | fungi: finished logging | 13:25 |
fungi | dtantsur: hopefully things are working better again for now? | 13:26 |
fungi | infra-root: unrelated, it looks like we have the openmetal server move schedule now, and the machines aren't all moving at the same time but rather spread out over the course of roughly two weeks. given the limited quota we have there, i favor just taking it offline between 2025-07-29 and 2025-08-12 to avoid spurious test failures | 13:28 |
frickler | sounds reasonable, +1 | 13:31 |
dtantsur | fungi: seem to be better, thanks again | 13:32 |
bbezak | https://tarballs.opendev.org stills seem a bit slow for me at least | 14:12 |
bbezak | (Hi) | 14:12 |
clarkb | why/where is tox pulling from static? | 14:22 |
clarkb | my early meeting is cancelled today so I'm going to pop out for a bike ride before it gets hot today | 14:23 |
dtantsur | clarkb: requirements or upper constraints, I assume? | 14:24 |
frickler | -c{env:TOX_CONSTRAINTS_FILE:https://releases.openstack.org/constraints/upper/master} | 14:25 |
dtantsur | I'm not sure, just a guess | 14:25 |
dtantsur | yeah | 14:25 |
clarkb | thats a redirect to https://opendev.org/openstack/requirements/raw/branch/master/upper-constraints.txt not sure what utility that provides | 14:26 |
clarkb | master upper constraints is statically located at ^ so the redirect is maybe redundant? | 14:26 |
dtantsur | IIRC it allows referring to named branches before they exist | 14:26 |
clarkb | right but not for master | 14:27 |
dtantsur | the master version is probably redundant indeed | 14:27 |
frickler | more stable URL, not dependent on how gitea presents it? | 14:27 |
clarkb | which is where 90% of the requests will be from the CI system | 14:27 |
clarkb | (I don't think the CI system is to blame here, but we can sometimes be our own worst enemy so its worth checking) | 14:27 |
elodilles | clarkb: this is where the redirection comes from https://review.opendev.org/c/openstack/releases/+/639011 | 14:30 |
elodilles | i mean, this patch introduced it | 14:31 |
elodilles | i mean #2, this patch automated the generation of redirections o:) | 14:32 |
elodilles | anyway, that is what provides the redirections | 14:32 |
slittle1_ | any known issues with zuul this morning? Seems like several jobs stuck in 'gate'. https://zuul.openstack.org/status?project=starlingx%2F* | 14:34 |
slittle1_ | ... and just now all the jobs got unstuck at the same moment, and cleared very quickly after that. | 14:40 |
slittle1_ | don't think I've seen jobs with all gate tests completed green just sit there for 10+ mins before. Seems strange. | 14:42 |
fungi | slittle1_: events and results all go into a fifo queue, and if a change merges that alters zuul's configuration it can result in a pause in queue processing while reconfiguration occurs, so you'll see build results start to pile up like that and then all clear as soon as queue processing resumes | 14:52 |
fungi | bbezak: yes, it looks like static.opendev.org's apache may be back in the same state it was before the restart a couple of hours ago, which suggests this is some sort of pileup related to an external factor (my money's on llm training crawlers, they seem to be the cause of most of our issues these days) | 14:54 |
fungi | i'll get a couple more debug status reports from apache and then restart it again | 14:55 |
fungi | #status log Restarted apache on static.opendev.org yet again in order to clear hung worker processes | 15:01 |
fungi | infra-root: i took two more status dumps about 6 minutes apart, they're server-status-1454.html and server-status-1500.html in my homedir on static.o.o | 15:01 |
opendevstatus | fungi: finished logging | 15:01 |
corvus | fungi: i'm not sure what to make of those server status dumps. i'm curious about what might be happening on those "reading" connections. what are they waiting on? is there some internal apache resource on which they're blocking? | 15:42 |
corvus | i don't see current behavior like that | 15:45 |
corvus | i fixed a test cleanup race in the 'require same provider' change if folks want to re +2 https://review.opendev.org/955545 | 16:18 |
clarkb | `ls -lh /var/log/apache2 | grep -v \\.gz | sort -k 5 -h` is somewhat informative | 16:42 |
clarkb | that shows where the bulk of the requests are going (just based on relative sizes of log files) and then naively looking into that log file it does look like bad crawlers to me | 16:43 |
clarkb | we have a lot of vhosts... I'm working on a change to apply the user agent filtering | 16:49 |
corvus | there are certainly bad crawlers, but are they causing a bunch of connection slots to get stuck in "R"? | 16:49 |
clarkb | corvus: probably not since we haven't seen this before with the same crawlers elesewhere | 16:50 |
fungi | i did some deeper analysis correlating the client addresses in the status reports to site access logs, see comments in matrix | 16:50 |
clarkb | but I figure its good hygiene to apply these rules if we're getting hit here too | 16:50 |
fungi | i'be added a temporary iptables rule to block an address that was seen in the status reports during each incident trying to use rclone to parallel download the contents of the tarballs site, which looks like was eating up all available connection slots in apache | 17:03 |
fungi | infra-root: ^ | 17:03 |
mordred | yeah. maybe don't do that mkay? | 17:03 |
opendevreview | Clark Boylan proposed opendev/system-config master: Apply UserAgentFilter to every vhost on static https://review.opendev.org/c/opendev/system-config/+/955616 | 17:04 |
clarkb | as mentioned ^ this is unlikely to solve the problem (I think fungi's analysis is more on point for that), but I'm hoping it improves things overall and gives us more headroom | 17:04 |
fungi | situations like the rclone one becomes an ongoing problem, we could add a conntrack rule like we have on the gerrit server to limit the number of parallel connections from a single address | 17:05 |
fungi | we'd need to come up with a number that would avoid exhausting apache workers while hopefully not adversely impacting clients behind overload nats | 17:06 |
clarkb | ++ | 17:06 |
fungi | so i'd rather not go there unless it gets worse | 17:07 |
clarkb | re openmetal I'm fine with simply shutting the whole thing down. It might be worth asking yuriy if they want to use our cloud as a guinea pig and have us use it during the migration to see what if anything breaks? | 17:07 |
clarkb | I think the risk to us is low with an easy out (stop using the cloud in zuul launcher temporarily) so that may be useful for them to see how things do. The one gotcha is our mirror there is a spof | 17:07 |
fungi | yes, the mirror server is my main concern, since if it goes offline while we keep booting test nodes, we'll get lots of job failures | 17:08 |
fungi | also if it's offline for a while and we don't put it in ansibles emergency disable group, it could cause problems for deploys | 17:09 |
clarkb | right. We could identify the node it lives on and only disable zuul launcher the day before and after that node migrates or something | 17:09 |
clarkb | but also doing the easy thing and disabling it entirely for the week is probably fine | 17:09 |
fungi | well, it's more like two weeks, but yeah | 17:09 |
fungi | (the earliest and latest move dates they listed span 13 days) | 17:10 |
clarkb | ah | 17:10 |
corvus | there's still a small bug related to zuul-launcher handling of image upload errors. things are much improved after the recent fix to avoid duplicate upload jobs, but it's still possible for the launchers to create an unlimited number of pending uploads. it's just happening in slow motion now. | 17:37 |
corvus | remote: https://review.opendev.org/c/zuul/zuul/+/955619 Launcher: skip non-pending uploads [NEW] | 17:37 |
corvus | that should address what we're currently seeing | 17:37 |
corvus | (now i will go see what the upload errors actually are) | 17:37 |
corvus | https://zuul.opendev.org/t/opendev/image/ubuntu-bionic shows the problem with pending uploads that fixes | 17:37 |
corvus | 2025-07-22 16:38:18,238 ERROR zuul.Launcher: Exception: Downloaded file /var/lib/zuul/tmp/31560f56174145d7b6f2930613394729 sha256 digest 15f50732fa66da350ed405985aafbd30736ea2e2d45dc9c4cc11698296584153 does not match artifact <ImageBuildArtifact 31560f56174145d7b6f2930613394729 state: ready canonical_name: opendev.org%2Fopendev%2Fzuul-providers/ubuntu-bionic build_uuid: 7884342657bb4e30a2bd9bce5e99c5ea validated: True> digest | 17:39 |
corvus | d59b4f42620dee6677ef570b55073482cfb6b4fa2a09d6007bb81a30d370745c | 17:39 |
corvus | cool. cool. | 17:40 |
clarkb | that isn't the empty file sha256sum fwiw so it is downloading something | 17:41 |
corvus | i think the issue is that we are compressing it before calculating the checksum | 17:45 |
corvus | fixing that is a little tricky because we compress outside of the upload role, and we calculate the checksum inside the role... but maybe we should do both inside the upload role | 17:46 |
corvus | alternatively... we could consider dropping all the checksum stuff entirely. i'm getting the feeling no one loves it. :) | 17:47 |
clarkb | I think it is a good sanity check if we can make it work properly. As long as we checksum either compressed or inflated data consistently it should be fine? I guess the problem is magic in download processing might mean we only see the inflated data on the download side? | 17:48 |
corvus | the main reason i included it in the first place was so that we could use the same value when we uploaded the final image for the cloud provider to swift. of course, we're not actually doing that (yet). and also, we are recalculating it after download, so it's not saving us time. | 17:49 |
corvus | that's the main reason for operating on the uncompressed data | 17:50 |
corvus | but we could do both | 17:50 |
corvus | we could checksum the compressed artifact, and then uncompress it, and checksum the uncompressed version for the second upload to swift | 17:50 |
corvus | or we could omit the checksum for the second swift upload (which is what we do now, but only because it's just not implemented) | 17:51 |
corvus | i'm not worried about magic processing on download; that shouldn't be an issue (we're not setting any gzip headers here) | 17:51 |
clarkb | corvus: the second upload is to glance right? | 17:51 |
clarkb | first upload is build job to swift as an artifact. Then the zuul launcher fetches from swift, does its checks then does the second upload into the openstack image server (glance) ? | 17:52 |
corvus | i mean... the implementation varies | 17:52 |
corvus | but yes let's say glance for simplicity | 17:52 |
clarkb | ok I was trying to understand why we couldn't upload the compressed image on the second upload and that is probably because glance has no way of indicating this is a qcow2 gzipped or whatever | 17:53 |
corvus | yeah specifically in this case, we're talking about our own zuul-side zst compression of a raw image | 17:53 |
corvus | qcow2 is already compressed, so no big deal | 17:54 |
fungi | also compression is often not reproducible, so checksums for a re-compressed file may differ depending on the algorithm | 17:54 |
corvus | the image create api call in sdk accepts sha256 and md5sum arguments, and the intention was to provide a value for those. | 17:54 |
clarkb | ack | 17:54 |
clarkb | fwiw it does look like glance supports some sort of "compressed" containe type since train but the docs don't indicate which types of compression are supported and older clouds won't support it | 17:55 |
corvus | oh i lied, we actually do currently pass those values; i thought we hadn't gotten to that yet, but it's there. | 17:55 |
clarkb | thinking out loud here is the most accurate thing carrying hashes of the inflated data? | 17:55 |
corvus | yep | 17:56 |
clarkb | so ya moving everything into the role so we can hash first then compress may be the best solution? | 17:56 |
corvus | doing that would mean adding a few minutes to our image build/upload jobs... unless maybe we have the role keep both copies of the image | 17:56 |
corvus | because right now, we hash in parallel while uploading | 17:57 |
clarkb | I suspect we may hit disk limits if we keep both copies | 17:57 |
clarkb | and maybe that is the most important limitation rather than wall clock time? | 17:58 |
corvus | me too... though... theoretically there is a moment where we have both on disk? | 17:58 |
clarkb | hrm thats a good point. If we aren't doing streaming compression too then we're already carrying both? | 17:59 |
clarkb | so maybe its fine | 17:59 |
corvus | maybe we can: 1) start the hash in background; 2) compress; 3) join hash; 4) delete original; 5) upload | 17:59 |
corvus | we should only ever have one raw image, so we shouldn't really need to delete it... | 18:00 |
corvus | so maybe step 4 is optional | 18:00 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Move compression into upload role https://review.opendev.org/c/opendev/zuul-providers/+/955621 | 18:09 |
corvus | infra-root: ^ that's somewhat urgent since it's blocking uploads to clouds that use raw images | 18:10 |
corvus | the check pipeline is a waste of time on that change; gate is self-testing, so as soon as it looks good to humans, we should approve it. | 18:12 |
clarkb | corvus: quick question did we switch to the rol in zuul-jobs yet? that change seems to indicate we haven't which is fine just catching up | 18:12 |
corvus | we have not :( | 18:12 |
clarkb | ya the role name is image-upload-swift which matches what is in zuul-providers | 18:12 |
corvus | that will just increase the delta, but... c'est la vie | 18:13 |
clarkb | +2 from me | 18:14 |
corvus | cool i'm going to +3 it, and if anyone sees a problem in the next several hours it'll take to gate, feel free to negate it :) | 18:19 |
corvus | 2025-07-22 18:34:24.547 | Error: Failed to download metadata for repo 'baseos': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried | 18:52 |
corvus | getting that fairly consistently for centos9 | 18:52 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Move compression into upload role https://review.opendev.org/c/opendev/zuul-providers/+/955621 | 18:55 |
corvus | clarkb: ^ missed some necessary changes | 18:56 |
clarkb | https://4b3c52423d412e2d9836-485520e0725c5c3e0cb7e09d4d5f1a24.ssl.cf2.rackcdn.com/openstack/082a27481fdc49d6ac05cccac587ded6/bridge99.opendev.org/screenshots/gitea-main.png this is the screenshot of the gitea main page on the squashed update | 19:41 |
clarkb | we can't verify link targets taht way but at least it renders nicely | 19:41 |
fungi | yeah, it's the best we've got, worst case it's only one change to revert if we don't like it once deployed | 19:42 |
clarkb | I +2'd the change. I'm also happy to help babysit and make sure that gitea doesn't go sideways while upgrading today. I'm going to grab lunch but ci jobs take long enough I think we can approve it now if anyone else wants to look at it quickly | 19:42 |
fungi | i'm happy to approve it now, that leaves time for others to +2 or -2 | 19:42 |
clarkb | ianw: responded on https://review.opendev.org/c/opendev/system-config/+/955544 basically journald writes to syslog so it all magically works as is. The only thing we have to change is the socket that docker/podman write to | 19:45 |
clarkb | and the CI job seems toconfirm this. I left links in my response you can confirm with too | 19:45 |
clarkb | infra-root I'm not positive that 955544 will restart our IRC bots automatically, but on the off chance it does we may want to approvethat change during a quiet time for meetings | 19:46 |
clarkb | but I think it is ready for review | 19:46 |
clarkb | and now lunch | 19:46 |
fungi | yeah, updates to the compose files will either restart those containers, or if they don't then that's probably an accidental oversight. mainly a concern for the limnoria meetbot | 19:50 |
opendevreview | Merged opendev/system-config master: Multiple Gitea splash page updates https://review.opendev.org/c/opendev/system-config/+/952407 | 20:37 |
fungi | that should get underway ~immediately | 20:37 |
clarkb | fwiw I think the last meeting today may end at 2200UTC | 20:38 |
fungi | infra-prod-service-gitea is running now, i'm tailing the log on bridge | 20:39 |
fungi | https://gitea09.opendev.org:3081/ is responding again and has updated with the new content | 20:41 |
clarkb | the links all seem to work there too | 20:42 |
fungi | yeah, they tested right for me | 20:42 |
clarkb | gitea13 is starting its update now. Load continues to be high there but that was fine yesterday | 20:49 |
clarkb | seems to have updated just fine | 20:51 |
fungi | yeah, it's on the last one now | 20:52 |
fungi | and done | 20:52 |
opendevreview | Merged opendev/zuul-providers master: Move compression into upload role https://review.opendev.org/c/opendev/zuul-providers/+/955621 | 21:08 |
opendevreview | Merged zuul/zuul-jobs master: Update s3 minio tests https://review.opendev.org/c/zuul/zuul-jobs/+/954886 | 23:36 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!