Tuesday, 2025-07-22

ykarelthx corvus 04:24
*** tosky_ is now known as tosky07:33
*** jroll08 is now known as jroll007:36
rpittauhello everyone, since bullseye backports repo has been dropped one of our jobs is failing as a required pkg was coming from there https://ca4e422190e719ff4803-b95957069fa081f0196bfab2502640d8.ssl.cf5.rackcdn.com/openstack/d8af28e2e4da43ffbc315d64d8d1c90b/job-output.txt08:48
rpittauit was suggested to us to explicitly enable that repository for that job, but I have no idea how to do that, is there an option in the zuul job config?08:48
jrosserisnt that trying to take a debian bullseye package onto an ubuntu noble system?09:22
rpittaujrosser: it's building a bullseye image on a noble system09:23
jrosserooooh ok09:23
rpittauusing DIB :)09:23
opendevreviewFrancisco Seruca Salgado proposed zuul/zuul-jobs master: Trigger Test  https://review.opendev.org/c/zuul/zuul-jobs/+/95558311:08
dtantsurAre all openstack sites extremely slow just for me?13:09
fungidtantsur: if it's all sites whose names are aliases for static.opendev.org then it could be that server getting pounded. i'll take a look13:09
dtantsurfungi: could be. Docs very slow, tox gets stuck on downloading requirements.13:10
fungimy personal webserver is utterly offline this morning because it can't keep up with the llm crawlers13:10
fungidtantsur: yeah, looks similar to a situation we saw late last week where the server load is almost nonexistent but very few requests are being processed. i'll see if i can get debug status data out of apache before i restart the service13:12
dtantsurthanks!13:12
fungilast observed occurrence was thursday: https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2025-07-17.log.html#opendev.2025-07-17.log.html%23t2025-07-17T15:47:1813:14
fungii was able to get the status this time at least, scoreboard shows most slots are in a "reading request" state13:18
funginone were accepting connections at the time i got it to return data13:19
fungimore than half of the clients have no associated vhost or request13:21
fungiwhich strikes me as odd13:21
fungianyway, i'll restart apache for now and we can pour over the collected status data13:23
fungiinfra-root: i took two status dumps about 6 minutes apart, they're server-status-1316.html and server-status-1322.html in my homedir on static.o.o13:24
fungi#status log Restarted apache on static.opendev.org in order to clear hung worker processes13:25
opendevstatusfungi: finished logging13:25
fungidtantsur: hopefully things are working better again for now?13:26
fungiinfra-root: unrelated, it looks like we have the openmetal server move schedule now, and the machines aren't all moving at the same time but rather spread out over the course of roughly two weeks. given the limited quota we have there, i favor just taking it offline between 2025-07-29 and 2025-08-12 to avoid spurious test failures13:28
fricklersounds reasonable, +113:31
dtantsurfungi: seem to be better, thanks again13:32
bbezakhttps://tarballs.opendev.org stills seem a bit slow for me at least14:12
bbezak(Hi)14:12
clarkbwhy/where is tox pulling from static?14:22
clarkbmy early meeting is cancelled today so I'm going to pop out for a bike ride before it gets hot today14:23
dtantsurclarkb: requirements or upper constraints, I assume?14:24
frickler-c{env:TOX_CONSTRAINTS_FILE:https://releases.openstack.org/constraints/upper/master}14:25
dtantsurI'm not sure, just a guess14:25
dtantsuryeah14:25
clarkbthats a redirect to https://opendev.org/openstack/requirements/raw/branch/master/upper-constraints.txt not sure what utility that provides14:26
clarkbmaster upper constraints is statically located at ^ so the redirect is maybe redundant?14:26
dtantsurIIRC it allows referring to named branches before they exist14:26
clarkbright but not for master14:27
dtantsurthe master version is probably redundant indeed14:27
fricklermore stable URL, not dependent on how gitea presents it?14:27
clarkbwhich is where 90% of the requests will be from the CI system14:27
clarkb(I don't think the CI system is to blame here, but we can sometimes be our own worst enemy so its worth checking)14:27
elodillesclarkb: this is where the redirection comes from https://review.opendev.org/c/openstack/releases/+/63901114:30
elodillesi mean, this patch introduced it14:31
elodillesi mean #2, this patch automated the generation of redirections o:)14:32
elodillesanyway, that is what provides the redirections14:32
slittle1_any known issues with zuul this morning?   Seems like several jobs stuck in 'gate'.  https://zuul.openstack.org/status?project=starlingx%2F*14:34
slittle1_... and just now all the jobs got unstuck at the same moment, and cleared very quickly after that.14:40
slittle1_don't think I've seen jobs with all gate tests completed green just sit there for 10+ mins before. Seems strange.14:42
fungislittle1_: events and results all go into a fifo queue, and if a change merges that alters zuul's configuration it can result in a pause in queue processing while reconfiguration occurs, so you'll see build results start to pile up like that and then all clear as soon as queue processing resumes14:52
fungibbezak: yes, it looks like static.opendev.org's apache may be back in the same state it was before the restart a couple of hours ago, which suggests this is some sort of pileup related to an external factor (my money's on llm training crawlers, they seem to be the cause of most of our issues these days)14:54
fungii'll get a couple more debug status reports from apache and then restart it again14:55
fungi#status log Restarted apache on static.opendev.org yet again in order to clear hung worker processes15:01
fungiinfra-root: i took two more status dumps about 6 minutes apart, they're server-status-1454.html and server-status-1500.html in my homedir on static.o.o15:01
opendevstatusfungi: finished logging15:01
corvusfungi: i'm not sure what to make of those server status dumps.  i'm curious about what might be happening on those "reading" connections.  what are they waiting on?  is there some internal apache resource on which they're blocking?15:42
corvusi don't see current behavior like that15:45
corvusi fixed a test cleanup race in the 'require same provider' change if folks want to re +2 https://review.opendev.org/95554516:18
clarkb`ls -lh /var/log/apache2 | grep -v \\.gz | sort -k 5 -h` is somewhat informative16:42
clarkbthat shows where the bulk of the requests are going (just based on relative sizes of log files) and then naively looking into that log file it does look like bad crawlers to me16:43
clarkbwe have a lot of vhosts... I'm working on a change to apply the user agent filtering16:49
corvusthere are certainly bad crawlers, but are they causing a bunch of connection slots to get stuck in "R"?16:49
clarkbcorvus: probably not since we haven't seen this before with the same crawlers elesewhere16:50
fungii did some deeper analysis correlating the client addresses in the status reports to site access logs, see comments in matrix16:50
clarkbbut I figure its good hygiene to apply these rules if we're getting hit here too16:50
fungii'be added a temporary iptables rule to block an address that was seen in the status reports during each incident trying to use rclone to parallel download the contents of the tarballs site, which looks like was eating up all available connection slots in apache17:03
fungiinfra-root: ^17:03
mordredyeah. maybe don't do that mkay?17:03
opendevreviewClark Boylan proposed opendev/system-config master: Apply UserAgentFilter to every vhost on static  https://review.opendev.org/c/opendev/system-config/+/95561617:04
clarkbas mentioned ^ this is unlikely to solve the problem (I think fungi's analysis is more on point for that), but I'm hoping it improves things overall and gives us more headroom17:04
fungisituations like the rclone one becomes an ongoing problem, we could add a conntrack rule like we have on the gerrit server to limit the number of parallel connections from a single address17:05
fungiwe'd need to come up with a number that would avoid exhausting apache workers while hopefully not adversely impacting clients behind overload nats17:06
clarkb++17:06
fungiso i'd rather not go there unless it gets worse17:07
clarkbre openmetal I'm fine with simply shutting the whole thing down. It might be worth asking yuriy if they want to use our cloud as a guinea pig and have us use it during the migration to see what if anything breaks?17:07
clarkbI think the risk to us is low with an easy out (stop using the cloud in zuul launcher temporarily) so that may be useful for them to see how things do. The one gotcha is our mirror there is a spof17:07
fungiyes, the mirror server is my main concern, since if it goes offline while we keep booting test nodes, we'll get lots of job failures17:08
fungialso if it's offline for a while and we don't put it in ansibles emergency disable group, it could cause problems for deploys17:09
clarkbright. We could identify the node it lives on and only disable zuul launcher the day before and after that node migrates or something17:09
clarkbbut also doing the easy thing and disabling it entirely for the week is probably fine17:09
fungiwell, it's more like two weeks, but yeah17:09
fungi(the earliest and latest move dates they listed span 13 days)17:10
clarkbah17:10
corvusthere's still a small bug related to zuul-launcher handling of image upload errors.  things are much improved after the recent fix to avoid duplicate upload jobs, but it's still possible for the launchers to create an unlimited number of pending uploads.  it's just happening in slow motion now.17:37
corvusremote:   https://review.opendev.org/c/zuul/zuul/+/955619 Launcher: skip non-pending uploads [NEW]17:37
corvusthat should address what we're currently seeing17:37
corvus(now i will go see what the upload errors actually are)17:37
corvushttps://zuul.opendev.org/t/opendev/image/ubuntu-bionic  shows the problem with pending uploads that fixes17:37
corvus2025-07-22 16:38:18,238 ERROR zuul.Launcher:   Exception: Downloaded file /var/lib/zuul/tmp/31560f56174145d7b6f2930613394729 sha256 digest 15f50732fa66da350ed405985aafbd30736ea2e2d45dc9c4cc11698296584153 does not match artifact <ImageBuildArtifact 31560f56174145d7b6f2930613394729 state: ready canonical_name: opendev.org%2Fopendev%2Fzuul-providers/ubuntu-bionic build_uuid: 7884342657bb4e30a2bd9bce5e99c5ea validated: True> digest17:39
corvusd59b4f42620dee6677ef570b55073482cfb6b4fa2a09d6007bb81a30d370745c17:39
corvuscool. cool.17:40
clarkbthat isn't the empty file sha256sum fwiw so it is downloading something17:41
corvusi think the issue is that we are compressing it before calculating the checksum17:45
corvusfixing that is a little tricky because we compress outside of the upload role, and we calculate the checksum inside the role... but maybe we should do both inside the upload role17:46
corvusalternatively... we could consider dropping all the checksum stuff entirely.  i'm getting the feeling no one loves it.  :)17:47
clarkbI think it is a good sanity check if we can make it work properly. As long as we checksum either compressed or inflated data consistently it should be fine? I guess the problem is magic in download processing might mean we only see the inflated data on the download side?17:48
corvusthe main reason i included it in the first place was so that we could use the same value when we uploaded the final image for the cloud provider to swift.  of course, we're not actually doing that (yet).  and also, we are recalculating it after download, so it's not saving us time.17:49
corvusthat's the main reason for operating on the uncompressed data17:50
corvusbut we could do both17:50
corvuswe could checksum the compressed artifact, and then uncompress it, and checksum the uncompressed version for the second upload to swift17:50
corvusor we could omit the checksum for the second swift upload (which is what we do now, but only because it's just not implemented)17:51
corvusi'm not worried about magic processing on download; that shouldn't be an issue (we're not setting any gzip headers here)17:51
clarkbcorvus: the second upload is to glance right?17:51
clarkbfirst upload is build job to swift as an artifact. Then the zuul launcher fetches from swift, does its checks then does the second upload into the openstack image server (glance) ?17:52
corvusi mean... the implementation varies17:52
corvusbut yes let's say glance for simplicity17:52
clarkbok I was trying to understand why we couldn't upload the compressed image on the second upload and that is probably because glance has no way of indicating this is a qcow2 gzipped or whatever17:53
corvusyeah specifically in this case, we're talking about our own zuul-side zst compression of a raw image17:53
corvusqcow2 is already compressed, so no big deal17:54
fungialso compression is often not reproducible, so checksums for a re-compressed file may differ depending on the algorithm17:54
corvusthe image create api call in sdk accepts sha256 and md5sum arguments, and the intention was to provide a value for those.17:54
clarkback17:54
clarkbfwiw it does look like glance supports some sort of "compressed" containe type since train but the docs don't indicate which types of compression are supported and older clouds won't support it17:55
corvusoh i lied, we actually do currently pass those values; i thought we hadn't gotten to that yet, but it's there.17:55
clarkbthinking out loud here is the most accurate thing carrying hashes of the inflated data?17:55
corvusyep17:56
clarkbso ya moving everything into the role so we can hash first then compress may be the best solution?17:56
corvusdoing that would mean adding a few minutes to our image build/upload jobs... unless maybe we have the role keep both copies of the image17:56
corvusbecause right now, we hash in parallel while uploading17:57
clarkbI suspect we may hit disk limits if we keep both copies17:57
clarkband maybe that is the most important limitation rather than wall clock time?17:58
corvusme too... though... theoretically there is a moment where we have both on disk?17:58
clarkbhrm thats a good point. If we aren't doing streaming compression too then we're already carrying both?17:59
clarkbso maybe its fine17:59
corvusmaybe we can: 1) start the hash in background; 2) compress; 3) join hash; 4) delete original; 5) upload17:59
corvuswe should only ever have one raw image, so we shouldn't really need to delete it...18:00
corvusso maybe step 4 is optional18:00
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Move compression into upload role  https://review.opendev.org/c/opendev/zuul-providers/+/95562118:09
corvusinfra-root: ^ that's somewhat urgent since it's blocking uploads to clouds that use raw images18:10
corvusthe check pipeline is a waste of time on that change; gate is self-testing, so as soon as it looks good to humans, we should approve it.18:12
clarkbcorvus: quick question did we switch to the rol in zuul-jobs yet? that change seems to indicate we haven't which is fine just catching up18:12
corvuswe have not :(18:12
clarkbya the role name is image-upload-swift which matches what is in zuul-providers18:12
corvusthat will just increase the delta, but... c'est la vie18:13
clarkb+2 from me18:14
corvuscool i'm going to +3 it, and if anyone sees a problem in the next several hours it'll take to gate, feel free to negate it :)18:19
corvus2025-07-22 18:34:24.547 | Error: Failed to download metadata for repo 'baseos': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried18:52
corvusgetting that fairly consistently for centos918:52
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Move compression into upload role  https://review.opendev.org/c/opendev/zuul-providers/+/95562118:55
corvusclarkb: ^ missed some necessary changes18:56
clarkbhttps://4b3c52423d412e2d9836-485520e0725c5c3e0cb7e09d4d5f1a24.ssl.cf2.rackcdn.com/openstack/082a27481fdc49d6ac05cccac587ded6/bridge99.opendev.org/screenshots/gitea-main.png this is the screenshot of the gitea main page on the squashed update19:41
clarkbwe can't verify link targets taht way but at least it renders nicely19:41
fungiyeah, it's the best we've got, worst case it's only one change to revert if we don't like it once deployed19:42
clarkbI +2'd the change. I'm also happy to help babysit and make sure that gitea doesn't go sideways while upgrading today. I'm going to grab lunch but ci jobs take long enough I think we can approve it now if anyone else wants to look at it quickly19:42
fungii'm happy to approve it now, that leaves time for others to +2 or -219:42
clarkbianw: responded on https://review.opendev.org/c/opendev/system-config/+/955544 basically journald writes to syslog so it all magically works as is. The only thing we have to change is the socket that docker/podman write to19:45
clarkband the CI job seems toconfirm this. I left links in my response you can confirm with too19:45
clarkbinfra-root I'm not positive that 955544 will restart our IRC bots automatically, but on the off chance it does we may want to approvethat change during a quiet time for meetings19:46
clarkbbut I think it is ready for review19:46
clarkband now lunch19:46
fungiyeah, updates to the compose files will either restart those containers, or if they don't then that's probably an accidental oversight. mainly a concern for the limnoria meetbot19:50
opendevreviewMerged opendev/system-config master: Multiple Gitea splash page updates  https://review.opendev.org/c/opendev/system-config/+/95240720:37
fungithat should get underway ~immediately20:37
clarkbfwiw I think the last meeting today may end at 2200UTC20:38
fungiinfra-prod-service-gitea is running now, i'm tailing the log on bridge20:39
fungihttps://gitea09.opendev.org:3081/ is responding again and has updated with the new content20:41
clarkbthe links all seem to work there too20:42
fungiyeah, they tested right for me20:42
clarkbgitea13 is starting its update now. Load continues to be high there but that was fine yesterday20:49
clarkbseems to have updated just fine20:51
fungiyeah, it's on the last one now20:52
fungiand done20:52
opendevreviewMerged opendev/zuul-providers master: Move compression into upload role  https://review.opendev.org/c/opendev/zuul-providers/+/95562121:08
opendevreviewMerged zuul/zuul-jobs master: Update s3 minio tests  https://review.opendev.org/c/zuul/zuul-jobs/+/95488623:36

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!