Tuesday, 2025-07-22

ykarel	thx corvus	04:24
*** tosky_ is now known as tosky		07:33
*** jroll08 is now known as jroll0		07:36
rpittau	hello everyone, since bullseye backports repo has been dropped one of our jobs is failing as a required pkg was coming from there https://ca4e422190e719ff4803-b95957069fa081f0196bfab2502640d8.ssl.cf5.rackcdn.com/openstack/d8af28e2e4da43ffbc315d64d8d1c90b/job-output.txt	08:48
rpittau	it was suggested to us to explicitly enable that repository for that job, but I have no idea how to do that, is there an option in the zuul job config?	08:48
jrosser	isnt that trying to take a debian bullseye package onto an ubuntu noble system?	09:22
rpittau	jrosser: it's building a bullseye image on a noble system	09:23
jrosser	ooooh ok	09:23
rpittau	using DIB :)	09:23
opendevreview	Francisco Seruca Salgado proposed zuul/zuul-jobs master: Trigger Test https://review.opendev.org/c/zuul/zuul-jobs/+/955583	11:08
dtantsur	Are all openstack sites extremely slow just for me?	13:09
fungi	dtantsur: if it's all sites whose names are aliases for static.opendev.org then it could be that server getting pounded. i'll take a look	13:09
dtantsur	fungi: could be. Docs very slow, tox gets stuck on downloading requirements.	13:10
fungi	my personal webserver is utterly offline this morning because it can't keep up with the llm crawlers	13:10
fungi	dtantsur: yeah, looks similar to a situation we saw late last week where the server load is almost nonexistent but very few requests are being processed. i'll see if i can get debug status data out of apache before i restart the service	13:12
dtantsur	thanks!	13:12
fungi	last observed occurrence was thursday: https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2025-07-17.log.html#opendev.2025-07-17.log.html%23t2025-07-17T15:47:18	13:14
fungi	i was able to get the status this time at least, scoreboard shows most slots are in a "reading request" state	13:18
fungi	none were accepting connections at the time i got it to return data	13:19
fungi	more than half of the clients have no associated vhost or request	13:21
fungi	which strikes me as odd	13:21
fungi	anyway, i'll restart apache for now and we can pour over the collected status data	13:23
fungi	infra-root: i took two status dumps about 6 minutes apart, they're server-status-1316.html and server-status-1322.html in my homedir on static.o.o	13:24
fungi	#status log Restarted apache on static.opendev.org in order to clear hung worker processes	13:25
opendevstatus	fungi: finished logging	13:25
fungi	dtantsur: hopefully things are working better again for now?	13:26
fungi	infra-root: unrelated, it looks like we have the openmetal server move schedule now, and the machines aren't all moving at the same time but rather spread out over the course of roughly two weeks. given the limited quota we have there, i favor just taking it offline between 2025-07-29 and 2025-08-12 to avoid spurious test failures	13:28
frickler	sounds reasonable, +1	13:31
dtantsur	fungi: seem to be better, thanks again	13:32
bbezak	https://tarballs.opendev.org stills seem a bit slow for me at least	14:12
bbezak	(Hi)	14:12
clarkb	why/where is tox pulling from static?	14:22
clarkb	my early meeting is cancelled today so I'm going to pop out for a bike ride before it gets hot today	14:23
dtantsur	clarkb: requirements or upper constraints, I assume?	14:24
frickler	-c{env:TOX_CONSTRAINTS_FILE:https://releases.openstack.org/constraints/upper/master}	14:25
dtantsur	I'm not sure, just a guess	14:25
dtantsur	yeah	14:25
clarkb	thats a redirect to https://opendev.org/openstack/requirements/raw/branch/master/upper-constraints.txt not sure what utility that provides	14:26
clarkb	master upper constraints is statically located at ^ so the redirect is maybe redundant?	14:26
dtantsur	IIRC it allows referring to named branches before they exist	14:26
clarkb	right but not for master	14:27
dtantsur	the master version is probably redundant indeed	14:27
frickler	more stable URL, not dependent on how gitea presents it?	14:27
clarkb	which is where 90% of the requests will be from the CI system	14:27
clarkb	(I don't think the CI system is to blame here, but we can sometimes be our own worst enemy so its worth checking)	14:27
elodilles	clarkb: this is where the redirection comes from https://review.opendev.org/c/openstack/releases/+/639011	14:30
elodilles	i mean, this patch introduced it	14:31
elodilles	i mean #2, this patch automated the generation of redirections o:)	14:32
elodilles	anyway, that is what provides the redirections	14:32
slittle1_	any known issues with zuul this morning? Seems like several jobs stuck in 'gate'. https://zuul.openstack.org/status?project=starlingx%2F*	14:34
slittle1_	... and just now all the jobs got unstuck at the same moment, and cleared very quickly after that.	14:40
slittle1_	don't think I've seen jobs with all gate tests completed green just sit there for 10+ mins before. Seems strange.	14:42
fungi	slittle1_: events and results all go into a fifo queue, and if a change merges that alters zuul's configuration it can result in a pause in queue processing while reconfiguration occurs, so you'll see build results start to pile up like that and then all clear as soon as queue processing resumes	14:52
fungi	bbezak: yes, it looks like static.opendev.org's apache may be back in the same state it was before the restart a couple of hours ago, which suggests this is some sort of pileup related to an external factor (my money's on llm training crawlers, they seem to be the cause of most of our issues these days)	14:54
fungi	i'll get a couple more debug status reports from apache and then restart it again	14:55
fungi	#status log Restarted apache on static.opendev.org yet again in order to clear hung worker processes	15:01
fungi	infra-root: i took two more status dumps about 6 minutes apart, they're server-status-1454.html and server-status-1500.html in my homedir on static.o.o	15:01
opendevstatus	fungi: finished logging	15:01
corvus	fungi: i'm not sure what to make of those server status dumps. i'm curious about what might be happening on those "reading" connections. what are they waiting on? is there some internal apache resource on which they're blocking?	15:42
corvus	i don't see current behavior like that	15:45
corvus	i fixed a test cleanup race in the 'require same provider' change if folks want to re +2 https://review.opendev.org/955545	16:18
clarkb	`ls -lh /var/log/apache2 \| grep -v \\.gz \| sort -k 5 -h` is somewhat informative	16:42
clarkb	that shows where the bulk of the requests are going (just based on relative sizes of log files) and then naively looking into that log file it does look like bad crawlers to me	16:43
clarkb	we have a lot of vhosts... I'm working on a change to apply the user agent filtering	16:49
corvus	there are certainly bad crawlers, but are they causing a bunch of connection slots to get stuck in "R"?	16:49
clarkb	corvus: probably not since we haven't seen this before with the same crawlers elesewhere	16:50
fungi	i did some deeper analysis correlating the client addresses in the status reports to site access logs, see comments in matrix	16:50
clarkb	but I figure its good hygiene to apply these rules if we're getting hit here too	16:50
fungi	i'be added a temporary iptables rule to block an address that was seen in the status reports during each incident trying to use rclone to parallel download the contents of the tarballs site, which looks like was eating up all available connection slots in apache	17:03
fungi	infra-root: ^	17:03
mordred	yeah. maybe don't do that mkay?	17:03
opendevreview	Clark Boylan proposed opendev/system-config master: Apply UserAgentFilter to every vhost on static https://review.opendev.org/c/opendev/system-config/+/955616	17:04
clarkb	as mentioned ^ this is unlikely to solve the problem (I think fungi's analysis is more on point for that), but I'm hoping it improves things overall and gives us more headroom	17:04
fungi	situations like the rclone one becomes an ongoing problem, we could add a conntrack rule like we have on the gerrit server to limit the number of parallel connections from a single address	17:05
fungi	we'd need to come up with a number that would avoid exhausting apache workers while hopefully not adversely impacting clients behind overload nats	17:06
clarkb	++	17:06
fungi	so i'd rather not go there unless it gets worse	17:07
clarkb	re openmetal I'm fine with simply shutting the whole thing down. It might be worth asking yuriy if they want to use our cloud as a guinea pig and have us use it during the migration to see what if anything breaks?	17:07
clarkb	I think the risk to us is low with an easy out (stop using the cloud in zuul launcher temporarily) so that may be useful for them to see how things do. The one gotcha is our mirror there is a spof	17:07
fungi	yes, the mirror server is my main concern, since if it goes offline while we keep booting test nodes, we'll get lots of job failures	17:08
fungi	also if it's offline for a while and we don't put it in ansibles emergency disable group, it could cause problems for deploys	17:09
clarkb	right. We could identify the node it lives on and only disable zuul launcher the day before and after that node migrates or something	17:09
clarkb	but also doing the easy thing and disabling it entirely for the week is probably fine	17:09
fungi	well, it's more like two weeks, but yeah	17:09
fungi	(the earliest and latest move dates they listed span 13 days)	17:10
clarkb	ah	17:10
corvus	there's still a small bug related to zuul-launcher handling of image upload errors. things are much improved after the recent fix to avoid duplicate upload jobs, but it's still possible for the launchers to create an unlimited number of pending uploads. it's just happening in slow motion now.	17:37
corvus	remote: https://review.opendev.org/c/zuul/zuul/+/955619 Launcher: skip non-pending uploads [NEW]	17:37
corvus	that should address what we're currently seeing	17:37
corvus	(now i will go see what the upload errors actually are)	17:37
corvus	https://zuul.opendev.org/t/opendev/image/ubuntu-bionic shows the problem with pending uploads that fixes	17:37
corvus	2025-07-22 16:38:18,238 ERROR zuul.Launcher: Exception: Downloaded file /var/lib/zuul/tmp/31560f56174145d7b6f2930613394729 sha256 digest 15f50732fa66da350ed405985aafbd30736ea2e2d45dc9c4cc11698296584153 does not match artifact <ImageBuildArtifact 31560f56174145d7b6f2930613394729 state: ready canonical_name: opendev.org%2Fopendev%2Fzuul-providers/ubuntu-bionic build_uuid: 7884342657bb4e30a2bd9bce5e99c5ea validated: True> digest	17:39
corvus	d59b4f42620dee6677ef570b55073482cfb6b4fa2a09d6007bb81a30d370745c	17:39
corvus	cool. cool.	17:40
clarkb	that isn't the empty file sha256sum fwiw so it is downloading something	17:41
corvus	i think the issue is that we are compressing it before calculating the checksum	17:45
corvus	fixing that is a little tricky because we compress outside of the upload role, and we calculate the checksum inside the role... but maybe we should do both inside the upload role	17:46
corvus	alternatively... we could consider dropping all the checksum stuff entirely. i'm getting the feeling no one loves it. :)	17:47
clarkb	I think it is a good sanity check if we can make it work properly. As long as we checksum either compressed or inflated data consistently it should be fine? I guess the problem is magic in download processing might mean we only see the inflated data on the download side?	17:48
corvus	the main reason i included it in the first place was so that we could use the same value when we uploaded the final image for the cloud provider to swift. of course, we're not actually doing that (yet). and also, we are recalculating it after download, so it's not saving us time.	17:49
corvus	that's the main reason for operating on the uncompressed data	17:50
corvus	but we could do both	17:50
corvus	we could checksum the compressed artifact, and then uncompress it, and checksum the uncompressed version for the second upload to swift	17:50
corvus	or we could omit the checksum for the second swift upload (which is what we do now, but only because it's just not implemented)	17:51
corvus	i'm not worried about magic processing on download; that shouldn't be an issue (we're not setting any gzip headers here)	17:51
clarkb	corvus: the second upload is to glance right?	17:51
clarkb	first upload is build job to swift as an artifact. Then the zuul launcher fetches from swift, does its checks then does the second upload into the openstack image server (glance) ?	17:52
corvus	i mean... the implementation varies	17:52
corvus	but yes let's say glance for simplicity	17:52
clarkb	ok I was trying to understand why we couldn't upload the compressed image on the second upload and that is probably because glance has no way of indicating this is a qcow2 gzipped or whatever	17:53
corvus	yeah specifically in this case, we're talking about our own zuul-side zst compression of a raw image	17:53
corvus	qcow2 is already compressed, so no big deal	17:54
fungi	also compression is often not reproducible, so checksums for a re-compressed file may differ depending on the algorithm	17:54
corvus	the image create api call in sdk accepts sha256 and md5sum arguments, and the intention was to provide a value for those.	17:54
clarkb	ack	17:54
clarkb	fwiw it does look like glance supports some sort of "compressed" containe type since train but the docs don't indicate which types of compression are supported and older clouds won't support it	17:55
corvus	oh i lied, we actually do currently pass those values; i thought we hadn't gotten to that yet, but it's there.	17:55
clarkb	thinking out loud here is the most accurate thing carrying hashes of the inflated data?	17:55
corvus	yep	17:56
clarkb	so ya moving everything into the role so we can hash first then compress may be the best solution?	17:56
corvus	doing that would mean adding a few minutes to our image build/upload jobs... unless maybe we have the role keep both copies of the image	17:56
corvus	because right now, we hash in parallel while uploading	17:57
clarkb	I suspect we may hit disk limits if we keep both copies	17:57
clarkb	and maybe that is the most important limitation rather than wall clock time?	17:58
corvus	me too... though... theoretically there is a moment where we have both on disk?	17:58
clarkb	hrm thats a good point. If we aren't doing streaming compression too then we're already carrying both?	17:59
clarkb	so maybe its fine	17:59
corvus	maybe we can: 1) start the hash in background; 2) compress; 3) join hash; 4) delete original; 5) upload	17:59
corvus	we should only ever have one raw image, so we shouldn't really need to delete it...	18:00
corvus	so maybe step 4 is optional	18:00
opendevreview	James E. Blair proposed opendev/zuul-providers master: Move compression into upload role https://review.opendev.org/c/opendev/zuul-providers/+/955621	18:09
corvus	infra-root: ^ that's somewhat urgent since it's blocking uploads to clouds that use raw images	18:10
corvus	the check pipeline is a waste of time on that change; gate is self-testing, so as soon as it looks good to humans, we should approve it.	18:12
clarkb	corvus: quick question did we switch to the rol in zuul-jobs yet? that change seems to indicate we haven't which is fine just catching up	18:12
corvus	we have not :(	18:12
clarkb	ya the role name is image-upload-swift which matches what is in zuul-providers	18:12
corvus	that will just increase the delta, but... c'est la vie	18:13
clarkb	+2 from me	18:14
corvus	cool i'm going to +3 it, and if anyone sees a problem in the next several hours it'll take to gate, feel free to negate it :)	18:19
corvus	2025-07-22 18:34:24.547 \| Error: Failed to download metadata for repo 'baseos': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried	18:52
corvus	getting that fairly consistently for centos9	18:52
opendevreview	James E. Blair proposed opendev/zuul-providers master: Move compression into upload role https://review.opendev.org/c/opendev/zuul-providers/+/955621	18:55
corvus	clarkb: ^ missed some necessary changes	18:56
clarkb	https://4b3c52423d412e2d9836-485520e0725c5c3e0cb7e09d4d5f1a24.ssl.cf2.rackcdn.com/openstack/082a27481fdc49d6ac05cccac587ded6/bridge99.opendev.org/screenshots/gitea-main.png this is the screenshot of the gitea main page on the squashed update	19:41
clarkb	we can't verify link targets taht way but at least it renders nicely	19:41
fungi	yeah, it's the best we've got, worst case it's only one change to revert if we don't like it once deployed	19:42
clarkb	I +2'd the change. I'm also happy to help babysit and make sure that gitea doesn't go sideways while upgrading today. I'm going to grab lunch but ci jobs take long enough I think we can approve it now if anyone else wants to look at it quickly	19:42
fungi	i'm happy to approve it now, that leaves time for others to +2 or -2	19:42
clarkb	ianw: responded on https://review.opendev.org/c/opendev/system-config/+/955544 basically journald writes to syslog so it all magically works as is. The only thing we have to change is the socket that docker/podman write to	19:45
clarkb	and the CI job seems toconfirm this. I left links in my response you can confirm with too	19:45
clarkb	infra-root I'm not positive that 955544 will restart our IRC bots automatically, but on the off chance it does we may want to approvethat change during a quiet time for meetings	19:46
clarkb	but I think it is ready for review	19:46
clarkb	and now lunch	19:46
fungi	yeah, updates to the compose files will either restart those containers, or if they don't then that's probably an accidental oversight. mainly a concern for the limnoria meetbot	19:50
opendevreview	Merged opendev/system-config master: Multiple Gitea splash page updates https://review.opendev.org/c/opendev/system-config/+/952407	20:37
fungi	that should get underway ~immediately	20:37
clarkb	fwiw I think the last meeting today may end at 2200UTC	20:38
fungi	infra-prod-service-gitea is running now, i'm tailing the log on bridge	20:39
fungi	https://gitea09.opendev.org:3081/ is responding again and has updated with the new content	20:41
clarkb	the links all seem to work there too	20:42
fungi	yeah, they tested right for me	20:42
clarkb	gitea13 is starting its update now. Load continues to be high there but that was fine yesterday	20:49
clarkb	seems to have updated just fine	20:51
fungi	yeah, it's on the last one now	20:52
fungi	and done	20:52
opendevreview	Merged opendev/zuul-providers master: Move compression into upload role https://review.opendev.org/c/opendev/zuul-providers/+/955621	21:08
opendevreview	Merged zuul/zuul-jobs master: Update s3 minio tests https://review.opendev.org/c/zuul/zuul-jobs/+/954886	23:36

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!