Thursday, 2025-08-28

opendevreview	Tim Burke proposed openstack/project-config master: Have more files trigger test-release-openstack https://review.opendev.org/c/openstack/project-config/+/958709	00:11
opendevreview	OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/957995	02:13
*** ralonsoh_ is now known as ralonsoh		08:43
clarkb	corvus: the openstack zuul tenant has a change (953448) stuck in the gate queue waiting on repo state for almost 5 hours now. I suspect that we might end up evicting that change and reenqueing it to get around whatever the problem is but wanted to call it out before doing so as this may be a good debugging opportunity for whatever caused the issue	14:36
corvus	looks like event 528f5ced64e043369efae94dd96d8ffc for 953448,1	14:47
corvus	2025-08-28 09:51:19,662 INFO zuul.Pipeline.openstack.gate: [e: 528f5ced64e043369efae94dd96d8ffc] Scheduling global repo state for item <QueueItem 00241ac4439e460b83aa20cd5eff8da8 live for [<Change 0x7c8da2a15310 openstack/neutron 953448,1>] in gate>	15:03
corvus	repo state job 5edc0edc456941eaa5c1c2650e1f5db6	15:04
corvus	ze11	15:05
corvus	i think that merge job may still be running. going to sigusr2 ze11	15:11
corvus	the merger thread is in a git clone	15:17
clarkb	that is all very unexpected. But it does explain why it is still waiting if the merge is still running things	15:19
clarkb	corvus: I need to pop out for breakfast things for about 15-20 minutes but I can help when I get back if there is anything helpful to do	15:19
corvus	i'm going to need to do some guessing to find the command	15:19
corvus	clarkb: ack	15:19
corvus	clarkb: it completed. perhaps the signal dislodged stuck pipes.	15:26
corvus	we don't annotate those logs, so it's hard to trace what happened.	15:27
corvus	2025-08-28 09:51:23,986 DEBUG zuul.Repo: Resetting repository /var/lib/zuul/executor-git/opendev.org/openstack/openstack%2Fdevstack	15:27
corvus	2025-08-28 15:23:14,925 ERROR zuul.Merger: stderr: 'Timeout: the command "git clone ssh://*****@review.opendev.org:29418/openstack/nova /var/lib/zuul/executor-git/opendev.org/openstack/openstack%2Fnova" did not complete in 600 secs.'	15:27
corvus	presumably some more stuff happened between those times, because we went from devstack to nova	15:27
corvus	and that nova git command was not that old:	15:27
corvus	zuuld 522029 2.2 0.1 12016 8972 ? S 15:11 0:15 ssh -i /var/lib/zuul/ssh/id_rsa -o SendEnv=GIT_PROTOCOL -p 29418 zuul@review.opendev.org git-upload-pack '/openstack/nova'	15:27
corvus	so i suspect it's something (handwaving) like a git subproc stuck or died in such a way that nothing detected it; the sigusr2 woke things up, it moved on to the next thing and ran a git command for nova which happened to time out.	15:29
corvus	i think we need job id annotations for the executor-mergers; it's easy to follow for the standalone mergers but not the executors	15:30
clarkb	almost like a lost sigchld got processed by sending the usr2 or something along those lines	15:44
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783	16:31
opendevreview	Merged zuul/zuul-jobs master: Add OIDC support to upload-logs-s3 https://review.opendev.org/c/zuul/zuul-jobs/+/954885	16:48
clarkb	corvus: I think ze11 is doing it again wiht https://zuul.opendev.org/t/zuul/stream/d33a71980f5544c1b887db92e95df201?logfile=console.log (and possibly https://zuul.opendev.org/t/openstack/stream/8c28b602b168440db78e761c5f803568?logfile=console.log)	17:09
clarkb	though I think those jobs have gotten as far as starting and its the workspace setup that has hung not the initial merge	17:10
clarkb	corvus: I've got an update to my change that will evict the first one from the queue that I'm working on. Let me get that all squared away but then I won't push it so that we can inspect more closely	17:12
clarkb	ok looking at ze11 overall things seem mostly fine. Disk, load, memory are all within reason. I notice there were zk load object errors related to cleaning up a dib node hold a few seconds before it started trying to process d33a71980f5544c1b887db92e95df201	17:18
clarkb	but that should be unrelated to new jobs in a different project	17:18
clarkb	there is an openstack/nova clone that timed out that seems to still have a process hanging around	17:20
clarkb	ok that git clone from 17:09 for nova is gone now	17:22
clarkb	and my job got unstuck (at least it moved to the next thing)	17:23
corvus	no intervention this time right?	17:23
clarkb	no intervention. And I think my unstuck comment isn't quite right	17:24
clarkb	it logged that it is finishing jobs including that one with a timestamp of 17:09	17:24
clarkb	so I think it may have just been a race for reading the logs and that message being emitted when I first looked at it	17:24
corvus	grep Timeout /var/log/zuul/executor-debug.log\|grep nova\|grep Merger\|wc -l	17:25
corvus	24	17:25
clarkb	I'm wondering if maybe the merger there decided it needed to fully reclone nova and our 10 minute timeout is no longer long enough for that repo? and then some issue with sequencing/serialization/whaever in the merger on the executor is backing things up?	17:25
corvus	yes i'm thinking along the same lines	17:25
corvus	this action that is timing out is the repo cache update; it is serialized	17:26
clarkb	should we stop the merger there and increase the timeout and/or do a manual clone while the executor is stopped?	17:26
corvus	before we do that, i want to see if we have these errors for any other repos or any other executors	17:27
clarkb	++ just let me know what i can do to help if anything	17:27
corvus	clarkb: can you use bridge to check for timeouts on other executors?	17:27
corvus	grep Timeout\|grep Merger should do it	17:28
clarkb	ya let me see	17:28
clarkb	`ansible zuul-executor -m shell -a 'grep Timeout /var/log/zuul/executor.log \|grep Merger'` produces only results for ze11	17:30
corvus	there were 25 attempts to update the nova repo; and 24 failures; i think that's basically a 100% failure rate (since one might be running now)	17:30
corvus	good, then it seems like we have an executor-local problem	17:30
corvus	i think we should stop (graceful or hard would be fine i think). then inspect.	17:30
clarkb	sounds good to me. Do you want to do that or should I and if I'm doing it any preference for shutdown type?	17:30
corvus	clarkb: you can; no pref	17:31
clarkb	I think its probably not going to run jobs anyway so hard stop and not waiting for timeouts makes sense to me. To do that I can just docker-compose down right?	17:32
clarkb	yes looks like that is what stop.yaml does in the executors. I'm proceeding with that now	17:32
clarkb	this is done	17:33
clarkb	I don't think our hourly jobs will automatically restart the executor, but I can put ze11 in the emergency file if we're worried about that	17:34
clarkb	probably worst case we just stop it again and then put it in the emergency file	17:34
clarkb	those two jobs I identified flipped back to 2nd attempt queued so thats good	17:34
corvus	i don't think emergency is necessary	17:35
corvus	yeah, the nova repo is gone, so ze11 was in it's "blow it away and re-clone" phase	17:36
corvus	i'm going to run similar clone to a temp dir in a screen session and time it	17:36
clarkb	++	17:37
corvus	took a bit but running now	17:41
corvus	i don't see anything terribly alarming in the cacti graphs for ze11	17:42
clarkb	I wonder if we're just seeing "nova is fairly large with many many refs and gerrit and ze11 are in different clouds half a continent away from each other" slowness	17:43
clarkb	and our timeout is simply not long enough?	17:43
corvus	yeah, good possibility. when it's done, let's repeat this on a different ze, without stopping.	17:44
clarkb	good diea	17:44
corvus	i haven't checked the gerrit server to see how loaded it is	17:44
clarkb	both system load and memory utilization look fine	17:45
corvus	clarkb: cacti hasn't been able to get a reliable reading from review03 for about a day	17:45
clarkb	interesting. Maybe that is realted as cacti and ze11 are in the same cloud region	17:46
clarkb	could be that git is trying ipv6 first and waiting fo a long timeout. Then switching to ipv4 but the combo is taking too long	17:46
corvus	the git clone seemed responsive, just slow	17:46
clarkb	ack	17:46
corvus	like it started right away	17:47
corvus	it's just been receiving objects for most of that time at 1-2 MiB/s	17:47
corvus	but cacti is snmp over udp i think	17:47
corvus	clarkb: do we have anything else in gerrit's cloud region?	17:48
clarkb	corvus: mirror.ca-ymq-1.vexxhost.opendev.org and maybe one of the ns servers	17:49
corvus	okay maybe we next clone from ze01 then from the mirror (though that will have to be an anonymous http clone instead of git)	17:50
corvus	er instead of ssh	17:50
clarkb	fwiw the recent lists backup failures seem unrelated. Those errors are due to mailman web caches changing under the backup process. We can probably exclude the web caches entirely	17:50
clarkb	I just saw that email come in and wondered if we failed to do a rax -> vexxhost backup but no it was rax to rax and due to server internal behaviors	17:51
clarkb	1-2MiB/s is similar to what i get from the giteas when I clone from there fwiw	17:52
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783	17:53
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Update cri-o packge location https://review.opendev.org/c/zuul/zuul-jobs/+/958800	17:53
corvus	162.98user 21.47system 13:17.54elapsed 23%CPU (0avgtext+0avgdata 152348maxresident)k	17:54
corvus	(wow what happened to the time format?)	17:54
clarkb	13:17 is > 600 seconds so at least we can reproduce?	17:54
corvus	yep. running same on ze01 now	17:55
corvus	14MiB/s	17:55
corvus	oh....	17:55
corvus	it slowed down a lot after 80%	17:55
corvus	i didn't notice if ze11 did that... might be worth repeating and watching more closely	17:55
corvus	let's see if ze01 ends up with the same time or not	17:56
corvus	at 86^ it's down to 5MiB/s	17:56
clarkb	I think that slowdown is normalish as you can get packfiles immediately then you're getting refs aftwerads and it slows down	17:56
clarkb	particularly for active repos where you'll have unpacked refs like nova	17:56
clarkb	so the total runtime is probably more informative	17:57
corvus	agree	17:57
clarkb	according to my email nova was repacked ~13 hours ago and the repack process had a lot to say about nova	17:58
clarkb	(but eventually reported it was done)	17:58
corvus	it never dropped below 5. total time on ze01 was 3:41	17:58
clarkb	just to rule out potential failure to repack over time making things sad	17:58
clarkb	cool I think the ze01 time also indicates it is unlikely to be repacking problems but probably more network related?	17:59
corvus	yeah, seems like a ze11<->review03 problem	17:59
corvus	anonymous http clone from mirror looks like ze01 so far	18:00
corvus	clarkb: should we just leave ze11 offline and see if the internet is better tomorrow?	18:00
clarkb	corvus: sure. I don't think we need all 12 executors right now	18:01
* clarkb makes a note to check on it		18:01
corvus	clarkb: want to add to emergency in that case?	18:01
clarkb	yup I'll do that	18:01
corvus	clarkb: i ran `GIT_SSH_COMMAND='ssh -i /var/lib/zuul/ssh/id_rsa' time git clone ssh://zuul@review.opendev.org:29418/openstack/nova` in a shell as the zuuld user	18:02
corvus	mirror clone finished in 3:28. so i think 3-4 minutes is our normal "clone nova from gerrit" time.	18:03
corvus	clarkb: maybe ze11 has noisy neighbors (network or otherwise)	18:03
clarkb	ze11 is in the emergency file now	18:04
clarkb	and I have a todo item for tomorrow's todo list to followup on it	18:04
clarkb	re 3-4 minutes being normal given that then 600 seconds for the timeout seems reasonable	18:06
clarkb	about 2.5x the normal expectation	18:06
clarkb	corvus: do you want to tr forcing ipv4 on ze11 too?	18:07
clarkb	and rerun the test there I mean with git clone -4 or using the ip address or something	18:08
corvus	sure	18:09
clarkb	looks like -4 is onyl an option to fetch not clone so may need to use the ip addr	18:09
corvus	looking better so far with -4	18:10
clarkb	oh maybe its just undocumented in the manpage and -4 does work?	18:10
corvus	zuuld 535644 5.5 0.1 19204 12556 pts/2 S+ 18:09 0:09 ssh -i /var/lib/zuul/ssh/id_rsa -o SendEnv=GIT_PROTOCOL -4 -p 29418 zuul@review.opendev.org git-upload-pack '/openstack/nova'	18:12
corvus	looks like it passes it through to ssh	18:12
corvus	but now that i think about it, i could have also put the -4 in the GIT_SSH_COMMAND	18:13
corvus	finished in 3:38	18:13
clarkb	ok so presumably that means there is some ipv6 slowdown between ze11 and review03 that isn't present on ze01.	18:13
corvus	seems like it	18:13
corvus	i'm repeating the clone without -4. it slows to 1MiB/s at around 30%	18:14
corvus	(i also wanted to make sure it was still happening)	18:14
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Fix kubernetes install methods https://review.opendev.org/c/zuul/zuul-jobs/+/958800	18:25
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783	18:25
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Fix kubernetes install methods https://review.opendev.org/c/zuul/zuul-jobs/+/958800	18:41
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783	18:42
opendevreview	Clark Boylan proposed opendev/grafyaml master: Pull python base images from quay.io https://review.opendev.org/c/opendev/grafyaml/+/958601	19:38
clarkb	corvus: ^ that is the dependson to check the zuul-jobs changes correct the problem opendev build see	19:38
clarkb	basically the sanity check before we dive into complicated test updates	19:38
clarkb	ok sanity checking doesn't work we are still using the old docker build command. I suspect that this is because the roles are used in a trusted repo so depends on isn't updating their contents	19:51
clarkb	I'm thinking it probably doesn't make sense to make a test only copy of the role and do a whole base-test dance or whatever. Instead we should probably just take the leap of faith that given what we know this will work and build out a test case to prove it	19:51
clarkb	corvus: ^ let me know if you have thoughts on that.	19:52
clarkb	(mostly I think setting up this test case may be somewhat complicated so I was hoping to avoid investing a ton of effort if we could show it wouldn't work in the first place)	19:52
corvus	clarkb: agree. sorry i didn't catch that earlier... there's a non-zero chance we could hack together a system-config change that did what we need -- we could probably just override the run playbook of that job. if you want to burn a few minutes on that, that might salvage the idea. but any more than that and i'd agree it's better to just switch to z-j	20:07
corvus	https://zuul.opendev.org/t/openstack/build/e26cd572f18447d69cca0bbbead562ba/console	20:07
corvus	(i'm looking at what the run playbook is doing there -- and wondering if that's enough for our testing)	20:07
clarkb	corvus: I started digging into zuul-jobs and I think I can do this more easily than I thought	20:10
clarkb	corvus: I'm doing some light hacking up of the existing buildset registry tests	20:10
corvus	++	20:10
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783	20:17
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Update registry tests to better cover speculative image builds https://review.opendev.org/c/zuul/zuul-jobs/+/958809	20:17
clarkb	corvus: I put the new tests first so that we can see them fail. I've also got a todo in the test change that you might want ot check if you think is necessary or not. I think for this first pass its not a big deal	20:18
clarkb	and actually if the test change passes I think its due to the todo. So this is a good sanity check	20:22
clarkb	ok hit a snag. We pull the image with docker except that docker proper never learned how to have mirrors for different registries	20:36
clarkb	however, this is the code where I have the TODO anyway so maybe I need to address this upfront	20:36
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Update registry tests to better cover speculative image builds https://review.opendev.org/c/zuul/zuul-jobs/+/958809	20:42
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783	20:42
corvus	clarkb: but that sequence (around the todo) does work for docker if using docker.io images right?	20:44
clarkb	corvus: yes	20:44
corvus	digging more, it looks like we have previous_build_repository: docker.io/upstream/image	20:45
clarkb	corvus: ya my change switches that otherwise we aren't exercising the behavior we want in the image builds	20:45
corvus	got it missed that, and yes that's expected	20:45
clarkb	its possible we want one set of jobs for docker.io and another for quay.io. I'm not sure where the return on value starts to drop off with exploding the matrix	20:46
corvus	do we need to 2x the .. yes that ^	20:46
clarkb	we can think about that while we get the quay.io version working	20:46
clarkb	basically the status quo is docker.io and everything works because magic	20:47
corvus	i'm leaning toward "yes" as long as there's something in the docker.io path that isn't covered by the quay.io path	20:47
corvus	because we've managed to keep the docker.io path working for many years and it'd be a shame to break that	20:47
clarkb	ya I think the main thing is using docker with docker.io and getting speculative behaviors on hte consuimption side	20:47
clarkb	ack I can add another axis to the matrix to keep coverage of docker.io speculative behavior with docker	20:48
corvus	with the changes we're talking about, all of the "docker build" stuff moves to the quay.io path... i don't have a good handle on what else in the docker.io path would still be uncovered.	20:48
clarkb	also looks like my manual editing of the zuul config is running into problems with the auto generated jobs lists. I didn't think these were autogenerated but I'm putting comments in at a higherlevel and maybe those get eaten...	20:50
corvus	clarkb: looking more, i'm kind of thinking: maybe everything in that test is in the quay.io build path and maybe we don't need to 2x the jobs	20:52
corvus	like the only things i see after that point are build-X-image, and push-to-intermediate-registry which i think is all skopeo	20:52
clarkb	corvus: ya I guess if docker image builds work there with quay.io then we can expect them to work with docker.io. And we're not really doing any docker run payload so that would be covered elsewhere?	20:53
corvus	yep that's my thinking (because "docker image builds" will be "docker buildx" in all cases after this set of changes)	20:53
corvus	clarkb: the jobs aren't autogenerated, but that file is rewritten and there is a ruamel bug that will eat some comments	20:55
corvus	clarkb: just turn that comment into a "description:" :)	20:55
clarkb	will do	20:55
clarkb	and i think we just saw our expected failure in the parent job for one of the jobs	20:56
clarkb	I need to pop out for a school run momentarily but when I get back I'll fix the linter issue and repush with the current order so that we can see everything but the expected failures fail. I might make them non voting too so that we can land that change first then land the followup change to fix things and set the jobs to voting	20:57
clarkb	I think that gives a nice iterative story to what is going on behind the scenes should we need to look into this again (it has been the never ending saga so I wouldn't be surprised if that happens)	20:57
clarkb	corvus: looks like quay.io/upstream does exist but quay.io/upstream/image does not. Do we want to use a different org (like opendevorg even maybe?) to have more control over whether or not the image exists?	22:01
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Update registry tests to better cover speculative image builds https://review.opendev.org/c/zuul/zuul-jobs/+/958809	22:06
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783	22:06
clarkb	that is easy to chagne ina followup so I'll ignore it for now	22:06
corvus	clarkb: yeah i think quay.io/opendevorg/imagedoesnotexist would be a good idea for that	22:24
clarkb	ack. I'm working on that update as well as addressing that TODO and fixing the new linter problem	22:31
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Update registry tests to better cover speculative image builds https://review.opendev.org/c/zuul/zuul-jobs/+/958809	22:35
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783	22:35
clarkb	I feel like this is getting very close to its final state	22:35
clarkb	and just in time for quay's cdn to start having a sad. But it looks like it is working otherwise. I think this is ready for review	22:52
corvus	clarkb: that looks swell! thanks!	22:59
clarkb	corvus: I'm realizing that my commit message on the change should be updated as I did update tests	23:00
clarkb	I'm going to go ahead and fix that now because it will bother me otherwise	23:00
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783	23:03
clarkb	that is just a commit message edit so should be a quick rereview	23:03
corvus	done	23:04
clarkb	https://status.redhat.com/ doesn't seem to be aware yet of the quay.io cdn issues. I'll probably hold off on rechecking anything as that latest patchset just failed on the same issue too	23:14
clarkb	try again in the morning	23:14
opendevreview	Merged zuul/zuul-jobs master: Remove Artifactory cleanup playbook https://review.opendev.org/c/zuul/zuul-jobs/+/947041	23:39

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!