| opendevreview | Tim Burke proposed openstack/project-config master: Have more files trigger test-release-openstack https://review.opendev.org/c/openstack/project-config/+/958709 | 00:11 |
|---|---|---|
| opendevreview | OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/957995 | 02:13 |
| *** ralonsoh_ is now known as ralonsoh | 08:43 | |
| clarkb | corvus: the openstack zuul tenant has a change (953448) stuck in the gate queue waiting on repo state for almost 5 hours now. I suspect that we might end up evicting that change and reenqueing it to get around whatever the problem is but wanted to call it out before doing so as this may be a good debugging opportunity for whatever caused the issue | 14:36 |
| corvus | looks like event 528f5ced64e043369efae94dd96d8ffc for 953448,1 | 14:47 |
| corvus | 2025-08-28 09:51:19,662 INFO zuul.Pipeline.openstack.gate: [e: 528f5ced64e043369efae94dd96d8ffc] Scheduling global repo state for item <QueueItem 00241ac4439e460b83aa20cd5eff8da8 live for [<Change 0x7c8da2a15310 openstack/neutron 953448,1>] in gate> | 15:03 |
| corvus | repo state job 5edc0edc456941eaa5c1c2650e1f5db6 | 15:04 |
| corvus | ze11 | 15:05 |
| corvus | i think that merge job may still be running. going to sigusr2 ze11 | 15:11 |
| corvus | the merger thread is in a git clone | 15:17 |
| clarkb | that is all very unexpected. But it does explain why it is still waiting if the merge is still running things | 15:19 |
| clarkb | corvus: I need to pop out for breakfast things for about 15-20 minutes but I can help when I get back if there is anything helpful to do | 15:19 |
| corvus | i'm going to need to do some guessing to find the command | 15:19 |
| corvus | clarkb: ack | 15:19 |
| corvus | clarkb: it completed. perhaps the signal dislodged stuck pipes. | 15:26 |
| corvus | we don't annotate those logs, so it's hard to trace what happened. | 15:27 |
| corvus | 2025-08-28 09:51:23,986 DEBUG zuul.Repo: Resetting repository /var/lib/zuul/executor-git/opendev.org/openstack/openstack%2Fdevstack | 15:27 |
| corvus | 2025-08-28 15:23:14,925 ERROR zuul.Merger: stderr: 'Timeout: the command "git clone ssh://*****@review.opendev.org:29418/openstack/nova /var/lib/zuul/executor-git/opendev.org/openstack/openstack%2Fnova" did not complete in 600 secs.' | 15:27 |
| corvus | presumably some more stuff happened between those times, because we went from devstack to nova | 15:27 |
| corvus | and that nova git command was not that old: | 15:27 |
| corvus | zuuld 522029 2.2 0.1 12016 8972 ? S 15:11 0:15 ssh -i /var/lib/zuul/ssh/id_rsa -o SendEnv=GIT_PROTOCOL -p 29418 zuul@review.opendev.org git-upload-pack '/openstack/nova' | 15:27 |
| corvus | so i suspect it's something (handwaving) like a git subproc stuck or died in such a way that nothing detected it; the sigusr2 woke things up, it moved on to the next thing and ran a git command for nova which happened to time out. | 15:29 |
| corvus | i think we need job id annotations for the executor-mergers; it's easy to follow for the standalone mergers but not the executors | 15:30 |
| clarkb | almost like a lost sigchld got processed by sending the usr2 or something along those lines | 15:44 |
| opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783 | 16:31 |
| opendevreview | Merged zuul/zuul-jobs master: Add OIDC support to upload-logs-s3 https://review.opendev.org/c/zuul/zuul-jobs/+/954885 | 16:48 |
| clarkb | corvus: I think ze11 is doing it again wiht https://zuul.opendev.org/t/zuul/stream/d33a71980f5544c1b887db92e95df201?logfile=console.log (and possibly https://zuul.opendev.org/t/openstack/stream/8c28b602b168440db78e761c5f803568?logfile=console.log) | 17:09 |
| clarkb | though I think those jobs have gotten as far as starting and its the workspace setup that has hung not the initial merge | 17:10 |
| clarkb | corvus: I've got an update to my change that will evict the first one from the queue that I'm working on. Let me get that all squared away but then I won't push it so that we can inspect more closely | 17:12 |
| clarkb | ok looking at ze11 overall things seem mostly fine. Disk, load, memory are all within reason. I notice there were zk load object errors related to cleaning up a dib node hold a few seconds before it started trying to process d33a71980f5544c1b887db92e95df201 | 17:18 |
| clarkb | but that should be unrelated to new jobs in a different project | 17:18 |
| clarkb | there is an openstack/nova clone that timed out that seems to still have a process hanging around | 17:20 |
| clarkb | ok that git clone from 17:09 for nova is gone now | 17:22 |
| clarkb | and my job got unstuck (at least it moved to the next thing) | 17:23 |
| corvus | no intervention this time right? | 17:23 |
| clarkb | no intervention. And I think my unstuck comment isn't quite right | 17:24 |
| clarkb | it logged that it is finishing jobs including that one with a timestamp of 17:09 | 17:24 |
| clarkb | so I think it may have just been a race for reading the logs and that message being emitted when I first looked at it | 17:24 |
| corvus | grep Timeout /var/log/zuul/executor-debug.log|grep nova|grep Merger|wc -l | 17:25 |
| corvus | 24 | 17:25 |
| clarkb | I'm wondering if maybe the merger there decided it needed to fully reclone nova and our 10 minute timeout is no longer long enough for that repo? and then some issue with sequencing/serialization/whaever in the merger on the executor is backing things up? | 17:25 |
| corvus | yes i'm thinking along the same lines | 17:25 |
| corvus | this action that is timing out is the repo cache update; it is serialized | 17:26 |
| clarkb | should we stop the merger there and increase the timeout and/or do a manual clone while the executor is stopped? | 17:26 |
| corvus | before we do that, i want to see if we have these errors for any other repos or any other executors | 17:27 |
| clarkb | ++ just let me know what i can do to help if anything | 17:27 |
| corvus | clarkb: can you use bridge to check for timeouts on other executors? | 17:27 |
| corvus | grep Timeout|grep Merger should do it | 17:28 |
| clarkb | ya let me see | 17:28 |
| clarkb | `ansible zuul-executor -m shell -a 'grep Timeout /var/log/zuul/executor.log |grep Merger'` produces only results for ze11 | 17:30 |
| corvus | there were 25 attempts to update the nova repo; and 24 failures; i think that's basically a 100% failure rate (since one might be running now) | 17:30 |
| corvus | good, then it seems like we have an executor-local problem | 17:30 |
| corvus | i think we should stop (graceful or hard would be fine i think). then inspect. | 17:30 |
| clarkb | sounds good to me. Do you want to do that or should I and if I'm doing it any preference for shutdown type? | 17:30 |
| corvus | clarkb: you can; no pref | 17:31 |
| clarkb | I think its probably not going to run jobs anyway so hard stop and not waiting for timeouts makes sense to me. To do that I can just docker-compose down right? | 17:32 |
| clarkb | yes looks like that is what stop.yaml does in the executors. I'm proceeding with that now | 17:32 |
| clarkb | this is done | 17:33 |
| clarkb | I don't think our hourly jobs will automatically restart the executor, but I can put ze11 in the emergency file if we're worried about that | 17:34 |
| clarkb | probably worst case we just stop it again and then put it in the emergency file | 17:34 |
| clarkb | those two jobs I identified flipped back to 2nd attempt queued so thats good | 17:34 |
| corvus | i don't think emergency is necessary | 17:35 |
| corvus | yeah, the nova repo is gone, so ze11 was in it's "blow it away and re-clone" phase | 17:36 |
| corvus | i'm going to run similar clone to a temp dir in a screen session and time it | 17:36 |
| clarkb | ++ | 17:37 |
| corvus | took a bit but running now | 17:41 |
| corvus | i don't see anything terribly alarming in the cacti graphs for ze11 | 17:42 |
| clarkb | I wonder if we're just seeing "nova is fairly large with many many refs and gerrit and ze11 are in different clouds half a continent away from each other" slowness | 17:43 |
| clarkb | and our timeout is simply not long enough? | 17:43 |
| corvus | yeah, good possibility. when it's done, let's repeat this on a different ze, without stopping. | 17:44 |
| clarkb | good diea | 17:44 |
| corvus | i haven't checked the gerrit server to see how loaded it is | 17:44 |
| clarkb | both system load and memory utilization look fine | 17:45 |
| corvus | clarkb: cacti hasn't been able to get a reliable reading from review03 for about a day | 17:45 |
| clarkb | interesting. Maybe that is realted as cacti and ze11 are in the same cloud region | 17:46 |
| clarkb | could be that git is trying ipv6 first and waiting fo a long timeout. Then switching to ipv4 but the combo is taking too long | 17:46 |
| corvus | the git clone seemed responsive, just slow | 17:46 |
| clarkb | ack | 17:46 |
| corvus | like it started right away | 17:47 |
| corvus | it's just been receiving objects for most of that time at 1-2 MiB/s | 17:47 |
| corvus | but cacti is snmp over udp i think | 17:47 |
| corvus | clarkb: do we have anything else in gerrit's cloud region? | 17:48 |
| clarkb | corvus: mirror.ca-ymq-1.vexxhost.opendev.org and maybe one of the ns servers | 17:49 |
| corvus | okay maybe we next clone from ze01 then from the mirror (though that will have to be an anonymous http clone instead of git) | 17:50 |
| corvus | er instead of ssh | 17:50 |
| clarkb | fwiw the recent lists backup failures seem unrelated. Those errors are due to mailman web caches changing under the backup process. We can probably exclude the web caches entirely | 17:50 |
| clarkb | I just saw that email come in and wondered if we failed to do a rax -> vexxhost backup but no it was rax to rax and due to server internal behaviors | 17:51 |
| clarkb | 1-2MiB/s is similar to what i get from the giteas when I clone from there fwiw | 17:52 |
| opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783 | 17:53 |
| opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Update cri-o packge location https://review.opendev.org/c/zuul/zuul-jobs/+/958800 | 17:53 |
| corvus | 162.98user 21.47system 13:17.54elapsed 23%CPU (0avgtext+0avgdata 152348maxresident)k | 17:54 |
| corvus | (wow what happened to the time format?) | 17:54 |
| clarkb | 13:17 is > 600 seconds so at least we can reproduce? | 17:54 |
| corvus | yep. running same on ze01 now | 17:55 |
| corvus | 14MiB/s | 17:55 |
| corvus | oh.... | 17:55 |
| corvus | it slowed down a lot after 80% | 17:55 |
| corvus | i didn't notice if ze11 did that... might be worth repeating and watching more closely | 17:55 |
| corvus | let's see if ze01 ends up with the same time or not | 17:56 |
| corvus | at 86^ it's down to 5MiB/s | 17:56 |
| clarkb | I think that slowdown is normalish as you can get packfiles immediately then you're getting refs aftwerads and it slows down | 17:56 |
| clarkb | particularly for active repos where you'll have unpacked refs like nova | 17:56 |
| clarkb | so the total runtime is probably more informative | 17:57 |
| corvus | agree | 17:57 |
| clarkb | according to my email nova was repacked ~13 hours ago and the repack process had a lot to say about nova | 17:58 |
| clarkb | (but eventually reported it was done) | 17:58 |
| corvus | it never dropped below 5. total time on ze01 was 3:41 | 17:58 |
| clarkb | just to rule out potential failure to repack over time making things sad | 17:58 |
| clarkb | cool I think the ze01 time also indicates it is unlikely to be repacking problems but probably more network related? | 17:59 |
| corvus | yeah, seems like a ze11<->review03 problem | 17:59 |
| corvus | anonymous http clone from mirror looks like ze01 so far | 18:00 |
| corvus | clarkb: should we just leave ze11 offline and see if the internet is better tomorrow? | 18:00 |
| clarkb | corvus: sure. I don't think we need all 12 executors right now | 18:01 |
| * clarkb makes a note to check on it | 18:01 | |
| corvus | clarkb: want to add to emergency in that case? | 18:01 |
| clarkb | yup I'll do that | 18:01 |
| corvus | clarkb: i ran `GIT_SSH_COMMAND='ssh -i /var/lib/zuul/ssh/id_rsa' time git clone ssh://zuul@review.opendev.org:29418/openstack/nova` in a shell as the zuuld user | 18:02 |
| corvus | mirror clone finished in 3:28. so i think 3-4 minutes is our normal "clone nova from gerrit" time. | 18:03 |
| corvus | clarkb: maybe ze11 has noisy neighbors (network or otherwise) | 18:03 |
| clarkb | ze11 is in the emergency file now | 18:04 |
| clarkb | and I have a todo item for tomorrow's todo list to followup on it | 18:04 |
| clarkb | re 3-4 minutes being normal given that then 600 seconds for the timeout seems reasonable | 18:06 |
| clarkb | about 2.5x the normal expectation | 18:06 |
| clarkb | corvus: do you want to tr forcing ipv4 on ze11 too? | 18:07 |
| clarkb | and rerun the test there I mean with git clone -4 or using the ip address or something | 18:08 |
| corvus | sure | 18:09 |
| clarkb | looks like -4 is onyl an option to fetch not clone so may need to use the ip addr | 18:09 |
| corvus | looking better so far with -4 | 18:10 |
| clarkb | oh maybe its just undocumented in the manpage and -4 does work? | 18:10 |
| corvus | zuuld 535644 5.5 0.1 19204 12556 pts/2 S+ 18:09 0:09 ssh -i /var/lib/zuul/ssh/id_rsa -o SendEnv=GIT_PROTOCOL -4 -p 29418 zuul@review.opendev.org git-upload-pack '/openstack/nova' | 18:12 |
| corvus | looks like it passes it through to ssh | 18:12 |
| corvus | but now that i think about it, i could have also put the -4 in the GIT_SSH_COMMAND | 18:13 |
| corvus | finished in 3:38 | 18:13 |
| clarkb | ok so presumably that means there is some ipv6 slowdown between ze11 and review03 that isn't present on ze01. | 18:13 |
| corvus | seems like it | 18:13 |
| corvus | i'm repeating the clone without -4. it slows to 1MiB/s at around 30% | 18:14 |
| corvus | (i also wanted to make sure it was still happening) | 18:14 |
| opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Fix kubernetes install methods https://review.opendev.org/c/zuul/zuul-jobs/+/958800 | 18:25 |
| opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783 | 18:25 |
| opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Fix kubernetes install methods https://review.opendev.org/c/zuul/zuul-jobs/+/958800 | 18:41 |
| opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783 | 18:42 |
| opendevreview | Clark Boylan proposed opendev/grafyaml master: Pull python base images from quay.io https://review.opendev.org/c/opendev/grafyaml/+/958601 | 19:38 |
| clarkb | corvus: ^ that is the dependson to check the zuul-jobs changes correct the problem opendev build see | 19:38 |
| clarkb | basically the sanity check before we dive into complicated test updates | 19:38 |
| clarkb | ok sanity checking doesn't work we are still using the old docker build command. I suspect that this is because the roles are used in a trusted repo so depends on isn't updating their contents | 19:51 |
| clarkb | I'm thinking it probably doesn't make sense to make a test only copy of the role and do a whole base-test dance or whatever. Instead we should probably just take the leap of faith that given what we know this will work and build out a test case to prove it | 19:51 |
| clarkb | corvus: ^ let me know if you have thoughts on that. | 19:52 |
| clarkb | (mostly I think setting up this test case may be somewhat complicated so I was hoping to avoid investing a ton of effort if we could show it wouldn't work in the first place) | 19:52 |
| corvus | clarkb: agree. sorry i didn't catch that earlier... there's a non-zero chance we could hack together a system-config change that did what we need -- we could probably just override the run playbook of that job. if you want to burn a few minutes on that, that might salvage the idea. but any more than that and i'd agree it's better to just switch to z-j | 20:07 |
| corvus | https://zuul.opendev.org/t/openstack/build/e26cd572f18447d69cca0bbbead562ba/console | 20:07 |
| corvus | (i'm looking at what the run playbook is doing there -- and wondering if that's enough for our testing) | 20:07 |
| clarkb | corvus: I started digging into zuul-jobs and I think I can do this more easily than I thought | 20:10 |
| clarkb | corvus: I'm doing some light hacking up of the existing buildset registry tests | 20:10 |
| corvus | ++ | 20:10 |
| opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783 | 20:17 |
| opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Update registry tests to better cover speculative image builds https://review.opendev.org/c/zuul/zuul-jobs/+/958809 | 20:17 |
| clarkb | corvus: I put the new tests first so that we can see them fail. I've also got a todo in the test change that you might want ot check if you think is necessary or not. I think for this first pass its not a big deal | 20:18 |
| clarkb | and actually if the test change passes I think its due to the todo. So this is a good sanity check | 20:22 |
| clarkb | ok hit a snag. We pull the image with docker except that docker proper never learned how to have mirrors for different registries | 20:36 |
| clarkb | however, this is the code where I have the TODO anyway so maybe I need to address this upfront | 20:36 |
| opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Update registry tests to better cover speculative image builds https://review.opendev.org/c/zuul/zuul-jobs/+/958809 | 20:42 |
| opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783 | 20:42 |
| corvus | clarkb: but that sequence (around the todo) does work for docker if using docker.io images right? | 20:44 |
| clarkb | corvus: yes | 20:44 |
| corvus | digging more, it looks like we have previous_build_repository: docker.io/upstream/image | 20:45 |
| clarkb | corvus: ya my change switches that otherwise we aren't exercising the behavior we want in the image builds | 20:45 |
| corvus | got it missed that, and yes that's expected | 20:45 |
| clarkb | its possible we want one set of jobs for docker.io and another for quay.io. I'm not sure where the return on value starts to drop off with exploding the matrix | 20:46 |
| corvus | do we need to 2x the .. yes that ^ | 20:46 |
| clarkb | we can think about that while we get the quay.io version working | 20:46 |
| clarkb | basically the status quo is docker.io and everything works because magic | 20:47 |
| corvus | i'm leaning toward "yes" as long as there's something in the docker.io path that isn't covered by the quay.io path | 20:47 |
| corvus | because we've managed to keep the docker.io path working for many years and it'd be a shame to break that | 20:47 |
| clarkb | ya I think the main thing is using docker with docker.io and getting speculative behaviors on hte consuimption side | 20:47 |
| clarkb | ack I can add another axis to the matrix to keep coverage of docker.io speculative behavior with docker | 20:48 |
| corvus | with the changes we're talking about, all of the "docker build" stuff moves to the quay.io path... i don't have a good handle on what else in the docker.io path would still be uncovered. | 20:48 |
| clarkb | also looks like my manual editing of the zuul config is running into problems with the auto generated jobs lists. I didn't think these were autogenerated but I'm putting comments in at a higherlevel and maybe those get eaten... | 20:50 |
| corvus | clarkb: looking more, i'm kind of thinking: maybe everything in that test is in the quay.io build path and maybe we don't need to 2x the jobs | 20:52 |
| corvus | like the only things i see after that point are build-X-image, and push-to-intermediate-registry which i think is all skopeo | 20:52 |
| clarkb | corvus: ya I guess if docker image builds work there with quay.io then we can expect them to work with docker.io. And we're not really doing any docker run payload so that would be covered elsewhere? | 20:53 |
| corvus | yep that's my thinking (because "docker image builds" will be "docker buildx" in all cases after this set of changes) | 20:53 |
| corvus | clarkb: the jobs aren't autogenerated, but that file is rewritten and there is a ruamel bug that will eat some comments | 20:55 |
| corvus | clarkb: just turn that comment into a "description:" :) | 20:55 |
| clarkb | will do | 20:55 |
| clarkb | and i think we just saw our expected failure in the parent job for one of the jobs | 20:56 |
| clarkb | I need to pop out for a school run momentarily but when I get back I'll fix the linter issue and repush with the current order so that we can see everything but the expected failures fail. I might make them non voting too so that we can land that change first then land the followup change to fix things and set the jobs to voting | 20:57 |
| clarkb | I think that gives a nice iterative story to what is going on behind the scenes should we need to look into this again (it has been the never ending saga so I wouldn't be surprised if that happens) | 20:57 |
| clarkb | corvus: looks like quay.io/upstream does exist but quay.io/upstream/image does not. Do we want to use a different org (like opendevorg even maybe?) to have more control over whether or not the image exists? | 22:01 |
| opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Update registry tests to better cover speculative image builds https://review.opendev.org/c/zuul/zuul-jobs/+/958809 | 22:06 |
| opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783 | 22:06 |
| clarkb | that is easy to chagne ina followup so I'll ignore it for now | 22:06 |
| corvus | clarkb: yeah i think quay.io/opendevorg/imagedoesnotexist would be a good idea for that | 22:24 |
| clarkb | ack. I'm working on that update as well as addressing that TODO and fixing the new linter problem | 22:31 |
| opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Update registry tests to better cover speculative image builds https://review.opendev.org/c/zuul/zuul-jobs/+/958809 | 22:35 |
| opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783 | 22:35 |
| clarkb | I feel like this is getting very close to its final state | 22:35 |
| clarkb | and just in time for quay's cdn to start having a sad. But it looks like it is working otherwise. I think this is ready for review | 22:52 |
| corvus | clarkb: that looks swell! thanks! | 22:59 |
| clarkb | corvus: I'm realizing that my commit message on the change should be updated as I did update tests | 23:00 |
| clarkb | I'm going to go ahead and fix that now because it will bother me otherwise | 23:00 |
| opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Always build docker images with custom buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/958783 | 23:03 |
| clarkb | that is just a commit message edit so should be a quick rereview | 23:03 |
| corvus | done | 23:04 |
| clarkb | https://status.redhat.com/ doesn't seem to be aware yet of the quay.io cdn issues. I'll probably hold off on rechecking anything as that latest patchset just failed on the same issue too | 23:14 |
| clarkb | try again in the morning | 23:14 |
| opendevreview | Merged zuul/zuul-jobs master: Remove Artifactory cleanup playbook https://review.opendev.org/c/zuul/zuul-jobs/+/947041 | 23:39 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!