clarkb | a number of the grafana graphs in zuul status and in the afs graph are blank (but don't have the typical no data message). I think we may need to hold a node and check if that is an artifact of selenium screenshotting before things can load or a problem with our graphs in 12.0.2 | 00:22 |
---|---|---|
clarkb | I can probably do that tomorrow morning | 00:22 |
clarkb | and the project-config sync change failed due to differing node cloud providers https://zuul.opendev.org/t/openstack/build/6abe86a117154b43980ca6d91298ffa8/log/job-output.txt#31-53 so thats still happening but definitely less often than before | 00:26 |
clarkb | now i must find dinner | 00:26 |
corvus | the 0200 periodic jobs appear to be a non-event today after that quota calc fix | 02:48 |
frickler | corvus: the daily cron for "docker exec zuul-scheduler_scheduler_1 zuul-admin export-keys" is failing because the container is now named zuul-scheduler-scheduler-1. fallout from the docker compose upgrade? | 04:13 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269 | 07:04 |
mnasiadka | frickler: in https://review.opendev.org/c/opendev/zuul-providers/+/953908?tab=change-view-tab-header-zuul-results-summary we seem to have the same issues - let me focus on getting the DIB retry patch working properly | 07:13 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on timeout https://review.opendev.org/c/openstack/diskimage-builder/+/721581 | 07:28 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581 | 07:28 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269 | 07:29 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581 | 07:40 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581 | 07:46 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581 | 07:57 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581 | 08:08 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581 | 09:04 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581 | 09:15 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581 | 09:27 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581 | 09:49 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: source-repositories: Increase http.postBuffer https://review.opendev.org/c/openstack/diskimage-builder/+/954025 | 10:24 |
mnasiadka | frickler: I'm starting to have a feeling it's not retries we need, on the latest version of the retries patch I'm getting error: unable to rewind rpc post data - try increasing http.postBuffer on the openstack/nova repository - constantly ;-) | 10:27 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: source-repositories: Increase http.postBuffer https://review.opendev.org/c/openstack/diskimage-builder/+/954025 | 10:30 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269 | 10:30 |
mnasiadka | Let's see if that helps - but git docs say it rather should not be used: https://git-scm.com/docs/git-config#Documentation/git-config.txt-httppostBuffer | 10:42 |
mnasiadka | Well, cloning opendev.org/openstack/nova.git seems extremely slow on counting and compressing objects | 10:58 |
frickler | mnasiadka: is this only for the images which don't have the cached repos yet? then I fear that this is the issue I was predicting a long time ago: without a shared cache it will be difficult to build new images from scratch | 11:44 |
Clark[m] | All of our image builds should occur where we have cached images. | 12:40 |
Clark[m] | mnasiadka: I think the gitea server should support http 1.1 and chunked transfers. But maybe there is a bug there? | 12:42 |
Clark[m] | frickler: yes the _ to - conversion in the container names is a side effect of the docker-compose to docker compose transition | 12:42 |
Clark[m] | Maybe we should consider holding an image build test nodes so that we can test git fetches from the node to gitea and check things like what protocol versions are used and how the cache is being used. | 12:44 |
Clark[m] | Also cross check against the nodepool image builds | 12:45 |
mnasiadka | Clark[m]: setting it to unreasonably high value did not help, I’m thinking it’s some bug in gnutls maybe? | 12:46 |
Clark[m] | I suppose that could be possible. We updated gitea semi recently which I think was the first update after bookworms big package update? I noted an opendev deployment job failed to fetch a repo yesterday for similar reasons. Maybe we need to be looking closer at gitea | 12:49 |
Clark[m] | Cross checking the nodepool image build logs is probably a good first step then looking at the gitea issue tracker to see if others have hit similar | 12:50 |
mnasiadka | I think the weird thing is a retry doesn’t help (at least not on the Nova repository) but I’ll try reproducing it locally | 12:54 |
Clark[m] | Oh another thought for testing would be to try different backends. They can be addressed using https://gitea09.opendev.org:3081 through gitea14 | 12:56 |
Clark[m] | Maybe one particular backend is a problem and that explains the infrequency | 12:56 |
corvus | anyone set a hold yet, or should i? | 13:01 |
Clark[m] | I haven't. I think you should | 13:04 |
fungi | looks like we have at least one xenial-based job still set to run on openstack/project-config, resulting in "Unable to freeze job graph: The nodeset "ubuntu-xenial" was not found." https://review.opendev.org/c/openstack/project-config/+/950771 | 13:06 |
fungi | i'll see if i can track it down after i get off my current conference call | 13:06 |
corvus | https://zuul.opendev.org/t/opendev/autoholds set for all the non-arm64 jobs (because we only have 15 arm64 nodes) | 13:07 |
corvus | and i've re-enqueued https://zuul.opendev.org/t/opendev/buildset/a014c19f845a483eb6b47c6d9075b749 | 13:07 |
corvus | fungi project-config-bindep-fallback-ubuntu-xenial looks suspicious | 13:09 |
fungi | aha, good eye | 13:10 |
fungi | looks like we dropped it from bindep but not project-config | 13:10 |
*** cloudnull109 is now known as cloudnull10 | 13:11 | |
fungi | i notice publish-openstack-sphinx-docs-base also uses ubuntu-xenial | 13:13 |
corvus | (but it says it's deprecated) | 13:14 |
fungi | yeah, though it's the parent of publish-openstack-sphinx-docs which is in the pipelines of a number of projects, apparently according to codesearch | 13:14 |
fungi | trying to see if they're overriding it | 13:14 |
corvus | i don't see any: https://codesearch.opendev.org/?q=publish-openstack-sphinx-docs&i=nope&literal=nope&files=&excludeFiles=&repos= | 13:16 |
fungi | oh, it's commented out in some zuul configs, i misread | 13:17 |
fungi | okay, i'll try removing it too | 13:17 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Drop lingering ubuntu-xenial bindep job https://review.opendev.org/c/openstack/project-config/+/954035 | 13:20 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Drop publish-openstack-sphinx-docs and parent https://review.opendev.org/c/openstack/project-config/+/954036 | 13:20 |
corvus | maybe need to do both of those in one change | 13:21 |
fungi | i'm hoping the lack of use of the second will make that possible to split | 13:22 |
fungi | but i'll squash them if necessary, yes | 13:22 |
fungi | yeah, looks like i do need to | 13:23 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Drop lingering ubuntu-xenial jobs https://review.opendev.org/c/openstack/project-config/+/954035 | 13:26 |
corvus | okay that looks good now... but since it's a config-project we can't get it through gate. we could either add the nodeset back, or force-merge it. i vote force-merge. | 13:28 |
fungi | oh, yep | 13:28 |
fungi | can do in a sec | 13:29 |
Clark[m] | Ya force merge seems fine for situations where we remove jobs | 13:32 |
Clark[m] | Unlikely to break anything worse | 13:32 |
opendevreview | Merged openstack/project-config master: Drop lingering ubuntu-xenial jobs https://review.opendev.org/c/openstack/project-config/+/954035 | 13:37 |
fungi | i've rechecked https://review.opendev.org/950771 to make sure things are working again | 13:39 |
fungi | and it enqueued, so we should be all set now | 13:40 |
corvus | i started some notes on one of the previous failed image builds: https://etherpad.opendev.org/p/u6w-_J4CFOZ9eNJQmYDB | 13:51 |
corvus | that's a link to the build, and the haproxy logs that correspond to it around the time it failed | 13:52 |
corvus | i think the first 2 lines are for the previous successful update; the failed one starts at :08) | 13:53 |
opendevreview | Merged openstack/project-config master: Comply with Policy for AI Generated Content https://review.opendev.org/c/openstack/project-config/+/950771 | 13:53 |
opendevreview | Clark Boylan proposed opendev/system-config master: Fix zuul scheduler container name in docker exec commands https://review.opendev.org/c/opendev/system-config/+/954051 | 13:53 |
clarkb | corvus: frickler ^ I think that will handle the renamed zuul scheduler container name where I found us using the old name | 13:54 |
clarkb | corvus: ya cD in the haproxy lines means "closed disconnected" iirc | 13:54 |
clarkb | the -- is a normal connection close. Let me see if I can find the haproxy docs on that cD status | 13:55 |
clarkb | ah c not C is "client side timeout expired waiting for the server to send or receive data" | 13:55 |
clarkb | *waiting for the client to send or receive data | 13:56 |
clarkb | then D is the session was in the data phase. So it seems that haproxy (and maybe gitea?) expect the client to keep talking but it doesn't. Maybe that is because the client is busy processing data trying to determine what its next request should be? | 13:57 |
corvus | https://zuul.opendev.org/t/opendev/stream/811315bc55aa4d4a9825d07594491e44?logfile=console.log | 13:58 |
corvus | is fetching nova from gitea10 right now | 13:58 |
clarkb | I see this process in ps listing for gitea10 `/usr/bin/git -c protocol.version=2 -c credential.helper= -c filter.lfs.required= -c filter.lfs.smudge= -c filter.lfs.clean= upload-pack --stateless-rpc /data/git/repositories/openstack/nova.git` | 14:01 |
clarkb | which indicates protocol v2 is used. Not sure if that is helpful or not. | 14:01 |
clarkb | also I was just able to clone nova from scratch to my local machine from gitea09. So this works at least some of the time | 14:02 |
corvus | git --git-dir=/opt/dib_cache/source-repositories/nova_fc7fe01d5e091fd6bdee3c94b8421f772daf2940/.git fetch -q --prune --update-head-ok https://opendev.org/openstack/nova.git +: | 14:02 |
clarkb | corvus: your stream log indicates it just failed but the process is still running on gitea10 | 14:02 |
corvus | that's what the dib build host (client) is running | 14:02 |
clarkb | I wonder if the proxy is creating problems and if we let it keep going it would be fine? | 14:03 |
corvus | GIT_HTTP_LOW_SPEED_LIMIT=1000 GIT_HTTP_LOW_SPEED_TIME=300 | 14:03 |
corvus | looks like we set a timeout of 5m which is when this is failing | 14:03 |
corvus | why is a fetch of nova taking 5m thou | 14:03 |
corvus | root@npb9e6c9b2f06e4:/opt/dib_cache/source-repositories/nova_fc7fe01d5e091fd6bdee3c94b8421f772daf2940# du -sh . | 14:04 |
corvus | 231M. | 14:04 |
clarkb | corvus: oh interesting. We set that on the client side? Then the client is giving up early and the proxy records that as a client side timeout? The odd thing is ya why isn't hte data trasnferring sufficiently to keep the connection timeouts happy | 14:05 |
clarkb | corvus: what does git show master and/or HEAD look like against that repo? Just wondering if it is super stale or maybe invalid somehow | 14:05 |
corvus | commit 43d57ae63d1ecda24d8707b4750d404daadc980f (HEAD -> master) Date: Fri Jun 27 20:29:16 2025 +0000 | 14:06 |
clarkb | ok so both valid and not that old | 14:08 |
corvus | root@23.253.108.26:nova.tgz | 14:08 |
corvus | i tar'd up the nova repo from the dib build if you want to grab a copy | 14:08 |
corvus | this node should hit the autohold, but just in case | 14:09 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update arm64 base job nodeset https://review.opendev.org/c/opendev/system-config/+/954053 | 14:10 |
clarkb | corvus: thanks I'll fetch that now | 14:10 |
clarkb | thinking out loud here maybe once that node is held we can try and rerun that fetch and strace the process to see what the client is doing? | 14:11 |
corvus | ++. i'm also doing that locally :) | 14:12 |
corvus | actually, i'm just starting with "time ... git fetch..." | 14:12 |
clarkb | 954053 is similar to the xenial removals but in this case its the lack of arm64 bionic and focal images. We don't need them for system config anymore so I just cleaned it up instead | 14:12 |
corvus | +2 | 14:13 |
clarkb | oh except I made a silly copy paste error | 14:14 |
corvus | https://paste.opendev.org/show/b39Zq0Rj4G9tv8dgDIoC/ got that running the fetch locally | 14:14 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update arm64 base job nodeset https://review.opendev.org/c/opendev/system-config/+/954053 | 14:14 |
clarkb | timeout client 2m | 14:17 |
stephenfin | clarkb: fungi: Another large bundle of pbr changes starting here, whenever you have time https://review.opendev.org/c/openstack/pbr/+/954040 | 14:17 |
clarkb | this is what we set in the haproxy config for gitea-lb. I half wonder if it just takes git more than 2 minutes to figure out what to fetch from the nova repo after getting a listing of the details? | 14:17 |
stephenfin | I'm going rebasing https://review.opendev.org/c/openstack/pbr/+/949055/ on top of it once the base patch (which is in the queue) merges, dropping the second one in the process (which is currently borked) | 14:18 |
clarkb | iirc protocol v2 in particular is all about being smarterabout what to negotiate to minimize network traffic, but maybe that requires us to be able to calculate the difference in under 2 minutes? | 14:18 |
clarkb | I hesitate to increase those timeouts without a better understanding simply beacuse that could make things worse as more connections hang around | 14:19 |
clarkb | whereas a clone (which i just did successfully) is more of a copy everything and sort details out later situation | 14:20 |
clarkb | (you just grab the packfiles directly iirc rather than generating delta packfilse I think) | 14:21 |
corvus | in the strace, i think i see git reading a bunch of "have facd8951f7ca4e3dd613d70aba8" lines from a subprocess, sending some data to the remote server after each one, then a "done", then it does a select busy wait until it times out | 14:24 |
corvus | (the checksum changes for each "have" line, that's just an example) | 14:25 |
corvus | at first glance, that seems consistent with "calculating delta takes too long" | 14:25 |
Clark[m] | Thinking out loud again: I wonder if a fallback of just clone the repo over again is a good fallback? Or we can try increasing the timeout on haproxy and/or the client | 14:27 |
Clark[m] | Neither are likely to be super efficient, but maybe good enough? | 14:28 |
corvus | i'm running the dib git fetch command on a brand new nova clone... give me a minute and i'll tell you how that goes | 14:28 |
Clark[m] | Also maybe those caches grow crusty over time and need a repack | 14:29 |
Clark[m] | I think with long lived servers managing the data git does that automatically. Not sure if we get the same behavior with this more one shot approach | 14:29 |
corvus | that fetch command failed with the tls error on a brand new clone | 14:30 |
corvus | did, erm, dib change anything about git fetching recently? | 14:31 |
Clark[m] | Ok so repack is unlikely to help. Falling back to a clone may work but we'll always reclone | 14:31 |
corvus | because aren't we also looking at a difference between release and git master? | 14:31 |
Clark[m] | No changes to dib git stuff other than what mnasiadka has pushed to try and make this better that I am aware of | 14:32 |
Clark[m] | All of the updates have been to adding distro support recently not changing the core caching elements | 14:32 |
corvus | so just a plain "git fetch" returns in 1.45 seconds | 14:33 |
corvus | i think we need to understand what that particular fetch command is doing better | 14:33 |
Clark[m] | Could be that the nova repo has changed in such a way to tickle the behavior in git. Or maybe the bookworm updates that we may have pulled into the latest gitea upgrade updated git? | 14:33 |
Clark[m] | ++ if regular fetch is fine then let's understand that | 14:33 |
corvus | i agree with all those questions about git -- but the variable i still don't understand is nightly jobs mostly work | 14:34 |
Clark[m] | The +: means update every ref? | 14:34 |
corvus | that would mean all of the refs/changes too; that's not going to be in the clone | 14:35 |
Clark[m] | Oohhh | 14:35 |
Clark[m] | Ya we may need something that restricts to refs/not changes ? But maybe that is the difference | 14:35 |
Clark[m] | We should confirm that is what +: means | 14:36 |
corvus | still struggling with "how did this ever work" | 14:36 |
Clark[m] | Maybe gitea didn't advertise those change refs because it has long been ignorant of them. But after the recent upgrade that possibly changed? | 14:37 |
corvus | okay, that's plausible... nightly jobs... maybe gitea runs just a bit faster then? | 14:37 |
Clark[m] | Ya could be we sneak under timeout values when things are quieter | 14:38 |
Clark[m] | Maybe we should test this with a much smaller repo and see what the behavior is and see if that confirms some of these assumptions/ideas | 14:38 |
corvus | [pid 1637860] newfstatat(AT_FDCWD, "/tmp/dibtest/nova_fc7fe01d5e091fd6bdee3c94b8421f772daf2940/.git/refs/changes/99/99699/4", 0x7ffd64ffb720, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory) | 14:38 |
corvus | tons of those in my strace ^ | 14:38 |
corvus | so the client is being told about the refs/changes and is checking to see if it has them | 14:39 |
Clark[m] | And is building a big list of things to request when done and then we timeout probably | 14:39 |
corvus | sounds plausible | 14:39 |
Clark[m] | https://github.com/git/git/commit/8e3ec76a20d6abf5dd8ceb3f5f2c157000e4c13e | 14:40 |
Clark[m] | I think we can set a negative refspec for refs/changes in dib and do a depends on to test that | 14:41 |
corvus | ++ | 14:41 |
Clark[m] | It'll be about half an hour before I can write that change. I'm happy for someone else to beat me to it | 14:42 |
mnasiadka | I don't know if from a better geographical location it's the same - but doing just "git clone https://opendev.org/openstack/nova" stalls a lot on the initial GET/POST of git-upload-pack / compressing objects, and git clone from GitHub is instantly just fetching objects | 14:43 |
corvus | one problem with this hypothesis: | 14:43 |
corvus | https://zuul.opendev.org/t/opendev/build/2e3e056fb57b46d2bbd70682008ff29f/log/job-output.txt#3277 | 14:43 |
corvus | that update of nova from last night took 2 seconds | 14:44 |
corvus | that's not just skating by... there's something different | 14:44 |
Clark[m] | Hrm | 14:46 |
mnasiadka | Well, the nova problem surfaced today from my perspective | 14:46 |
mnasiadka | So it might be totally different issue that we observed in previous days | 14:46 |
corvus | oh, so something may have changed in the last 12 hours | 14:47 |
Clark[m] | mnasiadka GitHub may cache checkpoint state for each repo allowing them to immediately start pulling that state whereas gitea (and just a normal git based sever) has to construct at least some info iirc | 14:49 |
Clark[m] | I don't think it is abnormal to have delays while the data is prepared | 14:49 |
mnasiadka | that might make sense, just an observation why I normally clone from GitHub ;-) | 14:49 |
Clark[m] | The problem here seems to be more than the negotiation goes beyond what our timeouts allow for | 14:49 |
mnasiadka | anyway, it might be some package version that has been bumped in the last 12hrs - previously we've seen multiple errors like connection timed out and others - now all jobs break on nova with the same error | 14:50 |
corvus | unfortunately, we move the image-cached git repos into the dib cache at the start of the image build jobs, so on this host, i can't inspect the original state | 14:52 |
corvus | okay, i logged into a random host running a normal job, and the nova repo looks the same as on the held node for the failed image build | 14:54 |
corvus | 231M size, no change refs | 14:55 |
corvus | same head commit | 14:55 |
corvus | the successful image build from last night has not been uploaded (i suspect that's a separate problem, not highest priority); but that means that all our jobs, including the one i just looked at, and our failed image builds, are all probably running on the same image as that succssful image build from last night. | 14:58 |
corvus | in other words, i don't think we're looking at a situation where our most recent image build broke something and is causing the current failures. | 14:58 |
corvus | my current thinking: i still don't see a solution other than the negative refs change Clark proposed, but i also don't understand what could have changed since our last successful nightly build. | 15:01 |
corvus | erm, coincidentally, our gitea docker containers are 12 hours old | 15:01 |
corvus | it looks like we restarted gitea because there's a newer mariadb container; i don't think we upgraded gitea though. | 15:06 |
Clark[m] | That makes sense. Another thought I had is it could be the git client version. Nodepool runs on Debian bookworm but these builds use noble iirc. And yet another idea is that this has to do with the timing of git repacking on the giteas. | 15:07 |
Clark[m] | Maybe running near a repack is more likely to succeed because you get more ref data by default fetching one packfile | 15:07 |
corvus | i haven't been looking at nodepool builds at all, so if the git client version changed, it would be because last night's zuul build got one version and the recent builds got another in the job itself (we can check that in the job logs) | 15:08 |
clarkb | corvus: maybe you can try the negative refspec locally in your env that has been reproducing? | 15:13 |
clarkb | something like `^refs/changes/* +:` ? | 15:14 |
clarkb | I also notice that there is a --refetch flag that basically makes git fetch act more like git clone | 15:17 |
clarkb | I'm trying to reproduce using that tar'd nova repo now as well | 15:20 |
clarkb | I am not able to reproduce on tumbleweed using git version 2.50.0 | 15:21 |
clarkb | corvus: ^ fyi this has me wondering if the git client itself may be at fault somehow | 15:22 |
clarkb | stracing locally I see it processing refs/changes too fwiw. | 15:26 |
clarkb | adding '^refs/changes/*' to my local invocation doesn't seem to chagne the count of refs/changes showing up in the strace though | 15:32 |
opendevreview | Clark Boylan proposed opendev/zuul-providers master: Test image builds on ubuntu jammy https://review.opendev.org/c/opendev/zuul-providers/+/954058 | 15:37 |
clarkb | that is mostly for information gathering | 15:37 |
clarkb | corvus: another idea for your reproduceable environment: maybe try gitea backends directly which bypass haproxy and see if it ever succeeds | 15:54 |
clarkb | that could inform us if there is any value to increasing timeouts on haproxy | 15:54 |
clarkb | https://zuul.opendev.org/t/opendev/build/fc5901e43f4b4f2da2d3a90c32ac3d0d/log/job-output.txt#5449-5450 this ran on jammy not noble. Looking at haproxy logs it has the same cD client disconnect | 16:06 |
clarkb | I do notice that the connection was made over ipv6 not ipv4 | 16:07 |
clarkb | is it possible that we haev ipv6 connectivity issues to the gitea load balancer? | 16:07 |
clarkb | (so many variables) but that may explain why I can't reproduce locally since ipv4 only | 16:07 |
corvus | i'm v4 only. i've tried some direct backend fetches and they time out | 16:11 |
corvus | git version 2.43.0 | 16:11 |
clarkb | looking at the haproxy log the cD state occurs for both ipv4 and ipv6 | 16:11 |
clarkb | so ya I don't think this is ip protocol specific | 16:11 |
corvus | i'm running the same git client locally as the test nodes | 16:12 |
clarkb | ya and jammy seems to exhibit similar issues (though the example I shared was on a repo other than noble) | 16:13 |
clarkb | *other than nova | 16:13 |
corvus | i don't think we upgrade git in the dib build job, which means we should have the same git client version on the successful and failing builds | 16:14 |
clarkb | yup | 16:15 |
corvus | the gitea servers also did a memcached upgrade a couple of hours before the mariadb upgrade | 16:17 |
clarkb | looking at the example I linked above I see the osel update on gitea12's apache server access log but I don't see the failed ospurge update there | 16:21 |
clarkb | for that particular example its like the connection made it as far as the haproxy server but not to the backend | 16:21 |
clarkb | `gnutls_handshake() failed: The TLS connection was non-properly terminated.` maybe that means we failed at l4 so apache doesn't log the request | 16:24 |
corvus | clarkb: a fetch with the +*:* directly to a backend took 325s locally | 16:27 |
corvus | s/the/`*:*`/, s/+*:*// | 16:27 |
corvus | s/the/`+*:*`/, s/+*:*// | 16:27 |
corvus | sorry -- formatting. | 16:27 |
clarkb | corvus: does adding the '^refs/changes/*' rule to the command change that? | 16:29 |
corvus | can you share your full pattern? | 16:34 |
corvus | or command | 16:35 |
corvus | maybe it's supposed to be this? git --git-dir=/tmp/nova/.git fetch -q --prune --update-head-ok https://gitea12.opendev.org:3081/openstack/nova.git '^refs/changes/' '+:*' | 16:35 |
corvus | * maybe it's supposed to be this? git --git-dir=/tmp/nova/.git fetch -q --prune --update-head-ok https://gitea12.opendev.org:3081/openstack/nova.git '^refs/changes/*' '+*:*' | 16:36 |
corvus | anyway, starting from the repo from the image, that takes 14 seconds. then repeating it takes 0.7 seconds. | 16:37 |
corvus | so i think adding the negative refspec looks ike a reasonable fix. still a mystery why it's needed. | 16:38 |
clarkb | corvus: yes sorry that second command. Basically otu add it then its supposed to filter out negative matches from any of the positive matches | 16:39 |
clarkb | fwiw I see some warnings for slow responses to specific commits that appear to come from "bad" crawlers during the time of the failure I linked above. Not a smoking gun for why gnutls had a sad there but maybe the gitas are getting overwhelmed periodically and then not making new conections quickly enough | 16:40 |
clarkb | looking at the apache config I believe the total connection limit there is greater than the haproxy total connection limit | 16:40 |
clarkb | so maybe haproxy is rejecting new connections or gitea itself is? | 16:40 |
clarkb | https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1&from=2025-07-03T15:18:19.041Z&to=2025-07-03T16:13:43.071Z&timezone=utc the error occured between 15:56 and 15:58 and frontend connection count looks fine | 16:41 |
clarkb | historically the connection bottleneck has been the mariabd connection limit. But I don't see any indication that is occuring during these failures | 16:43 |
clarkb | (or at all) | 16:43 |
corvus | looking at the all builds in your jammy buildset: 2 of them had the tls error, and one hit the slow speed limit. (2 more errors for rockylinux i think are probably platform based and we should ignore) | 16:43 |
clarkb | corvus: slow speed limit is what we think may be the refs/changes related item right? | 16:44 |
clarkb | maybe we start there as improvingthat may have a domino effect on the system improving things overall (though that is probably overly hopeful) | 16:44 |
corvus | not necessarily; it could be, or it could just be slow performance... hard to say. it's not the timeout though. | 16:45 |
corvus | on the successful jobs in your jammy change, i see a fetch for nova that took 1 second | 16:45 |
corvus | https://review.opendev.org/c/openstack/diskimage-builder/+/721581/25/diskimage_builder/elements/source-repositories/extra-data.d/98-source-repositories | 16:47 |
corvus | i feel like we should discuss the fact that chat change does in fact change the git fetch command | 16:48 |
clarkb | but we're seeing failures without that change (it hasn't merged) | 16:49 |
corvus | the giant pile of nova fails was a depends-on to that | 16:49 |
clarkb | oh I missed that | 16:49 |
corvus | frankly, that change has some very unusual ideas about how to implement retries and i think we should just start over with something simple and straightforward. | 16:50 |
clarkb | basically ignore the nova failures for a minute. Then reset on what retries should look like and see if that helps at all with the gnu tls handshake issues | 16:51 |
corvus | (maybe) i'm trying to work out what the old command would be and if it makes any difference | 16:52 |
corvus | okay, so looking at the old side of that diff (current code)... lines 142-143 looks like a very sensible fatch that will not pull all the change refs | 16:54 |
corvus | there is a fetch above on line 134-135, but that fetch is in the second half of a conditional that will short-circuit because our reporef is * | 16:55 |
corvus | so, in the end, the old code is only going to run: git -C ${CACHE_PATH} fetch -q --prune --update-head-ok $REPOLOCATION +refs/heads/*:refs/heads/* +refs/tags/*:refs/tags/* | 16:55 |
corvus | the new code tries to do a retry for the fetch that 's in the conditional and does not allow it to short-circuit | 16:56 |
corvus | i think that fully explains the nova problem | 16:56 |
clarkb | ok | 16:57 |
clarkb | I've started looking at the slow fetch error in https://zuul.opendev.org/t/opendev/build/5e57e60d98c9478b8cf405d701c4d3bb/log/ and I see gitea13 responding with a HTTP 200 | 16:58 |
clarkb | however there aer multiple packs being returned there. Maybe there is an additional one that didn't occur/succeed within the time? | 16:58 |
mnasiadka | corvus: huh, thanks for the analysis | 17:03 |
corvus | mnasiadka: i think that change is fundamentally flawed and you should start over with a new one :) | 17:05 |
corvus | maybe just make a function that does git retries, then call that from all the locations. and don't try to match on error output, just use exit codes. :) | 17:05 |
mnasiadka | And I just tried to spend less time doing that... I don't want to say that it usually ends up like this ;-) | 17:05 |
mnasiadka | well, the latest version is not trying to match on error output | 17:05 |
corvus | clarkb: the cacti graphs for gite13 at 15:56 look okay | 17:08 |
corvus | ethernet traffic ramps up to 50Mbps and then drops then, but i think that's expected traffic from that job probably. the host has previously sustained 140Mbps. | 17:09 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581 | 17:16 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269 | 17:17 |
corvus | i'm going to release the autoholds; any objections? | 17:20 |
clarkb | none from me. I think if we want to debug the other issues we should get new autoholds | 17:22 |
clarkb | I found https://github.com/go-gitea/gitea/pull/34842 upstream but it seems to be lfs specific so unlikely to be in play here | 17:22 |
clarkb | also https://github.com/go-gitea/gitea/issues/34003 others struggling with the archives and turning it off seems to be the suggestion (that is what we did) | 17:23 |
clarkb | separately digging through the logs both chatgpt and google appear to be crawling the frontend and now the backends | 17:25 |
clarkb | that sort of behavior makes me want to ban both of them. We aren't hosting on port 443 or port 80 | 17:25 |
clarkb | they are just multiplying the total load they generate by 6 as a result | 17:26 |
clarkb | (and thats on top of them failing to just use git clone) | 17:26 |
corvus | no objection here | 17:26 |
clarkb | the downside to that is I don't know how that might affect google search :/ | 17:27 |
clarkb | I guess another option would be to block direct backend access but that has been extremely useful for testing and similar in the past | 17:28 |
clarkb | I'll have to think about this more | 17:28 |
corvus | looking into image builds/uploads on the launcher side -- i think things are normal there; looks like there were some flex swift issues that borked the intermediate uploads, so we missed images for 2 days, but before and after are fine, and uploads are proceeding. | 17:28 |
clarkb | looks like gptbot may be ignoring our crawldelay too, but thats possibly because they are talking to the frontend and the backend at the same time crawling the same data on each even | 17:32 |
clarkb | its the most braindead thing ever | 17:32 |
corvus | we should spin up more backends to deal with the load! ;) | 17:33 |
clarkb | going back to the slow fetch I do wonder if that is just network slowness. Since everything I can find on the backend server side seems to indicate it responded successfully | 17:33 |
clarkb | then the data doesn't get back to the build server quickly enough and it errors on its own timeout which causes the cD state on haproxy | 17:34 |
clarkb | that may be a side effect of building images on random nodes in random clouds. I think for those retries and/or tuning the client side timeouts may be appropriate | 17:34 |
clarkb | that leaves us still trying to debug these gnu tls handshake issues. Which I suppose could be a similar problem but maybe during l3 SYN ACK SYNACK or the equivalent for l4? | 17:35 |
clarkb | that one I'm having a hard time getting a handle on because I never seem to see the connection in apache or the backend. Haproxy knows about it and logs the error disconnect cD state but our backends never seem to know | 17:35 |
corvus | that's kind of what i'm thinking (all 4 messages) | 17:35 |
clarkb | thinking out loud here it could be some network hiccup between the load balancer and the backends but they all live in the same cloud region. It could also be internet hiccups. I don't want to bother the cloud provider unless we can find more evidence of that sort of thing first though | 17:38 |
clarkb | https://github.com/argoproj/argo-cd/issues/3994 is interesting | 17:41 |
clarkb | essentially they are saying that openssl plays better with http 1.1 transfer-encoding: chunked than gnutls does | 17:43 |
clarkb | and that is why increasing the postbuffer helps. Because you only switch to chunked when you make requests larger than that buffer | 17:43 |
clarkb | with the implication being that maybe our combo of haproxy and apache2 and gitea http don't properlysupport chunked trasnfers? | 17:44 |
clarkb | https://github.com/go-gitea/gitea/issues/22233 gitea lfs doesn't do chunked transfers. Is it possible that gitea doesn't do chunked trasnfers for git as well? | 17:46 |
clarkb | oh though that seems to be an implementation detail of the git lfs client. It requires the headers to be set eventhough apparently this isn't strictly necessary in general | 17:47 |
clarkb | and double checking it looks like POST is the method used when git-upload-pack runs (which confusingly iswhat is used to download packs?) | 17:51 |
mnasiadka | Ok, the bionic/focal addition jobs look a lot better now | 17:53 |
fungi | trying to catch up with the copious scrollback, but it sounds like the current theory is some sort of networking problem between haproxy and apache on the gitea servers? | 17:54 |
clarkb | fungi: or between our git clients and haproxy | 17:55 |
clarkb | and/or http chunked transfer encoding probvlems within the system | 17:55 |
clarkb | I'm testing some clones locally with smaller postBuffers to see if I can trip the problem | 17:55 |
fungi | and the git clients are on job nodes? so are we seeing the issue from all providers or only some? | 17:56 |
clarkb | yes on job nodes. We're seeing problems from rax dfw and ovh bhs1 at least. | 17:56 |
fungi | ovh bhs1 and the gitea servers are geographically near one another at least | 17:57 |
clarkb | same continent yup | 17:57 |
fungi | oh, i thought they were in the same metro area. did we put the gitea servers in sjc1 instead of ca-ymq-1? | 17:58 |
clarkb | yes | 17:58 |
fungi | got it, i guess it's gerrit that's in ca-ymq-1 | 17:58 |
clarkb | yup | 17:58 |
fungi | okay, so not as close to each other as i had thought | 17:59 |
clarkb | `git -c http.postBuffer=65536 clone https://opendev.org/openstack/nova nova-small-post-buffer` this seems to work for me. A smaller post buffer is rejected by curl as being too small. The default git usees is apparently 1MB. So I don't think the buffers are the problem. I suspect its more of a latency/throughput problem affecting timeouts | 18:10 |
clarkb | corvus: noticed the nova team discussing rechecks and asked them to share the failures to cross check zuul-launcher stuff | 18:17 |
clarkb | corvus: https://review.opendev.org/c/openstack/nova/+/948079 is the rechecked change and each of these three failures were multinode jobs that ran in mixed cloud provider environments: | 18:17 |
clarkb | https://zuul.opendev.org/t/openstack/build/66557ea92bae4e809c09526eba619c07 https://zuul.opendev.org/t/openstack/build/c74d7a60b6cd43f7a778e72ae0e61e9b https://zuul.opendev.org/t/openstack/build/48912a161c9845a4818875470e4bd8ba | 18:18 |
clarkb | I'm wondering if we might need to more aggressively disallow mixed provider nodesets or maybe make that an optional nodeset configuration option? jobs that don't expect colocated resources can say they are cool with it and those that do can force the old behavior? | 18:18 |
corvus | clarkb: i'd like the opportunity to investigate this before deciding on a solution | 18:24 |
clarkb | yup no objections to that | 18:24 |
clarkb | noav did confirm that tehy don't expect rabbitmq to work in cross site situations (generally even outside of CI) | 18:24 |
clarkb | so while i'm not positive that was the cause of the failures it does seem likely | 18:24 |
corvus | 2025-07-03 13:21:32,923 WARNING zuul.Launcher: [e: 1e2dafc88ac549c296a182ab6fb9bdc1] [req: 1ebe8dd654174c29b9e7110c93809cd3] Error scanning keys: Timeout connecting to 104.130.158.87 on port 22 | 18:28 |
corvus | 2025-07-03 13:21:32,924 ERROR zuul.Launcher: [e: 1e2dafc88ac549c296a182ab6fb9bdc1] [req: 1ebe8dd654174c29b9e7110c93809cd3] Marking node <OpenstackProviderNode uuid=03db1a7fac774f1b9641aae512b64d71, label=ubuntu-noble, state=building, provider=opendev.org%2Fopendev%2Fzuul-providers/rax-dfw-main> as failed | 18:28 |
corvus | so it tried to build both in rax-dfw; one of them failed to boot; so it executed a fallback to a different provider | 18:29 |
clarkb | I guess we don't retry 3 times anymore? | 18:29 |
corvus | not in the same provider, which is generally an improvement | 18:29 |
corvus | oh actually we will retry using the same provider under some circumstances... | 18:31 |
clarkb | I don't have concrete evidence but I feel like for ssh failures like that where the cloud is reporting success otherwise retrying in the same provider is probably a reasonable thing to do | 18:32 |
clarkb | vs provider just straight up failed | 18:32 |
clarkb | but also I think the underlying issue here is that existing jobs have made assumptions that the nodes will all cohabitate the same routable network space and that breaks when we mix nodes. This is why I wondered if making that a requirement on a nodeset level makes sense | 18:32 |
corvus | at this point, with this cloud provider, it's probably that they're just taking too long to boot | 18:33 |
clarkb | mnasiadka: https://zuul.opendev.org/t/opendev/build/2f4200f24ae4492295d5c43307e93038/log/job-output.txt#4353-4357 here is a retry that succeeds. It added 5 minutes to the build but success... | 18:33 |
clarkb | corvus: ya that would be my suspicion too. But that probably isn't a universal problem ist probably some subset of hypervisors or semi random based on noisy neighbors | 18:34 |
corvus | i wonder if we need to bump that timout? i believe we did do that for ord, maybe we need to do that for dfw. | 18:34 |
clarkb | that could be. Maybe we're near the border so some succeed and some fail and bumping it would get us closer to 100% succeeding | 18:35 |
clarkb | I'd be willign to try that | 18:35 |
corvus | i'm still looking. give me more time. | 18:36 |
clarkb | ack | 18:37 |
clarkb | looking at that trying again case this is curious: the fetches seem to go over ipv4 first then ipv6 then ipv4 then ipv6 again | 18:39 |
clarkb | when the error occurs it was using ipv6. When it succeeds on the second attempt it appears to be using ipv4 | 18:40 |
clarkb | Then 5 minutes later it uses ipv6 again | 18:40 |
clarkb | theory time: the nodes have flaky ipv6 for some reason | 18:40 |
clarkb | which would explain clients not reading data quickly enough and the back and forth behavior we see on the haproxy frontend logs | 18:41 |
fungi | also possible git implements some sort of "if i failed using ipv6 fall back to v4 and vice versa" logic | 18:41 |
clarkb | fungi: in this case the trying again is a second git command invocation via mnasiadka's update to dib | 18:41 |
clarkb | fungi: and we start only using ipv4 | 18:41 |
corvus | clarkb: my initial reading of the code suggests that because one of the nodes was in dfw and had not failed, we should have retried that 3 times in rax-dfw before falling back on another provider; so i think we should treat this as a bug | 18:42 |
fungi | i see, and the v4 requests are fine it's only the v6 requests that have problems? | 18:42 |
clarkb | fungi: with my sample size of one yes | 18:42 |
clarkb | corvus: ack | 18:42 |
clarkb | and in this case I never see a cD the node isn't able to connect to the load balancer at all | 18:43 |
clarkb | I think we should entertain the idea that ipv6 problems on the node or between node and load balancer are causign at least some of these errors | 18:44 |
mnasiadka | clarkb: so the retries are needed, happy that it helps a bit and we can get a green run of all jobs finally | 18:44 |
clarkb | fungi: woudl syslog capture arp updates? Wondering what we should collect from the node logs to verify that sort of thing | 18:45 |
clarkb | another option is to hold a node and monitor its network interfaces as this would in theory happen regardless of the dib builds as they don't touch networking particularly while doing git repo updates | 18:46 |
frickler | about 120 new config-errors due to "nodeset ubuntu-xenial not found", mostly for starlingx repos https://zuul.opendev.org/t/openstack/config-errors?name=Nodeset+Not+Found&skip=0 | 19:02 |
Clark[m] | I think I pushed some changes to starlingx to get ahead of that a while back and they didn't go anywhere | 19:05 |
frickler | yeah, they also didn't act upon the centos-7 and opensuse removals. I wonder when it is time to take stronger measures, like drop those repos from zuul | 19:08 |
clarkb | ildikov: ^ you may have thoughts on that. Doesn't look like slittle is in here | 19:15 |
corvus | clarkb: i think i see the bug. easy fix, probably a hard test, will write after lunch. | 19:16 |
clarkb | corvus: awesome. I'm about to pop out on a bike ride but should be able to review it when I get back | 19:17 |
clarkb | ildikov: frickler looks like I only did opensuse and centos 7 here: https://review.opendev.org/c/starlingx/zuul-jobs/+/909766 | 19:17 |
clarkb | which didn't merge but is in merge conflict now so maybe they fixed ti themselves with a differetn change | 19:17 |
clarkb | infra-root https://review.opendev.org/c/opendev/system-config/+/954053 and https://review.opendev.org/c/opendev/system-config/+/954051 fix some straightforward issues and should be quick reviews if you get a moment | 19:19 |
clarkb | one fixes arm64 testing for system-config and the other should fix zuul's zk stored key backups that frickler discovered weren't working due to the docker compose switch | 19:20 |
mnasiadka | clarkb: I see that https://review.opendev.org/c/opendev/zuul-providers/+/953269 is green now - so would appreciate a review on the DIB depends-on | 19:27 |
fungi | clarkb: i don't know of anything that logs arp updates normally, though there might be a sysctl toggle or somewhere in the /sys tree or /proc/net to make the kernel log them to dmesg | 19:34 |
fungi | i guess what you really want to know is arp overwrites, not just arp updates? | 19:36 |
fungi | (i.e. arp messages arriving with a different mac for the same ipv4 address while there are unexpired entries for it in the table) | 19:37 |
fungi | though also the kernel will opportunistically maintain the arp table when receiving normal (non-arp) packets too | 19:38 |
fungi | in any case, i think it treats it as an update or an overwrite regardless of whether it was an actual arp packet, just to be clear | 19:39 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581 | 19:56 |
corvus | clarkb: okay, fixes uploaded 954064 and 954065 would be good too | 20:08 |
corvus | (the reason why that failed is that the method did not see that there was a "main" provider for the request, so it selected a new one) | 20:09 |
corvus | with that merged, it should try 3 times on the same provider, before it falls back to a different one. | 20:10 |
opendevreview | Merged opendev/system-config master: Update arm64 base job nodeset https://review.opendev.org/c/opendev/system-config/+/954053 | 20:32 |
clarkb | fungi: I think any change to the ipv6 address or routes is what we want to know about | 21:42 |
clarkb | fungi: as its almost seems like the ip address is simply becoming invalid for 5-10 minutes then going away | 21:43 |
clarkb | basically figure out if ipv6 is going away or updating which may explain why the requests are made with differing ip protocols over time | 21:47 |
clarkb | alternatively it could be the load balancer's network that is flapping I guess | 21:49 |
fungi | clarkb: oh, for ipv6 it won't be arp (address resolution call), it'll be nd (neighbor discovery) instead | 22:16 |
fungi | s/call/protocol/ | 22:17 |
clarkb | oh right | 22:18 |
clarkb | in any case what I'm trying to figure out is how can we log/capture changes to the ipv6 (and I guess ipv4 though that seems to work consistently on the system I checked) to see if that is what causes the errors | 22:19 |
clarkb | thinking about it further I suspect that it is on the client side system that change must be happening because it doesn't seem to do the long timeout then fallback to ipv4 thing when ipv6 doesn't work | 22:30 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!