Thursday, 2025-07-03

clarkba number of the grafana graphs in zuul status and in the afs graph are blank (but don't have the typical no data message). I think we may need to hold a node and check if that is an artifact of selenium screenshotting before things can load or a problem with our graphs in 12.0.200:22
clarkbI can probably do that tomorrow morning00:22
clarkband the project-config sync change failed due to differing node cloud providers https://zuul.opendev.org/t/openstack/build/6abe86a117154b43980ca6d91298ffa8/log/job-output.txt#31-53 so thats still happening but definitely less often than before00:26
clarkbnow i must find dinner00:26
corvusthe 0200 periodic jobs appear to be a non-event today after that quota calc fix02:48
fricklercorvus: the daily cron for "docker exec zuul-scheduler_scheduler_1 zuul-admin export-keys" is failing because the container is now named zuul-scheduler-scheduler-1. fallout from the docker compose upgrade?04:13
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config  https://review.opendev.org/c/opendev/zuul-providers/+/95326907:04
mnasiadkafrickler: in https://review.opendev.org/c/opendev/zuul-providers/+/953908?tab=change-view-tab-header-zuul-results-summary we seem to have the same issues - let me focus on getting the DIB retry patch working properly07:13
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on timeout  https://review.opendev.org/c/openstack/diskimage-builder/+/72158107:28
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors  https://review.opendev.org/c/openstack/diskimage-builder/+/72158107:28
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config  https://review.opendev.org/c/opendev/zuul-providers/+/95326907:29
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors  https://review.opendev.org/c/openstack/diskimage-builder/+/72158107:40
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors  https://review.opendev.org/c/openstack/diskimage-builder/+/72158107:46
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors  https://review.opendev.org/c/openstack/diskimage-builder/+/72158107:57
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors  https://review.opendev.org/c/openstack/diskimage-builder/+/72158108:08
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors  https://review.opendev.org/c/openstack/diskimage-builder/+/72158109:04
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors  https://review.opendev.org/c/openstack/diskimage-builder/+/72158109:15
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors  https://review.opendev.org/c/openstack/diskimage-builder/+/72158109:27
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors  https://review.opendev.org/c/openstack/diskimage-builder/+/72158109:49
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: source-repositories: Increase http.postBuffer  https://review.opendev.org/c/openstack/diskimage-builder/+/95402510:24
mnasiadkafrickler: I'm starting to have a feeling it's not retries we need, on the latest version of the retries patch I'm getting error: unable to rewind rpc post data - try increasing http.postBuffer on the openstack/nova repository - constantly ;-)10:27
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: source-repositories: Increase http.postBuffer  https://review.opendev.org/c/openstack/diskimage-builder/+/95402510:30
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config  https://review.opendev.org/c/opendev/zuul-providers/+/95326910:30
mnasiadkaLet's see if that helps - but git docs say it rather should not be used: https://git-scm.com/docs/git-config#Documentation/git-config.txt-httppostBuffer10:42
mnasiadkaWell, cloning opendev.org/openstack/nova.git seems extremely slow on counting and compressing objects10:58
fricklermnasiadka: is this only for the images which don't have the cached repos yet? then I fear that this is the issue I was predicting a long time ago: without a shared cache it will be difficult to build new images from scratch11:44
Clark[m]All of our image builds should occur where we have cached images.12:40
Clark[m]mnasiadka: I think the gitea server should support http 1.1 and chunked transfers. But maybe there is a bug there?12:42
Clark[m]frickler: yes the _ to - conversion in the container names is a side effect of the docker-compose to docker compose transition12:42
Clark[m]Maybe we should consider holding an image build test nodes so that we can test git fetches from the node to gitea and check things like what protocol versions are used and how the cache is being used.12:44
Clark[m]Also cross check against the nodepool image builds12:45
mnasiadkaClark[m]: setting it to unreasonably high value did not help, I’m thinking it’s some bug in gnutls maybe?12:46
Clark[m]I suppose that could be possible. We updated gitea semi recently which I think was the first update after bookworms big package update? I noted an opendev deployment job failed to fetch a repo yesterday for similar reasons. Maybe we need to be looking closer at gitea12:49
Clark[m]Cross checking the nodepool image build logs is probably a good first step then looking at the gitea issue tracker to see if others have hit similar12:50
mnasiadkaI think the weird thing is a retry doesn’t help (at least not on the Nova repository) but I’ll try reproducing it locally12:54
Clark[m]Oh another thought for testing would be to try different backends. They can be addressed using https://gitea09.opendev.org:3081 through gitea1412:56
Clark[m]Maybe one particular backend is a problem and that explains the infrequency 12:56
corvusanyone set a hold yet, or should i?13:01
Clark[m]I haven't. I think you should13:04
fungilooks like we have at least one xenial-based job still set to run on openstack/project-config, resulting in "Unable to freeze job graph: The nodeset "ubuntu-xenial" was not found." https://review.opendev.org/c/openstack/project-config/+/95077113:06
fungii'll see if i can track it down after i get off my current conference call13:06
corvushttps://zuul.opendev.org/t/opendev/autoholds set for all the non-arm64 jobs (because we only have 15 arm64 nodes)13:07
corvusand i've re-enqueued https://zuul.opendev.org/t/opendev/buildset/a014c19f845a483eb6b47c6d9075b74913:07
corvusfungi project-config-bindep-fallback-ubuntu-xenial looks suspicious13:09
fungiaha, good eye13:10
fungilooks like we dropped it from bindep but not project-config13:10
*** cloudnull109 is now known as cloudnull1013:11
fungii notice publish-openstack-sphinx-docs-base also uses ubuntu-xenial13:13
corvus(but it says it's deprecated)13:14
fungiyeah, though it's the parent of publish-openstack-sphinx-docs which is in the pipelines of a number of projects, apparently according to codesearch13:14
fungitrying to see if they're overriding it13:14
corvusi don't see any: https://codesearch.opendev.org/?q=publish-openstack-sphinx-docs&i=nope&literal=nope&files=&excludeFiles=&repos=13:16
fungioh, it's commented out in some zuul configs, i misread13:17
fungiokay, i'll try removing it too13:17
opendevreviewJeremy Stanley proposed openstack/project-config master: Drop lingering ubuntu-xenial bindep job  https://review.opendev.org/c/openstack/project-config/+/95403513:20
opendevreviewJeremy Stanley proposed openstack/project-config master: Drop publish-openstack-sphinx-docs and parent  https://review.opendev.org/c/openstack/project-config/+/95403613:20
corvusmaybe need to do both of those in one change13:21
fungii'm hoping the lack of use of the second will make that possible to split13:22
fungibut i'll squash them if necessary, yes13:22
fungiyeah, looks like i do need to13:23
opendevreviewJeremy Stanley proposed openstack/project-config master: Drop lingering ubuntu-xenial jobs  https://review.opendev.org/c/openstack/project-config/+/95403513:26
corvusokay that looks good now... but since it's a config-project we can't get it through gate.  we could either add the nodeset back, or force-merge it.  i vote force-merge.13:28
fungioh, yep13:28
fungican do in a sec13:29
Clark[m]Ya force merge seems fine for situations where we remove jobs13:32
Clark[m]Unlikely to break anything worse 13:32
opendevreviewMerged openstack/project-config master: Drop lingering ubuntu-xenial jobs  https://review.opendev.org/c/openstack/project-config/+/95403513:37
fungii've rechecked https://review.opendev.org/950771 to make sure things are working again13:39
fungiand it enqueued, so we should be all set now13:40
corvusi started some notes on one of the previous failed image builds: https://etherpad.opendev.org/p/u6w-_J4CFOZ9eNJQmYDB13:51
corvusthat's a link to the build, and the haproxy logs that correspond to it around the time it failed13:52
corvusi think the first 2 lines are for the previous successful update; the failed one starts at :08)13:53
opendevreviewMerged openstack/project-config master: Comply with Policy for AI Generated Content  https://review.opendev.org/c/openstack/project-config/+/95077113:53
opendevreviewClark Boylan proposed opendev/system-config master: Fix zuul scheduler container name in docker exec commands  https://review.opendev.org/c/opendev/system-config/+/95405113:53
clarkbcorvus: frickler  ^ I think that will handle the renamed zuul scheduler container name where I found us using the old name13:54
clarkbcorvus: ya cD in the haproxy lines means "closed disconnected" iirc13:54
clarkbthe -- is a normal connection close. Let me see if I can find the haproxy docs on that cD status13:55
clarkbah c not C is "client side timeout expired waiting for the server to send or receive data"13:55
clarkb*waiting for the client to send or receive data13:56
clarkbthen D is the session was in the data phase. So it seems that haproxy (and maybe gitea?) expect the client to keep talking but it doesn't. Maybe that is because the client is busy processing data trying to determine what its next request should be?13:57
corvushttps://zuul.opendev.org/t/opendev/stream/811315bc55aa4d4a9825d07594491e44?logfile=console.log13:58
corvusis fetching nova from gitea10 right now13:58
clarkbI see this process in ps listing for gitea10 `/usr/bin/git -c protocol.version=2 -c credential.helper= -c filter.lfs.required= -c filter.lfs.smudge= -c filter.lfs.clean= upload-pack --stateless-rpc /data/git/repositories/openstack/nova.git`14:01
clarkbwhich indicates protocol v2 is used. Not sure if that is helpful or not.14:01
clarkbalso I was just able to clone nova from scratch to my local machine from gitea09. So this works at least some of the time14:02
corvusgit --git-dir=/opt/dib_cache/source-repositories/nova_fc7fe01d5e091fd6bdee3c94b8421f772daf2940/.git fetch -q --prune --update-head-ok https://opendev.org/openstack/nova.git +:14:02
clarkbcorvus: your stream log indicates it just failed but the process is still running on gitea1014:02
corvusthat's what the dib build host (client) is running14:02
clarkbI wonder if the proxy is creating problems and if we let it keep going it would be fine?14:03
corvusGIT_HTTP_LOW_SPEED_LIMIT=1000 GIT_HTTP_LOW_SPEED_TIME=30014:03
corvuslooks like we set a timeout of 5m which is when this is failing14:03
corvuswhy is a fetch of nova taking 5m thou14:03
corvusroot@npb9e6c9b2f06e4:/opt/dib_cache/source-repositories/nova_fc7fe01d5e091fd6bdee3c94b8421f772daf2940# du -sh .14:04
corvus231M.14:04
clarkbcorvus: oh interesting. We set that on the client side? Then the client is giving up early and the proxy records that as a client side timeout? The odd thing is ya why isn't hte data trasnferring sufficiently to keep the connection timeouts happy14:05
clarkbcorvus: what does git show master and/or HEAD look like against that repo? Just wondering if it is super stale or maybe invalid somehow14:05
corvuscommit 43d57ae63d1ecda24d8707b4750d404daadc980f (HEAD -> master) Date:   Fri Jun 27 20:29:16 2025 +000014:06
clarkbok so both valid and not that old14:08
corvusroot@23.253.108.26:nova.tgz14:08
corvusi tar'd up the nova repo from the dib build if you want to grab a copy14:08
corvusthis node should hit the autohold, but just in case14:09
opendevreviewClark Boylan proposed opendev/system-config master: Update arm64 base job nodeset  https://review.opendev.org/c/opendev/system-config/+/95405314:10
clarkbcorvus: thanks I'll fetch that now14:10
clarkbthinking out loud here maybe once that node is held we can try and rerun that fetch and strace the process to see what the client is doing?14:11
corvus++.  i'm also doing that locally :)14:12
corvusactually, i'm just starting with "time ... git fetch..."14:12
clarkb954053 is similar to the xenial removals but in this case its the lack of arm64 bionic and focal images. We don't need them for system config anymore so I just cleaned it up instead14:12
corvus+214:13
clarkboh except I made a silly copy paste error14:14
corvushttps://paste.opendev.org/show/b39Zq0Rj4G9tv8dgDIoC/  got that running the fetch locally14:14
opendevreviewClark Boylan proposed opendev/system-config master: Update arm64 base job nodeset  https://review.opendev.org/c/opendev/system-config/+/95405314:14
clarkbtimeout  client 2m14:17
stephenfinclarkb: fungi: Another large bundle of pbr changes starting here, whenever you have time https://review.opendev.org/c/openstack/pbr/+/95404014:17
clarkbthis is what we set in the haproxy config for gitea-lb. I half wonder if it just takes git more than 2 minutes to figure out what to fetch from the nova repo after getting a listing of the details?14:17
stephenfinI'm going rebasing https://review.opendev.org/c/openstack/pbr/+/949055/ on top of it once the base patch (which is in the queue) merges, dropping the second one in the process (which is currently borked)14:18
clarkbiirc protocol v2 in particular is all about being smarterabout what to negotiate to minimize network traffic, but maybe that requires us to be able to calculate the difference in under 2 minutes?14:18
clarkbI hesitate to increase those timeouts without a better understanding simply beacuse that could make things worse as more connections hang around14:19
clarkbwhereas a clone (which i just did successfully) is more of a copy everything and sort details out later situation14:20
clarkb(you just grab the packfiles directly iirc rather than generating delta packfilse I think)14:21
corvusin the strace, i think i see git reading a bunch of "have facd8951f7ca4e3dd613d70aba8" lines from a subprocess, sending some data to the remote server after each one, then a "done", then it does a select busy wait until it times out14:24
corvus(the checksum changes for each "have" line, that's just an example)14:25
corvusat first glance, that seems consistent with "calculating delta takes too long"14:25
Clark[m]Thinking out loud again: I wonder if a fallback of just clone the repo over again is a good fallback? Or we can try increasing the timeout on haproxy and/or the client14:27
Clark[m]Neither are likely to be super efficient, but maybe good enough?14:28
corvusi'm running the dib git fetch command on a brand new nova clone...  give me a minute and i'll tell you how that goes14:28
Clark[m]Also maybe those caches grow crusty over time and need a repack14:29
Clark[m]I think with long lived servers managing the data git does that automatically. Not sure if we get the same behavior with this more one shot approach 14:29
corvusthat fetch command failed with the tls error on a brand new clone14:30
corvusdid, erm, dib change anything about git fetching recently?14:31
Clark[m]Ok so repack is unlikely to help. Falling back to a clone may work but we'll always reclone14:31
corvusbecause aren't we also looking at a difference between release and git master?14:31
Clark[m]No changes to dib git stuff other than what mnasiadka has pushed to try and make this better that I am aware of14:32
Clark[m]All of the updates have been to adding distro support recently not changing the core caching elements14:32
corvusso just a plain "git fetch" returns in 1.45 seconds14:33
corvusi think we need to understand what that particular fetch command is doing better14:33
Clark[m]Could be that the nova repo has changed in such a way to tickle the behavior in git. Or maybe the bookworm updates that we may have pulled into the latest gitea upgrade updated git?14:33
Clark[m]++ if regular fetch is fine then let's understand that14:33
corvusi agree with all those questions about git -- but the variable i still don't understand is nightly jobs mostly work14:34
Clark[m]The +: means update every ref?14:34
corvusthat would mean all of the refs/changes too; that's not going to be in the clone14:35
Clark[m]Oohhh14:35
Clark[m]Ya we may need something that restricts to refs/not changes ? But maybe that is the difference14:35
Clark[m]We should confirm that is what +: means14:36
corvusstill struggling with "how did this ever work"14:36
Clark[m]Maybe gitea didn't advertise those change refs because it has long been ignorant of them. But after the recent upgrade that possibly changed?14:37
corvusokay, that's plausible... nightly jobs... maybe gitea runs just a bit faster then?14:37
Clark[m]Ya could be we sneak under timeout values when things are quieter14:38
Clark[m]Maybe we should test this with a much smaller repo and see what the behavior is and see if that confirms some of these assumptions/ideas14:38
corvus[pid 1637860] newfstatat(AT_FDCWD, "/tmp/dibtest/nova_fc7fe01d5e091fd6bdee3c94b8421f772daf2940/.git/refs/changes/99/99699/4", 0x7ffd64ffb720, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)14:38
corvustons of those in my strace ^14:38
corvusso the client is being told about the refs/changes and is checking to see if it has them14:39
Clark[m]And is building a big list of things to request when done and then we timeout probably 14:39
corvussounds plausible14:39
Clark[m]https://github.com/git/git/commit/8e3ec76a20d6abf5dd8ceb3f5f2c157000e4c13e14:40
Clark[m]I think we can set a negative refspec for refs/changes in dib and do a depends on to test that14:41
corvus++14:41
Clark[m]It'll be about half an hour before I can write that change. I'm happy for someone else to beat me to it 14:42
mnasiadkaI don't know if from a better geographical location it's the same - but doing just "git clone https://opendev.org/openstack/nova" stalls a lot on the initial GET/POST of git-upload-pack / compressing objects, and git clone from GitHub is instantly just fetching objects14:43
corvusone problem with this hypothesis:14:43
corvushttps://zuul.opendev.org/t/opendev/build/2e3e056fb57b46d2bbd70682008ff29f/log/job-output.txt#327714:43
corvusthat update of nova from last night took 2 seconds14:44
corvusthat's not just skating by... there's something different14:44
Clark[m]Hrm14:46
mnasiadkaWell, the nova problem surfaced today from my perspective14:46
mnasiadkaSo it might be totally different issue that we observed in previous days14:46
corvusoh, so something may have changed in the last 12 hours14:47
Clark[m]mnasiadka GitHub may cache checkpoint state for each repo allowing them to immediately start pulling that state whereas gitea (and just a normal git based sever) has to construct at least some info iirc14:49
Clark[m]I don't think it is abnormal to have delays while the data is prepared14:49
mnasiadkathat might make sense, just an observation why I normally clone from GitHub ;-)14:49
Clark[m]The problem here seems to be more than the negotiation goes beyond what our timeouts allow for 14:49
mnasiadkaanyway, it might be some package version that has been bumped in the last 12hrs - previously we've seen multiple errors like connection timed out and others - now all jobs break on nova with the same error14:50
corvusunfortunately, we move the image-cached git repos into the dib cache at the start of the image build jobs, so on this host, i can't inspect the original state14:52
corvusokay, i logged into a random host running a normal job, and the nova repo looks the same as on the held node for the failed image build14:54
corvus231M size, no change refs14:55
corvussame head commit14:55
corvusthe successful image build from last night has not been uploaded (i suspect that's a separate problem, not highest priority); but that means that all our jobs, including the one i just looked at, and our failed image builds, are all probably running on the same image as that succssful image build from last night.14:58
corvusin other words, i don't think we're looking at a situation where our most recent image build broke something and is causing the current failures.14:58
corvusmy current thinking: i still don't see a solution other than the negative refs change Clark proposed, but i also don't understand what could have changed since our last successful nightly build.15:01
corvuserm, coincidentally, our gitea docker containers are 12 hours old15:01
corvusit looks like we restarted gitea because there's a newer mariadb container; i don't think we upgraded gitea though.15:06
Clark[m]That makes sense. Another thought I had is it could be the git client version. Nodepool runs on Debian bookworm but these builds use noble iirc. And yet another idea is that this has to do with the timing of git repacking on the giteas.15:07
Clark[m]Maybe running near a repack is more likely to succeed because you get more ref data by default fetching one packfile15:07
corvusi haven't been looking at nodepool builds at all, so if the git client version changed, it would be because last night's zuul build got one version and the recent builds got another in the job itself (we can check that in the job logs)15:08
clarkbcorvus: maybe you can try the negative refspec locally in your env that has been reproducing?15:13
clarkbsomething like `^refs/changes/* +:` ?15:14
clarkbI also notice that there is a --refetch flag that basically makes git fetch act more like git clone15:17
clarkbI'm trying to reproduce using that tar'd nova repo now as well15:20
clarkbI am not able to reproduce on tumbleweed using git version 2.50.015:21
clarkbcorvus: ^ fyi this has me wondering if the git client itself may be at fault somehow15:22
clarkbstracing locally I see it processing refs/changes too fwiw.15:26
clarkbadding '^refs/changes/*' to my local invocation doesn't seem to chagne the count of refs/changes showing up in the strace though15:32
opendevreviewClark Boylan proposed opendev/zuul-providers master: Test image builds on ubuntu jammy  https://review.opendev.org/c/opendev/zuul-providers/+/95405815:37
clarkbthat is mostly for information gathering15:37
clarkbcorvus: another idea for your reproduceable environment: maybe try gitea backends directly which bypass haproxy and see if it ever succeeds15:54
clarkbthat could inform us if there is any value to increasing timeouts on haproxy15:54
clarkbhttps://zuul.opendev.org/t/opendev/build/fc5901e43f4b4f2da2d3a90c32ac3d0d/log/job-output.txt#5449-5450 this ran on jammy not noble. Looking at haproxy logs it has the same cD client disconnect16:06
clarkbI do notice that the connection was made over ipv6 not ipv416:07
clarkbis it possible that we haev ipv6 connectivity issues to the gitea load balancer?16:07
clarkb(so many variables) but that may explain why I can't reproduce locally since ipv4 only16:07
corvusi'm v4 only.  i've tried some direct backend fetches and they time out16:11
corvusgit version 2.43.016:11
clarkblooking at the haproxy log the cD state occurs for both ipv4 and ipv616:11
clarkbso ya I don't think this is ip protocol specific16:11
corvusi'm running the same git client locally as the test nodes16:12
clarkbya and jammy seems to exhibit similar issues (though the example I shared was on a repo other than noble)16:13
clarkb*other than nova16:13
corvusi don't think we upgrade git in the dib build job, which means we should have the same git client version on the successful and failing builds16:14
clarkbyup16:15
corvusthe gitea servers also did a memcached upgrade a couple of hours before the mariadb upgrade16:17
clarkblooking at the example I linked above I see the osel update on gitea12's apache server access log but I don't see the failed ospurge update there16:21
clarkbfor that particular example its like the connection made it as far as the haproxy server but not to the backend16:21
clarkb`gnutls_handshake() failed: The TLS connection was non-properly terminated.` maybe that means we failed at l4 so apache doesn't log the request16:24
corvusclarkb: a fetch with the +*:* directly to a backend took 325s locally16:27
corvuss/the/`*:*`/, s/+*:*//16:27
corvuss/the/`+*:*`/, s/+*:*//16:27
corvussorry -- formatting.16:27
clarkbcorvus: does adding the '^refs/changes/*' rule to the command change that?16:29
corvuscan you share your full pattern?16:34
corvusor command16:35
corvusmaybe it's supposed to be this?  git --git-dir=/tmp/nova/.git fetch -q --prune --update-head-ok https://gitea12.opendev.org:3081/openstack/nova.git '^refs/changes/' '+:*'16:35
corvus * maybe it's supposed to be this?  git --git-dir=/tmp/nova/.git fetch -q --prune --update-head-ok https://gitea12.opendev.org:3081/openstack/nova.git '^refs/changes/*' '+*:*'16:36
corvusanyway, starting from the repo from the image, that takes 14 seconds.  then repeating it takes 0.7 seconds.16:37
corvusso i think adding the negative refspec looks ike a reasonable fix.  still a mystery why it's needed.16:38
clarkbcorvus: yes sorry that second command. Basically otu add it then its supposed to filter out negative matches from any of the positive matches16:39
clarkbfwiw I see some warnings for slow responses to specific commits that appear to come from "bad" crawlers during the time of the failure I linked above. Not a smoking gun for why gnutls had a sad there but maybe the gitas are getting overwhelmed periodically and then not making new conections quickly enough16:40
clarkblooking at the apache config I believe the total connection limit there is greater than the haproxy total connection limit16:40
clarkbso maybe haproxy is rejecting new connections or gitea itself is?16:40
clarkbhttps://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1&from=2025-07-03T15:18:19.041Z&to=2025-07-03T16:13:43.071Z&timezone=utc the error occured between 15:56 and 15:58 and frontend connection count looks fine16:41
clarkbhistorically the connection bottleneck has been the mariabd connection limit. But I don't see any indication that is occuring during these failures16:43
clarkb(or at all)16:43
corvuslooking at the all builds in your jammy buildset: 2 of them had the tls error, and one hit the slow speed limit.  (2 more errors for rockylinux i think are probably platform based and we should ignore)16:43
clarkbcorvus: slow speed limit is what we think may be the refs/changes related item right?16:44
clarkbmaybe we start there as improvingthat may have a domino effect on the system improving things overall (though that is probably overly hopeful)16:44
corvusnot necessarily; it could be, or it could just be slow performance... hard to say.  it's not the timeout though.16:45
corvuson the successful jobs in your jammy change, i see a fetch for nova that took 1 second16:45
corvushttps://review.opendev.org/c/openstack/diskimage-builder/+/721581/25/diskimage_builder/elements/source-repositories/extra-data.d/98-source-repositories16:47
corvusi feel like we should discuss the fact that chat change does in fact change the git fetch command16:48
clarkbbut we're seeing failures without that change (it hasn't merged)16:49
corvusthe giant pile of nova fails was a depends-on to that16:49
clarkboh I missed that16:49
corvusfrankly, that change has some very unusual ideas about how to implement retries and i think we should just start over with something simple and straightforward.16:50
clarkbbasically ignore the nova failures for a minute. Then reset on what retries should look like and see if that helps at all with the gnu tls handshake issues16:51
corvus(maybe)  i'm trying to work out what the old command would be and if it makes any difference16:52
corvusokay, so looking at the old side of that diff (current code)... lines 142-143 looks like a very sensible fatch that will not pull all the change refs16:54
corvusthere is a fetch above on line 134-135, but that fetch is in the second half of a conditional that will short-circuit because our reporef is *16:55
corvusso, in the end, the old code is only going to run: git -C ${CACHE_PATH} fetch -q --prune --update-head-ok $REPOLOCATION +refs/heads/*:refs/heads/* +refs/tags/*:refs/tags/*16:55
corvusthe new code tries to do a retry for the fetch that 's in the conditional and does not allow it to short-circuit 16:56
corvusi think that fully explains the nova problem16:56
clarkbok16:57
clarkbI've started looking at the slow fetch error in https://zuul.opendev.org/t/opendev/build/5e57e60d98c9478b8cf405d701c4d3bb/log/ and I see gitea13 responding with a HTTP 20016:58
clarkbhowever there aer multiple packs being returned there. Maybe there is an additional one that didn't occur/succeed within the time?16:58
mnasiadkacorvus: huh, thanks for the analysis17:03
corvusmnasiadka: i think that change is fundamentally flawed and you should start over with a new one :)17:05
corvusmaybe just make a function that does git retries, then call that from all the locations.  and don't try to match on error output, just use exit codes.  :)17:05
mnasiadkaAnd I just tried to spend less time doing that... I don't want to say that it usually ends up like this ;-)17:05
mnasiadkawell, the latest version is not trying to match on error output17:05
corvusclarkb: the cacti graphs for gite13 at 15:56 look okay17:08
corvusethernet traffic ramps up to 50Mbps and then drops then, but i think that's expected traffic from that job probably.  the host has previously sustained 140Mbps.17:09
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors  https://review.opendev.org/c/openstack/diskimage-builder/+/72158117:16
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config  https://review.opendev.org/c/opendev/zuul-providers/+/95326917:17
corvusi'm going to release the autoholds; any objections?17:20
clarkbnone from me. I think if we want to debug the other issues we should get new autoholds17:22
clarkbI found https://github.com/go-gitea/gitea/pull/34842 upstream but it seems to be lfs specific so unlikely to be in play here17:22
clarkbalso https://github.com/go-gitea/gitea/issues/34003 others struggling with the archives and turning it off seems to be the suggestion (that is what we did)17:23
clarkbseparately digging through the logs both chatgpt and google appear to be crawling the frontend and now the backends17:25
clarkbthat sort of behavior makes me want to ban both of them. We aren't hosting on port 443 or port 8017:25
clarkbthey are just multiplying the total load they generate by 6 as a result17:26
clarkb(and thats on top of them failing to just use git clone)17:26
corvusno objection here17:26
clarkbthe downside to that is I don't know how that might affect google search :/17:27
clarkbI guess another option would be to block direct backend access but that has been extremely useful for testing and similar in the past17:28
clarkbI'll have to think about this more17:28
corvuslooking into image builds/uploads on the launcher side -- i think things are normal there; looks like there were some flex swift issues that borked the intermediate uploads, so we missed images for 2 days, but before and after are fine, and uploads are proceeding.17:28
clarkblooks like gptbot may be ignoring our crawldelay too, but thats possibly because they are talking to the frontend and the backend at the same time crawling the same data on each even17:32
clarkbits the most braindead thing ever17:32
corvuswe should spin up more backends to deal with the load!  ;)17:33
clarkbgoing back to the slow fetch I do wonder if that is just network slowness. Since everything I can find on the backend server side seems to indicate it responded successfully17:33
clarkbthen the data doesn't get back to the build server quickly enough and it errors on its own timeout which causes the cD state on haproxy17:34
clarkbthat may be a side effect of building images on random nodes in random clouds. I think for those retries and/or tuning the client side timeouts may be appropriate17:34
clarkbthat leaves us still trying to debug these gnu tls handshake issues. Which I suppose could be a similar problem but maybe during l3 SYN ACK SYNACK or the equivalent for l4?17:35
clarkbthat one I'm having a hard time getting a handle on because I never seem to see the connection in apache or the backend. Haproxy knows about it and logs the error disconnect cD state but our backends never seem to know17:35
corvusthat's kind of what i'm thinking (all 4 messages)17:35
clarkbthinking out loud here it could be some network hiccup between the load balancer and the backends but they all live in the same cloud region. It could also be internet hiccups. I don't want to bother the cloud provider unless we can find more evidence of that sort of thing first though17:38
clarkbhttps://github.com/argoproj/argo-cd/issues/3994 is interesting17:41
clarkbessentially they are saying that openssl plays better with http 1.1 transfer-encoding: chunked than gnutls does17:43
clarkband that is why increasing the postbuffer helps. Because you only switch to chunked when you make requests larger than that buffer17:43
clarkbwith the implication being that maybe our combo of haproxy and apache2 and gitea http don't properlysupport chunked trasnfers?17:44
clarkbhttps://github.com/go-gitea/gitea/issues/22233 gitea lfs doesn't do chunked transfers. Is it possible that gitea doesn't do chunked trasnfers for git as well?17:46
clarkboh though that seems to be an implementation detail of the git lfs client. It requires the headers to be set eventhough apparently this isn't strictly necessary in general17:47
clarkband double checking it looks like POST is the method used when git-upload-pack runs (which confusingly iswhat is used to download packs?)17:51
mnasiadkaOk, the bionic/focal addition jobs look a lot better now17:53
fungitrying to catch up with the copious scrollback, but it sounds like the current theory is some sort of networking problem between haproxy and apache on the gitea servers?17:54
clarkbfungi: or between our git clients and haproxy17:55
clarkband/or http chunked transfer encoding probvlems within the system17:55
clarkbI'm testing some clones locally with smaller postBuffers to see if I can trip the problem17:55
fungiand the git clients are on job nodes? so are we seeing the issue from all providers or only some?17:56
clarkbyes on job nodes. We're seeing problems from rax dfw and ovh bhs1 at least.17:56
fungiovh bhs1 and the gitea servers are geographically near one another at least17:57
clarkbsame continent yup17:57
fungioh, i thought they were in the same metro area. did we put the gitea servers in sjc1 instead of ca-ymq-1?17:58
clarkbyes17:58
fungigot it, i guess it's gerrit that's in ca-ymq-117:58
clarkbyup17:58
fungiokay, so not as close to each other as i had thought17:59
clarkb`git -c http.postBuffer=65536 clone https://opendev.org/openstack/nova nova-small-post-buffer` this seems to work for me. A smaller post buffer is rejected by curl as being too small. The default git usees is apparently 1MB. So I don't think the buffers are the problem. I suspect its more of a latency/throughput problem affecting timeouts18:10
clarkbcorvus: noticed the nova team discussing rechecks and asked them to share the failures to cross check zuul-launcher stuff18:17
clarkbcorvus: https://review.opendev.org/c/openstack/nova/+/948079 is the rechecked change and each of these three failures were multinode jobs that ran in mixed cloud provider environments:18:17
clarkbhttps://zuul.opendev.org/t/openstack/build/66557ea92bae4e809c09526eba619c07 https://zuul.opendev.org/t/openstack/build/c74d7a60b6cd43f7a778e72ae0e61e9b https://zuul.opendev.org/t/openstack/build/48912a161c9845a4818875470e4bd8ba18:18
clarkbI'm wondering if we might need to more aggressively disallow mixed provider nodesets or maybe make that an optional nodeset configuration option? jobs that don't expect colocated resources can say they are cool with it and those that do can force the old behavior?18:18
corvusclarkb: i'd like the opportunity to investigate this before deciding on a solution18:24
clarkbyup no objections to that18:24
clarkbnoav did confirm that tehy don't expect rabbitmq to work in cross site situations (generally even outside of CI)18:24
clarkbso while i'm not positive that was the cause of the failures it does seem likely18:24
corvus2025-07-03 13:21:32,923 WARNING zuul.Launcher: [e: 1e2dafc88ac549c296a182ab6fb9bdc1] [req: 1ebe8dd654174c29b9e7110c93809cd3] Error scanning keys: Timeout connecting to 104.130.158.87 on port 2218:28
corvus2025-07-03 13:21:32,924 ERROR zuul.Launcher: [e: 1e2dafc88ac549c296a182ab6fb9bdc1] [req: 1ebe8dd654174c29b9e7110c93809cd3] Marking node <OpenstackProviderNode uuid=03db1a7fac774f1b9641aae512b64d71, label=ubuntu-noble, state=building, provider=opendev.org%2Fopendev%2Fzuul-providers/rax-dfw-main> as failed18:28
corvusso it tried to build both in rax-dfw; one of them failed to boot; so it executed a fallback to a different provider18:29
clarkbI guess we don't retry 3 times anymore?18:29
corvusnot in the same provider, which is generally an improvement18:29
corvusoh actually we will retry using the same provider under some circumstances...18:31
clarkbI don't have concrete evidence but I feel like for ssh failures like that where the cloud is reporting success otherwise retrying in the same provider is probably a reasonable thing to do18:32
clarkbvs provider just straight up failed18:32
clarkbbut also I think the underlying issue here is that existing jobs have made assumptions that the nodes will all cohabitate the same routable network space and that breaks when we mix nodes. This is why I wondered if making that a requirement on a nodeset level makes sense18:32
corvusat this point, with this cloud provider, it's probably that they're just taking too long to boot18:33
clarkbmnasiadka: https://zuul.opendev.org/t/opendev/build/2f4200f24ae4492295d5c43307e93038/log/job-output.txt#4353-4357 here is a retry that succeeds. It added 5 minutes to the build but success...18:33
clarkbcorvus: ya that would be my suspicion too. But that probably isn't a universal problem ist probably some subset of hypervisors or semi random based on noisy neighbors18:34
corvusi wonder if we need to bump that timout?  i believe we did do that for ord, maybe we need to do that for dfw.18:34
clarkbthat could be. Maybe we're near the border so some succeed and some fail and bumping it would get us closer to 100% succeeding18:35
clarkbI'd be willign to try that18:35
corvusi'm still looking.  give me more time.18:36
clarkback18:37
clarkblooking at that trying again case this is curious: the fetches seem to go over ipv4 first then ipv6 then ipv4 then ipv6 again18:39
clarkbwhen the error occurs it was using ipv6. When it succeeds on the second attempt it appears to be using ipv418:40
clarkbThen 5 minutes later it uses ipv6 again18:40
clarkbtheory time: the nodes have flaky ipv6 for some reason18:40
clarkbwhich would explain clients not reading data quickly enough and the back and forth behavior we see on the haproxy frontend logs18:41
fungialso possible git implements some sort of "if i failed using ipv6 fall back to v4 and vice versa" logic18:41
clarkbfungi: in this case the trying again is a second git command invocation via mnasiadka's update to dib18:41
clarkbfungi: and we start only using ipv418:41
corvusclarkb: my initial reading of the code suggests that because one of the nodes was in dfw and had not failed, we should have retried that 3 times in rax-dfw before falling back on another provider; so i think we should treat this as a bug18:42
fungii see, and the v4 requests are fine it's only the v6 requests that have problems?18:42
clarkbfungi: with my sample size of one yes18:42
clarkbcorvus: ack18:42
clarkband in this case I never see a cD the node isn't able to connect to the load balancer at all18:43
clarkbI think we should entertain the idea that ipv6 problems on the node or between node and load balancer are causign at least some of these errors18:44
mnasiadkaclarkb: so the retries are needed, happy that it helps a bit and we can get a green run of all jobs finally18:44
clarkbfungi: woudl syslog capture arp updates? Wondering what we should collect from the node logs to verify that sort of thing18:45
clarkbanother option is to hold a node and monitor its network interfaces as this would in theory happen regardless of the dib builds as they don't touch networking particularly while doing git repo updates18:46
fricklerabout 120 new config-errors due to "nodeset ubuntu-xenial not found", mostly for starlingx repos https://zuul.opendev.org/t/openstack/config-errors?name=Nodeset+Not+Found&skip=019:02
Clark[m]I think I pushed some changes to starlingx to get ahead of that a while back and they didn't go anywhere 19:05
frickleryeah, they also didn't act upon the centos-7 and opensuse removals. I wonder when it is time to take stronger measures, like drop those repos from zuul19:08
clarkbildikov: ^ you may have thoughts on that. Doesn't look like slittle is in here19:15
corvusclarkb: i think i see the bug.  easy fix, probably a hard test, will write after lunch.19:16
clarkbcorvus: awesome. I'm about to pop out on a bike ride but should be able to review it when I get back19:17
clarkbildikov: frickler  looks like I only did opensuse and centos 7 here: https://review.opendev.org/c/starlingx/zuul-jobs/+/90976619:17
clarkbwhich didn't merge but is in merge conflict now so maybe they fixed ti themselves with a differetn change19:17
clarkbinfra-root https://review.opendev.org/c/opendev/system-config/+/954053 and https://review.opendev.org/c/opendev/system-config/+/954051 fix some straightforward issues and should be quick reviews if you get a moment19:19
clarkbone fixes arm64 testing for system-config and the other should fix zuul's zk stored key backups that frickler discovered weren't working due to the docker compose switch19:20
mnasiadkaclarkb: I see that https://review.opendev.org/c/opendev/zuul-providers/+/953269 is green now - so would appreciate a review on the DIB depends-on19:27
fungiclarkb: i don't know of anything that logs arp updates normally, though there might be a sysctl toggle or somewhere in the /sys tree or /proc/net to make the kernel log them to dmesg19:34
fungii guess what you really want to know is arp overwrites, not just arp updates?19:36
fungi(i.e. arp messages arriving with a different mac for the same ipv4 address while there are unexpired entries for it in the table)19:37
fungithough also the kernel will opportunistically maintain the arp table when receiving normal (non-arp) packets too19:38
fungiin any case, i think it treats it as an update or an overwrite regardless of whether it was an actual arp packet, just to be clear19:39
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors  https://review.opendev.org/c/openstack/diskimage-builder/+/72158119:56
corvusclarkb: okay, fixes uploaded 954064 and 954065 would be good too20:08
corvus(the reason why that failed is that the method did not see that there was a "main" provider for the request, so it selected a new one)20:09
corvuswith that merged, it should try 3 times on the same provider, before it falls back to a different one.20:10
opendevreviewMerged opendev/system-config master: Update arm64 base job nodeset  https://review.opendev.org/c/opendev/system-config/+/95405320:32
clarkbfungi: I think any change to the ipv6 address or routes is what we want to know about21:42
clarkbfungi: as its almost seems like the ip address is simply becoming invalid for 5-10 minutes then going away21:43
clarkbbasically figure out if ipv6 is going away or updating which may explain why the requests are made with differing ip protocols over time21:47
clarkbalternatively it could be the load balancer's network that is flapping I guess21:49
fungiclarkb: oh, for ipv6 it won't be arp (address resolution call), it'll be nd (neighbor discovery) instead22:16
fungis/call/protocol/22:17
clarkboh right22:18
clarkbin any case what I'm trying to figure out is how can we log/capture changes to the ipv6 (and I guess ipv4 though that seems to work consistently on the system I checked) to see if that is what causes the errors22:19
clarkbthinking about it further I suspect that it is on the client side system that change must be happening because it doesn't seem to do the long timeout then fallback to ipv4 thing when ipv6 doesn't work22:30

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!