Thursday, 2025-07-03

clarkb	a number of the grafana graphs in zuul status and in the afs graph are blank (but don't have the typical no data message). I think we may need to hold a node and check if that is an artifact of selenium screenshotting before things can load or a problem with our graphs in 12.0.2	00:22
clarkb	I can probably do that tomorrow morning	00:22
clarkb	and the project-config sync change failed due to differing node cloud providers https://zuul.opendev.org/t/openstack/build/6abe86a117154b43980ca6d91298ffa8/log/job-output.txt#31-53 so thats still happening but definitely less often than before	00:26
clarkb	now i must find dinner	00:26
corvus	the 0200 periodic jobs appear to be a non-event today after that quota calc fix	02:48
frickler	corvus: the daily cron for "docker exec zuul-scheduler_scheduler_1 zuul-admin export-keys" is failing because the container is now named zuul-scheduler-scheduler-1. fallout from the docker compose upgrade?	04:13
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269	07:04
mnasiadka	frickler: in https://review.opendev.org/c/opendev/zuul-providers/+/953908?tab=change-view-tab-header-zuul-results-summary we seem to have the same issues - let me focus on getting the DIB retry patch working properly	07:13
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on timeout https://review.opendev.org/c/openstack/diskimage-builder/+/721581	07:28
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581	07:28
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269	07:29
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581	07:40
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581	07:46
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581	07:57
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581	08:08
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581	09:04
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581	09:15
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581	09:27
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581	09:49
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: source-repositories: Increase http.postBuffer https://review.opendev.org/c/openstack/diskimage-builder/+/954025	10:24
mnasiadka	frickler: I'm starting to have a feeling it's not retries we need, on the latest version of the retries patch I'm getting error: unable to rewind rpc post data - try increasing http.postBuffer on the openstack/nova repository - constantly ;-)	10:27
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: source-repositories: Increase http.postBuffer https://review.opendev.org/c/openstack/diskimage-builder/+/954025	10:30
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269	10:30
mnasiadka	Let's see if that helps - but git docs say it rather should not be used: https://git-scm.com/docs/git-config#Documentation/git-config.txt-httppostBuffer	10:42
mnasiadka	Well, cloning opendev.org/openstack/nova.git seems extremely slow on counting and compressing objects	10:58
frickler	mnasiadka: is this only for the images which don't have the cached repos yet? then I fear that this is the issue I was predicting a long time ago: without a shared cache it will be difficult to build new images from scratch	11:44
Clark[m]	All of our image builds should occur where we have cached images.	12:40
Clark[m]	mnasiadka: I think the gitea server should support http 1.1 and chunked transfers. But maybe there is a bug there?	12:42
Clark[m]	frickler: yes the _ to - conversion in the container names is a side effect of the docker-compose to docker compose transition	12:42
Clark[m]	Maybe we should consider holding an image build test nodes so that we can test git fetches from the node to gitea and check things like what protocol versions are used and how the cache is being used.	12:44
Clark[m]	Also cross check against the nodepool image builds	12:45
mnasiadka	Clark[m]: setting it to unreasonably high value did not help, I’m thinking it’s some bug in gnutls maybe?	12:46
Clark[m]	I suppose that could be possible. We updated gitea semi recently which I think was the first update after bookworms big package update? I noted an opendev deployment job failed to fetch a repo yesterday for similar reasons. Maybe we need to be looking closer at gitea	12:49
Clark[m]	Cross checking the nodepool image build logs is probably a good first step then looking at the gitea issue tracker to see if others have hit similar	12:50
mnasiadka	I think the weird thing is a retry doesn’t help (at least not on the Nova repository) but I’ll try reproducing it locally	12:54
Clark[m]	Oh another thought for testing would be to try different backends. They can be addressed using https://gitea09.opendev.org:3081 through gitea14	12:56
Clark[m]	Maybe one particular backend is a problem and that explains the infrequency	12:56
corvus	anyone set a hold yet, or should i?	13:01
Clark[m]	I haven't. I think you should	13:04
fungi	looks like we have at least one xenial-based job still set to run on openstack/project-config, resulting in "Unable to freeze job graph: The nodeset "ubuntu-xenial" was not found." https://review.opendev.org/c/openstack/project-config/+/950771	13:06
fungi	i'll see if i can track it down after i get off my current conference call	13:06
corvus	https://zuul.opendev.org/t/opendev/autoholds set for all the non-arm64 jobs (because we only have 15 arm64 nodes)	13:07
corvus	and i've re-enqueued https://zuul.opendev.org/t/opendev/buildset/a014c19f845a483eb6b47c6d9075b749	13:07
corvus	fungi project-config-bindep-fallback-ubuntu-xenial looks suspicious	13:09
fungi	aha, good eye	13:10
fungi	looks like we dropped it from bindep but not project-config	13:10
*** cloudnull109 is now known as cloudnull10		13:11
fungi	i notice publish-openstack-sphinx-docs-base also uses ubuntu-xenial	13:13
corvus	(but it says it's deprecated)	13:14
fungi	yeah, though it's the parent of publish-openstack-sphinx-docs which is in the pipelines of a number of projects, apparently according to codesearch	13:14
fungi	trying to see if they're overriding it	13:14
corvus	i don't see any: https://codesearch.opendev.org/?q=publish-openstack-sphinx-docs&i=nope&literal=nope&files=&excludeFiles=&repos=	13:16
fungi	oh, it's commented out in some zuul configs, i misread	13:17
fungi	okay, i'll try removing it too	13:17
opendevreview	Jeremy Stanley proposed openstack/project-config master: Drop lingering ubuntu-xenial bindep job https://review.opendev.org/c/openstack/project-config/+/954035	13:20
opendevreview	Jeremy Stanley proposed openstack/project-config master: Drop publish-openstack-sphinx-docs and parent https://review.opendev.org/c/openstack/project-config/+/954036	13:20
corvus	maybe need to do both of those in one change	13:21
fungi	i'm hoping the lack of use of the second will make that possible to split	13:22
fungi	but i'll squash them if necessary, yes	13:22
fungi	yeah, looks like i do need to	13:23
opendevreview	Jeremy Stanley proposed openstack/project-config master: Drop lingering ubuntu-xenial jobs https://review.opendev.org/c/openstack/project-config/+/954035	13:26
corvus	okay that looks good now... but since it's a config-project we can't get it through gate. we could either add the nodeset back, or force-merge it. i vote force-merge.	13:28
fungi	oh, yep	13:28
fungi	can do in a sec	13:29
Clark[m]	Ya force merge seems fine for situations where we remove jobs	13:32
Clark[m]	Unlikely to break anything worse	13:32
opendevreview	Merged openstack/project-config master: Drop lingering ubuntu-xenial jobs https://review.opendev.org/c/openstack/project-config/+/954035	13:37
fungi	i've rechecked https://review.opendev.org/950771 to make sure things are working again	13:39
fungi	and it enqueued, so we should be all set now	13:40
corvus	i started some notes on one of the previous failed image builds: https://etherpad.opendev.org/p/u6w-_J4CFOZ9eNJQmYDB	13:51
corvus	that's a link to the build, and the haproxy logs that correspond to it around the time it failed	13:52
corvus	i think the first 2 lines are for the previous successful update; the failed one starts at :08)	13:53
opendevreview	Merged openstack/project-config master: Comply with Policy for AI Generated Content https://review.opendev.org/c/openstack/project-config/+/950771	13:53
opendevreview	Clark Boylan proposed opendev/system-config master: Fix zuul scheduler container name in docker exec commands https://review.opendev.org/c/opendev/system-config/+/954051	13:53
clarkb	corvus: frickler ^ I think that will handle the renamed zuul scheduler container name where I found us using the old name	13:54
clarkb	corvus: ya cD in the haproxy lines means "closed disconnected" iirc	13:54
clarkb	the -- is a normal connection close. Let me see if I can find the haproxy docs on that cD status	13:55
clarkb	ah c not C is "client side timeout expired waiting for the server to send or receive data"	13:55
clarkb	*waiting for the client to send or receive data	13:56
clarkb	then D is the session was in the data phase. So it seems that haproxy (and maybe gitea?) expect the client to keep talking but it doesn't. Maybe that is because the client is busy processing data trying to determine what its next request should be?	13:57
corvus	https://zuul.opendev.org/t/opendev/stream/811315bc55aa4d4a9825d07594491e44?logfile=console.log	13:58
corvus	is fetching nova from gitea10 right now	13:58
clarkb	I see this process in ps listing for gitea10 `/usr/bin/git -c protocol.version=2 -c credential.helper= -c filter.lfs.required= -c filter.lfs.smudge= -c filter.lfs.clean= upload-pack --stateless-rpc /data/git/repositories/openstack/nova.git`	14:01
clarkb	which indicates protocol v2 is used. Not sure if that is helpful or not.	14:01
clarkb	also I was just able to clone nova from scratch to my local machine from gitea09. So this works at least some of the time	14:02
corvus	git --git-dir=/opt/dib_cache/source-repositories/nova_fc7fe01d5e091fd6bdee3c94b8421f772daf2940/.git fetch -q --prune --update-head-ok https://opendev.org/openstack/nova.git +:	14:02
clarkb	corvus: your stream log indicates it just failed but the process is still running on gitea10	14:02
corvus	that's what the dib build host (client) is running	14:02
clarkb	I wonder if the proxy is creating problems and if we let it keep going it would be fine?	14:03
corvus	GIT_HTTP_LOW_SPEED_LIMIT=1000 GIT_HTTP_LOW_SPEED_TIME=300	14:03
corvus	looks like we set a timeout of 5m which is when this is failing	14:03
corvus	why is a fetch of nova taking 5m thou	14:03
corvus	root@npb9e6c9b2f06e4:/opt/dib_cache/source-repositories/nova_fc7fe01d5e091fd6bdee3c94b8421f772daf2940# du -sh .	14:04
corvus	231M.	14:04
clarkb	corvus: oh interesting. We set that on the client side? Then the client is giving up early and the proxy records that as a client side timeout? The odd thing is ya why isn't hte data trasnferring sufficiently to keep the connection timeouts happy	14:05
clarkb	corvus: what does git show master and/or HEAD look like against that repo? Just wondering if it is super stale or maybe invalid somehow	14:05
corvus	commit 43d57ae63d1ecda24d8707b4750d404daadc980f (HEAD -> master) Date: Fri Jun 27 20:29:16 2025 +0000	14:06
clarkb	ok so both valid and not that old	14:08
corvus	root@23.253.108.26:nova.tgz	14:08
corvus	i tar'd up the nova repo from the dib build if you want to grab a copy	14:08
corvus	this node should hit the autohold, but just in case	14:09
opendevreview	Clark Boylan proposed opendev/system-config master: Update arm64 base job nodeset https://review.opendev.org/c/opendev/system-config/+/954053	14:10
clarkb	corvus: thanks I'll fetch that now	14:10
clarkb	thinking out loud here maybe once that node is held we can try and rerun that fetch and strace the process to see what the client is doing?	14:11
corvus	++. i'm also doing that locally :)	14:12
corvus	actually, i'm just starting with "time ... git fetch..."	14:12
clarkb	954053 is similar to the xenial removals but in this case its the lack of arm64 bionic and focal images. We don't need them for system config anymore so I just cleaned it up instead	14:12
corvus	+2	14:13
clarkb	oh except I made a silly copy paste error	14:14
corvus	https://paste.opendev.org/show/b39Zq0Rj4G9tv8dgDIoC/ got that running the fetch locally	14:14
opendevreview	Clark Boylan proposed opendev/system-config master: Update arm64 base job nodeset https://review.opendev.org/c/opendev/system-config/+/954053	14:14
clarkb	timeout client 2m	14:17
stephenfin	clarkb: fungi: Another large bundle of pbr changes starting here, whenever you have time https://review.opendev.org/c/openstack/pbr/+/954040	14:17
clarkb	this is what we set in the haproxy config for gitea-lb. I half wonder if it just takes git more than 2 minutes to figure out what to fetch from the nova repo after getting a listing of the details?	14:17
stephenfin	I'm going rebasing https://review.opendev.org/c/openstack/pbr/+/949055/ on top of it once the base patch (which is in the queue) merges, dropping the second one in the process (which is currently borked)	14:18
clarkb	iirc protocol v2 in particular is all about being smarterabout what to negotiate to minimize network traffic, but maybe that requires us to be able to calculate the difference in under 2 minutes?	14:18
clarkb	I hesitate to increase those timeouts without a better understanding simply beacuse that could make things worse as more connections hang around	14:19
clarkb	whereas a clone (which i just did successfully) is more of a copy everything and sort details out later situation	14:20
clarkb	(you just grab the packfiles directly iirc rather than generating delta packfilse I think)	14:21
corvus	in the strace, i think i see git reading a bunch of "have facd8951f7ca4e3dd613d70aba8" lines from a subprocess, sending some data to the remote server after each one, then a "done", then it does a select busy wait until it times out	14:24
corvus	(the checksum changes for each "have" line, that's just an example)	14:25
corvus	at first glance, that seems consistent with "calculating delta takes too long"	14:25
Clark[m]	Thinking out loud again: I wonder if a fallback of just clone the repo over again is a good fallback? Or we can try increasing the timeout on haproxy and/or the client	14:27
Clark[m]	Neither are likely to be super efficient, but maybe good enough?	14:28
corvus	i'm running the dib git fetch command on a brand new nova clone... give me a minute and i'll tell you how that goes	14:28
Clark[m]	Also maybe those caches grow crusty over time and need a repack	14:29
Clark[m]	I think with long lived servers managing the data git does that automatically. Not sure if we get the same behavior with this more one shot approach	14:29
corvus	that fetch command failed with the tls error on a brand new clone	14:30
corvus	did, erm, dib change anything about git fetching recently?	14:31
Clark[m]	Ok so repack is unlikely to help. Falling back to a clone may work but we'll always reclone	14:31
corvus	because aren't we also looking at a difference between release and git master?	14:31
Clark[m]	No changes to dib git stuff other than what mnasiadka has pushed to try and make this better that I am aware of	14:32
Clark[m]	All of the updates have been to adding distro support recently not changing the core caching elements	14:32
corvus	so just a plain "git fetch" returns in 1.45 seconds	14:33
corvus	i think we need to understand what that particular fetch command is doing better	14:33
Clark[m]	Could be that the nova repo has changed in such a way to tickle the behavior in git. Or maybe the bookworm updates that we may have pulled into the latest gitea upgrade updated git?	14:33
Clark[m]	++ if regular fetch is fine then let's understand that	14:33
corvus	i agree with all those questions about git -- but the variable i still don't understand is nightly jobs mostly work	14:34
Clark[m]	The +: means update every ref?	14:34
corvus	that would mean all of the refs/changes too; that's not going to be in the clone	14:35
Clark[m]	Oohhh	14:35
Clark[m]	Ya we may need something that restricts to refs/not changes ? But maybe that is the difference	14:35
Clark[m]	We should confirm that is what +: means	14:36
corvus	still struggling with "how did this ever work"	14:36
Clark[m]	Maybe gitea didn't advertise those change refs because it has long been ignorant of them. But after the recent upgrade that possibly changed?	14:37
corvus	okay, that's plausible... nightly jobs... maybe gitea runs just a bit faster then?	14:37
Clark[m]	Ya could be we sneak under timeout values when things are quieter	14:38
Clark[m]	Maybe we should test this with a much smaller repo and see what the behavior is and see if that confirms some of these assumptions/ideas	14:38
corvus	[pid 1637860] newfstatat(AT_FDCWD, "/tmp/dibtest/nova_fc7fe01d5e091fd6bdee3c94b8421f772daf2940/.git/refs/changes/99/99699/4", 0x7ffd64ffb720, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)	14:38
corvus	tons of those in my strace ^	14:38
corvus	so the client is being told about the refs/changes and is checking to see if it has them	14:39
Clark[m]	And is building a big list of things to request when done and then we timeout probably	14:39
corvus	sounds plausible	14:39
Clark[m]	https://github.com/git/git/commit/8e3ec76a20d6abf5dd8ceb3f5f2c157000e4c13e	14:40
Clark[m]	I think we can set a negative refspec for refs/changes in dib and do a depends on to test that	14:41
corvus	++	14:41
Clark[m]	It'll be about half an hour before I can write that change. I'm happy for someone else to beat me to it	14:42
mnasiadka	I don't know if from a better geographical location it's the same - but doing just "git clone https://opendev.org/openstack/nova" stalls a lot on the initial GET/POST of git-upload-pack / compressing objects, and git clone from GitHub is instantly just fetching objects	14:43
corvus	one problem with this hypothesis:	14:43
corvus	https://zuul.opendev.org/t/opendev/build/2e3e056fb57b46d2bbd70682008ff29f/log/job-output.txt#3277	14:43
corvus	that update of nova from last night took 2 seconds	14:44
corvus	that's not just skating by... there's something different	14:44
Clark[m]	Hrm	14:46
mnasiadka	Well, the nova problem surfaced today from my perspective	14:46
mnasiadka	So it might be totally different issue that we observed in previous days	14:46
corvus	oh, so something may have changed in the last 12 hours	14:47
Clark[m]	mnasiadka GitHub may cache checkpoint state for each repo allowing them to immediately start pulling that state whereas gitea (and just a normal git based sever) has to construct at least some info iirc	14:49
Clark[m]	I don't think it is abnormal to have delays while the data is prepared	14:49
mnasiadka	that might make sense, just an observation why I normally clone from GitHub ;-)	14:49
Clark[m]	The problem here seems to be more than the negotiation goes beyond what our timeouts allow for	14:49
mnasiadka	anyway, it might be some package version that has been bumped in the last 12hrs - previously we've seen multiple errors like connection timed out and others - now all jobs break on nova with the same error	14:50
corvus	unfortunately, we move the image-cached git repos into the dib cache at the start of the image build jobs, so on this host, i can't inspect the original state	14:52
corvus	okay, i logged into a random host running a normal job, and the nova repo looks the same as on the held node for the failed image build	14:54
corvus	231M size, no change refs	14:55
corvus	same head commit	14:55
corvus	the successful image build from last night has not been uploaded (i suspect that's a separate problem, not highest priority); but that means that all our jobs, including the one i just looked at, and our failed image builds, are all probably running on the same image as that succssful image build from last night.	14:58
corvus	in other words, i don't think we're looking at a situation where our most recent image build broke something and is causing the current failures.	14:58
corvus	my current thinking: i still don't see a solution other than the negative refs change Clark proposed, but i also don't understand what could have changed since our last successful nightly build.	15:01
corvus	erm, coincidentally, our gitea docker containers are 12 hours old	15:01
corvus	it looks like we restarted gitea because there's a newer mariadb container; i don't think we upgraded gitea though.	15:06
Clark[m]	That makes sense. Another thought I had is it could be the git client version. Nodepool runs on Debian bookworm but these builds use noble iirc. And yet another idea is that this has to do with the timing of git repacking on the giteas.	15:07
Clark[m]	Maybe running near a repack is more likely to succeed because you get more ref data by default fetching one packfile	15:07
corvus	i haven't been looking at nodepool builds at all, so if the git client version changed, it would be because last night's zuul build got one version and the recent builds got another in the job itself (we can check that in the job logs)	15:08
clarkb	corvus: maybe you can try the negative refspec locally in your env that has been reproducing?	15:13
clarkb	something like `^refs/changes/* +:` ?	15:14
clarkb	I also notice that there is a --refetch flag that basically makes git fetch act more like git clone	15:17
clarkb	I'm trying to reproduce using that tar'd nova repo now as well	15:20
clarkb	I am not able to reproduce on tumbleweed using git version 2.50.0	15:21
clarkb	corvus: ^ fyi this has me wondering if the git client itself may be at fault somehow	15:22
clarkb	stracing locally I see it processing refs/changes too fwiw.	15:26
clarkb	adding '^refs/changes/*' to my local invocation doesn't seem to chagne the count of refs/changes showing up in the strace though	15:32
opendevreview	Clark Boylan proposed opendev/zuul-providers master: Test image builds on ubuntu jammy https://review.opendev.org/c/opendev/zuul-providers/+/954058	15:37
clarkb	that is mostly for information gathering	15:37
clarkb	corvus: another idea for your reproduceable environment: maybe try gitea backends directly which bypass haproxy and see if it ever succeeds	15:54
clarkb	that could inform us if there is any value to increasing timeouts on haproxy	15:54
clarkb	https://zuul.opendev.org/t/opendev/build/fc5901e43f4b4f2da2d3a90c32ac3d0d/log/job-output.txt#5449-5450 this ran on jammy not noble. Looking at haproxy logs it has the same cD client disconnect	16:06
clarkb	I do notice that the connection was made over ipv6 not ipv4	16:07
clarkb	is it possible that we haev ipv6 connectivity issues to the gitea load balancer?	16:07
clarkb	(so many variables) but that may explain why I can't reproduce locally since ipv4 only	16:07
corvus	i'm v4 only. i've tried some direct backend fetches and they time out	16:11
corvus	git version 2.43.0	16:11
clarkb	looking at the haproxy log the cD state occurs for both ipv4 and ipv6	16:11
clarkb	so ya I don't think this is ip protocol specific	16:11
corvus	i'm running the same git client locally as the test nodes	16:12
clarkb	ya and jammy seems to exhibit similar issues (though the example I shared was on a repo other than noble)	16:13
clarkb	*other than nova	16:13
corvus	i don't think we upgrade git in the dib build job, which means we should have the same git client version on the successful and failing builds	16:14
clarkb	yup	16:15
corvus	the gitea servers also did a memcached upgrade a couple of hours before the mariadb upgrade	16:17
clarkb	looking at the example I linked above I see the osel update on gitea12's apache server access log but I don't see the failed ospurge update there	16:21
clarkb	for that particular example its like the connection made it as far as the haproxy server but not to the backend	16:21
clarkb	`gnutls_handshake() failed: The TLS connection was non-properly terminated.` maybe that means we failed at l4 so apache doesn't log the request	16:24
corvus	clarkb: a fetch with the +: directly to a backend took 325s locally	16:27
corvus	s/the/`:`/, s/+://	16:27
corvus	s/the/`+:`/, s/+://	16:27
corvus	sorry -- formatting.	16:27
clarkb	corvus: does adding the '^refs/changes/*' rule to the command change that?	16:29
corvus	can you share your full pattern?	16:34
corvus	or command	16:35
corvus	maybe it's supposed to be this? git --git-dir=/tmp/nova/.git fetch -q --prune --update-head-ok https://gitea12.opendev.org:3081/openstack/nova.git '^refs/changes/' '+:*'	16:35
corvus	* maybe it's supposed to be this? git --git-dir=/tmp/nova/.git fetch -q --prune --update-head-ok https://gitea12.opendev.org:3081/openstack/nova.git '^refs/changes/' '+:*'	16:36
corvus	anyway, starting from the repo from the image, that takes 14 seconds. then repeating it takes 0.7 seconds.	16:37
corvus	so i think adding the negative refspec looks ike a reasonable fix. still a mystery why it's needed.	16:38
clarkb	corvus: yes sorry that second command. Basically otu add it then its supposed to filter out negative matches from any of the positive matches	16:39
clarkb	fwiw I see some warnings for slow responses to specific commits that appear to come from "bad" crawlers during the time of the failure I linked above. Not a smoking gun for why gnutls had a sad there but maybe the gitas are getting overwhelmed periodically and then not making new conections quickly enough	16:40
clarkb	looking at the apache config I believe the total connection limit there is greater than the haproxy total connection limit	16:40
clarkb	so maybe haproxy is rejecting new connections or gitea itself is?	16:40
clarkb	https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1&from=2025-07-03T15:18:19.041Z&to=2025-07-03T16:13:43.071Z&timezone=utc the error occured between 15:56 and 15:58 and frontend connection count looks fine	16:41
clarkb	historically the connection bottleneck has been the mariabd connection limit. But I don't see any indication that is occuring during these failures	16:43
clarkb	(or at all)	16:43
corvus	looking at the all builds in your jammy buildset: 2 of them had the tls error, and one hit the slow speed limit. (2 more errors for rockylinux i think are probably platform based and we should ignore)	16:43
clarkb	corvus: slow speed limit is what we think may be the refs/changes related item right?	16:44
clarkb	maybe we start there as improvingthat may have a domino effect on the system improving things overall (though that is probably overly hopeful)	16:44
corvus	not necessarily; it could be, or it could just be slow performance... hard to say. it's not the timeout though.	16:45
corvus	on the successful jobs in your jammy change, i see a fetch for nova that took 1 second	16:45
corvus	https://review.opendev.org/c/openstack/diskimage-builder/+/721581/25/diskimage_builder/elements/source-repositories/extra-data.d/98-source-repositories	16:47
corvus	i feel like we should discuss the fact that chat change does in fact change the git fetch command	16:48
clarkb	but we're seeing failures without that change (it hasn't merged)	16:49
corvus	the giant pile of nova fails was a depends-on to that	16:49
clarkb	oh I missed that	16:49
corvus	frankly, that change has some very unusual ideas about how to implement retries and i think we should just start over with something simple and straightforward.	16:50
clarkb	basically ignore the nova failures for a minute. Then reset on what retries should look like and see if that helps at all with the gnu tls handshake issues	16:51
corvus	(maybe) i'm trying to work out what the old command would be and if it makes any difference	16:52
corvus	okay, so looking at the old side of that diff (current code)... lines 142-143 looks like a very sensible fatch that will not pull all the change refs	16:54
corvus	there is a fetch above on line 134-135, but that fetch is in the second half of a conditional that will short-circuit because our reporef is *	16:55
corvus	so, in the end, the old code is only going to run: git -C ${CACHE_PATH} fetch -q --prune --update-head-ok $REPOLOCATION +refs/heads/:refs/heads/ +refs/tags/:refs/tags/	16:55
corvus	the new code tries to do a retry for the fetch that 's in the conditional and does not allow it to short-circuit	16:56
corvus	i think that fully explains the nova problem	16:56
clarkb	ok	16:57
clarkb	I've started looking at the slow fetch error in https://zuul.opendev.org/t/opendev/build/5e57e60d98c9478b8cf405d701c4d3bb/log/ and I see gitea13 responding with a HTTP 200	16:58
clarkb	however there aer multiple packs being returned there. Maybe there is an additional one that didn't occur/succeed within the time?	16:58
mnasiadka	corvus: huh, thanks for the analysis	17:03
corvus	mnasiadka: i think that change is fundamentally flawed and you should start over with a new one :)	17:05
corvus	maybe just make a function that does git retries, then call that from all the locations. and don't try to match on error output, just use exit codes. :)	17:05
mnasiadka	And I just tried to spend less time doing that... I don't want to say that it usually ends up like this ;-)	17:05
mnasiadka	well, the latest version is not trying to match on error output	17:05
corvus	clarkb: the cacti graphs for gite13 at 15:56 look okay	17:08
corvus	ethernet traffic ramps up to 50Mbps and then drops then, but i think that's expected traffic from that job probably. the host has previously sustained 140Mbps.	17:09
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581	17:16
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269	17:17
corvus	i'm going to release the autoholds; any objections?	17:20
clarkb	none from me. I think if we want to debug the other issues we should get new autoholds	17:22
clarkb	I found https://github.com/go-gitea/gitea/pull/34842 upstream but it seems to be lfs specific so unlikely to be in play here	17:22
clarkb	also https://github.com/go-gitea/gitea/issues/34003 others struggling with the archives and turning it off seems to be the suggestion (that is what we did)	17:23
clarkb	separately digging through the logs both chatgpt and google appear to be crawling the frontend and now the backends	17:25
clarkb	that sort of behavior makes me want to ban both of them. We aren't hosting on port 443 or port 80	17:25
clarkb	they are just multiplying the total load they generate by 6 as a result	17:26
clarkb	(and thats on top of them failing to just use git clone)	17:26
corvus	no objection here	17:26
clarkb	the downside to that is I don't know how that might affect google search :/	17:27
clarkb	I guess another option would be to block direct backend access but that has been extremely useful for testing and similar in the past	17:28
clarkb	I'll have to think about this more	17:28
corvus	looking into image builds/uploads on the launcher side -- i think things are normal there; looks like there were some flex swift issues that borked the intermediate uploads, so we missed images for 2 days, but before and after are fine, and uploads are proceeding.	17:28
clarkb	looks like gptbot may be ignoring our crawldelay too, but thats possibly because they are talking to the frontend and the backend at the same time crawling the same data on each even	17:32
clarkb	its the most braindead thing ever	17:32
corvus	we should spin up more backends to deal with the load! ;)	17:33
clarkb	going back to the slow fetch I do wonder if that is just network slowness. Since everything I can find on the backend server side seems to indicate it responded successfully	17:33
clarkb	then the data doesn't get back to the build server quickly enough and it errors on its own timeout which causes the cD state on haproxy	17:34
clarkb	that may be a side effect of building images on random nodes in random clouds. I think for those retries and/or tuning the client side timeouts may be appropriate	17:34
clarkb	that leaves us still trying to debug these gnu tls handshake issues. Which I suppose could be a similar problem but maybe during l3 SYN ACK SYNACK or the equivalent for l4?	17:35
clarkb	that one I'm having a hard time getting a handle on because I never seem to see the connection in apache or the backend. Haproxy knows about it and logs the error disconnect cD state but our backends never seem to know	17:35
corvus	that's kind of what i'm thinking (all 4 messages)	17:35
clarkb	thinking out loud here it could be some network hiccup between the load balancer and the backends but they all live in the same cloud region. It could also be internet hiccups. I don't want to bother the cloud provider unless we can find more evidence of that sort of thing first though	17:38
clarkb	https://github.com/argoproj/argo-cd/issues/3994 is interesting	17:41
clarkb	essentially they are saying that openssl plays better with http 1.1 transfer-encoding: chunked than gnutls does	17:43
clarkb	and that is why increasing the postbuffer helps. Because you only switch to chunked when you make requests larger than that buffer	17:43
clarkb	with the implication being that maybe our combo of haproxy and apache2 and gitea http don't properlysupport chunked trasnfers?	17:44
clarkb	https://github.com/go-gitea/gitea/issues/22233 gitea lfs doesn't do chunked transfers. Is it possible that gitea doesn't do chunked trasnfers for git as well?	17:46
clarkb	oh though that seems to be an implementation detail of the git lfs client. It requires the headers to be set eventhough apparently this isn't strictly necessary in general	17:47
clarkb	and double checking it looks like POST is the method used when git-upload-pack runs (which confusingly iswhat is used to download packs?)	17:51
mnasiadka	Ok, the bionic/focal addition jobs look a lot better now	17:53
fungi	trying to catch up with the copious scrollback, but it sounds like the current theory is some sort of networking problem between haproxy and apache on the gitea servers?	17:54
clarkb	fungi: or between our git clients and haproxy	17:55
clarkb	and/or http chunked transfer encoding probvlems within the system	17:55
clarkb	I'm testing some clones locally with smaller postBuffers to see if I can trip the problem	17:55
fungi	and the git clients are on job nodes? so are we seeing the issue from all providers or only some?	17:56
clarkb	yes on job nodes. We're seeing problems from rax dfw and ovh bhs1 at least.	17:56
fungi	ovh bhs1 and the gitea servers are geographically near one another at least	17:57
clarkb	same continent yup	17:57
fungi	oh, i thought they were in the same metro area. did we put the gitea servers in sjc1 instead of ca-ymq-1?	17:58
clarkb	yes	17:58
fungi	got it, i guess it's gerrit that's in ca-ymq-1	17:58
clarkb	yup	17:58
fungi	okay, so not as close to each other as i had thought	17:59
clarkb	`git -c http.postBuffer=65536 clone https://opendev.org/openstack/nova nova-small-post-buffer` this seems to work for me. A smaller post buffer is rejected by curl as being too small. The default git usees is apparently 1MB. So I don't think the buffers are the problem. I suspect its more of a latency/throughput problem affecting timeouts	18:10
clarkb	corvus: noticed the nova team discussing rechecks and asked them to share the failures to cross check zuul-launcher stuff	18:17
clarkb	corvus: https://review.opendev.org/c/openstack/nova/+/948079 is the rechecked change and each of these three failures were multinode jobs that ran in mixed cloud provider environments:	18:17
clarkb	https://zuul.opendev.org/t/openstack/build/66557ea92bae4e809c09526eba619c07 https://zuul.opendev.org/t/openstack/build/c74d7a60b6cd43f7a778e72ae0e61e9b https://zuul.opendev.org/t/openstack/build/48912a161c9845a4818875470e4bd8ba	18:18
clarkb	I'm wondering if we might need to more aggressively disallow mixed provider nodesets or maybe make that an optional nodeset configuration option? jobs that don't expect colocated resources can say they are cool with it and those that do can force the old behavior?	18:18
corvus	clarkb: i'd like the opportunity to investigate this before deciding on a solution	18:24
clarkb	yup no objections to that	18:24
clarkb	noav did confirm that tehy don't expect rabbitmq to work in cross site situations (generally even outside of CI)	18:24
clarkb	so while i'm not positive that was the cause of the failures it does seem likely	18:24
corvus	2025-07-03 13:21:32,923 WARNING zuul.Launcher: [e: 1e2dafc88ac549c296a182ab6fb9bdc1] [req: 1ebe8dd654174c29b9e7110c93809cd3] Error scanning keys: Timeout connecting to 104.130.158.87 on port 22	18:28
corvus	2025-07-03 13:21:32,924 ERROR zuul.Launcher: [e: 1e2dafc88ac549c296a182ab6fb9bdc1] [req: 1ebe8dd654174c29b9e7110c93809cd3] Marking node <OpenstackProviderNode uuid=03db1a7fac774f1b9641aae512b64d71, label=ubuntu-noble, state=building, provider=opendev.org%2Fopendev%2Fzuul-providers/rax-dfw-main> as failed	18:28
corvus	so it tried to build both in rax-dfw; one of them failed to boot; so it executed a fallback to a different provider	18:29
clarkb	I guess we don't retry 3 times anymore?	18:29
corvus	not in the same provider, which is generally an improvement	18:29
corvus	oh actually we will retry using the same provider under some circumstances...	18:31
clarkb	I don't have concrete evidence but I feel like for ssh failures like that where the cloud is reporting success otherwise retrying in the same provider is probably a reasonable thing to do	18:32
clarkb	vs provider just straight up failed	18:32
clarkb	but also I think the underlying issue here is that existing jobs have made assumptions that the nodes will all cohabitate the same routable network space and that breaks when we mix nodes. This is why I wondered if making that a requirement on a nodeset level makes sense	18:32
corvus	at this point, with this cloud provider, it's probably that they're just taking too long to boot	18:33
clarkb	mnasiadka: https://zuul.opendev.org/t/opendev/build/2f4200f24ae4492295d5c43307e93038/log/job-output.txt#4353-4357 here is a retry that succeeds. It added 5 minutes to the build but success...	18:33
clarkb	corvus: ya that would be my suspicion too. But that probably isn't a universal problem ist probably some subset of hypervisors or semi random based on noisy neighbors	18:34
corvus	i wonder if we need to bump that timout? i believe we did do that for ord, maybe we need to do that for dfw.	18:34
clarkb	that could be. Maybe we're near the border so some succeed and some fail and bumping it would get us closer to 100% succeeding	18:35
clarkb	I'd be willign to try that	18:35
corvus	i'm still looking. give me more time.	18:36
clarkb	ack	18:37
clarkb	looking at that trying again case this is curious: the fetches seem to go over ipv4 first then ipv6 then ipv4 then ipv6 again	18:39
clarkb	when the error occurs it was using ipv6. When it succeeds on the second attempt it appears to be using ipv4	18:40
clarkb	Then 5 minutes later it uses ipv6 again	18:40
clarkb	theory time: the nodes have flaky ipv6 for some reason	18:40
clarkb	which would explain clients not reading data quickly enough and the back and forth behavior we see on the haproxy frontend logs	18:41
fungi	also possible git implements some sort of "if i failed using ipv6 fall back to v4 and vice versa" logic	18:41
clarkb	fungi: in this case the trying again is a second git command invocation via mnasiadka's update to dib	18:41
clarkb	fungi: and we start only using ipv4	18:41
corvus	clarkb: my initial reading of the code suggests that because one of the nodes was in dfw and had not failed, we should have retried that 3 times in rax-dfw before falling back on another provider; so i think we should treat this as a bug	18:42
fungi	i see, and the v4 requests are fine it's only the v6 requests that have problems?	18:42
clarkb	fungi: with my sample size of one yes	18:42
clarkb	corvus: ack	18:42
clarkb	and in this case I never see a cD the node isn't able to connect to the load balancer at all	18:43
clarkb	I think we should entertain the idea that ipv6 problems on the node or between node and load balancer are causign at least some of these errors	18:44
mnasiadka	clarkb: so the retries are needed, happy that it helps a bit and we can get a green run of all jobs finally	18:44
clarkb	fungi: woudl syslog capture arp updates? Wondering what we should collect from the node logs to verify that sort of thing	18:45
clarkb	another option is to hold a node and monitor its network interfaces as this would in theory happen regardless of the dib builds as they don't touch networking particularly while doing git repo updates	18:46
frickler	about 120 new config-errors due to "nodeset ubuntu-xenial not found", mostly for starlingx repos https://zuul.opendev.org/t/openstack/config-errors?name=Nodeset+Not+Found&skip=0	19:02
Clark[m]	I think I pushed some changes to starlingx to get ahead of that a while back and they didn't go anywhere	19:05
frickler	yeah, they also didn't act upon the centos-7 and opensuse removals. I wonder when it is time to take stronger measures, like drop those repos from zuul	19:08
clarkb	ildikov: ^ you may have thoughts on that. Doesn't look like slittle is in here	19:15
corvus	clarkb: i think i see the bug. easy fix, probably a hard test, will write after lunch.	19:16
clarkb	corvus: awesome. I'm about to pop out on a bike ride but should be able to review it when I get back	19:17
clarkb	ildikov: frickler looks like I only did opensuse and centos 7 here: https://review.opendev.org/c/starlingx/zuul-jobs/+/909766	19:17
clarkb	which didn't merge but is in merge conflict now so maybe they fixed ti themselves with a differetn change	19:17
clarkb	infra-root https://review.opendev.org/c/opendev/system-config/+/954053 and https://review.opendev.org/c/opendev/system-config/+/954051 fix some straightforward issues and should be quick reviews if you get a moment	19:19
clarkb	one fixes arm64 testing for system-config and the other should fix zuul's zk stored key backups that frickler discovered weren't working due to the docker compose switch	19:20
mnasiadka	clarkb: I see that https://review.opendev.org/c/opendev/zuul-providers/+/953269 is green now - so would appreciate a review on the DIB depends-on	19:27
fungi	clarkb: i don't know of anything that logs arp updates normally, though there might be a sysctl toggle or somewhere in the /sys tree or /proc/net to make the kernel log them to dmesg	19:34
fungi	i guess what you really want to know is arp overwrites, not just arp updates?	19:36
fungi	(i.e. arp messages arriving with a different mac for the same ipv4 address while there are unexpired entries for it in the table)	19:37
fungi	though also the kernel will opportunistically maintain the arp table when receiving normal (non-arp) packets too	19:38
fungi	in any case, i think it treats it as an update or an overwrite regardless of whether it was an actual arp packet, just to be clear	19:39
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on errors https://review.opendev.org/c/openstack/diskimage-builder/+/721581	19:56
corvus	clarkb: okay, fixes uploaded 954064 and 954065 would be good too	20:08
corvus	(the reason why that failed is that the method did not see that there was a "main" provider for the request, so it selected a new one)	20:09
corvus	with that merged, it should try 3 times on the same provider, before it falls back to a different one.	20:10
opendevreview	Merged opendev/system-config master: Update arm64 base job nodeset https://review.opendev.org/c/opendev/system-config/+/954053	20:32
clarkb	fungi: I think any change to the ipv6 address or routes is what we want to know about	21:42
clarkb	fungi: as its almost seems like the ip address is simply becoming invalid for 5-10 minutes then going away	21:43
clarkb	basically figure out if ipv6 is going away or updating which may explain why the requests are made with differing ip protocols over time	21:47
clarkb	alternatively it could be the load balancer's network that is flapping I guess	21:49
fungi	clarkb: oh, for ipv6 it won't be arp (address resolution call), it'll be nd (neighbor discovery) instead	22:16
fungi	s/call/protocol/	22:17
clarkb	oh right	22:18
clarkb	in any case what I'm trying to figure out is how can we log/capture changes to the ipv6 (and I guess ipv4 though that seems to work consistently on the system I checked) to see if that is what causes the errors	22:19
clarkb	thinking about it further I suspect that it is on the client side system that change must be happening because it doesn't seem to do the long timeout then fallback to ipv4 thing when ipv6 doesn't work	22:30
clarkb	it really seems like the system knows it should prefer ipv4 because ipv6 is unconfigured. Then ipv6 gets configured then breaks later	22:30
clarkb	and then eventually starts working again	22:30
clarkb	I think https://review.opendev.org/c/opendev/system-config/+/954051 failed on the mixed provider nodeset issue	22:38
clarkb	I have to pop out shortly to go to a birthday party but I'm going to recheck it anyway	22:38
clarkb	I think the risk with that change is low particularly since the thing it affects is already non working	22:38

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!