Thursday, 2025-01-23

opendevreview	Rodolfo Alonso proposed zuul/zuul-jobs master: Block twine 6.1.0, breaking ``test-release-openstack`` CI job https://review.opendev.org/c/zuul/zuul-jobs/+/939936	06:52
opendevreview	Elod Illes proposed openstack/project-config master: Use ubuntu-noble for test-release-openstack https://review.opendev.org/c/openstack/project-config/+/939947	10:46
opendevreview	chandan kumar proposed openstack/project-config master: Add repository for devstack-plugin-prometheus https://review.opendev.org/c/openstack/project-config/+/939952	11:26
opendevreview	chandan kumar proposed openstack/project-config master: Add repository for devstack-plugin-prometheus https://review.opendev.org/c/openstack/project-config/+/939952	11:33
opendevreview	chandan kumar proposed openstack/project-config master: Add repository for devstack-plugin-prometheus https://review.opendev.org/c/openstack/project-config/+/939952	11:55
opendevreview	chandan kumar proposed openstack/project-config master: Add repository for devstack-plugin-prometheus https://review.opendev.org/c/openstack/project-config/+/939952	12:39
opendevreview	chandan kumar proposed openstack/project-config master: Add repository for devstack-plugin-prometheus https://review.opendev.org/c/openstack/project-config/+/939952	12:53
opendevreview	Merged zuul/zuul-jobs master: Block twine 6.1.0, breaking ``test-release-openstack`` CI job https://review.opendev.org/c/zuul/zuul-jobs/+/939936	14:12
opendevreview	Merged openstack/project-config master: Use ubuntu-noble for test-release-openstack https://review.opendev.org/c/openstack/project-config/+/939947	14:39
opendevreview	Merged openstack/project-config master: Add repository for devstack-plugin-prometheus https://review.opendev.org/c/openstack/project-config/+/939952	14:58
opendevreview	Merged openstack/project-config master: Fix release ACL for whitebox-tempest-plugin https://review.opendev.org/c/openstack/project-config/+/938887	14:59
clarkb	there is a held gerrit here: https://200.225.47.41/q/status:open+-is:wip for testing h2 cache stuff. I don't think I'm going to dive right into that though as I'd like to clean up some of the remaining paste/lodgeit effort first	15:50
clarkb	to that end I've approved https://review.opendev.org/c/opendev/lodgeit/+/939385 to publish lodgeit images to quay and if that is happy I'll approve the change to pull from quay as well	15:51
clarkb	fungi: did you want to push a bindep release or do we think the twine problems might impact that?	15:52
clarkb	fungi: to look at podman package update restart hook behavior I'm at https://packages.ubuntu.com/noble/podman on the right pane there is a ...debian.tar.xz that is where I should look for package control and hook stuff right?	15:54
fungi	clarkb: i'm hoping we'll get the twine situation sorted out first, but that could be within the next few hours	15:56
fungi	clarkb: for podman packaging, yes but also see the further analysis and links i posted in the etherpad where we were discussing it. i don't see any indication it would try to restart running containers	15:57
clarkb	ah ok you've looked already thats great.	15:57
clarkb	fungi: in debian/rules I see stuff doing dh_installsystemd --name=podman-restart but that is all I can find so far	15:58
clarkb	but I think that is installing systemd unit files?	15:59
clarkb	adn podman-restart is a utility to restart containers	15:59
fungi	yeah, looks like it installs /etc/systemd/system/default.target.wants/podman-restart.service (you can check the one on paste)	15:59
clarkb	https://docs.podman.io/en/v5.1.0/markdown/podman-restart.1.html	15:59
clarkb	ya so its a tool we could run but it doesn't appear tied into the packaging itself so this is great	15:59
fungi	on stop it does `podman stop $(/usr/bin/podman container ls --filter restart-policy=always -q)`	16:00
fungi	so only affects containers with a restart-policy of :"always"	16:00
clarkb	fungi: that is most of our containers fwiw	16:00
clarkb	why do you mean by `on stop`?	16:01
fungi	Service.ExecStop=	16:01
clarkb	gotcha systemd service stop. For which service?	16:01
fungi	podman-restart.service	16:01
clarkb	got it.	16:01
fungi	so i guess we need to find whether podman-restart gets stopped/started/restarted on package updates	16:02
clarkb	a naive grep -r podman-restart * inside the debian xz tar contents doesn't show anything obviously doing that	16:02
fungi	its purpose seems to be more for making sure containers get started on boot	16:02
clarkb	and ya I think that is what is ensuring things come up on boot for us	16:03
fungi	so i'm not overly concerned, but i guess we should pay attention around podman package upgrades just to be sure	16:03
clarkb	sounds good	16:03
clarkb	and thanks again for taking an early look	16:03
clarkb	once I feel a bit more awake I'm also going to spot check backups for paste02 just to be more confident in them. Then I think we can probably land the change to retire paste01 backups	16:06
clarkb	infra-root for general container image reliablity I think we have two broad actions we can do: the first is updating mariadb to fetch from quay in all of our services (paste and gitea are done). This does restart the database so care needs to be taken. Then separately updating our Dockerfiles to pull dependencies for images we don't build (because we don't use them speculatively) as	16:09
clarkb	well	16:09
clarkb	at first I wanted to bulk move to our python-builder and python-base images on quay but realized we would lose speculative testing of updates to those images if we did so before moving to podman as the runtime. I do update lodgeit but we're on podman for paste now so I think that is fine	16:10
fungi	and, to be clear, switching to podman requires updating to noble first yeah?	16:11
clarkb	yes	16:11
clarkb	or at least it does currently. It may be possible to get podman running with docker compose on older platforms but every time we've tried in the past it hasn't been workable for one reason or another	16:12
clarkb	noble seems to be the first case where the debuntu world and the podman world have caught up to each other in a way that makes them work nicely	16:12
clarkb	we might also decide we're ok with losing speculative testing of python-base and python-builder if we have some speculative coverage of them (for example via lodgeit or something else)	16:13
clarkb	zuul uses them too	16:13
clarkb	so maybe its ok to accept a small amount of risk in updating those without speculative testing for say gerrit and whatever else as long as we lean on zuul and lodgeit for coverage	16:14
clarkb	oh but zuul is in a different tenant so we don't get speculative testing there iether?	16:14
slittle	still can't zuul to run on https://review.opendev.org/c/starlingx/utilities/+/938743 and https://review.opendev.org/c/starlingx/vault-armada-app/+/938744	16:21
clarkb	slittle: did you try my suggestion of pushing an update to the zuul config to force zuul to evaluate the config on that branc hand report back errors?	16:22
clarkb	I don't see evidence of that in the changes but maybe there was a different change pushed for that	16:22
fungi	remote: https://review.opendev.org/c/starlingx/utilities/+/940048 DNM: See what happens when Zuul config is modified [WIP] [NEW]	16:26
clarkb	looks like we're still not getting complaints from zuul and I'm not seeing it enqueue jobs either. I guess that hack to try and get zuul to post a response isn't valid	16:28
slittle	A whitespace change to .zull.yaml in https://review.opendev.org/c/starlingx/utilities/+/938743 had no effect	16:30
clarkb	zuul02 reports the same no jobs for queue item in check that we saw with the kolla change overriding config (which meant there were no jobs there still not sure why there are no jobs here)	16:30
clarkb	the list of sources still doesn't seem to contain r/stx.10.0 though	16:31
clarkb	could the problem be that we are ignoring the branch for some reason?	16:31
clarkb	here we go	16:32
clarkb	Configuration syntax error not related to change context. Error won't be reported.	16:32
fungi	so probably still related to one or more of the remaining starlingx/zuul-jobs errors?	16:33
clarkb	that is my best guess right now	16:34
clarkb	there doesn't seem to be an associated traceback in the log or anything indicating what the error is	16:34
slittle	i find it strange that a couple dozen other starlingx gits got passed this for .gitreview update on r/stx.10.0 branch	16:35
clarkb	slittle: they may not depend on the broken configuration in zuul so their zuul configs for the new branch loaded	16:35
clarkb	the problem appears to be that because this is a new branch there is no existing config in zuul for it. When zuul goes to load the configs for this branch it cannot do so because there are errors.	16:36
clarkb	I'm still trying to sort out what the errors are	16:38
fungi	you mean beyond just looking at the list of errors at https://zuul.opendev.org/t/openstack/config-errors?project=starlingx%2Fzuul-jobs&skip=0	16:38
clarkb	ya it would be nice to see a concrete link between starlingx/utilities and starlingx/zuul-jobs errors for example (like use of the bad nodeset or something)	16:39
clarkb	one option may be to trim the .zuul.yaml down to jus the linters job then build up from there until it breaks	16:39
clarkb	making additioanl guesses the problem could be with the secret	16:41
clarkb	zuul requires that secrets not be changed across branches and maybe this definition is different	16:41
slittle	does zuul.conf support commenting out ?	16:42
fungi	it does	16:42
fungi	940048,1 is trimmed down to just the linters job and zuul still didn't enqueue or report errors on the change	16:42
clarkb	fungi: 940048 has all the jobs in it	16:43
fungi	sorry, meant ,2	16:43
clarkb	ah I need ti f5 then	16:43
fungi	i revised it	16:43
clarkb	fungi: it needs a rebase on the latest parent patchset	16:43
fungi	i'm going to try setting it to just the noop job for check next	16:43
fungi	940048,3 just uses the noop job in check, and no response from zuul	16:48
clarkb	and we appear to still get the configuration error unrelated to the change	16:49
clarkb	so the error must not be in the branch (and maybe not the project?) itself	16:49
opendevreview	Merged opendev/lodgeit master: Reapply "Move lodgeit image publication to quay.io" https://review.opendev.org/c/opendev/lodgeit/+/939385	16:50
clarkb	however the error list only shows starlingx/zuul-jobs so maybe that is the source of the problem	16:52
clarkb	starlingx/zuul-jobs defines starlingx-common-tox-linters starlingx-common-tox-pep8 and starlingx-common-tox-pylint which are suspiciously similar to the jobs in utilities (but the have different names). Mostly jusit calling this out because why	16:54
fungi	940048,4 is just the noop job and no parent	16:54
clarkb	infra-root any objection to approving https://review.opendev.org/c/opendev/system-config/+/939767 now that lodgeit is being updated in quay: https://quay.io/repository/opendevorg/lodgeit?tab=tags&tag=latest ?	16:56
fungi	i've gone ahead and approved that, but if anyone disagrees with you or the 3 existing +2 votes they have time to -2 or wip it	16:57
frickler	added another +2 just in case ;)	16:58
clarkb	fungi: your latest ps still has the same issue according to zuul02's debug log: 2025-01-23 16:54:56,092 INFO zuul.Pipeline.openstack.check: [e: 8221001fed414122b8e0fe1cdea30352] Configuration syntax error not related to change context. Error won't be reported.	16:59
clarkb	which is really odd beacuse what in that config can be wrong	16:59
fungi	940048,6 is just the noop job with no parent change and no topic	17:00
fungi	on the wild theory that same-topic functionality is related to this	17:00
clarkb	2025-01-23 17:00:42,923 INFO zuul.Pipeline.openstack.check: [e: 33910d2518e74d8abb72886dbd146143] Configuration syntax error not related to change context. Error won't be reported.	17:01
clarkb	same thing	17:01
clarkb	I think that really points to project config elsewhere?	17:01
fungi	yeah, except starlingx/utilities doesn't have any jobs added by project-config either (i just checked)	17:02
clarkb	me too and I concur	17:02
clarkb	it could be config in another branch in utilities that uses a branch matcher to apply to this branch maybe	17:03
clarkb	or branch wide problems like the secret being redefined or something	17:04
clarkb	I think that is the problem its the secret breaking the project config project wide	17:04
clarkb	maybe	17:04
clarkb	https://opendev.org/starlingx/utilities/src/branch/r/stx.5.0/.zuul.yaml#L44 != https://opendev.org/starlingx/utilities/src/branch/master/.zuul.yaml#L159	17:05
clarkb	oh but the secret has different names so that should be ok	17:05
clarkb	corvus: is there a trick for finding unrelated errors in the zuul debug log? I'm looking at the source and wondering if we even log them at all (that might explain why I'm not seeing anything in the logs)	17:11
clarkb	corvus: tldr is fungi pushed a very minimal zuul config: https://review.opendev.org/c/starlingx/utilities/+/940048 and zuul still reports there are unrelated config errors so they won't be reported	17:12
clarkb	the project doesn't have any config in openstack/project-config which would imply the only config is in the project itself. Which has me at "its probably a problem in another branch impacting this new branch"	17:14
clarkb	looks like adding f/caracal branch worked ~2 months ago	17:15
clarkb	and this is the first new branch since then	17:16
clarkb	https://review.opendev.org/c/starlingx/utilities/+/934896	17:16
fungi	940048,7 sets pipeline.debug just to see if that can coax any details out	17:17
fungi	(hopefully i got that right, i do it so rarely)	17:17
clarkb	using that timeframe I'm looking at https://review.opendev.org/q/project:starlingx/utilities+status:merged to see what has gone into the project since November 12	17:17
corvus	clarkb: i'm going through 103 lines of scrollback, give me a minute.	17:17
clarkb	corvus: ack thanks	17:17
clarkb	none of the merged changes to the project since November 12 touch zuul config	17:18
corvus	clarkb: if i'm following correctly, the problem is that we expect starlingx/zuul-jobs to have a project config on branch r/stax.10.0 -- that can be seen here: https://opendev.org/starlingx/zuul-jobs/src/branch/r/stx.10.0/zuul.d/project.yaml but there is no configuration visible in zuul at https://zuul.opendev.org/t/openstack/project/opendev.org/starlingx/zuul-jobs for that branch and there are no errors in the config page	17:23
corvus	config error page	17:23
corvus	clarkb: that sounds like zuul is unaware of the branch. presumably we're not excluding the branch in main.yaml, which suggests that zuul may have just missed the branch creation. we can fix that with a reconfiguration.	17:24
corvus	i'll trigger a reconfiguration of the openstack tenant	17:25
clarkb	corvus: the project is starlingx/utilities but the rest of it makes sense to me	17:26
clarkb	zuul-jobs may be in the same boat too	17:26
clarkb	ya that repo has a r/stx.10.0 branch too so likely in the same boat	17:26
corvus	it is surprising for it to have missed two.	17:26
corvus	is there something unusual about how those branches were created.	17:26
corvus	?	17:26
clarkb	slittle: ^ how are you creating the branches?	17:27
fungi	maybe the reason https://review.opendev.org/c/starlingx/utilities/+/940048 isn't working though is that the as-created state of the r/stx.10.0 branch in starlingx/utilities refers back to starlingx/zuul-jobs which has errors, even though the proposed change would remove all association with it?	17:27
clarkb	corvus: starlingx-release has gerrt create perms but not force push as far as I can tell so they should be using either the gerrit web ui to create branches or the rest api	17:28
corvus	(i have not reconfigured zuul yet -- i have halted work on that because this additional info about multiple projects being affected is weird)	17:28
corvus	fungi: any existing errors should show up in the config-errors page	17:28
clarkb	starlginx/vault-armada-app too	17:28
clarkb	so at least three?	17:28
fungi	yeah, and https://zuul.opendev.org/t/openstack/project/opendev.org/starlingx/utilities doesn't show a r/stx.10.0 tab either	17:28
clarkb	I wonder if they are scripting the branch creation and they are all being createdin a very short window and zuul is missing them all due to something else happening in that time period?	17:29
fungi	which i'd at least expect if zuul were going to try to load configuration from it (but maybe only branches it successfully loads configuration from show up there?)	17:29
corvus	fungi: it shows live config, so if it's not there it's not loaded	17:29
corvus	when were the branches created?	17:30
fungi	clarkb: we have in the past seen bulk branch creation events from openstack projects also end up with missed events until zuul gets told to reload from the repository states	17:30
clarkb	https://opendev.org/starlingx/tools/src/branch/master/release/branch-repo.sh looks suspicious	17:30
clarkb	I don't see hte use of the api there but maybe git push --tags means you don't need that?	17:32
clarkb	slittle: ^ is that how you create the branches?	17:33
clarkb	corvus: if that script was used it also pushes the gitreview update so would've been created January 8	17:34
clarkb	corvus: per https://review.opendev.org/c/starlingx/utilities/+/938743 for starlingx/utilities	17:34
fungi	i have a feeling it may be cruft, that script was last touched almost 5 years ago	17:34
clarkb	could be	17:35
fungi	or, otherwise, it's been doing it this way for years	17:35
corvus	debug.log.15.gz:2025-01-08 23:57:52,072 DEBUG zuul.Scheduler: [e: be11c47d033f4bcdb45b54ede64d8d23] Submitting tenant reconfiguration event for openstack due to event <GerritTriggerEvent ref-updated opendev.org/starlingx/utilities r/stx.10.0> in project starlingx/utilities, ltime 1890417977107	17:36
corvus	oh sorry; the adjacent lines indicate that was the creation event	17:37
clarkb	2025-01-08 23:59:13,377 ERROR zuul.TenantParser: KeyError: 'r/stx.10.0'	17:37
clarkb	there are also errors like ^	17:37
clarkb	perhaps in recursive lookups for that branch in parent jobs etc?	17:38
clarkb	(just thinking out loud that maybe branch creation failed because the order of the projects matters nad it was wrong here?)	17:38
priteau	Is docs.openstack.org being hammered by bots like Git was the other day? It feels much slower than usual	17:40
clarkb	priteau: no system load looks fine. It could be something with afs I susppose but dmesg doesn't show any recent afs complaints	17:42
opendevreview	Merged opendev/system-config master: Reapply "Pull lodgeit from quay.io" https://review.opendev.org/c/opendev/system-config/+/939767	17:44
clarkb	/var/cache/openafs and /var/cache/openafs-client seem empty	17:44
clarkb	maybe those paths don't do what I think they do	17:44
clarkb	/var/cache/openafs is what the mirrors use for caching openafs	17:45
clarkb	/opt/cache/openafs is the cache path on status	17:46
clarkb	*static	17:46
clarkb	and it is not empty. So we do have a cache that appears to be working no the surface	17:46
clarkb	and docs.openstack.org does seem responsive to me now.	17:47
priteau	yes, it's better now	17:48
fungi	i can cat html files in /afs/openstack.org/docs/ from static.o.o just fine too	17:48
fungi	no lag/delay	17:49
slittle	https://review.opendev.org/c/starlingx/utilities/+/938743 is pretty minimal now. zuul still failing silently	17:51
slittle	I guess I can comment out EVERYTHING	17:52
clarkb	slittle: see the discussion above	17:52
slittle	will do ... jumping between threads	17:52
fungi	slittle: i did basically the same in various iterations already in https://review.opendev.org/c/starlingx/utilities/+/940048 and since then we've noticed that zuul doesn't seem to be aware of the branch itself in that repo (and several others created around the same time)	17:53
clarkb	corvus: against the wall clock zuul-jobs, utilities and vault-armada-app seem to be showing up in zuul on the 8th very close to each other	17:53
clarkb	corvus: and for other projects I see Loading configuration from starlingx/ptp-notification-armada-app/.zuul.yaml@r/stx.10.0 logs but not for utilities at least	17:54
slittle	git push ${review_remote} ${branch}:${branch}	17:55
slittle	git push ${review_remote} ${tag}:${tag}	17:55
fungi	slittle: are you doing it in quick succession for many repos in a batch, or ad hoc at different times?	17:56
clarkb	hrm for vault-armada-app I don't see a log showing the branch creation just the tag creation	17:56
slittle	it's scripted here ... https://opendev.org/starlingx/root/src/branch/master/build-tools/branching	17:57
clarkb	(it is possible I'm just not looking for the correct log messages)	17:58
slittle	push_branches_tags.sh will iterate over all our repos creating the branches, as well as tags marking the point of divergance from master	17:59
clarkb	oh yup zuul01 seems to have handled them and i was looking at zuul02. ok less confused about those missing right now	18:00
slittle	note the with_retries function, as the failure rate as rather high	18:01
clarkb	slittle: do you know what fails?	18:01
clarkb	because one thing I notice is that I actually see ref-updated events for the same refs several times over the course of just over an hour	18:02
clarkb	makes me wonder if you're actually succeeding and repushing unecessarily	18:02
clarkb	2025-01-08 23:58:03,362 DEBUG zuul.Scheduler: [e: be11c47d033f4bcdb45b54ede64d8d23] Trigger event minimum reconfigure ltime of 1890417977107 newer than current reconfigure ltime of 1890417970413, aborting early	18:05
clarkb	corvus: ^ could that explain it maybe?	18:05
corvus	nope	18:05
clarkb	that event id maps to the ref-updated for r/stx.10.0 branch creation on zuul01	18:05
corvus	i'm digging through some stuff, give me a few	18:05
clarkb	ack	18:06
clarkb	fyi paste did update to pull from quay and this test paste worked for me: https://paste.opendev.org/show/bcxk1PowxB9ueNuJ8U2j/	18:07
clarkb	I was able to open an old paste too out picked randomly out of my browser history	18:08
clarkb	infra-root I have removed the WIP from https://review.opendev.org/c/opendev/system-config/+/939128 after checking paste02 backups against the vexxhost backup server using borg mount	18:14
clarkb	head -4 against the mariadb dump shows a mariadb dump header and /etc of the filesystem dump appeared to have things I expeced in it (like etc/hosts matched the paste02 content)	18:14
clarkb	I have unmounted things but others can feel free to remount and investigate as part of their reviews of that change if they wish	18:15
opendevreview	yatin proposed openstack/project-config master: [Neutron] Update dashboard with latest job renames https://review.opendev.org/c/openstack/project-config/+/940065	18:24
clarkb	depending on how zuul debugging goes I'll plan to approve the gerrit 3.10.4 update before lunch sothat we can update to it after our meetup	18:25
clarkb	if we continue to have problems with ^ today I'll probably shfit gears tomorrow and start trying to reduce our reliance on docker as much as possible	18:25
clarkb	oh also I think tonyb's use ipv4 for docker hack might be a good idea in our mirror josb as I have seen some of them fail and I'm pretty sure that is due to the quota limits which are made worse by ipv6 block handling	18:26
slittle	it's the branch push that fails. I don't have logs any longer.	18:38
corvus	slittle: do you have any record of any push failures from that branch creation?	18:39
slittle	All I know is that the first half dozen or so go through fine, then I'll see several failures in a row as if something is either overloaded, or I've triggered a DOS protection mechanism.	18:40
fungi	usually if i'm doing bulk operations pushing to gerrit, i add a delay between each push (something like `sleep 3` in my loop)	18:40
corvus	starlingx/utilities was one of the last events i see, so there is a correlation there. one hypothesis i have is that gerrit did not return the new branch in the api call that zuul uses to list branches after the push event.	18:41
corvus	i wonder if there's a possibility the branch listing api call uses the result of an async operation which is backlogged	18:42
slittle	might be in luck ... I still have terminal history	18:42
slittle	Running: git review -s	18:43
slittle	ssh://slittle1@review.opendev.org:29418/starlingx/app-gen-tool.git did not work. Description: ssh_exchange_identification: read: Connection reset by peer	18:43
slittle	fatal: Could not read from remote repository.	18:43
slittle	succeeded on the next retry after 45 sec	18:44
fungi	we do actually have some "ddos protection" on and in gerrit, in particular a limit on the number of concurrent tcp connections allowed to the ssh api socket from the same ip address, and a limit on the number of concurrent open sessions for the same gerrit account, but they're in the 100+ range	18:44
clarkb	which could be hit more easily if behind NAT	18:45
clarkb	or if running things in parallel	18:45
slittle	I have a seconf failure from 'git review -s' ... slightly different	18:46
slittle	Running: git review -s	18:46
slittle	Problem running 'git remote update gerrit'	18:46
slittle	Fetching gerrit	18:46
slittle	ssh_exchange_identification: read: Connection reset by peer	18:46
slittle	fatal: Could not read from remote repository.	18:46
slittle	Please make sure you have the correct access rights	18:46
slittle	and the repository exists.	18:46
slittle	error: Could not fetch gerrit	18:46
fungi	those could be due to our overload protections, or gerrit getting overwhelmed, but could also indicate issues with the network anywhere between the client and server (e.g. a middlebox closing the session due to a full state tracking table or packet shaper limiting bandwidth utilization or security appliance freaking out over a false positive in the stream)	18:49
slittle	sorry, these errors appear to be from the create_branches_and_tags.sh script, not push_branches_tags.sh	18:50
corvus	[2025-01-08T23:57:08.644Z] [SSH git-receive-pack /starlingx/utilities.git (slittle1)] INFO com.google.gerrit.server.git.MultiProgressMonitor : Processing changes: refs: 1 [CONTEXT ratelimit_period="1 MINUTES [skipped: 17]" ]	18:51
fungi	introducing a delay between passes through the loop would be good to help even out the load it puts on the gerrit server anyway	18:51
corvus	i'm not sure what to make of that log entry. that is at the moment that zuul received the branch create event (zuul saw it 1 second later)	18:51
slittle	tag push error ...	18:53
slittle	git push gerrit vr/stx.10.0	18:53
slittle	ssh_exchange_identification: read: Connection reset by peer	18:53
slittle	fatal: Could not read from remote repository.	18:53
slittle	Please make sure you have the correct access rights	18:53
slittle	and the repository exists.	18:53
corvus	we are one day too late to look at the gerrit http access logs.... unless we back them up?	18:53
clarkb	corvus: we should. https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#restore-from-backup	18:54
clarkb	I think you can borg mount from a prior day on the host itself and navigate to the log file via fuse and view it that way. Then when you are done don't forget to run the borg umount command (it is emitted by the borg-mount script when you run it)	18:55
fungi	looking at example MultiProgressMonitor log examples including a ratelimit_period context, it looks like that may be how it tries to avoid spamming the service log	18:55
corvus	most of the work gerrit seems to be doing is for ai bon crawlers.	18:55
corvus	bot. not bon. very much not bon.	18:55
fungi	tres mal	18:55
clarkb	corvus: I suspect that is true for the majority of the internet now	18:55
slittle	roughly 10 git push errors on tags	18:55
corvus	we should maybe keep the same number of days of logs for gerrit and zuul :)	18:56
slittle	here is one trying to set up a gerrit review ofa .gitreview file .... git review --yes --topic=r.stx.10.0	18:57
slittle	Problem running 'git remote update gerrit'	18:57
slittle	Fetching gerrit	18:57
slittle	ssh_exchange_identification: read: Connection reset by peer	18:57
slittle	fatal: Could not read from remote repository.	18:57
clarkb	corvus: look in review_site/logs	18:57
fungi	it appears we've asked this question before too: https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-07-23.log.html#t2024-07-23T12:35:35	18:57
clarkb	corvus: we have http logs there for the gerrit http server	18:57
corvus	clarkb: oh hey cool, i was looking at apache	18:58
clarkb	ya its possible we may still need apache side logs but maybe this is sufficient	18:58
corvus	it has the byte count...	18:58
corvus	i'm going to try to deduce the contents of the api request from the byte count. it's a long shot, but it's the only echo of what happened.	18:59
slittle	roughly 5 identical 'git review' failures on that same run	18:59
clarkb	slittle: fungi I suspect that connection reset by peer as part of exhcange identification would be the user limit	19:00
clarkb	the tcp connection limit would occur earlier in the ssh connection setup and exchanging identification is probably the earlist point that gerrit can determine who you are to apply the user limit?	19:01
fungi	true, if it were the connection limit it would be something like an icmp admin-prohibited response	19:01
clarkb	so either yuo've got 96 concurrent connections or they are not shutting down cleanly (we've seen this with some clients but I want to say it was really old paramiko not openssh) or maybe your firewalls are not cleaning things up?	19:02
clarkb	once upon a tmie the stackalytics crew would just create a new account when they hit the limit rather tha nfigure out how to clean up after themselves	19:03
slittle	60 gits... for each at least four transactions ... git review -s, git push <branch>, git push <tag>, git review	19:03
fungi	or the socket shutdown takes too long and is happening in parallel with new connections accumulating	19:03
clarkb	slittle: and git review -s is multiple transactions that are papered over for you. Including a test probe, a fetch of the commit message hook if running an older version, and a login check	19:04
fungi	it's possible the tcp/ip stack tries to handle socket shutdown asynchronously	19:04
clarkb	fungi: ya that could be	19:04
clarkb	so ya I suspect fungi's simple idea of slowing things down a bit may be helpful here	19:04
clarkb	probably enough to wait between repos and not between every operation	19:05
fungi	right, if you're performing several operations in the loop, then maybe throw a `sleep 10` in there and see if the errors mostly/entirely disappear	19:06
slittle	I believe we tried that previously, with a 10-15s delay between repos. It made little difference	19:10
fungi	in that case the problem may not be originating from the gerrit server itself. where does the script run?	19:11
slittle	a workstation within the WindRiver corp network	19:11
fungi	okay, so could in theory be going through security proxies, overtaxed firewalls, oversubscribed network links, who knows what else	19:12
fungi	any of which have the potential to result in those exact errors	19:12
slittle	the thing is ... 'repo sync' iterates over those same 60 gits without issue. We only see issues when it's gerrit that we are interacting with in a loop	19:13
clarkb	looking at gerrit sshd logs you are using an old git review and it is fetching the commit message hook each time	19:14
clarkb	so that one wuold be one optimization to try	19:14
fungi	is 'repo sync' doing it over ssh on a nonstandard tcp port, or something more normal like https?	19:14
clarkb	I also see 523 logins from you over the course ofthat day in two blocks of time	19:15
clarkb	which is well above our 96 limit so ya if anything is slow to close connections (regarless of where that is happening) then you could hit the limit	19:15
fungi	might be interesting to try running it from another network some time and see if the error rate is any better	19:15
clarkb	ah the later block of time is spread over the 9th too	19:16
slittle	'repo' is a google tool formanaging multi-git projects. it uses git cammands under the hood. Our manifest is configured to pull from https://opendev.org/starlingx/*	19:17
clarkb	looks like 425 logins over the course of the hour that straddles the 8th and 9th	19:17
clarkb	oh you use repo	19:17
clarkb	it is likely that repo is reusing connections	19:17
clarkb	so you end up with a single login	19:17
clarkb	or at least far fewer than 425	19:17
clarkb	https://gerrit.googlesource.com/git-repo/+/refs/tags/v2.51/git_ssh confirmed it uses ssh control peristence	19:19
fungi	well, `repo` is using an https remote to our gitea haproxy, not ssh over 29418/tcp to our gerrit server, sounds like	19:19
clarkb	that would be another option open to you for your script	19:19
clarkb	fungi: oh ya the url above is gitea	19:19
clarkb	anyway repo will reuse ssh connections too	19:19
fungi	having managed lots of corporate security and network hardware in a past life, i can say without hesitation that there's plenty of stuff in the typical corporate network that will treat/handle those differently	19:20
fungi	these days, https connections are heavily optimized, for example	19:21
clarkb	my suggestions would be to 1) update git review so that you don't need to fetch the commit message hook and 2) try ssh control persistence	19:21
clarkb	slittle: ^	19:21
clarkb	also I liked fungi's idea of trying from a different network source	19:21
fungi	also 3. try to do more over https instead of ssh	19:22
corvus	clarkb: fungi slittle i don't quite have enough info to prove what happened; i think there are too many other changes to info/refs for me to be able to assume the contents based on length alone (or, at least, to fully reconstruct it may take a very long time).	19:29
corvus	however, we do know that zuul did query gerrit to get the list of branches (we see that in the logs), and i thought if there was an error in zuul, it was most likely to be that it didn't query at all. so at this point, we're looking at these choices:	19:29
corvus	1) gerrit did not include the new branch in info/refs when zuul queried it; or 2) something completely unexpected caused zuul to use the wrong data or fail to update its cache.	19:30
corvus	we can't prove either at this point, but i'm leaning toward 1	19:31
fungi	corvus: this seems, at least on the surface, similar to past incidents where the openstack release team has performed bulk stable branch creation around release time and some branches have ended up not getting jobs run until a full reconfig of zuul (which we used to do more often)	19:31
corvus	i'm going to add a log line to zuul that would disambiguate this in the future, so if it is #2, we have enough confidence to start looknig for weird things.	19:31
corvus	fungi: ++	19:31
corvus	and for now, i think we should just reconfigure :)	19:32
fungi	i agree it seems possible that gerrit's state is "eventually consistent" and that querying it for a branch may not work immediately after creation-related events are streamed, especially if it's been asked to create a lot of them in a short span of time and/or is under unrelated load	19:32
fungi	that is to say, it would nor surprise me to learn that emission of ref-updated events isn't held back waiting for branch creation to make it all the way down to the repositories	19:34
clarkb	corvus: sounds good and thank you for digging into it	19:34
clarkb	fwiw I find only one killed ssh connection logged from slittle on the 8th	19:34
clarkb	was for git-receive-pack./starlingx/portieris-armada-app.git	19:34
clarkb	makes me wonder if we're not logging the early connection stop for hitting the limits	19:35
clarkb	but I've been digging around in gerrit source ot try and figure out how that is implemented and haven't found it yety	19:35
corvus	yeah, the concerning part of my theory is that it would imply that one git operation (push) would not be reflected in a subsequent git operation (info/refs). that seems super unlikely. so one of two unlikely bugs. we need to see if it has stripes to know whether it's a zebra. :)	19:35
corvus	remote: https://review.opendev.org/c/zuul/zuul/+/940071 Add debug log when fetching Gerrit branches [NEW]	19:37
corvus	#status log issued zuul tenant-reconfigure for openstack to pick up missing starlingx branches	19:38
opendevstatus	corvus: finished logging	19:38
corvus	it might be a while before ^ takes effect	19:38
clarkb	ok gerrit does have an explicit log at warning for max connections reached	19:39
corvus	(it will have finished with the status page says the last reconfigure is after 19:39 utc)	19:39
clarkb	I don't see that in error_log or sshd_log for the 8th or 9th but gerrit logging is sufficient ly complicated I'm not convinced that didn't happen	19:40
clarkb	with that "resolved" any objection to me trying to land the gerrit 3.10.4 update again?	19:40
corvus	none here; i'm mostly afk until meetup tho	19:41
clarkb	ya me too. I need lunch and don't think we'll try to restart until after the meetup anyway	19:41
clarkb	I'll hit +A	19:42
clarkb	or actually reenqueue it again since that avoids needing clean check	19:42
fungi	i'm around and can keep an eye on it	19:44
clarkb	I think it is in the trigger queue which is waiting on the reconfigure	19:44
clarkb	but I did run the command to enqueue it	19:44
fungi	cool	19:44
fungi	thanks!	19:44
tonyb	frickler: Answering your question from openstack-dev here so it's more generally visible. For issues with "Software Factory CI" 3rd-party CI you can ping myself [UTC+1000] or dpawlik [UTC+0100], with the latter being my preference as they're more likely to deal with it quickly	19:56
slittle	for ssh control persistence to review.openstack.org ... do I want the 'user' to be 'git' or my user name?	20:01
clarkb	slittle: your user name. Gerrit doesn't use a shared account like github	20:16
clarkb	still no r/stx.10.0 branch in https://zuul.opendev.org/t/openstack/project/opendev.org/starlingx/vault-armada-app and a recheck doesn't seem to have worked	20:22
clarkb	the reconfiguration says 22 minutes ago so I think it completed	20:22
clarkb	so maybe this is still reproduceable cc corvus (but finish lunch)	20:22
clarkb	oh wait maybe I'm just impatient?	20:23
clarkb	jobs are enqueued now	20:23
clarkb	and the branch is there nevermind this was me expecting thinsg to run quicker and I just needed to wait another thirty seconds	20:23
clarkb	slittle: ^ fyi I think you can restore the starlingx/utilities change to its original state and it should hopefully work now too	20:24
slittle	will do	20:39
clarkb	meetup time	21:01
tonyb	I'll be ~5mins late I lost track of time and need coffee	21:02
clarkb	ack	21:03
opendevreview	Clark Boylan proposed opendev/system-config master: Update grafana to 10.4.14 https://review.opendev.org/c/opendev/system-config/+/940073	21:30
opendevreview	Merged opendev/system-config master: Update Gerrit images to 3.10.4 and 3.11.1 https://review.opendev.org/c/opendev/system-config/+/939167	21:35
opendevreview	Brian Haley proposed zuul/zuul-jobs master: Update ensure-twine role https://review.opendev.org/c/zuul/zuul-jobs/+/940074	21:37
*** tosky_ is now known as tosky		21:41
fungi	#status notice The Gerrit service on review.opendev.org will be offline momentarily while we reboot for a patch version upgrade of the software, but should return again within a few minutes	22:44
opendevstatus	fungi: sending notice	22:44
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be offline momentarily while we reboot for a patch version upgrade of the software, but should return again within a few minutes		22:44
opendevstatus	fungi: finished sending notice	22:46
clarkb	the change to bump grafan to 10.4.14 did pass CI so now we need to check screenshots I guess	23:09
clarkb	fungi: one thing I still have on my todo list for today is a bindep release. DO you know if 940074 is needed for that?	23:13
clarkb	I'm not sure if the ensure-twine role proposal is related to the twine problems earlier	23:13
clarkb	they linked to the issue but I'm not sure I understand why we need to ensure-twine in zuul-jobs if we are arelady installing it	23:16
*** promethe- is now known as prometheanfire		23:25
clarkb	ok confirmed that we can't release things using openstack release tooling right now	23:25
clarkb	the latest issue is ensure-twine uses pip install --user which runs afoul of the no global installs under python3.12 on noble and test-release-openstack at least moved to noble	23:26
clarkb	the change above updates things to use a virtualenv but now the problem is with testing.	23:26
clarkb	test-release-openstack is defined in openstack/project-config so can't speculatively load the proposed zuul-jobs ensure-twine pdate	23:26
clarkb	I think the change itself may be fine though just having a hard time verifying it against test-release-openstack	23:27
fungi	clarkb: correct (sorry, was nabbing dinner)	23:39
clarkb	no problem. I reivewed bhaley's change and caught up on all the reasons for it and tried to provide thati nformation in the review	23:47
clarkb	er haleyb	23:48

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!