Wednesday, 2025-04-16

mnaser	infra-root: i think github replication is brooooken :<	00:46
mnaser	https://github.com/openstack/openstack-helm-infra vs https://opendev.org/openstack/openstack-helm-infra	00:46
mnaser	seems specifically that repo didnt get replicated	00:46
mnaser	https://github.com/openstack/openstack-helm seems fine	00:46
tonyb	mnaser: can you confirm that openstack-helm-infra is using the same project templates as openstack-helm?	00:51
mnaser	i'll have to check project-config, the "repo" did get retired, lets see	00:51
tonyb	I recall helping someone diagnose that a while back but didn't follow up	00:51
mnaser	the last commit that worked renamed things to infra_*.yaml in zuul.d so no more jobs ran anymore https://opendev.org/openstack/openstack-helm-infra/commit/a96a6b955e478271f323620ef8b533c2e1e2a82e	00:54
Clark[m]	So they just retired things in the wrong order	00:55
tonyb	Ah.	00:57
opendevreview	Takashi Kajinami proposed openstack/project-config master: watcher: Remove notification of puppet-watcher https://review.opendev.org/c/openstack/project-config/+/947466	12:06
opendevreview	Takashi Kajinami proposed openstack/project-config master: zaqar: Fix missing notification about stable branches https://review.opendev.org/c/openstack/project-config/+/947467	12:08
opendevreview	Takashi Kajinami proposed openstack/project-config master: keystone: Fix missing notification about stable branches https://review.opendev.org/c/openstack/project-config/+/947468	12:10
opendevreview	Takashi Kajinami proposed openstack/project-config master: keystone: Fix missing notification about stable branches https://review.opendev.org/c/openstack/project-config/+/947468	12:11
opendevreview	Takashi Kajinami proposed openstack/project-config master: zaqar: Fix missing notification about stable branches https://review.opendev.org/c/openstack/project-config/+/947467	12:12
hashar	fungi: Clark[m]: hello, I have a change for git-review to enable coloring for messages received from Gerrit. I had it for a while and it seems to work for me. If one of you could have a look at it: https://review.opendev.org/c/opendev/git-review/+/914000	12:13
hashar	screenshots of before/after can be seen on our task at https://phabricator.wikimedia.org/T359981 :-]	12:13
opendevreview	Takashi Kajinami proposed openstack/project-config master: storlets: Add gerritbot notification about stable branches https://review.opendev.org/c/openstack/project-config/+/947469	12:14
tkajinam	hmm... I see CI Job is not started for https://review.opendev.org/c/openstack/puppet-openstacklib/+/947389 even after manual recheck, but I've not yet identified anything suspicious in the repo.	12:40
tkajinam	I've confirmed the jobs are started for 2025.1 branch of the other puppet repos	12:40
tkajinam	ok I see the jobs are queued but then immediately disappear in https://zuul.opendev.org/t/openstack/status ... I wonder if I can check any errors to identify its cause ?	12:43
tkajinam	one thing I noticed is that publish-openstack-puppet-branch-tarball job has been failing due to post failure but idk if that's related	12:46
tkajinam	https://zuul.opendev.org/t/openstack/build/ec2e4915082845568f00f2235c5fc652	12:46
frickler	tkajinam: let me check zuul logs	12:50
tkajinam	frickler, thx !	12:51
Clark[m]	I think this is the issue of new branch not being registered with zuul that requires a tenant config reload.	12:56
Clark[m]	When was the branch created? Hopefully recently enough we still have logs	12:56
tkajinam	these were create a few hours ago	12:57
tkajinam	https://review.opendev.org/c/openstack/releases/+/946899	12:57
frickler	yes, I just found that the branch is missing on https://zuul.opendev.org/t/openstack/project/opendev.org/openstack/puppet-openstacklib	12:57
tkajinam	ah	12:57
frickler	I'll leave it to others to check zuul logs for that	12:58
clarkb	I was ssh'ing in to see if I could quickly find the relevant logs and coincidentally the server rebooted for the upgrade that corvus initiated yesterday	13:51
clarkb	took me a second to figure out what happened there	13:51
corvus	what's the missing branch?	13:53
corvus	is it stable/2025.1? that's there in the web ui now.	13:54
corvus	one of the bugfixes that i'm rebooting to pick up could cause the web ui to show old data, but not the schedulers.	13:55
clarkb	corvus: yes stable/2025.1	13:58
clarkb	corvus: are you saying you think the lack of the branch in the web ui could just be a red herring and the schedulers should know about the branch?	13:58
clarkb	corvus: fwiw I think comparing to puppet-tempest I see it forwarding an event for ref-updated belonging to stable/2025.1 but I haven't found that for puppet-openstacklib yet	13:59
corvus	clarkb: yes, that's what i mean about the web ui. could still be a problem in the scheduler of course. just don't use the web ui as confirmation for issues seen prior to now.	14:00
tkajinam	yeah stable/2025.1 is the one with the problem	14:03
tkajinam	I triggered recheck and now I see the job is queued	14:03
tkajinam	https://review.opendev.org/c/openstack/puppet-openstacklib/+/947389	14:03
tkajinam	https://zuul.opendev.org/t/openstack/status?change=947389	14:04
slittle	To ssh://review.opendev.org:29418/starlingx/openstack-armada-app.git	14:06
slittle	! [remote rejected] f/portable-dc -> f/portable-dc (prohibited by Gerrit: update for creating new commit object not permitted)	14:06
slittle	error: failed to push some refs to 'ssh://review.opendev.org:29418/starlingx/openstack-armada-app.git'	14:06
frickler	slittle: did you try to push commits to that branch without review?	14:13
frickler	slittle: also allow me to take the opportunity to remind you of https://zuul.opendev.org/t/openstack/config-errors?project=starlingx%2Fzuul-jobs&skip=0	14:14
fungi	slittle: i guess you were trying to push a merge commit? are there any commit ids in your local f/portable-dc branch history which don't exist in gerrit?	14:16
slittle	it is a new branch	14:16
slittle	The script that creates the branches and updates the .gitreview across all the starlingx gits is unchanged, and worked 2 weeks ago.	14:18
slittle	has there been any changes on the gerrit server side ?	14:18
fungi	when did you run it? moments ago? earlier today?	14:18
slittle	failed run ~20 min ago	14:19
fungi	i think your branch creation command must have failed, because the branch doesn't seem to exist in gerrit yet: https://review.opendev.org/admin/repos/starlingx/openstack-armada-app,branches	14:19
fungi	what method is the script using to create that branch? rest api call?	14:20
slittle	git push ${review_remote} ${tag}	14:22
slittle	ssh -p ${port} ${host} gerrit create-branch ${path} ${branch} ${tag}	14:22
slittle	git config --local --replace-all "branch.${branch}.merge" refs/heads/${branch}	14:22
slittle	git review --topic="${branch/\//.}	14:22
slittle	it's failing on the tag	14:23
slittle	git push gerrit vf/portable-dc	14:23
slittle	remote: error: branch refs/tags/vf/portable-dc:	14:24
slittle	remote: use a SHA1 visible to you, or get update permission on the ref	14:24
slittle	remote: User: slittle1	14:24
slittle	remote: Contact an administrator to fix the permissions	14:24
fungi	aha, so it'd not getting as far as the `gerrit create-branch` command i guess?	14:24
fungi	which tag did you try to push?	14:25
fungi	is it listed at https://review.opendev.org/admin/repos/starlingx/openstack-armada-app,tags ?	14:25
slittle	vf/portable-dc	14:25
fungi	yeah, i don't see the tag in gerrit even	14:26
slittle	no, it is not listed	14:26
fungi	is it possible you have a local path or branch with the same name as the tag, and your `git push` is trying to push a branch of the same name as the tag?	14:26
slittle	git branch --all \| grep port	14:27
slittle	* f/portable-dc	14:27
slittle	f//portable-dc is the branch, the tag gets a 'v' prefix	14:27
frickler	is the sha1 that you tagged present on gerrit?	14:29
fungi	okay, are all commits in the local history of your vf/portable-dc tag represented by changes in gerrit with identical commit ids?	14:29
slittle	good question	14:30
fungi	yes, as frickler is implying, the error message you pasted is consistent with you having tagged a branch state that contains commits that aren't in gerrit (maybe there are similar commits in gerrit but with different commit ids, for example)	14:30
slittle	yah, that was the problem	14:31
fungi	cool, glad it was that simple	14:31
slittle	thanks	14:33
fungi	you're welcome!	14:33
fungi	the wiki isn't sticking logins again, rebooting the server now (uptime 29 days)	14:56
fungi	still hasn't come all the way up, probably decided to fsck the giant rootfs	15:03
fungi	if it goes much longer i'll try to check the oob console	15:04
clarkb	stepped away for a bit but I'm going to keep digging into zuul logs around that stable/2025.1 branch creation problem	15:08
clarkb	if I could get reviews on https://review.opendev.org/c/opendev/system-config/+/947044 that would be really helpful to ensure we can land that and sync up ssh host keys for gerrit prior to the server swap monday	15:09
fungi	wiki finally came back up	15:09
clarkb	I can also manually copy the keys if it comes to htat but having config management deal with it would be great	15:09
clarkb	when reviewing checking the secret vars content would be good to ensure I didn't mix up any values	15:09
fungi	#status log Rebooted wiki.openstack.org to get OpenId logins working again	15:09
opendevstatus	fungi: finished logging	15:10
fungi	openid session is working for me again	15:11
clarkb	on zuul01 we get this event logged 2025-04-16 08:41:09,099 DEBUG zuul.ConnectionEventQueue: [e: ce57da56aa904fc6b3fcd7ee88a91522] Submitting connection event to queue which maps to 'refName': 'refs/heads/stable/2025.1' getting created	15:18
clarkb	then the next thign that happens wit that eventid is 2025-04-16 13:44:07,356 DEBUG zuul.GerritConnection: [e: ce57da56aa904fc6b3fcd7ee88a91522] Scheduling event from gerrit on zuul02	15:20
clarkb	corvus: ^ it almost looks like the trigger event for new branches is getting stuck in queue processing somehow?	15:20
clarkb	I think we can rule out gerrit failing to emit the event on the event stream. We do seem to receive it around when I would expect to	15:21
clarkb	it looks like the zuul restart happens on zuul01 at 13:45 and on zuul02 at 13:51. Something about shutting down zuul01 unsticks the queues so that zuul02 can process it maybe?	15:23
clarkb	the puppet-tempest event is here: 2025-04-16 08:42:28,367 DEBUG zuul.ConnectionEventQueue: [e: afaeaf2203734532901dbfdee5a650ff] Submitting connection event to queue	15:27
clarkb	that is on zuul01	15:27
clarkb	then on zuul02 we process that event: 2025-04-16 08:43:00,805 DEBUG zuul.Scheduler: [e: afaeaf2203734532901dbfdee5a650ff] Forwarding trigger event almost immediately	15:28
clarkb	importantly this event comes after the puppet-openstacklib event	15:28
clarkb	these events go into a zookeeper queue that are supposed to be sequenced by zookeeper and should be processed in a fifo manner	15:28
clarkb	my suspicion is that something with the zookeeper queue implementation isn't doing what we suspect and we're ended up out of order or skipping events somehow	15:29
clarkb	maybe a race between reading the next event and writing events?	15:29
clarkb	oh ya one sec	15:29
clarkb	https://opendev.org/zuul/zuul/src/branch/master/zuul/zk/event_queues.py#L338-L344 the comment there may be a clue	15:30
fungi	wiki load average is hovering around 10-15 after reboot, probably llm training crawlers again	15:31
clarkb	corvus: looking at iterEvents I suspect there is some race where we're able to set event id offset to be too large ~here https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/gerrit/gerritconnection.py#L526-L544 and when things restarted we ended up setting that value to None in the fallback for some reason which unstuck us	15:34
clarkb	corvus: I think you're far more familiarwith that code than I am. Does that seem plausible or do you think something else could cause this behavior?	15:35
clarkb	corvus: maybe we should add logging to https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/gerrit/gerritconnection.py#L542-L543 ~here that logs the event id offset and the current event?	15:41
clarkb	then we'd theoretically be able to trace if and when a skip happens as well as the potential None reset that got things moving again?	15:41
corvus	looking	15:48
corvus	clarkb: you switched from event ce57da56aa904fc6b3fcd7ee88a91522 to event afaeaf2203734532901dbfdee5a650ff above.. why?	15:52
clarkb	corvus: ce57 is the problematic delayed event. afae arrives after ce57 but ends up getting processed. I was trying to illustrated that ce57 ppears to be skipped in the event queue	15:52
corvus	got it	15:52
clarkb	by showing that later events do get processed. Then it isn't until around the time zuul restarts that ce57 unsticks itself and proceeds	15:53
clarkb	another idea is that maybe the listEvents here https://opendev.org/zuul/zuul/src/branch/master/zuul/zk/event_queues.py#L342 is abel to return a newer queue event but not an older one if two events race each other depending on zookeeper state? However, I don't think that is how zookeeper works and we have a single event stream processor adding thinsg to that queue so they should be	16:01
clarkb	sequential from a single writer and not from multipel writers	16:01
clarkb	which is why I've largely settled on the event id offset behavior for iter()	16:01
corvus	https://paste.opendev.org/show/bXzpqIgUzSFAA4t6oeeS/	16:07
corvus	clarkb: that looks suspicious to me	16:07
corvus	i don't see any confirmation of forwarded events between the event arrival and that error, which makes me think that error is for that particular event	16:07
clarkb	oh I was looking backward in the log and the trick was looking forward	16:07
clarkb	corvus: ya and I guess if we can't process the event it would stay in the queue but the event id offset might jump ahead?	16:08
corvus	seems likely... then we reprocess it later.... i wonder if the event_id_offset has slightly nerfed our resiliency here...	16:09
corvus	like, perhaps before we added that we might have retried this event indefinitely	16:09
corvus	but let's split this into two concerns: 1) this error; 2) the event sticking in the queue and retry handling	16:10
clarkb	while it is possible there is a problem to solve on the gerrit side, I think it is somewhat expected for gerrit to protect itself when its connection limits are reached	16:10
corvus	focusing on 1 first since that's the most opendev part of this	16:10
clarkb	we do set limits on connections per ip and per user for example	16:10
corvus	you think that's a connection limit? maybe we should check the gerrit logs? can you do that? i want to check something else in zuul	16:10
clarkb	I think it is possible. And yes I'll look to see if I can find anything in gerrit indicating that was teh case	16:11
clarkb	specifically this occurred during a number of new ref creations which is also when we've seen this happen in the past. maybe when that happens all the ci systems rush to fight for connections or zuul alone generates enough connections all at once to trip over the user limit or something	16:11
corvus	the thing i was checking in zuul: when we query changes, we retry 3 times. but we don't do that for branch listings or getting the default branch. so there is an opportunity to put this in a retry loop if we find that gerrit doesn't behave well here.	16:15
corvus	there's a cap to the number of parallel zuul http connections.. probably something like 7 across both schedulers? maybe a few more or less. but if the only thing going on was the branch creation, then subtract 4 from that.	16:17
clarkb	ack I still haven't foudn any evidence of gerrit complaining about this specific request. In the error log at 18:41:09 I see the release user create the new branch then nothing complaining about the 18:41:19 request. I see requests from zuul in the httpd_log but do not see the requset for anything for puppet-penstacklib there. I need to check apache logs, kernel logs and the	16:18
clarkb	sshd_logs still	16:18
clarkb	corvus: do you know what the url in this reuqest would roughly be?	16:18
corvus	yeah 1 sec	16:19
clarkb	GET /openstack/puppet-tempest/info/refs?service=git-upload-pack maybe?	16:20
corvus	should be /openstack/puppet-openstacklib/info/refs?service=git-upload-pack	16:20
corvus	and yes, GET	16:21
corvus	(note you had a different project there)	16:21
clarkb	yup sorry that was an example I found	16:21
corvus	it should be authenticated and have a zuul user agent	16:21
clarkb	I think I've confirmed that no such request makes it into the gerrit httpd_log so it must've been rejected upstream	16:21
corvus	clarkb: looking at the traceback, it's in a "send" call, so we might entertain the idea that it never sent the request	16:22
clarkb	but also I think I can ignore the ssh log as this should be done over http	16:22
corvus	yeah, maybe gerrit hung up while it was still in the socket queue or something	16:22
clarkb	ya I'm going to see if apache has any clues	16:22
clarkb	corvus: I'm unable to find any evidence is apache logs or syslog that the connection ever made it to gerrit	16:26
corvus	oh right, because of the reverse proxy, it would have been apache that would receive the GET first...	16:27
clarkb	our iptabels rules that block when there are too many connections do also log and I'm trying to triple check that there really aren't any such logs in syslog	16:27
corvus	(quick aside -- i was looking at the gerrit server general error log; i didn't see anything related... but i did notice we have a lot of ssh connection errors from one specific ip)	16:28
clarkb	yes that Ip is in IBM and I think it is an old jenkins taht can't negotiate ssh beacuse its mina is too old	16:29
clarkb	I'm hoping that when we switch servers their firweall rules will stop allowing them to connect and they will goa way	16:29
corvus	sounds like a plan	16:29
clarkb	but ya I can find no evidence we blocked the connection from zuul or that apache ever received it	16:30
clarkb	maybe something on the internet lost the SYN or ACK in the initial connection setup?	16:30
clarkb	we alsp dp reject-with icmp6-port-unreachable and the python exception indicates there was no response which doesn't seem to match up	16:30
clarkb	the one thing that makes me suspicious of simply pinning this on the internet is this always seems to happen as a side effect of openstac release team or starlingx creating a bunch of branches all at once	16:34
clarkb	but maybe that rush of effort is what trips the networking to have a sad	16:35
corvus	yeah that bothers me too	16:36
corvus	2025-03-21 19:42:53,359 DEBUG zuul.GerritConnection: GET: https://review.opendev.org/a/projects/openstack%2Fopenstack-ansible-os_horizon/HEAD	16:36
corvus	similar failure (but getting the default branch instead of the branch list)	16:37
corvus	2025-04-11 09:28:53,964 ERROR zuul.GerritConnection: Cannot get references from openstack/puppet-ironic	16:37
corvus	and that's the same error (getting the branches)	16:37
corvus	those are the only other incidents in the logs	16:38
corvus	not sure if you know off-hand if os_horizon was making branches then....	16:38
clarkb	https://review.opendev.org/q/project:openstack/releases this shows puppet-ironic wasn't making a new branch on 4/11 I think	16:39
clarkb	there was a bunch of activity on 3/21 including https://review.opendev.org/c/openstack/releases/+/942201 and others	16:40
corvus	in the place where zuul does have a retry loop for change queries, we don't log if we use it. so we don't have a good way to tell if this happens often for other types of queries.	16:41
clarkb	19:42 UTC would've been 12:42PM local and the timestamp on ^ is 11:24AM for merging so possibly overlapping in the post merge jobs	16:41
corvus	i think apache is configured with 150 workers	16:42
corvus	i'm not sure how many threads gerrit has	16:42
clarkb	corvus: max threads is 150 (min is 20 and maxqueued is 200)	16:42
clarkb	naively that seems like we should see apache reject connections before gerrit does	16:44
clarkb	but I can't find apache logs saying it has done so. Maybe it doesn't log those rejections at the log level we have set?	16:44
corvus	apache's default backlog is 511	16:45
corvus	given how infrequent this is, maybe at this point we just need to say "shrug; internet" and try to defend	16:47
corvus	i think there's 3 changes to zuul we can make:	16:48
corvus	1) log any time we retry a request so we have better visibility	16:48
corvus	2) add retries to the branch list/head methods	16:48
corvus	3) do something with the failed events (whether it's retry, or discard)	16:48
corvus	#2 should fix the problem, the others are nice to have	16:49
corvus	i suspect that the old behavior, before the threadpool executor, was to discard failed events, so that's probably the way to go.	16:49
corvus	i'll start working on those.	16:50
clarkb	corvus: for 3) you're saying if after retrying we still fail then discard the event entirely?	16:50
clarkb	I suspect that is the case because in the past we've had to take manual intervention to unstick things	16:51
corvus	yep	16:51
clarkb	from a zuul perspective being more defensive seems like a good idea. THere are many reasons these requests can fail. From an oepndev perspective we should try and remember to followup and check the new log info and decide if we should do further tuning of gerrit	16:53
clarkb	one other crazy thought: the zuul connections traverse via ipv6. I know frickler and maybe some others have had ipv6 issues with vexxhost network blocks. Possible this is a symptom of that? Though I think their issues were more all or nothing	16:54
clarkb	not once in a blue moon problems	16:54
corvus	btw, issue #3 is new since february. so the actual problem (internet/apache/gerrit failed when we asked for a ref listing) may have been there for a long time, and since we had no retries, it just would have failed and never recovered until we did a tenant reconfig. the moved to the threadpool executor for event processing and #3 gave us the accidental recovery due to restarting, but i think that was completely unintentional.	16:54
corvus	s/the moved/the move/	16:55
corvus	i think that's all consistent with observed behavior	16:55
clarkb	++	16:55
corvus	and i agree, i think we should keep an eye on this and consider whether we need to add more logging or monitoring to apache, if we see that it happens more often	16:56
corvus	(i'm currently thinking that if we're missing one request every 2 weeks, it's going to be super hard to track it down; more frequent errors make that more tenable)	16:56
clarkb	anyone willing to reivew https://review.opendev.org/c/opendev/system-config/+/947044 I will need to pop out in about 45 minutes to run an errand then grab lunch. But if we're happy with that I'm happy to monitor it this afternoon	17:24
fungi	looking	17:30
fungi	lgtm	17:32
clarkb	fungi: did you check the secrets vars?	17:33
clarkb	I think my biggest concern with this change is I'll do something silly and write the wrong keytype into the wrong file via mixed up values or deploy the 03 values as the prod values. Though I tried really hard to avoid those pitfalls	17:34
clarkb	but assuming that contineus to check out we can approve that early afternoon for me and ensure it applies the way we anticipate	17:35
fungi	on bridge? they look fine, though i didn't compare them bit-for-bit with the copies on review02	17:37
clarkb	ya on bridge. If they appear to match types thats probably sufficient	17:37
fungi	they do	17:37
clarkb	ed35519 doesn't have ecdsa value etc	17:37
clarkb	*ed25519. Cool	17:37
fungi	on the pubkeys specifically i guess you mean	17:38
fungi	since that's not apparent in the privkey blobs	17:38
clarkb	ya and I tried to use a process where I applied the privkey then appended .pub to the source when doing the pubkey to ensure everything stayed aligned	17:39
clarkb	so if the pubkey looks good the privkey should be good too	17:39
fungi	cool, yes all looks like it matches	17:39
clarkb	and the ecdsa values all have increasing bit lengths	17:39
clarkb	so we can kind of infer there that they are likely to be correct too	17:39
fungi	yep	17:40
fungi	i'm popping out for a bit, shouldn't be an hour	19:02
opendevreview	Vladimir Kozhukalov proposed openstack/project-config master: Add official-openstack-repo-jobs to openstack-helm-infra https://review.opendev.org/c/openstack/project-config/+/947534	19:42
Clark[m]	I'm almost back. I'll plan to approve the ssh host key change once back in front of a computer	20:01
fungi	okay, back as well	20:03
clarkb	approved	20:16
fungi	great timing	20:16
clarkb	out of an abundance of caution I think I've decided I should just make a backup copy of each of the four keys on 02	20:17
fungi	fair enough, just in case the values on bridge are wrong in some way	20:17
clarkb	ya	20:19
opendevreview	Merged openstack/project-config master: Add official-openstack-repo-jobs to openstack-helm-infra https://review.opendev.org/c/openstack/project-config/+/947534	20:20
clarkb	once it applies we can check the values and then I'm 99% sure that review03's gerrit needs to be restarted to pick up the new keys. I'll do that and run some ssh keyscans to ensure everything matches	20:20
clarkb	I only backed up the privkeys as I believe the pubkeys can be derived from the privkeys	20:20
clarkb	(note we also have backuops of this stuff but having local copies makes comparing and double checking everything simpler	20:21
mnaser	i dunno if this is some sort of red herring however	20:38
mnaser	https://opendev.org/openstack/nova/compare/724d92d...1aeeb96 that seems to give me a 404, but we are seeing it _sometimes_	20:38
mnaser	for example, this works fine https://opendev.org/openstack/nova/compare/12cd287...3dd97ca	20:38
mnaser	we use this with renovate to give us diffs - https://github.com/vexxhost/atmosphere/pull/2476 -- but i changed it now to include full digest instead of short one, but still.. odd	20:39
clarkb	mnaser: the full sha works?	20:39
mnaser	yeah i just updated it annd so it updated the pr	20:39
mnaser	https://opendev.org/openstack/nova/compare/724d92dd382bf378f03487496c93bfba23fbc5f9...1aeeb96ffa646f4b4ebd2af8336e9f6eba4e974a	20:40
mnaser	works fine	20:40
fungi	i guess it could be that one of those refs is missing from one of the gitea backends?	20:40
clarkb	fungi: if that were the case the full sha shouldn't work either	20:41
clarkb	https://opendev.org/openstack/nova/compare/724d92dd38...1aeeb96ffa also seems to work	20:41
clarkb	I wonder if it is a collision with the short ref	20:41
mnaser	that's kinda what i was leaning to	20:41
mnaser	i can live happily with the long commit but i was just calling it out in case well, it was something else weird going on	20:41
clarkb	https://opendev.org/openstack/nova/compare/724d92dd...1aeeb96 also works	20:41
clarkb	but remove that last d and it doesn't so my hunch is that prefix isn't long enough to be unique	20:42
clarkb	https://opendev.org/openstack/nova/compare/724d92dd...1aeeb9 also works but remove the 9 at the very end and it doesn't	20:42
fungi	interestingly, https://github.com/openstack/nova/commit/724d92d and https://github.com/openstack/nova/commit/1aeeb96 both work though	20:43
clarkb	fungi: ya even 1aeeb9 works on gitea	20:44
mnaser	and https://github.com/openstack/nova/compare/724d92d...1aeeb96 i guess	20:44
clarkb	its possible that the colliding object (if that is the case) is in gitea but not github. Think gerrit refs (I don't think those are replicated to github)	20:45
mnaser	ah right yes	20:45
fungi	oh, right, for some reason i copied github urls from somewhere? and didn't even notice	20:46
fungi	i meant to test https://opendev.org/openstack/nova/commit/724d92d which does indeed 404	20:47
clarkb	cool I think this is consistent with the prefix simply being too short and nova being large with a long history is probably one of the repos most likely to have this problem	20:48
fungi	while https://opendev.org/openstack/nova/commit/1aeeb96 is fine	20:48
clarkb	to confirm this we'd probably have to do an object listing on gitea or something	20:48
fungi	https://opendev.org/openstack/nova/commit/724d92dd works but all other 15 possibilities for the last hex digit are 404, so it's not a commit at least	20:49
clarkb	ya my hunch is refs/changes/xyz/yz/meta or whatever the path is	20:50
fungi	oh, unless one of those 15 also has a collision in the next position	20:50
clarkb	which isn't a commit object	20:50
clarkb	I don't think	20:51
clarkb	I guess we could find the hash of a change meta object and see if gitea's commit/hash path would render it	20:51
fungi	https://paste.opendev.org/show/b4BdBJyGnyD0e5iTkgHT/	20:56
fungi	724d92d8 is a tree object	20:56
clarkb	cool so that confirms it	20:57
fungi	mnaser: ^ definitely a collision between truncated object hashes	20:57
clarkb	its possible the gitea service log would also show an error that leads to the 404 (I would definitely expect it if it 500'd)	20:57
mnaser	Wow. Its my lucky day then :)	21:14
mnaser	Having said that, 404 is an unpleasant response to get as a user	21:15
clarkb	I think it is accurate but under informative	21:15
clarkb	"404 - non unique identifier" instead of "404 - Not Found" would probably help a lot	21:16
JayF	it's interesting there's not an http status code indicating a valid but ambiguous request	21:17
JayF	honestly the ideal case is wikipedia disambiguation style pages for the human case, at least	21:18
clarkb	409 conflict might work (I think its usually there to tell you that you can't update ar esource when someone else is updating it though)	21:20
JayF	oh, good call	21:20
JayF	I wonder how a browser handles a 409 with a response body	21:20
JayF	Hilarious that the Ironic dev forgets about the 409, our API throws 409s like nobody's business (when a node is locked)	21:20
clarkb	my interpretation is taht 404 is correct here	21:21
clarkb	if we weren't working with git and had both /foobar and /foobaz then /foo would 404	21:21
JayF	yeah, but the application can understand and throw a proper status code; that's the case for many returned status codes	21:22
clarkb	fungi: the gerrit ssh host key change should merge in the next few minutes (its getting through the testinfra test cases now)	21:28
opendevreview	Merged opendev/system-config master: Manage gerrit's ecdsa and ed25519 hostkeys https://review.opendev.org/c/opendev/system-config/+/947044	21:31
clarkb	here we go	21:32
clarkb	build was successful	21:35
clarkb	timestamps on review02 are still from 2020 which is good we wanted noops there	21:36
clarkb	timestamps on review03 are from just about now.	21:36
clarkb	ssh-keyscan shows review03 has not changed its host keys so I do need to restart the service. Doing that next	21:36
clarkb	docker compose down is not moving quickly on review03. Which is weird considering it isn't really doing anything	21:38
clarkb	oh we set the stop signal to sighup	21:39
clarkb	however I wonder if this doesn't work due to the podman signal problem on noble and that signal was never received	21:40
clarkb	I'm going to manually issue a sighup	21:40
clarkb	that was exactly it	21:40
clarkb	I'm going to have to think about this. I don't think it is a blocker problem but an annoying annoyance	21:41
clarkb	infra-root ^ fyi	21:41
clarkb	ssh-keyscan lgtm though it only fetches the ecdsa, ed25519 and rsa keys. ecdsa 384 and ecdsa 521 are not types you can feed into ssh-keyscan	21:43
clarkb	but the output appears to match review02 now so I'm pretty happy with that. Not ahppy about the sighup thing. But I think we can do something like sudo docker compose down && sudo kill -HUP $gerrit_pid for now	21:44
clarkb	the next thing I want to test is a reboot of the server	21:46
clarkb	we did some testing of this back in janaury with different podman restart options and my expectation is that the server reboots and does not start gerrit again	21:47
clarkb	whcih Ithink is preferable but I want to confirm that behavior	21:47
clarkb	however, that will kill our running screen	21:47
clarkb	so I'll hold off on doing that until maybe tomorrow in case anyone wants to look at that context first (its a good dry run of what the outage will look like)	21:47
fungi	huh, okay so hangup was fine under docker but not podman?	21:56
clarkb	fungi: yes this is a podman apparmor issue on noble. We first ran into it with haproxy as we use sighup there to do graceful config reloads	21:57
clarkb	specifically for sighup	21:57
clarkb	it is possible that one of the allowed signals is a signal that gerrit accepts as a graceful shutdown command we can switch to that. I seem to recall that the jvm actually merges a whole bunch of signals into one action by default	21:58
clarkb	but the apparmor rules for podman containers only allow like sigkill and sigint or something like that through	21:58
clarkb	https://bugs.launchpad.net/ubuntu/+source/libpod/+bug/2089664	21:59
clarkb	int, quit, kill, term are allowed now	22:00
clarkb	let me try and cross check with gerrit now	22:00
clarkb	fungi: https://docs.oracle.com/en/java/javase/17/troubleshoot/handle-signals-and-exceptions.html#GUID-A7D91931-EA03-4BA0-B58B-53A67F9CBD21 reading that I think sigterm, sigint, and sighup are all equivalent and will trigger the jvm's shutdown handler which gerrit hooks into in daemon.java	22:05
clarkb	I think that means we can convert from using sighup to sigint instead	22:06
fungi	ah, neat, so we can use any of those that aren't being filtered	22:06
clarkb	I'm going to get on discord and aks about this because this is my not really a java dev understanding and there are real java devs on gerrit discord	22:07
fungi	not a bad idea	22:07
clarkb	but ya we may be able to work around this entirely with a simple docker-compose.yaml update to use sigint instead	22:07
clarkb	and until then the workaround I did use isn't terrible (just use kill as root to send the signal we want)	22:07
clarkb	https://zuul.opendev.org/t/openstack/build/1f8f38666d864d15a5d9ba7a1bc7cf15/log/job-output.txt#21121-21123 you can see the problem here in the upgrade job because we stop gerrit as part of the upgrade process. There is a 5 minute gap which is the timout where it falls back to using sigkill?	22:24
clarkb	I'm going to push up a WIP change that uses sigint and we'll see if that time delta goes down at least	22:24
opendevreview	Clark Boylan proposed opendev/system-config master: Use sigint instead of sighup to stop gerrit https://review.opendev.org/c/opendev/system-config/+/947540	22:29
clarkb	https://zuul.opendev.org/t/openstack/build/0b88866c3f52495985c32c08a0a048c2/log/job-output.txt#21119-21121 that looks much better if this is a valid way to gracefully stop gerrit	23:15

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!