Wednesday, 2025-04-16

mnaserinfra-root: i think github replication is brooooken :<00:46
mnaserhttps://github.com/openstack/openstack-helm-infra vs https://opendev.org/openstack/openstack-helm-infra00:46
mnaserseems specifically that repo didnt get replicated00:46
mnaserhttps://github.com/openstack/openstack-helm seems fine00:46
tonybmnaser: can you confirm that openstack-helm-infra is using the same project templates as openstack-helm?00:51
mnaseri'll have to check project-config, the "repo" did get retired, lets see00:51
tonybI recall helping someone diagnose that a while back but didn't follow up 00:51
mnaserthe last commit that worked renamed things to infra_*.yaml in zuul.d so no more jobs ran anymore https://opendev.org/openstack/openstack-helm-infra/commit/a96a6b955e478271f323620ef8b533c2e1e2a82e00:54
Clark[m]So they just retired things in the wrong order00:55
tonybAh.00:57
opendevreviewTakashi Kajinami proposed openstack/project-config master: watcher: Remove notification of puppet-watcher  https://review.opendev.org/c/openstack/project-config/+/94746612:06
opendevreviewTakashi Kajinami proposed openstack/project-config master: zaqar: Fix missing notification about stable branches  https://review.opendev.org/c/openstack/project-config/+/94746712:08
opendevreviewTakashi Kajinami proposed openstack/project-config master: keystone: Fix missing notification about stable branches  https://review.opendev.org/c/openstack/project-config/+/94746812:10
opendevreviewTakashi Kajinami proposed openstack/project-config master: keystone: Fix missing notification about stable branches  https://review.opendev.org/c/openstack/project-config/+/94746812:11
opendevreviewTakashi Kajinami proposed openstack/project-config master: zaqar: Fix missing notification about stable branches  https://review.opendev.org/c/openstack/project-config/+/94746712:12
hasharfungi: Clark[m]: hello, I have a  change for git-review to enable coloring for messages received from Gerrit. I had it for a while and it seems to work for me. If one of you could have a look at it: https://review.opendev.org/c/opendev/git-review/+/914000 12:13
hasharscreenshots of before/after can be seen on our task at https://phabricator.wikimedia.org/T359981 :-]12:13
opendevreviewTakashi Kajinami proposed openstack/project-config master: storlets: Add gerritbot notification about stable branches  https://review.opendev.org/c/openstack/project-config/+/94746912:14
tkajinamhmm... I see CI Job is not started for https://review.opendev.org/c/openstack/puppet-openstacklib/+/947389 even after manual recheck, but I've not yet identified anything suspicious in the repo.12:40
tkajinamI've confirmed the jobs are started for 2025.1 branch of the other puppet repos12:40
tkajinamok I see the jobs are queued but then immediately disappear in https://zuul.opendev.org/t/openstack/status ... I wonder if I can check any errors to identify its cause ?12:43
tkajinamone thing I noticed is that publish-openstack-puppet-branch-tarball job has been failing due to post failure but idk if that's related12:46
tkajinamhttps://zuul.opendev.org/t/openstack/build/ec2e4915082845568f00f2235c5fc65212:46
fricklertkajinam: let me check zuul logs12:50
tkajinamfrickler, thx !12:51
Clark[m]I think this is the issue of new branch not being registered with zuul that requires a tenant config reload.12:56
Clark[m]When was the branch created? Hopefully recently enough we still have logs12:56
tkajinamthese were create a few hours ago12:57
tkajinamhttps://review.opendev.org/c/openstack/releases/+/94689912:57
frickleryes, I just found that the branch is missing on https://zuul.opendev.org/t/openstack/project/opendev.org/openstack/puppet-openstacklib12:57
tkajinamah12:57
fricklerI'll leave it to others to check zuul logs for that12:58
clarkbI was ssh'ing in to see if I could quickly find the relevant logs and coincidentally the server rebooted for the upgrade that corvus initiated yesterday13:51
clarkbtook me a second to figure out what happened there13:51
corvuswhat's the missing branch?13:53
corvusis it stable/2025.1?  that's there in the web ui now.13:54
corvusone of the bugfixes that i'm rebooting to pick up could cause the web ui to show old data, but not the schedulers.13:55
clarkbcorvus: yes stable/2025.113:58
clarkbcorvus: are you saying you think the lack of the branch in the web ui could just be a red herring and the schedulers should know about the branch?13:58
clarkbcorvus: fwiw I think comparing to puppet-tempest I see it forwarding an event for ref-updated belonging to stable/2025.1 but I haven't found that for puppet-openstacklib yet13:59
corvusclarkb: yes, that's what i mean about the web ui.  could still be a problem in the scheduler of course. just don't use the web ui as confirmation for issues seen prior to now.14:00
tkajinamyeah stable/2025.1 is the one with the problem14:03
tkajinamI triggered recheck and now I see the job is queued14:03
tkajinamhttps://review.opendev.org/c/openstack/puppet-openstacklib/+/94738914:03
tkajinamhttps://zuul.opendev.org/t/openstack/status?change=94738914:04
slittleTo ssh://review.opendev.org:29418/starlingx/openstack-armada-app.git14:06
slittle ! [remote rejected] f/portable-dc -> f/portable-dc (prohibited by Gerrit: update for creating new commit object not permitted)14:06
slittleerror: failed to push some refs to 'ssh://review.opendev.org:29418/starlingx/openstack-armada-app.git'14:06
fricklerslittle: did you try to push commits to that branch without review?14:13
fricklerslittle: also allow me to take the opportunity to remind you of https://zuul.opendev.org/t/openstack/config-errors?project=starlingx%2Fzuul-jobs&skip=014:14
fungislittle: i guess you were trying to push a merge commit? are there any commit ids in your local f/portable-dc branch history which don't exist in gerrit?14:16
slittleit is a new branch14:16
slittleThe script that creates the branches and updates the .gitreview across all the starlingx gits is unchanged, and worked 2 weeks ago.14:18
slittlehas there been any changes on the gerrit server side ?14:18
fungiwhen did you run it? moments ago? earlier today?14:18
slittlefailed run ~20 min ago14:19
fungii think your branch creation command must have failed, because the branch doesn't seem to exist in gerrit yet: https://review.opendev.org/admin/repos/starlingx/openstack-armada-app,branches14:19
fungiwhat method is the script using to create that branch? rest api call?14:20
slittlegit push  ${review_remote} ${tag}14:22
slittlessh -p ${port} ${host} gerrit create-branch ${path} ${branch} ${tag}14:22
slittlegit config --local --replace-all "branch.${branch}.merge" refs/heads/${branch}14:22
slittlegit review --topic="${branch/\//.}14:22
slittleit's failing on the tag14:23
slittlegit push gerrit vf/portable-dc14:23
slittleremote: error: branch refs/tags/vf/portable-dc:14:24
slittleremote: use a SHA1 visible to you, or get update permission on the ref14:24
slittleremote: User: slittle114:24
slittleremote: Contact an administrator to fix the permissions14:24
fungiaha, so it'd not getting as far as the `gerrit create-branch` command i guess?14:24
fungiwhich tag did you try to push?14:25
fungiis it listed at https://review.opendev.org/admin/repos/starlingx/openstack-armada-app,tags ?14:25
slittlevf/portable-dc14:25
fungiyeah, i don't see the tag in gerrit even14:26
slittleno, it is not listed14:26
fungiis it possible you have a local path or branch with the same name as the tag, and your `git push` is trying to push a branch of the same name as the tag?14:26
slittlegit branch --all | grep port14:27
slittle* f/portable-dc14:27
slittlef//portable-dc is the branch,  the tag gets a 'v' prefix14:27
frickleris the sha1 that you tagged present on gerrit?14:29
fungiokay, are all commits in the local history of your vf/portable-dc tag represented by changes in gerrit with identical commit ids?14:29
slittlegood question14:30
fungiyes, as frickler is implying, the error message you pasted is consistent with you having tagged a branch state that contains commits that aren't in gerrit (maybe there are similar commits in gerrit but with different commit ids, for example)14:30
slittleyah, that was the problem14:31
fungicool, glad it was that simple14:31
slittlethanks14:33
fungiyou're welcome!14:33
fungithe wiki isn't sticking logins again, rebooting the server now (uptime 29 days)14:56
fungistill hasn't come all the way up, probably decided to fsck the giant rootfs15:03
fungiif it goes much longer i'll try to check the oob console15:04
clarkbstepped away for a bit but I'm going to keep digging into zuul logs around that stable/2025.1 branch creation problem15:08
clarkbif I could get reviews on https://review.opendev.org/c/opendev/system-config/+/947044 that would be really helpful to ensure we can land that and sync up ssh host keys for gerrit prior to the server swap monday15:09
fungiwiki finally came back up15:09
clarkbI can also manually copy the keys if it comes to htat but having config management deal with it would be great15:09
clarkbwhen reviewing checking the secret vars content would be good to ensure I didn't mix up any values15:09
fungi#status log Rebooted wiki.openstack.org to get OpenId logins working again15:09
opendevstatusfungi: finished logging15:10
fungiopenid session is working for me again15:11
clarkbon zuul01 we get this event logged 2025-04-16 08:41:09,099 DEBUG zuul.ConnectionEventQueue: [e: ce57da56aa904fc6b3fcd7ee88a91522] Submitting connection event to queue which maps to 'refName': 'refs/heads/stable/2025.1' getting created15:18
clarkbthen the next thign that happens wit that eventid is 2025-04-16 13:44:07,356 DEBUG zuul.GerritConnection: [e: ce57da56aa904fc6b3fcd7ee88a91522] Scheduling event from gerrit on zuul0215:20
clarkbcorvus: ^ it almost looks like the trigger event for new branches is getting stuck in queue processing somehow?15:20
clarkbI think we can rule out gerrit failing to emit the event on the event stream. We do seem to receive it around when I would expect to15:21
clarkbit looks like the zuul restart happens on zuul01 at 13:45 and on zuul02 at 13:51. Something about shutting down zuul01 unsticks the queues so that zuul02 can process it maybe?15:23
clarkbthe puppet-tempest event is here: 2025-04-16 08:42:28,367 DEBUG zuul.ConnectionEventQueue: [e: afaeaf2203734532901dbfdee5a650ff] Submitting connection event to queue15:27
clarkbthat is on zuul0115:27
clarkbthen on zuul02 we process that event: 2025-04-16 08:43:00,805 DEBUG zuul.Scheduler: [e: afaeaf2203734532901dbfdee5a650ff] Forwarding trigger event almost immediately15:28
clarkbimportantly this event comes after the puppet-openstacklib event15:28
clarkbthese events go into a zookeeper queue that are supposed to be sequenced by zookeeper and should be processed in a fifo manner15:28
clarkbmy suspicion is that something with the zookeeper queue implementation isn't doing what we suspect and we're ended up out of order or skipping events somehow15:29
clarkbmaybe a race between reading the next event and writing events?15:29
clarkboh ya one sec15:29
clarkbhttps://opendev.org/zuul/zuul/src/branch/master/zuul/zk/event_queues.py#L338-L344 the comment there may be a clue15:30
fungiwiki load average is hovering around 10-15 after reboot, probably llm training crawlers again15:31
clarkbcorvus: looking at iterEvents I suspect there is some race where we're able to set event id offset to be too large ~here https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/gerrit/gerritconnection.py#L526-L544 and when things restarted we ended up setting that value to None in the fallback for some reason which unstuck us15:34
clarkbcorvus: I think you're far more familiarwith that code than I am. Does that seem plausible or do you think something else could cause this behavior?15:35
clarkbcorvus: maybe we should add logging to https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/gerrit/gerritconnection.py#L542-L543 ~here that logs the event id offset and the current event?15:41
clarkbthen we'd theoretically be able to trace if and when a skip happens as well as the potential None reset that got things moving again?15:41
corvuslooking15:48
corvusclarkb: you switched from event  ce57da56aa904fc6b3fcd7ee88a91522 to event afaeaf2203734532901dbfdee5a650ff above.. why?15:52
clarkbcorvus: ce57 is the problematic delayed event. afae arrives after ce57 but ends up getting processed. I was trying to illustrated that ce57 ppears to be skipped in the event queue15:52
corvusgot it15:52
clarkbby showing that later events do get processed. Then it isn't until around the time zuul restarts that ce57 unsticks itself and proceeds15:53
clarkbanother idea is that maybe the listEvents here https://opendev.org/zuul/zuul/src/branch/master/zuul/zk/event_queues.py#L342 is abel to return a newer queue event but not an older one if two events race each other depending on zookeeper state? However, I don't think that is how zookeeper works and we have a single event stream processor adding thinsg to that queue so they should be16:01
clarkbsequential from a single writer and not from multipel writers16:01
clarkbwhich is why I've largely settled on the event id offset behavior for iter()16:01
corvushttps://paste.opendev.org/show/bXzpqIgUzSFAA4t6oeeS/16:07
corvusclarkb: that looks suspicious to me16:07
corvusi don't see any confirmation of forwarded events between the event arrival and that error, which makes me think that error is for that particular event16:07
clarkboh I was looking backward in the log and the trick was looking forward16:07
clarkbcorvus: ya and I guess if we can't process the event it would stay in the queue but the event id offset might jump ahead?16:08
corvusseems likely... then we reprocess it later.... i wonder if the event_id_offset has slightly nerfed our resiliency here...16:09
corvuslike, perhaps before we added that we might have retried this event indefinitely16:09
corvusbut let's split this into two concerns: 1) this error; 2) the event sticking in the queue and retry handling16:10
clarkbwhile it is possible there is a problem to solve on the gerrit side, I think it is somewhat expected for gerrit to protect itself when its connection limits are reached16:10
corvusfocusing on 1 first since that's the most opendev part of this16:10
clarkbwe do set limits on connections per ip and per user for example16:10
corvusyou think that's a connection limit?  maybe we should check the gerrit logs?  can you do that?  i want to check something else in zuul16:10
clarkbI think it is possible. And yes I'll look to see if I can find anything in gerrit indicating that was teh case16:11
clarkbspecifically this occurred during a number of new ref creations which is also when we've seen this happen in the past. maybe when that happens all the ci systems rush to fight for connections or zuul alone generates enough connections all at once to trip over the user limit or something16:11
corvusthe thing i was checking in zuul: when we query changes, we retry 3 times.  but we don't do that for branch listings or getting the default branch.  so there is an opportunity to put this in a retry loop if we find that gerrit doesn't behave well here.16:15
corvusthere's a cap to the number of parallel zuul http connections.. probably something like 7 across both schedulers?  maybe a few more or less.  but if the only thing going on was the branch creation, then subtract 4 from that.16:17
clarkback I still haven't foudn any evidence of gerrit complaining about this specific request. In the error log at 18:41:09 I see the release user create the new branch then nothing complaining about the 18:41:19 request. I see requests from zuul in the httpd_log but do not see the requset for anything for puppet-penstacklib there. I need to check apache logs, kernel logs and the16:18
clarkbsshd_logs still16:18
clarkbcorvus: do you know what the url in this reuqest would roughly be?16:18
corvusyeah 1 sec16:19
clarkbGET /openstack/puppet-tempest/info/refs?service=git-upload-pack maybe?16:20
corvusshould be /openstack/puppet-openstacklib/info/refs?service=git-upload-pack16:20
corvusand yes, GET16:21
corvus(note you had a different project there)16:21
clarkbyup sorry that was an example I found16:21
corvusit should be authenticated and have a zuul user agent16:21
clarkbI think I've confirmed that no such request makes it into the gerrit httpd_log so it must've been rejected upstream16:21
corvusclarkb: looking at the traceback, it's in a "send" call, so we might entertain the idea that it never sent the request16:22
clarkbbut also I think I can ignore the ssh log as this should be done over http16:22
corvusyeah, maybe gerrit hung up while it was still in the socket queue or something16:22
clarkbya I'm going to see if apache has any clues16:22
clarkbcorvus: I'm unable to find any evidence is apache logs or syslog that the connection ever made it to gerrit16:26
corvusoh right, because of the reverse proxy, it would have been apache that would receive the GET first...16:27
clarkbour iptabels rules that block when there are too many connections do also log and I'm trying to triple check that there really aren't any such logs in syslog16:27
corvus(quick aside -- i was looking at the gerrit server general error log; i didn't see anything related... but i did notice we have a lot of ssh connection errors from one specific ip)16:28
clarkbyes that Ip is in IBM and I think it is an old jenkins taht can't negotiate ssh beacuse its mina is too old16:29
clarkbI'm hoping that when we switch servers their firweall rules will stop allowing them to connect and they will goa way16:29
corvussounds like a plan16:29
clarkbbut ya I can find no evidence we blocked the connection from zuul or that apache ever received it16:30
clarkbmaybe something on the internet lost the SYN or ACK in the initial connection setup?16:30
clarkbwe alsp dp reject-with icmp6-port-unreachable and the python exception indicates there was no response which doesn't seem to match up16:30
clarkbthe one thing that makes me suspicious of simply pinning this on the internet is this always seems to happen as a side effect of openstac release team or starlingx creating a bunch of branches all at once16:34
clarkbbut maybe that rush of effort is what trips the networking to have a sad16:35
corvusyeah that bothers me too16:36
corvus2025-03-21 19:42:53,359 DEBUG zuul.GerritConnection: GET: https://review.opendev.org/a/projects/openstack%2Fopenstack-ansible-os_horizon/HEAD16:36
corvussimilar failure (but getting the default branch instead of the branch list)16:37
corvus2025-04-11 09:28:53,964 ERROR zuul.GerritConnection: Cannot get references from openstack/puppet-ironic16:37
corvusand that's the same error (getting the branches)16:37
corvusthose are the only other incidents in the logs16:38
corvusnot sure if you know off-hand if os_horizon was making branches then....16:38
clarkbhttps://review.opendev.org/q/project:openstack/releases this shows puppet-ironic wasn't making a new branch on 4/11 I think16:39
clarkbthere was a bunch of activity on 3/21 including https://review.opendev.org/c/openstack/releases/+/942201 and others16:40
corvusin the place where zuul does have a retry loop for change queries, we don't log if we use it.  so we don't have a good way to tell if this happens often for other types of queries.16:41
clarkb19:42 UTC would've been 12:42PM local and the timestamp on ^ is 11:24AM for merging so possibly overlapping in the post merge jobs16:41
corvusi think apache is configured with 150 workers16:42
corvusi'm not sure how many threads gerrit has16:42
clarkbcorvus: max threads is 150 (min is 20 and maxqueued is 200)16:42
clarkbnaively that seems like we should see apache reject connections before gerrit does16:44
clarkbbut I can't find apache logs saying it has done so. Maybe it doesn't log those rejections at the log level we have set?16:44
corvusapache's default backlog is 51116:45
corvusgiven how infrequent this is, maybe at this point we just need to say "shrug; internet" and try to defend16:47
corvusi think there's 3 changes to zuul we can make:16:48
corvus1) log any time we retry a request so we have better visibility16:48
corvus2) add retries to the branch list/head methods16:48
corvus3) do something with the failed events (whether it's retry, or discard)16:48
corvus#2 should fix the problem, the others are nice to have16:49
corvusi suspect that the old behavior, before the threadpool executor, was to discard failed events, so that's probably the way to go.16:49
corvusi'll start working on those.16:50
clarkbcorvus: for 3) you're saying if after retrying we still fail then discard the event entirely?16:50
clarkbI suspect that is the case because in the past we've had to take manual intervention to unstick things16:51
corvusyep16:51
clarkbfrom a zuul perspective being more defensive seems like a good idea. THere are many reasons these requests can fail. From an oepndev perspective we should try and remember to followup and check the new log info and decide if we should do further tuning of gerrit16:53
clarkbone other crazy thought: the zuul connections traverse via ipv6. I know frickler and maybe some others have had ipv6 issues with vexxhost network blocks. Possible this is a symptom of that? Though I think their issues were more all or nothing16:54
clarkbnot once in a blue moon problems16:54
corvusbtw, issue #3 is new since february.  so the actual problem (internet/apache/gerrit failed when we asked for a ref listing) may have been there for a long time, and since we had no retries, it just would have failed and never recovered until we did a tenant reconfig.  the moved to the threadpool executor for event processing and #3 gave us the accidental recovery due to restarting, but i think that was completely unintentional.16:54
corvuss/the moved/the move/16:55
corvusi think that's all consistent with observed behavior16:55
clarkb++16:55
corvusand i agree, i think we should keep an eye on this and consider whether we need to add more logging or monitoring to apache, if we see that it happens more often16:56
corvus(i'm currently thinking that if we're missing one request every 2 weeks, it's going to be super hard to track it down; more frequent errors make that more tenable)16:56
clarkbanyone willing to reivew https://review.opendev.org/c/opendev/system-config/+/947044 I will need to pop out in about 45 minutes to run an errand then grab lunch. But if we're happy with that I'm happy to monitor it this afternoon17:24
fungilooking17:30
fungilgtm17:32
clarkbfungi: did you check the secrets vars?17:33
clarkbI think my biggest concern with this change is I'll do something silly and write the wrong keytype into the wrong file via mixed up values or deploy the 03 values as the prod values. Though I tried really hard to avoid those pitfalls17:34
clarkbbut assuming that contineus to check out we can approve that early afternoon for me and ensure it applies the way we anticipate17:35
fungion bridge? they look fine, though i didn't compare them bit-for-bit with the copies on review0217:37
clarkbya on bridge. If they appear to match types thats probably sufficient17:37
fungithey do17:37
clarkbed35519 doesn't have ecdsa value etc17:37
clarkb*ed25519. Cool17:37
fungion the pubkeys specifically i guess you mean17:38
fungisince that's not apparent in the privkey blobs17:38
clarkbya and I tried to use a process where I applied the privkey then appended .pub to the source when doing the pubkey to ensure everything stayed aligned17:39
clarkbso if the pubkey looks good the privkey should be good too17:39
fungicool, yes all looks like it matches17:39
clarkband the ecdsa values all have increasing bit lengths17:39
clarkbso we can kind of infer there that they are likely to be correct too17:39
fungiyep17:40
fungii'm popping out for a bit, shouldn't be an hour19:02
opendevreviewVladimir Kozhukalov proposed openstack/project-config master: Add official-openstack-repo-jobs to openstack-helm-infra  https://review.opendev.org/c/openstack/project-config/+/94753419:42
Clark[m]I'm almost back. I'll plan to approve the ssh host key change once back in front of a computer 20:01
fungiokay, back as well20:03
clarkbapproved20:16
fungigreat timing20:16
clarkbout of an abundance of caution I think I've decided I should just make a backup copy of each of the four keys on 0220:17
fungifair enough, just in case the values on bridge are wrong in some way20:17
clarkbya20:19
opendevreviewMerged openstack/project-config master: Add official-openstack-repo-jobs to openstack-helm-infra  https://review.opendev.org/c/openstack/project-config/+/94753420:20
clarkbonce it applies we can check the values and then I'm 99% sure that review03's gerrit needs to be restarted to pick up the new keys. I'll do that and run some ssh keyscans to ensure everything matches20:20
clarkbI only backed up the privkeys as I believe the pubkeys can be derived from the privkeys20:20
clarkb(note we also have backuops of this stuff but having local copies makes comparing and double checking everything simpler20:21
mnaseri dunno if this is some sort of red herring however20:38
mnaserhttps://opendev.org/openstack/nova/compare/724d92d...1aeeb96 that seems to give me a 404, but we are seeing it _sometimes_20:38
mnaserfor example, this works fine https://opendev.org/openstack/nova/compare/12cd287...3dd97ca20:38
mnaserwe use this with renovate to give us diffs - https://github.com/vexxhost/atmosphere/pull/2476 -- but i changed it now to include full digest instead of short one, but still.. odd20:39
clarkbmnaser: the full sha works?20:39
mnaseryeah i just updated it annd so it updated the pr20:39
mnaserhttps://opendev.org/openstack/nova/compare/724d92dd382bf378f03487496c93bfba23fbc5f9...1aeeb96ffa646f4b4ebd2af8336e9f6eba4e974a20:40
mnaserworks fine20:40
fungii guess it could be that one of those refs is missing from one of the gitea backends?20:40
clarkbfungi: if that were the case the full sha shouldn't work either20:41
clarkbhttps://opendev.org/openstack/nova/compare/724d92dd38...1aeeb96ffa also seems to work20:41
clarkbI wonder if it is a collision with the short ref20:41
mnaserthat's kinda what i was leaning to20:41
mnaseri can live happily with the long commit but i was just calling it out in case well, it was something else weird going on20:41
clarkbhttps://opendev.org/openstack/nova/compare/724d92dd...1aeeb96 also works20:41
clarkbbut remove that last d and it doesn't so my hunch is that prefix isn't long enough to be unique20:42
clarkbhttps://opendev.org/openstack/nova/compare/724d92dd...1aeeb9 also works but remove the 9 at the very end and it doesn't20:42
fungiinterestingly, https://github.com/openstack/nova/commit/724d92d and  https://github.com/openstack/nova/commit/1aeeb96 both work though20:43
clarkbfungi: ya even 1aeeb9 works on gitea20:44
mnaserand https://github.com/openstack/nova/compare/724d92d...1aeeb96 i guess20:44
clarkbits possible that the colliding object (if that is the case) is in gitea but not github. Think gerrit refs (I don't think those are replicated to github)20:45
mnaserah right yes20:45
fungioh, right, for some reason i copied github urls from somewhere? and didn't even notice20:46
fungii meant to test https://opendev.org/openstack/nova/commit/724d92d which does indeed 40420:47
clarkbcool I think this is consistent with the prefix simply being too short and nova being large with a long history is probably one of the repos most likely to have this problem20:48
fungiwhile https://opendev.org/openstack/nova/commit/1aeeb96 is fine20:48
clarkbto confirm this we'd probably have to do an object listing on gitea or something20:48
fungihttps://opendev.org/openstack/nova/commit/724d92dd works but all other 15 possibilities for the last hex digit are 404, so it's not a commit at least20:49
clarkbya my hunch is refs/changes/xyz/yz/meta or whatever the path is20:50
fungioh, unless one of those 15 also has a collision in the next position20:50
clarkbwhich isn't a commit object20:50
clarkbI don't think20:51
clarkbI guess we could find the hash of a change meta object and see if gitea's commit/hash path would render it20:51
fungihttps://paste.opendev.org/show/b4BdBJyGnyD0e5iTkgHT/20:56
fungi724d92d8 is a tree object20:56
clarkbcool so that confirms it20:57
fungimnaser: ^ definitely a collision between truncated object hashes20:57
clarkbits possible the gitea service log would also show an error that leads to the 404 (I would definitely expect it if it 500'd)20:57
mnaserWow. Its my lucky day then :)21:14
mnaserHaving said that, 404 is an unpleasant response to get as a user 21:15
clarkbI think it is accurate but under informative21:15
clarkb"404 - non unique identifier" instead of "404 - Not Found" would probably help a lot21:16
JayFit's interesting there's not an http status code indicating a valid but ambiguous request21:17
JayFhonestly the ideal case is wikipedia disambiguation style pages for the human case, at least21:18
clarkb409 conflict might work (I think its usually there to tell you that you can't update ar esource when someone else is updating it though)21:20
JayFoh, good call21:20
JayFI wonder how  a browser handles a 409 with a response body21:20
JayFHilarious that the Ironic dev forgets about the 409, our API throws 409s like nobody's business (when a node is locked)21:20
clarkbmy interpretation is taht 404 is correct here21:21
clarkbif we weren't working with git and had both /foobar and /foobaz then /foo would 40421:21
JayFyeah, but the application can understand and throw a proper status code; that's the case for many returned status codes21:22
clarkbfungi: the gerrit ssh host key change should merge in the next few minutes (its getting through the testinfra test cases now)21:28
opendevreviewMerged opendev/system-config master: Manage gerrit's ecdsa and ed25519 hostkeys  https://review.opendev.org/c/opendev/system-config/+/94704421:31
clarkbhere we go21:32
clarkbbuild was successful21:35
clarkbtimestamps on review02 are still from 2020 which is good we wanted noops there21:36
clarkbtimestamps on review03 are from just about now.21:36
clarkbssh-keyscan shows review03 has not changed its host keys so I do need to restart the service. Doing that next21:36
clarkbdocker compose down is not moving quickly on review03. Which is weird considering it isn't really doing anything21:38
clarkboh we set the stop signal to sighup21:39
clarkbhowever I wonder if this doesn't work due to the podman signal problem on noble and that signal was never received21:40
clarkbI'm going to manually issue a sighup21:40
clarkbthat was exactly it21:40
clarkbI'm going to have to think about this. I don't think it is a blocker problem but an annoying annoyance21:41
clarkbinfra-root ^ fyi21:41
clarkbssh-keyscan lgtm though it only fetches the ecdsa, ed25519 and rsa keys. ecdsa 384 and ecdsa 521 are not types you can feed into ssh-keyscan21:43
clarkbbut the output appears to match review02 now so I'm pretty happy with that. Not ahppy about the sighup thing. But I think we can do something like sudo docker compose down && sudo kill -HUP $gerrit_pid for now21:44
clarkbthe next thing I want to test is a reboot of the server21:46
clarkbwe did some testing of this back in janaury with different podman restart options and my expectation is that the server reboots and does not start gerrit again21:47
clarkbwhcih  Ithink is preferable but I want to confirm that behavior21:47
clarkbhowever, that will kill our running screen21:47
clarkbso I'll hold off on doing that until maybe tomorrow in case anyone wants to look at that context first (its a good dry run of what the outage will look like)21:47
fungihuh, okay so hangup was fine under docker but not podman?21:56
clarkbfungi: yes this is a podman apparmor issue on noble. We first ran into it with haproxy as we use sighup there to do graceful config reloads21:57
clarkbspecifically for sighup21:57
clarkbit is possible that one of the allowed signals is a signal that gerrit accepts as a graceful shutdown command we can switch to that. I seem to recall that the jvm actually merges a whole bunch of signals into one action by default21:58
clarkbbut the apparmor rules for podman containers only allow like sigkill and sigint or something like that through21:58
clarkbhttps://bugs.launchpad.net/ubuntu/+source/libpod/+bug/208966421:59
clarkbint, quit, kill, term are allowed now22:00
clarkblet me try and cross check with gerrit now22:00
clarkbfungi: https://docs.oracle.com/en/java/javase/17/troubleshoot/handle-signals-and-exceptions.html#GUID-A7D91931-EA03-4BA0-B58B-53A67F9CBD21 reading that I think sigterm, sigint, and sighup are all equivalent and will trigger the jvm's shutdown handler which gerrit hooks into in daemon.java22:05
clarkbI think that means we can convert from using sighup to sigint instead22:06
fungiah, neat, so we can use any of those that aren't being filtered22:06
clarkbI'm going to get on discord and aks about this because this is my not really a java dev understanding and there are real java devs on gerrit discord22:07
funginot a bad idea22:07
clarkbbut ya we may be able to work around this entirely with a simple docker-compose.yaml update to use sigint instead22:07
clarkband until then the workaround I did use isn't terrible (just use kill as root to send the signal we want)22:07
clarkbhttps://zuul.opendev.org/t/openstack/build/1f8f38666d864d15a5d9ba7a1bc7cf15/log/job-output.txt#21121-21123 you can see the problem here in the upgrade job because we stop gerrit as part of the upgrade process. There is a 5 minute gap which is the timout where it falls back to using sigkill?22:24
clarkbI'm going to push up a WIP change that uses sigint and we'll see if that time delta goes down at least22:24
opendevreviewClark Boylan proposed opendev/system-config master: Use sigint instead of sighup to stop gerrit  https://review.opendev.org/c/opendev/system-config/+/94754022:29
clarkbhttps://zuul.opendev.org/t/openstack/build/0b88866c3f52495985c32c08a0a048c2/log/job-output.txt#21119-21121 that looks much better if this is a valid way to gracefully stop gerrit23:15

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!