mnaser | infra-root: i think github replication is brooooken :< | 00:46 |
---|---|---|
mnaser | https://github.com/openstack/openstack-helm-infra vs https://opendev.org/openstack/openstack-helm-infra | 00:46 |
mnaser | seems specifically that repo didnt get replicated | 00:46 |
mnaser | https://github.com/openstack/openstack-helm seems fine | 00:46 |
tonyb | mnaser: can you confirm that openstack-helm-infra is using the same project templates as openstack-helm? | 00:51 |
mnaser | i'll have to check project-config, the "repo" did get retired, lets see | 00:51 |
tonyb | I recall helping someone diagnose that a while back but didn't follow up | 00:51 |
mnaser | the last commit that worked renamed things to infra_*.yaml in zuul.d so no more jobs ran anymore https://opendev.org/openstack/openstack-helm-infra/commit/a96a6b955e478271f323620ef8b533c2e1e2a82e | 00:54 |
Clark[m] | So they just retired things in the wrong order | 00:55 |
tonyb | Ah. | 00:57 |
opendevreview | Takashi Kajinami proposed openstack/project-config master: watcher: Remove notification of puppet-watcher https://review.opendev.org/c/openstack/project-config/+/947466 | 12:06 |
opendevreview | Takashi Kajinami proposed openstack/project-config master: zaqar: Fix missing notification about stable branches https://review.opendev.org/c/openstack/project-config/+/947467 | 12:08 |
opendevreview | Takashi Kajinami proposed openstack/project-config master: keystone: Fix missing notification about stable branches https://review.opendev.org/c/openstack/project-config/+/947468 | 12:10 |
opendevreview | Takashi Kajinami proposed openstack/project-config master: keystone: Fix missing notification about stable branches https://review.opendev.org/c/openstack/project-config/+/947468 | 12:11 |
opendevreview | Takashi Kajinami proposed openstack/project-config master: zaqar: Fix missing notification about stable branches https://review.opendev.org/c/openstack/project-config/+/947467 | 12:12 |
hashar | fungi: Clark[m]: hello, I have a change for git-review to enable coloring for messages received from Gerrit. I had it for a while and it seems to work for me. If one of you could have a look at it: https://review.opendev.org/c/opendev/git-review/+/914000 | 12:13 |
hashar | screenshots of before/after can be seen on our task at https://phabricator.wikimedia.org/T359981 :-] | 12:13 |
opendevreview | Takashi Kajinami proposed openstack/project-config master: storlets: Add gerritbot notification about stable branches https://review.opendev.org/c/openstack/project-config/+/947469 | 12:14 |
tkajinam | hmm... I see CI Job is not started for https://review.opendev.org/c/openstack/puppet-openstacklib/+/947389 even after manual recheck, but I've not yet identified anything suspicious in the repo. | 12:40 |
tkajinam | I've confirmed the jobs are started for 2025.1 branch of the other puppet repos | 12:40 |
tkajinam | ok I see the jobs are queued but then immediately disappear in https://zuul.opendev.org/t/openstack/status ... I wonder if I can check any errors to identify its cause ? | 12:43 |
tkajinam | one thing I noticed is that publish-openstack-puppet-branch-tarball job has been failing due to post failure but idk if that's related | 12:46 |
tkajinam | https://zuul.opendev.org/t/openstack/build/ec2e4915082845568f00f2235c5fc652 | 12:46 |
frickler | tkajinam: let me check zuul logs | 12:50 |
tkajinam | frickler, thx ! | 12:51 |
Clark[m] | I think this is the issue of new branch not being registered with zuul that requires a tenant config reload. | 12:56 |
Clark[m] | When was the branch created? Hopefully recently enough we still have logs | 12:56 |
tkajinam | these were create a few hours ago | 12:57 |
tkajinam | https://review.opendev.org/c/openstack/releases/+/946899 | 12:57 |
frickler | yes, I just found that the branch is missing on https://zuul.opendev.org/t/openstack/project/opendev.org/openstack/puppet-openstacklib | 12:57 |
tkajinam | ah | 12:57 |
frickler | I'll leave it to others to check zuul logs for that | 12:58 |
clarkb | I was ssh'ing in to see if I could quickly find the relevant logs and coincidentally the server rebooted for the upgrade that corvus initiated yesterday | 13:51 |
clarkb | took me a second to figure out what happened there | 13:51 |
corvus | what's the missing branch? | 13:53 |
corvus | is it stable/2025.1? that's there in the web ui now. | 13:54 |
corvus | one of the bugfixes that i'm rebooting to pick up could cause the web ui to show old data, but not the schedulers. | 13:55 |
clarkb | corvus: yes stable/2025.1 | 13:58 |
clarkb | corvus: are you saying you think the lack of the branch in the web ui could just be a red herring and the schedulers should know about the branch? | 13:58 |
clarkb | corvus: fwiw I think comparing to puppet-tempest I see it forwarding an event for ref-updated belonging to stable/2025.1 but I haven't found that for puppet-openstacklib yet | 13:59 |
corvus | clarkb: yes, that's what i mean about the web ui. could still be a problem in the scheduler of course. just don't use the web ui as confirmation for issues seen prior to now. | 14:00 |
tkajinam | yeah stable/2025.1 is the one with the problem | 14:03 |
tkajinam | I triggered recheck and now I see the job is queued | 14:03 |
tkajinam | https://review.opendev.org/c/openstack/puppet-openstacklib/+/947389 | 14:03 |
tkajinam | https://zuul.opendev.org/t/openstack/status?change=947389 | 14:04 |
slittle | To ssh://review.opendev.org:29418/starlingx/openstack-armada-app.git | 14:06 |
slittle | ! [remote rejected] f/portable-dc -> f/portable-dc (prohibited by Gerrit: update for creating new commit object not permitted) | 14:06 |
slittle | error: failed to push some refs to 'ssh://review.opendev.org:29418/starlingx/openstack-armada-app.git' | 14:06 |
frickler | slittle: did you try to push commits to that branch without review? | 14:13 |
frickler | slittle: also allow me to take the opportunity to remind you of https://zuul.opendev.org/t/openstack/config-errors?project=starlingx%2Fzuul-jobs&skip=0 | 14:14 |
fungi | slittle: i guess you were trying to push a merge commit? are there any commit ids in your local f/portable-dc branch history which don't exist in gerrit? | 14:16 |
slittle | it is a new branch | 14:16 |
slittle | The script that creates the branches and updates the .gitreview across all the starlingx gits is unchanged, and worked 2 weeks ago. | 14:18 |
slittle | has there been any changes on the gerrit server side ? | 14:18 |
fungi | when did you run it? moments ago? earlier today? | 14:18 |
slittle | failed run ~20 min ago | 14:19 |
fungi | i think your branch creation command must have failed, because the branch doesn't seem to exist in gerrit yet: https://review.opendev.org/admin/repos/starlingx/openstack-armada-app,branches | 14:19 |
fungi | what method is the script using to create that branch? rest api call? | 14:20 |
slittle | git push ${review_remote} ${tag} | 14:22 |
slittle | ssh -p ${port} ${host} gerrit create-branch ${path} ${branch} ${tag} | 14:22 |
slittle | git config --local --replace-all "branch.${branch}.merge" refs/heads/${branch} | 14:22 |
slittle | git review --topic="${branch/\//.} | 14:22 |
slittle | it's failing on the tag | 14:23 |
slittle | git push gerrit vf/portable-dc | 14:23 |
slittle | remote: error: branch refs/tags/vf/portable-dc: | 14:24 |
slittle | remote: use a SHA1 visible to you, or get update permission on the ref | 14:24 |
slittle | remote: User: slittle1 | 14:24 |
slittle | remote: Contact an administrator to fix the permissions | 14:24 |
fungi | aha, so it'd not getting as far as the `gerrit create-branch` command i guess? | 14:24 |
fungi | which tag did you try to push? | 14:25 |
fungi | is it listed at https://review.opendev.org/admin/repos/starlingx/openstack-armada-app,tags ? | 14:25 |
slittle | vf/portable-dc | 14:25 |
fungi | yeah, i don't see the tag in gerrit even | 14:26 |
slittle | no, it is not listed | 14:26 |
fungi | is it possible you have a local path or branch with the same name as the tag, and your `git push` is trying to push a branch of the same name as the tag? | 14:26 |
slittle | git branch --all | grep port | 14:27 |
slittle | * f/portable-dc | 14:27 |
slittle | f//portable-dc is the branch, the tag gets a 'v' prefix | 14:27 |
frickler | is the sha1 that you tagged present on gerrit? | 14:29 |
fungi | okay, are all commits in the local history of your vf/portable-dc tag represented by changes in gerrit with identical commit ids? | 14:29 |
slittle | good question | 14:30 |
fungi | yes, as frickler is implying, the error message you pasted is consistent with you having tagged a branch state that contains commits that aren't in gerrit (maybe there are similar commits in gerrit but with different commit ids, for example) | 14:30 |
slittle | yah, that was the problem | 14:31 |
fungi | cool, glad it was that simple | 14:31 |
slittle | thanks | 14:33 |
fungi | you're welcome! | 14:33 |
fungi | the wiki isn't sticking logins again, rebooting the server now (uptime 29 days) | 14:56 |
fungi | still hasn't come all the way up, probably decided to fsck the giant rootfs | 15:03 |
fungi | if it goes much longer i'll try to check the oob console | 15:04 |
clarkb | stepped away for a bit but I'm going to keep digging into zuul logs around that stable/2025.1 branch creation problem | 15:08 |
clarkb | if I could get reviews on https://review.opendev.org/c/opendev/system-config/+/947044 that would be really helpful to ensure we can land that and sync up ssh host keys for gerrit prior to the server swap monday | 15:09 |
fungi | wiki finally came back up | 15:09 |
clarkb | I can also manually copy the keys if it comes to htat but having config management deal with it would be great | 15:09 |
clarkb | when reviewing checking the secret vars content would be good to ensure I didn't mix up any values | 15:09 |
fungi | #status log Rebooted wiki.openstack.org to get OpenId logins working again | 15:09 |
opendevstatus | fungi: finished logging | 15:10 |
fungi | openid session is working for me again | 15:11 |
clarkb | on zuul01 we get this event logged 2025-04-16 08:41:09,099 DEBUG zuul.ConnectionEventQueue: [e: ce57da56aa904fc6b3fcd7ee88a91522] Submitting connection event to queue which maps to 'refName': 'refs/heads/stable/2025.1' getting created | 15:18 |
clarkb | then the next thign that happens wit that eventid is 2025-04-16 13:44:07,356 DEBUG zuul.GerritConnection: [e: ce57da56aa904fc6b3fcd7ee88a91522] Scheduling event from gerrit on zuul02 | 15:20 |
clarkb | corvus: ^ it almost looks like the trigger event for new branches is getting stuck in queue processing somehow? | 15:20 |
clarkb | I think we can rule out gerrit failing to emit the event on the event stream. We do seem to receive it around when I would expect to | 15:21 |
clarkb | it looks like the zuul restart happens on zuul01 at 13:45 and on zuul02 at 13:51. Something about shutting down zuul01 unsticks the queues so that zuul02 can process it maybe? | 15:23 |
clarkb | the puppet-tempest event is here: 2025-04-16 08:42:28,367 DEBUG zuul.ConnectionEventQueue: [e: afaeaf2203734532901dbfdee5a650ff] Submitting connection event to queue | 15:27 |
clarkb | that is on zuul01 | 15:27 |
clarkb | then on zuul02 we process that event: 2025-04-16 08:43:00,805 DEBUG zuul.Scheduler: [e: afaeaf2203734532901dbfdee5a650ff] Forwarding trigger event almost immediately | 15:28 |
clarkb | importantly this event comes after the puppet-openstacklib event | 15:28 |
clarkb | these events go into a zookeeper queue that are supposed to be sequenced by zookeeper and should be processed in a fifo manner | 15:28 |
clarkb | my suspicion is that something with the zookeeper queue implementation isn't doing what we suspect and we're ended up out of order or skipping events somehow | 15:29 |
clarkb | maybe a race between reading the next event and writing events? | 15:29 |
clarkb | oh ya one sec | 15:29 |
clarkb | https://opendev.org/zuul/zuul/src/branch/master/zuul/zk/event_queues.py#L338-L344 the comment there may be a clue | 15:30 |
fungi | wiki load average is hovering around 10-15 after reboot, probably llm training crawlers again | 15:31 |
clarkb | corvus: looking at iterEvents I suspect there is some race where we're able to set event id offset to be too large ~here https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/gerrit/gerritconnection.py#L526-L544 and when things restarted we ended up setting that value to None in the fallback for some reason which unstuck us | 15:34 |
clarkb | corvus: I think you're far more familiarwith that code than I am. Does that seem plausible or do you think something else could cause this behavior? | 15:35 |
clarkb | corvus: maybe we should add logging to https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/gerrit/gerritconnection.py#L542-L543 ~here that logs the event id offset and the current event? | 15:41 |
clarkb | then we'd theoretically be able to trace if and when a skip happens as well as the potential None reset that got things moving again? | 15:41 |
corvus | looking | 15:48 |
corvus | clarkb: you switched from event ce57da56aa904fc6b3fcd7ee88a91522 to event afaeaf2203734532901dbfdee5a650ff above.. why? | 15:52 |
clarkb | corvus: ce57 is the problematic delayed event. afae arrives after ce57 but ends up getting processed. I was trying to illustrated that ce57 ppears to be skipped in the event queue | 15:52 |
corvus | got it | 15:52 |
clarkb | by showing that later events do get processed. Then it isn't until around the time zuul restarts that ce57 unsticks itself and proceeds | 15:53 |
clarkb | another idea is that maybe the listEvents here https://opendev.org/zuul/zuul/src/branch/master/zuul/zk/event_queues.py#L342 is abel to return a newer queue event but not an older one if two events race each other depending on zookeeper state? However, I don't think that is how zookeeper works and we have a single event stream processor adding thinsg to that queue so they should be | 16:01 |
clarkb | sequential from a single writer and not from multipel writers | 16:01 |
clarkb | which is why I've largely settled on the event id offset behavior for iter() | 16:01 |
corvus | https://paste.opendev.org/show/bXzpqIgUzSFAA4t6oeeS/ | 16:07 |
corvus | clarkb: that looks suspicious to me | 16:07 |
corvus | i don't see any confirmation of forwarded events between the event arrival and that error, which makes me think that error is for that particular event | 16:07 |
clarkb | oh I was looking backward in the log and the trick was looking forward | 16:07 |
clarkb | corvus: ya and I guess if we can't process the event it would stay in the queue but the event id offset might jump ahead? | 16:08 |
corvus | seems likely... then we reprocess it later.... i wonder if the event_id_offset has slightly nerfed our resiliency here... | 16:09 |
corvus | like, perhaps before we added that we might have retried this event indefinitely | 16:09 |
corvus | but let's split this into two concerns: 1) this error; 2) the event sticking in the queue and retry handling | 16:10 |
clarkb | while it is possible there is a problem to solve on the gerrit side, I think it is somewhat expected for gerrit to protect itself when its connection limits are reached | 16:10 |
corvus | focusing on 1 first since that's the most opendev part of this | 16:10 |
clarkb | we do set limits on connections per ip and per user for example | 16:10 |
corvus | you think that's a connection limit? maybe we should check the gerrit logs? can you do that? i want to check something else in zuul | 16:10 |
clarkb | I think it is possible. And yes I'll look to see if I can find anything in gerrit indicating that was teh case | 16:11 |
clarkb | specifically this occurred during a number of new ref creations which is also when we've seen this happen in the past. maybe when that happens all the ci systems rush to fight for connections or zuul alone generates enough connections all at once to trip over the user limit or something | 16:11 |
corvus | the thing i was checking in zuul: when we query changes, we retry 3 times. but we don't do that for branch listings or getting the default branch. so there is an opportunity to put this in a retry loop if we find that gerrit doesn't behave well here. | 16:15 |
corvus | there's a cap to the number of parallel zuul http connections.. probably something like 7 across both schedulers? maybe a few more or less. but if the only thing going on was the branch creation, then subtract 4 from that. | 16:17 |
clarkb | ack I still haven't foudn any evidence of gerrit complaining about this specific request. In the error log at 18:41:09 I see the release user create the new branch then nothing complaining about the 18:41:19 request. I see requests from zuul in the httpd_log but do not see the requset for anything for puppet-penstacklib there. I need to check apache logs, kernel logs and the | 16:18 |
clarkb | sshd_logs still | 16:18 |
clarkb | corvus: do you know what the url in this reuqest would roughly be? | 16:18 |
corvus | yeah 1 sec | 16:19 |
clarkb | GET /openstack/puppet-tempest/info/refs?service=git-upload-pack maybe? | 16:20 |
corvus | should be /openstack/puppet-openstacklib/info/refs?service=git-upload-pack | 16:20 |
corvus | and yes, GET | 16:21 |
corvus | (note you had a different project there) | 16:21 |
clarkb | yup sorry that was an example I found | 16:21 |
corvus | it should be authenticated and have a zuul user agent | 16:21 |
clarkb | I think I've confirmed that no such request makes it into the gerrit httpd_log so it must've been rejected upstream | 16:21 |
corvus | clarkb: looking at the traceback, it's in a "send" call, so we might entertain the idea that it never sent the request | 16:22 |
clarkb | but also I think I can ignore the ssh log as this should be done over http | 16:22 |
corvus | yeah, maybe gerrit hung up while it was still in the socket queue or something | 16:22 |
clarkb | ya I'm going to see if apache has any clues | 16:22 |
clarkb | corvus: I'm unable to find any evidence is apache logs or syslog that the connection ever made it to gerrit | 16:26 |
corvus | oh right, because of the reverse proxy, it would have been apache that would receive the GET first... | 16:27 |
clarkb | our iptabels rules that block when there are too many connections do also log and I'm trying to triple check that there really aren't any such logs in syslog | 16:27 |
corvus | (quick aside -- i was looking at the gerrit server general error log; i didn't see anything related... but i did notice we have a lot of ssh connection errors from one specific ip) | 16:28 |
clarkb | yes that Ip is in IBM and I think it is an old jenkins taht can't negotiate ssh beacuse its mina is too old | 16:29 |
clarkb | I'm hoping that when we switch servers their firweall rules will stop allowing them to connect and they will goa way | 16:29 |
corvus | sounds like a plan | 16:29 |
clarkb | but ya I can find no evidence we blocked the connection from zuul or that apache ever received it | 16:30 |
clarkb | maybe something on the internet lost the SYN or ACK in the initial connection setup? | 16:30 |
clarkb | we alsp dp reject-with icmp6-port-unreachable and the python exception indicates there was no response which doesn't seem to match up | 16:30 |
clarkb | the one thing that makes me suspicious of simply pinning this on the internet is this always seems to happen as a side effect of openstac release team or starlingx creating a bunch of branches all at once | 16:34 |
clarkb | but maybe that rush of effort is what trips the networking to have a sad | 16:35 |
corvus | yeah that bothers me too | 16:36 |
corvus | 2025-03-21 19:42:53,359 DEBUG zuul.GerritConnection: GET: https://review.opendev.org/a/projects/openstack%2Fopenstack-ansible-os_horizon/HEAD | 16:36 |
corvus | similar failure (but getting the default branch instead of the branch list) | 16:37 |
corvus | 2025-04-11 09:28:53,964 ERROR zuul.GerritConnection: Cannot get references from openstack/puppet-ironic | 16:37 |
corvus | and that's the same error (getting the branches) | 16:37 |
corvus | those are the only other incidents in the logs | 16:38 |
corvus | not sure if you know off-hand if os_horizon was making branches then.... | 16:38 |
clarkb | https://review.opendev.org/q/project:openstack/releases this shows puppet-ironic wasn't making a new branch on 4/11 I think | 16:39 |
clarkb | there was a bunch of activity on 3/21 including https://review.opendev.org/c/openstack/releases/+/942201 and others | 16:40 |
corvus | in the place where zuul does have a retry loop for change queries, we don't log if we use it. so we don't have a good way to tell if this happens often for other types of queries. | 16:41 |
clarkb | 19:42 UTC would've been 12:42PM local and the timestamp on ^ is 11:24AM for merging so possibly overlapping in the post merge jobs | 16:41 |
corvus | i think apache is configured with 150 workers | 16:42 |
corvus | i'm not sure how many threads gerrit has | 16:42 |
clarkb | corvus: max threads is 150 (min is 20 and maxqueued is 200) | 16:42 |
clarkb | naively that seems like we should see apache reject connections before gerrit does | 16:44 |
clarkb | but I can't find apache logs saying it has done so. Maybe it doesn't log those rejections at the log level we have set? | 16:44 |
corvus | apache's default backlog is 511 | 16:45 |
corvus | given how infrequent this is, maybe at this point we just need to say "shrug; internet" and try to defend | 16:47 |
corvus | i think there's 3 changes to zuul we can make: | 16:48 |
corvus | 1) log any time we retry a request so we have better visibility | 16:48 |
corvus | 2) add retries to the branch list/head methods | 16:48 |
corvus | 3) do something with the failed events (whether it's retry, or discard) | 16:48 |
corvus | #2 should fix the problem, the others are nice to have | 16:49 |
corvus | i suspect that the old behavior, before the threadpool executor, was to discard failed events, so that's probably the way to go. | 16:49 |
corvus | i'll start working on those. | 16:50 |
clarkb | corvus: for 3) you're saying if after retrying we still fail then discard the event entirely? | 16:50 |
clarkb | I suspect that is the case because in the past we've had to take manual intervention to unstick things | 16:51 |
corvus | yep | 16:51 |
clarkb | from a zuul perspective being more defensive seems like a good idea. THere are many reasons these requests can fail. From an oepndev perspective we should try and remember to followup and check the new log info and decide if we should do further tuning of gerrit | 16:53 |
clarkb | one other crazy thought: the zuul connections traverse via ipv6. I know frickler and maybe some others have had ipv6 issues with vexxhost network blocks. Possible this is a symptom of that? Though I think their issues were more all or nothing | 16:54 |
clarkb | not once in a blue moon problems | 16:54 |
corvus | btw, issue #3 is new since february. so the actual problem (internet/apache/gerrit failed when we asked for a ref listing) may have been there for a long time, and since we had no retries, it just would have failed and never recovered until we did a tenant reconfig. the moved to the threadpool executor for event processing and #3 gave us the accidental recovery due to restarting, but i think that was completely unintentional. | 16:54 |
corvus | s/the moved/the move/ | 16:55 |
corvus | i think that's all consistent with observed behavior | 16:55 |
clarkb | ++ | 16:55 |
corvus | and i agree, i think we should keep an eye on this and consider whether we need to add more logging or monitoring to apache, if we see that it happens more often | 16:56 |
corvus | (i'm currently thinking that if we're missing one request every 2 weeks, it's going to be super hard to track it down; more frequent errors make that more tenable) | 16:56 |
clarkb | anyone willing to reivew https://review.opendev.org/c/opendev/system-config/+/947044 I will need to pop out in about 45 minutes to run an errand then grab lunch. But if we're happy with that I'm happy to monitor it this afternoon | 17:24 |
fungi | looking | 17:30 |
fungi | lgtm | 17:32 |
clarkb | fungi: did you check the secrets vars? | 17:33 |
clarkb | I think my biggest concern with this change is I'll do something silly and write the wrong keytype into the wrong file via mixed up values or deploy the 03 values as the prod values. Though I tried really hard to avoid those pitfalls | 17:34 |
clarkb | but assuming that contineus to check out we can approve that early afternoon for me and ensure it applies the way we anticipate | 17:35 |
fungi | on bridge? they look fine, though i didn't compare them bit-for-bit with the copies on review02 | 17:37 |
clarkb | ya on bridge. If they appear to match types thats probably sufficient | 17:37 |
fungi | they do | 17:37 |
clarkb | ed35519 doesn't have ecdsa value etc | 17:37 |
clarkb | *ed25519. Cool | 17:37 |
fungi | on the pubkeys specifically i guess you mean | 17:38 |
fungi | since that's not apparent in the privkey blobs | 17:38 |
clarkb | ya and I tried to use a process where I applied the privkey then appended .pub to the source when doing the pubkey to ensure everything stayed aligned | 17:39 |
clarkb | so if the pubkey looks good the privkey should be good too | 17:39 |
fungi | cool, yes all looks like it matches | 17:39 |
clarkb | and the ecdsa values all have increasing bit lengths | 17:39 |
clarkb | so we can kind of infer there that they are likely to be correct too | 17:39 |
fungi | yep | 17:40 |
fungi | i'm popping out for a bit, shouldn't be an hour | 19:02 |
opendevreview | Vladimir Kozhukalov proposed openstack/project-config master: Add official-openstack-repo-jobs to openstack-helm-infra https://review.opendev.org/c/openstack/project-config/+/947534 | 19:42 |
Clark[m] | I'm almost back. I'll plan to approve the ssh host key change once back in front of a computer | 20:01 |
fungi | okay, back as well | 20:03 |
clarkb | approved | 20:16 |
fungi | great timing | 20:16 |
clarkb | out of an abundance of caution I think I've decided I should just make a backup copy of each of the four keys on 02 | 20:17 |
fungi | fair enough, just in case the values on bridge are wrong in some way | 20:17 |
clarkb | ya | 20:19 |
opendevreview | Merged openstack/project-config master: Add official-openstack-repo-jobs to openstack-helm-infra https://review.opendev.org/c/openstack/project-config/+/947534 | 20:20 |
clarkb | once it applies we can check the values and then I'm 99% sure that review03's gerrit needs to be restarted to pick up the new keys. I'll do that and run some ssh keyscans to ensure everything matches | 20:20 |
clarkb | I only backed up the privkeys as I believe the pubkeys can be derived from the privkeys | 20:20 |
clarkb | (note we also have backuops of this stuff but having local copies makes comparing and double checking everything simpler | 20:21 |
mnaser | i dunno if this is some sort of red herring however | 20:38 |
mnaser | https://opendev.org/openstack/nova/compare/724d92d...1aeeb96 that seems to give me a 404, but we are seeing it _sometimes_ | 20:38 |
mnaser | for example, this works fine https://opendev.org/openstack/nova/compare/12cd287...3dd97ca | 20:38 |
mnaser | we use this with renovate to give us diffs - https://github.com/vexxhost/atmosphere/pull/2476 -- but i changed it now to include full digest instead of short one, but still.. odd | 20:39 |
clarkb | mnaser: the full sha works? | 20:39 |
mnaser | yeah i just updated it annd so it updated the pr | 20:39 |
mnaser | https://opendev.org/openstack/nova/compare/724d92dd382bf378f03487496c93bfba23fbc5f9...1aeeb96ffa646f4b4ebd2af8336e9f6eba4e974a | 20:40 |
mnaser | works fine | 20:40 |
fungi | i guess it could be that one of those refs is missing from one of the gitea backends? | 20:40 |
clarkb | fungi: if that were the case the full sha shouldn't work either | 20:41 |
clarkb | https://opendev.org/openstack/nova/compare/724d92dd38...1aeeb96ffa also seems to work | 20:41 |
clarkb | I wonder if it is a collision with the short ref | 20:41 |
mnaser | that's kinda what i was leaning to | 20:41 |
mnaser | i can live happily with the long commit but i was just calling it out in case well, it was something else weird going on | 20:41 |
clarkb | https://opendev.org/openstack/nova/compare/724d92dd...1aeeb96 also works | 20:41 |
clarkb | but remove that last d and it doesn't so my hunch is that prefix isn't long enough to be unique | 20:42 |
clarkb | https://opendev.org/openstack/nova/compare/724d92dd...1aeeb9 also works but remove the 9 at the very end and it doesn't | 20:42 |
fungi | interestingly, https://github.com/openstack/nova/commit/724d92d and https://github.com/openstack/nova/commit/1aeeb96 both work though | 20:43 |
clarkb | fungi: ya even 1aeeb9 works on gitea | 20:44 |
mnaser | and https://github.com/openstack/nova/compare/724d92d...1aeeb96 i guess | 20:44 |
clarkb | its possible that the colliding object (if that is the case) is in gitea but not github. Think gerrit refs (I don't think those are replicated to github) | 20:45 |
mnaser | ah right yes | 20:45 |
fungi | oh, right, for some reason i copied github urls from somewhere? and didn't even notice | 20:46 |
fungi | i meant to test https://opendev.org/openstack/nova/commit/724d92d which does indeed 404 | 20:47 |
clarkb | cool I think this is consistent with the prefix simply being too short and nova being large with a long history is probably one of the repos most likely to have this problem | 20:48 |
fungi | while https://opendev.org/openstack/nova/commit/1aeeb96 is fine | 20:48 |
clarkb | to confirm this we'd probably have to do an object listing on gitea or something | 20:48 |
fungi | https://opendev.org/openstack/nova/commit/724d92dd works but all other 15 possibilities for the last hex digit are 404, so it's not a commit at least | 20:49 |
clarkb | ya my hunch is refs/changes/xyz/yz/meta or whatever the path is | 20:50 |
fungi | oh, unless one of those 15 also has a collision in the next position | 20:50 |
clarkb | which isn't a commit object | 20:50 |
clarkb | I don't think | 20:51 |
clarkb | I guess we could find the hash of a change meta object and see if gitea's commit/hash path would render it | 20:51 |
fungi | https://paste.opendev.org/show/b4BdBJyGnyD0e5iTkgHT/ | 20:56 |
fungi | 724d92d8 is a tree object | 20:56 |
clarkb | cool so that confirms it | 20:57 |
fungi | mnaser: ^ definitely a collision between truncated object hashes | 20:57 |
clarkb | its possible the gitea service log would also show an error that leads to the 404 (I would definitely expect it if it 500'd) | 20:57 |
mnaser | Wow. Its my lucky day then :) | 21:14 |
mnaser | Having said that, 404 is an unpleasant response to get as a user | 21:15 |
clarkb | I think it is accurate but under informative | 21:15 |
clarkb | "404 - non unique identifier" instead of "404 - Not Found" would probably help a lot | 21:16 |
JayF | it's interesting there's not an http status code indicating a valid but ambiguous request | 21:17 |
JayF | honestly the ideal case is wikipedia disambiguation style pages for the human case, at least | 21:18 |
clarkb | 409 conflict might work (I think its usually there to tell you that you can't update ar esource when someone else is updating it though) | 21:20 |
JayF | oh, good call | 21:20 |
JayF | I wonder how a browser handles a 409 with a response body | 21:20 |
JayF | Hilarious that the Ironic dev forgets about the 409, our API throws 409s like nobody's business (when a node is locked) | 21:20 |
clarkb | my interpretation is taht 404 is correct here | 21:21 |
clarkb | if we weren't working with git and had both /foobar and /foobaz then /foo would 404 | 21:21 |
JayF | yeah, but the application can understand and throw a proper status code; that's the case for many returned status codes | 21:22 |
clarkb | fungi: the gerrit ssh host key change should merge in the next few minutes (its getting through the testinfra test cases now) | 21:28 |
opendevreview | Merged opendev/system-config master: Manage gerrit's ecdsa and ed25519 hostkeys https://review.opendev.org/c/opendev/system-config/+/947044 | 21:31 |
clarkb | here we go | 21:32 |
clarkb | build was successful | 21:35 |
clarkb | timestamps on review02 are still from 2020 which is good we wanted noops there | 21:36 |
clarkb | timestamps on review03 are from just about now. | 21:36 |
clarkb | ssh-keyscan shows review03 has not changed its host keys so I do need to restart the service. Doing that next | 21:36 |
clarkb | docker compose down is not moving quickly on review03. Which is weird considering it isn't really doing anything | 21:38 |
clarkb | oh we set the stop signal to sighup | 21:39 |
clarkb | however I wonder if this doesn't work due to the podman signal problem on noble and that signal was never received | 21:40 |
clarkb | I'm going to manually issue a sighup | 21:40 |
clarkb | that was exactly it | 21:40 |
clarkb | I'm going to have to think about this. I don't think it is a blocker problem but an annoying annoyance | 21:41 |
clarkb | infra-root ^ fyi | 21:41 |
clarkb | ssh-keyscan lgtm though it only fetches the ecdsa, ed25519 and rsa keys. ecdsa 384 and ecdsa 521 are not types you can feed into ssh-keyscan | 21:43 |
clarkb | but the output appears to match review02 now so I'm pretty happy with that. Not ahppy about the sighup thing. But I think we can do something like sudo docker compose down && sudo kill -HUP $gerrit_pid for now | 21:44 |
clarkb | the next thing I want to test is a reboot of the server | 21:46 |
clarkb | we did some testing of this back in janaury with different podman restart options and my expectation is that the server reboots and does not start gerrit again | 21:47 |
clarkb | whcih Ithink is preferable but I want to confirm that behavior | 21:47 |
clarkb | however, that will kill our running screen | 21:47 |
clarkb | so I'll hold off on doing that until maybe tomorrow in case anyone wants to look at that context first (its a good dry run of what the outage will look like) | 21:47 |
fungi | huh, okay so hangup was fine under docker but not podman? | 21:56 |
clarkb | fungi: yes this is a podman apparmor issue on noble. We first ran into it with haproxy as we use sighup there to do graceful config reloads | 21:57 |
clarkb | specifically for sighup | 21:57 |
clarkb | it is possible that one of the allowed signals is a signal that gerrit accepts as a graceful shutdown command we can switch to that. I seem to recall that the jvm actually merges a whole bunch of signals into one action by default | 21:58 |
clarkb | but the apparmor rules for podman containers only allow like sigkill and sigint or something like that through | 21:58 |
clarkb | https://bugs.launchpad.net/ubuntu/+source/libpod/+bug/2089664 | 21:59 |
clarkb | int, quit, kill, term are allowed now | 22:00 |
clarkb | let me try and cross check with gerrit now | 22:00 |
clarkb | fungi: https://docs.oracle.com/en/java/javase/17/troubleshoot/handle-signals-and-exceptions.html#GUID-A7D91931-EA03-4BA0-B58B-53A67F9CBD21 reading that I think sigterm, sigint, and sighup are all equivalent and will trigger the jvm's shutdown handler which gerrit hooks into in daemon.java | 22:05 |
clarkb | I think that means we can convert from using sighup to sigint instead | 22:06 |
fungi | ah, neat, so we can use any of those that aren't being filtered | 22:06 |
clarkb | I'm going to get on discord and aks about this because this is my not really a java dev understanding and there are real java devs on gerrit discord | 22:07 |
fungi | not a bad idea | 22:07 |
clarkb | but ya we may be able to work around this entirely with a simple docker-compose.yaml update to use sigint instead | 22:07 |
clarkb | and until then the workaround I did use isn't terrible (just use kill as root to send the signal we want) | 22:07 |
clarkb | https://zuul.opendev.org/t/openstack/build/1f8f38666d864d15a5d9ba7a1bc7cf15/log/job-output.txt#21121-21123 you can see the problem here in the upgrade job because we stop gerrit as part of the upgrade process. There is a 5 minute gap which is the timout where it falls back to using sigkill? | 22:24 |
clarkb | I'm going to push up a WIP change that uses sigint and we'll see if that time delta goes down at least | 22:24 |
opendevreview | Clark Boylan proposed opendev/system-config master: Use sigint instead of sighup to stop gerrit https://review.opendev.org/c/opendev/system-config/+/947540 | 22:29 |
clarkb | https://zuul.opendev.org/t/openstack/build/0b88866c3f52495985c32c08a0a048c2/log/job-output.txt#21119-21121 that looks much better if this is a valid way to gracefully stop gerrit | 23:15 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!