opendevreview | Merged opendev/system-config master: Switch Gerrit replication to a larger RSA key https://review.opendev.org/c/opendev/system-config/+/902169 | 00:00 |
---|---|---|
clarkb | bah it merged just too late to get ahead of the hourly jobs | 00:00 |
clarkb | so ya even later... I'm thinking I'll confirm that the config updates make it onto the server as expected, push something to sandbox and see that it replicates (so that we haven't unexpectedly broken it) then we can plan for a restart monday | 00:01 |
fungi | okay, back | 00:20 |
clarkb | the files were just written out | 00:21 |
clarkb | still waiting for the job to complete | 00:21 |
clarkb | fungi: do you want to wait for monday or go for it now? | 00:21 |
fungi | i'm good to do it now | 00:21 |
fungi | i'm around all weekend too in case something comes up | 00:21 |
fungi | looks like the job just completed | 00:21 |
clarkb | ok then I guess we proceed. Plan would be to docker-compose down, mv the waiting queue aside, move the old id_rsa aside, then docker-compose up -d | 00:21 |
clarkb | and the job completed successfully | 00:22 |
clarkb | fungi: do you want me to start a screen? I don't think it is necessary | 00:22 |
fungi | nah | 00:22 |
fungi | i'm just looking for the key | 00:22 |
clarkb | I can do the stop and file moves and all that if you want to do a #status log? and maybe prep a change to push after it starts so that we can check replication is working? | 00:23 |
fungi | the new key isn't in ~gerrit2/review_site/etc/ i guess? | 00:23 |
clarkb | no it is in /home/gerrit2/.ssh/ next to the old key | 00:23 |
fungi | aha, yep that updated | 00:24 |
clarkb | oh we should make sure that whole dir is bind mounted to the container and not indvidual files | 00:24 |
fungi | i see replication_id_rsa_B and replication_id_rsa_B.pub with a last updated time of a few minutes ago | 00:24 |
clarkb | I didn't think of that until just now | 00:24 |
clarkb | - /home/gerrit2/.ssh:/var/gerrit/.ssh | 00:24 |
clarkb | should be good | 00:24 |
fungi | /home/gerrit2/.ssh:/var/gerrit/.ssh | 00:24 |
fungi | yep | 00:24 |
clarkb | I'll move id_rsa there aside to id_rsa.bak | 00:25 |
fungi | k | 00:25 |
clarkb | as well as moving the replication waiting queue into /home/gerrit2/tmp/clarkb | 00:25 |
clarkb | fungi: maybe you can #status notice whiel I sort out those file paths? | 00:25 |
fungi | #status notice The Gerrit service on review.opendev.org will be offline momentarily to restart it onto an updated replication key | 00:26 |
opendevstatus | fungi: sending notice | 00:26 |
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be offline momentarily to restart it onto an updated replication key | 00:26 | |
clarkb | fungi: I think I'm ready to proceed as soon as that reports it is done | 00:27 |
clarkb | that == the bot | 00:27 |
fungi | cool | 00:27 |
opendevstatus | fungi: finished sending notice | 00:29 |
fungi | there we are | 00:29 |
clarkb | ok I'm proceeding now with the stop, file/dir moves, and then start | 00:29 |
clarkb | I am not doing an image update pull | 00:29 |
fungi | sounds right | 00:30 |
clarkb | it should be back up ish now | 00:31 |
clarkb | the log says it is anyway | 00:31 |
fungi | yeah, loading for me now | 00:31 |
fungi | Powered by Gerrit Code Review (3.8.3-2-gb446549261-dirty) | 00:31 |
clarkb | now we need to push some code | 00:31 |
clarkb | arg replication is failing according to the replication log | 00:33 |
clarkb | no public keys to try so it isn't reading the ssh config as documented? | 00:33 |
fungi | replication_id_rsa_B.pub looks like the right format, same format as id_rsa.pub anyway | 00:34 |
fungi | same for the private keys | 00:34 |
clarkb | well the odd thing is it says "no keys to try" instead of "key failed" | 00:34 |
fungi | is the existence of id_rsa.pub tripping it up? | 00:35 |
clarkb | Maybe? but if it is reading the config file it shouldn't bother with id_rsa at all or its pubkey | 00:35 |
fungi | but the IdentityFile line should be pointing it at the other key, yeah | 00:35 |
clarkb | either its ignoring the file entirely for some reason (plugin docs are bad or file permissions are wrong or something) or the specification in the file is wrong | 00:36 |
clarkb | I think our options are to 1) move the new key in place as id_rsa and put review in the emergency file or 2) revert and figure it out next week | 00:36 |
fungi | Cannot log in at gitea14.opendev.org:222 publickey: no keys to try | 00:36 |
clarkb | either way we need to do another restart | 00:36 |
clarkb | yup thats the error | 00:36 |
clarkb | my plan is after we restore replication to grep `Cannot replicate to` and manually trigger replication for the repos that didn't replicate | 00:37 |
fungi | permissions and ownership look the same on the old and new files | 00:37 |
clarkb | maybe the Host specification needs the :222 at the end since that is what we are ssh'ing to? | 00:37 |
fungi | oh, that's a strong possibility | 00:38 |
clarkb | if that is the issue I'll be cranky beacuse Port 222 is set | 00:39 |
fungi | mmm, yeah i don't think it needs to be on the host line then, you're right | 00:39 |
clarkb | oh! I have a : on the Host line | 00:39 |
clarkb | and that shouldn't be there | 00:39 |
fungi | right! this is not yaml | 00:39 |
clarkb | this may be holdover from when I was ensuring I got the port in there | 00:39 |
clarkb | ya I think I started with the yaml from our replication config | 00:40 |
clarkb | ok so lets stop it again, remove the : and start again then see if that works? | 00:40 |
fungi | and yeah, just confirmed the host lines in my personal config don't end in : | 00:40 |
clarkb | if it does we can put review in teh emegecny file and fix that specific problem monday | 00:40 |
fungi | i concur | 00:40 |
clarkb | alright proceeding now | 00:40 |
clarkb | it should be up(ish) now | 00:42 |
clarkb | I'm waiting on the replication log to say something new | 00:42 |
opendevreview | Clark Boylan proposed opendev/system-config master: A file with tab delimited useful utf8 chars https://review.opendev.org/c/opendev/system-config/+/900379 | 00:42 |
clarkb | ok replication logs look much better after ^ still need to confirm the gitea side has that commit | 00:43 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Remove a stray colon https://review.opendev.org/c/opendev/system-config/+/902461 | 00:43 |
clarkb | hrm no its still failing | 00:44 |
clarkb | I think the previous log entries were for noop edits because I used the web ui to make the change then it took some time for the actual replication runs to fail | 00:45 |
clarkb | I don't really want to restart gerrit over and over again... but also ugh | 00:45 |
fungi | want to try the :222 or just roll back? | 00:46 |
clarkb | adding :222 is the only other thing I can figure to try right now. I guess we can do that and if it fails we can roll back | 00:46 |
fungi | wfm | 00:46 |
clarkb | ok proceeding | 00:47 |
clarkb | it should be back ish again | 00:48 |
fungi | looking at the manpage for ssh_config, the host entry refers to the patterns section, which talks about wildcards but doesn't mention any port suffix, so i have doubts this will help | 00:49 |
fungi | then again, this is mina's interpretation of openssh configuration | 00:49 |
opendevreview | Clark Boylan proposed opendev/system-config master: A file with tab delimited useful utf8 chars https://review.opendev.org/c/opendev/system-config/+/900379 | 00:49 |
fungi | so all bets are off | 00:49 |
clarkb | ya it may not actually read that file for all we know and the docs are completely wrong | 00:50 |
clarkb | I guess that is the other thing we can do. Move the new replication key to id_rsa | 00:50 |
clarkb | and not rely on the .ssh/config file at all | 00:51 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Add the port to mina's replication host pattern https://review.opendev.org/c/opendev/system-config/+/902461 | 00:51 |
clarkb | basically move that file over, put review into emergency file and then update on Monday to simply write to the id_rsa file | 00:51 |
clarkb | ok the :222 did not work. So either we revert or do ^ | 00:51 |
fungi | i'm okay with one last try | 00:52 |
clarkb | actually we don't need to revert. We should be abel to just move id_rsa back itno place and restart | 00:52 |
clarkb | fungi: and that one last try would be putting the new key in as id_rsa? | 00:52 |
fungi | right | 00:52 |
clarkb | ok I'll proceed with that then | 00:52 |
fungi | and undoing the config of course | 00:52 |
fungi | granted, it may simply be ignoring the config entirely | 00:53 |
fungi | or it may parse only a subset of ssh_config syntax | 00:53 |
clarkb | oh I didn't remove the config because I figured it was completely ignoring it | 00:54 |
fungi | we'll see | 00:54 |
fungi | but yes, that seems likely | 00:54 |
clarkb | wow ok it still says no keys to try | 00:57 |
clarkb | I don't get it | 00:57 |
clarkb | I think we're in full revert territory (remove config and put id_rsa.bak as id_rsa again) | 00:57 |
fungi | agreed | 00:57 |
fungi | i'll add the server to the emergency disable list | 00:57 |
clarkb | thank you | 00:57 |
clarkb | I'm proceeding with manual revert now | 00:58 |
fungi | it's in the emergency list now | 00:58 |
clarkb | I moved .ssh/config to .ssh/brokenconfig and id_rsa.bak to id_rsa | 00:59 |
clarkb | I think we can leave the other two new key files in places since nothing should point to them now | 00:59 |
clarkb | if you concur I'll start gerrit hopefulyl for the last time this evening | 00:59 |
fungi | yes | 01:00 |
fungi | remaining possibilities i can come up with: 1. the config is confusing the client, or 2. the new key is too large or formatted internally in a way that the client is unable to load | 01:00 |
clarkb | it is starting now | 01:00 |
fungi | git commit --amend | 01:02 |
fungi | hah | 01:02 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Revert "Switch Gerrit replication to a larger RSA key" https://review.opendev.org/c/opendev/system-config/+/902481 | 01:02 |
fungi | https://opendev.org/opendev/system-config/commit/d346d5375ffb70c3cea37def33f4d52887d8d276 | 01:02 |
fungi | replicated | 01:02 |
clarkb | cool I seem to see happy replication logs too | 01:03 |
clarkb | other idea: maybe the mina client is validating id_rsa and id_rsa.pub match | 01:03 |
clarkb | otherwise it won't use the key? | 01:03 |
clarkb | I'm going to generate a list of proejcts to trigger replication for now | 01:03 |
fungi | oh, i think it does do that. did you only move the private key to id_rsa but not the public one to id_rsa.pub? | 01:03 |
clarkb | yes | 01:03 |
clarkb | because well it shouldn't matter... | 01:03 |
fungi | that's a distinct possibility then | 01:04 |
clarkb | I think we can test these things with a held node next week | 01:04 |
clarkb | and try to nail down what exactly went wrong from all of these posibilities | 01:04 |
fungi | yes, it shouldn't matter but in the past we've found out the hard way that it blew up if we only installed the id_rsa and not the id_rsa.pub even though it should never use the latter | 01:04 |
fungi | i think openssh will also explode if your pubkey doesn't match the corresponding privkey when present, but is fine if it's entirely absent | 01:05 |
clarkb | openstack/openstack-helm-infra openstack/neutron opendev/system-config and openstack/openstack are the ones with reported replication failures | 01:06 |
fungi | so maybe mina-ssh is trying to mimic that safety check | 01:06 |
clarkb | at this point I smell dinner and don't really awnt to try another restart to check that | 01:06 |
fungi | no, we can revisit next week | 01:06 |
clarkb | assuming that is the case it still doesn't explain why the .ssh/config stuff didn't work (since the key it pointed to did have a matching pubkey) | 01:07 |
clarkb | so I think we basically want to investigate if .ssh/config works and if so how to amke it work. Then decide on whether or not we need to replace id_rsa and id_rsa.pub entirely or can have a new key alonside the old one | 01:07 |
clarkb | I'm going to trigger replication for those repos | 01:07 |
fungi | looks like syntax is `replication start openstack/openstack-helm-infra openstack/neutron opendev/system-config openstack/openstack` | 01:09 |
clarkb | oh I was doing them one at a time can I list them all at once? neat | 01:09 |
fungi | it may only be one or wildcards | 01:10 |
fungi | (or none) | 01:10 |
clarkb | that is done | 01:10 |
fungi | `replication start --help` does suggest that it will accept multiple patterns | 01:11 |
fungi | replication start [PATTERN ...] [--] [--all] [--deadline VAL] [--help (-h)] [--now] [--trace] [--trace-id VAL] [--url PATTERN] [--wait] | 01:11 |
clarkb | the other thing I noticed is that my client generates an exception in error_log when loading my test change https://review.opendev.org/c/opendev/system-config/+/900379 because ps4 isn't presnet or something. However it seems to render fine | 01:11 |
clarkb | I suspect that maybe when we restarted it cut a reindex action for that short and it isn't in the index? | 01:12 |
fungi | that would be my guess, and reindexing will eventually solve it | 01:12 |
clarkb | Part of what backs that up is that clicking the sha says no changes | 01:12 |
clarkb | I think I'll push a new ps and see if that reindexes the whole change | 01:12 |
clarkb | otherwise we'll need to trigger full reindexing for the project another time | 01:12 |
opendevreview | Clark Boylan proposed opendev/system-config master: A file with tab delimited useful utf8 chars https://review.opendev.org/c/opendev/system-config/+/900379 | 01:13 |
clarkb | I think that did it. I can click on the sha now and it finds the chagne and there isn't a new traceback in the error_log | 01:14 |
clarkb | I'll do a recap so that scrollback isn't so bad | 01:16 |
clarkb | We restarted gerrit with the new .ssh/config and the two new key files (private and public) in place on review02. Replication began to fail with no possible key errors. We thought there may be config file errors (of which we found at least one, but resolving that one didn't change the behavior). After several restarts I figured we'd test moving the new key in as id_rsa as a hail | 01:17 |
clarkb | mary. This still didn't work. After that we reverted by hand and put the server in the emergency file | 01:17 |
clarkb | After the by hand revert everything started working again. Current thoughts: the .ssh/config is either bad because mina can't parse it for some reason or it is completely ignore my mina. And the reason that moving the new key in as id_rsa may not have worked is I didn't also move the pubkey to id_rsa.pub | 01:18 |
clarkb | On Monday we'll want to rollback things more properly so that the server can come out of the emergency file and then test how to make this work more reliably | 01:18 |
clarkb | Finally I noticed errors with my test change that appear to be related to cutting indexing of a new patchset short due to stopping gerrit. Pushing a new patchset to the change fixed this but presumably so would an explicit online reindex request for the project in question | 01:19 |
clarkb | fungi: thank you for the help | 01:19 |
fungi | also https://review.opendev.org/902481 is the revert | 01:19 |
fungi | if that merges, presumably we can take the server out of the emergency disable list? | 01:20 |
clarkb | fungi: yes I think so because then we won't try to rewrite .ssh/config | 01:21 |
clarkb | but we'll need tomanually cleanup the new key and .ssh/brokenconfig | 01:21 |
fungi | ah, correct | 01:21 |
clarkb | I'm beginning to think that the way to rollforward will liekyl end up being a manual move of the new key to id_rsa and id_rsa.pub after backing up both files for ease of manual reverting. Restart gerrit and then fi that works just update our config management to overwrite the old key with the new key data | 01:22 |
clarkb | basically don't try to have A and B keys since .ssh/config selection seems iffy | 01:22 |
clarkb | but we can attempt to do more in depth testing next week before making any decisions | 01:22 |
fungi | agreed | 01:22 |
fungi | but for now, your dinner grows ever colder | 01:23 |
clarkb | yes I can smell it :) | 01:23 |
fungi | and i have a cold beer and video games calling my name | 01:23 |
clarkb | I couldn't help myself so I started to rtfs | 01:37 |
clarkb | MINA appears to do a regex match | 01:37 |
clarkb | not a glob match | 01:37 |
clarkb | a small but very important difference in behavior | 01:37 |
clarkb | https://github.com/apache/mina-sshd/blob/master/sshd-common/src/main/java/org/apache/sshd/client/config/hosts/HostPatternsHolder.java#L30 | 01:38 |
clarkb | https://github.com/apache/mina-sshd/blob/master/sshd-common/src/main/java/org/apache/sshd/client/config/hosts/HostPatternsHolder.java#L180 | 01:38 |
clarkb | there is also port matching going on so we may need to deal with ports too | 01:40 |
opendevreview | Merged opendev/system-config master: Add debugging info to certcheck list building https://review.opendev.org/c/opendev/system-config/+/898475 | 03:29 |
fungi | wow, regex instead of glob sounds like they either completely misinterpreted what openssh does or just... didn't care | 13:21 |
Clark[m] | Ya I mean reading the code I'm like 95% sure that gitea[0-9]*.opendev.org would work now. But we should test anyway | 19:51 |
Clark[m] | And then assuming that is the solution I get to go update more docs | 19:52 |
fungi | yeah, i believe it, even as disappointing as that belief is at an existential level | 19:53 |
opendevreview | Clark Boylan proposed opendev/system-config master: Reapply "Switch Gerrit replication to a larger RSA key" https://review.opendev.org/c/opendev/system-config/+/902490 | 22:23 |
opendevreview | Clark Boylan proposed opendev/system-config master: Reapply "Switch Gerrit replication to a larger RSA key" https://review.opendev.org/c/opendev/system-config/+/902490 | 22:26 |
clarkb | Forgot to add the forced test failures. I'm going to put holds in place for the gitea and gerrit jobs and see if we can make gerrit replicate to the held gitea that way | 22:27 |
clarkb | I think I can set up replication to replicate the test repo in gerrit to any one of the gitea repos that is empty using a direct mapping in the replication config. Then as long as I update /etc/hosts and ssh authorized keys in gitea it should be a valid test of the .ssh/config | 22:31 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!