Saturday, 2023-12-02

opendevreviewMerged opendev/system-config master: Switch Gerrit replication to a larger RSA key
clarkbbah it merged just too late to get ahead of the hourly jobs00:00
clarkbso ya even later... I'm thinking I'll confirm that the config updates make it onto the server as expected, push something to sandbox and see that it replicates (so that we haven't unexpectedly broken it) then we can plan for a restart monday00:01
fungiokay, back00:20
clarkbthe files were just written out00:21
clarkbstill waiting for the job to complete00:21
clarkbfungi: do you want to wait for monday or go for it now?00:21
fungii'm good to do it now00:21
fungii'm around all weekend too in case something comes up00:21
fungilooks like the job just completed00:21
clarkbok then I guess we proceed. Plan would be to docker-compose down, mv the waiting queue aside, move the old id_rsa aside, then docker-compose up -d00:21
clarkband the job completed successfully00:22
clarkbfungi: do you want me to start a screen? I don't think it is necessary00:22
fungii'm just looking for the key00:22
clarkbI can do the stop and file moves and all that if you want to do a #status log? and maybe prep a change to push after it starts so that we can check replication is working?00:23
fungithe new key isn't in ~gerrit2/review_site/etc/ i guess?00:23
clarkbno it is in /home/gerrit2/.ssh/ next to the old key00:23
fungiaha, yep that updated00:24
clarkboh we should make sure that whole dir is bind mounted to the container and not indvidual files00:24
fungii see replication_id_rsa_B and with a last updated time of a few minutes ago00:24
clarkbI didn't think of that until just now00:24
clarkb- /home/gerrit2/.ssh:/var/gerrit/.ssh00:24
clarkbshould be good00:24
clarkbI'll move id_rsa there aside to id_rsa.bak00:25
clarkbas well as moving the replication waiting queue into /home/gerrit2/tmp/clarkb00:25
clarkbfungi: maybe you can #status notice whiel I sort out those file paths?00:25
fungi#status notice The Gerrit service on will be offline momentarily to restart it onto an updated replication key00:26
opendevstatusfungi: sending notice00:26
-opendevstatus- NOTICE: The Gerrit service on will be offline momentarily to restart it onto an updated replication key00:26
clarkbfungi: I think I'm ready to proceed as soon as that reports it is done00:27
clarkbthat == the bot00:27
opendevstatusfungi: finished sending notice00:29
fungithere we are00:29
clarkbok I'm proceeding now with the stop, file/dir moves, and then start00:29
clarkbI am not doing an image update pull00:29
fungisounds right00:30
clarkbit should be back up ish now00:31
clarkbthe log says it is anyway00:31
fungiyeah, loading for me now00:31
fungiPowered by Gerrit Code Review (3.8.3-2-gb446549261-dirty)00:31
clarkbnow we need to push some code00:31
clarkbarg replication is failing according to the replication log00:33
clarkbno public keys to try so it isn't reading the ssh config as documented?00:33 looks like the right format, same format as anyway00:34
fungisame for the private keys00:34
clarkbwell the odd thing is it says "no keys to try" instead of "key failed"00:34
fungiis the existence of tripping it up?00:35
clarkbMaybe? but if it is reading the config file it shouldn't bother with id_rsa at all or its pubkey00:35
fungibut the IdentityFile line should be pointing it at the other key, yeah00:35
clarkbeither its ignoring the file entirely for some reason (plugin docs are bad or file permissions are wrong or something) or the specification in the file is wrong00:36
clarkbI think our options are to 1) move the new key in place as id_rsa and put review in the emergency file or 2) revert and figure it out next week00:36
fungiCannot log in at publickey: no keys to try00:36
clarkbeither way we need to do another restart00:36
clarkbyup thats the error00:36
clarkbmy plan is after we restore replication to grep `Cannot replicate to` and manually trigger replication for the repos that didn't replicate00:37
fungipermissions and ownership look the same on the old and new files00:37
clarkbmaybe the Host specification needs the :222 at the end since that is what we are ssh'ing to?00:37
fungioh, that's a strong possibility00:38
clarkbif that is the issue I'll be cranky beacuse Port 222 is set00:39
fungimmm, yeah i don't think it needs to be on the host line then, you're right00:39
clarkboh! I have a : on the Host line00:39
clarkband that shouldn't be there00:39
fungiright! this is not yaml00:39
clarkbthis may be holdover from when I was ensuring I got the port in there00:39
clarkbya I think I started with the yaml from our replication config00:40
clarkbok so lets stop it again, remove the : and start again then see if that works?00:40
fungiand yeah, just confirmed the host lines in my personal config don't end in :00:40
clarkbif it does we can put review in teh emegecny file and fix that specific problem monday00:40
fungii concur00:40
clarkbalright proceeding now00:40
clarkbit should be up(ish) now00:42
clarkbI'm waiting on the replication log to say something new00:42
opendevreviewClark Boylan proposed opendev/system-config master: A file with tab delimited useful utf8 chars
clarkbok replication logs look much better after ^ still need to confirm the gitea side has that commit00:43
opendevreviewJeremy Stanley proposed opendev/system-config master: Remove a stray colon
clarkbhrm no its still failing00:44
clarkbI think the previous log entries were for noop edits because I used the web ui to make the change then it took some time for the actual replication runs to fail00:45
clarkbI don't really want to restart gerrit over and over again... but also ugh00:45
fungiwant to try the :222 or just roll back?00:46
clarkbadding :222 is the only other thing I can figure to try right now. I guess we can do that and if it fails we can roll back00:46
clarkbok proceeding00:47
clarkbit should be back ish again00:48
fungilooking at the manpage for ssh_config, the host entry refers to the patterns section, which talks about wildcards but doesn't mention any port suffix, so i have doubts this will help00:49
fungithen again, this is mina's interpretation of openssh configuration00:49
opendevreviewClark Boylan proposed opendev/system-config master: A file with tab delimited useful utf8 chars
fungiso all bets are off00:49
clarkbya it may not actually read that file for all we know and the docs are completely wrong00:50
clarkbI guess that is the other thing we can do. Move the new replication key to id_rsa00:50
clarkband not rely on the .ssh/config file at all00:51
opendevreviewJeremy Stanley proposed opendev/system-config master: Add the port to mina's replication host pattern
clarkbbasically move that file over, put review into emergency file and then update on Monday to simply write to the id_rsa file00:51
clarkbok the :222 did not work. So either we revert or do ^00:51
fungii'm okay with one last try00:52
clarkbactually we don't need to revert. We should be abel to just move id_rsa back itno place and restart00:52
clarkbfungi: and that one last try would be putting the new key in as id_rsa?00:52
clarkbok I'll proceed with that then00:52
fungiand undoing the config of course00:52
fungigranted, it may simply be ignoring the config entirely00:53
fungior it may parse only a subset of ssh_config syntax00:53
clarkboh I didn't remove the config because I figured it was completely ignoring it00:54
fungiwe'll see00:54
fungibut yes, that seems likely00:54
clarkbwow ok it still says no keys to try00:57
clarkbI don't get it00:57
clarkbI think we're in full revert territory (remove config and put id_rsa.bak as id_rsa again)00:57
fungii'll add the server to the emergency disable list00:57
clarkbthank you00:57
clarkbI'm proceeding with manual revert now00:58
fungiit's in the emergency list now00:58
clarkbI moved .ssh/config to .ssh/brokenconfig and id_rsa.bak to id_rsa00:59
clarkbI think we can leave the other two new key files in places since nothing should point to them now00:59
clarkbif you concur I'll start gerrit hopefulyl for the last time this evening00:59
fungiremaining possibilities i can come up with: 1. the config is confusing the client, or 2. the new key is too large or formatted internally in a way that the client is unable to load01:00
clarkbit is starting now01:00
fungigit commit --amend01:02
opendevreviewJeremy Stanley proposed opendev/system-config master: Revert "Switch Gerrit replication to a larger RSA key"
clarkbcool I seem to see happy replication logs too01:03
clarkbother idea: maybe the mina client is validating id_rsa and match01:03
clarkbotherwise it won't use the key?01:03
clarkbI'm going to generate a list of proejcts to trigger replication for now01:03
fungioh, i think it does do that. did you only move the private key to id_rsa but not the public one to
clarkbbecause well it shouldn't matter...01:03
fungithat's a distinct possibility then01:04
clarkbI think we can test these things with a held node next week01:04
clarkband try to nail down what exactly went wrong from all of these posibilities01:04
fungiyes, it shouldn't matter but in the past we've found out the hard way that it blew up if we only installed the id_rsa and not the even though it should never use the latter01:04
fungii think openssh will also explode if your pubkey doesn't match the corresponding privkey when present, but is fine if it's entirely absent01:05
clarkbopenstack/openstack-helm-infra openstack/neutron opendev/system-config and openstack/openstack are the ones with reported replication failures01:06
fungiso maybe mina-ssh is trying to mimic that safety check01:06
clarkbat this point I smell dinner and don't really awnt to try another restart to check that01:06
fungino, we can revisit next week01:06
clarkbassuming that is the case it still doesn't explain why the .ssh/config stuff didn't work (since the key it pointed to did have a matching pubkey)01:07
clarkbso I think we basically want to investigate if .ssh/config works and if so how to amke it work. Then decide on whether or not we need to replace id_rsa and entirely or can have a new key alonside the old one01:07
clarkbI'm going to trigger replication for those repos01:07
fungilooks like syntax is `replication start openstack/openstack-helm-infra openstack/neutron opendev/system-config openstack/openstack`01:09
clarkboh I was doing them one at a time can I list them all at once? neat01:09
fungiit may only be one or wildcards01:10
fungi(or none)01:10
clarkbthat is done01:10
fungi`replication start --help` does suggest that it will accept multiple patterns01:11
fungireplication start [PATTERN ...] [--] [--all] [--deadline VAL] [--help (-h)] [--now] [--trace] [--trace-id VAL] [--url PATTERN] [--wait]01:11
clarkbthe other thing I noticed is that my client generates an exception in error_log when loading my test change because ps4 isn't presnet or something. However it seems to render fine01:11
clarkbI suspect that maybe when we restarted it cut a reindex action for that short and it isn't in the index?01:12
fungithat would be my guess, and reindexing will eventually solve it01:12
clarkbPart of what backs that up is that clicking the sha says no changes01:12
clarkbI think I'll push a new ps and see if that reindexes the whole change01:12
clarkbotherwise we'll need to trigger full reindexing for the project another time01:12
opendevreviewClark Boylan proposed opendev/system-config master: A file with tab delimited useful utf8 chars
clarkbI think that did it. I can click on the sha now and it finds the chagne and there isn't a new traceback in the error_log01:14
clarkbI'll do a recap so that scrollback isn't so bad01:16
clarkbWe restarted gerrit with the new .ssh/config and the two new key files (private and public) in place on review02. Replication began to fail with no possible key errors. We thought there may be config file errors (of which we found at least one, but resolving that one didn't change the behavior). After several restarts I figured we'd test moving the new key in as id_rsa as a hail01:17
clarkbmary. This still didn't work. After that we reverted by hand and put the server in the emergency file01:17
clarkbAfter the by hand revert everything started working again. Current thoughts: the .ssh/config is either bad because mina can't parse it for some reason or it is completely ignore my mina. And the reason that moving the new key in as id_rsa may not have worked is I didn't also move the pubkey to id_rsa.pub01:18
clarkbOn Monday we'll want to rollback things more properly so that the server can come out of the emergency file and then test how to make this work more reliably01:18
clarkbFinally I noticed errors with my test change that appear to be related to cutting indexing of a new patchset short due to stopping gerrit. Pushing a new patchset to the change fixed this but presumably so would an explicit online reindex request for the project in question01:19
clarkbfungi: thank you for the help01:19
fungialso is the revert01:19
fungiif that merges, presumably we can take the server out of the emergency disable list?01:20
clarkbfungi: yes I think so because then we won't try to rewrite .ssh/config01:21
clarkbbut we'll need tomanually cleanup the new key and .ssh/brokenconfig01:21
fungiah, correct01:21
clarkbI'm beginning to think that the way to rollforward will liekyl end up being a manual move of the new key to id_rsa and after backing up both files for ease of manual reverting. Restart gerrit and then fi that works just update our config management to overwrite the old key with the new key data01:22
clarkbbasically don't try to have A and B keys since .ssh/config selection seems iffy01:22
clarkbbut we can attempt to do more in depth testing next week before making any decisions01:22
fungibut for now, your dinner grows ever colder01:23
clarkbyes I can smell it :)01:23
fungiand i have a cold beer and video games calling my name01:23
clarkbI couldn't help myself so I started to rtfs01:37
clarkbMINA appears to do a regex match01:37
clarkbnot a glob match01:37
clarkba small but very important difference in behavior01:37
clarkbthere is also port matching going on so we may need to deal with ports too01:40
opendevreviewMerged opendev/system-config master: Add debugging info to certcheck list building
fungiwow, regex instead of glob sounds like they either completely misinterpreted what openssh does or just... didn't care13:21
Clark[m]Ya I mean reading the code I'm like 95% sure that gitea[0-9]* would work now. But we should test anyway19:51
Clark[m]And then assuming that is the solution I get to go update more docs19:52
fungiyeah, i believe it, even as disappointing as that belief is at an existential level19:53
opendevreviewClark Boylan proposed opendev/system-config master: Reapply "Switch Gerrit replication to a larger RSA key"
opendevreviewClark Boylan proposed opendev/system-config master: Reapply "Switch Gerrit replication to a larger RSA key"
clarkbForgot to add the forced test failures. I'm going to put holds in place for the gitea and gerrit jobs and see if we can make gerrit replicate to the held gitea that way22:27
clarkbI think I can set up replication to replicate the test repo in gerrit to any one of the gitea repos that is empty using a direct mapping in the replication config. Then as long as I update /etc/hosts and ssh authorized keys in gitea it should be a valid test of the .ssh/config22:31

Generated by 2.17.3 by Marius Gedminas - find it at!