*** elodilles_pto is now known as elodilles | 06:54 | |
tkajinam | seems something wrong with authentication in gerrit and now the browser is redirected to https://review.opendev.org/SignInFailure,SIGN_IN,Local+signature+verification+failed after login | 07:27 |
---|---|---|
tkajinam | I'm not facing the problem with my browser in this laptop (because of remaining sessions) but I hit this in a different laotop where login session was purged | 07:27 |
zigo | I can't login either ... :/ | 07:34 |
StutiArya[m] | I need to solution, I am working on RHOSP platform where I want to run networking infoblox plugin and following their documentation. The RHOSP VM is also used by some other (QA) team and we are having conflict while changing the neutron ipam driver configuration. The other(QA) team needs the neutron ipam driver set as neutron.ipam_driver which is default value, but Infoblox plugin needs value as 'infoblox'. I am looking for a way where I | 07:35 |
StutiArya[m] | can achieve both neutron configuration. | 07:35 |
zigo | ianw: Do you know what's going on? | 07:38 |
frickler | infra-root: ^^ I can confirm the issue, no idea what' happening, nothing obvious to me in gerrit logs | 08:02 |
frickler | login to ubuntu.one and accessing lp works fine | 08:02 |
frickler | StutiArya[m]: this channel is for collaborating on the opendev infrastructure. you may want to ask in the neutron channel maybe or in some redhat support platform | 08:07 |
frickler | o.k., after filtering out a huge number of key exchange warning messages, I see this: ERROR com.google.gerrit.httpd.auth.openid.OpenIdServiceImpl : OpenID failure: Local signature verification failed | 08:15 |
frickler | but since we didn't change anything, likely an issue with lp/u1? | 08:15 |
frickler | first timestamp I see this is 2024-01-22T22:43:30.668Z | 08:24 |
tonyb | frickler: would you like some help? even if it's just someone to bounce ideas off? | 08:25 |
frickler | tonyb: sure, anything. I've noticed someone already created https://bugs.launchpad.net/canonical-identity-provider/+bug/2050855 and I asked in #launchpad about it now. maybe jamespage also can check internally? | 08:30 |
frickler | I've tried registering a new account with ubuntu.one and logging in with that one, but that seems to give the same result | 08:32 |
priteau | frickler: I am also seeing the login issues here. Would it be worth sending a notification via IRC and Mastodon? | 08:44 |
frickler | priteau: good idea | 08:44 |
frickler | status notice all new logins to https://review.opendev.org are currently failing. investigation is ongoing, please be patient | 08:45 |
frickler | priteau: tonyb: ^^ does that sound good? | 08:45 |
frickler | fwiw people in #launchpad gave me an email address to contact for ubuntu one support, which I'm doing now | 08:46 |
frickler | #status notice all new logins to https://review.opendev.org are currently failing. investigation is ongoing, please be patient | 08:53 |
opendevstatus | frickler: sending notice | 08:53 |
-opendevstatus- NOTICE: all new logins to https://review.opendev.org are currently failing. investigation is ongoing, please be patient | 08:53 | |
opendevstatus | frickler: finished sending notice | 08:56 |
priteau | Thanks frickler! | 08:56 |
tonyb | frickler: some quick research indicates that it could be time related but AFAICT the time on review02 matches the time on my laptop so if that is it, it's on the LP side .... | 08:59 |
frickler | tonyb: the time on login.ubuntu.com also looks not too far off according to http headers | 09:04 |
tonyb | Okay so probably not that | 09:04 |
frickler | I've had the root mailbox on CC: for my mail but their ticket system seems to have ignored that at least in their auto-reply. will keep bouncing any further updates if/when they arrive | 09:06 |
frickler | also I guess I should also look into that mutt multi account setup. although I don't like switching inboxes, maybe I'll just add another tab | 09:08 |
tonyb | Okay, I don't see anything in the roo mailbox | 09:13 |
tonyb | *root | 09:13 |
frickler | oh, that's @openstack.org, I used @opendev.org instead. wondering where that ends up, it didn't bounce at least | 09:18 |
tonyb | frickler: any objections to me increasing the logging in org.openid4java (WARN) and com.google.gerrit.httpd.auth.openid (INFO) to DEBUG to try and get some more details? | 10:15 |
frickler | tonyb: how would you do that? I would like to avoid any restart as that would likely worsen the situation for people whose login is currently still working. not sure if spinning up a help review99 would help? | 10:17 |
tonyb | https://gerrit-review.googlesource.com/Documentation/cmd-logging-set-level.html | 10:19 |
tonyb | *should* be safe to run on prod? | 10:20 |
frickler | tonyb: sounds ok to me | 10:20 |
* frickler needs to go prep some food, bbiab | 10:25 | |
tonyb | okay | 10:25 |
frickler | hmm, those logs don't show anything useful to me. that may be due to my complete ignorance of how openid actually works though | 10:50 |
tonyb | Yeah I'm digging into both. I *think* it's at the other end but I'd like to validate that. | 10:52 |
frickler | fwiw the initial mail address I sent things to was wrong, I've resent to the new address, but not response except the auto reply yet | 11:14 |
tonyb | frickler: Okay. I think I've satisfied myself that we are indeed receiving invalid signatures from Ubuntu-One/Launchpad. | 11:39 |
tonyb | frickler: the java code (specifically openid4java), sadly doesn't really do dynamic logging the way we'd like so we only get some of the potential additional output. | 11:40 |
tonyb | I've put the log levels back the way they were | 11:41 |
tonyb | I see your email and the auto-reply in the root mailbox | 11:42 |
swest | trying to log into opendev gerrit I'm currently getting redirected to https://review.opendev.org/SignInFailure,SIGN_IN,Local+signature+verification+failed which gives a 404 "Not Found" | 11:43 |
tonyb | If it's still broken when I get up tomorrow I'll try to create a more self-contained test case | 11:43 |
tonyb | swest: Yes it's a know issue no ETA on a resolution yet | 11:43 |
swest | ack, thanks | 11:43 |
tonyb | swest: You can still interact with gerrit via ssh and anonymous APIs but there is a bit of a learning curve to that | 11:44 |
tonyb | frickler: I'm heading to bed #good_luck | 11:45 |
frickler | tonyb: thx, gn8 to you | 11:46 |
fungi | just waking up now,, but ubuntuone sso goes on the fritz from time to time. usually the admins for that correct the problem within a few hours though | 13:02 |
fungi | looks like it's been going on for about six hours this time though? (just based on reports in irc... maybe longer but i haven't checked the logs on gerrit to see if there were significantly earlier occurrences) | 13:23 |
fungi | aha, 22:43 was mentioned | 13:25 |
fungi | so nearly 14 hours now | 13:25 |
fungi | er, nearly 15 | 13:25 |
frickler | yes, also no feedback to my ticket or the LP issue so far | 13:30 |
jrosser | i'm trying to use one of the 32G sized nodes here but failing - is there something obvious i've done wrong? https://review.opendev.org/c/openstack/openstack-ansible-os_magnum/+/905199 | 13:43 |
frickler | jrosser: doesn't look wrong to me at first sight, will check nodepool logs | 13:48 |
frickler | jrosser: ah, ubuntu-jammy-32GB vs. ubuntu-jammy-32G, needed cleaned glasses to spot that ;) | 13:56 |
jrosser | ahha! well spotted! | 13:56 |
jrosser | thankyou :) | 13:57 |
frickler | yw | 13:59 |
yoctozepto | ah, so you know SSO is broken | 14:08 |
opendevreview | Merged openstack/project-config master: Add eventlet to projects available from github https://review.opendev.org/c/openstack/project-config/+/906071 | 14:12 |
fungi | yoctozepto: yep, we spammed the irc channels that subscribe to our statusbot, which also gets posted to https://fosstodon.org/@opendevinfra/ | 14:19 |
fungi | and i responded to a thread that was started on the service-discuss ml about it as well | 14:20 |
yoctozepto | yeah, I have just come to see if you know | 14:34 |
yoctozepto | and you know | 14:34 |
yoctozepto | perfect | 14:34 |
zigo | Time to get away from lp and find our own alternative? :) | 14:49 |
jamespage | frickler: apologies working odd hours this week - I'll go see what I can find out | 15:03 |
fungi | zigo: we've got a little progress on https://docs.opendev.org/opendev/infra-specs/latest/specs/central-auth.html but it's currently stalled behind not enough volunteers and also switching out our keycloak container images in https://review.opendev.org/905469 | 15:04 |
fungi | it's a goal of mine to make some forward progress on it this quarter, but that will all depend on what other fires erupt in the coming weeks | 15:07 |
* ajaiswal[m] uploaded an image: (25KiB) < https://matrix.org/_matrix/media/v3/download/matrix.org/JxbJGBFHSpCZkxKITWUpvvJd/image.png > | 15:08 | |
ajaiswal[m] | Hi i am unable to login to gerrit any help | 15:08 |
fungi | ajaiswal[m]: something's wrong with the ubuntuone sso identity provider we rely on, we've reported it to their admins and are hoping to hear something back soon | 15:09 |
fungi | ssh and authenticated rest api access are still working, just not the interactive webui | 15:10 |
fungi | if your session expires anyway, that is | 15:10 |
frickler | response from canonical in https://bugs.launchpad.net/bugs/2050855 , so I guess we need to do some further debugging or consider alternatives | 15:39 |
clarkb | are other services failing too? | 15:54 |
fungi | i do seem to be able to log out of wiki.openstack.org and then log back in with my ubuntuone id. also same with storyboard.openstack.org, so at least some of our systems do seem unaffected, and whatever's happening does appear to just be impacting gerrit this time | 15:54 |
clarkb | ok and gerrit itself reports the same version number | 15:54 |
fungi | sorry for not checking those earlier, i thought they had already been tested and were seeing the same problems but apparently not | 15:54 |
clarkb | possible update to apache maybe impacting how the redirect is parsed in gerrit? Another thing is maybe the all-users repo got corrupted somehow? | 15:55 |
clarkb | however, people are getting redirected to what appear to be the correct ubuntu one locations implying external ids are fine | 15:55 |
clarkb | that makes an all-users problem seem unlikely to me | 15:56 |
fungi | nothing new in dpkg.log since 2024-01-19 06:29:53 | 15:56 |
fungi | so it doesn't appear to be any system package updates on review.o.o | 15:56 |
clarkb | the openid path is you click login in gerrit that redirects you to ubuntu one. At ubuntu one you login and if successful they are supposed to redirect you to a path supplied in the original gerrit -> ubuntu one redirect. Our server logs should show those redirect paths and we should be able to check them for sanity? | 15:58 |
clarkb | I'm wondering if we're providing a bad post auth endpoint to ubuntu one somehow | 15:58 |
clarkb | we also got a notice about cert expiration for review.o.o again | 16:01 |
clarkb | makes me wonder about apache | 16:01 |
clarkb | I'm wary of logging out myself to test things yet. I've also got a series of meetings all morning long | 16:05 |
fungi | vhost files haven't changed since 2022 | 16:05 |
clarkb | But ideas I'll throw out: check apache worker process staleness if for no other reason than to address the certcheck error. Next trace an auth request path and ensure that the redirects from both sides look correct (we might be able to do this through server logs and not need client side tracing) | 16:06 |
*** jonher_ is now known as jonher | 16:06 | |
*** mmalchuk_ is now known as mmalchuk | 16:06 | |
fungi | in fact, nothing in /etc/apache has a recent mtime | 16:06 |
clarkb | ya I'm thinking more of the running processes | 16:06 |
clarkb | since the certcheck error indicates a potentially very long running and stale worker or workers. Unlikely to be the source of our issue though | 16:07 |
fungi | if you have a secondary test account you can use a ff account container with that | 16:07 |
clarkb | I don't, but maybe I need ot create one | 16:07 |
fungi | i tested logging in with my test account in a separate account container and got the expected error | 16:08 |
*** jonher_ is now known as jonher | 16:08 | |
fungi | i can restart apache if we think that might be a problem | 16:09 |
clarkb | fungi: I think we want to trace the entire http redirect path for that process. on the initial redirect from gerrit to ubuntu one there should be an included redirect path for post auth back to gerrit. I think we want to check that value on its way to ubuntu one and on its way back to gerrit | 16:09 |
clarkb | I doubt apache's stale worker is at fault since the cert is still valid | 16:10 |
clarkb | more just calling it out as a less that ideal state along side this | 16:10 |
fungi | infra-prod-base started to fail again on sunday: https://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-base&project=opendev/system-config | 16:11 |
fungi | i'll track down the logs for it | 16:11 |
fungi | mirror02.iad3.inmotion.opendev.org : ok=0 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0 | 16:11 |
fungi | i'll add it to the disable list | 16:12 |
clarkb | thanks | 16:12 |
fungi | next infra-prod-base run should succeed hopefully and we'll get cert updates, but i think this means those warnings were entirely unrelated to the openid issue | 16:13 |
clarkb | other thoughts while I'm having them: something browser specific either a specific browser or browser version or some new "make the internet more secure and less useable" functionality like the hsts stuff that has hindered our testing of held nodes | 16:13 |
clarkb | fungi: the review warning wasn't caused by that failure though | 16:14 |
fungi | right | 16:14 |
clarkb | I still think it is highly unlikely that the warning is related to the issue but the cert did refresh for review a few days back and we're still getting intermittent errors which points to apache workers that haven't restarted to pull in the new cert | 16:14 |
fungi | and yeah, i'm seeing a plenty recent cert from review.o.o | 16:15 |
fungi | Not After : Apr 17 02:10:50 2024 GMT | 16:15 |
fungi | so it was refreshed about a week ago | 16:16 |
clarkb | additional thought: hold a zuul deployed gerrit test node and test if this occurs there | 16:19 |
clarkb | if this is system state specific we may be able to identify it through comparison with a test node | 16:19 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: DNM: break gerrit tests for an autohold https://review.opendev.org/c/opendev/system-config/+/906385 | 16:23 |
fungi | autohold set, but that will take some time. we can check other avenues in the interim | 16:25 |
clarkb | re tracing I see people get redirected to a failure url. Does that redirect come from ubuntu one or gerrit after the ubuntu one redirect back to gerrit? | 16:31 |
fungi | i don't notice any interim redirect | 16:33 |
clarkb | git grep in gerrit shows it may be from gerrit as it constructs the url in this file: java/com/google/gerrit/httpd/auth/openid/OpenIdServiceImpl.java | 16:35 |
clarkb | I dont see the appended error message yet' | 16:37 |
clarkb | https://gerrit.googlesource.com/gerrit/+/refs/heads/stable-3.8/java/com/google/gerrit/httpd/auth/openid/OpenIdServiceImpl.java#265 this is the fail case we hit I think | 16:40 |
clarkb | https://github.com/jbufu/openid4java/blob/4ea8f55c6857d398cbb5033d12a3baa98843be59/src/org/openid4java/consumer/ConsumerManager.java#L1776-L1797 seems to be where the error originates and as the lp issue hints at has to do with associations | 16:45 |
frickler | yes, if you look at the debug logging that tonyb did earlier, you can see some references to that. do "grep -v sshd-SshDaemon /home/gerrit2/review_site/logs/error_log|less" and look for DEBUG at around 11:28 | 16:50 |
frickler | also not very comforting to look at the age of that library. but none of that explains why it should start failing out of the blue | 16:51 |
fungi | embedded trust store with a ca cert that suddenly expired? | 16:52 |
clarkb | associations appear to be a relationship between the gerrit server and ubuntu one that involve hashes of some sort to verify the responses from ubuntu one to gerrit | 16:54 |
clarkb | these are stored in an in memory db if the variable naming in that library is to be trusted | 16:54 |
clarkb | there is also mention of expiration in comments. | 16:54 |
clarkb | thinking out loud here: is it possible we restarted gerrit at $time and loaded association with hashes that expired recently? | 16:55 |
fungi | i'm trying to work out when the gerrit process last started | 16:55 |
clarkb | if that theory above is the cause then I would expect the held node to load new valid associations | 16:58 |
fungi | 42d18h ago | 17:00 |
clarkb | fungi: 6 weeks ago according to docker ps | 17:00 |
fungi | yep | 17:00 |
fungi | so not all that long ago in the grand scheme of things. wish ps didn't make that so hard to figure out accurately | 17:01 |
clarkb | no but if they recycle the hashes annually or something like that we'd maybe start and load the association with a long enough period then hit it not too long later? | 17:01 |
clarkb | I think we should figure out what the association endpoint is for ubuntu one and see if we can inspect the hashes and their expiry dates directly | 17:01 |
clarkb | to get a better feel for what would be in this database | 17:02 |
clarkb | then we can also test with the held node if it is able to login with ubuntu one happily after loading current association data. | 17:02 |
clarkb | I'm wary of just going ahead and restarting gerrit right now in prod to test this that way as this may reset all the sessions people have and create a bigger issue for us if that wasn't the solution | 17:02 |
frickler | ack, that's why I was wary of doing a restart attempt earlier, too | 17:03 |
fungi | agreed | 17:04 |
clarkb | https://github.com/jbufu/openid4java/blob/openid4java-1.0.0/src/org/openid4java/consumer/ConsumerManager.java#L684-L853 this is the code that loads/builds the association | 17:04 |
clarkb | unfortunately not small :) | 17:04 |
fungi | ugh, the dnm change above is on its second attempt building gerrit imeages | 17:04 |
fungi | both the 3.8 and 3.9 image builds are on their "2nd attempt" | 17:05 |
fungi | E: Failed to fetch https://mirror-int.ord.rax.opendev.org/ubuntu/dists/jammy-security/main/binary-amd64/Packages.gz File has unexpected size | 17:07 |
clarkb | fungi: in your DNM chagne we can stop the image build from running. In fact that will help us reproduce more closely 1:1 as we won't get an updated image | 17:08 |
clarkb | (I would just remove the job from the check jobs list as it should be a soft dep) | 17:08 |
fungi | the error above should never happen with the way we do reprepro with afs | 17:08 |
clarkb | fungi: we've seen it happen before iirc | 17:08 |
fungi | yeah, i can strip down the jobs, should have done that before | 17:09 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: DNM: break gerrit tests for an autohold https://review.opendev.org/c/opendev/system-config/+/906385 | 17:11 |
fungi | now only system-config-run-review-3.8 is waiting | 17:11 |
fungi | and now it's queued | 17:12 |
clarkb | https://login.ubuntu.com/.well-known/openid-configuration this doesn't seem to work. Not sure what the actual endpoint should be yet | 17:15 |
fungi | and now it's running | 17:16 |
clarkb | when you load the in memory association structure the first thing it does is remove expired entries | 17:17 |
clarkb | really beginning to suspect either we're not removing expired entries or it is but then not loading the new valid stuff | 17:17 |
fungi | clarkb: did i remove too much in the dnm change? the job only got as far as "pull-from-intermediate-registry: Load information from zuul_return" which failed | 17:21 |
fungi | do i need to leave the soft dependencies on that job? | 17:21 |
clarkb | yes I think we need to keep the dependencies in place we just don't run the job that builds a new image | 17:22 |
clarkb | otherwise all the speculatiev container image stuff doesn't have enough info to run | 17:22 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: DNM: break gerrit tests for an autohold https://review.opendev.org/c/opendev/system-config/+/906385 | 17:24 |
fungi | oh, wait, keep the registry job too then | 17:25 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: DNM: break gerrit tests for an autohold https://review.opendev.org/c/opendev/system-config/+/906385 | 17:26 |
clarkb | yes otherwise you have to undo all the speculative container building framework stuff | 17:26 |
clarkb | we just want the job to not run then the speculative framework stuff can do its thing and decide to use the image we have published rather than a new speculative image | 17:27 |
clarkb | https://openid.net/specs/openid-authentication-1_1.html#mode_associate ok I think unlike openid connect this doesn't rely on a well known endpoint. Instead you just post to the openid endpoint to set up an association then the other end sends back the pubkeys and expiry data | 17:28 |
clarkb | reading the code gerrit calls the authenticate() method in that ConsumerManager class. The authenticate method calls associate which should refresh the association db pruning expired entries | 17:40 |
clarkb | I'm also finding the github code viewer does not work very well | 17:41 |
fungi | argh, it didn't run the registry job even though i included it, but then complained that the registry job didn't run and was required by the gerrit test job | 17:41 |
clarkb | we also know that the association is not null because the error we hit is within a block with a valid association | 17:41 |
fungi | do i need to find a way to force the registry job to run when no images are built? | 17:41 |
clarkb | I thought it would run regardless of building images | 17:43 |
fungi | "Unable to freeze job graph: Job system-config-run-review-3.8 depends on opendev-buildset-registry which was not run." | 17:44 |
fungi | but i kept "- opendev-buildset-registry" in the check pipeline jobs | 17:44 |
fungi | i don't see any conditions on the opendev-buildset-registry job definition in opendev/base-jobs either | 17:47 |
clarkb | fungi: its working now | 17:51 |
clarkb | I think you were looking at an older patchset and not the current one? | 17:51 |
fungi | oh! okay, i was looking at an old error ;) | 17:51 |
fungi | yep, that looks better | 17:51 |
clarkb | looking at apache logs you can see the assoc_handle values | 17:53 |
clarkb | we had a value that flipped over and then this all started | 17:53 |
clarkb | the same value continues to be used and fails now. | 17:53 |
clarkb | So we've cached bad data potentially or ubuntu one has | 17:53 |
fungi | in which case probably the held test node will "just work" | 17:54 |
clarkb | but importantly this shows we aren't refreshing the association on every verification request likely because we're still considered valid so it is short circuiting somewhere? | 17:54 |
clarkb | fungi: ya I mean assuming the upstream assertion nothing has changed is correct then all I can imagine is that something got corrupted somehow | 17:54 |
clarkb | the hash type is the same before and after the problem occurs | 17:55 |
clarkb | I'm trying to figure out if this association ever worked | 17:59 |
fungi | held node is 104.130.169.176 | 17:59 |
fungi | argh, we don't use openid on test deployments | 18:00 |
clarkb | the first use of the assoc handle is at 22:43:30 and the first signing failure is at 22:43:32 given the timing I believe this association simply never worked | 18:00 |
clarkb | fungi: we don't but its easy to modify to make it so | 18:01 |
clarkb | fungi: you also may need to set up /etc/hosts so that whatever redirects are used point at the test server | 18:01 |
clarkb | that all happens in your browser so /etc/hosts should be sufficient | 18:01 |
clarkb | I think we know this we got a new association at 22:43:30 and it appears to have never worked | 18:01 |
fungi | yeah, i have "104.130.169.176 review.opendev.org" in /etc/hosts already | 18:01 |
clarkb | Prior to that the existing association was fine. They also appear to use the same hash type | 18:02 |
fungi | frickler: saw a "ERROR com.google.gerrit.httpd.auth.openid.OpenIdServiceImpl : OpenID failure: Local signature verification failed" at 2024-01-22T22:43:30.668Z as the first occurrence, yeah | 18:03 |
clarkb | I have double checked the hash types are the same | 18:03 |
clarkb | so this isn't a change in hash type we aren't able to process | 18:04 |
clarkb | I think all this info leads weight to the idea that either our association db is corrupt or ubuntu one's is. Since we appear to store associations in memory only if we restart gerrit that should create a new association which hopefully works. tl;dr lets continue pushing down the path of testing with the test server | 18:05 |
fungi | i just need to add "type = OPENID_SSO" and the openIdSsoUrl to [auth] in place of the DEVELOPMENT_BECOME_ANY_ACCOUNT yeah? | 18:06 |
clarkb | I think so | 18:06 |
clarkb | then when you startup you should beable to login as you would in prod | 18:06 |
fungi | it's coming up with those adjustments now | 18:06 |
fungi | yeah, it works | 18:07 |
clarkb | it works as in your are able to fully login to the test server using ubuntu one? | 18:07 |
fungi | prompts me to set my displayname and username, then i'm logged in | 18:07 |
fungi | yes, ubuntuone sso openid | 18:07 |
clarkb | cool I want ot check the apache logs on the held server to see if hash types align and otherwise look like prod currently | 18:08 |
fungi | with my fungi-three test account in a dedicated ff account container | 18:08 |
fungi | same thing that was generating an error with the prod gerrit | 18:08 |
clarkb | if you grep for assoc_handle in the gerrit-ssl-access.log you'll see the handle info | 18:09 |
clarkb | and ya that appaers to match with prod. Same hash type anyway | 18:09 |
prjadhav | Hi need help regarding openstack zed issue | 18:11 |
prjadhav | I am trying to start instance but getting error as Could not find versioned identity endpoints when attempting to authenticate. Please check that your auth_url is correct. Unable to establish connection to https://controller/identity | 18:12 |
clarkb | fungi: as a final sanity check we have a none null openid.sig value in both prod when it fails and your test node and old logs from prod when it succeeds. Which is the other half of the thing we're checking against | 18:12 |
clarkb | prjadhav: we run the development tools that the openstack project uses to build the software but aren't really experts in how to run openstack. There are better venues for that like #openstack or the openstack-discuss mailing list | 18:13 |
fungi | prjadhav: this is the channel where we discuss services run as part of the opendev collaboratory. you probably want the #openstack channel or the openstack-discuss@lists.openstack.org mailing list | 18:13 |
prjadhav | even though I hash commented the service_user section under nova.conf | 18:13 |
prjadhav | anyone faced this issue? | 18:13 |
fungi | prjadhav: you're asking in the wrong place, so you're unlikely to get any answers on that topic | 18:13 |
prjadhav | ok sorry | 18:14 |
prjadhav | Thanks for pointing | 18:14 |
clarkb | fungi: I'm not sure what else we should be checking. These associations appear to live for many days so this sin't something that will automatically clean itself out. I also don't see a way to force the gerrit openid implementation to rebuild associations short of a restart but there may be a way we can do that | 18:14 |
clarkb | fungi: oh I know what we can check. Can you restart the test server and log in again and we can check a new association value is made? | 18:15 |
fungi | yep, on it | 18:15 |
fungi | it's coming back up now | 18:15 |
fungi | seems to leave me logged in. should i log out and back in again? | 18:16 |
clarkb | yes | 18:16 |
fungi | clarkb: done | 18:16 |
clarkb | fungi: the assoc_handle values do seem to differ | 18:17 |
clarkb | though quite similar which may just be coincidence | 18:17 |
fungi | okay, so it generates a new one at restart | 18:17 |
fungi | i can log out and in again to see if it changes, also restart yet again and see if it's still similar | 18:17 |
clarkb | I think so. Please double check by grepping assoc_handle in the apahce logs | 18:17 |
clarkb | ++ to both things | 18:17 |
clarkb | side note ps confirms an older than cert update apache worker | 18:18 |
fungi | okay, that explains the e-mail at least | 18:18 |
clarkb | I'm going to need to take a break soon as I haven't had breakfast yet and meetings will have me busy well through the lunch hour here | 18:20 |
clarkb | but lets validate the log out and log back in then restart and log out and log back in data first | 18:20 |
fungi | using this to filter it: | 18:21 |
fungi | sudo grep assoc_handle /var/log/apache2/gerrit-ssl-access.log|sed 's/.*assoc_handle=%7BHMAC-SHA256%7D%7B\(.*\)%7D&openid.*/\1/' | 18:21 |
fungi | stays constant when i log out and in again without restarting | 18:22 |
*** osmanlicilegi is now known as Guest16 | 18:22 | |
clarkb | yup I see that | 18:22 |
clarkb | which also mimics prod behavior and seems to indicate we're caching associations | 18:22 |
fungi | and it changes again when i restart and log out/in | 18:23 |
clarkb | agreed. Though still similar so maybe the handle itself isn't a hash | 18:24 |
clarkb | but instead refers to some hash type info that isn't being directly logged | 18:24 |
fungi | the first field looks like a counter and the second field seems to be a base64-encoded token of some kind | 18:24 |
clarkb | which would make sense if that info is sensitive which I believe it is | 18:24 |
clarkb | oh amybe the second value is the hashed ifno | 18:25 |
fungi | basically everything after the %7B (the repeated %3D at the end is what leads me to expect something like base64) | 18:25 |
clarkb | ah yup == | 18:26 |
clarkb | the spec also refers to base64 for some fields so that is very likely | 18:27 |
fungi | anyway, if we thing the assoc_handle us bad and that getting a fresh one may solve this, restarting gerrit is probably the fastest way there | 18:28 |
clarkb | alright so where we're at is that a association handle change occured which seems to imply we updated associations (likely due to an expiry of the old association). These changes happen many days apart so are not frequent | 18:28 |
fungi | er, think the assoc_handle is bad | 18:28 |
clarkb | It seems that restarting gerrit creates a new handle | 18:28 |
clarkb | I've confirmed the gerrit docker images are the same on both the test node and the prod node ruling out something specific to code versions | 18:29 |
fungi | which implies the handle expired around 42 days of age | 18:29 |
clarkb | fungi: or maybe we've successfully rotated them weekly or something | 18:29 |
fungi | 42 days is suspiciously close to 1000 hours | 18:30 |
clarkb | and only hit the problem with a corrupt assocation this time ? | 18:30 |
clarkb | but ya I haven't dug through logs far enough tofigure out how long the old one may have been used for | 18:30 |
clarkb | it was in use as far back as our current log files record. We'd have to go to backups to look furhter | 18:31 |
clarkb | frickler tonyb (no corvus here?) would be good if you can weigh in on the idea that restarting the service will create a new assocaition which hopefully works | 18:31 |
clarkb | I'm going to take that break now otherwise I may not get another chance for some time | 18:31 |
fungi | yeah, i'm on hand to restart the gerrit container as soon as we have consensus (explicit or tacit) | 18:32 |
clarkb | thinking out loud here: if we wanted to investigate furthe before a restart we could see if anyone at the ubuntu one side is willing to compare notes on these values to see if we did end up with corruption somehow | 18:32 |
clarkb | other than that I'm not sure what we can do | 18:33 |
fungi | which can also be done after the fact, we still have the data | 18:33 |
clarkb | ya I think the main risk with a restart is that it is common for restarts to invalidate user sessions | 18:34 |
clarkb | so if we don't fix it through the restart we'll have made the problem much more noticeable | 18:34 |
fungi | i haven't actually observed gerrit restarts invalidating my login sessions, fwiw | 18:40 |
frickler | reading backlog after tc meeting now | 18:40 |
frickler | I think that used to happen with older gerrit, but I think has not happened recently | 18:41 |
frickler | given the testing I'd be fine with doing a gerrit restart. though I'm biased because I don't have a working session left after my earlier testing and will also not be around for the next 12h or so anyway ;) | 18:42 |
frickler | doing rechecks or +3 via cli is fine, but I didn't look into the details of doing file based comments yet | 18:47 |
clarkb | I think that may require the http rest api. I seem to recall that zuul can't do inline comments without the rest api | 18:48 |
clarkb | corvus indicates that his messages may not be reaching the channel but is also happy with a restart given the testing fungi performed | 18:54 |
*** corvus is now known as notcorvus | 18:54 | |
*** notcorvus is now known as corvus | 18:54 | |
frickler | there seemed to be issues with the matrix bridge in the tc channel, too | 18:56 |
tonyb | I have some scrollback to read before I could weigh in on that question | 18:57 |
fungi | okay, pending tonyb's feedback, i'm ready to restart the prod gerrit container at any moment (including during our meeting) | 18:58 |
* corvus twiddles thumbs | 18:58 | |
corvus | corvus vocalizes? | 18:58 |
corvus | oh hey i have a voice | 18:58 |
clarkb | hello! | 18:59 |
corvus | i think some msgs will probably be lost, but my understanding is that fungi effectively made a review-dev server with our production config, and that allays my concerns that there might be some gerrit incompatability contributing to the issue. so i'm +1 on restart. | 18:59 |
fungi | correct. thanks corvus! | 19:00 |
corvus | and yes, file comments require the rest api (patchlevel comments dont). gertty uses the rest api for that. | 19:00 |
clarkb | corvus: tonyb the main risk with a restart is that it may invalidate existing sessions for some/all users | 19:01 |
clarkb | so if we don't actually fix the issue then we're in a worse spot | 19:01 |
corvus | agreed. but i think we have high confidence that it's a transient, not a systemic issue. so my remaining concern is what if we don't "reset" things enough by doing a restart. maybe worth having some gerrit cache clearing commands ready. | 19:02 |
clarkb | cool just wanted to make sure that was apparent to people reviewing scrollback | 19:03 |
fungi | what gerrit caches do we generally clear? | 19:03 |
tonyb | clarkb: That's my understanding but I wanted to check what testing had been done | 19:04 |
clarkb | fungi: I don't think we've had to clear caches in a long time (upstream made a bunch of improvments to them and they act more reliably now iirc) | 19:04 |
fungi | tonyb: basically held a test deployment from the gerrit 3.8 test job, adjusted its config to use the same openid auth settings we do in prod, adjusted /etc/hosts to point review.opendev.org to that held node's ip address, then successfully logged into it with an ubuntuone openid | 19:06 |
corvus | mostly just suggesting that we look up what caches there are and how to clear them before restarting, so that if we end up in a worse place and have to do some voodoo we have that ready to go | 19:06 |
fungi | clarkb also noted that the association hash is normally constant but changed at the same time as the errors started, and we've checked that restarting gerrit also clears out and chooses a new association hash | 19:06 |
fungi | which gives us a reasonably high confidence that the failures are related to association renewal and that restarting gerrit will force another fresh one | 19:07 |
clarkb | I guess if there is an openid associations cache claer command we could try that first | 19:07 |
clarkb | however this happens in the library external to gerrit caches so I don't think that is a thing | 19:07 |
fungi | sounds like it's simply held in memory? | 19:08 |
tonyb | All the Association caches are in memory if I read the code correctly | 19:08 |
clarkb | fungi: the variable and class names strongly imply that yes | 19:09 |
clarkb | but we know restarting seems to get us new ones so thats good | 19:09 |
fungi | right, i mean held in process memory, which goes away if we stop the process and so has to be created fresh by the new process at restart | 19:10 |
clarkb | yup | 19:10 |
fungi | tonyb: so anyway, have an opinion on the restart idea yet? i have to be on a conference call in about half an hour, so my available attention will be dropping sharply around that time for about an hour | 19:33 |
tonyb | #makeitso | 19:34 |
fungi | okay, restarting it momentarily | 19:34 |
fungi | #status notice The Gerrit service on review.opendev.org will be offline momentarily for a restart, in order to attempt to restore OpenID login functionality | 19:35 |
opendevstatus | fungi: sending notice | 19:35 |
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be offline momentarily for a restart, in order to attempt to restore OpenID login functionality | 19:35 | |
opendevstatus | fungi: finished sending notice | 19:38 |
fungi | infra-root: bad news... i now get a different failure since the restart: "SignInFailure,SIGN_IN,Contact+site+administrator" | 19:38 |
tonyb | It works for me now | 19:40 |
frickler | wfm2 | 19:41 |
tonyb | I did a full UbuntuOne logout/login as part for the redirect dance | 19:41 |
frickler | maybe some startup issue? | 19:41 |
clarkb | ya I was going to say looking at the logs I think I'm seeing people able to login | 19:41 |
clarkb | perhaps related to fungi's /etc/hosts updates to hit the test server? | 19:41 |
fungi | yeah, possible my browser has something cached in that profile now | 19:43 |
tonyb | There are a few ERRORs but they may be somewhat normal? | 19:44 |
clarkb | you are the only failures so far that I see | 19:44 |
clarkb | tonyb: some of the startup errosr were due to not sanitizing the replication queue waiting list | 19:44 |
clarkb | so it blew up on all of those entries it can't move | 19:44 |
fungi | yeah, weird. given others are reporting success i'll try that with my real account outside the account container rather than my test account | 19:44 |
fungi | yeah, my proper account logged in fine | 19:45 |
fungi | oh, you know what? my test account might actually be set inactive | 19:46 |
fungi | i bet that's it. i think i was testing account settings on it in prod a while back | 19:46 |
fungi | i was able to use it to reproduce the earlier openid failure because gerrit never got far enough to validate the account was active | 19:46 |
tonyb | clarkb: I was seeing part of a nullpointerexception because of the grep I was using | 19:47 |
frickler | you should certainly contact an site admin then ;) | 19:47 |
fungi | frickler: yep, on it! he might be gone fishing already but i'll keep trying | 19:47 |
clarkb | frickler: phew I wasn't seeing log messages I would expect given the contact site admin message | 19:47 |
clarkb | tonyb: I think the plugin manager error is a null ptr exception | 19:47 |
clarkb | and taht one is expected | 19:47 |
clarkb | er fungi my last message to frickler was intended to be aimed at you :) | 19:47 |
* frickler sneaks off now to avoid more confusion | 19:48 | |
clarkb | frickler: good night! | 19:48 |
clarkb | Now shoudl we write this down in the opendev manual? If openid logins to gerrit fail check whether the assoc_handle updated and immediately started to fail. If so restart gerrit to force a new openid association to be generated | 19:48 |
tonyb | clarkb: That's a good idea. | 19:50 |
fungi | i can't help but wonder if the bad association handle is also what's been responsible for people getting occasional openid errors on the mediawiki instance, though there it only occurs when the account needs to be created | 19:50 |
fungi | but if association handles are done per-account in mediawiki's openid plugin rather than how they're cached site-wide in gerrit... | 19:51 |
fungi | maybe? | 19:51 |
fungi | it's always seemed like some users randomly get a bad response from ubuntuone sso the first time they log into mediawiki, but some weeks later trying again with the same id works and an account gets created | 19:52 |
fungi | where by "bad response" i mean something the openid plugin for mediawiki can't handle | 19:53 |
tonyb | Sounds possible | 19:53 |
fungi | though does the association handle come from the id provider, or from the client? | 19:54 |
fungi | s/client/id consumer/ | 19:54 |
clarkb | fungi: from the server I think. The consumer does a post against the server and gets that info back | 19:56 |
clarkb | fungi: https://openid.net/specs/openid-authentication-1_1.html#mode_associate this was helpful when I was wrapping my head around it | 19:56 |
fungi | yeah, so maybe if something like 0.1% of the ones ubuntuone sso returns are broken in some way that the consumer can't actually use them... | 19:57 |
clarkb | could be | 19:58 |
fungi | anyway, speculation for now. i'll put some actual facts in the lp bug report as an update. we can probably status notice that things are back to normal? | 19:58 |
clarkb | ++ | 19:59 |
fungi | i've updated https://bugs.launchpad.net/canonical-identity-provider/+bug/2050855 | 20:01 |
fungi | #status notice OpenID logins for the Gerrit WebUI on review.opendev.org should be working normally again since the recent service restart | 20:01 |
opendevstatus | fungi: sending notice | 20:01 |
-opendevstatus- NOTICE: OpenID logins for the Gerrit WebUI on review.opendev.org should be working normally again since the recent service restart | 20:01 | |
opendevstatus | fungi: finished sending notice | 20:04 |
clarkb | so I don't forget we should restart apache on review too. but that is less urgent | 20:39 |
tonyb | I can do that | 20:45 |
fungi | thanks! | 20:49 |
tonyb | Is `apache2ctl graceful` adequate? | 20:51 |
fungi | no, that gets called any time the cert is replaced. it only recycles workers as they come idle | 20:53 |
fungi | `systemctl restart apache2` is disruptive but makes sure all workers are stopped and replaced | 20:53 |
tonyb | Okay | 20:53 |
clarkb | unrelated to everything else https://gitlab.com/Linaro/kisscache may possibly be useful to us | 20:54 |
clarkb | however like most other caching tools they seem to want to limit things on the lcient side rather than the server side so we may hvae to make a feature request or write a feature to limit the resources that are accessible through the cache | 20:54 |
clarkb | it is written in python though so should be familiar | 20:55 |
*** blarnath is now known as d34dh0r53 | 21:00 | |
tonyb | Should I status about the restart? given it's disruptive | 21:00 |
clarkb | I think thats fine. Could also just status log it rather than hitting all the channels again | 21:01 |
tonyb | # status log Restarting web services on review.opendev.org to clear stale workers | 21:04 |
tonyb | look okay? | 21:04 |
clarkb | up | 21:04 |
fungi | yep | 21:04 |
tonyb | #status log Restarting web services on review.opendev.org to clear stale workers | 21:04 |
opendevstatus | tonyb: finished logging | 21:04 |
tonyb | done | 21:06 |
clarkb | my meetings are done. I can finally find real food (an apple doesn't count). Then I have a school run and will be back to investigate that cloud with tony | 21:07 |
tonyb | ++ | 21:07 |
fungi | fwiw, my food intake so far has consisted of one umeboshi and two pieces of sourdough toast with a bit of hummus | 21:33 |
fungi | this day was over before it started | 21:33 |
fungi | taking a break now to make ผัดพริกขิง (phat prhrik khing) | 21:35 |
fungi | like a typical thai red curry but you cook down the coconut milk until it's condensed to solids and oil | 21:36 |
Clark[m] | I made a sloppy Joe out of leftovers. Feel much better | 21:37 |
fungi | ah, yep, pic on the 'paedia is pretty close to what we end up with: https://en.wikipedia.org/wiki/Phat_phrik_khing | 21:37 |
fungi | we've started keeping our own lime tree so we can have fresh leaves | 21:38 |
fungi | also grow our own thai chilis, much better with the right ingredients | 21:39 |
clarkb | the sundried peppers from the island are the red gold in my pantry. Some days I decide what we should cook just as an excuse to use the pepers | 21:48 |
clarkb | they go great in garlic noodles | 21:48 |
fungi | sounds amazing. they're basically a kind of bird's eye chili based on the pictures i saw? seem similar to a couple of short thai varieties we've been growing. we have 6 plants of those from last year over-wintering in a couple of kinds of ways to see how well they survive the cold | 22:13 |
fungi | we have about a gallon of them in the freezer still too | 22:13 |
clarkb | ya they are in a glass instant cover jar | 23:02 |
clarkb | tonyb: ready when you are | 23:03 |
fungi | the plants produce like mad (there were some still coming out even right up to this week's freezing conditions) | 23:03 |
fungi | we just kept chucking all the ripe ones we didn't use into a zipper bag in the freezer | 23:03 |
clarkb | Up 10 days rabbitmq | 23:12 |
clarkb | that seems like evidence in favor of frickler's hunch | 23:12 |
fungi | for the inmotion cloud? | 23:17 |
clarkb | ya | 23:17 |
fungi | keep in mind we only lost contact with the mirror instance a few days ago | 23:18 |
fungi | so it must have taken an extra week to blow everything up | 23:18 |
clarkb | looks like all of the compute services claim to be offline though | 23:21 |
clarkb | according to openstack compute service list | 23:21 |
clarkb | the containers seem to be running though so need to look at logs | 23:21 |
clarkb | there are oslo messaging timeouts | 23:25 |
fungi | sounds like openstack to me | 23:26 |
clarkb | but currently trying to sort out tonyb's ssh access | 23:27 |
clarkb | seems like tcp port 22 isn't working for tonyb | 23:27 |
fungi | odd | 23:27 |
clarkb | apparently now it works I dunno what changed | 23:28 |
fungi | the universe was spun upon its axis | 23:28 |
fungi | not sure which one, string theory says there are 11 of them | 23:29 |
clarkb | looking in rabbitmq logs there are actually a numebr of heartbeat issues and closed connections that seem to happen frequently | 23:41 |
clarkb | I think we're going to try restarting rabbitmq completely (all three cluster members). And if rabbit errors go away we can restart compute agents if they don't auto reconnect and become happy | 23:47 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!