Tuesday, 2024-01-23

*** elodilles_pto is now known as elodilles06:54
tkajinamseems something wrong with authentication in gerrit and now the browser is redirected to  https://review.opendev.org/SignInFailure,SIGN_IN,Local+signature+verification+failed after login07:27
tkajinamI'm not facing the problem with my browser in this laptop (because of remaining sessions) but I hit this in a different laotop where login session was purged07:27
zigoI can't login either ... :/07:34
StutiArya[m]I need to solution, I am working on RHOSP platform where I want to run networking infoblox plugin and following their documentation. The RHOSP VM is also used by some other (QA) team and we are having conflict while changing the neutron ipam driver configuration. The other(QA) team needs the neutron ipam driver set as neutron.ipam_driver which is default value, but Infoblox plugin needs value as 'infoblox'. I am looking for a way where I07:35
StutiArya[m]can achieve both neutron configuration. 07:35
zigoianw: Do you know what's going on?07:38
fricklerinfra-root: ^^ I can confirm the issue, no idea what' happening, nothing obvious to me in gerrit logs08:02
fricklerlogin to ubuntu.one and accessing lp works fine08:02
fricklerStutiArya[m]: this channel is for collaborating on the opendev infrastructure. you may want to ask in the neutron channel maybe or in some redhat support platform08:07
fricklero.k., after filtering out a huge number of key exchange warning messages, I see this: ERROR com.google.gerrit.httpd.auth.openid.OpenIdServiceImpl : OpenID failure: Local signature verification failed08:15
fricklerbut since we didn't change anything, likely an issue with lp/u1?08:15
fricklerfirst timestamp I see this is 2024-01-22T22:43:30.668Z08:24
tonybfrickler: would you like some help?  even if it's just someone to bounce ideas off?08:25
fricklertonyb: sure, anything. I've noticed someone already created https://bugs.launchpad.net/canonical-identity-provider/+bug/2050855 and I asked in #launchpad about it now. maybe jamespage also can check internally?08:30
fricklerI've tried registering a new account with ubuntu.one and logging in with that one, but that seems to give the same result08:32
priteaufrickler: I am also seeing the login issues here. Would it be worth sending a notification via IRC and Mastodon?08:44
fricklerpriteau: good idea08:44
fricklerstatus notice all new logins to https://review.opendev.org are currently failing. investigation is ongoing, please be patient08:45
fricklerpriteau: tonyb: ^^ does that sound good?08:45
fricklerfwiw people in #launchpad gave me an email address to contact for ubuntu one support, which I'm doing now08:46
frickler#status notice all new logins to https://review.opendev.org are currently failing. investigation is ongoing, please be patient08:53
opendevstatusfrickler: sending notice08:53
-opendevstatus- NOTICE: all new logins to https://review.opendev.org are currently failing. investigation is ongoing, please be patient08:53
opendevstatusfrickler: finished sending notice08:56
priteauThanks frickler!08:56
tonybfrickler: some quick research indicates that it could be time related but AFAICT the time on review02 matches the time on my laptop so if that is it, it's on the LP side .... 08:59
fricklertonyb: the time on login.ubuntu.com also looks not too far off according to http headers09:04
tonybOkay so probably not that09:04
fricklerI've had the root mailbox on CC: for my mail but their ticket system seems to have ignored that at least in their auto-reply. will keep bouncing any further updates if/when they arrive09:06
frickleralso I guess I should also look into that mutt multi account setup. although I don't like switching inboxes, maybe I'll just add another tab09:08
tonybOkay, I don't see anything in the roo mailbox09:13
frickleroh, that's @openstack.org, I used @opendev.org instead. wondering where that ends up, it didn't bounce at least09:18
tonybfrickler: any objections to me increasing the logging in org.openid4java (WARN) and com.google.gerrit.httpd.auth.openid (INFO) to DEBUG to try and get some more details?10:15
fricklertonyb: how would you do that? I would like to avoid any restart as that would likely worsen the situation for people whose login is currently still working. not sure if spinning up a help review99 would help?10:17
tonyb*should* be safe to run on prod?10:20
fricklertonyb: sounds ok to me10:20
* frickler needs to go prep some food, bbiab10:25
fricklerhmm, those logs don't show anything useful to me. that may be due to my complete ignorance of how openid actually works though10:50
tonybYeah I'm digging into both.  I *think* it's at the other end but I'd like to validate that.10:52
fricklerfwiw the initial mail address I sent things to was wrong, I've resent to the new address, but not response except the auto reply yet11:14
tonybfrickler: Okay.  I think I've satisfied myself that we are indeed receiving invalid signatures from Ubuntu-One/Launchpad. 11:39
tonybfrickler: the java code (specifically openid4java), sadly doesn't really do dynamic logging the way we'd like so we only get some of the potential additional output.11:40
tonybI've put the log levels back the way they were11:41
tonybI see your email and the auto-reply in the root mailbox11:42
swesttrying to log into opendev gerrit I'm currently getting redirected to https://review.opendev.org/SignInFailure,SIGN_IN,Local+signature+verification+failed which gives a 404 "Not Found"11:43
tonybIf it's still broken when I get up tomorrow I'll try to create a more self-contained test case 11:43
tonybswest: Yes it's a know issue no ETA on a resolution yet11:43
swestack, thanks11:43
tonybswest: You can still interact with gerrit via ssh and anonymous APIs but there is a bit of a learning curve to that11:44
tonybfrickler: I'm heading to bed #good_luck11:45
fricklertonyb: thx, gn8 to you11:46
fungijust waking up now,, but ubuntuone sso goes on the fritz from time to time. usually the admins for that correct the problem within a few hours though13:02
fungilooks like it's been going on for about six hours this time though? (just based on reports in irc... maybe longer but i haven't checked the logs on gerrit to see if there were significantly earlier occurrences)13:23
fungiaha, 22:43 was mentioned13:25
fungiso nearly 14 hours now13:25
fungier, nearly 1513:25
frickleryes, also no feedback to my ticket or the LP issue so far13:30
jrosseri'm trying to use one of the 32G sized nodes here but failing - is there something obvious i've done wrong? https://review.opendev.org/c/openstack/openstack-ansible-os_magnum/+/90519913:43
fricklerjrosser: doesn't look wrong to me at first sight, will check nodepool logs13:48
fricklerjrosser: ah, ubuntu-jammy-32GB vs. ubuntu-jammy-32G, needed cleaned glasses to spot that ;)13:56
jrosserahha! well spotted!13:56
jrosserthankyou :)13:57
yoctozeptoah, so you know SSO is broken14:08
opendevreviewMerged openstack/project-config master: Add eventlet to projects available from github  https://review.opendev.org/c/openstack/project-config/+/90607114:12
fungiyoctozepto: yep, we spammed the irc channels that subscribe to our statusbot, which also gets posted to https://fosstodon.org/@opendevinfra/14:19
fungiand i responded to a thread that was started on the service-discuss ml about it as well14:20
yoctozeptoyeah, I have just come to see if you know14:34
yoctozeptoand you know14:34
zigoTime to get away from lp and find our own alternative? :)14:49
jamespagefrickler: apologies working odd hours this week - I'll go see what I can find out15:03
fungizigo: we've got a little progress on https://docs.opendev.org/opendev/infra-specs/latest/specs/central-auth.html but it's currently stalled behind not enough volunteers and also switching out our keycloak container images in https://review.opendev.org/90546915:04
fungiit's a goal of mine to make some forward progress on it this quarter, but that will all depend on what other fires erupt in the coming weeks15:07
* ajaiswal[m] uploaded an image: (25KiB) < https://matrix.org/_matrix/media/v3/download/matrix.org/JxbJGBFHSpCZkxKITWUpvvJd/image.png >15:08
ajaiswal[m]Hi i am unable to login to gerrit any help15:08
fungiajaiswal[m]: something's wrong with the ubuntuone sso identity provider we rely on, we've reported it to their admins and are hoping to hear something back soon15:09
fungissh and authenticated rest api access are still working, just not the interactive webui15:10
fungiif your session expires anyway, that is15:10
fricklerresponse from canonical in https://bugs.launchpad.net/bugs/2050855 , so I guess we need to do some further debugging or consider alternatives15:39
clarkbare other services failing too?15:54
fungii do seem to be able to log out of wiki.openstack.org and then log back in with my ubuntuone id. also same with storyboard.openstack.org, so at least some of our systems do seem unaffected, and whatever's happening does appear to just be impacting gerrit this time15:54
clarkbok and gerrit itself reports the same version number15:54
fungisorry for not checking those earlier, i thought they had already been tested and were seeing the same problems but apparently not15:54
clarkbpossible update to apache maybe impacting how the redirect is parsed in gerrit? Another thing is maybe the all-users repo got corrupted somehow?15:55
clarkbhowever, people are getting redirected to what appear to be the correct ubuntu one locations implying external ids are fine15:55
clarkbthat makes an all-users problem seem unlikely to me15:56
funginothing new in dpkg.log since 2024-01-19 06:29:5315:56
fungiso it doesn't appear to be any system package updates on review.o.o15:56
clarkbthe openid path is you click login in gerrit that redirects you to ubuntu one. At ubuntu one you login and if successful they are supposed to redirect you to a path supplied in the original gerrit -> ubuntu one redirect. Our server logs should show those redirect paths and we should be able to check them for sanity?15:58
clarkbI'm wondering if we're providing a bad post auth endpoint to ubuntu one somehow15:58
clarkbwe also got a notice about cert expiration for review.o.o again16:01
clarkbmakes me wonder about apache16:01
clarkbI'm wary of logging out myself to test things yet. I've also got a series of meetings all morning long16:05
fungivhost files haven't changed since 202216:05
clarkbBut ideas I'll throw out: check apache worker process staleness if for no other reason than to address the certcheck error. Next trace an auth request path and ensure that the redirects from both sides look correct (we might be able to do this through server logs and not need client side tracing)16:06
*** jonher_ is now known as jonher16:06
*** mmalchuk_ is now known as mmalchuk16:06
fungiin fact, nothing in /etc/apache has a recent mtime16:06
clarkbya I'm thinking more of the running processes16:06
clarkbsince the certcheck error indicates a potentially very long running and stale worker or workers. Unlikely to be the source of our issue though16:07
fungiif you have a secondary test account you can use a ff account container with that16:07
clarkbI don't, but maybe I need ot create one16:07
fungii tested logging in with my test account in a separate account container and got the expected error16:08
*** jonher_ is now known as jonher16:08
fungii can restart apache if we think that might be a problem16:09
clarkbfungi: I think we want to trace the entire http redirect path for that process. on the initial redirect from gerrit to ubuntu one there should be an included redirect path for post auth back to gerrit. I think we want to check that value on its way to ubuntu one and on its way back to gerrit16:09
clarkbI doubt apache's stale worker is at fault since the cert is still valid16:10
clarkbmore just calling it out as a less that ideal state along side this16:10
fungiinfra-prod-base started to fail again on sunday: https://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-base&project=opendev/system-config16:11
fungii'll track down the logs for it16:11
fungimirror02.iad3.inmotion.opendev.org : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=016:11
fungii'll add it to the disable list16:12
funginext infra-prod-base run should succeed hopefully and we'll get cert updates, but i think this means those warnings were entirely unrelated to the openid issue16:13
clarkbother thoughts while I'm having them: something browser specific either a specific browser or browser version or some new "make the internet more secure and less useable" functionality like the hsts stuff that has hindered our testing of held nodes16:13
clarkbfungi: the review warning wasn't caused by that failure though16:14
clarkbI still think it is highly unlikely that the warning is related to the issue but the cert did refresh for review a few days back and we're still getting intermittent errors which points to apache workers that haven't restarted to pull in the new cert16:14
fungiand yeah, i'm seeing a plenty recent cert from review.o.o16:15
fungiNot After : Apr 17 02:10:50 2024 GMT16:15
fungiso it was refreshed about a week ago16:16
clarkbadditional thought: hold a zuul deployed gerrit test node and test if this occurs there16:19
clarkbif this is system state specific we may be able to identify it through comparison with a test node16:19
opendevreviewJeremy Stanley proposed opendev/system-config master: DNM: break gerrit tests for an autohold  https://review.opendev.org/c/opendev/system-config/+/90638516:23
fungiautohold set, but that will take some time. we can check other avenues in the interim16:25
clarkbre tracing I see people get redirected to a failure url. Does that redirect come from ubuntu one or gerrit after the ubuntu one redirect back to gerrit?16:31
fungii don't notice any interim redirect16:33
clarkbgit grep in gerrit shows it may be from gerrit as it constructs the url in this file: java/com/google/gerrit/httpd/auth/openid/OpenIdServiceImpl.java16:35
clarkbI dont see the appended error message yet'16:37
clarkbhttps://gerrit.googlesource.com/gerrit/+/refs/heads/stable-3.8/java/com/google/gerrit/httpd/auth/openid/OpenIdServiceImpl.java#265 this is the fail case we hit I think16:40
clarkbhttps://github.com/jbufu/openid4java/blob/4ea8f55c6857d398cbb5033d12a3baa98843be59/src/org/openid4java/consumer/ConsumerManager.java#L1776-L1797 seems to be where the error originates and as the lp issue hints at has to do with associations16:45
frickleryes, if you look at the debug logging that tonyb did earlier, you can see some references to that. do "grep -v sshd-SshDaemon /home/gerrit2/review_site/logs/error_log|less" and look for DEBUG at around 11:2816:50
frickleralso not very comforting to look at the age of that library. but none of that explains why it should start failing out of the blue16:51
fungiembedded trust store with a ca cert that suddenly expired?16:52
clarkbassociations appear to be a relationship between the gerrit server and ubuntu one that involve hashes of some sort to verify the responses from ubuntu one to gerrit16:54
clarkbthese are stored in an in memory db if the variable naming in that library is to be trusted16:54
clarkbthere is also mention of expiration in comments.16:54
clarkbthinking out loud here: is it possible we restarted gerrit at $time and loaded association with hashes that expired recently?16:55
fungii'm trying to work out when the gerrit process last started16:55
clarkbif that theory above is the cause then I would expect the held node to load new valid associations16:58
fungi42d18h ago17:00
clarkbfungi: 6 weeks ago according to docker ps17:00
fungiso not all that long ago in the grand scheme of things. wish ps didn't make that so hard to figure out accurately17:01
clarkbno but if they recycle the hashes annually or something like that we'd maybe start and load the association with a long enough period then hit it not too long later?17:01
clarkbI think we should figure out what the association endpoint is for ubuntu one and see if we can inspect the hashes and their expiry dates directly17:01
clarkbto get a better feel for what would be in this database17:02
clarkbthen we can also test with the held node if it is able to login with ubuntu one happily after loading current association data.17:02
clarkbI'm wary of just going ahead and restarting gerrit right now in prod to test this that way as this may reset all the sessions people have and create a bigger issue for us if that wasn't the solution17:02
fricklerack, that's why I was wary of doing a restart attempt earlier, too17:03
clarkbhttps://github.com/jbufu/openid4java/blob/openid4java-1.0.0/src/org/openid4java/consumer/ConsumerManager.java#L684-L853 this is the code that loads/builds the association17:04
clarkbunfortunately not small :)17:04
fungiugh, the dnm change above is on its second attempt building gerrit imeages17:04
fungiboth the 3.8 and 3.9 image builds are on their "2nd attempt"17:05
fungiE: Failed to fetch https://mirror-int.ord.rax.opendev.org/ubuntu/dists/jammy-security/main/binary-amd64/Packages.gz  File has unexpected size17:07
clarkbfungi: in your DNM chagne we can stop the image build from running. In fact that will help us reproduce more closely 1:1 as we won't get an updated image17:08
clarkb(I would just remove the job from the check jobs list as it should be a soft dep)17:08
fungithe error above should never happen with the way we do reprepro with afs17:08
clarkbfungi: we've seen it happen before iirc17:08
fungiyeah, i can strip down the jobs, should have done that before17:09
opendevreviewJeremy Stanley proposed opendev/system-config master: DNM: break gerrit tests for an autohold  https://review.opendev.org/c/opendev/system-config/+/90638517:11
funginow only system-config-run-review-3.8 is waiting17:11
fungiand now it's queued17:12
clarkbhttps://login.ubuntu.com/.well-known/openid-configuration this doesn't seem to work. Not sure what the actual endpoint should be yet17:15
fungiand now it's running17:16
clarkbwhen you load the in memory association structure the first thing it does is remove expired entries17:17
clarkbreally beginning to suspect either we're not removing expired entries or it is but then not loading the new valid stuff17:17
fungiclarkb: did i remove too much in the dnm change? the job only got as far as "pull-from-intermediate-registry: Load information from zuul_return" which failed17:21
fungido i need to leave the soft dependencies on that job?17:21
clarkbyes I think we need to keep the dependencies in place we just don't run the job that builds a new image17:22
clarkbotherwise all the speculatiev container image stuff doesn't have enough info to run17:22
opendevreviewJeremy Stanley proposed opendev/system-config master: DNM: break gerrit tests for an autohold  https://review.opendev.org/c/opendev/system-config/+/90638517:24
fungioh, wait, keep the registry job too then17:25
opendevreviewJeremy Stanley proposed opendev/system-config master: DNM: break gerrit tests for an autohold  https://review.opendev.org/c/opendev/system-config/+/90638517:26
clarkbyes otherwise you have to undo all the speculative container building framework stuff17:26
clarkbwe just want the job to not run then the speculative framework stuff can do its thing and decide to use the image we have published rather than a new speculative image17:27
clarkbhttps://openid.net/specs/openid-authentication-1_1.html#mode_associate ok I think unlike openid connect this doesn't rely on a well known endpoint. Instead you just post to the openid endpoint to set up an association then the other end sends back the pubkeys and expiry data17:28
clarkbreading the code gerrit calls the authenticate() method in that ConsumerManager class. The authenticate method calls associate which should refresh the association db pruning expired entries17:40
clarkbI'm also finding the github code viewer does not work very well17:41
fungiargh, it didn't run the registry job even though i included it, but then complained that the registry job didn't run and was required by the gerrit test job17:41
clarkbwe also know that the association is not null because the error we hit is within a block with a valid association17:41
fungido i need to find a way to force the registry job to run when no images are built?17:41
clarkbI thought it would run regardless of building images17:43
fungi"Unable to freeze job graph: Job system-config-run-review-3.8 depends on opendev-buildset-registry which was not run."17:44
fungibut i kept "- opendev-buildset-registry" in the check pipeline jobs17:44
fungii don't see any conditions on the opendev-buildset-registry job definition in opendev/base-jobs either17:47
clarkbfungi: its working now17:51
clarkbI think you were looking at an older patchset and not the current one?17:51
fungioh! okay, i was looking at an old error ;)17:51
fungiyep, that looks better17:51
clarkblooking at apache logs you can see the assoc_handle values17:53
clarkbwe had a value that flipped over and then this all started17:53
clarkbthe same value continues to be used and fails now.17:53
clarkbSo we've cached bad data potentially or ubuntu one has17:53
fungiin which case probably the held test node will "just work"17:54
clarkbbut importantly this shows we aren't refreshing the association on every verification request likely because we're still considered valid so it is short circuiting somewhere?17:54
clarkbfungi: ya I mean assuming the upstream assertion nothing has changed is correct then all I can imagine is that something got corrupted somehow17:54
clarkbthe hash type is the same before and after the problem occurs17:55
clarkbI'm trying to figure out if this association ever worked17:59
fungiheld node is
fungiargh, we don't use openid on test deployments18:00
clarkbthe first use of the assoc handle is at 22:43:30 and the first signing failure is at 22:43:32 given the timing I believe this association simply never worked18:00
clarkbfungi: we don't but its easy to modify to make it so18:01
clarkbfungi: you also may need to set up /etc/hosts so that whatever redirects are used point at the test server18:01
clarkbthat all happens in your browser so /etc/hosts should be sufficient18:01
clarkbI think we know this we got a new association at 22:43:30 and it appears to have never worked18:01
fungiyeah, i have " review.opendev.org" in /etc/hosts already18:01
clarkbPrior to that the existing association was fine. They also appear to use the same hash type18:02
fungifrickler: saw a "ERROR com.google.gerrit.httpd.auth.openid.OpenIdServiceImpl : OpenID failure: Local signature verification failed" at 2024-01-22T22:43:30.668Z as the first occurrence, yeah18:03
clarkbI have double checked the hash types are the same18:03
clarkbso this isn't a change in hash type we aren't able to process18:04
clarkbI think all this info leads weight to the idea that either our association db is corrupt or ubuntu one's is. Since we appear to store associations in memory only if we restart gerrit that should create a new association which hopefully works. tl;dr lets continue pushing down the path of testing with the test server18:05
fungii just need to add "type = OPENID_SSO" and the openIdSsoUrl to [auth] in place of the DEVELOPMENT_BECOME_ANY_ACCOUNT yeah?18:06
clarkbI think so18:06
clarkbthen when you startup you should beable to login as you would in prod18:06
fungiit's coming up with those adjustments now18:06
fungiyeah, it works18:07
clarkbit works as in your are able to fully login to the test server using ubuntu one?18:07
fungiprompts me to set my displayname and username, then i'm logged in18:07
fungiyes, ubuntuone sso openid18:07
clarkbcool I want ot check the apache logs on the held server to see if hash types align and otherwise look like prod currently18:08
fungiwith my fungi-three test account in a dedicated ff account container18:08
fungisame thing that was generating an error with the prod gerrit18:08
clarkbif you grep for assoc_handle in the gerrit-ssl-access.log you'll see the handle info18:09
clarkband ya that appaers to match with prod. Same hash type anyway18:09
prjadhavHi need help regarding openstack zed issue18:11
prjadhavI am trying to start instance but getting error as Could not find versioned identity endpoints when attempting to authenticate. Please check that your auth_url is correct. Unable to establish connection to https://controller/identity18:12
clarkbfungi: as a final sanity check we have a none null openid.sig value in both prod when it fails and your test node and old logs from prod when it succeeds. Which is the other half of the thing we're checking against18:12
clarkbprjadhav: we run the development tools that the openstack project uses to build the software but aren't really experts in how to run openstack. There are better venues for that like #openstack or the openstack-discuss mailing list18:13
fungiprjadhav: this is the channel where we discuss services run as part of the opendev collaboratory. you probably want the #openstack channel or the openstack-discuss@lists.openstack.org mailing list18:13
prjadhaveven though I hash commented the service_user section under nova.conf18:13
prjadhavanyone faced this issue?18:13
fungiprjadhav: you're asking in the wrong place, so you're unlikely to get any answers on that topic18:13
prjadhavok sorry18:14
prjadhavThanks for pointing18:14
clarkbfungi: I'm not sure what else we should be checking. These associations appear to live for many days so this sin't something that will automatically clean itself out. I also don't see a way to force the gerrit openid implementation to rebuild associations short of a restart but there may be a way we can do that18:14
clarkbfungi: oh I know what we can check. Can you restart the test server and log in again and we can check a new association value is made?18:15
fungiyep, on it18:15
fungiit's coming back up now18:15
fungiseems to leave me logged in. should i log out and back in again?18:16
fungiclarkb: done18:16
clarkbfungi: the assoc_handle values do seem to differ18:17
clarkbthough quite similar which may just be coincidence18:17
fungiokay, so it generates a new one at restart18:17
fungii can log out and in again to see if it changes, also restart yet again and see if it's still similar18:17
clarkbI think so. Please double check by grepping assoc_handle in the apahce logs18:17
clarkb++ to both things18:17
clarkbside note ps confirms an older than cert update apache worker18:18
fungiokay, that explains the e-mail at least18:18
clarkbI'm going to need to take a break soon as I haven't had breakfast yet and meetings will have me busy well through the lunch hour here18:20
clarkbbut lets validate the log out and log back in then restart and log out and log back in data first18:20
fungiusing this to filter it:18:21
fungisudo grep assoc_handle /var/log/apache2/gerrit-ssl-access.log|sed 's/.*assoc_handle=%7BHMAC-SHA256%7D%7B\(.*\)%7D&openid.*/\1/'18:21
fungistays constant when i log out and in again without restarting18:22
*** osmanlicilegi is now known as Guest1618:22
clarkbyup I see that18:22
clarkbwhich also mimics prod behavior and seems to indicate we're caching associations18:22
fungiand it changes again when i restart and log out/in18:23
clarkbagreed. Though still similar so maybe the handle itself isn't a hash18:24
clarkbbut instead refers to some hash type info that isn't being directly logged18:24
fungithe first field looks like a counter and the second field seems to be a base64-encoded token of some kind18:24
clarkbwhich would make sense if that info is sensitive which I believe it is18:24
clarkboh amybe the second value is the hashed ifno18:25
fungibasically everything after the %7B (the repeated %3D at the end is what leads me to expect something like base64)18:25
clarkbah yup ==18:26
clarkbthe spec also refers to base64 for some fields so that is very likely18:27
fungianyway, if we thing the assoc_handle us bad and that getting a fresh one may solve this, restarting gerrit is probably the fastest way there18:28
clarkbalright so where we're at is that a association handle change occured which seems to imply we updated associations (likely due to an expiry of the old association). These changes happen many days apart so are not frequent18:28
fungier, think the assoc_handle is bad18:28
clarkbIt seems that restarting gerrit creates a new handle18:28
clarkbI've confirmed the gerrit docker images are the same on both the test node and the prod node ruling out something specific to code versions18:29
fungiwhich implies the handle expired around 42 days of age18:29
clarkbfungi: or maybe we've successfully rotated them weekly or something18:29
fungi42 days is suspiciously close to 1000 hours18:30
clarkband only hit the problem with a corrupt assocation this time ?18:30
clarkbbut ya I haven't dug through logs far enough tofigure out how long the old one may have been used for18:30
clarkbit was in use as far back as our current log files record. We'd have to go to backups to look furhter18:31
clarkbfrickler tonyb (no corvus here?) would be good if you can weigh in on the idea that restarting the service will create a new assocaition which hopefully works18:31
clarkbI'm going to take that break now otherwise I may not get another chance for some time18:31
fungiyeah, i'm on hand to restart the gerrit container as soon as we have consensus (explicit or tacit)18:32
clarkbthinking out loud here: if we wanted to investigate furthe before a restart we could see if anyone at the ubuntu one side is willing to compare notes on these values to see if we did end up with corruption somehow18:32
clarkbother than that I'm not sure what we can do18:33
fungiwhich can also be done after the fact, we still have the data18:33
clarkbya I think the main risk with a restart is that it is common for restarts to invalidate user sessions18:34
clarkbso if we don't fix it through the restart we'll have made the problem much more noticeable18:34
fungii haven't actually observed gerrit restarts invalidating my login sessions, fwiw18:40
fricklerreading backlog after tc meeting now18:40
fricklerI think that used to happen with older gerrit, but I think has not happened recently18:41
fricklergiven the testing I'd be fine with doing a gerrit restart. though I'm biased because I don't have a working session left after my earlier testing and will also not be around for the next 12h or so anyway ;)18:42
fricklerdoing rechecks or +3 via cli is fine, but I didn't look into the details of doing file based comments yet18:47
clarkbI think that may require the http rest api. I seem to recall that zuul can't do inline comments without the rest api18:48
clarkbcorvus indicates that his messages may not be reaching the channel but is also happy with a restart given the testing fungi performed18:54
*** corvus is now known as notcorvus18:54
*** notcorvus is now known as corvus18:54
fricklerthere seemed to be issues with the matrix bridge in the tc channel, too18:56
tonybI have some scrollback to read before I could weigh in on that question 18:57
fungiokay, pending tonyb's feedback, i'm ready to restart the prod gerrit container at any moment (including during our meeting)18:58
* corvus twiddles thumbs18:58
corvuscorvus vocalizes?18:58
corvusoh hey i have a voice18:58
corvusi think some msgs will probably be lost, but my understanding is that fungi effectively made a review-dev server with our production config, and that allays my concerns that there might be some gerrit incompatability contributing to the issue.  so i'm +1 on restart.18:59
fungicorrect. thanks corvus!19:00
corvusand yes, file comments require the rest api (patchlevel comments dont). gertty uses the rest api for that.19:00
clarkbcorvus: tonyb the main risk with a restart is that it may invalidate existing sessions for some/all users19:01
clarkbso if we don't actually fix the issue then we're in a worse spot19:01
corvusagreed.  but i think we have high confidence that it's a transient, not a systemic issue.  so my remaining concern is what if we don't "reset" things enough by doing a restart.  maybe worth having some gerrit cache clearing commands ready.19:02
clarkbcool just wanted to make sure that was apparent to people reviewing scrollback19:03
fungiwhat gerrit caches do we generally clear?19:03
tonybclarkb: That's my understanding but I wanted to check what testing had been done19:04
clarkbfungi: I don't think we've had to clear caches in a long time (upstream made a bunch of improvments to them and they act more reliably now iirc)19:04
fungitonyb: basically held a test deployment from the gerrit 3.8 test job, adjusted its config to use the same openid auth settings we do in prod, adjusted /etc/hosts to point review.opendev.org to that held node's ip address, then successfully logged into it with an ubuntuone openid19:06
corvusmostly just suggesting that we look up what caches there are and how to clear them before restarting, so that if we end up in a worse place and have to do some voodoo we have that ready to go19:06
fungiclarkb also noted that the association hash is normally constant but changed at the same time as the errors started, and we've checked that restarting gerrit also clears out and chooses a new association hash19:06
fungiwhich gives us a reasonably high confidence that the failures are related to association renewal and that restarting gerrit will force another fresh one19:07
clarkbI guess if there is an openid associations cache claer command we could try that first19:07
clarkbhowever this happens in the library external to gerrit caches so I don't think that is a thing19:07
fungisounds like it's simply held in memory?19:08
tonybAll the Association caches are in memory if I read the code correctly19:08
clarkbfungi: the variable and class names strongly imply that yes19:09
clarkbbut we know restarting seems to get us new ones so thats good19:09
fungiright, i mean held in process memory, which goes away if we stop the process and so has to be created fresh by the new process at restart19:10
fungitonyb: so anyway, have an opinion on the restart idea yet? i have to be on a conference call in about half an hour, so my available attention will be dropping sharply around that time for about an hour19:33
fungiokay, restarting it momentarily19:34
fungi#status notice The Gerrit service on review.opendev.org will be offline momentarily for a restart, in order to attempt to restore OpenID login functionality19:35
opendevstatusfungi: sending notice19:35
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be offline momentarily for a restart, in order to attempt to restore OpenID login functionality19:35
opendevstatusfungi: finished sending notice19:38
fungiinfra-root: bad news... i now get a different failure since the restart: "SignInFailure,SIGN_IN,Contact+site+administrator"19:38
tonybIt works for me now19:40
tonybI did a full UbuntuOne logout/login as part for the redirect dance19:41
fricklermaybe some startup issue?19:41
clarkbya I was going to say looking at the logs I think I'm seeing people able to login19:41
clarkbperhaps related to fungi's /etc/hosts updates to hit the test server?19:41
fungiyeah, possible my browser has something cached in that profile now19:43
tonybThere are a few ERRORs but they may be somewhat normal?19:44
clarkbyou are the only failures so far that I see19:44
clarkbtonyb: some of the startup errosr were due to not sanitizing the replication queue waiting list 19:44
clarkbso it blew up on all of those entries it can't move19:44
fungiyeah, weird. given others are reporting success i'll try that with my real account outside the account container rather than my test account19:44
fungiyeah, my proper account logged in fine19:45
fungioh, you know what? my test account might actually be set inactive19:46
fungii bet that's it. i think i was testing account settings on it in prod a while back19:46
fungii was able to use it to reproduce the earlier openid failure because gerrit never got far enough to validate the account was active19:46
tonybclarkb: I was seeing part of a nullpointerexception because of the grep I was using19:47
frickleryou should certainly contact an site admin then ;)19:47
fungifrickler: yep, on it! he might be gone fishing already but i'll keep trying19:47
clarkbfrickler: phew I wasn't seeing log messages I would expect given the contact site admin message19:47
clarkbtonyb: I think the plugin manager error is a null ptr exception19:47
clarkband taht one is expected19:47
clarkber fungi my last message to frickler  was intended to be aimed at you :)19:47
* frickler sneaks off now to avoid more confusion19:48
clarkbfrickler: good night!19:48
clarkbNow shoudl we write this down in the opendev manual? If openid logins to gerrit fail check whether the assoc_handle updated and immediately started to fail. If so restart gerrit to force a new openid association to be generated19:48
tonybclarkb: That's a good idea.19:50
fungii can't help but wonder if the bad association handle is also what's been responsible for people getting occasional openid errors on the mediawiki instance, though there it only occurs when the account needs to be created19:50
fungibut if association handles are done per-account in mediawiki's openid plugin rather than how they're cached site-wide in gerrit...19:51
fungiit's always seemed like some users randomly get a bad response from ubuntuone sso the first time they log into mediawiki, but some weeks later trying again with the same id works and an account gets created19:52
fungiwhere by "bad response" i mean something the openid plugin for mediawiki can't handle19:53
tonybSounds possible19:53
fungithough does the association handle come from the id provider, or from the client?19:54
fungis/client/id consumer/19:54
clarkbfungi: from the server I think. The consumer does a post against the server and gets that info back19:56
clarkbfungi: https://openid.net/specs/openid-authentication-1_1.html#mode_associate this was helpful when I was wrapping my head around it19:56
fungiyeah, so maybe if something like 0.1% of the ones ubuntuone sso returns are broken in some way that the consumer can't actually use them...19:57
clarkbcould be19:58
fungianyway, speculation for now. i'll put some actual facts in the lp bug report as an update. we can probably status notice that things are back to normal?19:58
fungii've updated https://bugs.launchpad.net/canonical-identity-provider/+bug/205085520:01
fungi#status notice OpenID logins for the Gerrit WebUI on review.opendev.org should be working normally again since the recent service restart20:01
opendevstatusfungi: sending notice20:01
-opendevstatus- NOTICE: OpenID logins for the Gerrit WebUI on review.opendev.org should be working normally again since the recent service restart20:01
opendevstatusfungi: finished sending notice20:04
clarkbso I don't forget we should restart apache on review too. but that is less urgent20:39
tonybI can do that20:45
tonybIs `apache2ctl graceful` adequate?20:51
fungino, that gets called any time the cert is replaced. it only recycles workers as they come idle20:53
fungi`systemctl restart apache2` is disruptive but makes sure all workers are stopped and replaced20:53
clarkbunrelated to everything else https://gitlab.com/Linaro/kisscache may possibly be useful to us20:54
clarkbhowever like most other caching tools they seem to want to limit things on the lcient side rather than the server side so we may hvae to make a feature request or write a feature to limit the resources that are accessible through the cache20:54
clarkbit is written in python though so should be familiar20:55
*** blarnath is now known as d34dh0r5321:00
tonybShould I status about the restart? given it's disruptive21:00
clarkbI think thats fine. Could also just status log it rather than hitting all the channels again21:01
tonyb# status log Restarting web services on review.opendev.org to clear stale workers21:04
tonyblook okay?21:04
tonyb#status log Restarting web services on review.opendev.org to clear stale workers21:04
opendevstatustonyb: finished logging21:04
clarkbmy meetings are done. I can finally find real food (an apple doesn't count). Then I have a school run and will be back to investigate that cloud with tony21:07
fungifwiw, my food intake so far has consisted of one umeboshi and two pieces of sourdough toast with a bit of hummus21:33
fungithis day was over before it started21:33
fungitaking a break now to make ผัดพริกขิง  (phat prhrik khing)21:35
fungilike a typical thai red curry but you cook down the coconut milk until it's condensed to solids and oil21:36
Clark[m]I made a sloppy Joe out of leftovers. Feel much better21:37
fungiah, yep, pic on the 'paedia is pretty close to what we end up with: https://en.wikipedia.org/wiki/Phat_phrik_khing21:37
fungiwe've started keeping our own lime tree so we can have fresh leaves21:38
fungialso grow our own thai chilis, much better with the right ingredients21:39
clarkbthe sundried peppers from the island are the red gold in my pantry. Some days I decide what we should cook just as an excuse to use the pepers21:48
clarkbthey go great in garlic noodles21:48
fungisounds amazing. they're basically a kind of bird's eye chili based on the pictures i saw? seem similar to a couple of short thai varieties we've been growing. we have 6 plants of those from last year over-wintering in a couple of kinds of ways to see how well they survive the cold22:13
fungiwe have about a gallon of them in the freezer still too22:13
clarkbya they are in a glass instant cover jar23:02
clarkbtonyb: ready when you are23:03
fungithe plants produce like mad (there were some still coming out even right up to this week's freezing conditions)23:03
fungiwe just kept chucking all the ripe ones we didn't use into a zipper bag in the freezer23:03
clarkbUp 10 days                              rabbitmq23:12
clarkbthat seems like evidence in favor of frickler's hunch23:12
fungifor the inmotion cloud?23:17
fungikeep in mind we only lost contact with the mirror instance a few days ago23:18
fungiso it must have taken an extra week to blow everything up23:18
clarkblooks like all of the compute services claim to be offline though23:21
clarkbaccording to openstack compute service list23:21
clarkbthe containers seem to be running though so need to look at logs23:21
clarkbthere are oslo messaging timeouts23:25
fungisounds like openstack to me23:26
clarkbbut currently trying to sort out tonyb's ssh access23:27
clarkbseems like tcp port 22 isn't working for tonyb23:27
clarkbapparently now it works I dunno what changed23:28
fungithe universe was spun upon its axis23:28
funginot sure which one, string theory says there are 11 of them23:29
clarkblooking in rabbitmq logs there are actually a numebr of heartbeat issues and closed connections that seem to happen frequently23:41
clarkbI think we're going to try restarting rabbitmq completely (all three cluster members). And if rabbit errors go away we can restart compute agents if they don't auto reconnect and become happy23:47

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!