Tuesday, 2024-01-23

*** elodilles_pto is now known as elodilles		06:54
tkajinam	seems something wrong with authentication in gerrit and now the browser is redirected to https://review.opendev.org/SignInFailure,SIGN_IN,Local+signature+verification+failed after login	07:27
tkajinam	I'm not facing the problem with my browser in this laptop (because of remaining sessions) but I hit this in a different laotop where login session was purged	07:27
zigo	I can't login either ... :/	07:34
StutiArya[m]	I need to solution, I am working on RHOSP platform where I want to run networking infoblox plugin and following their documentation. The RHOSP VM is also used by some other (QA) team and we are having conflict while changing the neutron ipam driver configuration. The other(QA) team needs the neutron ipam driver set as neutron.ipam_driver which is default value, but Infoblox plugin needs value as 'infoblox'. I am looking for a way where I	07:35
StutiArya[m]	can achieve both neutron configuration.	07:35
zigo	ianw: Do you know what's going on?	07:38
frickler	infra-root: ^^ I can confirm the issue, no idea what' happening, nothing obvious to me in gerrit logs	08:02
frickler	login to ubuntu.one and accessing lp works fine	08:02
frickler	StutiArya[m]: this channel is for collaborating on the opendev infrastructure. you may want to ask in the neutron channel maybe or in some redhat support platform	08:07
frickler	o.k., after filtering out a huge number of key exchange warning messages, I see this: ERROR com.google.gerrit.httpd.auth.openid.OpenIdServiceImpl : OpenID failure: Local signature verification failed	08:15
frickler	but since we didn't change anything, likely an issue with lp/u1?	08:15
frickler	first timestamp I see this is 2024-01-22T22:43:30.668Z	08:24
tonyb	frickler: would you like some help? even if it's just someone to bounce ideas off?	08:25
frickler	tonyb: sure, anything. I've noticed someone already created https://bugs.launchpad.net/canonical-identity-provider/+bug/2050855 and I asked in #launchpad about it now. maybe jamespage also can check internally?	08:30
frickler	I've tried registering a new account with ubuntu.one and logging in with that one, but that seems to give the same result	08:32
priteau	frickler: I am also seeing the login issues here. Would it be worth sending a notification via IRC and Mastodon?	08:44
frickler	priteau: good idea	08:44
frickler	status notice all new logins to https://review.opendev.org are currently failing. investigation is ongoing, please be patient	08:45
frickler	priteau: tonyb: ^^ does that sound good?	08:45
frickler	fwiw people in #launchpad gave me an email address to contact for ubuntu one support, which I'm doing now	08:46
frickler	#status notice all new logins to https://review.opendev.org are currently failing. investigation is ongoing, please be patient	08:53
opendevstatus	frickler: sending notice	08:53
-opendevstatus- NOTICE: all new logins to https://review.opendev.org are currently failing. investigation is ongoing, please be patient		08:53
opendevstatus	frickler: finished sending notice	08:56
priteau	Thanks frickler!	08:56
tonyb	frickler: some quick research indicates that it could be time related but AFAICT the time on review02 matches the time on my laptop so if that is it, it's on the LP side ....	08:59
frickler	tonyb: the time on login.ubuntu.com also looks not too far off according to http headers	09:04
tonyb	Okay so probably not that	09:04
frickler	I've had the root mailbox on CC: for my mail but their ticket system seems to have ignored that at least in their auto-reply. will keep bouncing any further updates if/when they arrive	09:06
frickler	also I guess I should also look into that mutt multi account setup. although I don't like switching inboxes, maybe I'll just add another tab	09:08
tonyb	Okay, I don't see anything in the roo mailbox	09:13
tonyb	*root	09:13
frickler	oh, that's @openstack.org, I used @opendev.org instead. wondering where that ends up, it didn't bounce at least	09:18
tonyb	frickler: any objections to me increasing the logging in org.openid4java (WARN) and com.google.gerrit.httpd.auth.openid (INFO) to DEBUG to try and get some more details?	10:15
frickler	tonyb: how would you do that? I would like to avoid any restart as that would likely worsen the situation for people whose login is currently still working. not sure if spinning up a help review99 would help?	10:17
tonyb	https://gerrit-review.googlesource.com/Documentation/cmd-logging-set-level.html	10:19
tonyb	should be safe to run on prod?	10:20
frickler	tonyb: sounds ok to me	10:20
* frickler needs to go prep some food, bbiab		10:25
tonyb	okay	10:25
frickler	hmm, those logs don't show anything useful to me. that may be due to my complete ignorance of how openid actually works though	10:50
tonyb	Yeah I'm digging into both. I think it's at the other end but I'd like to validate that.	10:52
frickler	fwiw the initial mail address I sent things to was wrong, I've resent to the new address, but not response except the auto reply yet	11:14
tonyb	frickler: Okay. I think I've satisfied myself that we are indeed receiving invalid signatures from Ubuntu-One/Launchpad.	11:39
tonyb	frickler: the java code (specifically openid4java), sadly doesn't really do dynamic logging the way we'd like so we only get some of the potential additional output.	11:40
tonyb	I've put the log levels back the way they were	11:41
tonyb	I see your email and the auto-reply in the root mailbox	11:42
swest	trying to log into opendev gerrit I'm currently getting redirected to https://review.opendev.org/SignInFailure,SIGN_IN,Local+signature+verification+failed which gives a 404 "Not Found"	11:43
tonyb	If it's still broken when I get up tomorrow I'll try to create a more self-contained test case	11:43
tonyb	swest: Yes it's a know issue no ETA on a resolution yet	11:43
swest	ack, thanks	11:43
tonyb	swest: You can still interact with gerrit via ssh and anonymous APIs but there is a bit of a learning curve to that	11:44
tonyb	frickler: I'm heading to bed #good_luck	11:45
frickler	tonyb: thx, gn8 to you	11:46
fungi	just waking up now,, but ubuntuone sso goes on the fritz from time to time. usually the admins for that correct the problem within a few hours though	13:02
fungi	looks like it's been going on for about six hours this time though? (just based on reports in irc... maybe longer but i haven't checked the logs on gerrit to see if there were significantly earlier occurrences)	13:23
fungi	aha, 22:43 was mentioned	13:25
fungi	so nearly 14 hours now	13:25
fungi	er, nearly 15	13:25
frickler	yes, also no feedback to my ticket or the LP issue so far	13:30
jrosser	i'm trying to use one of the 32G sized nodes here but failing - is there something obvious i've done wrong? https://review.opendev.org/c/openstack/openstack-ansible-os_magnum/+/905199	13:43
frickler	jrosser: doesn't look wrong to me at first sight, will check nodepool logs	13:48
frickler	jrosser: ah, ubuntu-jammy-32GB vs. ubuntu-jammy-32G, needed cleaned glasses to spot that ;)	13:56
jrosser	ahha! well spotted!	13:56
jrosser	thankyou :)	13:57
frickler	yw	13:59
yoctozepto	ah, so you know SSO is broken	14:08
opendevreview	Merged openstack/project-config master: Add eventlet to projects available from github https://review.opendev.org/c/openstack/project-config/+/906071	14:12
fungi	yoctozepto: yep, we spammed the irc channels that subscribe to our statusbot, which also gets posted to https://fosstodon.org/@opendevinfra/	14:19
fungi	and i responded to a thread that was started on the service-discuss ml about it as well	14:20
yoctozepto	yeah, I have just come to see if you know	14:34
yoctozepto	and you know	14:34
yoctozepto	perfect	14:34
zigo	Time to get away from lp and find our own alternative? :)	14:49
jamespage	frickler: apologies working odd hours this week - I'll go see what I can find out	15:03
fungi	zigo: we've got a little progress on https://docs.opendev.org/opendev/infra-specs/latest/specs/central-auth.html but it's currently stalled behind not enough volunteers and also switching out our keycloak container images in https://review.opendev.org/905469	15:04
fungi	it's a goal of mine to make some forward progress on it this quarter, but that will all depend on what other fires erupt in the coming weeks	15:07
* ajaiswal[m] uploaded an image: (25KiB) < https://matrix.org/_matrix/media/v3/download/matrix.org/JxbJGBFHSpCZkxKITWUpvvJd/image.png >		15:08
ajaiswal[m]	Hi i am unable to login to gerrit any help	15:08
fungi	ajaiswal[m]: something's wrong with the ubuntuone sso identity provider we rely on, we've reported it to their admins and are hoping to hear something back soon	15:09
fungi	ssh and authenticated rest api access are still working, just not the interactive webui	15:10
fungi	if your session expires anyway, that is	15:10
frickler	response from canonical in https://bugs.launchpad.net/bugs/2050855 , so I guess we need to do some further debugging or consider alternatives	15:39
clarkb	are other services failing too?	15:54
fungi	i do seem to be able to log out of wiki.openstack.org and then log back in with my ubuntuone id. also same with storyboard.openstack.org, so at least some of our systems do seem unaffected, and whatever's happening does appear to just be impacting gerrit this time	15:54
clarkb	ok and gerrit itself reports the same version number	15:54
fungi	sorry for not checking those earlier, i thought they had already been tested and were seeing the same problems but apparently not	15:54
clarkb	possible update to apache maybe impacting how the redirect is parsed in gerrit? Another thing is maybe the all-users repo got corrupted somehow?	15:55
clarkb	however, people are getting redirected to what appear to be the correct ubuntu one locations implying external ids are fine	15:55
clarkb	that makes an all-users problem seem unlikely to me	15:56
fungi	nothing new in dpkg.log since 2024-01-19 06:29:53	15:56
fungi	so it doesn't appear to be any system package updates on review.o.o	15:56
clarkb	the openid path is you click login in gerrit that redirects you to ubuntu one. At ubuntu one you login and if successful they are supposed to redirect you to a path supplied in the original gerrit -> ubuntu one redirect. Our server logs should show those redirect paths and we should be able to check them for sanity?	15:58
clarkb	I'm wondering if we're providing a bad post auth endpoint to ubuntu one somehow	15:58
clarkb	we also got a notice about cert expiration for review.o.o again	16:01
clarkb	makes me wonder about apache	16:01
clarkb	I'm wary of logging out myself to test things yet. I've also got a series of meetings all morning long	16:05
fungi	vhost files haven't changed since 2022	16:05
clarkb	But ideas I'll throw out: check apache worker process staleness if for no other reason than to address the certcheck error. Next trace an auth request path and ensure that the redirects from both sides look correct (we might be able to do this through server logs and not need client side tracing)	16:06
*** jonher_ is now known as jonher		16:06
*** mmalchuk_ is now known as mmalchuk		16:06
fungi	in fact, nothing in /etc/apache has a recent mtime	16:06
clarkb	ya I'm thinking more of the running processes	16:06
clarkb	since the certcheck error indicates a potentially very long running and stale worker or workers. Unlikely to be the source of our issue though	16:07
fungi	if you have a secondary test account you can use a ff account container with that	16:07
clarkb	I don't, but maybe I need ot create one	16:07
fungi	i tested logging in with my test account in a separate account container and got the expected error	16:08
*** jonher_ is now known as jonher		16:08
fungi	i can restart apache if we think that might be a problem	16:09
clarkb	fungi: I think we want to trace the entire http redirect path for that process. on the initial redirect from gerrit to ubuntu one there should be an included redirect path for post auth back to gerrit. I think we want to check that value on its way to ubuntu one and on its way back to gerrit	16:09
clarkb	I doubt apache's stale worker is at fault since the cert is still valid	16:10
clarkb	more just calling it out as a less that ideal state along side this	16:10
fungi	infra-prod-base started to fail again on sunday: https://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-base&project=opendev/system-config	16:11
fungi	i'll track down the logs for it	16:11
fungi	mirror02.iad3.inmotion.opendev.org : ok=0 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0	16:11
fungi	i'll add it to the disable list	16:12
clarkb	thanks	16:12
fungi	next infra-prod-base run should succeed hopefully and we'll get cert updates, but i think this means those warnings were entirely unrelated to the openid issue	16:13
clarkb	other thoughts while I'm having them: something browser specific either a specific browser or browser version or some new "make the internet more secure and less useable" functionality like the hsts stuff that has hindered our testing of held nodes	16:13
clarkb	fungi: the review warning wasn't caused by that failure though	16:14
fungi	right	16:14
clarkb	I still think it is highly unlikely that the warning is related to the issue but the cert did refresh for review a few days back and we're still getting intermittent errors which points to apache workers that haven't restarted to pull in the new cert	16:14
fungi	and yeah, i'm seeing a plenty recent cert from review.o.o	16:15
fungi	Not After : Apr 17 02:10:50 2024 GMT	16:15
fungi	so it was refreshed about a week ago	16:16
clarkb	additional thought: hold a zuul deployed gerrit test node and test if this occurs there	16:19
clarkb	if this is system state specific we may be able to identify it through comparison with a test node	16:19
opendevreview	Jeremy Stanley proposed opendev/system-config master: DNM: break gerrit tests for an autohold https://review.opendev.org/c/opendev/system-config/+/906385	16:23
fungi	autohold set, but that will take some time. we can check other avenues in the interim	16:25
clarkb	re tracing I see people get redirected to a failure url. Does that redirect come from ubuntu one or gerrit after the ubuntu one redirect back to gerrit?	16:31
fungi	i don't notice any interim redirect	16:33
clarkb	git grep in gerrit shows it may be from gerrit as it constructs the url in this file: java/com/google/gerrit/httpd/auth/openid/OpenIdServiceImpl.java	16:35
clarkb	I dont see the appended error message yet'	16:37
clarkb	https://gerrit.googlesource.com/gerrit/+/refs/heads/stable-3.8/java/com/google/gerrit/httpd/auth/openid/OpenIdServiceImpl.java#265 this is the fail case we hit I think	16:40
clarkb	https://github.com/jbufu/openid4java/blob/4ea8f55c6857d398cbb5033d12a3baa98843be59/src/org/openid4java/consumer/ConsumerManager.java#L1776-L1797 seems to be where the error originates and as the lp issue hints at has to do with associations	16:45
frickler	yes, if you look at the debug logging that tonyb did earlier, you can see some references to that. do "grep -v sshd-SshDaemon /home/gerrit2/review_site/logs/error_log\|less" and look for DEBUG at around 11:28	16:50
frickler	also not very comforting to look at the age of that library. but none of that explains why it should start failing out of the blue	16:51
fungi	embedded trust store with a ca cert that suddenly expired?	16:52
clarkb	associations appear to be a relationship between the gerrit server and ubuntu one that involve hashes of some sort to verify the responses from ubuntu one to gerrit	16:54
clarkb	these are stored in an in memory db if the variable naming in that library is to be trusted	16:54
clarkb	there is also mention of expiration in comments.	16:54
clarkb	thinking out loud here: is it possible we restarted gerrit at $time and loaded association with hashes that expired recently?	16:55
fungi	i'm trying to work out when the gerrit process last started	16:55
clarkb	if that theory above is the cause then I would expect the held node to load new valid associations	16:58
fungi	42d18h ago	17:00
clarkb	fungi: 6 weeks ago according to docker ps	17:00
fungi	yep	17:00
fungi	so not all that long ago in the grand scheme of things. wish ps didn't make that so hard to figure out accurately	17:01
clarkb	no but if they recycle the hashes annually or something like that we'd maybe start and load the association with a long enough period then hit it not too long later?	17:01
clarkb	I think we should figure out what the association endpoint is for ubuntu one and see if we can inspect the hashes and their expiry dates directly	17:01
clarkb	to get a better feel for what would be in this database	17:02
clarkb	then we can also test with the held node if it is able to login with ubuntu one happily after loading current association data.	17:02
clarkb	I'm wary of just going ahead and restarting gerrit right now in prod to test this that way as this may reset all the sessions people have and create a bigger issue for us if that wasn't the solution	17:02
frickler	ack, that's why I was wary of doing a restart attempt earlier, too	17:03
fungi	agreed	17:04
clarkb	https://github.com/jbufu/openid4java/blob/openid4java-1.0.0/src/org/openid4java/consumer/ConsumerManager.java#L684-L853 this is the code that loads/builds the association	17:04
clarkb	unfortunately not small :)	17:04
fungi	ugh, the dnm change above is on its second attempt building gerrit imeages	17:04
fungi	both the 3.8 and 3.9 image builds are on their "2nd attempt"	17:05
fungi	E: Failed to fetch https://mirror-int.ord.rax.opendev.org/ubuntu/dists/jammy-security/main/binary-amd64/Packages.gz File has unexpected size	17:07
clarkb	fungi: in your DNM chagne we can stop the image build from running. In fact that will help us reproduce more closely 1:1 as we won't get an updated image	17:08
clarkb	(I would just remove the job from the check jobs list as it should be a soft dep)	17:08
fungi	the error above should never happen with the way we do reprepro with afs	17:08
clarkb	fungi: we've seen it happen before iirc	17:08
fungi	yeah, i can strip down the jobs, should have done that before	17:09
opendevreview	Jeremy Stanley proposed opendev/system-config master: DNM: break gerrit tests for an autohold https://review.opendev.org/c/opendev/system-config/+/906385	17:11
fungi	now only system-config-run-review-3.8 is waiting	17:11
fungi	and now it's queued	17:12
clarkb	https://login.ubuntu.com/.well-known/openid-configuration this doesn't seem to work. Not sure what the actual endpoint should be yet	17:15
fungi	and now it's running	17:16
clarkb	when you load the in memory association structure the first thing it does is remove expired entries	17:17
clarkb	really beginning to suspect either we're not removing expired entries or it is but then not loading the new valid stuff	17:17
fungi	clarkb: did i remove too much in the dnm change? the job only got as far as "pull-from-intermediate-registry: Load information from zuul_return" which failed	17:21
fungi	do i need to leave the soft dependencies on that job?	17:21
clarkb	yes I think we need to keep the dependencies in place we just don't run the job that builds a new image	17:22
clarkb	otherwise all the speculatiev container image stuff doesn't have enough info to run	17:22
opendevreview	Jeremy Stanley proposed opendev/system-config master: DNM: break gerrit tests for an autohold https://review.opendev.org/c/opendev/system-config/+/906385	17:24
fungi	oh, wait, keep the registry job too then	17:25
opendevreview	Jeremy Stanley proposed opendev/system-config master: DNM: break gerrit tests for an autohold https://review.opendev.org/c/opendev/system-config/+/906385	17:26
clarkb	yes otherwise you have to undo all the speculative container building framework stuff	17:26
clarkb	we just want the job to not run then the speculative framework stuff can do its thing and decide to use the image we have published rather than a new speculative image	17:27
clarkb	https://openid.net/specs/openid-authentication-1_1.html#mode_associate ok I think unlike openid connect this doesn't rely on a well known endpoint. Instead you just post to the openid endpoint to set up an association then the other end sends back the pubkeys and expiry data	17:28
clarkb	reading the code gerrit calls the authenticate() method in that ConsumerManager class. The authenticate method calls associate which should refresh the association db pruning expired entries	17:40
clarkb	I'm also finding the github code viewer does not work very well	17:41
fungi	argh, it didn't run the registry job even though i included it, but then complained that the registry job didn't run and was required by the gerrit test job	17:41
clarkb	we also know that the association is not null because the error we hit is within a block with a valid association	17:41
fungi	do i need to find a way to force the registry job to run when no images are built?	17:41
clarkb	I thought it would run regardless of building images	17:43
fungi	"Unable to freeze job graph: Job system-config-run-review-3.8 depends on opendev-buildset-registry which was not run."	17:44
fungi	but i kept "- opendev-buildset-registry" in the check pipeline jobs	17:44
fungi	i don't see any conditions on the opendev-buildset-registry job definition in opendev/base-jobs either	17:47
clarkb	fungi: its working now	17:51
clarkb	I think you were looking at an older patchset and not the current one?	17:51
fungi	oh! okay, i was looking at an old error ;)	17:51
fungi	yep, that looks better	17:51
clarkb	looking at apache logs you can see the assoc_handle values	17:53
clarkb	we had a value that flipped over and then this all started	17:53
clarkb	the same value continues to be used and fails now.	17:53
clarkb	So we've cached bad data potentially or ubuntu one has	17:53
fungi	in which case probably the held test node will "just work"	17:54
clarkb	but importantly this shows we aren't refreshing the association on every verification request likely because we're still considered valid so it is short circuiting somewhere?	17:54
clarkb	fungi: ya I mean assuming the upstream assertion nothing has changed is correct then all I can imagine is that something got corrupted somehow	17:54
clarkb	the hash type is the same before and after the problem occurs	17:55
clarkb	I'm trying to figure out if this association ever worked	17:59
fungi	held node is 104.130.169.176	17:59
fungi	argh, we don't use openid on test deployments	18:00
clarkb	the first use of the assoc handle is at 22:43:30 and the first signing failure is at 22:43:32 given the timing I believe this association simply never worked	18:00
clarkb	fungi: we don't but its easy to modify to make it so	18:01
clarkb	fungi: you also may need to set up /etc/hosts so that whatever redirects are used point at the test server	18:01
clarkb	that all happens in your browser so /etc/hosts should be sufficient	18:01
clarkb	I think we know this we got a new association at 22:43:30 and it appears to have never worked	18:01
fungi	yeah, i have "104.130.169.176 review.opendev.org" in /etc/hosts already	18:01
clarkb	Prior to that the existing association was fine. They also appear to use the same hash type	18:02
fungi	frickler: saw a "ERROR com.google.gerrit.httpd.auth.openid.OpenIdServiceImpl : OpenID failure: Local signature verification failed" at 2024-01-22T22:43:30.668Z as the first occurrence, yeah	18:03
clarkb	I have double checked the hash types are the same	18:03
clarkb	so this isn't a change in hash type we aren't able to process	18:04
clarkb	I think all this info leads weight to the idea that either our association db is corrupt or ubuntu one's is. Since we appear to store associations in memory only if we restart gerrit that should create a new association which hopefully works. tl;dr lets continue pushing down the path of testing with the test server	18:05
fungi	i just need to add "type = OPENID_SSO" and the openIdSsoUrl to [auth] in place of the DEVELOPMENT_BECOME_ANY_ACCOUNT yeah?	18:06
clarkb	I think so	18:06
clarkb	then when you startup you should beable to login as you would in prod	18:06
fungi	it's coming up with those adjustments now	18:06
fungi	yeah, it works	18:07
clarkb	it works as in your are able to fully login to the test server using ubuntu one?	18:07
fungi	prompts me to set my displayname and username, then i'm logged in	18:07
fungi	yes, ubuntuone sso openid	18:07
clarkb	cool I want ot check the apache logs on the held server to see if hash types align and otherwise look like prod currently	18:08
fungi	with my fungi-three test account in a dedicated ff account container	18:08
fungi	same thing that was generating an error with the prod gerrit	18:08
clarkb	if you grep for assoc_handle in the gerrit-ssl-access.log you'll see the handle info	18:09
clarkb	and ya that appaers to match with prod. Same hash type anyway	18:09
prjadhav	Hi need help regarding openstack zed issue	18:11
prjadhav	I am trying to start instance but getting error as Could not find versioned identity endpoints when attempting to authenticate. Please check that your auth_url is correct. Unable to establish connection to https://controller/identity	18:12
clarkb	fungi: as a final sanity check we have a none null openid.sig value in both prod when it fails and your test node and old logs from prod when it succeeds. Which is the other half of the thing we're checking against	18:12
clarkb	prjadhav: we run the development tools that the openstack project uses to build the software but aren't really experts in how to run openstack. There are better venues for that like #openstack or the openstack-discuss mailing list	18:13
fungi	prjadhav: this is the channel where we discuss services run as part of the opendev collaboratory. you probably want the #openstack channel or the openstack-discuss@lists.openstack.org mailing list	18:13
prjadhav	even though I hash commented the service_user section under nova.conf	18:13
prjadhav	anyone faced this issue?	18:13
fungi	prjadhav: you're asking in the wrong place, so you're unlikely to get any answers on that topic	18:13
prjadhav	ok sorry	18:14
prjadhav	Thanks for pointing	18:14
clarkb	fungi: I'm not sure what else we should be checking. These associations appear to live for many days so this sin't something that will automatically clean itself out. I also don't see a way to force the gerrit openid implementation to rebuild associations short of a restart but there may be a way we can do that	18:14
clarkb	fungi: oh I know what we can check. Can you restart the test server and log in again and we can check a new association value is made?	18:15
fungi	yep, on it	18:15
fungi	it's coming back up now	18:15
fungi	seems to leave me logged in. should i log out and back in again?	18:16
clarkb	yes	18:16
fungi	clarkb: done	18:16
clarkb	fungi: the assoc_handle values do seem to differ	18:17
clarkb	though quite similar which may just be coincidence	18:17
fungi	okay, so it generates a new one at restart	18:17
fungi	i can log out and in again to see if it changes, also restart yet again and see if it's still similar	18:17
clarkb	I think so. Please double check by grepping assoc_handle in the apahce logs	18:17
clarkb	++ to both things	18:17
clarkb	side note ps confirms an older than cert update apache worker	18:18
fungi	okay, that explains the e-mail at least	18:18
clarkb	I'm going to need to take a break soon as I haven't had breakfast yet and meetings will have me busy well through the lunch hour here	18:20
clarkb	but lets validate the log out and log back in then restart and log out and log back in data first	18:20
fungi	using this to filter it:	18:21
fungi	sudo grep assoc_handle /var/log/apache2/gerrit-ssl-access.log\|sed 's/.assoc_handle=%7BHMAC-SHA256%7D%7B$.$%7D&openid.*/\1/'	18:21
fungi	stays constant when i log out and in again without restarting	18:22
*** osmanlicilegi is now known as Guest16		18:22
clarkb	yup I see that	18:22
clarkb	which also mimics prod behavior and seems to indicate we're caching associations	18:22
fungi	and it changes again when i restart and log out/in	18:23
clarkb	agreed. Though still similar so maybe the handle itself isn't a hash	18:24
clarkb	but instead refers to some hash type info that isn't being directly logged	18:24
fungi	the first field looks like a counter and the second field seems to be a base64-encoded token of some kind	18:24
clarkb	which would make sense if that info is sensitive which I believe it is	18:24
clarkb	oh amybe the second value is the hashed ifno	18:25
fungi	basically everything after the %7B (the repeated %3D at the end is what leads me to expect something like base64)	18:25
clarkb	ah yup ==	18:26
clarkb	the spec also refers to base64 for some fields so that is very likely	18:27
fungi	anyway, if we thing the assoc_handle us bad and that getting a fresh one may solve this, restarting gerrit is probably the fastest way there	18:28
clarkb	alright so where we're at is that a association handle change occured which seems to imply we updated associations (likely due to an expiry of the old association). These changes happen many days apart so are not frequent	18:28
fungi	er, think the assoc_handle is bad	18:28
clarkb	It seems that restarting gerrit creates a new handle	18:28
clarkb	I've confirmed the gerrit docker images are the same on both the test node and the prod node ruling out something specific to code versions	18:29
fungi	which implies the handle expired around 42 days of age	18:29
clarkb	fungi: or maybe we've successfully rotated them weekly or something	18:29
fungi	42 days is suspiciously close to 1000 hours	18:30
clarkb	and only hit the problem with a corrupt assocation this time ?	18:30
clarkb	but ya I haven't dug through logs far enough tofigure out how long the old one may have been used for	18:30
clarkb	it was in use as far back as our current log files record. We'd have to go to backups to look furhter	18:31
clarkb	frickler tonyb (no corvus here?) would be good if you can weigh in on the idea that restarting the service will create a new assocaition which hopefully works	18:31
clarkb	I'm going to take that break now otherwise I may not get another chance for some time	18:31
fungi	yeah, i'm on hand to restart the gerrit container as soon as we have consensus (explicit or tacit)	18:32
clarkb	thinking out loud here: if we wanted to investigate furthe before a restart we could see if anyone at the ubuntu one side is willing to compare notes on these values to see if we did end up with corruption somehow	18:32
clarkb	other than that I'm not sure what we can do	18:33
fungi	which can also be done after the fact, we still have the data	18:33
clarkb	ya I think the main risk with a restart is that it is common for restarts to invalidate user sessions	18:34
clarkb	so if we don't fix it through the restart we'll have made the problem much more noticeable	18:34
fungi	i haven't actually observed gerrit restarts invalidating my login sessions, fwiw	18:40
frickler	reading backlog after tc meeting now	18:40
frickler	I think that used to happen with older gerrit, but I think has not happened recently	18:41
frickler	given the testing I'd be fine with doing a gerrit restart. though I'm biased because I don't have a working session left after my earlier testing and will also not be around for the next 12h or so anyway ;)	18:42
frickler	doing rechecks or +3 via cli is fine, but I didn't look into the details of doing file based comments yet	18:47
clarkb	I think that may require the http rest api. I seem to recall that zuul can't do inline comments without the rest api	18:48
clarkb	corvus indicates that his messages may not be reaching the channel but is also happy with a restart given the testing fungi performed	18:54
*** corvus is now known as notcorvus		18:54
*** notcorvus is now known as corvus		18:54
frickler	there seemed to be issues with the matrix bridge in the tc channel, too	18:56
tonyb	I have some scrollback to read before I could weigh in on that question	18:57
fungi	okay, pending tonyb's feedback, i'm ready to restart the prod gerrit container at any moment (including during our meeting)	18:58
* corvus twiddles thumbs		18:58
corvus	corvus vocalizes?	18:58
corvus	oh hey i have a voice	18:58
clarkb	hello!	18:59
corvus	i think some msgs will probably be lost, but my understanding is that fungi effectively made a review-dev server with our production config, and that allays my concerns that there might be some gerrit incompatability contributing to the issue. so i'm +1 on restart.	18:59
fungi	correct. thanks corvus!	19:00
corvus	and yes, file comments require the rest api (patchlevel comments dont). gertty uses the rest api for that.	19:00
clarkb	corvus: tonyb the main risk with a restart is that it may invalidate existing sessions for some/all users	19:01
clarkb	so if we don't actually fix the issue then we're in a worse spot	19:01
corvus	agreed. but i think we have high confidence that it's a transient, not a systemic issue. so my remaining concern is what if we don't "reset" things enough by doing a restart. maybe worth having some gerrit cache clearing commands ready.	19:02
clarkb	cool just wanted to make sure that was apparent to people reviewing scrollback	19:03
fungi	what gerrit caches do we generally clear?	19:03
tonyb	clarkb: That's my understanding but I wanted to check what testing had been done	19:04
clarkb	fungi: I don't think we've had to clear caches in a long time (upstream made a bunch of improvments to them and they act more reliably now iirc)	19:04
fungi	tonyb: basically held a test deployment from the gerrit 3.8 test job, adjusted its config to use the same openid auth settings we do in prod, adjusted /etc/hosts to point review.opendev.org to that held node's ip address, then successfully logged into it with an ubuntuone openid	19:06
corvus	mostly just suggesting that we look up what caches there are and how to clear them before restarting, so that if we end up in a worse place and have to do some voodoo we have that ready to go	19:06
fungi	clarkb also noted that the association hash is normally constant but changed at the same time as the errors started, and we've checked that restarting gerrit also clears out and chooses a new association hash	19:06
fungi	which gives us a reasonably high confidence that the failures are related to association renewal and that restarting gerrit will force another fresh one	19:07
clarkb	I guess if there is an openid associations cache claer command we could try that first	19:07
clarkb	however this happens in the library external to gerrit caches so I don't think that is a thing	19:07
fungi	sounds like it's simply held in memory?	19:08
tonyb	All the Association caches are in memory if I read the code correctly	19:08
clarkb	fungi: the variable and class names strongly imply that yes	19:09
clarkb	but we know restarting seems to get us new ones so thats good	19:09
fungi	right, i mean held in process memory, which goes away if we stop the process and so has to be created fresh by the new process at restart	19:10
clarkb	yup	19:10
fungi	tonyb: so anyway, have an opinion on the restart idea yet? i have to be on a conference call in about half an hour, so my available attention will be dropping sharply around that time for about an hour	19:33
tonyb	#makeitso	19:34
fungi	okay, restarting it momentarily	19:34
fungi	#status notice The Gerrit service on review.opendev.org will be offline momentarily for a restart, in order to attempt to restore OpenID login functionality	19:35
opendevstatus	fungi: sending notice	19:35
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be offline momentarily for a restart, in order to attempt to restore OpenID login functionality		19:35
opendevstatus	fungi: finished sending notice	19:38
fungi	infra-root: bad news... i now get a different failure since the restart: "SignInFailure,SIGN_IN,Contact+site+administrator"	19:38
tonyb	It works for me now	19:40
frickler	wfm2	19:41
tonyb	I did a full UbuntuOne logout/login as part for the redirect dance	19:41
frickler	maybe some startup issue?	19:41
clarkb	ya I was going to say looking at the logs I think I'm seeing people able to login	19:41
clarkb	perhaps related to fungi's /etc/hosts updates to hit the test server?	19:41
fungi	yeah, possible my browser has something cached in that profile now	19:43
tonyb	There are a few ERRORs but they may be somewhat normal?	19:44
clarkb	you are the only failures so far that I see	19:44
clarkb	tonyb: some of the startup errosr were due to not sanitizing the replication queue waiting list	19:44
clarkb	so it blew up on all of those entries it can't move	19:44
fungi	yeah, weird. given others are reporting success i'll try that with my real account outside the account container rather than my test account	19:44
fungi	yeah, my proper account logged in fine	19:45
fungi	oh, you know what? my test account might actually be set inactive	19:46
fungi	i bet that's it. i think i was testing account settings on it in prod a while back	19:46
fungi	i was able to use it to reproduce the earlier openid failure because gerrit never got far enough to validate the account was active	19:46
tonyb	clarkb: I was seeing part of a nullpointerexception because of the grep I was using	19:47
frickler	you should certainly contact an site admin then ;)	19:47
fungi	frickler: yep, on it! he might be gone fishing already but i'll keep trying	19:47
clarkb	frickler: phew I wasn't seeing log messages I would expect given the contact site admin message	19:47
clarkb	tonyb: I think the plugin manager error is a null ptr exception	19:47
clarkb	and taht one is expected	19:47
clarkb	er fungi my last message to frickler was intended to be aimed at you :)	19:47
* frickler sneaks off now to avoid more confusion		19:48
clarkb	frickler: good night!	19:48
clarkb	Now shoudl we write this down in the opendev manual? If openid logins to gerrit fail check whether the assoc_handle updated and immediately started to fail. If so restart gerrit to force a new openid association to be generated	19:48
tonyb	clarkb: That's a good idea.	19:50
fungi	i can't help but wonder if the bad association handle is also what's been responsible for people getting occasional openid errors on the mediawiki instance, though there it only occurs when the account needs to be created	19:50
fungi	but if association handles are done per-account in mediawiki's openid plugin rather than how they're cached site-wide in gerrit...	19:51
fungi	maybe?	19:51
fungi	it's always seemed like some users randomly get a bad response from ubuntuone sso the first time they log into mediawiki, but some weeks later trying again with the same id works and an account gets created	19:52
fungi	where by "bad response" i mean something the openid plugin for mediawiki can't handle	19:53
tonyb	Sounds possible	19:53
fungi	though does the association handle come from the id provider, or from the client?	19:54
fungi	s/client/id consumer/	19:54
clarkb	fungi: from the server I think. The consumer does a post against the server and gets that info back	19:56
clarkb	fungi: https://openid.net/specs/openid-authentication-1_1.html#mode_associate this was helpful when I was wrapping my head around it	19:56
fungi	yeah, so maybe if something like 0.1% of the ones ubuntuone sso returns are broken in some way that the consumer can't actually use them...	19:57
clarkb	could be	19:58
fungi	anyway, speculation for now. i'll put some actual facts in the lp bug report as an update. we can probably status notice that things are back to normal?	19:58
clarkb	++	19:59
fungi	i've updated https://bugs.launchpad.net/canonical-identity-provider/+bug/2050855	20:01
fungi	#status notice OpenID logins for the Gerrit WebUI on review.opendev.org should be working normally again since the recent service restart	20:01
opendevstatus	fungi: sending notice	20:01
-opendevstatus- NOTICE: OpenID logins for the Gerrit WebUI on review.opendev.org should be working normally again since the recent service restart		20:01
opendevstatus	fungi: finished sending notice	20:04
clarkb	so I don't forget we should restart apache on review too. but that is less urgent	20:39
tonyb	I can do that	20:45
fungi	thanks!	20:49
tonyb	Is `apache2ctl graceful` adequate?	20:51
fungi	no, that gets called any time the cert is replaced. it only recycles workers as they come idle	20:53
fungi	`systemctl restart apache2` is disruptive but makes sure all workers are stopped and replaced	20:53
tonyb	Okay	20:53
clarkb	unrelated to everything else https://gitlab.com/Linaro/kisscache may possibly be useful to us	20:54
clarkb	however like most other caching tools they seem to want to limit things on the lcient side rather than the server side so we may hvae to make a feature request or write a feature to limit the resources that are accessible through the cache	20:54
clarkb	it is written in python though so should be familiar	20:55
*** blarnath is now known as d34dh0r53		21:00
tonyb	Should I status about the restart? given it's disruptive	21:00
clarkb	I think thats fine. Could also just status log it rather than hitting all the channels again	21:01
tonyb	# status log Restarting web services on review.opendev.org to clear stale workers	21:04
tonyb	look okay?	21:04
clarkb	up	21:04
fungi	yep	21:04
tonyb	#status log Restarting web services on review.opendev.org to clear stale workers	21:04
opendevstatus	tonyb: finished logging	21:04
tonyb	done	21:06
clarkb	my meetings are done. I can finally find real food (an apple doesn't count). Then I have a school run and will be back to investigate that cloud with tony	21:07
tonyb	++	21:07
fungi	fwiw, my food intake so far has consisted of one umeboshi and two pieces of sourdough toast with a bit of hummus	21:33
fungi	this day was over before it started	21:33
fungi	taking a break now to make ผัดพริกขิง (phat prhrik khing)	21:35
fungi	like a typical thai red curry but you cook down the coconut milk until it's condensed to solids and oil	21:36
Clark[m]	I made a sloppy Joe out of leftovers. Feel much better	21:37
fungi	ah, yep, pic on the 'paedia is pretty close to what we end up with: https://en.wikipedia.org/wiki/Phat_phrik_khing	21:37
fungi	we've started keeping our own lime tree so we can have fresh leaves	21:38
fungi	also grow our own thai chilis, much better with the right ingredients	21:39
clarkb	the sundried peppers from the island are the red gold in my pantry. Some days I decide what we should cook just as an excuse to use the pepers	21:48
clarkb	they go great in garlic noodles	21:48
fungi	sounds amazing. they're basically a kind of bird's eye chili based on the pictures i saw? seem similar to a couple of short thai varieties we've been growing. we have 6 plants of those from last year over-wintering in a couple of kinds of ways to see how well they survive the cold	22:13
fungi	we have about a gallon of them in the freezer still too	22:13
clarkb	ya they are in a glass instant cover jar	23:02
clarkb	tonyb: ready when you are	23:03
fungi	the plants produce like mad (there were some still coming out even right up to this week's freezing conditions)	23:03
fungi	we just kept chucking all the ripe ones we didn't use into a zipper bag in the freezer	23:03
clarkb	Up 10 days rabbitmq	23:12
clarkb	that seems like evidence in favor of frickler's hunch	23:12
fungi	for the inmotion cloud?	23:17
clarkb	ya	23:17
fungi	keep in mind we only lost contact with the mirror instance a few days ago	23:18
fungi	so it must have taken an extra week to blow everything up	23:18
clarkb	looks like all of the compute services claim to be offline though	23:21
clarkb	according to openstack compute service list	23:21
clarkb	the containers seem to be running though so need to look at logs	23:21
clarkb	there are oslo messaging timeouts	23:25
fungi	sounds like openstack to me	23:26
clarkb	but currently trying to sort out tonyb's ssh access	23:27
clarkb	seems like tcp port 22 isn't working for tonyb	23:27
fungi	odd	23:27
clarkb	apparently now it works I dunno what changed	23:28
fungi	the universe was spun upon its axis	23:28
fungi	not sure which one, string theory says there are 11 of them	23:29
clarkb	looking in rabbitmq logs there are actually a numebr of heartbeat issues and closed connections that seem to happen frequently	23:41
clarkb	I think we're going to try restarting rabbitmq completely (all three cluster members). And if rabbit errors go away we can restart compute agents if they don't auto reconnect and become happy	23:47

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!