16:00:22 #startmeeting keystone 16:00:23 Meeting started Tue Oct 23 16:00:22 2018 UTC and is due to finish in 60 minutes. The chair is lbragstad. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:24 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:27 The meeting name has been set to 'keystone' 16:00:28 #link https://etherpad.openstack.org/p/keystone-weekly-meeting 16:00:32 o/ 16:00:53 o/ 16:01:06 o/ 16:01:09 o/ 16:02:02 Oyez oyez 16:02:16 o/ 16:03:00 #topic Release status 16:03:16 #info next week is Stein-1 and specification proposal freeze 16:03:50 I assume we have real things to discuss prior to my two agenda items 16:04:04 we should be smoothing out concerns with specs sooner rather than later at this point 16:04:33 o/ 16:04:42 if you have specific items wrt to specs and want higher bandwidth to discuss, please let someone know 16:04:55 or throw it on the meeting agenda 16:05:13 ayoung do you want to reorder the schedule 16:05:13 ? 16:05:22 or is that what you're suggesting? 16:06:28 Nah 16:06:31 I'm going last 16:06:40 ok 16:06:49 #topic Oath approach to federation 16:07:05 last week we talked about Oath open-sourcing their approach to federation 16:07:14 so replace uuid3 with uuid5 and I like 16:07:46 couple other things; we could make it so the deployer choses the namespace, and could keep that in sync across their deployments, to get "unique" Ids that are still distributed 16:07:49 tl;dr they consume Athenz tokens in place of SAML assertion and have their own auth plugin for doing their version of auto-provisioning 16:08:13 you can find the code here: 16:08:15 #link https://github.com/yahoo/openstack-collab/tree/master/keystone-federation-ocata 16:08:43 i started walking through it and comparing their implementation against what we have, just to better understand the differences 16:08:53 you can find that here, but i just started working on it 16:08:55 #link https://etherpad.openstack.org/p/keystone-shadow-mapping-athenz-delta 16:09:04 Could Athenz be done as a middleware module? Something that looks like REMOTE_USER/REMOTE_GROUPS? Or does it provide more information than we currently accept from SAML etc 16:09:40 it doesn't really follow the saml spec at all - from what i can tell, it gets everything from the athenz token and the auth body 16:10:35 the auth plugin decodes the token and provisions users, projects, and roles based on the values 16:10:41 Because Autoprovisioning is its own thing, and we should be willing to accept that as a standalone contribution anyway. 16:11:44 yeah - i guess it's important to note that Oath developed this for replicated usecases and not auto-provisioning specifically, but the implementation is very similar to what we developed as a solution for auto-provisioning 16:12:19 also...Oauth needs predictable Ids. 16:12:19 I have a WIP spec to support those. It is more than just Users, it looks like 16:12:38 i'm not sure they need those if they come from the identity provider 16:12:42 which is athenz 16:12:49 I think the inter-tubes are congested 16:13:22 https://review.openstack.org/#/c/612099/ 16:13:51 why does athenz need predictable user ids? 16:14:07 lbragstad, becasue they need to be the same from location to location 16:14:28 so admin can't be one ABC on region1 and 123 in region2 16:14:30 https://github.com/yahoo/openstack-collab/blob/master/keystone-federation-ocata/plugin/keystone/auth/plugins/athenz.py#L123-L129 16:14:56 they state they use uuid3(NAMESPACE, name) 16:14:56 the user id is generated by athens 16:15:03 athenz* 16:15:36 and keystone just populates it in the database, from what i can tell 16:16:04 so long as you're using athenz tokens to access keystone service providers, you should have the same user id at each site? 16:16:15 that is my understanding, yes 16:16:41 so their implementation has already achieved predictable user ids 16:16:59 right? 16:19:14 if anyone feels like parsing that code, feel free to add your comments, questions, or concerns to that etherpad 16:19:28 it might be helpful if/when we or penick go to draft a specification 16:19:42 worst case, it helps us understand their usecase a bit better 16:19:59 Will take a look later. 16:20:07 thanks wxy| 16:20:11 any other questions on this? 16:20:57 * knikolla will read back. am stuck in meetings as we have the MOC workshop next week. sorry for being AWOL this time period. 16:21:11 no worries - thanks knikolla 16:21:16 alright, moving on 16:21:29 #topic Another report of upgrade failures with user options 16:21:45 #link https://bugs.launchpad.net/openstack-ansible/+bug/1793389 16:21:45 Launchpad bug 1793389 in openstack-ansible "Upgrade to Ocata: Keystone Intermittent Missing 'options' Key" [Medium,Fix released] - Assigned to Alex Redinger (rexredinger) 16:21:56 we've had this one crop up a few times 16:22:17 specifically, the issue is due to caching during a live upgrade 16:22:54 from pre-Ocata to Ocata 16:23:11 it's still undetermined if this impacts FFU scenarios 16:23:18 (e.g. Newton -> Pike) 16:23:57 but it boils down to the cache returning a user reference during authentication on Ocata code that expects user['options'] to be present, but isn't because the user was cached prior to the upgrade 16:24:20 Gah...disconnect. I'll try to catch up 16:24:45 deployment projects have a work around to flush memcached as a way to force a miss on authentication and refetch the user 16:25:46 cmurphy odyssey4me and i were discussing approaches for mitigating this in keystone directly 16:26:11 there is a WIP review in gerrit 16:26:14 #link https://review.openstack.org/#/c/612686/ 16:26:24 but curious if people have thoughts or concerns about this approach 16:26:40 or if there are other approaches we should consider 16:27:31 wouldn't deploying a fix like this flush the cache anyway? 16:27:38 How could they ever get in this state? 16:28:01 the memcached instance has a valid cache for a specific user 16:28:22 Is this a side effect of 0 downtime upgrades? Keep the cache up, even as we change the data out from underneath? 16:28:38 yeah - that's the problem 16:28:41 the cache remains up 16:28:48 thus holding the cached data 16:28:50 that is going to be a problem in other ways 16:29:13 needs to be part of the upgrade. Flush cache when we do .... 16:29:18 contract? 16:29:38 we change the schema in the middle. THe cache will no longer reflect the schema after some point 16:30:12 that's what https://review.openstack.org/#/c/608066/ does 16:30:19 but not in process 16:30:25 that's the problem, the question is whether we can be a bit more surgical instead of flushing the whole cache 16:30:35 I see that, but it is on a row by row basis 16:31:00 yeah, that review looks like it is in the right direction 16:31:41 so...can we tell memcache to flush all of a certain class of entry? As I recall from token revocations, that is not possible 16:31:47 also - alex's comment on https://review.openstack.org/#/c/612686/ proves this could affect FFU 16:31:57 it only knows about key/value stores 16:32:38 ayoung are you asking about cache region support? 16:32:50 lbragstad, maybe. 16:33:02 does each region reflect a specific class of cached objects? 16:33:04 some parts of keystone rely on regions, yes 16:33:21 computed role assignment have their own region, for example 16:33:26 same with tokens 16:33:50 are regions expensive? Is there a reason to avoid using them? 16:34:04 i'm not sure - that might be a better question for kmalloc 16:34:15 no 16:34:16 #link https://review.openstack.org/#/c/612686/1/keystone/identity/core.py,unified is an attempt at creating a region specifically for users 16:34:29 could we wrap user, groups, projects etc each with a region, and then, as part of the sql migrations, flush the region 16:34:29 not expensive, but we have cases where we cannot invalidate an explicit cache key 16:34:40 e.g. many entries via kwargs into a single method 16:34:46 so we need to invalidate the entire region 16:34:49 #link https://review.openstack.org/#/c/612686/1/keystone/auth/core.py,unified@389 drops the entire user region (every cached user) 16:34:57 it is better to narrow the invalidation to as small a subset as possible 16:35:15 no reason to invalidate *everything* if only computed role assignments needs to be invalidated 16:35:30 kmalloc, if we change the scheme on, in this case, users, we need to invalidate all cached users. Is that too specific? 16:35:41 you can do so. 16:35:52 each class of object gets its own region? 16:36:03 so far yes 16:36:13 well... 16:36:20 each manager 16:36:27 ok...so, we could tie in with the migration code, too, to identify what reqions need to be invalidated 16:36:32 correct - if that region needs to be invalidated 16:36:37 and some managers have extra regions, eg. computed assignments 16:36:41 OK, so users and groups would go together, for example? 16:36:46 right now, yes 16:37:00 but - they could be two separate regions if needed 16:37:04 ++ 16:37:07 depends on the invalidation strategy 16:37:08 it's highly modular 16:37:23 Backend is probably granular enough 16:37:29 or what needs to invoke invalidation, how often, etc... 16:37:31 identity, assignment, resource 16:37:57 you can also force a cache pop by changing the argument(s)/kwargs [once https://review.openstack.org/#/c/611120/ lands] in the method signature 16:38:06 since we cache memoized 16:38:10 yech 16:38:14 lets not count on that. 16:38:37 it is a way caching works. 16:38:37 I'd hate to hate to change kwargs just to force a cache pop 16:38:51 yeah, and it is ok, just not what we want to use for this requirement 16:38:56 it is a way a lot of things on the internet work, explicit change to the request forcing a cache cycle 16:39:36 in either case you can force a cache pop. though i would not want to do that in db_sync 16:40:02 it might make sense to do an explicit region (all region) cache expiration/invalidation on keystone start 16:40:38 or as a keystone-manage command 16:40:59 hooking in all the cache logic into db_sync seems ... rough 16:41:17 in that case, a single keystone node could invalidate the memcached instances 16:41:26 what if db_sync set the values that would then be used by the manage-command 16:41:31 but that behavior also depends on cache configuration 16:41:42 like a scracth table with the set of regions to invalidate? 16:41:43 ayoung: there is no reason to do something like that 16:42:05 really, just invalidate the regions 16:42:10 they will re-warm quickly 16:42:25 upgrade steps should be expected to need a cache invalidation/rewarm 16:42:29 performance will degrade for a bit 16:42:50 also - cmurphy brought up a good point earlier that it would be nice to find a solution that wasn't super specific to just this case 16:42:54 which is fine for an upgrade process. we already say "turn everything off except X" 16:42:57 since this is likely going to happen in the future 16:43:13 so, i'd say keystone-manage that forces a region-wide invalidation 16:43:17 [all regions] 16:43:55 I'll defer. I thought we were going more specific, to flush only regions we knew had changed, but, this is ok] 16:44:44 for the most part our TTLs are very narrow 16:45:10 i'll bet most cache is popped just by timeout (5m) during upgrade process 16:45:16 or a restart of memcache servers as part of the deal 16:45:53 this is just explicit another option is to add a namespace value that we change per release of keystone 16:46:05 that just forces rotation of the cache based upon code base. 16:46:09 ok, so keystone-manage cache-invalidate [region | all ] ? 16:46:30 fwiw, a namespace is just "added" to the cache key (before sha calculation) 16:46:51 which then forces a new keystone to always use new cache namespace 16:47:02 no "don't forget to run this command" 16:47:19 (though an explicit cache invalidate command might be generally useful regardless) 16:47:50 cool. We good here? 16:48:06 i think so - we're probably at a good point to continue the discussion in review 16:48:17 we could use https://github.com/openstack/keystone/blob/master/keystone/version.py#L15 anyway. yeah we should continue discussion in review 16:48:29 Cool...ok 16:48:31 #topic open discussion 16:48:41 stwo things 16:48:48 1. service roles 16:48:49 we have 12 minutes to talk about whatever we wanna talk about 16:48:51 Flask has 2 more reviews, all massive code removals! yay, we're done with the migration 16:48:59 we need a way to convert people from admin-everywhere to service roles 16:49:02 * kmalloc has nothing else to talk about there, just cheering that we got there 16:49:08 so...short version: 16:49:43 * kmalloc hands the floor to ayoung... and since ayoung is now holding the entire floor, everyone falls ... into the emptyness/botomless area below the floor. 16:49:48 we role in rules that say admin(not everywhere) is servicer role or is_admin_project and leae the current mechanism in place 16:50:23 so, once we enable a bogus admin proejct in keystone, none of the tokens will ever have is_admin_project set 16:50:29 then we can remove those rules 16:51:05 it will let a deployer decide when to switch on service roles as the only allowed way to perform those ops 16:51:11 why wouldn't we just use system-scope and use the upgrade checks to make sure people have the right role assignments according to their policy? 16:51:59 lbragstad, so... 16:52:10 that implied a big bang change 16:52:14 those never go smoothly 16:52:30 we want to be able to have people get used to using system roles, but not break their existing workflows 16:52:38 but upgrade checkers are a programmable way to help with those types of things? 16:52:51 will it make sure that Horizon works? 16:52:58 Will it make sure 3rd party apps work? 16:53:10 we want to leave the existing policy in place until they are ready to throw the switch 16:53:16 and give them a way to throw it back 16:53:33 right now, people are misusing admin tokens 16:53:47 I've seen some really crappy code along those lines 16:54:18 ayoung: that is the idea behind the deprecated policy bits, they just do an logical OR between new and old 16:54:20 we want to tell people: switch to using "service scoped tokens" and make it their choice 16:54:34 yeah, but.... 16:54:39 until we remove the declaration of the "this is the deprecated rule" 16:54:47 I don't want to have to try and synchronize this across all of the projects in openstack 16:54:55 you are going to have to. 16:54:56 so...we absoutelty use those 16:55:03 it's just how policy works 16:55:10 re-read what I said 16:55:26 it allows us to roll in those changes, but keep things working as-is until we throw the switch 16:55:28 you can't just wave a wand here. 16:55:41 I worked long and hard on this wand 16:56:10 it is going to be a "support (new or old) or supply custom policy" 16:56:12 so, the idea is we get a common definition of service scoped admin-ness 16:56:16 the switch is thrown down the line. 16:56:25 yes! 16:56:29 and it likely will be an upgrade 16:56:36 where the old declaration is removed 16:56:46 but it COULD be re-added with a custom policy override 16:56:56 this has to be done per-project in that project's tree 16:57:10 what happens if that breaks a critical component? 16:57:18 they are not going to do a downgrade 16:57:22 3 things: supply a fixed custom policy 16:57:28 (quick remediation) 16:57:41 2) do better UAT and/or halt upgrade 16:57:46 3) roll back to previous 16:58:10 custom policy to the old policy string is immediate and fixes "critical path is broken" 16:58:20 So...nothing I am saying is going to break that. But it ain;'t going to work that smoothly 16:58:22 so... 16:58:27 here is the middle piece: 16:58:47 make it an organizational decision to enable and disable the service scoped roles as the ONLY way to enforce that policy 16:58:53 and isolate that decision 16:59:01 final minute 16:59:05 this feels like a deployer/installer choice. 16:59:07 fwiw 16:59:18 OK...one other thing 16:59:19 not something we can encode directly 16:59:35 (just because of how we sucked at building how policy works in the past) 16:59:36 I propse that the custom policies we discussed last week go to oslo-context instead of olso-policy 16:59:42 -2 16:59:59 put them external in a new lib if it doesn't go in oslo-policy 16:59:59 oslo-context is the openstack specific code. oslo-policy is a generic rules engine. 17:00:11 there is a dependency between them for this anyway 17:00:11 context is the wrong place to put things that are policy rules. 17:00:20 so is olso-policy, tho 17:00:21 oslo context is a holder object for context data. 17:00:32 but we insist on it for enforcing policy 17:00:33 put them in oslo-policy and then extract to new thing 17:00:38 or put it in new thing and fight to land it 17:00:42 oslo.context is often overridden for service specific implementations, too 17:00:53 I think it stays in new thing, then 17:00:55 do not assume oslo.context even is in use. 17:01:08 i told you i recommend olos-policy for one reason only 17:01:12 just for ease of landing it 17:01:15 then extract 17:01:18 ok - we're out of time folks 17:01:24 but, i am happy to support a new thing as well 17:01:31 cool. I'll push for new thing 17:01:32 it will just be painful to get adopted (overall) 17:01:34 reminder that we have office hours and we can continue there 17:01:48 thanks all! 17:01:54 but i am fine with +2ing lots of stuff for that as it comes down the line 17:02:10 #endmeeting