16:00:22 <lbragstad> #startmeeting keystone 16:00:23 <openstack> Meeting started Tue Oct 23 16:00:22 2018 UTC and is due to finish in 60 minutes. The chair is lbragstad. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:24 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:27 <openstack> The meeting name has been set to 'keystone' 16:00:28 <lbragstad> #link https://etherpad.openstack.org/p/keystone-weekly-meeting 16:00:32 <lbragstad> o/ 16:00:53 <cmurphy> o/ 16:01:06 <hrybacki> o/ 16:01:09 <gagehugo> o/ 16:02:02 <ayoung> Oyez oyez 16:02:16 <wxy|> o/ 16:03:00 <lbragstad> #topic Release status 16:03:16 <lbragstad> #info next week is Stein-1 and specification proposal freeze 16:03:50 <ayoung> I assume we have real things to discuss prior to my two agenda items 16:04:04 <lbragstad> we should be smoothing out concerns with specs sooner rather than later at this point 16:04:33 <kmalloc> o/ 16:04:42 <lbragstad> if you have specific items wrt to specs and want higher bandwidth to discuss, please let someone know 16:04:55 <lbragstad> or throw it on the meeting agenda 16:05:13 <lbragstad> ayoung do you want to reorder the schedule 16:05:13 <lbragstad> ? 16:05:22 <lbragstad> or is that what you're suggesting? 16:06:28 <ayoung> Nah 16:06:31 <ayoung> I'm going last 16:06:40 <lbragstad> ok 16:06:49 <lbragstad> #topic Oath approach to federation 16:07:05 <lbragstad> last week we talked about Oath open-sourcing their approach to federation 16:07:14 <ayoung> so replace uuid3 with uuid5 and I like 16:07:46 <ayoung> couple other things; we could make it so the deployer choses the namespace, and could keep that in sync across their deployments, to get "unique" Ids that are still distributed 16:07:49 <lbragstad> tl;dr they consume Athenz tokens in place of SAML assertion and have their own auth plugin for doing their version of auto-provisioning 16:08:13 <lbragstad> you can find the code here: 16:08:15 <lbragstad> #link https://github.com/yahoo/openstack-collab/tree/master/keystone-federation-ocata 16:08:43 <lbragstad> i started walking through it and comparing their implementation against what we have, just to better understand the differences 16:08:53 <lbragstad> you can find that here, but i just started working on it 16:08:55 <lbragstad> #link https://etherpad.openstack.org/p/keystone-shadow-mapping-athenz-delta 16:09:04 <ayoung> Could Athenz be done as a middleware module? Something that looks like REMOTE_USER/REMOTE_GROUPS? Or does it provide more information than we currently accept from SAML etc 16:09:40 <lbragstad> it doesn't really follow the saml spec at all - from what i can tell, it gets everything from the athenz token and the auth body 16:10:35 <lbragstad> the auth plugin decodes the token and provisions users, projects, and roles based on the values 16:10:41 <ayoung> Because Autoprovisioning is its own thing, and we should be willing to accept that as a standalone contribution anyway. 16:11:44 <lbragstad> yeah - i guess it's important to note that Oath developed this for replicated usecases and not auto-provisioning specifically, but the implementation is very similar to what we developed as a solution for auto-provisioning 16:12:19 <ayoung> also...Oauth needs predictable Ids. 16:12:19 <ayoung> I have a WIP spec to support those. It is more than just Users, it looks like 16:12:38 <lbragstad> i'm not sure they need those if they come from the identity provider 16:12:42 <lbragstad> which is athenz 16:12:49 <ayoung> I think the inter-tubes are congested 16:13:22 <ayoung> https://review.openstack.org/#/c/612099/ 16:13:51 <lbragstad> why does athenz need predictable user ids? 16:14:07 <ayoung> lbragstad, becasue they need to be the same from location to location 16:14:28 <ayoung> so admin can't be one ABC on region1 and 123 in region2 16:14:30 <lbragstad> https://github.com/yahoo/openstack-collab/blob/master/keystone-federation-ocata/plugin/keystone/auth/plugins/athenz.py#L123-L129 16:14:56 <ayoung> they state they use uuid3(NAMESPACE, name) 16:14:56 <lbragstad> the user id is generated by athens 16:15:03 <lbragstad> athenz* 16:15:36 <lbragstad> and keystone just populates it in the database, from what i can tell 16:16:04 <lbragstad> so long as you're using athenz tokens to access keystone service providers, you should have the same user id at each site? 16:16:15 <ayoung> that is my understanding, yes 16:16:41 <lbragstad> so their implementation has already achieved predictable user ids 16:16:59 <lbragstad> right? 16:19:14 <lbragstad> if anyone feels like parsing that code, feel free to add your comments, questions, or concerns to that etherpad 16:19:28 <lbragstad> it might be helpful if/when we or penick go to draft a specification 16:19:42 <lbragstad> worst case, it helps us understand their usecase a bit better 16:19:59 <wxy|> Will take a look later. 16:20:07 <lbragstad> thanks wxy| 16:20:11 <lbragstad> any other questions on this? 16:20:57 * knikolla will read back. am stuck in meetings as we have the MOC workshop next week. sorry for being AWOL this time period. 16:21:11 <lbragstad> no worries - thanks knikolla 16:21:16 <lbragstad> alright, moving on 16:21:29 <lbragstad> #topic Another report of upgrade failures with user options 16:21:45 <lbragstad> #link https://bugs.launchpad.net/openstack-ansible/+bug/1793389 16:21:45 <openstack> Launchpad bug 1793389 in openstack-ansible "Upgrade to Ocata: Keystone Intermittent Missing 'options' Key" [Medium,Fix released] - Assigned to Alex Redinger (rexredinger) 16:21:56 <lbragstad> we've had this one crop up a few times 16:22:17 <lbragstad> specifically, the issue is due to caching during a live upgrade 16:22:54 <lbragstad> from pre-Ocata to Ocata 16:23:11 <lbragstad> it's still undetermined if this impacts FFU scenarios 16:23:18 <lbragstad> (e.g. Newton -> Pike) 16:23:57 <lbragstad> but it boils down to the cache returning a user reference during authentication on Ocata code that expects user['options'] to be present, but isn't because the user was cached prior to the upgrade 16:24:20 <ayoung> Gah...disconnect. I'll try to catch up 16:24:45 <lbragstad> deployment projects have a work around to flush memcached as a way to force a miss on authentication and refetch the user 16:25:46 <lbragstad> cmurphy odyssey4me and i were discussing approaches for mitigating this in keystone directly 16:26:11 <lbragstad> there is a WIP review in gerrit 16:26:14 <lbragstad> #link https://review.openstack.org/#/c/612686/ 16:26:24 <lbragstad> but curious if people have thoughts or concerns about this approach 16:26:40 <lbragstad> or if there are other approaches we should consider 16:27:31 <ayoung> wouldn't deploying a fix like this flush the cache anyway? 16:27:38 <ayoung> How could they ever get in this state? 16:28:01 <lbragstad> the memcached instance has a valid cache for a specific user 16:28:22 <ayoung> Is this a side effect of 0 downtime upgrades? Keep the cache up, even as we change the data out from underneath? 16:28:38 <lbragstad> yeah - that's the problem 16:28:41 <lbragstad> the cache remains up 16:28:48 <lbragstad> thus holding the cached data 16:28:50 <ayoung> that is going to be a problem in other ways 16:29:13 <ayoung> needs to be part of the upgrade. Flush cache when we do .... 16:29:18 <ayoung> contract? 16:29:38 <ayoung> we change the schema in the middle. THe cache will no longer reflect the schema after some point 16:30:12 <lbragstad> that's what https://review.openstack.org/#/c/608066/ does 16:30:19 <lbragstad> but not in process 16:30:25 <cmurphy> that's the problem, the question is whether we can be a bit more surgical instead of flushing the whole cache 16:30:35 <ayoung> I see that, but it is on a row by row basis 16:31:00 <ayoung> yeah, that review looks like it is in the right direction 16:31:41 <ayoung> so...can we tell memcache to flush all of a certain class of entry? As I recall from token revocations, that is not possible 16:31:47 <lbragstad> also - alex's comment on https://review.openstack.org/#/c/612686/ proves this could affect FFU 16:31:57 <ayoung> it only knows about key/value stores 16:32:38 <lbragstad> ayoung are you asking about cache region support? 16:32:50 <ayoung> lbragstad, maybe. 16:33:02 <ayoung> does each region reflect a specific class of cached objects? 16:33:04 <lbragstad> some parts of keystone rely on regions, yes 16:33:21 <lbragstad> computed role assignment have their own region, for example 16:33:26 <lbragstad> same with tokens 16:33:50 <ayoung> are regions expensive? Is there a reason to avoid using them? 16:34:04 <lbragstad> i'm not sure - that might be a better question for kmalloc 16:34:15 <kmalloc> no 16:34:16 <lbragstad> #link https://review.openstack.org/#/c/612686/1/keystone/identity/core.py,unified is an attempt at creating a region specifically for users 16:34:29 <ayoung> could we wrap user, groups, projects etc each with a region, and then, as part of the sql migrations, flush the region 16:34:29 <kmalloc> not expensive, but we have cases where we cannot invalidate an explicit cache key 16:34:40 <kmalloc> e.g. many entries via kwargs into a single method 16:34:46 <kmalloc> so we need to invalidate the entire region 16:34:49 <lbragstad> #link https://review.openstack.org/#/c/612686/1/keystone/auth/core.py,unified@389 drops the entire user region (every cached user) 16:34:57 <kmalloc> it is better to narrow the invalidation to as small a subset as possible 16:35:15 <kmalloc> no reason to invalidate *everything* if only computed role assignments needs to be invalidated 16:35:30 <ayoung> kmalloc, if we change the scheme on, in this case, users, we need to invalidate all cached users. Is that too specific? 16:35:41 <kmalloc> you can do so. 16:35:52 <ayoung> each class of object gets its own region? 16:36:03 <kmalloc> so far yes 16:36:13 <kmalloc> well... 16:36:20 <kmalloc> each manager 16:36:27 <ayoung> ok...so, we could tie in with the migration code, too, to identify what reqions need to be invalidated 16:36:32 <lbragstad> correct - if that region needs to be invalidated 16:36:37 <kmalloc> and some managers have extra regions, eg. computed assignments 16:36:41 <ayoung> OK, so users and groups would go together, for example? 16:36:46 <kmalloc> right now, yes 16:37:00 <lbragstad> but - they could be two separate regions if needed 16:37:04 <kmalloc> ++ 16:37:07 <lbragstad> depends on the invalidation strategy 16:37:08 <kmalloc> it's highly modular 16:37:23 <ayoung> Backend is probably granular enough 16:37:29 <lbragstad> or what needs to invoke invalidation, how often, etc... 16:37:31 <ayoung> identity, assignment, resource 16:37:57 <kmalloc> you can also force a cache pop by changing the argument(s)/kwargs [once https://review.openstack.org/#/c/611120/ lands] in the method signature 16:38:06 <kmalloc> since we cache memoized 16:38:10 <ayoung> yech 16:38:14 <ayoung> lets not count on that. 16:38:37 <kmalloc> it is a way caching works. 16:38:37 <ayoung> I'd hate to hate to change kwargs just to force a cache pop 16:38:51 <ayoung> yeah, and it is ok, just not what we want to use for this requirement 16:38:56 <kmalloc> it is a way a lot of things on the internet work, explicit change to the request forcing a cache cycle 16:39:36 <kmalloc> in either case you can force a cache pop. though i would not want to do that in db_sync 16:40:02 <kmalloc> it might make sense to do an explicit region (all region) cache expiration/invalidation on keystone start 16:40:38 <kmalloc> or as a keystone-manage command 16:40:59 <kmalloc> hooking in all the cache logic into db_sync seems ... rough 16:41:17 <lbragstad> in that case, a single keystone node could invalidate the memcached instances 16:41:26 <ayoung> what if db_sync set the values that would then be used by the manage-command 16:41:31 <lbragstad> but that behavior also depends on cache configuration 16:41:42 <ayoung> like a scracth table with the set of regions to invalidate? 16:41:43 <kmalloc> ayoung: there is no reason to do something like that 16:42:05 <kmalloc> really, just invalidate the regions 16:42:10 <kmalloc> they will re-warm quickly 16:42:25 <kmalloc> upgrade steps should be expected to need a cache invalidation/rewarm 16:42:29 <lbragstad> performance will degrade for a bit 16:42:50 <lbragstad> also - cmurphy brought up a good point earlier that it would be nice to find a solution that wasn't super specific to just this case 16:42:54 <kmalloc> which is fine for an upgrade process. we already say "turn everything off except X" 16:42:57 <lbragstad> since this is likely going to happen in the future 16:43:13 <kmalloc> so, i'd say keystone-manage that forces a region-wide invalidation 16:43:17 <kmalloc> [all regions] 16:43:55 <ayoung> I'll defer. I thought we were going more specific, to flush only regions we knew had changed, but, this is ok] 16:44:44 <kmalloc> for the most part our TTLs are very narrow 16:45:10 <kmalloc> i'll bet most cache is popped just by timeout (5m) during upgrade process 16:45:16 <kmalloc> or a restart of memcache servers as part of the deal 16:45:53 <kmalloc> this is just explicit another option is to add a namespace value that we change per release of keystone 16:46:05 <kmalloc> that just forces rotation of the cache based upon code base. 16:46:09 <ayoung> ok, so keystone-manage cache-invalidate [region | all ] ? 16:46:30 <kmalloc> fwiw, a namespace is just "added" to the cache key (before sha calculation) 16:46:51 <kmalloc> which then forces a new keystone to always use new cache namespace 16:47:02 <kmalloc> no "don't forget to run this command" 16:47:19 <kmalloc> (though an explicit cache invalidate command might be generally useful regardless) 16:47:50 <ayoung> cool. We good here? 16:48:06 <lbragstad> i think so - we're probably at a good point to continue the discussion in review 16:48:17 <kmalloc> we could use https://github.com/openstack/keystone/blob/master/keystone/version.py#L15 anyway. yeah we should continue discussion in review 16:48:29 <ayoung> Cool...ok 16:48:31 <lbragstad> #topic open discussion 16:48:41 <ayoung> stwo things 16:48:48 <ayoung> 1. service roles 16:48:49 <lbragstad> we have 12 minutes to talk about whatever we wanna talk about 16:48:51 <kmalloc> Flask has 2 more reviews, all massive code removals! yay, we're done with the migration 16:48:59 <ayoung> we need a way to convert people from admin-everywhere to service roles 16:49:02 * kmalloc has nothing else to talk about there, just cheering that we got there 16:49:08 <ayoung> so...short version: 16:49:43 * kmalloc hands the floor to ayoung... and since ayoung is now holding the entire floor, everyone falls ... into the emptyness/botomless area below the floor. 16:49:48 <ayoung> we role in rules that say admin(not everywhere) is servicer role or is_admin_project and leae the current mechanism in place 16:50:23 <ayoung> so, once we enable a bogus admin proejct in keystone, none of the tokens will ever have is_admin_project set 16:50:29 <ayoung> then we can remove those rules 16:51:05 <ayoung> it will let a deployer decide when to switch on service roles as the only allowed way to perform those ops 16:51:11 <lbragstad> why wouldn't we just use system-scope and use the upgrade checks to make sure people have the right role assignments according to their policy? 16:51:59 <ayoung> lbragstad, so... 16:52:10 <ayoung> that implied a big bang change 16:52:14 <ayoung> those never go smoothly 16:52:30 <ayoung> we want to be able to have people get used to using system roles, but not break their existing workflows 16:52:38 <lbragstad> but upgrade checkers are a programmable way to help with those types of things? 16:52:51 <ayoung> will it make sure that Horizon works? 16:52:58 <ayoung> Will it make sure 3rd party apps work? 16:53:10 <ayoung> we want to leave the existing policy in place until they are ready to throw the switch 16:53:16 <ayoung> and give them a way to throw it back 16:53:33 <ayoung> right now, people are misusing admin tokens 16:53:47 <ayoung> I've seen some really crappy code along those lines 16:54:18 <kmalloc> ayoung: that is the idea behind the deprecated policy bits, they just do an logical OR between new and old 16:54:20 <ayoung> we want to tell people: switch to using "service scoped tokens" and make it their choice 16:54:34 <ayoung> yeah, but.... 16:54:39 <kmalloc> until we remove the declaration of the "this is the deprecated rule" 16:54:47 <ayoung> I don't want to have to try and synchronize this across all of the projects in openstack 16:54:55 <kmalloc> you are going to have to. 16:54:56 <ayoung> so...we absoutelty use those 16:55:03 <kmalloc> it's just how policy works 16:55:10 <ayoung> re-read what I said 16:55:26 <ayoung> it allows us to roll in those changes, but keep things working as-is until we throw the switch 16:55:28 <kmalloc> you can't just wave a wand here. 16:55:41 <ayoung> I worked long and hard on this wand 16:56:10 <kmalloc> it is going to be a "support (new or old) or supply custom policy" 16:56:12 <ayoung> so, the idea is we get a common definition of service scoped admin-ness 16:56:16 <kmalloc> the switch is thrown down the line. 16:56:25 <ayoung> yes! 16:56:29 <kmalloc> and it likely will be an upgrade 16:56:36 <kmalloc> where the old declaration is removed 16:56:46 <kmalloc> but it COULD be re-added with a custom policy override 16:56:56 <kmalloc> this has to be done per-project in that project's tree 16:57:10 <ayoung> what happens if that breaks a critical component? 16:57:18 <ayoung> they are not going to do a downgrade 16:57:22 <kmalloc> 3 things: supply a fixed custom policy 16:57:28 <kmalloc> (quick remediation) 16:57:41 <kmalloc> 2) do better UAT and/or halt upgrade 16:57:46 <kmalloc> 3) roll back to previous 16:58:10 <kmalloc> custom policy to the old policy string is immediate and fixes "critical path is broken" 16:58:20 <ayoung> So...nothing I am saying is going to break that. But it ain;'t going to work that smoothly 16:58:22 <ayoung> so... 16:58:27 <ayoung> here is the middle piece: 16:58:47 <ayoung> make it an organizational decision to enable and disable the service scoped roles as the ONLY way to enforce that policy 16:58:53 <ayoung> and isolate that decision 16:59:01 <lbragstad> final minute 16:59:05 <kmalloc> this feels like a deployer/installer choice. 16:59:07 <kmalloc> fwiw 16:59:18 <ayoung> OK...one other thing 16:59:19 <kmalloc> not something we can encode directly 16:59:35 <kmalloc> (just because of how we sucked at building how policy works in the past) 16:59:36 <ayoung> I propse that the custom policies we discussed last week go to oslo-context instead of olso-policy 16:59:42 <kmalloc> -2 16:59:59 <kmalloc> put them external in a new lib if it doesn't go in oslo-policy 16:59:59 <ayoung> oslo-context is the openstack specific code. oslo-policy is a generic rules engine. 17:00:11 <ayoung> there is a dependency between them for this anyway 17:00:11 <kmalloc> context is the wrong place to put things that are policy rules. 17:00:20 <ayoung> so is olso-policy, tho 17:00:21 <kmalloc> oslo context is a holder object for context data. 17:00:32 <ayoung> but we insist on it for enforcing policy 17:00:33 <kmalloc> put them in oslo-policy and then extract to new thing 17:00:38 <kmalloc> or put it in new thing and fight to land it 17:00:42 <lbragstad> oslo.context is often overridden for service specific implementations, too 17:00:53 <ayoung> I think it stays in new thing, then 17:00:55 <kmalloc> do not assume oslo.context even is in use. 17:01:08 <kmalloc> i told you i recommend olos-policy for one reason only 17:01:12 <kmalloc> just for ease of landing it 17:01:15 <kmalloc> then extract 17:01:18 <lbragstad> ok - we're out of time folks 17:01:24 <kmalloc> but, i am happy to support a new thing as well 17:01:31 <ayoung> cool. I'll push for new thing 17:01:32 <kmalloc> it will just be painful to get adopted (overall) 17:01:34 <lbragstad> reminder that we have office hours and we can continue there 17:01:48 <lbragstad> thanks all! 17:01:54 <kmalloc> but i am fine with +2ing lots of stuff for that as it comes down the line 17:02:10 <lbragstad> #endmeeting