18:02:22 <dolphm> #startmeeting keystone 18:02:23 <openstack> Meeting started Tue Aug 9 18:02:22 2016 UTC and is due to finish in 60 minutes. The chair is dolphm. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:02:24 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 18:02:27 <openstack> The meeting name has been set to 'keystone' 18:02:27 <stevemar> dolphm: you remember how to run the meeting? 18:02:28 <stevemar> :) 18:02:35 <bknudson> just like old times. 18:02:42 <stevemar> bknudson: #bettertimes 18:02:44 <ayoung> maybe <o 18:02:55 <lbragstad> nostalgia 18:03:00 <dolphm> #topic Scheduley bits 18:03:13 <bknudson> trump will bring back the good old days. 18:03:21 <dolphm> #info August 22-26 is final release for non-client libs 18:03:35 <henrynash> (3 minutes in...and first Trump reference already..) 18:03:39 <dolphm> #info Newton-3 deadline is August ~29 18:03:41 <stevemar> hehe 18:03:44 <dstanek> "Let's make keystone great again" 18:03:49 <rderose_> ++ 18:03:52 <gagehugo> +1 18:03:56 <rodrigods> lol 18:04:02 <samueldmq> hi all 18:04:12 <stevemar> dolphm: i think the client libraries freeze the same time as the server side 18:04:29 <stevemar> the non-client libs (keystoneauth and keystonemiddleware) freeze a week before 18:04:30 <bknudson> do we get to choose when our libs freeze? 18:04:36 <stevemar> bknudson: nope! 18:04:42 <dolphm> #link http://releases.openstack.org/newton/schedule.html 18:05:05 <dolphm> non-client libs freeze a week early to help ease requirements management at the end of the release 18:05:23 <dolphm> #topic Let's make HMT great again: support for delete and update 18:05:40 <dolphm> #link https://blueprints.launchpad.net/keystone/+spec/project-tree-deletion 18:05:47 <dolphm> Delete patch: https://review.openstack.org/#/c/244248/ (abandoned) 18:05:50 <notmorgan> i might be lurking tody 18:05:51 * rodrigods hides in the corner 18:05:52 <dolphm> Update patch: https://review.openstack.org/#/c/243585/ (abandoned) 18:06:02 <dolphm> stevemar: have a minute for this one, since your name is on it? 18:06:10 <stevemar> dolphm: sure 18:06:17 <stevemar> right 18:06:36 <stevemar> so, we implemented all this stuff in the controller and manager and backend 18:06:41 <stevemar> and we don't have an API that can hit it 18:06:48 <stevemar> so... kinda pointless right now 18:06:55 <rderose_> Revert it all! 18:06:59 <rderose_> :) 18:07:03 <amakarov> dolphm, is my patch https://review.openstack.org/#/c/285521/ somewhere on the schedule? 18:07:28 <rodrigods> major issue is because in the HMT design, a project isn't owner of its subtree 18:07:32 <dolphm> amakarov: no? but feel free to add it https://etherpad.openstack.org/p/keystone-weekly-meeting 18:07:32 <gyee> see, see, rderose_ said it 18:07:58 <stevemar> so yeah, as rodrigods said, there were issues around policy and what it should look like for deleting / updating a tree of projects 18:08:12 <dolphm> rodrigods: so the implementation is not ready to be exposed to the API? 18:08:15 <stevemar> i'm only half joking when i say revert it all 18:08:26 <dstanek> stevemar: -- 18:08:37 <rodrigods> dolphm, looks like it is impossible to implement it cleany 18:08:48 <samueldmq> is there a REAL use case for it ? 18:08:53 <stevemar> my argument is that no one has pushed for it in months, so do we really need it? 18:08:54 <rodrigods> samueldmq, sure 18:08:55 <gyee> stevemar, trump was half-joking when he accept the nomination too 18:08:58 <notmorgan> rderose_: revert allof HMT! /snark, (sorry) 18:09:11 <henrynash> I think part of the problem was a lack of clarity on the conceptual meaning ....i..e is delete/update tree a single command policied in its own right...or is it shorthand for issue update/delete on every node in the tree and teh regular policy applies to each one? 18:09:14 <dstanek> rodrigods: so what can we do here? 18:09:23 <stevemar> i'm trying to encourage anyone to pick this up and finish it :) 18:09:33 <lbragstad> henrynash isn't that the cascade feature? 18:09:39 <stevemar> lbragstad: yep 18:09:59 <rodrigods> dstanek, think henrynash can be more precise about the issues since he made further investigations on it 18:10:01 <samueldmq> henrynash: yes, because there is no way to specify that in the current policies 18:10:17 <notmorgan> updating a whole tree is potentially scary (tm) 18:10:22 <stevemar> anyway, i didn't expect a solution to come out of this, just a PSA 18:10:23 <henrynash> lbragstad: yes, you could argue that this is expanding teh cascade capability to full delete/update (not just the enabled flag which is all you can do with it today) 18:10:32 <rodrigods> notmorgan, the update is to disable, that is required for deletion 18:10:34 <notmorgan> and/or deleting because of our lack of "soft delete" 18:10:38 <stevemar> i was cleaning old BPs and this one keeps getting bumped 18:10:46 <samueldmq> anyone picking this up ? 18:10:59 <dstanek> what does 'update a tree' actually mean? 18:11:08 <rodrigods> dstanek, disable it 18:11:09 <henrynash> I'm happ to pick it up, just not sure whether can do this for Newton 18:11:10 <stevemar> dstanek: only 'enabled' is allowed to be updated 18:11:11 <dolphm> henrynash: why does a disable have to cascade to take effect? 18:11:12 <bknudson> what happens if nobody picks it up? 18:11:12 <notmorgan> dstanek: specifically a disable 18:11:23 <stevemar> bknudson: revert it all! <joking> 18:11:25 <notmorgan> that cascades 18:11:26 <dstanek> rodrigods: do you have to explicitly disable all child nodes then? 18:11:27 <notmorgan> bknudson: since no public api... it could be removed? 18:11:37 <dolphm> henrynash: if any of a project's parents are disabled, the project is disabled, and amakarov's table should make that check trivial 18:11:39 <notmorgan> dstanek: today iirc yes. 18:11:44 <rodrigods> dstanek, for the current design, yes 18:11:56 <dstanek> sounds like we need to fix the glitch then 18:11:56 <bknudson> ok, just checking there's no huge bug if we don't do something. 18:12:06 <henrynash> dolphm: I was just describing what the current cascade does.... 18:12:15 <notmorgan> dstanek: explicit disable. it still preventx login so security is fine, just a cascade delete cannot occur w/o everything disabled as i understand it 18:12:19 <dolphm> henrynash: right 18:12:42 <dolphm> rodrigods: it sounds like the problem is with checking the enabled/disabled state of a child (which is a read query), not with "cascading" a write to a bunch of rows in the db 18:12:58 <notmorgan> dolphm: correct. a cascade write would be absurd 18:13:10 <notmorgan> and shouldn't be a thing we do. 18:13:13 <rodrigods> right 18:13:34 <stevemar> bknudson: we just end up with lots of ugly branches in our code :( 18:13:35 <rodrigods> but... as of today 18:13:36 <notmorgan> doing the "is any of my parents disabled" check should be what is done. 18:13:41 <rodrigods> we check if everyone is disabled 18:13:57 <samueldmq> I think the code is in 18:14:01 <notmorgan> this is a QOL change for operators/users imo 18:14:03 <samueldmq> we just don't know how to expose that in the policies 18:14:22 <stevemar> notmorgan: you are hardly lurking :) 18:14:23 <stevemar> hehe 18:14:46 <dolphm> samueldmq: the perceived "enabled" state of a project is just whether a project AND all of it's parent's are enabled 18:14:47 <ayoung> samueldmq, policy is kindof all or nothing. It should check to see if you can do the operation based on a role, but dealing with hierarchy is kindof beyond scope 18:14:50 <stevemar> dolphm: so yeah, just a PSA, didn't expect an outcome here 18:15:10 <ayoung> but.... 18:15:11 <notmorgan> stevemar: i can go away again, i just was dropping in to voice the "this should be safe from a security standpoint" view. 18:15:18 <dolphm> notmorgan: thanks :) 18:15:23 <stevemar> notmorgan: we love having you around :) 18:15:28 <dolphm> notmorgan: not for the going away again bit, though 18:15:35 <notmorgan> dolphm: lol 18:15:35 <ayoung> the part that should be checked is the applicability of the token used to request the operation to each of the node 18:15:40 <henrynash> stevemar: Happy to get this assigned to me to go think about it ad come back with a spec (or not) for Ocata 18:15:41 <ayoung> nodes to be dleted that is 18:15:50 <rodrigods> ayoung, right 18:15:57 <rodrigods> check auth in the whole tree 18:15:58 <ayoung> and what do we say if the scoped does not match...probably deny 18:15:59 <stevemar> henrynash: i'd be ok with that 18:16:00 <samueldmq> dolphm: that made me think in something ... if a parent is disabled, shouldn't its children be disabled too (by consequence) ? 18:16:01 <rodrigods> authz* 18:16:04 <dolphm> notmorgan: /nick notnotmorgan ? 18:16:10 <samueldmq> anyways, we don't do that today 18:16:14 <dolphm> samueldmq: correct 18:16:18 <rodrigods> samueldmq, that's the point 18:16:22 <rodrigods> we don't do that today 18:16:27 <ayoung> samueldmq, yes they should 18:16:27 <samueldmq> and we can't 18:16:34 <samueldmq> because backwards comapt 18:16:51 <samueldmq> and that would disable auth 18:16:53 <dolphm> samueldmq: i wouldn't not consider it to be backwards incompatible to fix that behavior 18:16:55 <rodrigods> don't think changing that breaks the API 18:16:58 <dolphm> samueldmq: great! 18:17:03 <samueldmq> can taht be considered a securit issue then we fix ? 18:17:04 <notmorgan> ayoung: uhm. wouldn't the owner of the top of the tree be allowed to delete even if they don't own down the tree? 18:17:08 <dolphm> it's not an API change so much as a change in the API's behavior 18:17:10 <notmorgan> just thinking through the usabilityu 18:17:17 <notmorgan> but traversing up the tree no. 18:17:23 <dolphm> notmorgan: right 18:17:24 <rodrigods> the API remains the same, the internal representation that changes 18:17:29 <ayoung> notmorgan, I was just thinking about that. I need to see what a Linux system does if you try that with a FS 18:17:42 <rodrigods> although, every GET project needs to look at the parents 18:17:43 <notmorgan> ayoung: depends on the FS iirc. 18:17:47 <rodrigods> to check if someone is disabled 18:17:58 <samueldmq> dolphm: ayoung: cool so we can technically fixthat 18:18:10 <stevemar> i'm excited to get to the rolling upgrade topic :) 18:18:12 <dolphm> it sounds like we have some consensus on the topic to move forward then? 18:18:23 <samueldmq> ++ 18:18:25 <rodrigods> henrynash will pick it up? 18:18:28 <samueldmq> henrynash: picking it up? 18:18:34 <dolphm> rodrigods: well played 18:18:42 <henrynash> yes 18:18:43 <rodrigods> :) 18:18:56 <samueldmq> henrynash: thanks 18:18:59 <rodrigods> henrynash, o/ 18:19:01 <notmorgan> ayoung: http://paste.openstack.org/show/552606/ 18:19:17 <notmorgan> ayoung: now. iirc you can delete files just not directory inodes. 18:19:19 <dolphm> #topic Let's make HTTP status code documentation and descriptions great again 18:19:21 <dolphm> lbragstad: o/ 18:19:28 <dolphm> "The docs team wants to provide a template that we can add to our docs that gives some information regarding HTTP status codes along with formal and informal descriptions of what they mean." 18:19:32 <ayoung> http://paste.openstack.org/show/552607/ 18:19:33 <lbragstad> ^ 18:19:39 <dolphm> #link https://review.openstack.org/#/c/349551/ 18:19:46 <lbragstad> some projects have already started 18:19:47 <lbragstad> ^ 18:19:57 <ayoung> notmorgan, so, to a directory, anything inside it is a dentry, whether file or subdir 18:20:00 <lbragstad> thoughts/comments/questions/concerns? 18:20:02 <dolphm> "Thoughts about the documentation and additional descriptions? Should we organize an API sprint to knock this out?" 18:20:08 <notmorgan> yes 18:20:09 <lbragstad> I'll take feed back to the next docs meeting 18:20:10 <notmorgan> ayoung: yes 18:20:11 <ayoung> so if I don't have perms to delete a subdir, I can't remove the dir 18:20:23 <ayoung> so if I have x/y/z 18:20:34 <notmorgan> ayoung: http://paste.openstack.org/show/552608/ 18:20:37 <ayoung> and I don't have permissions to delete y, the tree is protected 18:21:04 <notmorgan> ayoung: exactly.for POSIX, non-POSIX... lets just pretend we on't care here 18:21:14 <bknudson> for keystone, each operation might have a different meaning for each status code. 18:21:17 <gagehugo> lbragstad: seems like it would look nicer than what it currently is 18:21:23 <lbragstad> bknudson yeah - that's the tough part 18:21:26 <notmorgan> bknudson: also we have duplicated use of status 18:21:30 <lbragstad> there are a lot of things that can represent a 401 18:21:35 <notmorgan> basically the answer should be an "error code" with the status 18:21:42 <notmorgan> where we can provide it for security reasons 18:21:43 <dolphm> lbragstad: it sounds like this is a reverse of the normal API documentation structure, which is "here's an API request, and the possible status codes that could result from it" 18:21:50 <lbragstad> right 18:21:59 <ayoung> so...on cascading delete, we are doing something like rm -rf proj 18:22:02 <lbragstad> for example - see https://review.openstack.org/#/c/349551/1/api-ref/source/v2/http-codes.yaml 18:22:08 <bknudson> also, we can have a single status code that has multiple reasons. 18:22:16 <notmorgan> but we should at least clearly say status possible from API X is Y,Z,Q 18:22:18 * ayoung just realized we moved on 18:22:37 <notmorgan> ayoung: yeah sorry, just wanted to hand you the infos for further digesting down the line. :) 18:22:46 <stevemar> no one stops the ayoung train 18:22:50 <ayoung> can we say "you need role R to perform this operatoin?" 18:22:55 <amakarov> I'd be happy if I can find out where exactly was decided to raise 401 basing on status code in non-debug mode 18:22:59 <samueldmq> bknudson: ++ 18:23:04 * stevemar runs off 18:23:07 <samueldmq> bknudson: and maybe generalizing isn't the ideal 18:23:08 <stevemar> thanks dolphm 18:23:27 <notmorgan> amakarov: so, 90% of the cases a 401 cannot give much more info 18:23:28 <samueldmq> why not define the status codes in each API ? as they may mean something different 18:23:30 <bknudson> the status codes already have meanings defined by the HTTP standard. 18:23:50 <notmorgan> because we're leaking auth related data to the user then (not auth info itself, auth related) 18:24:07 <dolphm> there's a table of authentication errors by status code and their reasonings here http://developer.openstack.org/api-ref/identity/v3/index.html#authentication-and-token-management 18:24:10 <notmorgan> we would need to restructure our exceptions to directly obscure the auth data. 18:24:13 <amakarov> notmorgan, ++ 18:24:15 <bknudson> http://specs.openstack.org/openstack/keystone-specs/api/v3/identity-api-v3.html#http-status-codes 18:24:15 <notmorgan> auth-related data 18:24:29 <bknudson> ^ link to the old table of status codes 18:24:42 <lbragstad> one thing the docs folks wanted was to see if the projects would be interested in this and they would provide us with a template to fill out 18:24:57 <bknudson> "Normal response codes: 201 Error response codes: 413,415,405,404,403,401,400,503,409" -- not useful! 18:24:58 <ayoung> the tricky thing we do that messes some poeple up is return 404 when we don't want people to know the resource is there but they don't have access 18:25:01 <lbragstad> cc annegentle was driving that template effort 18:25:10 <bknudson> a random string of numbers. 18:25:30 <bknudson> should have put "Sad!" 18:25:41 <ayoung> 418 18:25:47 <notmorgan> so, i would recommend 1 of two things. 18:25:50 <notmorgan> ayoung: i am not a teapot 18:26:01 <ayoung> notmorgan, I never said "you" were 18:26:10 <ayoung> 418 says "I" am teapot. 18:26:15 <dolphm> this seems like it's not super useful for a data-driven API, which is most of our API beyond auth 18:26:17 <notmorgan> either we restructure our execptions 18:26:20 <notmorgan> so we don't pass data back. 18:26:29 <notmorgan> if this is something worth changing 18:26:46 <notmorgan> or we add some level of "error code" that can communicate more data to the end user 18:27:00 <notmorgan> but not leak important things like "does X exist, or do you not have any access" 18:27:03 <amakarov> notmorgan, it will ease problem solving 18:27:23 <samueldmq> looks like it's orthogonal to whether define the status code or not ? 18:27:29 <ayoung> what I would like to see is that, if someone calls an API, but does not have any roles on their token, they get back a 401 with a list of roles that would be required 18:27:39 <bknudson> btw, these docs are looking good. Room for improvement like everything, but there has been improvement 18:27:52 <samueldmq> I see lbragstad as 'define status codes in a single place, in the same way other projects are doing' (which is this topic) 18:27:56 <ayoung> something like a token with scope but no roles...not a pure unscoped token ,as that is a security thing.... 18:28:11 <bknudson> we already have securityexception to indicate to hide the details from the user. 18:28:11 <dolphm> can we keep the discussion on the documentation issue? 18:28:19 <ayoung> Let the user then request a token with the appropriate scope, if they like the risk 18:28:25 <samueldmq> dolphm: ++ 18:28:55 <ayoung> dolphm, is this purely a documentaion question? Did not realize. I thought it was about actually changethe responses 18:28:56 <notmorgan> the doc change is less useful when hitting the auth wall 18:28:56 <samueldmq> lbragstad: I'd say that's nice to have, just get what we had already from the docs dolphm and bknudson linked above 18:28:59 <notmorgan> since everything is 401ish 18:29:00 <lbragstad> I mainly wanted to bring it up here to see what people though - anne was interested in getting a template rolling 18:29:08 <notmorgan> otherwise the doc change is useful 18:29:30 * ayoung though descriptions meant the message we passed back in the response. Understand better now 18:29:34 <dolphm> i'm personally skeptical of the value of this, especially for us, because it doesn't match the workflow that i'd be following if i were utilizing API docs for the first time (i'd be looking at the operation first, then the status code i'm getting -- not the other way around) 18:29:38 <lbragstad> i can go back to the doc meeting and asking for more information about the intended audience 18:29:39 <bknudson> is the ask just do we want a summary of how we generally use return codes? 18:29:56 <bknudson> or how to make something like "Normal response codes: 201 Error response codes: 413,415,405,404,403,401,400,503,409" actually useful? 18:29:59 <dolphm> lbragstad: that's basically here http://developer.openstack.org/api-ref/identity/v3/index.html#authentication-and-token-management 18:30:00 <lbragstad> dolphm i see 18:30:06 <dolphm> for auth, anyway 18:30:10 <lbragstad> dolphm sure 18:30:11 <samueldmq> dolphm: yes, having the status code described in the api description itself is more meaninigful to me 18:30:14 <dstanek> bknudson: that's what i don't quite understand 18:30:18 <dolphm> which is our only "interesting" use of status codes? 18:30:21 <samueldmq> as the API (arguments, etc) would be all in one spot 18:30:25 <dstanek> who is to use this, how and why? 18:30:37 <dolphm> samueldmq: that's a better way to describe it 18:30:47 <bknudson> any application writer wants to know this information! 18:30:56 <bknudson> deployers want to know why something failed 18:31:01 <ayoung> I suspect the goal is to keep the meaning of response codes consistent across the services, but having a single doc to start with 18:31:06 <bknudson> and currently we don't provide that information. 18:31:09 <ayoung> of course, we are long past "starting" 18:31:13 <lbragstad> dstanek yeah - i can bring all this to the next meeting and see if I can clarify it further 18:31:22 <amakarov> bknudson, as for me if I can determine the code that kicked request out, the problem is solved 18:31:50 <bknudson> amakarov: what if the same code means lots of things? like a 400 ? 18:31:53 <samueldmq> lbragstad: it would be nice to ahve an exceptions.yaml, somehting like parameters.yaml 18:31:59 <amakarov> whather it'll be by status code or some magic in status message - doesn't matter 18:32:13 <amakarov> bknudson, that's the problem 18:32:15 <samueldmq> lbragstad: you reference the exceptions in the api description, then they get rendered by getting what is in that file 18:32:18 <lbragstad> ok so 18:32:20 <lbragstad> summary 18:32:30 <lbragstad> there is a fear of duplication 18:32:38 <lbragstad> (duplication of docs) 18:32:42 <lbragstad> Does it match the workflow for anyone using this (why would they use this versus the regular API)? 18:32:50 <lbragstad> Does this make sense for a data-driven API? 18:32:51 <dstanek> done...you're welcome: http://git.openstack.org/cgit/openstack/api-wg/tree/guidelines/http.rst#n83 18:33:11 <dolphm> #topic Let's make rolling upgrades great again 18:33:51 <dolphm> so, last week after we approved henry's spec for rolling upgrades, I proposed a simplification (hopefully) of the spec 18:34:05 <dolphm> this is the spec revision https://review.openstack.org/#/c/351798/ 18:34:41 <dolphm> i also proposed a draft of what our docs would look like for deployers https://review.openstack.org/#/c/350793/ 18:34:47 <bknudson> +154, -149 -- looks more complicated! 18:34:53 <ayoung> dolphm, any propsal to deal with revocations during that time? 18:35:18 <dolphm> and implemented a read-only mode config option to support the spec https://review.openstack.org/#/c/349700/ 18:35:31 <dolphm> ayoung: yes - in that you can't disable or delete anything anyway, so no need to revoke anything ;) 18:36:04 <henrynash> so I have a number of questions/concerns about this: 18:36:08 <ayoung> dolphm, um..you do see thefallacy there, right? 18:36:14 <ayoung> You are, I assume, just joking. 18:36:19 <gyee> can't list ldap users either as we can't external IDs 18:36:23 <henrynash> 1) Obviously it is different to what other services are doing 18:36:46 <gyee> can't issue tokens and shadow users 18:36:47 <dolphm> the tl;dr of the change is that i cut out a few commands from henry's spec for managing metadata about the state of the rolling upgrade, added a read-only mode to protect the live migration process and the application from stomping on each other (which also prevents us from having to write code to handle multiple states of the DB in a single release) 18:36:50 <henrynash> 2) It isn't clear to me that a RO mode will suit everyone (and of course certainly not anyon on UUID tkes) 18:37:00 <dolphm> gyee: you can issue Fernet tokens of course, but not UUID/etc tokens 18:37:14 <gyee> dolphm, what about shadowing? 18:37:23 <henrynash> 3) Minor concern over use of config variable, since it means keystone comes up, by default, as RO (on a clean install) 18:37:31 <dolphm> gyee: there's certainly a constraint there, but my assumption is that if you're pursuing rolling upgrades, you're probably already on board with Fernet (if that's a bad assumption, i'd love to hear feedback) 18:37:38 <dolphm> gyee: it'd be blocked :( 18:37:44 <henrynash> (but you could get round that with a database variable instead) 18:37:48 <rderose_> gyee: you wouldn't be able to onboard new users during the upgrade 18:37:50 <dolphm> gyee: no writes to the db during the data migration process at all 18:38:18 <gyee> dolphm, rderose_, yeah that make sense 18:38:46 <dolphm> henrynash: (1) is my biggest concern, but i'd also wager that no other project other than maybe glance would remain particularly useful during a read-only mode 18:38:59 <dolphm> glance can still serve images, for example 18:39:10 <dolphm> (during a rolling upgrade of a similar nature to glance) 18:39:43 <gyee> now rolling upgrade on a mult-site datacenter will be even more fun :-) 18:40:00 <dolphm> so, during a rolling upgrade of keystone, you'd suffer a "partial service outage", wherein you can't onboard new users (including shadowing), manage authz, tweak your service catalog, etc, but you'd be able to create and validate tokens all day 18:40:34 <rderose_> ++ 18:40:39 <bknudson> I hope it doesn't take all day 18:40:48 <dolphm> gyee: the duration of read-only mode becomes the concern in that case -- the more nodes you're upgrading, the more time you're likely spending in read-only mode 18:40:50 <ayoung> dolphm, can we store revoations in a flat file, and run them all post upgrade? 18:40:53 <dolphm> bknudson: i hope not too :) 18:41:04 <dstanek> also PCI things would be off during that time. e.g. no recording failed auths 18:41:09 <dolphm> bknudson: in reality, you're not upgrading one node at a time, but groups of nodes at once 18:41:11 <bknudson> it would be interesting to see some kind of timing on this, since it's something we've had to guess on. 18:41:34 <rderose_> dstanek: correct 18:41:50 <bknudson> I have no idea if the read-only period is 5 mins or 5 days. 18:41:54 <henrynash> dolphm: so it's probably worth perhaps you giving the reasons why you feel the current spec isn't teh right way to go 18:41:59 * ayoung wonders if we should be backing revocations to a KVS anyway 18:42:22 <bknudson> we should be backing all keystone data to a kvs. 18:42:23 <dolphm> ayoung: you can always write to the db yourself? keystone-manage revoke-things-manually 18:42:38 <ayoung> dolphm, nah, I am thinking of all the Horizon log-outs 18:42:39 <gyee> anyone gone through multi-site upgrade so far, I presume there's a step to redo the replication after the upgrade? 18:42:39 <lbragstad> raw sql for the win 18:43:01 <dolphm> ayoung: good point 18:43:31 <dstanek> gyee: what replication? 18:43:34 <bknudson> does the server return some http status? 18:43:39 <bknudson> when it's read-only 18:43:42 <henrynash> as far as I am concerned we have a perfectly good solution right now agreed...albeit with more steps required (which of course means more possibiliy of operater error) 18:43:43 <gyee> dstanek, like replicating data between sites? 18:43:48 <dolphm> bknudson: yes, 503 on POST and PATCH, for example 18:43:53 <lbragstad> gyee are you thinking part of the deployment is in read-only? 18:43:56 <lbragstad> and the other part isn't? 18:44:01 <dstanek> gyee: in RO mode everything is RO until the upgrade is completed 18:44:07 <lbragstad> dstanek ++ 18:44:14 <lbragstad> dstanek thats how I understand it 18:44:16 <dolphm> bknudson: new exception is implemented here https://review.openstack.org/#/c/349700/3/keystone/exception.py 18:44:28 <gyee> dstanek, right, but we'll have to configure to replace any new tables don't we 18:44:41 <gyee> replicate 18:44:54 <bknudson> makes sense. I'll not that we already see 503 errors (from proxy server) 18:44:58 <dolphm> gyee: i don't understand 18:45:22 <lbragstad> gyee no we just not allowing writes through keystone 18:45:30 <dolphm> bknudson: yeah, this would be the first 503 from keystone itself 18:45:30 <lbragstad> gyee the upgrade will still write to the database 18:45:34 <gyee> dolphm, you want to upgrade all the sites, then turn replication back on right? 18:45:44 <lbragstad> gyee which will replicate like it normally would 18:46:02 <dolphm> lbragstad: the database itself is not read-only, it's the keystone service that is rejecting write requests 18:46:02 <gyee> including the new tables and tables we don't want to replicate 18:46:12 <dolphm> gyee: *^ 18:46:16 <lbragstad> gyee what dolphm said 18:46:27 <dolphm> gyee: what tables don't you want to replicate? why do they exist 18:46:37 <henrynash> dolphm: what about my question of how the RO mode is achiveed...si this by code in keystone itself? If so, would this change only be execuatable for Newton->Ocata 18:46:59 <notmorgan> don't table creates get replicated? 18:47:03 <dolphm> henrynash: yes, it's literally this one liner https://review.openstack.org/#/c/349700/3/keystone/common/sql/core.py,unified 18:47:05 <lbragstad> henrynash read-only meaning keystone will reject requests to write to the database 18:47:09 <notmorgan> in standard replication models? 18:47:16 <lbragstad> notmorgan yeah 18:47:22 <dolphm> henrynash: any driver trying to open a write session with respond to http requests with a 503 instead 18:47:29 <notmorgan> why are we... i 'm not sure what the issue here is 18:47:34 <henrynash> dolphm: so that means we can't use it to roll upgrades from Mitaka to Newton 18:47:46 <notmorgan> henrynash: that is already the case. 18:47:47 <lbragstad> we're not flipping any specific database bits to make this read only 18:48:00 <henrynash> notmorgan: ? 18:48:13 <notmorgan> henrynash: mitaka => Newton is not rolling upgrade iirc, wouldn't the base be newton to -> o? 18:48:19 <notmorgan> since support is landing in n? 18:48:22 <dolphm> henrynash: and L129 documents a workaround for Mitaka->Newton https://review.openstack.org/#/c/350793/2/doc/source/upgrading.rst 18:48:23 <henrynash> notmotgan: yes it is 18:48:38 <henrynash> notmorgan: that's what we agreed in the spec at the midcycle 18:48:54 <henrynash> and that's what is already up for review 18:49:21 <dolphm> notmorgan: yeah, good support will be newton to O 18:49:25 <dolphm> cata 18:49:27 <notmorgan> dolphm: ok :) 18:49:46 <dolphm> but in my head, this is possible with some extra effort for mitaka -> newton upgrades if you really wanted to 18:50:27 <henrynash> so although I'm always up for simplification, I don't see why we should move away from a commitment of a full RW rolling upgrade.... 18:52:18 <dolphm> henrynash: no other project has successfully delivered on true rolling upgrades yet - i'm skeptical that we're going to be able to achieve it in a single release without serious race conditions and other unexpected bugs related to the complexity of the required implementations 18:53:01 <dolphm> when we have the testing in place to assert that we're doing things correctly and safely (or to show that we're doing it wrong), i'll feel much more comfortable pursuing more complicated implementations 18:53:10 <gyee> yeah, lets give it a try, experimental 18:53:37 <dolphm> gyee: i can't stomach the idea of treating upgrades as "experimental" 18:53:42 <rderose_> I think the difficult part, is that the code has to behave differently based on the deployment phase, read/write to old and new columns for example. Whereas with read-only strategy, it's much simpler in that respect. 18:53:42 <henrynash> given that in Newton we are doing all the migrartion in one go in the expand phase, I think this lowers our risk 18:54:01 <bknudson> if we implement the read-only upgrade, does that block moving to the full r-w upgrade? 18:54:01 <dolphm> gyee: almost ANY bug during an upgrade is treated as critical 18:54:15 <notmorgan> dolphm: ++ 18:54:20 <rderose_> bknudson: I don't think so 18:54:54 <henrynash> bknudson: not technically...although the guidance from the offsite was have the same commands as the zero-downtime rolling upgrade approach, 18:55:00 <gyee> dolphm, yeah, but we guarantee this going to work? 18:55:08 <lbragstad> bknudson not that I am aware of - a RO upgrade is a little more restrictive and a few less cases 18:55:25 <dolphm> bknudson: no; because we can relax the requirement for read-only mode later, and introduce new commands for more granular steps in the ugprade process later 18:55:34 <rderose_> gyee: oh yeah, 100% guaranteed :) 18:55:42 <gyee> alllrighty then 18:55:48 <dolphm> bknudson: the basic expand -> migrate -> contract pattern would still hold 18:56:00 <lbragstad> a lot of the complexity comes from the write-side of the problem 18:56:12 <dolphm> all / most * 18:56:12 <ayoung> 4 minutes left 18:56:16 <dolphm> ayoung: thanks 18:56:50 <henrynash> lbragstad: which we are not doing in N anyway...we are doing the migration in one go.... 18:58:01 <lbragstad> henrynash but we don't claim any upgrade support currently, right? 18:58:02 <dolphm> henrynash: i absolutely want to get to a full R/W capable upgrade, but this feels like a safe step in the correct direction for the next release or two, until we get some experience operating and testing minimal/zero downtime upgrades as a broader community 18:58:16 <rderose_> ++ 18:58:19 <henrynash> lbragdatad: for M->N, yes we will 18:58:46 <henrynash> lbragstad: we just doing do the migrations piecemeal, on the fly... 18:59:13 <dolphm> henrynash: you mean we can't / aren't going to refactor the migrations we've already landed? 18:59:45 <henrynash> dolphm: no, but we do repair anything the leave unset 19:00:10 <dolphm> (time) 19:00:27 <dolphm> let's carry this on in a mailing list discussion? 19:00:32 <dolphm> i'd be happy to start it 19:00:34 <dolphm> #endmeeting