#openstack-meeting log

18:02:22 <dolphm> #startmeeting keystone
18:02:23 <openstack> Meeting started Tue Aug  9 18:02:22 2016 UTC and is due to finish in 60 minutes.  The chair is dolphm. Information about MeetBot at http://wiki.debian.org/MeetBot.
18:02:24 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
18:02:27 <openstack> The meeting name has been set to 'keystone'
18:02:27 <stevemar> dolphm: you remember how to run the meeting?
18:02:28 <stevemar> :)
18:02:35 <bknudson> just like old times.
18:02:42 <stevemar> bknudson: #bettertimes
18:02:44 <ayoung> maybe  <o
18:02:55 <lbragstad> nostalgia
18:03:00 <dolphm> #topic Scheduley bits
18:03:13 <bknudson> trump will bring back the good old days.
18:03:21 <dolphm> #info August 22-26 is final release for non-client libs
18:03:35 <henrynash> (3 minutes in...and first Trump reference already..)
18:03:39 <dolphm> #info Newton-3 deadline is August ~29
18:03:41 <stevemar> hehe
18:03:44 <dstanek> "Let's make keystone great again"
18:03:49 <rderose_> ++
18:03:52 <gagehugo> +1
18:03:56 <rodrigods> lol
18:04:02 <samueldmq> hi all
18:04:12 <stevemar> dolphm: i think the client libraries freeze the same time as the server side
18:04:29 <stevemar> the non-client libs (keystoneauth and keystonemiddleware) freeze a week before
18:04:30 <bknudson> do we get to choose when our libs freeze?
18:04:36 <stevemar> bknudson: nope!
18:04:42 <dolphm> #link http://releases.openstack.org/newton/schedule.html
18:05:05 <dolphm> non-client libs freeze a week early to help ease requirements management at the end of the release
18:05:23 <dolphm> #topic Let's make HMT great again: support for delete and update
18:05:40 <dolphm> #link https://blueprints.launchpad.net/keystone/+spec/project-tree-deletion
18:05:47 <dolphm> Delete patch: https://review.openstack.org/#/c/244248/ (abandoned)
18:05:50 <notmorgan> i might be lurking tody
18:05:51 * rodrigods hides in the corner
18:05:52 <dolphm> Update patch: https://review.openstack.org/#/c/243585/ (abandoned)
18:06:02 <dolphm> stevemar: have a minute for this one, since your name is on it?
18:06:10 <stevemar> dolphm: sure
18:06:17 <stevemar> right
18:06:36 <stevemar> so, we implemented all this stuff in the controller and manager and backend
18:06:41 <stevemar> and we don't have an API that can hit it
18:06:48 <stevemar> so... kinda pointless right now
18:06:55 <rderose_> Revert it all!
18:06:59 <rderose_> :)
18:07:03 <amakarov> dolphm, is my patch https://review.openstack.org/#/c/285521/ somewhere on the schedule?
18:07:28 <rodrigods> major issue is because in the HMT design, a project isn't owner of its subtree
18:07:32 <dolphm> amakarov: no? but feel free to add it https://etherpad.openstack.org/p/keystone-weekly-meeting
18:07:32 <gyee> see, see, rderose_ said it
18:07:58 <stevemar> so yeah, as rodrigods said, there were issues around policy and what it should look like for deleting / updating a tree of projects
18:08:12 <dolphm> rodrigods: so the implementation is not ready to be exposed to the API?
18:08:15 <stevemar> i'm only half joking when i say revert it all
18:08:26 <dstanek> stevemar: --
18:08:37 <rodrigods> dolphm, looks like it is impossible to implement it cleany
18:08:48 <samueldmq> is there a REAL use case for it ?
18:08:53 <stevemar> my argument is that no one has pushed for it in months, so do we really need it?
18:08:54 <rodrigods> samueldmq, sure
18:08:55 <gyee> stevemar, trump was half-joking when he accept the nomination too
18:08:58 <notmorgan> rderose_: revert allof HMT! /snark, (sorry)
18:09:11 <henrynash> I think part of the problem was a lack of clarity on the conceptual meaning ....i..e is delete/update tree a single command policied in its own right...or is it shorthand for issue update/delete on every node in the tree and teh regular policy  applies to each one?
18:09:14 <dstanek> rodrigods: so what can we do here?
18:09:23 <stevemar> i'm trying to encourage anyone to pick this up and finish it :)
18:09:33 <lbragstad> henrynash isn't that the cascade feature?
18:09:39 <stevemar> lbragstad: yep
18:09:59 <rodrigods> dstanek, think henrynash can be more precise about the issues since he made further investigations on it
18:10:01 <samueldmq> henrynash: yes, because there is no way to specify that in the current policies
18:10:17 <notmorgan> updating a whole tree is potentially scary (tm)
18:10:22 <stevemar> anyway, i didn't expect a solution to come out of this, just a PSA
18:10:23 <henrynash> lbragstad: yes, you could argue that this is expanding teh cascade capability to full delete/update (not just the enabled flag which is all you can do with it today)
18:10:32 <rodrigods> notmorgan, the update is to disable, that is required for deletion
18:10:34 <notmorgan> and/or deleting because of our lack of "soft delete"
18:10:38 <stevemar> i was cleaning old BPs and this one keeps getting bumped
18:10:46 <samueldmq> anyone picking this up ?
18:10:59 <dstanek> what does 'update a tree' actually mean?
18:11:08 <rodrigods> dstanek, disable it
18:11:09 <henrynash> I'm happ to pick it up, just not sure whether can do this for Newton
18:11:10 <stevemar> dstanek: only 'enabled' is allowed to be updated
18:11:11 <dolphm> henrynash: why does a disable have to cascade to take effect?
18:11:12 <bknudson> what happens if nobody picks it up?
18:11:12 <notmorgan> dstanek: specifically a disable
18:11:23 <stevemar> bknudson: revert it all! <joking>
18:11:25 <notmorgan> that cascades
18:11:26 <dstanek> rodrigods: do you have to explicitly disable all child nodes then?
18:11:27 <notmorgan> bknudson: since no public api... it could be removed?
18:11:37 <dolphm> henrynash: if any of a project's parents are disabled, the project is disabled, and amakarov's table should make that check trivial
18:11:39 <notmorgan> dstanek: today iirc yes.
18:11:44 <rodrigods> dstanek, for the current design, yes
18:11:56 <dstanek> sounds like we need to fix the glitch then
18:11:56 <bknudson> ok, just checking there's no huge bug if we don't do something.
18:12:06 <henrynash> dolphm: I was just describing what the current cascade does....
18:12:15 <notmorgan> dstanek: explicit disable. it still preventx login so security is fine, just a cascade delete cannot occur w/o everything disabled as i understand it
18:12:19 <dolphm> henrynash: right
18:12:42 <dolphm> rodrigods: it sounds like the problem is with checking the enabled/disabled state of a child (which is a read query), not with "cascading" a write to a bunch of rows in the db
18:12:58 <notmorgan> dolphm: correct. a cascade write would be absurd
18:13:10 <notmorgan> and shouldn't be a thing we do.
18:13:13 <rodrigods> right
18:13:34 <stevemar> bknudson: we just end up with lots of ugly branches in our code :(
18:13:35 <rodrigods> but... as of today
18:13:36 <notmorgan> doing the "is any of my parents disabled" check should be what is done.
18:13:41 <rodrigods> we check if everyone is disabled
18:13:57 <samueldmq> I think the code is in
18:14:01 <notmorgan> this is a QOL change for operators/users imo
18:14:03 <samueldmq> we just don't know how to expose that in the policies
18:14:22 <stevemar> notmorgan: you are hardly lurking :)
18:14:23 <stevemar> hehe
18:14:46 <dolphm> samueldmq: the perceived "enabled" state of a project is just whether a project AND all of it's parent's are enabled
18:14:47 <ayoung> samueldmq, policy is kindof all or nothing.  It should check to see if you can do the operation based on a role, but dealing with hierarchy is kindof beyond scope
18:14:50 <stevemar> dolphm: so yeah, just a PSA, didn't expect an outcome here
18:15:10 <ayoung> but....
18:15:11 <notmorgan> stevemar: i can go away again, i just was dropping in to voice the "this should be safe from a security standpoint" view.
18:15:18 <dolphm> notmorgan: thanks :)
18:15:23 <stevemar> notmorgan: we love having you around :)
18:15:28 <dolphm> notmorgan: not for the going away again bit, though
18:15:35 <notmorgan> dolphm: lol
18:15:35 <ayoung> the part that should be checked is the applicability of the token used to request the operation to each of the node
18:15:40 <henrynash> stevemar: Happy to get this assigned to me to go think about it ad come back with a spec (or not) for Ocata
18:15:41 <ayoung> nodes to be dleted that is
18:15:50 <rodrigods> ayoung, right
18:15:57 <rodrigods> check auth in the whole tree
18:15:58 <ayoung> and what do we say if the scoped does not match...probably deny
18:15:59 <stevemar> henrynash: i'd be ok with that
18:16:00 <samueldmq> dolphm: that made me think in something ... if a parent is disabled, shouldn't its children be disabled too (by consequence) ?
18:16:01 <rodrigods> authz*
18:16:04 <dolphm> notmorgan: /nick notnotmorgan ?
18:16:10 <samueldmq> anyways, we don't do that today
18:16:14 <dolphm> samueldmq: correct
18:16:18 <rodrigods> samueldmq, that's the point
18:16:22 <rodrigods> we don't do that today
18:16:27 <ayoung> samueldmq, yes they should
18:16:27 <samueldmq> and we can't
18:16:34 <samueldmq> because backwards comapt
18:16:51 <samueldmq> and that would disable auth
18:16:53 <dolphm> samueldmq: i wouldn't not consider it to be backwards incompatible to fix that behavior
18:16:55 <rodrigods> don't think changing that breaks the API
18:16:58 <dolphm> samueldmq: great!
18:17:03 <samueldmq> can taht be considered a securit issue then we fix ?
18:17:04 <notmorgan> ayoung: uhm. wouldn't the owner of the top of the tree be allowed to delete even if they don't own down the tree?
18:17:08 <dolphm> it's not an API change so much as a change in the API's behavior
18:17:10 <notmorgan> just thinking through the usabilityu
18:17:17 <notmorgan> but traversing up the tree no.
18:17:23 <dolphm> notmorgan: right
18:17:24 <rodrigods> the API remains the same, the internal representation that changes
18:17:29 <ayoung> notmorgan, I was just thinking about that.  I need to see what a Linux system does if you try that with a FS
18:17:42 <rodrigods> although, every GET project needs to look at the parents
18:17:43 <notmorgan> ayoung: depends on the FS iirc.
18:17:47 <rodrigods> to check if someone is disabled
18:17:58 <samueldmq> dolphm: ayoung: cool so we can technically fixthat
18:18:10 <stevemar> i'm excited to get to the rolling upgrade topic :)
18:18:12 <dolphm> it sounds like we have some consensus on the topic to move forward then?
18:18:23 <samueldmq> ++
18:18:25 <rodrigods> henrynash will pick it up?
18:18:28 <samueldmq> henrynash: picking it up?
18:18:34 <dolphm> rodrigods: well played
18:18:42 <henrynash> yes
18:18:43 <rodrigods> :)
18:18:56 <samueldmq> henrynash: thanks
18:18:59 <rodrigods> henrynash, o/
18:19:01 <notmorgan> ayoung: http://paste.openstack.org/show/552606/
18:19:17 <notmorgan> ayoung: now. iirc you can delete files just not directory inodes.
18:19:19 <dolphm> #topic Let's make HTTP status code documentation and descriptions great again
18:19:21 <dolphm> lbragstad: o/
18:19:28 <dolphm> "The docs team wants to provide a template that we can add to our docs that gives some information regarding HTTP status codes along with formal and informal descriptions of what they mean."
18:19:32 <ayoung> http://paste.openstack.org/show/552607/
18:19:33 <lbragstad> ^
18:19:39 <dolphm> #link https://review.openstack.org/#/c/349551/
18:19:46 <lbragstad> some projects have already started
18:19:47 <lbragstad> ^
18:19:57 <ayoung> notmorgan, so, to a directory, anything inside it is a dentry, whether file or subdir
18:20:00 <lbragstad> thoughts/comments/questions/concerns?
18:20:02 <dolphm> "Thoughts about the documentation and additional descriptions? Should we organize an API sprint to knock this out?"
18:20:08 <notmorgan> yes
18:20:09 <lbragstad> I'll take feed back to the next docs meeting
18:20:10 <notmorgan> ayoung: yes
18:20:11 <ayoung> so if I don't have perms to delete a subdir, I can't remove the dir
18:20:23 <ayoung> so if I have  x/y/z
18:20:34 <notmorgan> ayoung: http://paste.openstack.org/show/552608/
18:20:37 <ayoung> and I don't have permissions to delete y, the tree is protected
18:21:04 <notmorgan> ayoung: exactly.for POSIX, non-POSIX... lets just pretend we on't care here
18:21:14 <bknudson> for keystone, each operation might have a different meaning for each status code.
18:21:17 <gagehugo> lbragstad: seems like it would look nicer than what it currently is
18:21:23 <lbragstad> bknudson yeah - that's the tough part
18:21:26 <notmorgan> bknudson: also we have duplicated use of status
18:21:30 <lbragstad> there are a lot of things that can represent a 401
18:21:35 <notmorgan> basically the answer should be an "error code" with the status
18:21:42 <notmorgan> where we can provide it for security reasons
18:21:43 <dolphm> lbragstad: it sounds like this is a reverse of the normal API documentation structure, which is "here's an API request, and the possible status codes that could result from it"
18:21:50 <lbragstad> right
18:21:59 <ayoung> so...on cascading delete, we are doing something like rm -rf   proj
18:22:02 <lbragstad> for example - see https://review.openstack.org/#/c/349551/1/api-ref/source/v2/http-codes.yaml
18:22:08 <bknudson> also, we can have a single status code that has multiple reasons.
18:22:16 <notmorgan> but we should at least clearly say status possible from API X is Y,Z,Q
18:22:18 * ayoung just realized we moved on
18:22:37 <notmorgan> ayoung: yeah sorry, just wanted to hand you the infos for further digesting down the line. :)
18:22:46 <stevemar> no one stops the ayoung train
18:22:50 <ayoung> can we say "you need role R to perform this operatoin?"
18:22:55 <amakarov> I'd be happy if I can find out where exactly was decided to raise 401 basing on status code in non-debug mode
18:22:59 <samueldmq> bknudson: ++
18:23:04 * stevemar runs off
18:23:07 <samueldmq> bknudson: and maybe generalizing isn't the ideal
18:23:08 <stevemar> thanks dolphm
18:23:27 <notmorgan> amakarov: so, 90% of the cases a 401 cannot give much more info
18:23:28 <samueldmq> why not define the status codes in each API ? as they may mean something different
18:23:30 <bknudson> the status codes already have meanings defined by the HTTP standard.
18:23:50 <notmorgan> because we're leaking auth related data to the user then (not auth info itself, auth related)
18:24:07 <dolphm> there's a table of authentication errors by status code and their reasonings here http://developer.openstack.org/api-ref/identity/v3/index.html#authentication-and-token-management
18:24:10 <notmorgan> we would need to restructure our exceptions to directly obscure the auth data.
18:24:13 <amakarov> notmorgan, ++
18:24:15 <bknudson> http://specs.openstack.org/openstack/keystone-specs/api/v3/identity-api-v3.html#http-status-codes
18:24:15 <notmorgan> auth-related data
18:24:29 <bknudson> ^ link to the old table of status codes
18:24:42 <lbragstad> one thing the docs folks wanted was to see if the projects would be interested in this and they would provide us with a template to fill out
18:24:57 <bknudson> "Normal response codes: 201 Error response codes: 413,415,405,404,403,401,400,503,409" -- not useful!
18:24:58 <ayoung> the tricky thing we do that messes some poeple up is return 404 when we don't want people to know the resource is there but they don't have access
18:25:01 <lbragstad> cc annegentle was driving that template effort
18:25:10 <bknudson> a random string of numbers.
18:25:30 <bknudson> should have put "Sad!"
18:25:41 <ayoung> 418
18:25:47 <notmorgan> so, i would recommend 1 of two things.
18:25:50 <notmorgan> ayoung: i am not a teapot
18:26:01 <ayoung> notmorgan, I never said "you" were
18:26:10 <ayoung> 418 says "I" am teapot.
18:26:15 <dolphm> this seems like it's not super useful for a data-driven API, which is most of our API beyond auth
18:26:17 <notmorgan> either we restructure our execptions
18:26:20 <notmorgan> so we don't pass data back.
18:26:29 <notmorgan> if this is something worth changing
18:26:46 <notmorgan> or we add some level of "error code" that can communicate more data to the end user
18:27:00 <notmorgan> but not leak important things like "does X exist, or do you not have any access"
18:27:03 <amakarov> notmorgan, it will ease problem solving
18:27:23 <samueldmq> looks like it's orthogonal to whether define the status code or not ?
18:27:29 <ayoung> what I would like to see is that, if someone calls an API, but does not have any roles on their token, they get back a 401 with a list of roles that would be required
18:27:39 <bknudson> btw, these docs are looking good. Room for improvement like everything, but there has been improvement
18:27:52 <samueldmq> I see lbragstad as 'define status codes in a single place, in the same way other projects are doing' (which is this topic)
18:27:56 <ayoung> something like a token with scope but no roles...not a pure unscoped token ,as that is a security thing....
18:28:11 <bknudson> we already have securityexception to indicate to hide the details from the user.
18:28:11 <dolphm> can we keep the discussion on the documentation issue?
18:28:19 <ayoung> Let the user then request a token with the appropriate scope, if they like the risk
18:28:25 <samueldmq> dolphm: ++
18:28:55 <ayoung> dolphm, is this purely a documentaion question?  Did not realize.  I thought it was about actually changethe responses
18:28:56 <notmorgan> the doc change is less useful when hitting the auth wall
18:28:56 <samueldmq> lbragstad: I'd say that's nice to have, just get what we had already from the docs dolphm and bknudson linked above
18:28:59 <notmorgan> since everything is 401ish
18:29:00 <lbragstad> I mainly wanted to bring it up here to see what people though  - anne was interested in getting a template rolling
18:29:08 <notmorgan> otherwise the doc change is useful
18:29:30 * ayoung though descriptions meant the message we passed back in the response.  Understand better now
18:29:34 <dolphm> i'm personally skeptical of the value of this, especially for us, because it doesn't match the workflow that i'd be following if i were utilizing API docs for the first time (i'd be looking at the operation first, then the status code i'm getting -- not the other way around)
18:29:38 <lbragstad> i can go back to the doc meeting and asking for more information about the intended audience
18:29:39 <bknudson> is the ask just do we want a summary of how we generally use return codes?
18:29:56 <bknudson> or how to make something like "Normal response codes: 201 Error response codes: 413,415,405,404,403,401,400,503,409" actually useful?
18:29:59 <dolphm> lbragstad: that's basically here http://developer.openstack.org/api-ref/identity/v3/index.html#authentication-and-token-management
18:30:00 <lbragstad> dolphm i see
18:30:06 <dolphm> for auth, anyway
18:30:10 <lbragstad> dolphm sure
18:30:11 <samueldmq> dolphm: yes, having the status code described in the api description itself is more meaninigful to me
18:30:14 <dstanek> bknudson: that's what i don't quite understand
18:30:18 <dolphm> which is our only "interesting" use of status codes?
18:30:21 <samueldmq> as the API (arguments, etc) would be all in one spot
18:30:25 <dstanek> who is to use this, how and why?
18:30:37 <dolphm> samueldmq: that's a better way to describe it
18:30:47 <bknudson> any application writer wants to know this information!
18:30:56 <bknudson> deployers want to know why something failed
18:31:01 <ayoung> I suspect the goal is to keep the meaning of response codes consistent across the services, but having a single doc to start with
18:31:06 <bknudson> and currently we don't provide that information.
18:31:09 <ayoung> of course, we are long past "starting"
18:31:13 <lbragstad> dstanek yeah - i can bring all this to the next meeting and see if I can clarify it further
18:31:22 <amakarov> bknudson, as for me if I can determine the code that kicked request out, the problem is solved
18:31:50 <bknudson> amakarov: what if the same code means lots of things? like a 400 ?
18:31:53 <samueldmq> lbragstad: it would be nice to ahve an exceptions.yaml, somehting like parameters.yaml
18:31:59 <amakarov> whather it'll be by status code or some magic in status message - doesn't matter
18:32:13 <amakarov> bknudson, that's the problem
18:32:15 <samueldmq> lbragstad: you reference the exceptions in the api description, then they get rendered by getting what is in that file
18:32:18 <lbragstad> ok so
18:32:20 <lbragstad> summary
18:32:30 <lbragstad> there is a fear of duplication
18:32:38 <lbragstad> (duplication of docs)
18:32:42 <lbragstad> Does it match the workflow for anyone using this (why would they use this versus the regular API)?
18:32:50 <lbragstad> Does this make sense for a data-driven API?
18:32:51 <dstanek> done...you're welcome: http://git.openstack.org/cgit/openstack/api-wg/tree/guidelines/http.rst#n83
18:33:11 <dolphm> #topic Let's make rolling upgrades great again
18:33:51 <dolphm> so, last week after we approved henry's spec for rolling upgrades, I proposed a simplification (hopefully) of the spec
18:34:05 <dolphm> this is the spec revision https://review.openstack.org/#/c/351798/
18:34:41 <dolphm> i also proposed a draft of what our docs would look like for deployers https://review.openstack.org/#/c/350793/
18:34:47 <bknudson> +154, -149 -- looks more complicated!
18:34:53 <ayoung> dolphm, any propsal to deal with revocations during that time?
18:35:18 <dolphm> and implemented a read-only mode config option to support the spec https://review.openstack.org/#/c/349700/
18:35:31 <dolphm> ayoung: yes - in that you can't disable or delete anything anyway, so no need to revoke anything ;)
18:36:04 <henrynash> so I have a number of questions/concerns about this:
18:36:08 <ayoung> dolphm, um..you do see thefallacy there, right?
18:36:14 <ayoung> You are, I assume, just joking.
18:36:19 <gyee> can't list ldap users either as we can't external IDs
18:36:23 <henrynash> 1) Obviously it is different to what other services are doing
18:36:46 <gyee> can't issue tokens and shadow users
18:36:47 <dolphm> the tl;dr of the change is that i cut out a few commands from henry's spec for managing metadata about the state of the rolling upgrade, added a read-only mode to protect the live migration process and the application from stomping on each other (which also prevents us from having to write code to handle multiple states of the DB in a single release)
18:36:50 <henrynash> 2) It isn't clear to me that a RO mode will suit everyone (and of course certainly not anyon on UUID tkes)
18:37:00 <dolphm> gyee: you can issue Fernet tokens of course, but not UUID/etc tokens
18:37:14 <gyee> dolphm, what about shadowing?
18:37:23 <henrynash> 3) Minor concern over use of config variable, since it means keystone comes up, by default, as RO (on a clean install)
18:37:31 <dolphm> gyee: there's certainly a constraint there, but my assumption is that if you're pursuing rolling upgrades, you're probably already on board with Fernet (if that's a bad assumption, i'd love to hear feedback)
18:37:38 <dolphm> gyee: it'd be blocked :(
18:37:44 <henrynash> (but you could get round that with a database variable instead)
18:37:48 <rderose_> gyee: you wouldn't be able to onboard new users during the upgrade
18:37:50 <dolphm> gyee: no writes to the db during the data migration process at all
18:38:18 <gyee> dolphm, rderose_, yeah that make sense
18:38:46 <dolphm> henrynash: (1) is my biggest concern, but i'd also wager that no other project other than maybe glance would remain particularly useful during a read-only mode
18:38:59 <dolphm> glance can still serve images, for example
18:39:10 <dolphm> (during a rolling upgrade of a similar nature to glance)
18:39:43 <gyee> now rolling upgrade on a mult-site datacenter will be even more fun :-)
18:40:00 <dolphm> so, during a rolling upgrade of keystone, you'd suffer a "partial service outage", wherein you can't onboard new users (including shadowing), manage authz, tweak your service catalog, etc, but you'd be able to create and validate tokens all day
18:40:34 <rderose_> ++
18:40:39 <bknudson> I hope it doesn't take all day
18:40:48 <dolphm> gyee: the duration of read-only mode becomes the concern in that case -- the more nodes you're upgrading, the more time you're likely spending in read-only mode
18:40:50 <ayoung> dolphm, can we store revoations in a flat file, and run them all post upgrade?
18:40:53 <dolphm> bknudson: i hope not too :)
18:41:04 <dstanek> also PCI things would be off during that time. e.g. no recording failed auths
18:41:09 <dolphm> bknudson: in reality, you're not upgrading one node at a time, but groups of nodes at once
18:41:11 <bknudson> it would be interesting to see some kind of timing on this, since it's something we've had to guess on.
18:41:34 <rderose_> dstanek: correct
18:41:50 <bknudson> I have no idea if the read-only period is 5 mins or 5 days.
18:41:54 <henrynash> dolphm: so it's probably worth perhaps you giving the reasons why you feel the current spec isn't teh right way to go
18:41:59 * ayoung wonders if we should be backing revocations to a KVS anyway
18:42:22 <bknudson> we should be backing all keystone data to a kvs.
18:42:23 <dolphm> ayoung: you can always write to the db yourself? keystone-manage revoke-things-manually
18:42:38 <ayoung> dolphm, nah, I am thinking of all the Horizon log-outs
18:42:39 <gyee> anyone gone through multi-site upgrade so far, I presume there's a step to redo the replication after the upgrade?
18:42:39 <lbragstad> raw sql for the win
18:43:01 <dolphm> ayoung: good point
18:43:31 <dstanek> gyee: what replication?
18:43:34 <bknudson> does the server return some http status?
18:43:39 <bknudson> when it's read-only
18:43:42 <henrynash> as far as I am concerned we have a perfectly good solution right now agreed...albeit with more steps required (which of course means more possibiliy of operater error)
18:43:43 <gyee> dstanek, like replicating data between sites?
18:43:48 <dolphm> bknudson: yes, 503 on POST and PATCH, for example
18:43:53 <lbragstad> gyee are you thinking part of the deployment is in read-only?
18:43:56 <lbragstad> and the other part isn't?
18:44:01 <dstanek> gyee: in RO mode everything is RO until the upgrade is completed
18:44:07 <lbragstad> dstanek ++
18:44:14 <lbragstad> dstanek thats how I understand it
18:44:16 <dolphm> bknudson: new exception is implemented here https://review.openstack.org/#/c/349700/3/keystone/exception.py
18:44:28 <gyee> dstanek, right, but we'll have to configure to replace any new tables don't we
18:44:41 <gyee> replicate
18:44:54 <bknudson> makes sense. I'll not that we already see 503 errors (from proxy server)
18:44:58 <dolphm> gyee: i don't understand
18:45:22 <lbragstad> gyee no we just not allowing writes through keystone
18:45:30 <dolphm> bknudson: yeah, this would be the first 503 from keystone itself
18:45:30 <lbragstad> gyee the upgrade will still write to the database
18:45:34 <gyee> dolphm, you want to upgrade all the sites, then turn replication back on right?
18:45:44 <lbragstad> gyee which will replicate like it normally would
18:46:02 <dolphm> lbragstad: the database itself is not read-only, it's the keystone service that is rejecting write requests
18:46:02 <gyee> including the new tables and tables we don't want to replicate
18:46:12 <dolphm> gyee: *^
18:46:16 <lbragstad> gyee what dolphm said
18:46:27 <dolphm> gyee: what tables don't you want to replicate? why do they exist
18:46:37 <henrynash> dolphm: what about my question of how the RO mode is achiveed...si this by code in keystone itself? If so, would this change only be execuatable for Newton->Ocata
18:46:59 <notmorgan> don't table creates get replicated?
18:47:03 <dolphm> henrynash: yes, it's literally this one liner https://review.openstack.org/#/c/349700/3/keystone/common/sql/core.py,unified
18:47:05 <lbragstad> henrynash read-only meaning keystone will reject requests to write to the database
18:47:09 <notmorgan> in standard replication models?
18:47:16 <lbragstad> notmorgan yeah
18:47:22 <dolphm> henrynash: any driver trying to open a write session with respond to http requests with a 503 instead
18:47:29 <notmorgan> why are we... i 'm not sure what the issue here is
18:47:34 <henrynash> dolphm: so that means we can't use it to roll upgrades from Mitaka to Newton
18:47:46 <notmorgan> henrynash: that is already the case.
18:47:47 <lbragstad> we're not flipping any specific database bits to make this read only
18:48:00 <henrynash> notmorgan: ?
18:48:13 <notmorgan> henrynash: mitaka => Newton is not rolling upgrade iirc, wouldn't the base be newton to -> o?
18:48:19 <notmorgan> since support is landing in n?
18:48:22 <dolphm> henrynash: and L129 documents a workaround for Mitaka->Newton https://review.openstack.org/#/c/350793/2/doc/source/upgrading.rst
18:48:23 <henrynash> notmotgan: yes it is
18:48:38 <henrynash> notmorgan: that's what we agreed in the spec at the midcycle
18:48:54 <henrynash> and that's what is already up for review
18:49:21 <dolphm> notmorgan: yeah, good support will be newton to O
18:49:25 <dolphm> cata
18:49:27 <notmorgan> dolphm: ok :)
18:49:46 <dolphm> but in my head, this is possible with some extra effort for mitaka -> newton upgrades if you really wanted to
18:50:27 <henrynash> so although I'm always up for simplification, I don't see why we should move away from a commitment of a full RW rolling upgrade....
18:52:18 <dolphm> henrynash: no other project has successfully delivered on true rolling upgrades yet - i'm skeptical that we're going to be able to achieve it in a single release without serious race conditions and other unexpected bugs related to the complexity of the required implementations
18:53:01 <dolphm> when we have the testing in place to assert that we're doing things correctly and safely (or to show that we're doing it wrong), i'll feel much more comfortable pursuing more complicated implementations
18:53:10 <gyee> yeah, lets give it a try, experimental
18:53:37 <dolphm> gyee: i can't stomach the idea of treating upgrades as "experimental"
18:53:42 <rderose_> I think the difficult part, is that the code has to behave differently based on the deployment phase, read/write to old and new columns for example.  Whereas with read-only strategy, it's much simpler in that respect.
18:53:42 <henrynash> given that in Newton we are doing all the migrartion in one go in the expand phase, I think this lowers our risk
18:54:01 <bknudson> if we implement the read-only upgrade, does that block moving to the full r-w upgrade?
18:54:01 <dolphm> gyee: almost ANY bug during an upgrade is treated as critical
18:54:15 <notmorgan> dolphm: ++
18:54:20 <rderose_> bknudson: I don't think so
18:54:54 <henrynash> bknudson: not technically...although the guidance from the offsite was have the same commands as the zero-downtime rolling upgrade approach,
18:55:00 <gyee> dolphm, yeah, but we guarantee this going to work?
18:55:08 <lbragstad> bknudson not that I am aware of - a RO upgrade is a little more restrictive and a few less cases
18:55:25 <dolphm> bknudson: no; because we can relax the requirement for read-only mode later, and introduce new commands for more granular steps in the ugprade process later
18:55:34 <rderose_> gyee: oh yeah, 100% guaranteed :)
18:55:42 <gyee> alllrighty then
18:55:48 <dolphm> bknudson: the basic expand -> migrate -> contract pattern would still hold
18:56:00 <lbragstad> a lot of the complexity comes from the write-side of the problem
18:56:12 <dolphm> all / most *
18:56:12 <ayoung> 4 minutes left
18:56:16 <dolphm> ayoung: thanks
18:56:50 <henrynash> lbragstad: which we are not doing in N anyway...we are doing the migration in one go....
18:58:01 <lbragstad> henrynash but we don't claim any upgrade support currently, right?
18:58:02 <dolphm> henrynash: i absolutely want to get to a full R/W capable upgrade, but this feels like a safe step in the correct direction for the next release or two, until we get some experience operating and testing minimal/zero downtime upgrades as a broader community
18:58:16 <rderose_> ++
18:58:19 <henrynash> lbragdatad: for M->N, yes we will
18:58:46 <henrynash> lbragstad: we just doing do the migrations piecemeal, on the fly...
18:59:13 <dolphm> henrynash: you mean we can't / aren't going to refactor the migrations we've already landed?
18:59:45 <henrynash> dolphm: no, but we do repair anything the leave unset
19:00:10 <dolphm> (time)
19:00:27 <dolphm> let's carry this on in a mailing list discussion?
19:00:32 <dolphm> i'd be happy to start it
19:00:34 <dolphm> #endmeeting