13:04:17 <yanyanhu> #startmeeting senlin
the agenda is here https://wiki.openstack.org/wiki/Meetings/SenlinAgenda
13:06:18 <Qiming> I'm still having problems thinking clearly
13:06:36 <yanyanhu> time difference?
13:06:59 <yanyanhu> currently, one two items have been added to agenda, if you guys have something else want to discuss, plz feel free to add them
13:07:02 <Qiming> yes, my body still confused what time is it
13:07:17 <yanyanhu> sigh... you need have a good sleep
13:07:28 <haiwei> when will you be back
13:07:35 <haiwei> to Beijing
13:07:38 <yanyanhu> #topic update work status
13:07:45 <Qiming> leaving tomorrow
13:07:58 <yanyanhu> ok, maybe lets update our on-going work first
13:08:02 <tyagiprince> lets start
13:08:10 <yanyanhu> who want to be the first one?
13:08:28 <elynn> you mean the BPs?
13:08:34 <yanyanhu> I think some TODO items have been claimed and related bps have been filed
13:08:35 <yanyanhu> yep
13:08:53 <elynn> ok, I will start first
13:08:54 <yanyanhu> so I belive you guys have started working on something :)
13:08:59 <jruano> i spent some time looking at monasca
13:09:05 <jruano> last week
13:09:13 <yanyanhu> ok
13:09:27 <elynn> I post 3 patches for lock breaker
13:09:32 <jruano> might put that on hold, as speaking to qiming last week, it is going to be low priority
13:09:50 <yanyanhu> hi, ethan, maybe we can let jruano first :)
13:09:57 <yanyanhu> jruano, yes
13:10:03 <elynn> :)
13:10:09 <yanyanhu> it's now given low priority
13:10:25 <jruano> there are some issues with the monasca api that are different than ceilometer, namely notification api
13:10:26 <Qiming> the priority is relatively low when compared to a generic receiver api
13:10:38 <yanyanhu> since we are still not sure we should support trigger completely in senlin, @ Qiming
13:10:44 <jruano> and not really need it right now
13:10:45 <jruano> yes
13:10:56 <Qiming> some knowledge of monasca would actually help us shaping the api
13:11:05 <jruano> so i started to look at the webhook interface, and thinking of how to generalize it
13:11:18 <jruano> i will claim the receiver item
13:11:22 <jruano> and update todo
13:11:27 <yanyanhu> jruano, cool
13:11:28 <Qiming> \o/
13:12:11 <jruano> so that's me. will dig into receiver generalization this week
13:12:58 <haiwei> the next elynn?
13:13:03 <yanyanhu> nice
13:13:05 <yanyanhu> I think the design of receiver is something we really need to take care about, maybe we can make thorough discussion on it before start coding :)
13:13:13 <jruano> for sure
13:13:21 <jruano> i will draw up the blueprint for discussion
13:13:25 <elynn> :)
13:13:29 <yanyanhu> jruano, thanks
13:14:24 <elynn> for lock breaker, I post 3 patches to steal a lock from dead engine.
13:14:31 <yanyanhu> elynn, your turn now :)
13:14:40 <yanyanhu> yes, I saw it
13:14:54 <haiwei> I have reviewed one
13:15:02 <elynn> But I don't know why db api for cluster/node lock doesn't require a context.
13:15:05 <haiwei> seems good to me
13:15:25 <elynn> While other db_api always require context.
13:15:45 <yanyanhu> let me check it
13:16:12 <Qiming> elynn, there are two levels of locks
13:16:16 <yanyanhu> hmm, that's true
13:16:26 <yanyanhu> http://git.openstack.org/cgit/openstack/senlin/tree/senlin/db/sqlalchemy/api.py#n635
13:16:43 <Qiming> first level is an action is claimed by an engine, where we don't have context, maybe we should?
13:17:02 <Qiming> the second level is an action claims the lock on a cluster/node
13:17:22 <yanyanhu> hi, Qiming, we give context when locking action http://git.openstack.org/cgit/openstack/senlin/tree/senlin/db/sqlalchemy/api.py#n1481
13:17:42 <elynn> Qiming:  I haven't notice that is there any lock that is claimed by engine.
13:17:47 <Qiming> yes, that context contains a db session I think
13:17:51 <yanyanhu> yes
13:18:20 <yanyanhu> elynn, the related code is in engine/actions/base.py now
13:18:34 <yanyanhu> but we plan to move to engine/scheduler.py
13:18:39 <elynn> hmm...
13:18:51 <yanyanhu> in this patch https://review.openstack.org/244026
13:19:04 <elynn> Seems I need to take engine lock into consider.
13:19:34 <yanyanhu> now its here http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n483
13:19:34 <Qiming> http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n483
13:19:39 <elynn> I just handle cluster/node lock for now.
13:19:40 <yanyanhu> yea
13:20:23 <Qiming> when breaking locks, we need to consider two things: actions locked by a 'dead' engine; clusters/nodes locked by those actions
13:20:48 <yanyanhu> yes
13:21:39 <elynn> yes, will cover the first case in future codes. current codes only covers later case.
13:21:40 <haiwei> elynn, I think you are considering engine lock in this patch https://review.openstack.org/#/c/243483/
13:22:17 <haiwei> the owner is engine id
13:22:25 <haiwei> 'owner'
13:22:49 <elynn> haiwei: yes, part of it, just aware the action is owner by the dead engine, but haven't clean it in db.
13:23:08 <yanyanhu> elynn, I think maybe we can temporarily ignore the lock stealing of action since it may depends on more action support like suspend, resume
13:23:18 <Qiming> yes, once we have a clear picture of the 2-level locks, problems is not that difficult to solve
13:23:21 <haiwei> ok, expect your patch later
13:24:11 <elynn> yanyanhu: Qiming, so we also hold this BP for now?
13:24:14 <Qiming> yanyanhu, why is that?
13:24:56 <Qiming> I'm not seeing a direct connection between action suspend/resume and lock breaker
13:25:03 <yanyanhu> hi, Qiming I think if the lock of an action is slean by another engine, it means the new coming engine try to recover/resume this action
13:25:04 <Qiming> missed something?
13:25:23 <yanyanhu> which is currently being seized by an dead engine
13:26:13 <Qiming> if the engine is dead, the actions previously held/locked by the engine need to be unlocked
13:26:18 <haiwei> I think that should be a new action
13:27:12 <yanyanhu> Qiming, yes, but if we don't consider action resume, that lock stealing will be easy to handle I think
13:27:36 <elynn> Qiming: so after steal a lock from action, then this engine will continue to execute this action?
13:27:36 <haiwei> yes, the previous action should go to ERROR, and the new action will get the lock
13:28:02 <jruano> i now see why we need distributed lock management :)
13:28:05 <Qiming> right, there are corner cases where an action was killed
13:28:10 <yanyanhu> elynn, that is what we want to support in future
13:28:16 <Qiming> jruano, nod
13:28:40 <yanyanhu> jruano, yep :)
13:29:12 <Qiming> elynn, continue execute an action sounds a little dangerous to me
13:29:21 <elynn> yanyanhu: hmm, then the scope of this BP is getting bigger then I thought.
13:29:49 <yanyanhu> elynn, I think you can just keep it in current status and focus on lock stealing of cluster/node
13:29:55 <elynn> Qiming: isn't that the action resume thing?
13:30:13 <yanyanhu> management of action lock will be another topic I believe
13:30:52 <elynn> So this BP just cover cluster/node lock stealing thing is ok?
13:31:19 <haiwei> make a dead action relive?? seems cool, but..
13:31:22 <Qiming> elynn, maybe we should go with haiwei's suggestion -- restart the action
13:32:33 <yanyanhu> hmm, but restarting an action without getting its current status is a little dangerous I think
13:33:04 <yanyanhu> e.g. some physical resources has been created, maybe we need to clean them before restart the action?
13:33:07 <Qiming> I'm not a big fan of lock stealing, to be honest
13:33:22 <yanyanhu> Qiming, me too actually
13:33:23 <elynn> Qiming: That should be done in lock breaker? I mean I can do it, but seems not very related to this BP.
13:33:38 <yanyanhu> it should be only used in some cases like engine die
13:33:59 <elynn> yanyanhu: agree with you.
13:34:04 <haiwei> what about supporting lock breaker first?
13:34:16 <Qiming> yanyanhu, making action execution transactional, i.e. roll it back? sounds fancy, but I don't think it is feasible
13:34:34 <haiwei> agree
13:34:42 <yanyanhu> hmm, that's true... at least in current stage
13:35:30 <Qiming> the simplest approach could be just mark all actions locked by a 'dead' engine as failed
13:35:34 <yanyanhu> so maybe we just set action to failed status if we found its owner died for some reasons?
13:35:44 <yanyanhu> yea
13:35:44 <Qiming> and remove locks on the cluster/nodes locked by that action
13:35:56 <haiwei> agreed
13:36:14 <elynn> yanyanhu: yes, that sounds easy to approach.
13:36:18 <haiwei> support this first, and see what we can do further
13:36:26 <Qiming> yep
13:36:33 <elynn> And when to do that check?
13:36:45 <haiwei> what check
13:36:52 <yanyanhu> when next time you reach the action
13:37:03 <elynn> to check if the action owner by dead engine.
13:37:04 <yanyanhu> maybe cuased by user's request
13:37:14 <yanyanhu> or scheduling movement
13:37:33 <yanyanhu> e.g. user execute action-show
13:37:45 <yanyanhu> or a scheduler try to claim this action to run?
13:37:46 <elynn> yanyanhu: yes , two way to do so, regularly or just trigger by other action.
13:37:56 <haiwei> yes, when that is checked, it should be released auto, user should not know it
13:39:04 <elynn> If add a scheduler to do so, might cause other problems in multi-engine env.
13:39:31 <elynn> like race condition.
13:40:16 <elynn> I would prefer to set dead actions to failed when some certain other actions need to lock the cluster/node?
13:40:21 <elynn> what do you think?
13:40:39 <Qiming> my previous thought was a scavenger daemon, which will clean the mess left by dead engines, remove old actions and events from db, so on and so forth
13:41:00 <yanyanhu> and then found the engine who is working on the owner action of cluster/node has dead?
13:41:12 <elynn> Qiming: That means another service?
13:41:43 <elynn> Qiming: or running in senlin-engine?
13:41:43 <Qiming> but it sounds a bit complicated, there will be another lock...
13:41:58 <Qiming> elynn, running in senlin-engine was the idea
13:42:03 <yanyanhu> Qiming, I think this is good, but still not very clear how to implement it
13:42:13 <elynn> Qiming: yes, that might introduce other lock for this service.
13:42:30 <yanyanhu> so before we have it, maybe just do passitive lock breaking?
13:42:54 <elynn> Since we can only let one service cleaning current db at one time.
13:43:38 <Qiming> yes, that is where we may need a "single" coordinator thing, or a good dlm solution
13:43:52 <haiwei> what about doing like this? when some action need to steal the lock, it find the dead engine first, and then clean the dead action, and then steal the lock
13:44:28 <yanyanhu> haiwei, this is what I mean by 'passitive' :)
13:44:35 <elynn> haiwei: sounds good to me.
13:44:40 <Qiming> but anyway, we can start with some basic primitives that will do engine aliveness checking, action unlocking
13:44:45 <elynn> yanyanhu: seems we all mean that ;)
13:44:48 <yanyanhu> agree
13:44:57 <Qiming> passive you mean, :)
13:45:01 <yanyanhu> oh, right
13:45:12 <yanyanhu> sigh...
13:45:12 <haiwei> ok
13:45:54 <yanyanhu> ok, if we are clear about this issue, lets move on?
13:46:03 <elynn> yes
13:46:13 <haiwei> the next is?
13:46:14 <yanyanhu> haiwei, your turn now?
13:46:18 <haiwei> ok
13:46:49 <haiwei> I assigned this https://blueprints.launchpad.net/senlin/+spec/http-response-modification
13:47:01 <yanyanhu> ok
13:47:05 <haiwei> not started yet, just thinking about it
13:47:21 <yanyanhu> it's about API refactor
13:47:34 <haiwei> it seems I need to modify quite a lot in senlin/common/wsgi.py
13:48:00 <yanyanhu> hope it can make our API stable and more consistent with the guide from API-WG
13:48:11 <haiwei> yes
13:48:20 <yanyanhu> haiwei, just feel free to propose the patch :)
13:48:51 <haiwei> return 202 is not difficult, we still need to add url of the resource in response body??
13:49:03 <yanyanhu> yes, I guess so
13:49:08 <Qiming> haiwei, yes, in response header
13:49:24 <haiwei> in header not body?
13:49:31 <Qiming> check the api-wg guideline
13:49:45 <haiwei> I investigated other projects, they are storing url in body
13:50:35 <Qiming> read this again: http://git.openstack.org/cgit/openstack/api-wg/tree/guidelines/http.rst#n105
13:50:40 <haiwei> I have read the guideline, maybe it is not meaning to store them in header, english is a little difficult
13:50:57 <Qiming> "* Must return a Location header set to one of the following:"
13:51:21 <haiwei> the a Location header is response header??
13:51:34 <Qiming> what else could 'header' mean then?
13:51:58 <Qiming> maybe it was a misunderstanding
13:52:20 <Qiming> pls check with the author
13:52:35 <Qiming> cannot recall his name/ircnic
13:52:38 <haiwei> ok, I will do it
13:53:02 <yanyanhu> thanks, haiwei
13:53:06 <Qiming> Miguel Grinberg
13:53:30 <haiwei> ok
13:53:42 <yanyanhu> ok, my turn I guess
13:53:49 <yanyanhu> just quick update my work
13:54:20 <yanyanhu> I started working on senlin scheduler to make it support initiative action scheduling
13:54:48 <Qiming> ok
13:54:53 <yanyanhu> currently, scheduler is not a real 'scheduler' since it will only schedule the action given to it by dispatcher
13:55:18 <yanyanhu> we hope to make it smarter and behave more like a 'scheduler'
13:55:33 <haiwei> yes
13:55:38 <yanyanhu> so the first step is to add the ability to choose a random ready action to schedule
13:55:40 <yanyanhu> https://review.openstack.org/#/c/244026/
13:55:42 <yanyanhu> patch is here
13:55:55 <Qiming> don't be too smart, just some basic 'scheduling' would suffice, :)
13:56:02 <yanyanhu> Qiming, yep :)
13:56:13 <Qiming> will jump onto that
13:56:30 <yanyanhu> many thanks
13:56:43 <yanyanhu> ok, time is almost over
13:56:52 <yanyanhu> #topic open discussion
13:57:00 <yanyanhu> anything else want to discuss
13:57:16 <elynn> not from me :)
13:57:22 <jruano_> i am out on vacation next week. thanksgiving for us in the usa
13:57:22 <yanyanhu> I guess we need to postpone the topic of rechecking existing bps to next meeting
13:57:34 <yanyanhu> jruano_, have a good vacation :)
13:57:36 <Qiming> jruano_, best wishes, :)
13:57:45 <haiwei> hope you can have time to review this https://review.openstack.org/#/c/238753/
13:57:48 <Qiming> yanyanhu, fine
13:57:56 <elynn> jruano_: Have a nice vacation ;)
13:57:58 <haiwei> it's there for quite a long time
13:58:13 <yanyanhu> yes, we need some dicussion about this issue
13:58:24 <Qiming> haiwei, okay
13:58:40 <haiwei> that all from me
13:58:44 <yanyanhu> ok, so I guess that's all for this meeting?
