13:04:17 #startmeeting senlin 13:04:18 Meeting started Tue Nov 17 13:04:17 2015 UTC and is due to finish in 60 minutes. The chair is yanyanhu. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:04:19 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:04:21 The meeting name has been set to 'senlin' 13:04:31 hi, guys 13:04:36 hi 13:04:37 Hi 13:04:43 hello 13:04:51 Hey 13:04:56 hi 13:04:57 o/ 13:05:00 hello 13:05:26 Qiming, I have started the meeting, you can hold it now :) 13:05:47 go ahead as the host, ;) 13:06:06 ok 13:06:12 so the agenda is here https://wiki.openstack.org/wiki/Meetings/SenlinAgenda 13:06:18 I'm still having problems thinking clearly 13:06:36 time difference? 13:06:59 currently, one two items have been added to agenda, if you guys have something else want to discuss, plz feel free to add them 13:07:02 yes, my body still confused what time is it 13:07:17 sigh... you need have a good sleep 13:07:28 when will you be back 13:07:35 to Beijing 13:07:38 #topic update work status 13:07:45 leaving tomorrow 13:07:58 ok, maybe lets update our on-going work first 13:08:02 lets start 13:08:10 who want to be the first one? 13:08:28 you mean the BPs? 13:08:34 I think some TODO items have been claimed and related bps have been filed 13:08:35 yep 13:08:53 ok, I will start first 13:08:54 so I belive you guys have started working on something :) 13:08:59 i spent some time looking at monasca 13:09:05 last week 13:09:13 ok 13:09:27 I post 3 patches for lock breaker 13:09:32 might put that on hold, as speaking to qiming last week, it is going to be low priority 13:09:50 hi, ethan, maybe we can let jruano first :) 13:09:57 jruano, yes 13:10:03 :) 13:10:09 it's now given low priority 13:10:25 there are some issues with the monasca api that are different than ceilometer, namely notification api 13:10:26 the priority is relatively low when compared to a generic receiver api 13:10:38 since we are still not sure we should support trigger completely in senlin, @ Qiming 13:10:44 and not really need it right now 13:10:45 yes 13:10:56 some knowledge of monasca would actually help us shaping the api 13:11:05 so i started to look at the webhook interface, and thinking of how to generalize it 13:11:18 i will claim the receiver item 13:11:22 and update todo 13:11:27 jruano, cool 13:11:28 \o/ 13:12:11 so that's me. will dig into receiver generalization this week 13:12:58 the next elynn? 13:13:03 nice 13:13:05 I think the design of receiver is something we really need to take care about, maybe we can make thorough discussion on it before start coding :) 13:13:13 for sure 13:13:21 i will draw up the blueprint for discussion 13:13:25 :) 13:13:29 jruano, thanks 13:14:24 for lock breaker, I post 3 patches to steal a lock from dead engine. 13:14:31 elynn, your turn now :) 13:14:40 yes, I saw it 13:14:54 I have reviewed one 13:15:02 But I don't know why db api for cluster/node lock doesn't require a context. 13:15:05 seems good to me 13:15:25 While other db_api always require context. 13:15:45 let me check it 13:16:12 elynn, there are two levels of locks 13:16:16 hmm, that's true 13:16:26 http://git.openstack.org/cgit/openstack/senlin/tree/senlin/db/sqlalchemy/api.py#n635 13:16:43 first level is an action is claimed by an engine, where we don't have context, maybe we should? 13:17:02 the second level is an action claims the lock on a cluster/node 13:17:22 hi, Qiming, we give context when locking action http://git.openstack.org/cgit/openstack/senlin/tree/senlin/db/sqlalchemy/api.py#n1481 13:17:42 Qiming: I haven't notice that is there any lock that is claimed by engine. 13:17:47 yes, that context contains a db session I think 13:17:51 yes 13:18:20 elynn, the related code is in engine/actions/base.py now 13:18:34 but we plan to move to engine/scheduler.py 13:18:39 hmm... 13:18:51 in this patch https://review.openstack.org/244026 13:19:04 Seems I need to take engine lock into consider. 13:19:34 now its here http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n483 13:19:34 http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n483 13:19:39 I just handle cluster/node lock for now. 13:19:40 yea 13:20:23 when breaking locks, we need to consider two things: actions locked by a 'dead' engine; clusters/nodes locked by those actions 13:20:48 yes 13:21:39 yes, will cover the first case in future codes. current codes only covers later case. 13:21:40 elynn, I think you are considering engine lock in this patch https://review.openstack.org/#/c/243483/ 13:22:17 the owner is engine id 13:22:25 'owner' 13:22:49 haiwei: yes, part of it, just aware the action is owner by the dead engine, but haven't clean it in db. 13:23:08 elynn, I think maybe we can temporarily ignore the lock stealing of action since it may depends on more action support like suspend, resume 13:23:18 yes, once we have a clear picture of the 2-level locks, problems is not that difficult to solve 13:23:21 ok, expect your patch later 13:24:11 yanyanhu: Qiming, so we also hold this BP for now? 13:24:14 yanyanhu, why is that? 13:24:56 I'm not seeing a direct connection between action suspend/resume and lock breaker 13:25:03 hi, Qiming I think if the lock of an action is slean by another engine, it means the new coming engine try to recover/resume this action 13:25:04 missed something? 13:25:23 which is currently being seized by an dead engine 13:26:13 if the engine is dead, the actions previously held/locked by the engine need to be unlocked 13:26:18 I think that should be a new action 13:27:12 Qiming, yes, but if we don't consider action resume, that lock stealing will be easy to handle I think 13:27:36 Qiming: so after steal a lock from action, then this engine will continue to execute this action? 13:27:36 yes, the previous action should go to ERROR, and the new action will get the lock 13:28:02 i now see why we need distributed lock management :) 13:28:05 right, there are corner cases where an action was killed 13:28:10 elynn, that is what we want to support in future 13:28:16 jruano, nod 13:28:40 jruano, yep :) 13:29:12 elynn, continue execute an action sounds a little dangerous to me 13:29:21 yanyanhu: hmm, then the scope of this BP is getting bigger then I thought. 13:29:49 elynn, I think you can just keep it in current status and focus on lock stealing of cluster/node 13:29:55 Qiming: isn't that the action resume thing? 13:30:13 management of action lock will be another topic I believe 13:30:52 So this BP just cover cluster/node lock stealing thing is ok? 13:31:19 make a dead action relive?? seems cool, but.. 13:31:22 elynn, maybe we should go with haiwei's suggestion -- restart the action 13:32:33 hmm, but restarting an action without getting its current status is a little dangerous I think 13:33:04 e.g. some physical resources has been created, maybe we need to clean them before restart the action? 13:33:07 I'm not a big fan of lock stealing, to be honest 13:33:22 Qiming, me too actually 13:33:23 Qiming: That should be done in lock breaker? I mean I can do it, but seems not very related to this BP. 13:33:38 it should be only used in some cases like engine die 13:33:59 yanyanhu: agree with you. 13:34:04 what about supporting lock breaker first? 13:34:16 yanyanhu, making action execution transactional, i.e. roll it back? sounds fancy, but I don't think it is feasible 13:34:34 agree 13:34:42 hmm, that's true... at least in current stage 13:35:30 the simplest approach could be just mark all actions locked by a 'dead' engine as failed 13:35:34 so maybe we just set action to failed status if we found its owner died for some reasons? 13:35:44 yea 13:35:44 and remove locks on the cluster/nodes locked by that action 13:35:56 agreed 13:36:14 yanyanhu: yes, that sounds easy to approach. 13:36:18 support this first, and see what we can do further 13:36:26 yep 13:36:33 And when to do that check? 13:36:45 what check 13:36:52 when next time you reach the action 13:37:03 to check if the action owner by dead engine. 13:37:04 maybe cuased by user's request 13:37:14 or scheduling movement 13:37:33 e.g. user execute action-show 13:37:45 or a scheduler try to claim this action to run? 13:37:46 yanyanhu: yes , two way to do so, regularly or just trigger by other action. 13:37:56 yes, when that is checked, it should be released auto, user should not know it 13:39:04 If add a scheduler to do so, might cause other problems in multi-engine env. 13:39:31 like race condition. 13:40:16 I would prefer to set dead actions to failed when some certain other actions need to lock the cluster/node? 13:40:21 what do you think? 13:40:39 my previous thought was a scavenger daemon, which will clean the mess left by dead engines, remove old actions and events from db, so on and so forth 13:41:00 and then found the engine who is working on the owner action of cluster/node has dead? 13:41:12 Qiming: That means another service? 13:41:43 Qiming: or running in senlin-engine? 13:41:43 but it sounds a bit complicated, there will be another lock... 13:41:58 elynn, running in senlin-engine was the idea 13:42:03 Qiming, I think this is good, but still not very clear how to implement it 13:42:13 Qiming: yes, that might introduce other lock for this service. 13:42:30 so before we have it, maybe just do passitive lock breaking? 13:42:54 Since we can only let one service cleaning current db at one time. 13:43:38 yes, that is where we may need a "single" coordinator thing, or a good dlm solution 13:43:52 what about doing like this? when some action need to steal the lock, it find the dead engine first, and then clean the dead action, and then steal the lock 13:44:28 haiwei, this is what I mean by 'passitive' :) 13:44:35 haiwei: sounds good to me. 13:44:40 but anyway, we can start with some basic primitives that will do engine aliveness checking, action unlocking 13:44:45 yanyanhu: seems we all mean that ;) 13:44:48 agree 13:44:57 passive you mean, :) 13:45:01 oh, right 13:45:12 sigh... 13:45:12 ok 13:45:54 ok, if we are clear about this issue, lets move on? 13:46:03 yes 13:46:13 the next is? 13:46:14 haiwei, your turn now? 13:46:18 ok 13:46:49 I assigned this https://blueprints.launchpad.net/senlin/+spec/http-response-modification 13:47:01 ok 13:47:05 not started yet, just thinking about it 13:47:21 it's about API refactor 13:47:34 it seems I need to modify quite a lot in senlin/common/wsgi.py 13:48:00 hope it can make our API stable and more consistent with the guide from API-WG 13:48:11 yes 13:48:20 haiwei, just feel free to propose the patch :) 13:48:51 return 202 is not difficult, we still need to add url of the resource in response body?? 13:49:03 yes, I guess so 13:49:08 haiwei, yes, in response header 13:49:24 in header not body? 13:49:31 check the api-wg guideline 13:49:45 I investigated other projects, they are storing url in body 13:50:35 read this again: http://git.openstack.org/cgit/openstack/api-wg/tree/guidelines/http.rst#n105 13:50:40 I have read the guideline, maybe it is not meaning to store them in header, english is a little difficult 13:50:57 "* Must return a Location header set to one of the following:" 13:51:21 the a Location header is response header?? 13:51:34 what else could 'header' mean then? 13:51:58 maybe it was a misunderstanding 13:52:20 pls check with the author 13:52:35 cannot recall his name/ircnic 13:52:38 ok, I will do it 13:53:02 thanks, haiwei 13:53:06 Miguel Grinberg 13:53:30 ok 13:53:42 ok, my turn I guess 13:53:49 just quick update my work 13:54:20 I started working on senlin scheduler to make it support initiative action scheduling 13:54:48 ok 13:54:53 currently, scheduler is not a real 'scheduler' since it will only schedule the action given to it by dispatcher 13:55:18 we hope to make it smarter and behave more like a 'scheduler' 13:55:33 yes 13:55:38 so the first step is to add the ability to choose a random ready action to schedule 13:55:40 https://review.openstack.org/#/c/244026/ 13:55:42 patch is here 13:55:55 don't be too smart, just some basic 'scheduling' would suffice, :) 13:56:02 Qiming, yep :) 13:56:13 will jump onto that 13:56:30 many thanks 13:56:43 ok, time is almost over 13:56:52 #topic open discussion 13:57:00 anything else want to discuss 13:57:16 not from me :) 13:57:22 i am out on vacation next week. thanksgiving for us in the usa 13:57:22 I guess we need to postpone the topic of rechecking existing bps to next meeting 13:57:34 jruano_, have a good vacation :) 13:57:36 jruano_, best wishes, :) 13:57:45 hope you can have time to review this https://review.openstack.org/#/c/238753/ 13:57:48 yanyanhu, fine 13:57:56 jruano_: Have a nice vacation ;) 13:57:58 it's there for quite a long time 13:58:13 yes, we need some dicussion about this issue 13:58:24 haiwei, okay 13:58:40 that all from me 13:58:44 ok, so I guess that's all for this meeting? 13:59:19 ok, thanks you guys for joining, have a good night/day :) 13:59:25 bye 13:59:28 bye 13:59:30 * regXboi wanders in and looks at the clock 13:59:30 thank you! 13:59:38 bye, lets move back to senlin channel 13:59:40 #endmeeting