13:00:05 <Qiming> #startmeeting senlin 13:00:06 <openstack> Meeting started Tue Dec 29 13:00:05 2015 UTC and is due to finish in 60 minutes. The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:00:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:00:09 <openstack> The meeting name has been set to 'senlin' 13:00:30 <Liuqing> hi 13:00:36 <yanyanhu> hello 13:00:41 <Qiming> hi, Liuqing 13:00:41 <elynn> o/ 13:01:29 <Qiming> pls check agenda and see if you have things to add 13:01:40 <Qiming> #link https://wiki.openstack.org/wiki/Meetings/SenlinAgenda 13:01:58 <zhangguoqing> hi 13:02:05 <Qiming> hi, zhangguoqing 13:02:14 <Qiming> let's get started 13:02:25 <Qiming> #topic mitaka work items 13:02:37 <Qiming> #link https://etherpad.openstack.org/p/senlin-mitaka-workitems 13:02:54 <Qiming> heat resource type support 13:03:17 <elynn> Hi Qiming 13:03:23 <Qiming> seems blocked by action progress check 13:03:32 <elynn> yes 13:03:36 <Qiming> let's postpone that discussion 13:03:40 <elynn> still don't know how to deal with it. 13:03:46 <Qiming> it is item 2 on the agenda 13:03:56 <Qiming> client unit test 13:04:18 <Qiming> need to do a coverage test to find out where we are 13:04:30 <Qiming> unittest part 3 merged? 13:04:41 <yanyanhu> no I think 13:04:55 <yanyanhu> poor network, can't open review page 13:04:59 <Qiming> any blockers? 13:05:21 <Liuqing> gerrit's problem, so slow.. 13:05:25 <yanyanhu> just haven't got enough review it I think 13:06:11 <Qiming> not just gerrit problem, looks more like a problem caused by gfw 13:06:42 <yanyanhu> oh, sorry 13:06:46 <yanyanhu> it has been merged 13:06:52 <yanyanhu> https://review.openstack.org/258416 13:07:01 <yanyanhu> my fault 13:07:23 <Qiming> great 13:07:40 <Qiming> still need to unify the client calls 13:07:55 <Qiming> especially cluster action and node action ones 13:08:16 <Qiming> health policy 13:09:02 <Qiming> xinhui and I spent a whole afternoon together yesterday, we have a draft work plan for implementing the health manager and health policy 13:09:25 <Qiming> the first step would be about improve profiles so that do_check and do_recover are supported 13:09:38 <yanyanhu> great 13:09:40 <Qiming> then these operations will be exposed from engine and rpc 13:10:06 <yanyanhu> which operations? 13:10:13 <yanyanhu> do_check and do_recover? 13:10:28 <yanyanhu> as actions can be triggered by enduser? 13:10:29 <Qiming> the health manager can poll cluster nodes in the background then invoke the do_recover() operation on a cluster if node anomalies are detected 13:10:39 <Qiming> yanyanhu, not yet 13:10:46 <Qiming> first step is to make them an internal RPC 13:10:52 <yanyanhu> ah, I see 13:11:07 <yanyanhu> make them availabe for internal request 13:11:13 <Qiming> when we feel confident/comfortable, we can expose them as REST APIs 13:11:49 <Qiming> there are some details we need to deal with 13:12:13 <Qiming> but I think xinhui is on the righ track now 13:12:26 <yanyanhu> about health manager polling status and doing recovery action, it sounds good 13:12:31 <yanyanhu> yep 13:12:38 <Qiming> we have to do that 13:12:57 <Qiming> or else the auto-scaling scenario won't be reliable 13:13:33 <Qiming> and we want that feature enabled without bothering users to do any complex config 13:14:01 <Qiming> there could be many knobs exposed for customization, but the first step would be about the basics 13:14:08 <Liuqing> could the health policy cover the user case : instance HA? 13:14:20 <Qiming> Liuqing, yes 13:14:30 <Liuqing> great 13:14:34 <yanyanhu> Qiming, totally make sense 13:14:40 <Liuqing> cool 13:14:43 <zhangguoqing> great 13:14:58 <Qiming> by design, a profile (say nova server) will have a do_recover(**options) operation 13:15:34 <Qiming> the base profile will implement this as a create-after-delete logic 13:15:54 <Liuqing> what does the logic mean ? 13:16:24 <Qiming> a specific profile will be able to override this behavior, it can do a probably better job in recovering a node 13:16:53 <Qiming> Liuqing, the default logic (implemented in base profile class) will be: delete the node, then recreate it 13:17:14 <Liuqing> Qiming: got it , thanks 13:17:16 <Qiming> that is the most generic solution, applicable to all profile types 13:17:46 <Qiming> a nova server profile will have more choices: reboot, rebuild, evacuate, ... recreate 13:18:16 <Qiming> these options can be exposed through a health policy 13:18:36 <Qiming> users can customize it as they see appropriate 13:18:57 <Qiming> question? 13:19:30 <Qiming> update operation for profile types 13:19:38 <Liuqing> so for me i could customize it for instance HA or others, right? 13:19:50 <Qiming> yes, Liuqing, absolutely 13:19:58 <Liuqing> :-) 13:20:14 <Qiming> however, we will also want to add some fencing mechanisms in future 13:20:28 <Qiming> without fencing, recoverying won't be a complete solution 13:20:29 <Liuqing> yes 13:20:42 <Liuqing> now we use pacemaker for instance HA 13:20:46 <Qiming> let's do it step by step 13:21:08 <Qiming> use pacemaker is ... a choice out of no choice, I believe 13:21:15 <Liuqing> instance HA is very important for enterprice production 13:21:42 <Liuqing> yes, Qiming 13:21:44 <Qiming> yes, totally agreed 13:22:09 <Liuqing> the customer will always ask the HA prroblems.... 13:22:10 <Qiming> yanyanhu, any progress on update operation for profile types? 13:22:35 <Qiming> Liuqing, they are not yet ready for real clouds 13:22:40 <yanyanhu> I'm working on add function calls about server metadata 13:22:49 <yanyanhu> three new methods will be added to sdk 13:22:57 <yanyanhu> hope can finish it in the coming week 13:23:07 <Qiming> most of the time, they use their private cloud as an advanced virtualization platform 13:23:11 <yanyanhu> and also the metadata update for nova server profile 13:23:14 <Qiming> great, yanyanhu 13:23:34 <Qiming> maybe we should add update support for heat stacks? 13:23:54 <yanyanhu> Qiming, sure. I plan to do some investigation after nova server related work is done 13:23:57 <Qiming> the interface is much easier when compared to nova server case 13:24:02 <Qiming> cool 13:24:05 <yanyanhu> yes, I think so :) 13:24:09 <Qiming> Receiver 13:24:17 <Qiming> We are done with it? 13:24:53 <Qiming> I pushed quite a few patches last weekend to close the loop 13:24:53 <yanyanhu> I think so 13:24:58 <yanyanhu> just need some tests 13:25:08 <Qiming> a simple test shows it is now working 13:25:11 <yanyanhu> thanks for your hard work :) 13:25:23 <yanyanhu> great, will try to add functional test for it 13:25:40 <Qiming> hoho, added 13:25:54 <yanyanhu> :) 13:26:34 <Qiming> btw, our API doc is up to date: https://review.openstack.org/261627 13:26:39 <Qiming> already merged 13:27:03 <Qiming> lock breaker 13:27:18 <Qiming> https://review.openstack.org/262151 13:27:19 <elynn> I submit a patch to reenable it 13:27:25 <Qiming> yes, reviewed 13:27:29 <elynn> Not sure it's the right way or not 13:27:30 <Qiming> I disagree with the logic 13:27:38 <Qiming> posted comment 13:27:42 <elynn> Just saw your comment 13:28:02 <Qiming> it took me 20 minutes or so to post the comment, frustrating ... network really bad today 13:28:18 <elynn> You intend to move it after retries? 13:28:28 <Qiming> elynn, please consider moving the check out of the critical path 13:28:39 <Qiming> by 'critical path', I mean the retry logic 13:28:53 <Qiming> retry could be very common if you are locking a cluster 13:29:29 <elynn> so we do it outside lock_acquire? 13:29:46 <Qiming> maybe we should even relax the number of retries before doing a engine-death check 13:30:00 <Qiming> or ... 13:30:18 <Qiming> we should move this lock breakers to engine start up 13:30:42 <elynn> engine-death check will be very quick if enigne is alive 13:30:44 <Qiming> but when we have multiple engines, doing lock checks during startup is not anything good 13:30:59 <Qiming> elynn, no, engine could be very busy 13:31:00 <elynn> only took some time if engine is dead. 13:31:21 <Qiming> I have encountered several times when I was fixing the concurrency problem 13:31:42 <Qiming> many times the warning of engine dead, but the engine is still running 13:32:24 <Qiming> and .... putting it before the retry logic has led to quite some mistakes 13:32:26 <elynn> ok... I got your point... 13:32:38 <elynn> Maybe we should add a taskrunner first 13:33:03 <elynn> Like what heat do 13:33:08 <Qiming> adding task runner won't help, AFAICT 13:33:39 <Qiming> it will introduce more concurrency problems 13:34:00 <Qiming> elynn, pls continue digging 13:34:14 <Qiming> haiwei is not online I guess 13:34:15 <elynn> Yes, I will 13:34:21 <Qiming> let's skip the last itme 13:34:24 <Qiming> item 13:34:51 <Qiming> #topic checking progress of async operations/actions 13:35:04 <Qiming> this is blocking heat resource type work 13:35:14 <elynn> yes 13:35:19 <yanyanhu> just saw ethan's patch in sdk side 13:35:20 <Qiming> because .... it is really a tricky thing to do 13:35:36 <elynn> need to figure out a way to receive correct action id 13:35:40 <Qiming> we have done our best to align our APIs to the guidelines from api-wg 13:36:03 <Qiming> you got to understand the principles behind, before start this work 13:36:40 <Qiming> in senlin, we are having most of create, update and delete operations return a 202 13:36:46 <elynn> I think we are following the guidelines 13:37:01 <Qiming> and we are returning a 'location' in the header 13:37:20 <Qiming> most of the time, the location points you to an action 13:37:43 <Qiming> since we have action apis, we are not hiding this from users 13:38:10 <elynn> yes, most of the time except for cluster update... 13:38:14 <Qiming> one thing we still need to improve is about the body returned 13:38:45 <Qiming> for DELETE requests, we cannot return a byte in the body as HTTP protocol says 13:39:22 <Qiming> for UPDATE, we are returning the object in the body 13:39:35 <yanyanhu> for delete request, I think check until notfound exception happens is ok 13:39:39 <elynn> for cluster deletion, I can catch not_found in heat resource. 13:40:00 <elynn> The problem is UPDATE/RESIZE 13:40:01 <Qiming> we are also returning the pointer to the object in header 13:40:11 <Qiming> update and resize are different 13:40:17 <elynn> for RESIZE, we have a body contain action. 13:40:25 <Qiming> UPDATE is itself a PATCH request 13:40:30 <Qiming> RESIZE is an action 13:40:42 <Qiming> these two operations are following different rules 13:40:57 <elynn> How to check if a UPDATE is finished? 13:41:14 <Qiming> it depends on what you are updating, elynn 13:41:22 <Qiming> if you are updating the name of a cluster 13:41:30 <Qiming> you should just check the body 13:41:39 <elynn> I mean profile? 13:41:44 <Qiming> if you are updating a cluster's profile, ... you will need to check the profile 13:41:56 <Qiming> sorry, you will need to check the action 13:42:22 <elynn> yes, I think so 13:42:34 <Qiming> if we are not returning action in header, that is a bug to fix 13:42:59 <elynn> Do we? 13:43:16 <Qiming> you can check the api code, cannot remember 13:43:17 <yanyanhu> Yes, we have. Just didn't find a way to expose it in client 13:43:45 <Qiming> okay, then, next step is to have the 'location' header used from sdk/client side 13:43:51 <elynn> yanyanhu: that would be the problem to solve. 13:43:57 <yanyanhu> yes 13:44:15 <Qiming> if we are checking the header from senlinclient, we are requiring the whole SDK to return a tuple to us 13:44:25 <xuhaiwei> the client side can get the action id, can it be used? 13:44:27 <Qiming> or embed the header into the object 13:44:43 <Qiming> xuhaiwei, you are a ghost 13:44:54 <elynn> My patch is to embed the header into object. 13:44:58 <xuhaiwei> sorry, didn't say hello just now 13:45:11 <Qiming> neither of the solutions above sounds elegant 13:45:13 <xuhaiwei> I am in vocation from today 13:45:37 <Qiming> since we are suggested to use the function call interface from SDK, instead of the resource interface 13:45:52 <Qiming> I'm thinking maybe we should do somework in the _proxy methods 13:46:32 <Qiming> once the response is returned to senlinclient, we get no chance to check the header 13:46:35 <elynn> Qiming: You mean directly return response body? 13:46:44 <xuhaiwei> Qiming, you mean put the api response information into sdk? 13:47:05 <Qiming> the _proxy commands should know what they are doing 13:47:16 <Qiming> say cluster_create(**params) 13:47:44 <Qiming> when calling this method, we are expecting a 'location' header from the response 13:48:14 <Qiming> we can either squeeze it into the json before returning to senlinclient 13:48:27 <Qiming> or we can do a wait-for-complete inside sdk 13:48:52 <Qiming> I'm opt to the 2nd approach 13:49:16 <Qiming> since some other wait-for-something-to-complete logic is already in sdk 13:49:22 <elynn> I'm not very sure how to implement your #2 option 13:49:43 <elynn> I will have a try. 13:50:10 <Qiming> then we can add a keyword argument 'wait=True' to the cluster_create(**params) 13:50:27 <yanyanhu> I also think the option2 is better 13:50:47 <Qiming> if 'wait' is specified to be true, then we check the action repeatedly in the cluster_create method 13:50:47 <elynn> The self._create() returns a resource object, it doesn't contain any headers. 13:51:15 <Qiming> elynn, you have already added 'location' to the resource.headers 13:51:59 <Qiming> we can check it there 13:52:06 <elynn> hmm...I just don't know how to use headers :P 13:52:11 <elynn> I will have a try. 13:52:27 <Qiming> 'headers' was designed to be part of a request 13:52:38 <Qiming> now you are using it in response 13:52:42 <Qiming> that is not good 13:52:58 <Qiming> next time you send a request, you may have to clean it 13:53:18 <Qiming> maybe add a response_header property is a cleaner fix 13:53:56 <elynn> I put it in response just to find a way to expose it... 13:54:10 <Qiming> then we discuss with brian and terry, see if it is an acceptable 'hack' 13:54:23 <elynn> Or we don't have the way to set the location to headers. 13:54:23 <Qiming> if it is not acceptable, we have to do it in a different way 13:54:39 <elynn> Ok 13:54:47 <Qiming> e.g. make 'action' a field of the object returned to senlinclient, parse it and do the wait there 13:54:47 <elynn> If we add a wait function. 13:55:03 <elynn> heat codes might be block by this wait function 13:55:16 <elynn> I'm not sure if it's good way to go 13:55:21 <Qiming> if you want to wait, you will have to wait 13:56:17 <elynn> For now heat using its taskrunner to schedule tasks 13:56:27 <Qiming> that is stupid 13:56:30 <Qiming> to be honest 13:56:44 <Qiming> there are proposals to remove them all 13:56:47 <elynn> wait function might blocks its taskrunner... 13:57:46 <Qiming> are we risking blocking their engine? 13:58:05 <Qiming> taskrunner is so cheap 13:58:19 <elynn> Yes, that might be 13:58:42 <elynn> if we don't yeild from wait 13:59:31 <Qiming> okay, let's spend some time reading some more resource type implementations 13:59:49 <Qiming> I don't see a way out 14:00:06 <Qiming> sorry guys, no time for open discussions today 14:00:15 <Qiming> let's continue on #senlin 14:00:19 <Qiming> #endmeeting