13:03:09 <Qiming> #startmeeting senlin 13:03:09 <openstack> Meeting started Tue Jan 26 13:03:09 2016 UTC and is due to finish in 60 minutes. The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:03:10 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:03:13 <openstack> The meeting name has been set to 'senlin' 13:03:23 <cschulz> Hi 13:03:31 <Qiming> seems they forgot to endmeeting 13:03:37 <Qiming> hi, cschulz 13:03:54 <Qiming> pls review agenda here 13:03:56 <Qiming> #link https://wiki.openstack.org/wiki/Meetings/SenlinAgenda 13:03:59 <elynn> o/ 13:04:21 <Qiming> oops, two open discussion, we got plenty of time? 13:04:38 <yanyanhu> :) 13:04:55 <Qiming> anywhy, let's get started with etherpad 13:05:01 <Qiming> #topic mitaka work items 13:05:05 <Qiming> #link https://etherpad.openstack.org/p/senlin-mitaka-workitems 13:05:21 <Qiming> no progress on API as I know of 13:05:42 <Qiming> heat specs proposed for node/policy/receiver 13:05:58 <Qiming> https://review.openstack.org/271949 13:05:58 <elynn> yes, waiting for review 13:06:27 <Qiming> not aware of any on-going work about testing 13:06:47 <Qiming> we got some bug reports and guys have helped work on them 13:07:09 <Qiming> health management, we have a dedicated timeslot for that 13:07:17 <Qiming> lixinhui_, it's yours 13:07:17 <lixinhui_> oh 13:07:24 <lixinhui_> https://blueprints.launchpad.net/senlin/+spec/support-health-management-customization 13:07:42 <Qiming> documentation, em, no progress either 13:08:12 <Qiming> Sample scenarios/stories, I'll get on to it 13:08:23 <lixinhui_> cool 13:08:47 <lixinhui_> just let us know if you need help 13:08:53 <Qiming> my etherpad connection broke 13:09:10 <yanyanhu> network problem? 13:09:31 <elynn> That's normal, get used to it :) 13:09:44 <lixinhui_> :) 13:09:54 <Qiming> yes 13:10:42 <Qiming> does it work for anyone of you? 13:10:45 <cschulz> I'm putting together an example scalable stack using Senlin cluster. 13:10:53 <Qiming> please drive the flow 13:10:55 <yanyanhu> it works for me 13:11:08 <yanyanhu> next one is profile 13:11:20 <Qiming> okay, now I'm on it 13:11:25 <yanyanhu> I plan to work on heat stack update support first 13:11:30 <Qiming> need a proxy again 13:11:34 <lixinhui_> who is cschulz 13:11:36 <Qiming> great 13:11:58 <lixinhui_> could you introduce your name? 13:12:02 <Qiming> cschulz, that would be great 13:12:03 <cschulz> I'm Chuck Schulz from IBM Research in NY USA. 13:12:10 <yanyanhu> about dick property update support for nova server profile, will give it lower priority 13:12:23 <lixinhui_> Nice to meet you, cschulz 13:12:23 <yanyanhu> cschulz, welcome :) 13:12:24 <elynn> Hi Chuck, welcome. 13:12:49 <yanyanhu> looks like disk updating is not common 13:13:00 <Qiming> agree 13:13:10 <Qiming> just leave it there 13:13:21 <yanyanhu> yes, that is my plan 13:13:26 <Qiming> I'm experimenting cluster of containers 13:13:28 <cschulz> elynn, We've been exchanging email. Thanks for the NotFound patch. 13:13:30 <Qiming> using coreos 13:13:56 <Qiming> it is an attempt to find the function gaps 13:14:05 <elynn> cschulz: yes, feel free to reach if you have any problems in using senlin resources :) 13:14:18 <elynn> reach me 13:14:26 <Qiming> policies 13:14:42 <yanyanhu> most work have been done I think 13:14:48 <yanyanhu> in last week 13:14:49 <Qiming> we have released fixes to placement and deletion policies 13:15:09 <Qiming> \o/ 13:15:17 <yanyanhu> lb policy also works now 13:15:27 <lixinhui_> great 13:15:29 <Qiming> all history now 13:15:33 <yanyanhu> yea 13:15:34 <elynn> That's great, we are on the way. 13:15:54 <Qiming> receiver functional test 13:16:10 <yanyanhu> will start this work after finishing heat stack update support 13:16:12 <Qiming> we have deleted the webhook test, but not yet added one for receiver 13:16:17 <Qiming> okay 13:16:36 <Qiming> engine status list is done 13:16:40 <elynn> yes 13:16:44 <Qiming> that bp closed? 13:16:50 <elynn> not yet 13:16:52 <yanyanhu> hi, elynn, about the problem I met before 13:16:54 <elynn> not yet approved.. 13:17:02 <yanyanhu> has it been addressed? 13:17:15 <yanyanhu> you mentioned that could be race condition there 13:17:20 <elynn> yanyanhu: I still don't have any idea about how to fix it. 13:17:40 <yanyanhu> ok, it didn't happen every time I restarted senlin engine 13:18:09 <elynn> trap in meeting these two days, I think it's betterto open a but to track it. 13:18:10 <Qiming> #action Qiming to update bp status - service list 13:18:27 <Qiming> yep, file a bug sounds a plan 13:18:47 <elynn> Seem multi engine will have potential race condition when delete the same record. 13:19:06 <yanyanhu> yes, that could be the reason 13:19:08 <Qiming> we can allocate some time for a discussion about event log generation in next meetings 13:19:12 <Qiming> not a high priority 13:19:20 <yanyanhu> Qiming, ok 13:19:23 <elynn> It's not a bocking issue I think 13:19:59 <Qiming> okay, anything missed in the etherpad? 13:20:09 <elynn> lock breaker? 13:20:22 <Qiming> yes? any progress? 13:20:34 <elynn> I'm done coding and do some test on local env 13:20:40 <elynn> it's base on engine status. 13:20:47 <Qiming> okay 13:20:53 <elynn> Will steal lock after retry 13:20:56 <Qiming> cool 13:21:14 <Qiming> yes, just be careful on the critical path 13:21:30 <yanyanhu> elynn, looking forward to your patch :) 13:21:49 <lixinhui_> cool, elynn! 13:21:53 <Qiming> #topic cleanse desired_capacity 13:22:06 <Qiming> this one is a little bit dirty today 13:22:12 <yanyanhu> yep 13:22:22 <yanyanhu> saw xujun's patch and your comment 13:22:29 <Qiming> we don't have a clear definition of desired_capacity, so we are not handling it in a consistent way 13:22:48 <Qiming> yep, xujun talked to me 30 mins ago 13:23:09 <Qiming> seems a concise definition is needed, to guide the code implementation 13:23:36 <Qiming> desired_capacity, as the name suggests, is about the 'desired' size of a cluster 13:23:49 <Qiming> it may and may not be the current size of a cluster 13:24:09 <elynn> So how do we get the current_size? 13:24:30 <Qiming> do a node_get_all 13:25:07 <Qiming> we are not maintaining the current size as a cluster property 13:25:20 <Qiming> it is always a instanteneous concept 13:25:29 <cschulz> I don't think it should be. 13:25:51 <cschulz> Current capacity is a bit of data that may change at any moment. 13:25:59 <Qiming> exactly 13:26:13 <lixinhui_> ... 13:26:14 <Qiming> so the question is about when are we gonna change the desired_capacity 13:26:15 <elynn> ok, I might need to revise the cluster resource in heat. 13:26:20 <yanyanhu> Qiming, about that patch, actually I agree with xujun that the desired_capacity should be changed once the node object is added into DB 13:26:44 <Qiming> yanyanhu, I agree with that patch too 13:26:51 <cschulz> As I see it desired capacity only changes based on a Action. 13:27:07 <Qiming> but I dislike the way to fix it piece by piece 13:27:29 <Qiming> say when are get a request cluster-scale-in by 1 13:27:31 <yanyanhu> Qiming, yes, me neither 13:27:57 <Qiming> that is an intent to shrink the cluster size by 1, an intent to descrease desired_capacity 13:28:32 <cschulz> Shouldn't that scale in actually change the desired capacity? 13:28:42 <Qiming> when we remove a node from a cluster (not matter we destory it or not after the removal), the intent is to decrease the desired_capacity as well 13:29:18 <Qiming> cschulz, that is the expectation, but I'm not 100% sure we are consistently enforcing this changes everywhere 13:29:20 <yanyanhu> cschulz, desired_capacity is expected to be changed during that action 13:29:36 <yanyanhu> Qiming, yes, that's right 13:29:41 <cschulz> Yes I agree. 13:29:45 <Qiming> so ... I'm hoping we do a cleanse about desired_capacity 13:29:58 <yanyanhu> from DB level? 13:30:04 <cschulz> So current capacity is always trying to match desired capacity. 13:30:15 <Qiming> the first thing to do when processing an action is to change its desired_capacity 13:30:24 <Qiming> not matter the action succeeds or not 13:31:00 <yanyanhu> Qiming, agree 13:31:20 <Qiming> certainly, we will do some sanity checking before changing desired_capacity, e.g. whether it is going beyond the max_size etc. 13:31:30 <cschulz> Ah ... So here is the rub. Should there be a daemon which is always looking at the current capacity and the desired capacity and triggering an action if they are different? 13:31:50 <Qiming> cschulz, you are running ahead of us, :) 13:31:53 <yanyanhu> cschulz, that's an important thing we want to do :) 13:32:01 <yanyanhu> although maybe in future 13:32:17 <Qiming> let's reach a consensus here, about desired_capacity 13:32:36 <Qiming> #topic rework NODE_CREATE, NODE_DELETE actions 13:32:55 <yanyanhu> Qiming, that's the point 13:32:56 <Qiming> these two actions are currently off the radar for policy checking 13:33:05 <Qiming> this has to be fixed 13:33:12 <yanyanhu> and maybe we also need to recheck the current implementation of cluster resize/scale actions 13:33:18 <yanyanhu> to ensure their logic is right 13:33:42 <Qiming> adding NODE_CREATE/DELETE into policy checking is gonna further complicating the policy checking logic 13:33:54 <Qiming> so my proposal is like this: 13:34:08 <Qiming> when we get a node-create request, with cluster_id specified 13:34:29 <Qiming> we create a NODE_CREATE action followed by an internal CLUSTER_ADD_NODE action 13:35:03 <Qiming> the section action will trigger the cluster policies 13:35:09 <yanyanhu> you mean cluster_add_node action depends on node_create action? 13:35:18 <Qiming> yes, a single dependency 13:35:23 <yanyanhu> ok 13:35:35 <Qiming> it is a little bit tricky to get it doen 13:35:39 <Qiming> s/doen/done 13:35:42 <yanyanhu> sound good :) 13:35:58 <Qiming> we should still lock the cluster first before locking the node 13:36:28 <Qiming> if we do these 2 steps in the reverse order, we will risk introducing some dead locks 13:36:33 <cschulz> Might that cause the cluster to be locked for a long time? 13:36:41 <yanyanhu> cluster should be locked only when cluster_add_node action is executed, right? 13:36:56 <yanyanhu> cschulz, have the same concern 13:37:05 <Qiming> so .. a rule of thumb, always lock a cluster first then lock the node you want to operate 13:37:27 <Qiming> hopefully, it won't be that long 13:37:42 <yanyanhu> Qiming, yes, but the first node_create action has not relationship with cluster I think 13:37:44 <cschulz> OK, but don't lock cluster before performing the create-node 13:37:50 <Qiming> it is much shorter than cluster-scale-out count=2 13:38:01 <yanyanhu> until cluster_add_node action started 13:38:34 <Qiming> there is a window where we have created a node db object, and we are creating a node 13:38:56 <Qiming> at the same time, another request comes in, it wants a scaling down 13:39:15 <yanyanhu> hmm. that's a problem 13:40:01 <Qiming> anyway, just a rough idea, will need to revisit the workflow and see if it fits in seamlessly 13:40:22 <yanyanhu> or maybe we split creating node with target cluster specified into two individual actions 13:40:26 <yanyanhu> ok 13:40:31 <yanyanhu> need further discussion 13:40:46 <Qiming> #action Qiming to add NODE_CREATE/DELETE rework into etherpad for tracking 13:41:21 <Qiming> yanyanhu, that is possible too 13:41:41 <Qiming> we will introduce some waiting logic in engine service before adding node to cluster 13:41:55 <yanyanhu> Qiming, that will be nice 13:42:19 <Qiming> will dig more 13:42:27 <cschulz> I don't like waits. You never know how long is enough. 13:42:28 <Qiming> #topic health management 13:42:42 <Qiming> yes, cschulz, agreed 13:42:56 <lixinhui_> I registered a new BP https://etherpad.openstack.org/p/senlin-mitaka-workitems 13:43:01 <lixinhui_> sorry 13:43:12 <lixinhui_> https://blueprints.launchpad.net/senlin/+spec/support-health-management-customization 13:43:26 <lixinhui_> need all of you's help to review 13:43:52 <lixinhui_> in the Mitaka 2, we have implemented the check and recover functions 13:43:58 <Qiming> good, but we still need some more details 13:44:17 <lixinhui_> we need dicuss how to enable them 13:44:34 <Qiming> right, those are the primitives to check if a node is 'healthy' or not and to recover a node by recreating it 13:45:09 <Qiming> I think we will need to expose those operations as APIs 13:45:24 <Qiming> triggered by some external node monitoring software 13:45:27 <yanyanhu> xinhui, I agree with what you described in whiteboard 13:45:52 <lixinhui_> Yes 13:45:56 <Qiming> there is a subtle question here 13:46:07 <yanyanhu> maybe the first step is exposing those two actions through API 13:46:11 <Qiming> checking logic is okay, we only check the nodes we know of 13:46:17 <cschulz> So is it possible for metric information to be part of the health check test? 13:46:42 <yanyanhu> cschulz, that depends on the profile implementation 13:46:49 <cschulz> OK 13:47:09 <Qiming> cschulz, the expectation is that we will build some internal health check polling as a starting point 13:47:18 <lixinhui_> yanyanhu, you mean the first one: manual 13:47:24 <yanyanhu> if you need it, you can implement it in do_check routine of the profile 13:47:36 <Qiming> that polling can be disabled by deployer easily, if they have a better alternative for health monitoring 13:47:51 <yanyanhu> lixinhui_, yes 13:47:53 <Qiming> back to the 'recover' action 13:48:20 <Qiming> we have primitives for node recover, which defaults to recreate, or nova server rebuild, this is fine too 13:48:33 <Qiming> but for cluster recover, we may need to consider the desired_capacity 13:48:49 <Qiming> this is leading us back the question cschulz asked before 13:49:02 <yanyanhu> yes 13:49:20 <Qiming> if we find a cluster only have 2 nodes ACTIVE and 1 node ERROR, and desired_capacity is 4 13:49:46 <Qiming> the expected recover action should be rebuild one plus create one 13:49:49 <lixinhui_> Qiming and Yanyan, I agree this is a problem. but it is the concept of limits of "resource pool" 13:50:08 <lixinhui_> I do not think recover need to think this 13:50:14 <Qiming> that's my understanding of recover 13:50:24 <cschulz> I think the test is as Qiming stated plus no actions current against that cluster. 13:50:49 <lixinhui_> since the error node comes from awareness of inconsistency of node status stored by senlin and nova 13:50:50 <cschulz> i.e. nothing is trying to create that 4th node. 13:50:51 <Qiming> or at least we should provide such an option to users 13:51:08 <Qiming> node level recovery is fine 13:51:36 <lixinhui_> cluster recover fork node recover nowdays 13:51:40 <Qiming> but for recovering a cluster, it means you are taking full responsibility to get it back to its desired_capacity? 13:52:03 <cschulz> Yes, I think that is the correct approach 13:52:28 <Qiming> or else, we will have to introduce another operation to do this job 13:52:38 <Qiming> it would be a headache 13:52:46 <yanyanhu> definitely 13:52:55 <lixinhui_> i think it should be another service to handle limits of capacity 13:53:05 <lixinhui_> not only for recover 13:53:18 <yanyanhu> very difficult to say which way is better before thinking this problem more deeply 13:53:47 <Qiming> once this mechanisms are all in place, we proceed to think of making this whole process semi-automatic and/or fully-automatic 13:53:59 <Qiming> again, just a rough idea 13:54:14 <Qiming> we can think of it offline 13:54:19 <yanyanhu> sure 13:54:21 <lixinhui_> okay 13:54:29 <Qiming> #topic open discussions 13:54:49 <yanyanhu> I think elynn has proposed a topic 13:54:56 <yanyanhu> into Austin summit 13:54:58 <Qiming> shoot 13:55:08 <Qiming> ... that 'meeting' 13:55:09 <yanyanhu> ? 13:55:15 <elynn> Do you have any topic to propose? 13:55:32 <yanyanhu> elynn, yours is the only one :) 13:55:33 <Qiming> I want to propose a solution for container clustering 13:55:50 <elynn> More is better :) 13:55:53 <Qiming> hopefully we can get a POC done by the summit 13:56:11 <yanyanhu> Qiming, you mean a container cluster managed by senlin? 13:56:12 <Qiming> I'm collecting all the SUR works last year 13:56:20 <lixinhui_> :) 13:56:21 <yanyanhu> not other container services 13:56:34 <Qiming> yes, openstack native container clustering 13:56:39 <yanyanhu> cool! 13:56:45 <lixinhui_> cool! 13:57:22 <Qiming> proposal deadline is approaching, need to work hard on this 13:57:53 <Qiming> or another candidate would be the health scenario 13:58:15 <lixinhui_> who to show this health scenario 13:58:18 <Qiming> as far as I know of, HA has been a long-wanted feature, not provided by any service 13:58:25 <lixinhui_> how to show 13:58:41 <Qiming> lixinhui_, you tell me, ;) 13:58:44 <yanyanhu> hmm, I think we need a more specific use case for building the health topic 13:58:51 <cschulz> As you think about HA, also consider DR. 13:59:04 <yanyanhu> otherwise, it's difficult to describe it clearly 13:59:05 <elynn> maybe should think of a HA scenario :) 13:59:07 <Qiming> cschulz, good idea 13:59:25 <lixinhui_> :) 13:59:27 <Qiming> time's up, let's switch back to #senlin 13:59:39 <Qiming> thanks for joining the meeting, guys 13:59:43 <Qiming> talk to you next week 13:59:46 <yanyanhu> thanks 13:59:47 <Qiming> #endmeeting