13:03:09 #startmeeting senlin 13:03:09 Meeting started Tue Jan 26 13:03:09 2016 UTC and is due to finish in 60 minutes. The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:03:10 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:03:13 The meeting name has been set to 'senlin' 13:03:23 Hi 13:03:31 seems they forgot to endmeeting 13:03:37 hi, cschulz 13:03:54 pls review agenda here 13:03:56 #link https://wiki.openstack.org/wiki/Meetings/SenlinAgenda 13:03:59 o/ 13:04:21 oops, two open discussion, we got plenty of time? 13:04:38 :) 13:04:55 anywhy, let's get started with etherpad 13:05:01 #topic mitaka work items 13:05:05 #link https://etherpad.openstack.org/p/senlin-mitaka-workitems 13:05:21 no progress on API as I know of 13:05:42 heat specs proposed for node/policy/receiver 13:05:58 https://review.openstack.org/271949 13:05:58 yes, waiting for review 13:06:27 not aware of any on-going work about testing 13:06:47 we got some bug reports and guys have helped work on them 13:07:09 health management, we have a dedicated timeslot for that 13:07:17 lixinhui_, it's yours 13:07:17 oh 13:07:24 https://blueprints.launchpad.net/senlin/+spec/support-health-management-customization 13:07:42 documentation, em, no progress either 13:08:12 Sample scenarios/stories, I'll get on to it 13:08:23 cool 13:08:47 just let us know if you need help 13:08:53 my etherpad connection broke 13:09:10 network problem? 13:09:31 That's normal, get used to it :) 13:09:44 :) 13:09:54 yes 13:10:42 does it work for anyone of you? 13:10:45 I'm putting together an example scalable stack using Senlin cluster. 13:10:53 please drive the flow 13:10:55 it works for me 13:11:08 next one is profile 13:11:20 okay, now I'm on it 13:11:25 I plan to work on heat stack update support first 13:11:30 need a proxy again 13:11:34 who is cschulz 13:11:36 great 13:11:58 could you introduce your name? 13:12:02 cschulz, that would be great 13:12:03 I'm Chuck Schulz from IBM Research in NY USA. 13:12:10 about dick property update support for nova server profile, will give it lower priority 13:12:23 Nice to meet you, cschulz 13:12:23 cschulz, welcome :) 13:12:24 Hi Chuck, welcome. 13:12:49 looks like disk updating is not common 13:13:00 agree 13:13:10 just leave it there 13:13:21 yes, that is my plan 13:13:26 I'm experimenting cluster of containers 13:13:28 elynn, We've been exchanging email. Thanks for the NotFound patch. 13:13:30 using coreos 13:13:56 it is an attempt to find the function gaps 13:14:05 cschulz: yes, feel free to reach if you have any problems in using senlin resources :) 13:14:18 reach me 13:14:26 policies 13:14:42 most work have been done I think 13:14:48 in last week 13:14:49 we have released fixes to placement and deletion policies 13:15:09 \o/ 13:15:17 lb policy also works now 13:15:27 great 13:15:29 all history now 13:15:33 yea 13:15:34 That's great, we are on the way. 13:15:54 receiver functional test 13:16:10 will start this work after finishing heat stack update support 13:16:12 we have deleted the webhook test, but not yet added one for receiver 13:16:17 okay 13:16:36 engine status list is done 13:16:40 yes 13:16:44 that bp closed? 13:16:50 not yet 13:16:52 hi, elynn, about the problem I met before 13:16:54 not yet approved.. 13:17:02 has it been addressed? 13:17:15 you mentioned that could be race condition there 13:17:20 yanyanhu: I still don't have any idea about how to fix it. 13:17:40 ok, it didn't happen every time I restarted senlin engine 13:18:09 trap in meeting these two days, I think it's betterto open a but to track it. 13:18:10 #action Qiming to update bp status - service list 13:18:27 yep, file a bug sounds a plan 13:18:47 Seem multi engine will have potential race condition when delete the same record. 13:19:06 yes, that could be the reason 13:19:08 we can allocate some time for a discussion about event log generation in next meetings 13:19:12 not a high priority 13:19:20 Qiming, ok 13:19:23 It's not a bocking issue I think 13:19:59 okay, anything missed in the etherpad? 13:20:09 lock breaker? 13:20:22 yes? any progress? 13:20:34 I'm done coding and do some test on local env 13:20:40 it's base on engine status. 13:20:47 okay 13:20:53 Will steal lock after retry 13:20:56 cool 13:21:14 yes, just be careful on the critical path 13:21:30 elynn, looking forward to your patch :) 13:21:49 cool, elynn! 13:21:53 #topic cleanse desired_capacity 13:22:06 this one is a little bit dirty today 13:22:12 yep 13:22:22 saw xujun's patch and your comment 13:22:29 we don't have a clear definition of desired_capacity, so we are not handling it in a consistent way 13:22:48 yep, xujun talked to me 30 mins ago 13:23:09 seems a concise definition is needed, to guide the code implementation 13:23:36 desired_capacity, as the name suggests, is about the 'desired' size of a cluster 13:23:49 it may and may not be the current size of a cluster 13:24:09 So how do we get the current_size? 13:24:30 do a node_get_all 13:25:07 we are not maintaining the current size as a cluster property 13:25:20 it is always a instanteneous concept 13:25:29 I don't think it should be. 13:25:51 Current capacity is a bit of data that may change at any moment. 13:25:59 exactly 13:26:13 ... 13:26:14 so the question is about when are we gonna change the desired_capacity 13:26:15 ok, I might need to revise the cluster resource in heat. 13:26:20 Qiming, about that patch, actually I agree with xujun that the desired_capacity should be changed once the node object is added into DB 13:26:44 yanyanhu, I agree with that patch too 13:26:51 As I see it desired capacity only changes based on a Action. 13:27:07 but I dislike the way to fix it piece by piece 13:27:29 say when are get a request cluster-scale-in by 1 13:27:31 Qiming, yes, me neither 13:27:57 that is an intent to shrink the cluster size by 1, an intent to descrease desired_capacity 13:28:32 Shouldn't that scale in actually change the desired capacity? 13:28:42 when we remove a node from a cluster (not matter we destory it or not after the removal), the intent is to decrease the desired_capacity as well 13:29:18 cschulz, that is the expectation, but I'm not 100% sure we are consistently enforcing this changes everywhere 13:29:20 cschulz, desired_capacity is expected to be changed during that action 13:29:36 Qiming, yes, that's right 13:29:41 Yes I agree. 13:29:45 so ... I'm hoping we do a cleanse about desired_capacity 13:29:58 from DB level? 13:30:04 So current capacity is always trying to match desired capacity. 13:30:15 the first thing to do when processing an action is to change its desired_capacity 13:30:24 not matter the action succeeds or not 13:31:00 Qiming, agree 13:31:20 certainly, we will do some sanity checking before changing desired_capacity, e.g. whether it is going beyond the max_size etc. 13:31:30 Ah ... So here is the rub. Should there be a daemon which is always looking at the current capacity and the desired capacity and triggering an action if they are different? 13:31:50 cschulz, you are running ahead of us, :) 13:31:53 cschulz, that's an important thing we want to do :) 13:32:01 although maybe in future 13:32:17 let's reach a consensus here, about desired_capacity 13:32:36 #topic rework NODE_CREATE, NODE_DELETE actions 13:32:55 Qiming, that's the point 13:32:56 these two actions are currently off the radar for policy checking 13:33:05 this has to be fixed 13:33:12 and maybe we also need to recheck the current implementation of cluster resize/scale actions 13:33:18 to ensure their logic is right 13:33:42 adding NODE_CREATE/DELETE into policy checking is gonna further complicating the policy checking logic 13:33:54 so my proposal is like this: 13:34:08 when we get a node-create request, with cluster_id specified 13:34:29 we create a NODE_CREATE action followed by an internal CLUSTER_ADD_NODE action 13:35:03 the section action will trigger the cluster policies 13:35:09 you mean cluster_add_node action depends on node_create action? 13:35:18 yes, a single dependency 13:35:23 ok 13:35:35 it is a little bit tricky to get it doen 13:35:39 s/doen/done 13:35:42 sound good :) 13:35:58 we should still lock the cluster first before locking the node 13:36:28 if we do these 2 steps in the reverse order, we will risk introducing some dead locks 13:36:33 Might that cause the cluster to be locked for a long time? 13:36:41 cluster should be locked only when cluster_add_node action is executed, right? 13:36:56 cschulz, have the same concern 13:37:05 so .. a rule of thumb, always lock a cluster first then lock the node you want to operate 13:37:27 hopefully, it won't be that long 13:37:42 Qiming, yes, but the first node_create action has not relationship with cluster I think 13:37:44 OK, but don't lock cluster before performing the create-node 13:37:50 it is much shorter than cluster-scale-out count=2 13:38:01 until cluster_add_node action started 13:38:34 there is a window where we have created a node db object, and we are creating a node 13:38:56 at the same time, another request comes in, it wants a scaling down 13:39:15 hmm. that's a problem 13:40:01 anyway, just a rough idea, will need to revisit the workflow and see if it fits in seamlessly 13:40:22 or maybe we split creating node with target cluster specified into two individual actions 13:40:26 ok 13:40:31 need further discussion 13:40:46 #action Qiming to add NODE_CREATE/DELETE rework into etherpad for tracking 13:41:21 yanyanhu, that is possible too 13:41:41 we will introduce some waiting logic in engine service before adding node to cluster 13:41:55 Qiming, that will be nice 13:42:19 will dig more 13:42:27 I don't like waits. You never know how long is enough. 13:42:28 #topic health management 13:42:42 yes, cschulz, agreed 13:42:56 I registered a new BP https://etherpad.openstack.org/p/senlin-mitaka-workitems 13:43:01 sorry 13:43:12 https://blueprints.launchpad.net/senlin/+spec/support-health-management-customization 13:43:26 need all of you's help to review 13:43:52 in the Mitaka 2, we have implemented the check and recover functions 13:43:58 good, but we still need some more details 13:44:17 we need dicuss how to enable them 13:44:34 right, those are the primitives to check if a node is 'healthy' or not and to recover a node by recreating it 13:45:09 I think we will need to expose those operations as APIs 13:45:24 triggered by some external node monitoring software 13:45:27 xinhui, I agree with what you described in whiteboard 13:45:52 Yes 13:45:56 there is a subtle question here 13:46:07 maybe the first step is exposing those two actions through API 13:46:11 checking logic is okay, we only check the nodes we know of 13:46:17 So is it possible for metric information to be part of the health check test? 13:46:42 cschulz, that depends on the profile implementation 13:46:49 OK 13:47:09 cschulz, the expectation is that we will build some internal health check polling as a starting point 13:47:18 yanyanhu, you mean the first one: manual 13:47:24 if you need it, you can implement it in do_check routine of the profile 13:47:36 that polling can be disabled by deployer easily, if they have a better alternative for health monitoring 13:47:51 lixinhui_, yes 13:47:53 back to the 'recover' action 13:48:20 we have primitives for node recover, which defaults to recreate, or nova server rebuild, this is fine too 13:48:33 but for cluster recover, we may need to consider the desired_capacity 13:48:49 this is leading us back the question cschulz asked before 13:49:02 yes 13:49:20 if we find a cluster only have 2 nodes ACTIVE and 1 node ERROR, and desired_capacity is 4 13:49:46 the expected recover action should be rebuild one plus create one 13:49:49 Qiming and Yanyan, I agree this is a problem. but it is the concept of limits of "resource pool" 13:50:08 I do not think recover need to think this 13:50:14 that's my understanding of recover 13:50:24 I think the test is as Qiming stated plus no actions current against that cluster. 13:50:49 since the error node comes from awareness of inconsistency of node status stored by senlin and nova 13:50:50 i.e. nothing is trying to create that 4th node. 13:50:51 or at least we should provide such an option to users 13:51:08 node level recovery is fine 13:51:36 cluster recover fork node recover nowdays 13:51:40 but for recovering a cluster, it means you are taking full responsibility to get it back to its desired_capacity? 13:52:03 Yes, I think that is the correct approach 13:52:28 or else, we will have to introduce another operation to do this job 13:52:38 it would be a headache 13:52:46 definitely 13:52:55 i think it should be another service to handle limits of capacity 13:53:05 not only for recover 13:53:18 very difficult to say which way is better before thinking this problem more deeply 13:53:47 once this mechanisms are all in place, we proceed to think of making this whole process semi-automatic and/or fully-automatic 13:53:59 again, just a rough idea 13:54:14 we can think of it offline 13:54:19 sure 13:54:21 okay 13:54:29 #topic open discussions 13:54:49 I think elynn has proposed a topic 13:54:56 into Austin summit 13:54:58 shoot 13:55:08 ... that 'meeting' 13:55:09 ? 13:55:15 Do you have any topic to propose? 13:55:32 elynn, yours is the only one :) 13:55:33 I want to propose a solution for container clustering 13:55:50 More is better :) 13:55:53 hopefully we can get a POC done by the summit 13:56:11 Qiming, you mean a container cluster managed by senlin? 13:56:12 I'm collecting all the SUR works last year 13:56:20 :) 13:56:21 not other container services 13:56:34 yes, openstack native container clustering 13:56:39 cool! 13:56:45 cool! 13:57:22 proposal deadline is approaching, need to work hard on this 13:57:53 or another candidate would be the health scenario 13:58:15 who to show this health scenario 13:58:18 as far as I know of, HA has been a long-wanted feature, not provided by any service 13:58:25 how to show 13:58:41 lixinhui_, you tell me, ;) 13:58:44 hmm, I think we need a more specific use case for building the health topic 13:58:51 As you think about HA, also consider DR. 13:59:04 otherwise, it's difficult to describe it clearly 13:59:05 maybe should think of a HA scenario :) 13:59:07 cschulz, good idea 13:59:25 :) 13:59:27 time's up, let's switch back to #senlin 13:59:39 thanks for joining the meeting, guys 13:59:43 talk to you next week 13:59:46 thanks 13:59:47 #endmeeting