13:01:06 <yanyanhu> #startmeeting senlin 13:01:07 <openstack> Meeting started Tue Sep 20 13:01:06 2016 UTC and is due to finish in 60 minutes. The chair is yanyanhu. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:01:08 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:01:10 <openstack> The meeting name has been set to 'senlin' 13:01:19 <yanyanhu> hello 13:02:07 <elynn> Hi 13:02:08 <guoshan> hi, all 13:02:36 <yanyanhu> hi 13:02:52 <yanyanhu> Qiming will come soon 13:03:41 <yanyanhu> so lets go through the newton work item list first 13:04:00 <yanyanhu> https://etherpad.openstack.org/p/senlin-newton-workitems 13:04:02 <yanyanhu> this one 13:04:14 <yanyanhu> Performance test 13:04:21 <yanyanhu> no progress in last week I think 13:04:36 <yanyanhu> #topic newton workitem 13:05:03 <yanyanhu> about more senlin support in rally side, didn't get time to work on it recently 13:05:24 <yanyanhu> working on message type of receiver support in last two weeks 13:05:47 <yanyanhu> will resume the performance test work after this job is done 13:06:01 <yanyanhu> Health Management 13:06:33 <yanyanhu> I think Qiming has something to update. Lets skip this and wait him come back to update 13:07:00 <yanyanhu> Document, no progress I guess 13:07:12 <yanyanhu> container profile 13:07:20 <yanyanhu> haiwei is not here I think? 13:07:45 <yanyanhu> looks so. lets move on 13:08:01 <yanyanhu> Zaqar message type of receiver 13:08:57 <yanyanhu> I'm now working on it. The receiver creating part has been done, including queue/subscription creating, trust building between enduer and zaqar trustee user 13:09:23 <yanyanhu> and also api and rpc interface 13:09:42 <yanyanhu> now working on message notification handling 13:09:58 <yanyanhu> https://review.openstack.org/373004 13:10:00 <Qiming> o/ 13:10:04 <elynn> zaqar trustee user is a configuration option in senlin.conf? 13:10:05 <yanyanhu> hi, Qiming 13:10:10 <yanyanhu> elynn, yes 13:10:17 <yanyanhu> it is configurable 13:10:38 <yanyanhu> since operator can define the trustee user in zaqar side as well 13:10:41 <Qiming> it is a configuration option stolen from oslo_config I think 13:10:51 <yanyanhu> although the default trustee user will be 'zaqar' 13:11:06 <yanyanhu> Qiming, yes 13:11:28 <yanyanhu> will well document this part to make it clear for user/operator 13:11:34 <Qiming> I'd suggest we don't add this kind of config options only to be deprecated/invalidated some day 13:11:55 <yanyanhu> Qiming, yes 13:12:08 <yanyanhu> that also depends on how zaqar support it 13:12:58 <yanyanhu> so will keep looking at it and talking with zaqar team to ensure our usage is correct 13:13:08 <Qiming> sounds something negotiable 13:13:30 <yanyanhu> yes 13:14:18 <yanyanhu> it now works as expected 13:14:37 <yanyanhu> https://review.openstack.org/373004 after applying this patch which is in progress 13:15:34 <yanyanhu> after creating a message type of receiver, user can trigger different actions on specific cluster by posting message to the message queue 13:16:13 <yanyanhu> the queue can be reused multiple times 13:16:17 <yanyanhu> util receiver is deleted 13:16:20 <Qiming> need a tutorial doc on this, so that users know how to use it 13:16:25 <yanyanhu> s/util/until 13:16:32 <yanyanhu> Qiming, absolutely 13:16:40 <yanyanhu> that is necessary 13:16:48 <yanyanhu> will work on it after basic support is done 13:17:00 <Qiming> cool, thx 13:17:05 <yanyanhu> my pleasure 13:17:19 <yanyanhu> so this is about the progress of message receiver 13:18:15 <yanyanhu> i, hQiming, we just skipped the HA topic, could you plz give the update on it. thanks 13:18:41 <Qiming> yep, there is a patch which need some eyes 13:18:42 <Qiming> https://review.openstack.org/#/c/369937/ 13:19:10 <Qiming> the only problem we found so far is about the context used when calling cluster_check 13:19:46 <Qiming> the health manager will invoke cluster_check periodically using an admin context 13:20:17 <Qiming> an admin context is special, in that it has no meaningful fields except for the is_admin set to True 13:20:23 <yanyanhu> Qiming, yes 13:20:43 <Qiming> such a context will be referenced later in the action, and the action wants to record the requesting user/project 13:20:50 <Qiming> which, in this case, are both None 13:21:32 <Qiming> an action having user/project set to None cannot be deserialized later because we strictly require all actions objects have user/project associated with it 13:21:49 <yanyanhu> I see 13:22:11 <Qiming> xuefeng has helped proposed a fix to this. pls help review 13:22:11 <yanyanhu> maybe we should invoke cluster_check using senlin service context? 13:22:32 <yanyanhu> Qiming, will check it 13:22:33 <Qiming> service context has a different user/project 13:22:37 <Qiming> it makes sense also 13:22:40 <yanyanhu> from the cluster owner 13:22:42 <yanyanhu> yes 13:22:48 <yanyanhu> will think about it 13:22:59 <Qiming> a service context is more appropriate imo 13:23:05 <Qiming> more accurate 13:23:17 <yanyanhu> yes, since this action is actually done by senlin 13:23:34 <Qiming> yes 13:24:19 <Qiming> I've added something to the agenda today 13:24:30 <Qiming> one thing is about desired capacity 13:24:49 <yanyanhu> ok, I noticed some recent changes are about it 13:24:51 <Qiming> I'm still dealing with the last relevant action (CLUSTER_DEL_NODES) 13:25:07 <Qiming> hopefully can get it done early tomorrow 13:25:20 <yanyanhu> great 13:25:39 <Qiming> the idea is to encourage such a usage scenario 13:25:57 <Qiming> a user observes the current/actual capacity when examining a cluster 13:26:17 <Qiming> the desired capacity means nothing, it is just an indicator of an ideal case 13:26:18 <yanyanhu> so the current logic is all desired_capacity recalculation will be done based on 'real' size of cluster when adding/deleting node to/from cluster 13:26:47 <Qiming> which cannot be satisfied most of the time in a dynamic environment I'm afraid 13:27:05 <Qiming> at the end of the day, users have to face the truth 13:27:22 <Qiming> they need to know the actual/current capacity and make decisions about their next steps 13:27:31 <yanyanhu> Qiming, it makes sense when the real size of cluster is different from desired_capacity 13:27:38 <Qiming> actually, that is the logic behind our auto-scaling scenario 13:28:08 <ruijie> Will senlin provide a cron or something to do health check automaticly? 13:28:13 <Qiming> the metrics collected then used to trigger an auto-scaling operation are based on the actual nodes a cluster has 13:28:57 <Qiming> that implies the triggering was a decision based on real capacity, not the desired capacity 13:29:22 <Qiming> I'm trying to make things consistent across all actions related to cluster size changes 13:29:30 <yanyanhu> Qiming, I think it's reasonable for node creating/deleting scenarios 13:29:45 <yanyanhu> but for cluster scaling/resizing scenarios, I'm not sure 13:30:03 <Qiming> whenever an action is gonna change a cluster's size, it means the users are changing their new expectation, i.e. the new desired_capacity 13:30:28 <Qiming> even after those operations are performed, you will still face two numbers: actual and desired 13:30:43 <yanyanhu> especially if we want to differentiate 'scaling' and 'recovering' 13:30:56 <yanyanhu> Qiming, yes 13:31:14 <Qiming> okay, I was talking about manual operations, without taking health policy into the picture 13:31:41 <ruijie> My bad:) 13:32:01 <Qiming> when health policy is attached, users will get more automation in keeping the actual number of nodes close to the desired number 13:32:15 <Qiming> there are some tricky cases to handle 13:32:27 <yanyanhu> Consider this case: a cluster desired_capacity is 5, its real size is 4, so it is not totally health now(maybe in warning status) 13:32:51 <Qiming> currently, the recover operation is not trying to create or delete nodes so that the cluster size matches that of the desired capacity 13:33:09 <Qiming> yes, yanyan, that is a WARNING state 13:33:32 <yanyanhu> then user perform a cluster_scale_out operation(or node_add operation). If the desired_capacity is recalculated with real size, it will still be 5. 13:33:40 <yanyanhu> and a new node will be created/added 13:33:46 <Qiming> as we gain more experiences on health policy usage, we can add options to the policy, teach it to do some automatic 'convergence' thing 13:33:54 <yanyanhu> then the cluster will switch to health status(active, e.g.) 13:34:09 <Qiming> yes, in that case, the cluster is healthy 13:34:19 <Qiming> users new expectation is 5 nodes 13:34:26 <Qiming> and he has got 5 13:34:30 <yanyanhu> Qiming, exactly, what I want to say is, if the desired_capacity is done using real size, cluster scaling could become kind of recovering operation 13:34:38 <yanyanhu> and will change the cluster's health status 13:34:41 <Qiming> right 13:34:46 <Qiming> sure 13:34:55 <Qiming> that is an implication, very subtle one 13:35:15 <Qiming> I was even thinking of a cluster_resize operation with no argument 13:35:24 <yanyanhu> so I think we may need to decide whether this kind of status switch is as expected 13:35:30 <elynn> yanyanhu: actually I think user is expecting 6 if he do cluster_scale_out... 13:35:43 <Qiming> that operation will virtually reset the cluster's status, delete all non-active nodes and re-evaluate the cluster's status 13:35:53 <yanyanhu> elynn, yes, that is something could confuse user 13:36:03 <yanyanhu> we may need to state it clearly 13:36:06 <elynn> since desired_capacity is what he desired before and now he want to scale out... 13:36:07 <Qiming> if we are chasing the desired capacity, we will never end the loop 13:36:28 <Qiming> desired is always a 'dream' 13:36:32 <Qiming> so to make that clear 13:36:52 <yanyanhu> Qiming, yes, if the new desired_capacity become 6. the cluster real size will be added to e.g. 5. and cluster will remain on warnning 13:37:11 <Qiming> I'm proposing add a 'current_capacity' property to the cluster, automatically calculated at client side or before returning to client 13:37:13 <yanyanhu> but maybe this is what user want :) 13:37:18 <ruijie> My be desired_capacity is 6, and real size is 5? And then use health policy to keep the cluster healthy 13:37:27 <Qiming> exactly, you will never get your cluster status fixed 13:37:50 <yanyanhu> so I mean cluster status will only be shifted when user perform recovering operation 13:37:56 <yanyanhu> maybe 13:38:07 <yanyanhu> since it is health status related operation 13:38:09 <elynn> so when user scale_out, cluster should do: 1. check cluster size 2. create nodes to desired_capacity now, which is 5 3. add new nodes to 6 13:38:35 <Qiming> elynn, step 2 could faile, step 3 could fail 13:38:48 <yanyanhu> elynn, yes, that is kind of implicit self recovering 13:38:55 <yanyanhu> that is possible 13:39:14 <Qiming> the only reliable status you can get is by automatically invoke the eval_status method after those operations 13:39:32 <yanyanhu> so maybe we only change the cluster's health status when user explicitly perform recovering operation? 13:39:43 <yanyanhu> Qiming, yes 13:39:47 <elynn> If it failed, just show warning and change desired_capacity to 6. 13:39:50 <Qiming> users will always know the 'desired' status, as he/she expressed previously, and the 'current' status, which is always a fact 13:39:53 <yanyanhu> eval_status is for that purpose 13:40:40 <Qiming> elynn, if there is no health policy, how would you make the cluster healthy? 13:41:05 <yanyanhu> so my thought is we keep cluster health status unchanged after cluster scaling/resizing/node_adding/deleting 13:41:14 <Qiming> each time you want to add new nodes, you are pushing the desired capacity higher 13:41:35 <yanyanhu> Qiming, yes. maybe manually perform cluster_recover? 13:41:55 <Qiming> cluster_recover is not that reliable 13:41:56 <elynn> could we provide an operation like cluster_heal? 13:42:05 <elynn> manual perform by users? 13:42:08 <Qiming> and it is too complex to get it done right 13:42:20 <Qiming> cluster_recover + cluster_heal ? 13:43:03 <elynn> hmm... yes, that will become more complex... 13:43:53 <Qiming> it is too complicated, maybe I haven't thought it through, but I did spent a lot time on balancing this 13:44:12 <elynn> let's forget about cluster_heal, just cluster_recover. 13:44:21 <Qiming> I was even creating a table on this ... 13:44:38 <Qiming> any url to paste a pic? 13:44:38 <yanyanhu> understand your intention. just if cluster scaling doesnt change cluster desired_capacity, that is confused imho 13:44:46 <elynn> SWOT? 13:45:04 <Qiming> url 13:45:19 <Qiming> cluster scaling does change cluster's desired_capacity 13:45:41 <Qiming> say if you have a cluster: desired=5, current=4 13:45:48 <yanyanhu> yes 13:45:49 <Qiming> and you do scale-out by 2 13:46:13 <yanyanhu> but if I scale out by 1, new desired will still be 5? 13:46:16 <Qiming> the current explanation of that request is: users know there are 4 nodes in the cluster, he/she wants to add 2 13:46:39 <elynn> then I get d=7, c=7 13:46:41 <Qiming> so the desired_capacity is changed to 6 13:46:53 <Qiming> and we create 2 nodes as user requested 13:46:59 <yanyanhu> if the desired is calculated with real size. then this scaling will actually become "recover" 13:47:18 <Qiming> it is not recover 13:47:28 <Qiming> please read the recover logic 13:47:30 <yanyanhu> I mean for scale out 1 case 13:47:41 <Qiming> it heals those nodes that are created 13:47:55 <yanyanhu> Qiming, ah, yes, sorry I used wrong item 13:47:55 <Qiming> it doesn't create new nodes 13:48:01 <yanyanhu> maybe cluster heal as ethan mentioned 13:48:10 <Qiming> give me a url to paste pic pls 13:48:19 <yanyanhu> which will try to converge the cluster real size to desired one 13:48:21 <elynn> why cluster recover can't create nodes? 13:48:30 <Qiming> it is a limitation 13:48:35 <yanyanhu> elynn, currently, recover means recover a failed node 13:48:40 <Qiming> we can improve it do do that 13:48:44 <yanyanhu> through recreating, rebuilding, e.g. 13:48:59 <elynn> To me cluster recover should bring cluster back to heal, from its words... 13:49:08 <yanyanhu> elynn, +1 13:49:25 <yanyanhu> just we haven't made it support creating node 13:49:36 <yanyanhu> maybe we can improve it as Qiming said 13:49:49 <Qiming> http://picpaste.com/001-tvkIE5Aw.jpg 13:49:51 <elynn> I just talk about the logic here...Not the implementation... 13:50:26 <Qiming> yes, go ahead think about changing the desired_capacity based on current desired_capacity then 13:50:34 <Qiming> see if you can fill the table with correct operations 13:50:59 <Qiming> in that picture, min_size is 1 desired is 2, max is 5 13:51:19 <Qiming> the first row contains (active)/(total) nodes you have in a cluster 13:51:28 <Qiming> then you get a request from user 13:51:53 <Qiming> tell me what you will do to keep the desired capacity a reality, or even near that number 13:52:36 <Qiming> so ... I was really frustrated at chasing desired_capacity in all these operations 13:52:46 <yanyanhu> Qiming, I think user will understand that the real size of cluster could be always different from their desired 13:52:55 <Qiming> we should really let users know ... you have your clusters status accurately reported 13:53:03 <yanyanhu> for some reasons 13:53:08 <elynn> let me go through this table , it takes some time... 13:53:18 <Qiming> you make your decisions based on the real size, not the imaginative (desired) capacity 13:53:30 <yanyanhu> but once that happen, they need to recover their cluster to match the real size to desired 13:53:43 <Qiming> that is an ideal case, senlin will do its best, but there will be no guarantee 13:53:57 <yanyanhu> that's why we call that opertion "recover" or "heal" 13:54:00 <Qiming> yanyanhu, what if recover fails 13:54:07 <Qiming> I mean, fails in the middle 13:54:20 <yanyanhu> that could happen 13:54:22 <Qiming> we cannot hide that 13:54:29 <yanyanhu> and it just means recovery failed 13:54:36 <Qiming> the recover operation still cannot solve this problem 13:54:36 <yanyanhu> and user can try it again later maybe 13:54:48 <elynn> Qiming: let cluster show warning status? 13:54:52 <Qiming> then why do we pretend we can achieve desired capacity at all? 13:55:16 <elynn> And stop there.. 13:55:21 <yanyanhu> Qiming, yes, but I think no one can ensure that user can always get what they want, right 13:55:29 <Qiming> yes, recovery operation fails, what's the cluster's status? 13:55:39 <Qiming> so the logic is really simple 13:55:45 <yanyanhu> warning I think 13:55:46 <Qiming> expose the current capacity to users 13:56:00 <Qiming> let them do their decisions based on the real capacity 13:56:01 <elynn> for example, d=5, c=4, scale_out=2 13:56:03 <Qiming> not the desired capacity 13:56:11 <elynn> recover failed 13:56:22 <elynn> d=7, c=4 13:56:28 <Qiming> the desired capacity has been proved to be a failure if you have 4 nodes created for a d=5 cluster 13:56:44 <elynn> it's a warning status I think? 13:56:53 <Qiming> how do you explain scale_out=2 ? 13:57:05 <Qiming> user mean he want to create 7 nodes? 13:57:06 <Qiming> why? 13:57:13 <ruijie> Qiming, maybe each action should just do what it should? and let other action or policy to keep the real_capacity=desired 13:57:17 <yanyanhu> scale_out=2 means the desired_capacity will increase 2 13:57:26 <elynn> user want to scale_out 2 nodes, he totally want 7 nodes here... 13:57:46 <yanyanhu> we even can't guarantee the 2 new nodes will be created correctly 13:57:51 <Qiming> elynn, users already saw only 4 nodes in the cluster 13:58:09 <elynn> If he only want current nodes +2 ,then he should do recover first and then scale_out I think... 13:58:09 <Qiming> alright, take a step back 13:58:16 <Qiming> think about this 13:58:32 <Qiming> you have d=3, c=2, then ceilometer triggers an autoscaling 13:58:43 <Qiming> skip the user intervention for now 13:58:47 <elynn> That's another scenario.... 13:58:47 <yanyanhu> elynn, that is also what I'm thinking actually 13:58:51 <Qiming> what was that decision based? 13:59:02 <elynn> that's based on actual nodes... 13:59:04 <Qiming> why shouldn't we keep this consistent? 13:59:18 <yanyanhu> Qiming, if user leave the bar to ceilometer, I think we can treat ceilometer as the user 13:59:20 <Qiming> sigh ... 14:00:01 <elynn> that's the confused part... 14:00:08 <Qiming> pls go through that table 14:00:09 <yanyanhu> umm, time is over... may keep on discussing it in senlin channel? 14:00:17 <Qiming> you will realize what a problem we are facing 14:00:41 <yanyanhu> Qiming, I see. Will think about it and have more discussion on it 14:00:58 <yanyanhu> will end meeting 14:01:01 <yanyanhu> #endmeeting