13:00:25 #startmeeting senlin 13:00:26 Meeting started Tue Aug 1 13:00:25 2017 UTC and is due to finish in 60 minutes. The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:00:27 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:00:29 The meeting name has been set to 'senlin' 13:00:41 evening 13:00:52 Hi 13:00:53 hi Qiming 13:01:05 hi, chohoor, ruijie 13:01:55 evening, elynn 13:02:02 Hi Qiming 13:02:38 I don't have specific agenda for this meeting 13:02:53 How is your princess ? Saw your pics in wechat. 13:02:55 if you have topics, feel free to add them here: https://wiki.openstack.org/wiki/Meetings/SenlinAgenda 13:03:07 veeeeeeeery naughty 13:03:13 uncontrollable 13:03:21 I dont ,just to ask about my patch. 13:03:52 haha, your gene is in it 13:03:57 that big one 13:03:59 I got questions about zone placement policy & health manager ~ 13:04:04 about https://review.openstack.org/#/c/467108/ 13:04:06 +677, -343 13:04:21 Will any one of you could review and test.... 13:04:41 have you tested it? 13:04:50 I test some basic functions, create ports ,create floating IPs, at least it works fine 13:05:10 great 13:05:16 basic update. 13:06:22 I will add you all to review this patch later :) 13:06:34 that is fair 13:06:49 When is the RC? 13:07:18 I want to discuss some details about fast scaling and standby nodes. 13:07:28 Also I'm working on k8s this week, hopefully I can get a running k8s this week with kubeadm. 13:07:31 chohoor, later 13:07:44 RC1 is next week 13:08:05 That's all from my part :) 13:08:12 okay, thanks 13:08:27 final rc is aug 28-sep 01 13:08:50 oh, no, aug 21-aug 25 13:08:58 aug 28 is final release 13:09:13 okay, I see. 13:09:37 okay, let's switch to chohoor's proposal 13:09:43 fast scaling of clusters 13:09:46 it is a great idea 13:09:58 although there are some details to be discussed 13:10:25 I have write a spec in doc. 13:10:40 I've read it 13:11:49 In todo list, the num of standby nodes is max_size - desired_capacity, but I think if the max_size is too big.... 13:11:52 say I have a cluster created: min=10, max=20 13:12:09 how many standby nodes should we create at the very beginning? 13:13:06 i think it shuould specify by user. 13:13:16 right, that was my point 13:13:28 but less than max_size 13:14:02 maybe we can put that constraint aside 13:14:22 let's focus on user's experience 13:14:45 if I'm a user, am I supposed to be aware of how many standby nodes there? 13:14:47 that's ok 13:15:34 can we share standby nodes among two or more clusters? 13:15:37 yes, I think use should know that. 13:16:00 No, the profile is diffrent. 13:16:08 profile can be the same 13:16:21 the properties are very different 13:16:24 say I have created two clusters from the same profile 13:16:28 like the disk type, network .etc 13:16:49 right, they could be ... 13:17:18 maybe the cluster should have own standby nodes. 13:17:52 okay ... if we agree that standby nodes are still members of a cluster 13:18:13 and they are visible to the cluster's owner 13:18:37 are you gonna charge them for these standby nodes? 13:19:40 yes, we need make sure standby nodes in active state. 13:20:21 that would lead us to another question: if the user is consuming additional resources, why don't we charge them? 13:21:13 if we charge them, they would be confused at least ... because they are not "running" workloads on those standby nodes 13:22:56 and if they costs extra money, why not just create extra nodes .. 13:23:12 yes 13:23:52 I have think this question, the standby nodes is for standby, but if the nodes in cluster is error, we could replace them immediately. 13:23:54 so ... one of the key question to answer is about the definition of "standby" 13:25:35 if you allocate 10 nodes, all active, but only treat 8 of them as member of your cluster, the other 2 are only for "standby" 13:25:46 then why don't you run workloads on them as well? 13:27:38 you are right. i'll continue to consider. 13:28:41 one baby step is for this case 13:28:48 suppose I have 10 nodes running now 13:29:09 my cluster have min=5, max=15 13:29:53 for whatever reason, I want to scale it in by 2 nodes now, that means I'm setting the desired_capacity to 8 13:30:06 removing 2 nodes from the cluster 13:30:30 we can lazily remove those nodes 13:30:57 if later on I want to scale the cluster back to 10 nodes, I can reuse them 13:31:21 sounds the idea is very similar to yours 13:31:29 remove the nodes to standby nodes first, then real delete them? 13:31:56 Qiming, you mean keep them for a while, but still will be removed for a user defined period ? 13:31:56 let the cluster owner decide for how long we will keep those nodes 13:32:21 ok 13:32:30 because we have node-delete API, user can still delete them at will at any time 13:33:22 this use case is less risky because we are sure the nodes are homogeneous 13:33:42 we are sure the nodes are configured the same way 13:34:01 we are sure there won't be data leaked from one user to another 13:35:54 if there are such extra nodes, we may want to stop them instead of deleting them 13:36:56 that leads us back to the definition of "standby" nodes 13:38:13 I'll rewrite the spec and please give me more suggest in irc later. thank you. 13:38:21 my pleasure 13:38:39 just want to be sure that we have thought it through before writing the code 13:39:04 sure 13:40:16 is it my turn now :) 13:40:17 anything else? 13:40:24 shoot 13:40:36 and another question, about protect a special node not to be delete. 13:40:38 yes Qiming, the first one is this patch : https://review.openstack.org/#/c/481670/ 13:41:07 your turn. 13:41:20 the change will rerun the actions which is action.status = READY 13:41:35 thah is caused by failed to acquire lock 13:42:00 yes 13:42:20 but it will try to process the action again and again .. 13:42:47 the cluster is still locked in this period .. that will create a lot of events? 13:42:47 what do you mean by "again and again" 13:42:56 what is the interval? 13:43:02 0 sec 13:43:16 so .. that is the problem 13:43:23 yes Qiming 13:43:50 or we should add a global option lock_max_retries 13:44:23 and maybe a lock_retry_interval .. 13:44:31 :) 13:44:34 maybe both 13:45:15 the logic is not incorrect, just it ... is too frequent, too noisy 13:46:20 chohoor, you want to "lock" a node? 13:46:20 improvement is needed :) 13:46:40 yes 13:47:00 use case? 13:47:00 A node not to be delete. 13:47:39 maybe this is a special node in user point... 13:48:01 sounds like a use case of node roles 13:49:33 In tencent cloud, user could add a normal instance to cluster as a node. 13:49:53 and the code could be protect not to be delete. 13:50:06 s/code/node/ 13:50:32 like node adoption 13:50:32 who is deleting it then? if the user doesn't want to delete it? 13:51:19 auto scale, because it could be random 13:52:03 or oldest first policy. 13:52:04 I see 13:52:54 So by 'protect' here just mean cluster scaling actions won't delete it, but this node still can be deleted by 'node delete' command? 13:53:33 so it is gonna be some label or tag that will change some action's behavior 13:53:45 elynn: i think so 13:53:48 yes, I guess. 13:54:30 alright, please think if we can use node.role for this purpose 13:54:36 a 'protect' tag, will exclude it from scaling candidate list. 13:54:45 because ... that is a more general solution 13:55:08 em ... we don't support tags yet 13:55:08 ok 13:55:11 maybe we should 13:55:20 metedata? 13:55:22 no objection on implemeting tags 13:55:38 metadata is also fine 13:56:05 say 'protected: true' 13:57:03 if a node in protected state, what if the cluster/node execute update action? 13:57:13 feel free to propose a solution then 13:57:23 Need to think about health policy, can we rebuild it? 13:57:46 I suggest not. 13:58:46 Then update should also be rejected I guess? 13:59:13 ... 13:59:27 just don't update the protected node. 13:59:46 thanks guys 13:59:48 hi, senlin rdo has been finished 13:59:52 https://bugzilla.redhat.com/show_bug.cgi?id=1426551 13:59:52 bugzilla.redhat.com bug 1426551 in Package Review "Review Request: Senlin - is a clustering service for OpenStack" [Unspecified,Closed: errata] - Assigned to amoralej 13:59:52 we are running out of time 14:00:00 #endmeeting