13:00:06 <ruijie_> #startmeeting senlin 13:00:07 <openstack> Meeting started Tue Dec 5 13:00:06 2017 UTC and is due to finish in 60 minutes. The chair is ruijie_. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:00:08 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:00:10 <openstack> The meeting name has been set to 'senlin' 13:00:27 <XueFeng> hi ruijie_ 13:00:38 <ruijie_> hi all, this is the agenda, please feel free to add topics: https://wiki.openstack.org/wiki/Meetings/SenlinAgenda#Agenda_.282017-12-05_1300_UTC.29 13:00:42 <ruijie_> hi XueFeng 13:02:39 <ruijie_> let's wait for a while 13:02:48 <elynn> hi ruijie_ 13:03:00 <Qiming> hi 13:03:01 <ruijie_> hi elynn :) 13:03:03 <chenybd_> hi 13:03:05 <ruijie_> Qiming 13:03:15 <XueFeng> hi,all 13:03:19 <ruijie_> woo 13:03:44 <ruijie_> let's get started :) 13:03:57 <ruijie_> https://review.openstack.org/#/c/523965/ 13:04:23 <ruijie_> the first one is the lifecycle hook for scale-in action 13:05:19 <ruijie_> this means we will set a temporary status for the scale-in action 13:06:21 <ruijie_> and a greenthread will wait for the change of the status 13:06:40 <ruijie_> so that the server or user have time to terminate there service on the server before we destroy it 13:09:09 <Qiming> spawn_after means creating a new greenthread after given time interval? 13:09:46 <ruijie_> emm, I think he want to create a thread in the policy to wait for the TIMEOUT happens 13:10:09 <Qiming> he meant to invoke spawn_after 13:10:33 <ruijie_> this thread will go head to destroy the server if we receive a message/request directly, or we wait for the TIMEOUT and then destroy it 13:10:39 <Qiming> I'm not sure a thread is actually created, but ... anyway 13:10:54 <Qiming> it is a confirmation 13:11:02 <ruijie_> yes Qiming 13:11:09 <Qiming> WAITING_LIFECYCLE_COMPLETION 13:11:23 <Qiming> it makes sense 13:11:31 <Qiming> but there are some corner cases to check 13:11:53 <Qiming> suppose we have a cluster-scale-in request arriving, with timeout set to 180 seconds 13:12:17 <Qiming> then we cannot wait for the new confirmation for more than 30 seconds 13:12:55 <Qiming> also, in current implementation, policies are checked DURING action execution, not BEFORE action execution 13:13:33 <Qiming> when CLUSTER_SCALE_IN action is executed, senlin will check the policies attached to the cluster 13:13:44 <Qiming> that action will lock the cluster for operation 13:14:21 <Qiming> for cluster status to be consistent, we cannot release that lock 13:14:44 <Qiming> we have to wait for the WAITING_LIFECYCLE_COMPLETION signal to arrive, while holding the lock 13:15:11 <ruijie_> yes Qiming, the cluster should be locked during this period 13:15:50 <elynn> That might block the whole cluster operation? 13:15:54 <ruijie_> maybe pointing at node actions makes sense: cluster scale-in --> node-delete(pre_check the policy, create threadGroup/thread to process, and release current thread) 13:16:01 <Qiming> yes, elynn 13:17:09 <Qiming> for this particular use case, I'm wondering if we can extend the deletion policy 13:17:30 <ruijie_> once we receive a message/request, we only update the node status 13:17:49 <Qiming> say add a property to the deletion policy: 13:17:50 <ruijie_> action's status actually 13:17:54 <Qiming> notify: 13:18:00 <Qiming> type: zaqar 13:18:18 <Qiming> sink: wherever zaqar is 13:18:44 <Qiming> need_confirm: true 13:18:51 <Qiming> timeout: 120 13:19:24 <Qiming> we continue the cluster-scale-in or cluster-resize or node-delete operation 13:20:05 <Qiming> just the node will become orphant node 13:20:38 <ruijie_> or reject the deletion? 13:20:50 <Qiming> acutally node DESTROY happens either 1) timeout waiting for confirm, or 2) a confirm message is received via the proposed new action 13:21:00 <elynn> So this node was actually removed from a cluster instead of deleted? 13:21:22 <Qiming> yes ... 13:21:41 <elynn> How can senlin receive a confirm message? 13:21:41 <Qiming> no matter we are getting a notification or not, the node will be removed 13:21:45 <elynn> from receiver? 13:22:07 <Qiming> the use case is actually only about "postponing the node deletion operation" 13:22:27 <Qiming> kind of a new action as proposed in the spec 13:23:34 <elynn> Also we might need a new node status. 13:23:49 <Qiming> yes, that is fine 13:24:07 <Qiming> this is a very real use case 13:24:30 <Qiming> just trying to identify some mismatches between the proposed approach and senlin's current workflow 13:25:05 <Qiming> we cannot do policy check before locking a cluster 13:25:11 <Qiming> there are race conditions 13:25:37 <ruijie_> the advantage for a new policy is that we can extend it to support cluster-scale-out, to ask permission from supervisors? 13:26:05 <Qiming> say senlin noticed a "lifecycle" policy attached, so it decides to do something, then before senlin actually started doing the job, the policy may get detached ... 13:26:36 <Qiming> that is true, ruijie_ 13:26:55 <Qiming> still, we have to check policy with cluster locked 13:28:06 <Qiming> and it will introduce more concurrency issues if we lock a cluster twice: one lock for policy checking, another for actual cluster operation 13:29:04 <ruijie_> emm, how about: cluster-scale-in --> lock the cluster --> node actions --> check the policy --> hold here .. we only focus on the node level 13:29:38 <Qiming> yes, for scaling in operation, that is what I was talking about 13:30:05 <Qiming> to be more specific, the policy is a new deletion policy 13:30:44 <Qiming> that policy will grant users sufficient time to drain a node 13:31:17 <Qiming> the question is ... when are we releasing the cluster lock 13:32:05 <Qiming> once required number of nodes are removed from the cluster, we treat the cluster scale-in operation completed 13:32:34 <Qiming> however, the deletion policy leaves a thread waiting for further confirmation before destroying the node 13:32:55 <Qiming> and it is just an extension of the grace period 13:33:02 <Qiming> make sense? 13:34:07 <ruijie_> looks so.. rejection might be reasonable? 13:34:22 <Qiming> rejection? 13:34:41 <ruijie_> you are not able to destroy the node currently, please try again 13:35:04 <Qiming> no, cluster-scale-in operation was successful 13:35:21 <Qiming> number of active/working nodes reduced 13:35:35 <Qiming> just there are some nodes not yet cleanly destroyed 13:36:08 <Qiming> leaving nodes there untouched has been treated as a valid situation before 13:36:26 <Qiming> we have "destroy_after_deletion" property in deletion policy 1.0 13:37:09 <Qiming> what we really need is some new fields, that prolongs the waiting period, and waiting for a confirmation before finally destroying the node 13:38:03 <Qiming> I'm also a little bit cautious at introducing new policies 13:38:43 <Qiming> one principle we had in senlin is that all policies are independent from each other 13:39:08 <Qiming> they can be used separately, and they can be combined freely 13:39:31 <Qiming> all policies have builtin priorities to make sure all possible combinations make senses 13:40:19 <Qiming> the new lifecycle policy, if permited, will conflict with deletion policy, I'm afraid 13:40:49 <Qiming> although, I do see how careful the author has been when drafting the spec 13:41:30 <Qiming> he/she has considered actions to handle, alternatives to the proposed approach etc etc 13:41:36 <ruijie_> exactly, it looks like an extension of the deletion policy, but useful for some cases 13:41:38 <Qiming> it is a great spec 13:42:49 <Qiming> yet another possibility ... if I failed to convince you that an extension to the deletion policy is okay for the use case 13:43:01 <Qiming> we don't call this a policy 13:43:13 <Qiming> a lifecycle hook is a hook 13:43:38 <Qiming> it is a precheck and postcheck for cluster level operations 13:43:55 <Qiming> we don't lock the cluster for these hooks 13:44:28 <Qiming> we clearly document what can happen when user application receives a notification from senlin 13:46:43 <ruijie_> but other actions could be triggered during this period if we do not lock the cluster/node 13:47:02 <Qiming> right, that is the dilemma 13:47:42 <Qiming> in fact, we don't have a lot choices 13:48:15 <Qiming> one option is a synchronous call, senlin doesn't move forward until we get a response 13:48:34 <Qiming> another option is asynchronous notification 13:49:05 <Qiming> sync calls to outside world, with cluster locks held, is dangerous, imo 13:49:51 <ruijie_> but it meets our action flow.. 13:51:15 <Qiming> exactly 13:52:15 <XueFeng> A little time for another two topics:) 13:52:28 <ruijie_> I see, that is a problem .., sync == hold cluster lock, async == release lock for current implementation 13:53:06 <Qiming> ya, don't do async calls with locks held ... ;) 13:53:49 <ruijie_> ha sorry, we do not have enough time for rest topics, so please leave comments for this patch :) 13:54:18 <Qiming> yep. please jump in and leave comments 13:54:38 <Qiming> btw, ruijie_ , we are about to release q-2 this week 13:54:42 <ruijie_> let's move to senlin channel? we only have 6 minutes left 13:54:46 <ruijie_> yes Qiming 13:54:51 <ruijie_> I am working on it today 13:54:54 <Qiming> release team is not sending notifications? 13:55:00 <Qiming> great! 13:55:51 <XueFeng> great man, ruijie_ 13:55:58 <ruijie_> I am going to release the channel, we have another 2 topics need to discuss in #senlin 13:56:09 <Qiming> ok 13:56:17 <ruijie_> thanks for joining :) 13:56:28 <ruijie_> #endmeeting