#openstack-meeting log

13:00:06 <ruijie_> #startmeeting senlin
13:00:07 <openstack> Meeting started Tue Dec  5 13:00:06 2017 UTC and is due to finish in 60 minutes.  The chair is ruijie_. Information about MeetBot at http://wiki.debian.org/MeetBot.
13:00:08 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
13:00:10 <openstack> The meeting name has been set to 'senlin'
13:00:27 <XueFeng> hi ruijie_
13:00:38 <ruijie_> hi all, this is the agenda, please feel free to add topics: https://wiki.openstack.org/wiki/Meetings/SenlinAgenda#Agenda_.282017-12-05_1300_UTC.29
13:00:42 <ruijie_> hi XueFeng
13:02:39 <ruijie_> let's wait for a while
13:02:48 <elynn> hi ruijie_
13:03:00 <Qiming> hi
13:03:01 <ruijie_> hi elynn :)
13:03:03 <chenybd_> hi
13:03:05 <ruijie_> Qiming
13:03:15 <XueFeng> hi,all
13:03:19 <ruijie_> woo
13:03:44 <ruijie_> let's get started :)
13:03:57 <ruijie_> https://review.openstack.org/#/c/523965/
13:04:23 <ruijie_> the first one is the lifecycle hook for scale-in action
13:05:19 <ruijie_> this means we will set a temporary status for the scale-in action
13:06:21 <ruijie_> and a greenthread will wait for the change of the status
13:06:40 <ruijie_> so that the server or user have time to terminate there service on the server before we destroy it
13:09:09 <Qiming> spawn_after means creating a new greenthread after given time interval?
13:09:46 <ruijie_> emm, I think he want to create a thread in the policy to wait for the TIMEOUT happens
13:10:09 <Qiming> he meant to invoke spawn_after
13:10:33 <ruijie_> this thread will go head to destroy the server if we receive a message/request directly, or we wait for the TIMEOUT and then destroy it
13:10:39 <Qiming> I'm not sure a thread is actually created, but ... anyway
13:10:54 <Qiming> it is a confirmation
13:11:02 <ruijie_> yes Qiming
13:11:09 <Qiming> WAITING_LIFECYCLE_COMPLETION
13:11:23 <Qiming> it makes sense
13:11:31 <Qiming> but there are some corner cases to check
13:11:53 <Qiming> suppose we have a cluster-scale-in request arriving, with timeout set to 180 seconds
13:12:17 <Qiming> then we cannot wait for the new confirmation for more than 30 seconds
13:12:55 <Qiming> also, in current implementation, policies are checked DURING action execution, not BEFORE action execution
13:13:33 <Qiming> when CLUSTER_SCALE_IN action is executed, senlin will check the policies attached to the cluster
13:13:44 <Qiming> that action will lock the cluster for operation
13:14:21 <Qiming> for cluster status to be consistent, we cannot release that lock
13:14:44 <Qiming> we have to wait for the WAITING_LIFECYCLE_COMPLETION signal to arrive, while holding the lock
13:15:11 <ruijie_> yes Qiming, the cluster should be locked during this period
13:15:50 <elynn> That might block the whole cluster operation?
13:15:54 <ruijie_> maybe pointing at node actions makes sense: cluster scale-in --> node-delete(pre_check the policy, create threadGroup/thread to process, and release current thread)
13:16:01 <Qiming> yes, elynn
13:17:09 <Qiming> for this particular use case, I'm wondering if we can extend the deletion policy
13:17:30 <ruijie_> once we receive a message/request, we only update the node status
13:17:49 <Qiming> say add a property to the deletion policy:
13:17:50 <ruijie_> action's status actually
13:17:54 <Qiming> notify:
13:18:00 <Qiming> type: zaqar
13:18:18 <Qiming> sink: wherever zaqar is
13:18:44 <Qiming> need_confirm: true
13:18:51 <Qiming> timeout: 120
13:19:24 <Qiming> we continue the cluster-scale-in or cluster-resize or node-delete operation
13:20:05 <Qiming> just the node will become orphant node
13:20:38 <ruijie_> or reject the deletion?
13:20:50 <Qiming> acutally node DESTROY happens either 1) timeout waiting for confirm, or 2) a confirm message is received via the proposed new action
13:21:00 <elynn> So this node was actually removed from a cluster instead of deleted?
13:21:22 <Qiming> yes ...
13:21:41 <elynn> How can senlin receive a confirm message?
13:21:41 <Qiming> no matter we are getting a notification or not, the node will be removed
13:21:45 <elynn> from receiver?
13:22:07 <Qiming> the use case is actually only about "postponing the node deletion operation"
13:22:27 <Qiming> kind of a new action as proposed in the spec
13:23:34 <elynn> Also we might need a new node status.
13:23:49 <Qiming> yes, that is fine
13:24:07 <Qiming> this is a very real use case
13:24:30 <Qiming> just trying to identify some mismatches between the proposed approach and senlin's current workflow
13:25:05 <Qiming> we cannot do policy check before locking a cluster
13:25:11 <Qiming> there are race conditions
13:25:37 <ruijie_> the advantage for a new policy is that we can extend it to support cluster-scale-out, to ask permission from supervisors?
13:26:05 <Qiming> say senlin noticed a "lifecycle" policy attached, so it decides to do something, then before senlin actually started doing the job, the policy may get detached ...
13:26:36 <Qiming> that is true, ruijie_
13:26:55 <Qiming> still, we have to check policy with cluster locked
13:28:06 <Qiming> and it will introduce more concurrency issues if we lock a cluster twice: one lock for policy checking, another for actual cluster operation
13:29:04 <ruijie_> emm, how about: cluster-scale-in --> lock the cluster --> node actions --> check the policy --> hold here .. we only focus on the node level
13:29:38 <Qiming> yes, for scaling in operation, that is what I was talking about
13:30:05 <Qiming> to be more specific, the policy is a new deletion policy
13:30:44 <Qiming> that policy will grant users sufficient time to drain a node
13:31:17 <Qiming> the question is ... when are we releasing the cluster lock
13:32:05 <Qiming> once required number of nodes are removed from the cluster, we treat the cluster scale-in operation completed
13:32:34 <Qiming> however, the deletion policy leaves a thread waiting for further confirmation before destroying the node
13:32:55 <Qiming> and it is just an extension of the grace period
13:33:02 <Qiming> make sense?
13:34:07 <ruijie_> looks so.. rejection might be reasonable?
13:34:22 <Qiming> rejection?
13:34:41 <ruijie_> you are not able to destroy the node currently, please try again
13:35:04 <Qiming> no, cluster-scale-in operation was successful
13:35:21 <Qiming> number of active/working nodes reduced
13:35:35 <Qiming> just there are some nodes not yet cleanly destroyed
13:36:08 <Qiming> leaving nodes there untouched has been treated as a valid situation before
13:36:26 <Qiming> we have "destroy_after_deletion" property in deletion policy 1.0
13:37:09 <Qiming> what we really need is some new fields, that prolongs the waiting period, and waiting for a confirmation before finally destroying the node
13:38:03 <Qiming> I'm also a little bit cautious at introducing new policies
13:38:43 <Qiming> one principle we had in senlin is that all policies are independent from each other
13:39:08 <Qiming> they can be used separately, and they can be combined freely
13:39:31 <Qiming> all policies have builtin priorities to make sure all possible combinations make senses
13:40:19 <Qiming> the new lifecycle policy, if permited, will conflict with deletion policy, I'm afraid
13:40:49 <Qiming> although, I do see how careful the author has been when drafting the spec
13:41:30 <Qiming> he/she has considered actions to handle, alternatives to the proposed approach etc etc
13:41:36 <ruijie_> exactly, it looks like an extension of the deletion policy, but useful for some cases
13:41:38 <Qiming> it is a great spec
13:42:49 <Qiming> yet another possibility ... if I failed to convince you that an extension to the deletion policy is okay for the use case
13:43:01 <Qiming> we don't call this a policy
13:43:13 <Qiming> a lifecycle hook is a hook
13:43:38 <Qiming> it is a precheck and postcheck for cluster level operations
13:43:55 <Qiming> we don't lock the cluster for these hooks
13:44:28 <Qiming> we clearly document what can happen when user application receives a notification from senlin
13:46:43 <ruijie_> but other actions could be triggered during this period if we do not lock the cluster/node
13:47:02 <Qiming> right, that is the dilemma
13:47:42 <Qiming> in fact, we don't have a lot choices
13:48:15 <Qiming> one option is a synchronous call, senlin doesn't move forward until we get a response
13:48:34 <Qiming> another option is asynchronous notification
13:49:05 <Qiming> sync calls to outside world, with cluster locks held, is dangerous, imo
13:49:51 <ruijie_> but it meets our action flow..
13:51:15 <Qiming> exactly
13:52:15 <XueFeng> A little time for another two topics:)
13:52:28 <ruijie_> I see, that is a problem .., sync == hold cluster lock, async == release lock for current implementation
13:53:06 <Qiming> ya, don't do async calls with locks held ... ;)
13:53:49 <ruijie_> ha sorry, we do not have enough time for rest topics, so please leave comments for this patch :)
13:54:18 <Qiming> yep. please jump in and leave comments
13:54:38 <Qiming> btw, ruijie_ , we are about to release q-2 this week
13:54:42 <ruijie_> let's move to senlin channel? we only have 6 minutes left
13:54:46 <ruijie_> yes Qiming
13:54:51 <ruijie_> I am working on it today
13:54:54 <Qiming> release team is not sending notifications?
13:55:00 <Qiming> great!
13:55:51 <XueFeng> great man, ruijie_
13:55:58 <ruijie_> I am going to release the channel, we have another 2 topics need to discuss in #senlin
13:56:09 <Qiming> ok
13:56:17 <ruijie_> thanks for joining :)
13:56:28 <ruijie_> #endmeeting