#openstack-meeting log

13:03:09 <Qiming> #startmeeting senlin
13:03:09 <openstack> Meeting started Tue Jan 26 13:03:09 2016 UTC and is due to finish in 60 minutes.  The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot.
13:03:10 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
13:03:13 <openstack> The meeting name has been set to 'senlin'
13:03:23 <cschulz> Hi
13:03:31 <Qiming> seems they forgot to endmeeting
13:03:37 <Qiming> hi, cschulz
13:03:54 <Qiming> pls review agenda here
13:03:56 <Qiming> #link https://wiki.openstack.org/wiki/Meetings/SenlinAgenda
13:03:59 <elynn> o/
13:04:21 <Qiming> oops, two open discussion, we got plenty of time?
13:04:38 <yanyanhu> :)
13:04:55 <Qiming> anywhy, let's get started with etherpad
13:05:01 <Qiming> #topic mitaka work items
13:05:05 <Qiming> #link https://etherpad.openstack.org/p/senlin-mitaka-workitems
13:05:21 <Qiming> no progress on API as I know of
13:05:42 <Qiming> heat specs proposed for node/policy/receiver
13:05:58 <Qiming> https://review.openstack.org/271949
13:05:58 <elynn> yes, waiting for review
13:06:27 <Qiming> not aware of any on-going work about testing
13:06:47 <Qiming> we got some bug reports and guys have helped work on them
13:07:09 <Qiming> health management, we have a dedicated timeslot for that
13:07:17 <Qiming> lixinhui_, it's yours
13:07:17 <lixinhui_> oh
13:07:24 <lixinhui_> https://blueprints.launchpad.net/senlin/+spec/support-health-management-customization
13:07:42 <Qiming> documentation, em, no progress either
13:08:12 <Qiming> Sample scenarios/stories, I'll get on to it
13:08:23 <lixinhui_> cool
13:08:47 <lixinhui_> just let us know if you need help
13:08:53 <Qiming> my etherpad connection broke
13:09:10 <yanyanhu> network problem?
13:09:31 <elynn> That's normal, get used to it :)
13:09:44 <lixinhui_> :)
13:09:54 <Qiming> yes
13:10:42 <Qiming> does it work for anyone of you?
13:10:45 <cschulz> I'm putting together an example scalable stack using Senlin cluster.
13:10:53 <Qiming> please drive the flow
13:10:55 <yanyanhu> it works for me
13:11:08 <yanyanhu> next one is profile
13:11:20 <Qiming> okay, now I'm on it
13:11:25 <yanyanhu> I plan to work on heat stack update support first
13:11:30 <Qiming> need a proxy again
13:11:34 <lixinhui_> who is cschulz
13:11:36 <Qiming> great
13:11:58 <lixinhui_> could you introduce your name?
13:12:02 <Qiming> cschulz, that would be great
13:12:03 <cschulz> I'm Chuck Schulz from IBM Research in NY USA.
13:12:10 <yanyanhu> about dick property update support for nova server profile, will give it lower priority
13:12:23 <lixinhui_> Nice to meet you, cschulz
13:12:23 <yanyanhu> cschulz, welcome :)
13:12:24 <elynn> Hi Chuck, welcome.
13:12:49 <yanyanhu> looks like disk updating is not common
13:13:00 <Qiming> agree
13:13:10 <Qiming> just leave it there
13:13:21 <yanyanhu> yes, that is my plan
13:13:26 <Qiming> I'm experimenting cluster of containers
13:13:28 <cschulz> elynn, We've been exchanging email.  Thanks for the NotFound patch.
13:13:30 <Qiming> using coreos
13:13:56 <Qiming> it is an attempt to find the function gaps
13:14:05 <elynn> cschulz: yes, feel free to reach if you have any problems in using senlin resources :)
13:14:18 <elynn> reach me
13:14:26 <Qiming> policies
13:14:42 <yanyanhu> most work have been done I think
13:14:48 <yanyanhu> in last week
13:14:49 <Qiming> we have released fixes to placement and deletion policies
13:15:09 <Qiming> \o/
13:15:17 <yanyanhu> lb policy also works now
13:15:27 <lixinhui_> great
13:15:29 <Qiming> all history now
13:15:33 <yanyanhu> yea
13:15:34 <elynn> That's great, we are on the way.
13:15:54 <Qiming> receiver functional test
13:16:10 <yanyanhu> will start this work after finishing heat stack update support
13:16:12 <Qiming> we have deleted the webhook test, but not yet added one for receiver
13:16:17 <Qiming> okay
13:16:36 <Qiming> engine status list is done
13:16:40 <elynn> yes
13:16:44 <Qiming> that bp closed?
13:16:50 <elynn> not yet
13:16:52 <yanyanhu> hi, elynn, about the problem I met before
13:16:54 <elynn> not yet approved..
13:17:02 <yanyanhu> has it been addressed?
13:17:15 <yanyanhu> you mentioned that could be race condition there
13:17:20 <elynn> yanyanhu: I still don't have any idea about how to fix it.
13:17:40 <yanyanhu> ok, it didn't happen every time I restarted senlin engine
13:18:09 <elynn> trap in meeting these two days, I think it's betterto open a but to track it.
13:18:10 <Qiming> #action Qiming to update bp status - service list
13:18:27 <Qiming> yep, file a bug sounds a plan
13:18:47 <elynn> Seem multi engine will have potential race condition when delete the same record.
13:19:06 <yanyanhu> yes, that could be the reason
13:19:08 <Qiming> we can allocate some time for a discussion about event log generation in next meetings
13:19:12 <Qiming> not a high priority
13:19:20 <yanyanhu> Qiming, ok
13:19:23 <elynn> It's not a bocking issue I think
13:19:59 <Qiming> okay, anything missed in the etherpad?
13:20:09 <elynn> lock breaker?
13:20:22 <Qiming> yes? any progress?
13:20:34 <elynn> I'm done coding and do some test on local env
13:20:40 <elynn> it's base on engine status.
13:20:47 <Qiming> okay
13:20:53 <elynn> Will steal lock after retry
13:20:56 <Qiming> cool
13:21:14 <Qiming> yes, just be careful on the critical path
13:21:30 <yanyanhu> elynn, looking forward to your patch  :)
13:21:49 <lixinhui_> cool, elynn!
13:21:53 <Qiming> #topic cleanse desired_capacity
13:22:06 <Qiming> this one is a little bit dirty today
13:22:12 <yanyanhu> yep
13:22:22 <yanyanhu> saw xujun's patch and your comment
13:22:29 <Qiming> we don't have a clear definition of desired_capacity, so we are not handling it in a consistent way
13:22:48 <Qiming> yep, xujun talked to me 30 mins ago
13:23:09 <Qiming> seems a concise definition is needed, to guide the code implementation
13:23:36 <Qiming> desired_capacity, as the name suggests, is about the 'desired' size of a cluster
13:23:49 <Qiming> it may and may not be the current size of a cluster
13:24:09 <elynn> So how do we get the current_size?
13:24:30 <Qiming> do a node_get_all
13:25:07 <Qiming> we are not maintaining the current size as a cluster property
13:25:20 <Qiming> it is always a instanteneous concept
13:25:29 <cschulz> I don't think it should be.
13:25:51 <cschulz> Current capacity is a bit of data that may change at any moment.
13:25:59 <Qiming> exactly
13:26:13 <lixinhui_> ...
13:26:14 <Qiming> so the question is about when are we gonna change the desired_capacity
13:26:15 <elynn> ok, I might need to revise the cluster resource in heat.
13:26:20 <yanyanhu> Qiming, about that patch, actually I agree with xujun that the desired_capacity should be changed once the node object is added into DB
13:26:44 <Qiming> yanyanhu, I agree with that patch too
13:26:51 <cschulz> As I see it desired capacity only changes based on a Action.
13:27:07 <Qiming> but I dislike the way to fix it piece by piece
13:27:29 <Qiming> say when are get a request cluster-scale-in by 1
13:27:31 <yanyanhu> Qiming, yes, me neither
13:27:57 <Qiming> that is an intent to shrink the cluster size by 1, an intent to descrease desired_capacity
13:28:32 <cschulz> Shouldn't that scale in actually change the desired capacity?
13:28:42 <Qiming> when we remove a node from a cluster (not matter we destory it or not after the removal), the intent is to decrease the desired_capacity as well
13:29:18 <Qiming> cschulz, that is the expectation, but I'm not 100% sure we are consistently enforcing this changes everywhere
13:29:20 <yanyanhu> cschulz, desired_capacity is expected to be changed during that action
13:29:36 <yanyanhu> Qiming, yes, that's right
13:29:41 <cschulz> Yes I agree.
13:29:45 <Qiming> so ... I'm hoping we do a cleanse about desired_capacity
13:29:58 <yanyanhu> from DB level?
13:30:04 <cschulz> So current capacity is always trying to match desired capacity.
13:30:15 <Qiming> the first thing to do when processing an action is to change its desired_capacity
13:30:24 <Qiming> not matter the action succeeds or not
13:31:00 <yanyanhu> Qiming, agree
13:31:20 <Qiming> certainly, we will do some sanity checking before changing desired_capacity, e.g. whether it is going beyond the max_size etc.
13:31:30 <cschulz> Ah ... So here is the rub.  Should there be a daemon which is always looking at the current capacity and the desired capacity and triggering an action if they are different?
13:31:50 <Qiming> cschulz, you are running ahead of us, :)
13:31:53 <yanyanhu> cschulz, that's an important thing we want to do :)
13:32:01 <yanyanhu> although maybe in future
13:32:17 <Qiming> let's reach a consensus here, about desired_capacity
13:32:36 <Qiming> #topic rework NODE_CREATE, NODE_DELETE actions
13:32:55 <yanyanhu> Qiming, that's the point
13:32:56 <Qiming> these two actions are currently off the radar for policy checking
13:33:05 <Qiming> this has to be fixed
13:33:12 <yanyanhu> and maybe we also need to recheck the current implementation of cluster resize/scale actions
13:33:18 <yanyanhu> to ensure their logic is right
13:33:42 <Qiming> adding NODE_CREATE/DELETE into policy checking is gonna further complicating the policy checking logic
13:33:54 <Qiming> so my proposal is like this:
13:34:08 <Qiming> when we get a node-create request, with cluster_id specified
13:34:29 <Qiming> we create a NODE_CREATE action followed by an internal CLUSTER_ADD_NODE action
13:35:03 <Qiming> the section action will trigger the cluster policies
13:35:09 <yanyanhu> you mean cluster_add_node action depends on node_create action?
13:35:18 <Qiming> yes, a single dependency
13:35:23 <yanyanhu> ok
13:35:35 <Qiming> it is a little bit tricky to get it doen
13:35:39 <Qiming> s/doen/done
13:35:42 <yanyanhu> sound good :)
13:35:58 <Qiming> we should still lock the cluster first before locking the node
13:36:28 <Qiming> if we do these 2 steps in the reverse order, we will risk introducing some dead locks
13:36:33 <cschulz> Might that cause the cluster to be locked for a long time?
13:36:41 <yanyanhu> cluster should be locked only when cluster_add_node action is executed, right?
13:36:56 <yanyanhu> cschulz, have the same concern
13:37:05 <Qiming> so .. a rule of thumb, always lock a cluster first then lock the node you want to operate
13:37:27 <Qiming> hopefully, it won't be that long
13:37:42 <yanyanhu> Qiming, yes, but the first node_create action has not relationship with cluster I think
13:37:44 <cschulz> OK, but don't lock cluster before performing the create-node
13:37:50 <Qiming> it is much shorter than cluster-scale-out count=2
13:38:01 <yanyanhu> until cluster_add_node action started
13:38:34 <Qiming> there is a window where we have created a node db object, and we are creating a node
13:38:56 <Qiming> at the same time, another request comes in, it wants a scaling down
13:39:15 <yanyanhu> hmm. that's a problem
13:40:01 <Qiming> anyway, just a rough idea, will need to revisit the workflow and see if it fits in seamlessly
13:40:22 <yanyanhu> or maybe we split creating node with target cluster specified into two individual actions
13:40:26 <yanyanhu> ok
13:40:31 <yanyanhu> need further discussion
13:40:46 <Qiming> #action Qiming to add NODE_CREATE/DELETE rework into etherpad for tracking
13:41:21 <Qiming> yanyanhu, that is possible too
13:41:41 <Qiming> we will introduce some waiting logic in engine service before adding node to cluster
13:41:55 <yanyanhu> Qiming, that will be nice
13:42:19 <Qiming> will dig more
13:42:27 <cschulz> I don't like waits.  You never know how long is enough.
13:42:28 <Qiming> #topic health management
13:42:42 <Qiming> yes, cschulz, agreed
13:42:56 <lixinhui_> I registered a new BP https://etherpad.openstack.org/p/senlin-mitaka-workitems
13:43:01 <lixinhui_> sorry
13:43:12 <lixinhui_> https://blueprints.launchpad.net/senlin/+spec/support-health-management-customization
13:43:26 <lixinhui_> need all of you's help to review
13:43:52 <lixinhui_> in the Mitaka 2, we have implemented the check and recover functions
13:43:58 <Qiming> good, but we still need some more details
13:44:17 <lixinhui_> we need dicuss how to enable them
13:44:34 <Qiming> right, those are the primitives to check if a node is 'healthy' or not and to recover a node by recreating it
13:45:09 <Qiming> I think we will need to expose those operations as APIs
13:45:24 <Qiming> triggered by some external node monitoring software
13:45:27 <yanyanhu> xinhui, I agree with what you described in whiteboard
13:45:52 <lixinhui_> Yes
13:45:56 <Qiming> there is a subtle question here
13:46:07 <yanyanhu> maybe the first step is exposing those two actions through API
13:46:11 <Qiming> checking logic is okay, we only check the nodes we know of
13:46:17 <cschulz> So is it possible for metric information to be part of the health check test?
13:46:42 <yanyanhu> cschulz, that depends on the profile implementation
13:46:49 <cschulz> OK
13:47:09 <Qiming> cschulz, the expectation is that we will build some internal health check polling as a starting point
13:47:18 <lixinhui_> yanyanhu, you mean the first one: manual
13:47:24 <yanyanhu> if you need it, you can implement it in do_check routine of the profile
13:47:36 <Qiming> that polling can be disabled by deployer easily, if they have a better alternative for health monitoring
13:47:51 <yanyanhu> lixinhui_, yes
13:47:53 <Qiming> back to the 'recover' action
13:48:20 <Qiming> we have primitives for node recover, which defaults to recreate, or nova server rebuild, this is fine too
13:48:33 <Qiming> but for cluster recover, we may need to consider the desired_capacity
13:48:49 <Qiming> this is leading us back the question cschulz asked before
13:49:02 <yanyanhu> yes
13:49:20 <Qiming> if we find a cluster only have 2 nodes ACTIVE and 1 node ERROR, and desired_capacity is 4
13:49:46 <Qiming> the expected recover action should be rebuild one plus create one
13:49:49 <lixinhui_> Qiming and Yanyan, I agree this is a problem. but it is the concept of limits of "resource pool"
13:50:08 <lixinhui_> I do not think recover need to think this
13:50:14 <Qiming> that's my understanding of recover
13:50:24 <cschulz> I think the test is as Qiming stated plus no actions current against that cluster.
13:50:49 <lixinhui_> since the error node comes from awareness of inconsistency of node status stored by senlin and nova
13:50:50 <cschulz> i.e. nothing is trying to create that 4th node.
13:50:51 <Qiming> or at least we should provide such an option to users
13:51:08 <Qiming> node level recovery is fine
13:51:36 <lixinhui_> cluster recover fork node recover nowdays
13:51:40 <Qiming> but for recovering a cluster, it means you are taking full responsibility to get it back to its desired_capacity?
13:52:03 <cschulz> Yes, I think that is the correct approach
13:52:28 <Qiming> or else, we will have to introduce another operation to do this job
13:52:38 <Qiming> it would be a headache
13:52:46 <yanyanhu> definitely
13:52:55 <lixinhui_> i think it should be another service to handle limits of capacity
13:53:05 <lixinhui_> not only for recover
13:53:18 <yanyanhu> very difficult to say which way is better before thinking this problem more deeply
13:53:47 <Qiming> once this mechanisms are all in place, we proceed to think of making this whole process semi-automatic and/or fully-automatic
13:53:59 <Qiming> again, just a rough idea
13:54:14 <Qiming> we can think of it offline
13:54:19 <yanyanhu> sure
13:54:21 <lixinhui_> okay
13:54:29 <Qiming> #topic open discussions
13:54:49 <yanyanhu> I think elynn has proposed a topic
13:54:56 <yanyanhu> into Austin summit
13:54:58 <Qiming> shoot
13:55:08 <Qiming> ... that 'meeting'
13:55:09 <yanyanhu> ?
13:55:15 <elynn> Do you have any topic to propose?
13:55:32 <yanyanhu> elynn, yours is the only one :)
13:55:33 <Qiming> I want to propose a solution for container clustering
13:55:50 <elynn> More is better :)
13:55:53 <Qiming> hopefully we can get a POC done by the summit
13:56:11 <yanyanhu> Qiming, you mean a container cluster managed by senlin?
13:56:12 <Qiming> I'm collecting all the SUR works last year
13:56:20 <lixinhui_> :)
13:56:21 <yanyanhu> not other container services
13:56:34 <Qiming> yes, openstack native container clustering
13:56:39 <yanyanhu> cool!
13:56:45 <lixinhui_> cool!
13:57:22 <Qiming> proposal deadline is approaching, need to work hard on this
13:57:53 <Qiming> or another candidate would be the health scenario
13:58:15 <lixinhui_> who to show this health scenario
13:58:18 <Qiming> as far as I know of, HA has been a long-wanted feature, not provided by any service
13:58:25 <lixinhui_> how to show
13:58:41 <Qiming> lixinhui_, you tell me, ;)
13:58:44 <yanyanhu> hmm, I think we need a more specific use case for building the health topic
13:58:51 <cschulz> As you think about HA, also consider DR.
13:59:04 <yanyanhu> otherwise, it's difficult to describe it clearly
13:59:05 <elynn> maybe should think of a HA scenario :)
13:59:07 <Qiming> cschulz, good idea
13:59:25 <lixinhui_> :)
13:59:27 <Qiming> time's up, let's switch back to #senlin
13:59:39 <Qiming> thanks for joining the meeting, guys
13:59:43 <Qiming> talk to you next week
13:59:46 <yanyanhu> thanks
13:59:47 <Qiming> #endmeeting