#openstack-meeting log

13:01:14 <Qiming> #startmeeting senlin
13:01:15 <openstack> Meeting started Tue Oct 20 13:01:14 2015 UTC and is due to finish in 60 minutes.  The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot.
13:01:16 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
13:01:20 <openstack> The meeting name has been set to 'senlin'
13:01:27 <Qiming> hello guys
13:01:30 <yanyanhu> hi
13:01:37 <jruano> hello
13:01:41 <Qiming> #link https://wiki.openstack.org/wiki/Meetings/SenlinAgenda#Weekly_Senlin_.28Clustering.29_meeting
13:01:51 <yanyanhu> I guess haiwei is on the way back hometown
13:01:52 <Qiming> please revise agenda if you have things to add
13:02:00 <Qiming> yes, heard that
13:02:05 <Qiming> feel sorry about him
13:02:20 <yanyanhu> yes
13:02:46 <Qiming> don't know if lixinhui will join today
13:02:58 <Qiming> anyway, let's get started
13:03:03 <yanyanhu> ok
13:03:15 <Qiming> #topic health policy and its relationship to scaling policy
13:03:51 <Qiming> so lixinhui talked to me this afternoon, she is looking into that policy
13:03:57 <yanyanhu> hi, elynn
13:04:14 <yanyanhu> great
13:04:15 <elynn> Hi
13:04:31 <Qiming> it is a requirement from some production teams as well
13:04:58 <Qiming> I have got quite some complaints that resourcegroup cannot handle resource failures
13:05:16 <yanyanhu> yes, I think their thought about 'desired_capacity' and 'current size' is understandbale
13:05:25 <Qiming> we had a demo last year on Paris summit, showing how VM HA can be done
13:06:08 <Qiming> the prototype was done using Heat, but the solution doesn't fall into Heat's scope
13:06:59 <Qiming> so we have had a sample policy here: http://git.openstack.org/cgit/openstack/senlin/tree/examples/policies/health_policy_poll.yaml
13:07:16 <Qiming> and a WIP code here:  http://git.openstack.org/cgit/openstack/senlin/tree/senlin/policies/health_policy.py
13:08:04 <Qiming> my colleague helped write that code but didn't finish it before switching to other jobs
13:08:15 <Qiming> glad lixinhui is picking this up
13:08:27 <yanyanhu> yes
13:08:42 <yanyanhu> this workitem has been there for about half year I think :)
13:08:45 <Qiming> definitely, she will need help
13:09:01 <yanyanhu> yes
13:09:05 <Qiming> one thing is about how to trigger recovery actions
13:10:05 <yanyanhu> yes, I think how to perform the health check for each node is the key issue we need to figure out
13:10:08 <Qiming> maybe we need to define a do_recover() method for the base Profile class for an implementation to decide how to recover itself
13:10:32 <yanyanhu> I found a decorator of 'periodic_task' is added
13:10:42 <yanyanhu> agree
13:10:49 <Qiming> well, failure detection is not a great challenge
13:11:16 <Qiming> we have three options at least: 1) poll the node status periodically for clusters attached with a health policy
13:11:37 <Qiming> 2) poll the LB health monitor if a cluster has a lb policy attached
13:11:52 <Qiming> 3) listen to lifecycle events on message queue
13:12:28 <Qiming> the first, very naive, very generic solution is to have the health_manager to call profile.do_check()
13:12:43 <Qiming> so we can know if the object (stack, server) is still active
13:12:59 <Qiming> that could serve the base line
13:13:57 <Qiming> I am proposing it to be the "base line" because it introduces no dependencies to other services
13:14:43 <yanyanhu> yes, these three choices should be available
13:14:43 <yanyanhu> yes
13:15:18 <yanyanhu> but I guess we may still need to rely on some APIs provided by other services to help decide the status of node
13:15:24 <Qiming> we can figure out better/advanced solutions/strategies when we gain some experiences from this naive version
13:15:54 <Qiming> I believe most objects exposed have a 'status' property
13:17:26 <Qiming> we may need to figure out how to handle those objects who do not support status check
13:17:56 <yanyanhu> or sometimes, the status may not be able to reflect the real availability of node
13:17:56 <Qiming> but ... if they don't have a status for checking, we cannot be sure we can "recover" it, right?
13:18:09 <Qiming> yanyanhu, that is very true
13:18:15 <yanyanhu> yea, this is the problem...
13:18:43 <Qiming> so one of the key design is to have a profile to implement a do_recover() method
13:18:58 <Qiming> and somehow make that a builtin action to support
13:19:03 <yanyanhu> yes
13:19:04 <Qiming> then expose it with a webhook
13:19:19 <Qiming> we throw the question back to users
13:19:38 <jruano> that sounds doable as a first attempt
13:20:17 <yanyanhu> I think we can start from the simplest case
13:20:22 <Qiming> I'll talk to lixinhui about this
13:20:39 <yanyanhu> great
13:20:39 <Qiming> yes, simple case can help us learn a lot
13:20:54 <Qiming> another question asked is
13:21:14 <Qiming> how do we deal with CLUSTER_DEL_NODES/CLUSTER_SCALE_IN actions
13:21:31 <Qiming> if the actions are not changing the desired_capability
13:22:24 <yanyanhu> I think that depends on our definition of cluster_del_nodes and cluster_scale_in
13:22:43 <Qiming> maybe a better question to ask is how do we deal with the "conflict" between health policy and scaling policy
13:22:59 <yanyanhu> maybe we just define them as actions which will reduce the desired_capacity of cluster :)
13:23:12 <yanyanhu> oh
13:23:41 <Qiming> previously, when I was studying the amazon design
13:23:59 <Qiming> somewhere they have an option to temporarily disable the health policy
13:24:22 <Qiming> I believe that is a feature proposed out of hard lessons learned
13:24:40 <yanyanhu> that means health policy is the most powerful one once it is enabled :)
13:24:46 <Qiming> yes
13:24:51 <yanyanhu> I think this is reasonable
13:25:14 <yanyanhu> since availability should have higher priority than performance
13:25:25 <yanyanhu> or cost
13:25:25 <Qiming> temporarily disable health policy was considered when we design cluster_policy table
13:25:41 <Qiming> the 'enabled' field in that table is for this very purpose
13:25:52 <yanyanhu> yes
13:26:04 <Qiming> we don't know what operators want to do
13:26:25 <Qiming> they may have a good reason to do this, for a manual scaling or something
13:26:46 <Qiming> so .. refer to this code: https://github.com/openstack/senlin/blob/master/senlin/engine/actions/cluster_action.py#L672-L699
13:27:20 <Qiming> it is the logic of CLUSTER_UPDATE_POLICY action
13:27:40 <Qiming> if we move this code to be a method of a cluster
13:27:50 <Qiming> it can be shared
13:28:16 <yanyanhu> agree
13:29:08 <Qiming> in policy checkings, we can always temporarily disable the health policy (if any) and reenable it in the pre_op and post_op calls
13:30:01 <Qiming> a byproduct of this is that the actions code are greatly simplified
13:30:58 <yanyanhu> hmm, that's true
13:31:03 <Qiming> let's think about it
13:31:12 <yanyanhu> ok
13:31:32 <Qiming> #topic heat resource type support
13:32:17 <Qiming> oh, elynn dropped
13:32:27 <Qiming> two links on the agenda
13:32:28 <yanyanhu> lets wait him for a while
13:32:43 <yanyanhu> I think the current spec looks good
13:32:55 <Qiming> first is the spec proposal in heat
13:33:03 <Qiming> yes, already +2'ed
13:33:06 <yanyanhu> only possible question is about the spec property of profile
13:33:16 <Qiming> however, we need to think beyond that
13:33:24 <Qiming> yes?
13:33:30 <yanyanhu> since it is really different from current design about template
13:33:54 <Qiming> yes, I will try my best to convince heat team that we are doing the right thing
13:33:57 <yanyanhu> it could take a while for other people to understand it
13:34:12 <Qiming> I think heat cores can grasp the idea quickly
13:34:31 <yanyanhu> sure, what we need to do is decribing it clearly :)
13:34:40 <Qiming> it's lesson learned when the convergence proposal was discussed
13:34:40 <yanyanhu> and correctly
13:34:56 <yanyanhu> yes
13:35:19 <Qiming> basically, you have 'template', 'environment', 'parameters' and 'files' together to determine the 'desired_state'
13:35:59 <Qiming> all these four inputs should be consolidated into a single "spec" for defining a stack
13:36:40 <yanyanhu> yea, this is what spec is for
13:37:21 <Qiming> or else we cannot know what a stack should look like
13:37:46 <yanyanhu> Another possible issue is about senlinclient plugin for heat
13:38:05 <Qiming> yes, senlinclient has some big issues
13:38:21 <yanyanhu> do we need to further refactor the implmentation of python-senlinclient to make this work easier?
13:38:22 <Qiming> one thing is about the interfacing with openstacksdk
13:38:51 <Qiming> yes, that is something we need to do
13:39:00 <Qiming> especially for profile create
13:39:14 <Qiming> here are some preprocessing logic there
13:39:22 <Qiming> for parsing of the get_file function
13:39:27 <yanyanhu> yes, I also think so
13:39:55 <Qiming> when invoked from other services (including senlin-dashboard), those logics are skipped
13:40:40 <elynn> yes, and also needed by heat resource properties.
13:41:07 <elynn> Just lost my connection on my way home...
13:41:09 <Qiming> we will need to fix the problem from ourside first
13:41:13 <yanyanhu> so it seems that these get_file operations need to be done in senlinclient
13:41:18 <yanyanhu> on behalf of heatclient
13:41:46 <Qiming> right, we have dependencies to heatclient because of that extra parsing logic
13:42:22 <Qiming> #action file a bug on profile_create skipping template parsing
13:42:53 <yanyanhu> I think we need fix this issue before elynn can start his work :)
13:43:09 <Qiming> yes, it is high priority
13:43:20 <Qiming> regarding interaction with Heat, we have more questions to answer
13:43:29 <Qiming> #link http://sched.co/4QbQ
13:44:17 <Qiming> it would be a good and difficult discussion
13:44:48 <yanyanhu> yes, really hope this session won't conflict with the demo on booth
13:44:51 <elynn> yes, what are you going to talk about in summit?
13:45:18 <Qiming> elynn, question to whom?
13:45:21 <yanyanhu> or maybe I can ask other guy to help me stay at the booth if so
13:45:36 <elynn> questins about ASG in heat
13:45:55 <Qiming> it is a design summit working session
13:46:09 <Qiming> people sit around a table for discussion
13:46:32 <yanyanhu> elynn, you really should join the design session :)
13:46:45 <elynn> I did hope..
13:46:52 <yanyanhu> I joined once in 2013 hongkong summit, very interestring
13:46:56 <Qiming> elynn, don't worry, https://etherpad.openstack.org/p/mitaka-heat-autoscaling
13:47:03 <Qiming> this will be the etherpad
13:47:52 <Qiming> we need to think beyond current spec
13:48:06 <Qiming> how autoscaling would be supported once cluster is in
13:48:07 <elynn> I try to implement ResourceGroup and found that it's not very easy to do that, so many properties and situation need to be consider. like index_var, rolling_update, removal_policy.
13:48:23 <Qiming> right
13:48:58 <elynn> All these questions might throw out at that session. Need to be careful...
13:49:05 <Qiming> rolling_update and removal_policy can be modeled as senlin policies
13:49:09 <Qiming> just to make heat user's life easier
13:49:26 <Qiming> index_var is difficult
13:49:44 <Qiming> it is completely a heat thing
13:50:09 <Qiming> so I think it can be still implemented in heat
13:50:48 <yanyanhu> what is index_var for?
13:50:50 <elynn> Why are you saying that index_var is difficult? Since senlin can set the names of nodes in cluster, is it possible to change that logic to support index_var?
13:51:00 <yanyanhu> oh, I see
13:51:24 <Qiming> elynn, I mean the logic should live in heat, not passed to senlin for parsing
13:51:25 <yanyanhu> I recalled we used before
13:51:44 <Qiming> it is mainly used for naming things for easier reference later
13:52:35 <Qiming> the key is to translate the 'resource_def' property into a profile
13:52:49 <yanyanhu> Qiming, I think I understand elynn's concern about index_var
13:52:52 <yanyanhu> yes
13:53:04 <Qiming> and make the future update behave as what users expected
13:53:14 <elynn> If that logic live in heat, if I delete a node from cluster manually, heat might not aware of that. And will cause NotFound error or something else...
13:53:44 <Qiming> if the cluster is created from a heat stack
13:53:50 <Qiming> heat is responsible for that
13:54:11 <Qiming> just like you will delete a nova server secretly, :)
13:54:52 <Qiming> elynn, feel free to call a f2f discussion this week if you have questions
13:55:20 <Qiming> #topic summit meetup planning
13:55:36 <elynn> yes, I will, after I investigate more deeper.
13:55:42 <Qiming> I haven't look at the logistics yet
13:55:48 <Qiming> room is very limited
13:56:08 <Qiming> need to spend some time on this
13:56:23 <Qiming> so please watch mailinglist
13:56:30 <yanyanhu> hi, elynn, I think a node should not be deleted manually using senlin interface in that case
13:56:33 <yanyanhu> ok
13:56:50 <Qiming> will annouce time/location to meet when we have a plan
13:57:15 <Qiming> anything else?
13:57:35 <yanyanhu> nope
13:57:43 <jruano> nothing from me
13:57:54 <elynn> not from me
13:57:59 <Qiming> I need to talk to sdk guys for code review
13:58:04 <Qiming> it is stagnating
13:58:13 <Qiming> need to know what's the plan next
13:58:31 <yanyanhu> yes, we have dependency on it
13:58:33 <Qiming> it is blocking a lot of things from us
13:59:05 <Qiming> btw, next weekly meeting will be cancelled
13:59:23 <Qiming> talk to you in two weeks, :)
13:59:23 <Qiming> thanks for joining
13:59:27 <Qiming> #endmeeting