#openstack-meeting log

13:00:10 <Qiming> #startmeeting senlin
13:00:10 <openstack> Meeting started Tue Jun  7 13:00:10 2016 UTC and is due to finish in 60 minutes.  The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot.
13:00:11 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
13:00:13 <openstack> The meeting name has been set to 'senlin'
13:00:26 <Qiming> #topic roll call
13:00:37 <yanyanhu> hello
13:00:58 <Qiming> hi
13:01:02 <xuhaiwei__> hi
13:01:08 <Qiming> hi, haiwei
13:01:09 <lixinhui_> hi
13:01:16 <yanyanhu> network connection at home is really unstable
13:01:17 <elynn> o/
13:01:33 <Qiming> #topic newton work items
13:01:45 <Qiming> testing, where are we ?
13:02:03 <elynn> enable tempest api on gate
13:02:08 <yanyanhu> 50% I think
13:02:15 <yanyanhu> about negative test cases
13:02:30 <elynn> Saw some patches submitted by yanyanhu, many thanks!
13:02:40 <Qiming> okay, so we did find some inconsistencies in apis
13:02:43 <yanyanhu> elynn, my pleasure :)
13:02:57 <yanyanhu> also found some issues about our API implementation when writing the test
13:03:00 <Qiming> elynn, posted some comments to you latest patches
13:03:02 <yanyanhu> Qiming, yes
13:03:06 <yanyanhu> that is valuable
13:03:36 <elynn> Qiming: will check :)
13:03:50 <Qiming> tempest dsvm gate is not very slow, right?
13:03:56 <yanyanhu> yes
13:04:17 <Qiming> great
13:04:35 <Qiming> rally side
13:04:47 <Qiming> patch 318453 was in
13:04:49 <yanyanhu> gate job is finally ready
13:05:15 <Qiming> you mean gate job at senlin side?
13:05:18 <yanyanhu> and the 301522 works well now
13:05:21 <yanyanhu> both side
13:05:25 <yanyanhu> in rally and senlin
13:05:41 <yanyanhu> just need to address rally teams question about that patch
13:05:58 <yanyanhu> but no critical obstacle I think
13:05:58 <Qiming> we don't rely on 301522 to run gate at senlin side, right?
13:06:07 <yanyanhu> yes
13:06:15 <yanyanhu> that is for rally repo
13:06:24 <Qiming> my question is about the gate failures we saw when doing 'check experimental' at senlin side
13:06:56 <yanyanhu> actually 307170 works as well
13:07:03 <yanyanhu> you mean this one? https://review.openstack.org/307170
13:07:23 <Qiming> oh?
13:07:43 <Qiming> that is the first time we have gate-rally-dsvm-senlin-senlin working!!
13:08:01 <yanyanhu> sorry, I was dropped
13:08:06 <yanyanhu> Qiming, yes :)
13:08:08 <elynn> This name is a little weird...
13:08:26 <yanyanhu> elynn, yes if you mean senlin-senlin.yaml :)
13:08:40 <Qiming> yes, let's get it in and fix it later?
13:08:58 <yanyanhu> that is because we try to match gate-dsvm-rally-senlin-{name} job template
13:09:29 <yanyanhu> Qiming, ok, I think I may need to do some clean job before that patch become ready
13:09:42 <Qiming> okay
13:09:48 <yanyanhu> will remove[WIP] when it's ok
13:09:53 <elynn> double senlin, now we have amazon :P
13:10:04 <yanyanhu> :P
13:10:36 <yanyanhu> and will discuss with eldon about their test based on rally
13:10:36 <Qiming> better rename that... it is strange, indeed
13:10:42 <yanyanhu> hope can provide some help for them
13:11:08 <yanyanhu> Qiming, yes, I think maybe we can propose another job template in future
13:12:03 <Qiming> sure, we can work together on stress tests
13:12:18 <elynn> Anyway, now we have many eyes on gates :)
13:12:34 <yanyanhu> yes
13:12:48 <Qiming> moving on
13:12:55 <Qiming> health management
13:13:15 <Qiming> I saw xinhui has taken over that lbaas bug
13:13:25 <Qiming> patch proposed now
13:13:35 <yanyanhu> cool
13:13:47 <lixinhui_> yes
13:13:52 <lixinhui_> I am doing that
13:13:59 <Qiming> many thanks
13:14:07 <lixinhui_> my pleasure
13:14:19 <lixinhui_> will contribute more in next weeks
13:14:29 <Qiming> there are still some gate failures
13:14:52 <lixinhui_> en
13:15:08 <Qiming> I have added health monitoring by listening to vm lifecycle events
13:15:23 <Qiming> it took me quite some time to understand the filtering logics
13:15:45 <lixinhui_> congrats!
13:15:49 <lixinhui_> you got it
13:15:56 <Qiming> there are still things unstable inside oslo.messaging, complaining that some regex matching failures
13:15:56 <lixinhui_> will learn from you
13:16:23 <Qiming> anyway, we can get notified when vm status was modified (reboot, start, stop, ...)
13:16:32 <Qiming> next thing is to trigger some actions
13:16:41 <lixinhui_> is that reliable?
13:16:42 <Qiming> will dive into that
13:16:53 <lixinhui_> I mean the lisetning
13:17:18 <Qiming> listening is reliable, just some initialization wasn't complete I guess
13:17:28 <lixinhui_> ...
13:17:37 <Qiming> when I restart the engine, the listeners are created, but not receiving events
13:17:54 <Qiming> maybe need to do some fuzzy delay
13:18:02 <lixinhui_> ok
13:19:09 <Qiming> as for health threshold
13:19:21 <Qiming> I'm thinking of using desired_capacity
13:19:44 <lixinhui_> could you explain more?
13:20:06 <lixinhui_> if time permits
13:20:37 <Qiming> make desired_capacity the health threshold
13:21:04 <Qiming> a cluster is treated healthy if the number of active nodes >= desired_capacity
13:22:15 <Qiming> make sense?
13:22:16 <lixinhui_> okay
13:22:21 <lixinhui_> too hash?
13:22:37 <yanyanhu> Qiming, if so, the node number could beyond desired_capacity?
13:22:52 <Qiming> yes, between max_size and desired_capacity
13:23:26 <xuhaiwei__> what is the case node number bigger than desired_capacity?
13:23:43 <yanyanhu> hmm, sounds a little different from our discussion in summit
13:24:10 <Qiming> when you do cluster check, there are some nodes not responding
13:24:25 <Qiming> when you do some operations later, those nodes come back to life
13:25:17 <Qiming> did we have any discussion about the health threshold during summit?
13:25:28 <xuhaiwei__> so the desired_capacity only contains the alive nodes?
13:25:34 <yanyanhu> yes, so will the total number of health node finally match the desired_capacity
13:25:44 <Qiming> desired is always the 'desired'
13:25:47 <yanyanhu> Qiming, no, it's not about health threshold
13:25:55 <yanyanhu> about the scaling basement
13:26:16 <Qiming> it is not the number of actually active nodes, we can never assume so
13:26:50 <yanyanhu> Qiming, yes, that's what I mean. I think the case that total number of node beyond desired_capacity is kind of transient status?
13:27:11 <Qiming> if you don't do something, those nodes will be there
13:27:23 <yanyanhu> finally, health nodes amount will be desired_capacity
13:27:30 <Qiming> there are transient status some nodes are not active when you are checking them
13:28:05 <Qiming> the question is why are we maintaining the number of healthy nodes?
13:28:45 <Qiming> we are already not so sure about the number of active nodes, considering that there are transient problems
13:29:14 <Qiming> what we do care is "whether the cluster is healthy"
13:29:30 <Qiming> which means there are enough nodes to share workloads
13:29:51 <Qiming> by "enough" here, we mean that number of active nodes >= desired capacity
13:30:07 <Qiming> yes, it is a bit hash
13:30:21 <yanyanhu> yes. I think user specifies the desired_capacity which is the number of health nodes they want to have, so we should try to match it and the actual active nodes number?
13:30:31 <Qiming> but is there any way to maintain another statistics?
13:31:08 <Qiming> assume you are the user, when you are specifying desired_capacity, what are you thinking?
13:31:27 <yanyanhu> hmm, I want a cluster with this number of nodes being active
13:31:32 <yanyanhu> healthy
13:32:25 <Qiming> there could be a case where a user wants to create a cluster of 10 nodes, but 5 nodes is okay for him/her
13:32:44 <yanyanhu> yes, that is possible
13:32:54 <Qiming> why is he doing that?
13:33:19 <Qiming> if 5 is okay, then 5 is the min_size, right?
13:34:05 <yanyanhu> hmm, I think that means in any cases, the cluster size should not be less than 5
13:34:22 <Qiming> yes, then 5 is actually the min_size
13:34:24 <yanyanhu> that is not directly related to health management
13:34:31 <yanyanhu> yes, 5 is the min_size
13:34:44 <Qiming> if the cluster is dropping below that level, the cluster is in error status
13:35:01 <yanyanhu> I think that's two cases
13:35:03 <Qiming> if cluster size is between min_size and desired_capacity, we can treat it as warning
13:35:25 <yanyanhu> if cluster size is less than 5, that means internal error happened in senlin side
13:35:25 <Qiming> it is all about how we define the status of a cluster
13:35:41 <Qiming> users don't care what happened
13:35:56 <yanyanhu> yes, understand what you mean, just feel we shouldn't mix health management case and scaling case
13:36:01 <Qiming> maybe some nova nodes crashed
13:36:11 <Qiming> you cannot say it is senlin's fault
13:36:16 <yanyanhu> IMHO, min_size/max_size/desired is about scaling cases
13:36:41 <Qiming> they are properties you specify when you create a cluster
13:36:52 <Qiming> no matter you will scale that cluster or not
13:37:07 <yanyanhu> yes, that's the hard limit of cluster size
13:37:19 <yanyanhu> no matter the cluster has HA management support or not
13:37:28 <Qiming> exactly
13:38:07 <Qiming> so ... I'm wondering if we do want to introduce another threshold into senlin at the moment
13:38:14 <yanyanhu> so I think desired_capacity is something related to HA since it's user desired
13:38:39 <Qiming> and we can never make sure it matches the reality
13:38:43 <yanyanhu> Qiming, I agree we consider desired_capacity a health related property
13:38:51 <yanyanhu> but min_size, max_size could not be
13:39:21 <Qiming> okay, agree to disagree
13:39:30 <Qiming> let's think about it offline
13:39:38 <yanyanhu> ok
13:39:51 <yanyanhu> we can have a further discussion tomorrow :)
13:39:56 <Qiming> I'd suggest we forget all the actions/policies we have in senlin
13:39:59 <yanyanhu> really need more thinking about it
13:40:16 <Qiming> just think from a user's perspective, what makes a better sense for them
13:40:26 <yanyanhu> agree with this
13:40:40 <lixinhui_> jealous
13:40:46 <yanyanhu> a clear definition from user perspective is the most important
13:40:54 <lixinhui_> you can discuss face to face
13:41:07 <Qiming> we will discuss on irc
13:41:11 <lixinhui_> :)
13:41:15 <yanyanhu> lixinhui_, you can come here, some one will buy you coffee :P
13:41:24 <lixinhui_> cool!
13:41:31 <lixinhui_> or you can come to VMware
13:41:37 <lixinhui_> tomorrow we have happy hour
13:41:41 <yanyanhu> for free coffee :)
13:41:59 <lixinhui_> :P
13:42:01 <Qiming> we can define min_size, health_watermark, desired_capacity and max_size
13:42:13 <Qiming> try if you can explain all these four numbers to users
13:42:30 <yanyanhu> hmm, need more thinking on it
13:42:43 <Qiming> okay, let's move on
13:43:06 <Qiming> any news from you lixinhui_ on health management?
13:43:18 <lixinhui_> Sorry, Qiming
13:43:30 <lixinhui_> I will try to contribute more in the followed weeks
13:43:32 <yanyanhu> last sentence from me about this issue: maybe we should re consider why user define min_size/max_size and whether and when they really need it :)
13:43:39 <lixinhui_> too distract
13:43:49 <Qiming> no worry, just ask questions, in case you have moving too fast
13:44:21 <lixinhui_> :)
13:44:28 <Qiming> no update on documentation from me
13:44:38 <Qiming> container support
13:44:54 <xuhaiwei__> I submitted a patch
13:45:06 <xuhaiwei__> initialize docker driver
13:45:38 <Qiming> saw some patches from haiwei, I think we have been mixing things in a strange way
13:46:05 <Qiming> will have a closer look at the patch
13:46:31 <xuhaiwei__> yes, please comment it
13:46:44 <Qiming> notification/event side, some basics are there
13:47:05 <Qiming> need some serializers and an example to encapsulate a notification into an object
13:47:20 <Qiming> then dispatch that object to oslo.messaging or db
13:47:29 <Qiming> will continue work on that
13:47:41 <Qiming> zaqar work is stalled
13:48:06 <Qiming> that's all from the etherpad
13:48:15 <Qiming> things to add?
13:48:39 <yanyanhu> nope, really lots of work items
13:49:03 <Qiming> #topic senlin cluster-do operation
13:49:27 <Qiming> here is the patch: https://review.openstack.org/#/c/326208/
13:49:51 <Qiming> we are adding OPERATIONS to a profile definition
13:50:00 <Qiming> it is not exposed to users for customization
13:50:30 <Qiming> but implementation wise, we are modeling operation parameters using schemas
13:50:46 <yanyanhu> oh, its for this purpose
13:50:54 <yanyanhu> I didn't get it when saw it first time
13:51:04 <Qiming> so an operation can be easily verified when we get a JSON containing the operation requested
13:51:10 <yanyanhu> will check it
13:51:15 <yanyanhu> yep
13:51:27 <yanyanhu> that's a nice wrap
13:51:31 <Qiming> an operation request could be {"reboot": {"type": "HARD"}}
13:51:55 <Qiming> when users input senlin cluster-do help cluster1
13:52:26 <Qiming> we can iterate through the OPERATIONS dict and return a help text --- here are operations you can try
13:52:49 <Qiming> just like when you do senlin profile-type-show <a-profile-type>
13:52:51 <yanyanhu> nice
13:53:23 <Qiming> in the case of a nova server cluster, you can do 'senlin cluster-do reboot --type=HARD cluster1'
13:53:57 <Qiming> command wise, we can add more parameters so that users can reboot nodes with specific roles, but those can be added later
13:54:26 <Qiming> parameters are checked just like profile/policy properties, they have data types
13:54:34 <lixinhui_> :)
13:54:36 <Qiming> the only difference is that they are not 'updatable'
13:54:55 <Qiming> that is why I revised the common schema module
13:55:11 <yanyanhu> yes, saw that patch
13:55:17 <Qiming> in future there could be some extensions to Operation schema, today it is only just a Map
13:55:32 <Qiming> that's some background about that patch
13:55:39 <Qiming> #topic open discussions
13:55:56 <Qiming> I think we have covered the health management part
13:56:16 <lixinhui_> yes
13:56:18 <Qiming> and ... 1 hour is definitely not enough for discussion
13:56:29 <lixinhui_> nod
13:56:30 <yanyanhu> yes :)
13:56:38 <Qiming> need some homework before we discuss it again
13:56:57 <Qiming> pls think from user's perspective
13:56:58 <Qiming> :)
13:57:00 <yanyanhu> will think about it as well
13:57:06 <lixinhui_> :)
13:57:06 <yanyanhu> yes
13:57:15 <xuhaiwei__> ok
13:58:29 <Qiming> oh, don't know if you have noticed it
13:58:43 <Qiming> we have had senlin 2.0.0.0b1 released last Friday
13:58:53 <Qiming> senlinclient 0.5.0 released today
13:59:08 <xuhaiwei__> saw it
13:59:11 <Qiming> senlinclient version jump was based on release team's suggesion
13:59:22 <Qiming> leave some version numbers for back-port
13:59:30 <Qiming> em.
13:59:35 <yanyanhu> in global requirement?
13:59:45 <Qiming> not yet propsed to global requirements
13:59:50 <yanyanhu> I see
13:59:59 <Qiming> feel free to do so
14:00:03 <Qiming> #endmeeting