#openstack-meeting log

13:00:32 <Qiming> #startmeeting senlin
13:00:33 <openstack> Meeting started Tue Aug 16 13:00:32 2016 UTC and is due to finish in 60 minutes.  The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot.
13:00:34 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
13:00:36 <openstack> The meeting name has been set to 'senlin'
13:01:02 <Qiming> hello
13:01:09 <elynn> Hi
13:01:12 <yanyan> hi
13:01:15 <guoshan> hey,yo
13:01:21 <lixinhui_> hi
13:01:32 <Qiming> I don't have extra items to put into the agenda
13:01:42 <Qiming> if you have got one or two, please update it here:
13:01:51 <Qiming> https://wiki.openstack.org/wiki/Meetings/SenlinAgenda#Weekly_Senlin_.28Clustering.29_meeting
13:01:59 <Qiming> evening everyone
13:02:18 <Qiming> #topic newton work items
13:02:30 <Qiming> #link https://etherpad.openstack.org/p/senlin-newton-workitems
13:02:44 <Qiming> let's start with the etherpad, as usuall
13:03:15 <yanyan> hi, about rally support
13:03:44 <yanyan> I'm working on context for senlin resources, like profile or cluster per roman's suggestion
13:04:04 <yanyan> profile context patch has been proposed and roman has left some comments there
13:04:12 <yanyan> will keep working on it in coming week
13:04:49 <yanyan> after that, will be cluster context which will be useful for test case like cluster scaling, resizing
13:04:56 <Qiming> any plan on how many test cases we want to land in rally?
13:05:38 <yanyan> currently, I plan for cluster creating, resizing, scaling, deleting, node creating, deleting
13:05:43 <yanyan> and also lb policy related ones
13:05:58 <yanyan> since those operations could have concurrent issue
13:06:05 <Qiming> is lb related test cases relevant to performance?
13:06:22 <yanyan> em, actually it's more about lbaas's performance I feel
13:06:51 <Qiming> exactly, I'm a little bit concerned about lbaas/octavia's stability
13:07:00 <yanyan> but since there is still chance senlin's problem to cause concurrency issue, maybe we need related test case
13:07:11 <yanyan> Qiming, yes, that could be a problem
13:07:19 <Qiming> if it is not stable, then we are propagating that problem into rally
13:07:30 <Qiming> it seems to me rally gate is already fragile
13:08:02 <yanyan> Qiming, yes. But use can also make test locally using those test scenarios
13:08:04 <Qiming> rally is for performance test, maybe we can use tempest for concurrency tests?
13:08:08 <yanyan> will could be more practical
13:08:16 <Qiming> yes
13:08:22 <yanyan> Qiming, yes, sorry my expression is not accurate
13:08:27 <Qiming> that makes some more senses to me
13:08:47 <yanyan> so, that is my current plan
13:09:04 <Qiming> sorry I haven't got cycles to spit on that patch 346656
13:09:14 <Qiming> great
13:09:17 <yanyan> will also listen to users' voice, like voice from cmcc guys :)
13:09:31 <yanyan> Qiming, no problem
13:09:40 <yanyan> it is now blocked for context support
13:09:53 <yanyan> will revise it later after profile/cluster context is ready
13:10:12 <Qiming> okay, that is a technical problem, can fixed eventually when we work harder on it
13:10:23 <yanyan> yes
13:10:25 <Qiming> not so worried about it
13:10:37 <yanyan> :)
13:10:39 <Qiming> next thing is integration test
13:10:56 <yanyan> yes, it still doesn't work correctly...
13:10:59 <yanyan> need further fix
13:11:01 <Qiming> I haven't checked any integration test log recently, because they are not voting, ;)
13:11:05 <yanyan> I have proposed the patch
13:11:16 <yanyan> Qiming, it is ok I think
13:11:24 <yanyan> we can make it vote after it works correctly
13:11:44 <Qiming> sure
13:11:46 <yanyan> https://review.openstack.org/354566
13:12:06 <yanyan> also add zaqar support in this patch for coming test for message support
13:12:45 <Qiming> okay
13:13:07 <Qiming> moving on
13:13:13 <Qiming> health management
13:13:28 <Qiming> I'm working on fixing the health manager
13:13:54 <Qiming> this one is critical: https://review.openstack.org/355743
13:14:20 <Qiming> and it is the reason why sometimes I cannot get any nova notifications from listeners
13:14:39 <Qiming> this one is an optimization: https://review.openstack.org/355751
13:15:01 <Qiming> it is about postponing hm initialization after service table cleansing
13:15:15 <Qiming> the latest one is this: https://review.openstack.org/355790
13:15:25 <yanyan> so the dead service blocks those notifications
13:15:48 <Qiming> it is about dynamically disable/enable listening to a specific cluster
13:16:01 <Qiming> yes, dead service may still have db records left
13:16:10 <yanyan> I see
13:16:27 <yanyan> yes, those service will be cleaned when new service is initialized
13:16:35 <Qiming> when hm is trying to pick a cluster for listening, it will skip all those have a service record in db
13:16:52 <Qiming> yes, any new engine restart will cleanse the table
13:17:05 <yanyan> I see
13:17:38 <Qiming> dynamical enabling/disabling is useful when we want to fully support health policy (thus a HA usage scenario)
13:17:48 <Qiming> when I'm doing a cluster-scale-in
13:18:00 <Qiming> some nodes are supposed to be deleted from the cluster
13:18:18 <yanyan> sure, it is very useful I think
13:18:29 <Qiming> and those nodes (assuming they are nova servers) will generate notifications when they are killed
13:18:52 <Qiming> the health policy has to know that these events are generated intentionally
13:19:48 <Qiming> so ... eventually, the health policy may have to get itself hooked into those actions that are shrinking a cluster
13:20:28 <Qiming> before such actions happen, it will dynamically disable itself, and after such actions are completed, it dynamically restore health maintenance
13:20:29 <yanyan> yes, so health policy should take effect when this kind of operations is performed
13:20:38 <yanyan> s/when/before
13:20:58 <Qiming> or else, you will get a cluster with nodes you cannot delete
13:21:07 <Qiming> it is always recovered by the health policy
13:21:20 <yanyan> or hm ignores those events?
13:21:47 <Qiming> hm "suspends" itself when those actions are performed and resumes after those actions are completed
13:22:42 <yanyan> Qiming, yes. Just in case node failure happens during cluster scaling, those failure event will be omitted
13:23:13 <Qiming> due to the fact we have at least two types of "checkers", it is hard to inject these info directly into the checkers
13:23:13 <yanyan> I mean those "unexpected" failure events
13:23:43 <Qiming> em ... health policy is not designed to handel senlin action failures
13:23:45 <yanyan> checkers, you mean?
13:24:17 <Qiming> one checker is the poller (for vm status), anther type of checker is event queue listener
13:24:25 <yanyan> Qiming, sorry, I didn't express clearly. I mean nova server is down during cluster scaling
13:24:33 <Qiming> maybe in future we can add a 'pinger' ...
13:24:40 <Qiming> right
13:24:44 <yanyan> and that down is expected to be caught by hm
13:24:53 <Qiming> exactly
13:25:12 <yanyan> Qiming, I see
13:25:17 <Qiming> hm has to know whether a vm is down due to request, or due to exceptions/errors
13:25:49 <Qiming> it should do recovery only for the later cases, but it is currently receiving all VM down events
13:26:14 <yanyan> Qiming, yes, unless we have a "list" of VMs need to monitor
13:26:51 <Qiming> even with that
13:27:14 <Qiming> when you are getting an event about VM1 being shutdown
13:27:27 <Qiming> that cluster is already monitored
13:27:36 <Qiming> how would you follow up?
13:28:00 <Qiming> the key is that hm should be able to deduce the reason behind vm "failure"
13:28:02 <yanyan> hmm, so the monitoring may be done with node granularity?
13:28:07 <yanyan> yes
13:28:33 <Qiming> the events are per-node (vm)
13:28:41 <Qiming> you got to filter them
13:28:48 <yanyan> yes
13:28:54 <yanyan> filtering is needed
13:29:21 <Qiming> it is not about whether you CAN get a vm notification
13:29:36 <Qiming> it is about WHAT you should do after got such a notification
13:30:12 <yanyan> I see. You mean hm collects all events, then decide which of them should be reacted
13:30:28 <Qiming> so the easier path to attack this is we suspend health check during cluster shrinkage
13:30:39 <Qiming> hm cannot be that smarter
13:30:48 <Qiming> it is a separate rpc server, ...
13:31:13 <yanyan> got it
13:31:29 <Qiming> it needs too many information to make a call
13:31:51 <Qiming> so the flow would be something like this
13:32:52 <Qiming> scale_in request -> scale-in action -> cluster locked -> hm disable itself (if enabled) -> scale in -> hm reenable itself -> cluster unlocked -> action complete
13:33:40 <yanyan> Qiming, sounds reasonable
13:33:52 <Qiming> VM failures during action execution cannot be handled, it is a pity
13:34:14 <yanyan> just one thing may need attention, that the hm could be disabled for a long while if scaling action hangs for some reasons
13:34:32 <yanyan> Qiming, yes
13:34:37 <Qiming> anyway, that is the current design, will continue try things up and see if the loop can be closed soon
13:34:50 <yanyan> ok
13:35:01 <yanyan> looks great
13:35:08 <Qiming> yes, during that period, you are not supposed to enable health recovery
13:35:37 <Qiming> cluster status is in transition
13:36:08 <Qiming> okay,
13:36:23 <Qiming> we have got some fencing code proposed, please help review
13:36:35 <Qiming> and xinhui is pushing more to gerrit
13:36:41 <yanyan> yes, will check it
13:36:58 <elynn> will check it later
13:37:08 <Qiming> profile/policy version control, again, sorry, didn't got time to look into it
13:37:29 <yanyan> no problem :)
13:37:41 <Qiming> container profile fixes ...
13:37:42 <yanyan> it's not that urgent
13:38:00 <Qiming> mostly looks okay
13:38:29 <Qiming> receiver
13:38:39 <Qiming> sdk side blocked
13:38:54 <yanyan> yes, have posted patch to add claim support
13:39:01 <Qiming> both patches look good now
13:39:06 <yanyan> but need to add test case
13:39:14 <Qiming> just need another +2
13:39:38 <Qiming> events/notifications .... I am not aware of any activities therein
13:39:42 <yanyan> yes, after this support is done, will try to work on message based receiver
13:39:53 <Qiming> so, that's the etherpad
13:40:15 <yanyan> Qiming, another thing want to mention is senlinclient is broken for latest sdk change I think...
13:40:31 <yanyan> it works correctly with sdk 0.9.2
13:40:34 <Qiming> last week I was reworking the exception catching logics for profiles, now it's all settled, please test
13:40:39 <yanyan> but broken using master
13:40:48 <Qiming> yes?
13:40:52 <yanyan> Qiming, sure, saw that huge work :)
13:40:54 <yanyan> yes
13:41:02 <yanyan> me and elynn tested this afternonn
13:41:26 <Qiming> okay, I checked the bouncer ...
13:41:35 <yanyan> error just happened at client side
13:41:53 <Qiming> clientside has some patches not reviewed
13:42:03 <yanyan> ok, will check them
13:42:15 <Qiming> oh, all self approved I think
13:42:53 <Qiming> are you using latest cliend with latest sdk?
13:42:59 <yanyan> yes
13:43:05 <yanyan> both are latest code
13:43:07 <Qiming> the error message?
13:43:16 <yanyan> I can't remember it, I
13:43:24 <yanyan> I'm now using my own laptop :)
13:43:44 <yanyan> something like "unexpected header-*** attr"
13:43:53 <yanyan> can't recall it exactly
13:44:20 <yanyan> hi, elynn, do you have the log of error?
13:44:22 <Qiming> the additional_headers problem could be caused by keystoneauth version
13:44:39 <yanyan> oh
13:45:04 <Qiming> in global requirements, it has been bumped to "keystoneauth1>=2.10.0 "
13:45:18 <Qiming> don't confuse keystoneauth with keystoneauth1
13:45:23 <elynn> yanyan, no, I'm using another computer now.
13:45:31 <Qiming> the latter one is the correct package
13:45:34 <yanyan> elynn, I see. thanks
13:45:48 <yanyan> yes
13:46:29 <yanyan> so any fix we need to apply on client?
13:46:35 <Qiming> no
13:46:59 <Qiming> check this: https://review.openstack.org/#/c/343992/6/openstack/session.py
13:47:31 <Qiming> line 96 was using the additional_headers parameter added to latest version of keystoneauth1
13:47:49 <Qiming> that is the way we pass the additional header "openstack-api-version: clustering 1.2" to server
13:48:02 <Qiming> thus we are getting cluster-collect operation work properly
13:48:10 <yanyan> ok
13:48:44 <Qiming> if you are seeing errors about set_api_version, that means you sdk version is too old
13:48:56 <yanyan> hmm, that is weird
13:49:02 <yanyan> will test it again tomorrow
13:49:13 <Qiming> it is also added in this patch: https://review.openstack.org/#/c/343992/6/openstack/profile.py
13:49:25 <yanyan> I have cleaned my sdk installation and reinstalled it
13:49:55 <Qiming> senlinclient now will automatically add a 'clustering 1.2' header value to indicate it knows the cluster-collect call
13:50:14 <yanyan> I see. Will have a try tomorrow
13:50:27 <Qiming> okay
13:52:12 <Qiming> anything else?
13:52:22 <yanyan> no from me
13:52:34 <Qiming> guoshan, thanks for joining
13:52:45 <Qiming> are you based in China?
13:53:00 <guoshan> yes
13:53:13 <yanyan> guoshan, welcome :)
13:53:15 <Qiming> I see, so it is 21:53 for you too
13:53:26 <guoshan> yep：）
13:53:30 <yanyan> maybe some self introduction?
13:53:32 <yanyan> :)
13:53:59 <Qiming> I'm not single, but always available, period.
13:54:11 <yanyan> haha
13:54:49 <elynn> zan
13:55:32 <guoshan> I work in Awcloud, base in wuhan. New comer in senlin.
13:56:07 <guoshan> And I am single and available :)
13:56:11 <yanyan> welcome, any question, plz feel free to ping us in senlin channel :)
13:56:18 <Qiming> yep
13:56:22 <yanyan> :P
13:56:31 <elynn> haha, welcome guoshan
13:56:55 <Qiming> 4 mins left
13:57:10 <elynn> btw, I'm working on profile/policy validation, please help to review those patches if you have time.
13:57:41 <yanyan> elynn, sure, great work
13:58:08 <yanyan> very glad you have time to work on it recently :)
13:58:25 <Qiming> thanks, elynn
13:58:34 <elynn> Finally I got some spare time.
13:58:50 <yanyan> )
13:58:52 <elynn> np Qiming
13:59:33 <Qiming> if you don't have other things to talk, we can release the channel now
13:59:40 <Qiming> thanks for joining
13:59:45 <Qiming> #endmeeting