13:00:32 <Qiming> #startmeeting senlin 13:00:33 <openstack> Meeting started Tue Aug 16 13:00:32 2016 UTC and is due to finish in 60 minutes. The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:00:34 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:00:36 <openstack> The meeting name has been set to 'senlin' 13:01:02 <Qiming> hello 13:01:09 <elynn> Hi 13:01:12 <yanyan> hi 13:01:15 <guoshan> hey,yo 13:01:21 <lixinhui_> hi 13:01:32 <Qiming> I don't have extra items to put into the agenda 13:01:42 <Qiming> if you have got one or two, please update it here: 13:01:51 <Qiming> https://wiki.openstack.org/wiki/Meetings/SenlinAgenda#Weekly_Senlin_.28Clustering.29_meeting 13:01:59 <Qiming> evening everyone 13:02:18 <Qiming> #topic newton work items 13:02:30 <Qiming> #link https://etherpad.openstack.org/p/senlin-newton-workitems 13:02:44 <Qiming> let's start with the etherpad, as usuall 13:03:15 <yanyan> hi, about rally support 13:03:44 <yanyan> I'm working on context for senlin resources, like profile or cluster per roman's suggestion 13:04:04 <yanyan> profile context patch has been proposed and roman has left some comments there 13:04:12 <yanyan> will keep working on it in coming week 13:04:49 <yanyan> after that, will be cluster context which will be useful for test case like cluster scaling, resizing 13:04:56 <Qiming> any plan on how many test cases we want to land in rally? 13:05:38 <yanyan> currently, I plan for cluster creating, resizing, scaling, deleting, node creating, deleting 13:05:43 <yanyan> and also lb policy related ones 13:05:58 <yanyan> since those operations could have concurrent issue 13:06:05 <Qiming> is lb related test cases relevant to performance? 13:06:22 <yanyan> em, actually it's more about lbaas's performance I feel 13:06:51 <Qiming> exactly, I'm a little bit concerned about lbaas/octavia's stability 13:07:00 <yanyan> but since there is still chance senlin's problem to cause concurrency issue, maybe we need related test case 13:07:11 <yanyan> Qiming, yes, that could be a problem 13:07:19 <Qiming> if it is not stable, then we are propagating that problem into rally 13:07:30 <Qiming> it seems to me rally gate is already fragile 13:08:02 <yanyan> Qiming, yes. But use can also make test locally using those test scenarios 13:08:04 <Qiming> rally is for performance test, maybe we can use tempest for concurrency tests? 13:08:08 <yanyan> will could be more practical 13:08:16 <Qiming> yes 13:08:22 <yanyan> Qiming, yes, sorry my expression is not accurate 13:08:27 <Qiming> that makes some more senses to me 13:08:47 <yanyan> so, that is my current plan 13:09:04 <Qiming> sorry I haven't got cycles to spit on that patch 346656 13:09:14 <Qiming> great 13:09:17 <yanyan> will also listen to users' voice, like voice from cmcc guys :) 13:09:31 <yanyan> Qiming, no problem 13:09:40 <yanyan> it is now blocked for context support 13:09:53 <yanyan> will revise it later after profile/cluster context is ready 13:10:12 <Qiming> okay, that is a technical problem, can fixed eventually when we work harder on it 13:10:23 <yanyan> yes 13:10:25 <Qiming> not so worried about it 13:10:37 <yanyan> :) 13:10:39 <Qiming> next thing is integration test 13:10:56 <yanyan> yes, it still doesn't work correctly... 13:10:59 <yanyan> need further fix 13:11:01 <Qiming> I haven't checked any integration test log recently, because they are not voting, ;) 13:11:05 <yanyan> I have proposed the patch 13:11:16 <yanyan> Qiming, it is ok I think 13:11:24 <yanyan> we can make it vote after it works correctly 13:11:44 <Qiming> sure 13:11:46 <yanyan> https://review.openstack.org/354566 13:12:06 <yanyan> also add zaqar support in this patch for coming test for message support 13:12:45 <Qiming> okay 13:13:07 <Qiming> moving on 13:13:13 <Qiming> health management 13:13:28 <Qiming> I'm working on fixing the health manager 13:13:54 <Qiming> this one is critical: https://review.openstack.org/355743 13:14:20 <Qiming> and it is the reason why sometimes I cannot get any nova notifications from listeners 13:14:39 <Qiming> this one is an optimization: https://review.openstack.org/355751 13:15:01 <Qiming> it is about postponing hm initialization after service table cleansing 13:15:15 <Qiming> the latest one is this: https://review.openstack.org/355790 13:15:25 <yanyan> so the dead service blocks those notifications 13:15:48 <Qiming> it is about dynamically disable/enable listening to a specific cluster 13:16:01 <Qiming> yes, dead service may still have db records left 13:16:10 <yanyan> I see 13:16:27 <yanyan> yes, those service will be cleaned when new service is initialized 13:16:35 <Qiming> when hm is trying to pick a cluster for listening, it will skip all those have a service record in db 13:16:52 <Qiming> yes, any new engine restart will cleanse the table 13:17:05 <yanyan> I see 13:17:38 <Qiming> dynamical enabling/disabling is useful when we want to fully support health policy (thus a HA usage scenario) 13:17:48 <Qiming> when I'm doing a cluster-scale-in 13:18:00 <Qiming> some nodes are supposed to be deleted from the cluster 13:18:18 <yanyan> sure, it is very useful I think 13:18:29 <Qiming> and those nodes (assuming they are nova servers) will generate notifications when they are killed 13:18:52 <Qiming> the health policy has to know that these events are generated intentionally 13:19:48 <Qiming> so ... eventually, the health policy may have to get itself hooked into those actions that are shrinking a cluster 13:20:28 <Qiming> before such actions happen, it will dynamically disable itself, and after such actions are completed, it dynamically restore health maintenance 13:20:29 <yanyan> yes, so health policy should take effect when this kind of operations is performed 13:20:38 <yanyan> s/when/before 13:20:58 <Qiming> or else, you will get a cluster with nodes you cannot delete 13:21:07 <Qiming> it is always recovered by the health policy 13:21:20 <yanyan> or hm ignores those events? 13:21:47 <Qiming> hm "suspends" itself when those actions are performed and resumes after those actions are completed 13:22:42 <yanyan> Qiming, yes. Just in case node failure happens during cluster scaling, those failure event will be omitted 13:23:13 <Qiming> due to the fact we have at least two types of "checkers", it is hard to inject these info directly into the checkers 13:23:13 <yanyan> I mean those "unexpected" failure events 13:23:43 <Qiming> em ... health policy is not designed to handel senlin action failures 13:23:45 <yanyan> checkers, you mean? 13:24:17 <Qiming> one checker is the poller (for vm status), anther type of checker is event queue listener 13:24:25 <yanyan> Qiming, sorry, I didn't express clearly. I mean nova server is down during cluster scaling 13:24:33 <Qiming> maybe in future we can add a 'pinger' ... 13:24:40 <Qiming> right 13:24:44 <yanyan> and that down is expected to be caught by hm 13:24:53 <Qiming> exactly 13:25:12 <yanyan> Qiming, I see 13:25:17 <Qiming> hm has to know whether a vm is down due to request, or due to exceptions/errors 13:25:49 <Qiming> it should do recovery only for the later cases, but it is currently receiving all VM down events 13:26:14 <yanyan> Qiming, yes, unless we have a "list" of VMs need to monitor 13:26:51 <Qiming> even with that 13:27:14 <Qiming> when you are getting an event about VM1 being shutdown 13:27:27 <Qiming> that cluster is already monitored 13:27:36 <Qiming> how would you follow up? 13:28:00 <Qiming> the key is that hm should be able to deduce the reason behind vm "failure" 13:28:02 <yanyan> hmm, so the monitoring may be done with node granularity? 13:28:07 <yanyan> yes 13:28:33 <Qiming> the events are per-node (vm) 13:28:41 <Qiming> you got to filter them 13:28:48 <yanyan> yes 13:28:54 <yanyan> filtering is needed 13:29:21 <Qiming> it is not about whether you CAN get a vm notification 13:29:36 <Qiming> it is about WHAT you should do after got such a notification 13:30:12 <yanyan> I see. You mean hm collects all events, then decide which of them should be reacted 13:30:28 <Qiming> so the easier path to attack this is we suspend health check during cluster shrinkage 13:30:39 <Qiming> hm cannot be that smarter 13:30:48 <Qiming> it is a separate rpc server, ... 13:31:13 <yanyan> got it 13:31:29 <Qiming> it needs too many information to make a call 13:31:51 <Qiming> so the flow would be something like this 13:32:52 <Qiming> scale_in request -> scale-in action -> cluster locked -> hm disable itself (if enabled) -> scale in -> hm reenable itself -> cluster unlocked -> action complete 13:33:40 <yanyan> Qiming, sounds reasonable 13:33:52 <Qiming> VM failures during action execution cannot be handled, it is a pity 13:34:14 <yanyan> just one thing may need attention, that the hm could be disabled for a long while if scaling action hangs for some reasons 13:34:32 <yanyan> Qiming, yes 13:34:37 <Qiming> anyway, that is the current design, will continue try things up and see if the loop can be closed soon 13:34:50 <yanyan> ok 13:35:01 <yanyan> looks great 13:35:08 <Qiming> yes, during that period, you are not supposed to enable health recovery 13:35:37 <Qiming> cluster status is in transition 13:36:08 <Qiming> okay, 13:36:23 <Qiming> we have got some fencing code proposed, please help review 13:36:35 <Qiming> and xinhui is pushing more to gerrit 13:36:41 <yanyan> yes, will check it 13:36:58 <elynn> will check it later 13:37:08 <Qiming> profile/policy version control, again, sorry, didn't got time to look into it 13:37:29 <yanyan> no problem :) 13:37:41 <Qiming> container profile fixes ... 13:37:42 <yanyan> it's not that urgent 13:38:00 <Qiming> mostly looks okay 13:38:29 <Qiming> receiver 13:38:39 <Qiming> sdk side blocked 13:38:54 <yanyan> yes, have posted patch to add claim support 13:39:01 <Qiming> both patches look good now 13:39:06 <yanyan> but need to add test case 13:39:14 <Qiming> just need another +2 13:39:38 <Qiming> events/notifications .... I am not aware of any activities therein 13:39:42 <yanyan> yes, after this support is done, will try to work on message based receiver 13:39:53 <Qiming> so, that's the etherpad 13:40:15 <yanyan> Qiming, another thing want to mention is senlinclient is broken for latest sdk change I think... 13:40:31 <yanyan> it works correctly with sdk 0.9.2 13:40:34 <Qiming> last week I was reworking the exception catching logics for profiles, now it's all settled, please test 13:40:39 <yanyan> but broken using master 13:40:48 <Qiming> yes? 13:40:52 <yanyan> Qiming, sure, saw that huge work :) 13:40:54 <yanyan> yes 13:41:02 <yanyan> me and elynn tested this afternonn 13:41:26 <Qiming> okay, I checked the bouncer ... 13:41:35 <yanyan> error just happened at client side 13:41:53 <Qiming> clientside has some patches not reviewed 13:42:03 <yanyan> ok, will check them 13:42:15 <Qiming> oh, all self approved I think 13:42:53 <Qiming> are you using latest cliend with latest sdk? 13:42:59 <yanyan> yes 13:43:05 <yanyan> both are latest code 13:43:07 <Qiming> the error message? 13:43:16 <yanyan> I can't remember it, I 13:43:24 <yanyan> I'm now using my own laptop :) 13:43:44 <yanyan> something like "unexpected header-*** attr" 13:43:53 <yanyan> can't recall it exactly 13:44:20 <yanyan> hi, elynn, do you have the log of error? 13:44:22 <Qiming> the additional_headers problem could be caused by keystoneauth version 13:44:39 <yanyan> oh 13:45:04 <Qiming> in global requirements, it has been bumped to "keystoneauth1>=2.10.0 " 13:45:18 <Qiming> don't confuse keystoneauth with keystoneauth1 13:45:23 <elynn> yanyan, no, I'm using another computer now. 13:45:31 <Qiming> the latter one is the correct package 13:45:34 <yanyan> elynn, I see. thanks 13:45:48 <yanyan> yes 13:46:29 <yanyan> so any fix we need to apply on client? 13:46:35 <Qiming> no 13:46:59 <Qiming> check this: https://review.openstack.org/#/c/343992/6/openstack/session.py 13:47:31 <Qiming> line 96 was using the additional_headers parameter added to latest version of keystoneauth1 13:47:49 <Qiming> that is the way we pass the additional header "openstack-api-version: clustering 1.2" to server 13:48:02 <Qiming> thus we are getting cluster-collect operation work properly 13:48:10 <yanyan> ok 13:48:44 <Qiming> if you are seeing errors about set_api_version, that means you sdk version is too old 13:48:56 <yanyan> hmm, that is weird 13:49:02 <yanyan> will test it again tomorrow 13:49:13 <Qiming> it is also added in this patch: https://review.openstack.org/#/c/343992/6/openstack/profile.py 13:49:25 <yanyan> I have cleaned my sdk installation and reinstalled it 13:49:55 <Qiming> senlinclient now will automatically add a 'clustering 1.2' header value to indicate it knows the cluster-collect call 13:50:14 <yanyan> I see. Will have a try tomorrow 13:50:27 <Qiming> okay 13:52:12 <Qiming> anything else? 13:52:22 <yanyan> no from me 13:52:34 <Qiming> guoshan, thanks for joining 13:52:45 <Qiming> are you based in China? 13:53:00 <guoshan> yes 13:53:13 <yanyan> guoshan, welcome :) 13:53:15 <Qiming> I see, so it is 21:53 for you too 13:53:26 <guoshan> yep:) 13:53:30 <yanyan> maybe some self introduction? 13:53:32 <yanyan> :) 13:53:59 <Qiming> I'm not single, but always available, period. 13:54:11 <yanyan> haha 13:54:49 <elynn> zan 13:55:32 <guoshan> I work in Awcloud, base in wuhan. New comer in senlin. 13:56:07 <guoshan> And I am single and available :) 13:56:11 <yanyan> welcome, any question, plz feel free to ping us in senlin channel :) 13:56:18 <Qiming> yep 13:56:22 <yanyan> :P 13:56:31 <elynn> haha, welcome guoshan 13:56:55 <Qiming> 4 mins left 13:57:10 <elynn> btw, I'm working on profile/policy validation, please help to review those patches if you have time. 13:57:41 <yanyan> elynn, sure, great work 13:58:08 <yanyan> very glad you have time to work on it recently :) 13:58:25 <Qiming> thanks, elynn 13:58:34 <elynn> Finally I got some spare time. 13:58:50 <yanyan> ) 13:58:52 <elynn> np Qiming 13:59:33 <Qiming> if you don't have other things to talk, we can release the channel now 13:59:40 <Qiming> thanks for joining 13:59:45 <Qiming> #endmeeting