13:00:32 #startmeeting senlin 13:00:33 Meeting started Tue Aug 16 13:00:32 2016 UTC and is due to finish in 60 minutes. The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:00:34 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:00:36 The meeting name has been set to 'senlin' 13:01:02 hello 13:01:09 Hi 13:01:12 hi 13:01:15 hey,yo 13:01:21 hi 13:01:32 I don't have extra items to put into the agenda 13:01:42 if you have got one or two, please update it here: 13:01:51 https://wiki.openstack.org/wiki/Meetings/SenlinAgenda#Weekly_Senlin_.28Clustering.29_meeting 13:01:59 evening everyone 13:02:18 #topic newton work items 13:02:30 #link https://etherpad.openstack.org/p/senlin-newton-workitems 13:02:44 let's start with the etherpad, as usuall 13:03:15 hi, about rally support 13:03:44 I'm working on context for senlin resources, like profile or cluster per roman's suggestion 13:04:04 profile context patch has been proposed and roman has left some comments there 13:04:12 will keep working on it in coming week 13:04:49 after that, will be cluster context which will be useful for test case like cluster scaling, resizing 13:04:56 any plan on how many test cases we want to land in rally? 13:05:38 currently, I plan for cluster creating, resizing, scaling, deleting, node creating, deleting 13:05:43 and also lb policy related ones 13:05:58 since those operations could have concurrent issue 13:06:05 is lb related test cases relevant to performance? 13:06:22 em, actually it's more about lbaas's performance I feel 13:06:51 exactly, I'm a little bit concerned about lbaas/octavia's stability 13:07:00 but since there is still chance senlin's problem to cause concurrency issue, maybe we need related test case 13:07:11 Qiming, yes, that could be a problem 13:07:19 if it is not stable, then we are propagating that problem into rally 13:07:30 it seems to me rally gate is already fragile 13:08:02 Qiming, yes. But use can also make test locally using those test scenarios 13:08:04 rally is for performance test, maybe we can use tempest for concurrency tests? 13:08:08 will could be more practical 13:08:16 yes 13:08:22 Qiming, yes, sorry my expression is not accurate 13:08:27 that makes some more senses to me 13:08:47 so, that is my current plan 13:09:04 sorry I haven't got cycles to spit on that patch 346656 13:09:14 great 13:09:17 will also listen to users' voice, like voice from cmcc guys :) 13:09:31 Qiming, no problem 13:09:40 it is now blocked for context support 13:09:53 will revise it later after profile/cluster context is ready 13:10:12 okay, that is a technical problem, can fixed eventually when we work harder on it 13:10:23 yes 13:10:25 not so worried about it 13:10:37 :) 13:10:39 next thing is integration test 13:10:56 yes, it still doesn't work correctly... 13:10:59 need further fix 13:11:01 I haven't checked any integration test log recently, because they are not voting, ;) 13:11:05 I have proposed the patch 13:11:16 Qiming, it is ok I think 13:11:24 we can make it vote after it works correctly 13:11:44 sure 13:11:46 https://review.openstack.org/354566 13:12:06 also add zaqar support in this patch for coming test for message support 13:12:45 okay 13:13:07 moving on 13:13:13 health management 13:13:28 I'm working on fixing the health manager 13:13:54 this one is critical: https://review.openstack.org/355743 13:14:20 and it is the reason why sometimes I cannot get any nova notifications from listeners 13:14:39 this one is an optimization: https://review.openstack.org/355751 13:15:01 it is about postponing hm initialization after service table cleansing 13:15:15 the latest one is this: https://review.openstack.org/355790 13:15:25 so the dead service blocks those notifications 13:15:48 it is about dynamically disable/enable listening to a specific cluster 13:16:01 yes, dead service may still have db records left 13:16:10 I see 13:16:27 yes, those service will be cleaned when new service is initialized 13:16:35 when hm is trying to pick a cluster for listening, it will skip all those have a service record in db 13:16:52 yes, any new engine restart will cleanse the table 13:17:05 I see 13:17:38 dynamical enabling/disabling is useful when we want to fully support health policy (thus a HA usage scenario) 13:17:48 when I'm doing a cluster-scale-in 13:18:00 some nodes are supposed to be deleted from the cluster 13:18:18 sure, it is very useful I think 13:18:29 and those nodes (assuming they are nova servers) will generate notifications when they are killed 13:18:52 the health policy has to know that these events are generated intentionally 13:19:48 so ... eventually, the health policy may have to get itself hooked into those actions that are shrinking a cluster 13:20:28 before such actions happen, it will dynamically disable itself, and after such actions are completed, it dynamically restore health maintenance 13:20:29 yes, so health policy should take effect when this kind of operations is performed 13:20:38 s/when/before 13:20:58 or else, you will get a cluster with nodes you cannot delete 13:21:07 it is always recovered by the health policy 13:21:20 or hm ignores those events? 13:21:47 hm "suspends" itself when those actions are performed and resumes after those actions are completed 13:22:42 Qiming, yes. Just in case node failure happens during cluster scaling, those failure event will be omitted 13:23:13 due to the fact we have at least two types of "checkers", it is hard to inject these info directly into the checkers 13:23:13 I mean those "unexpected" failure events 13:23:43 em ... health policy is not designed to handel senlin action failures 13:23:45 checkers, you mean? 13:24:17 one checker is the poller (for vm status), anther type of checker is event queue listener 13:24:25 Qiming, sorry, I didn't express clearly. I mean nova server is down during cluster scaling 13:24:33 maybe in future we can add a 'pinger' ... 13:24:40 right 13:24:44 and that down is expected to be caught by hm 13:24:53 exactly 13:25:12 Qiming, I see 13:25:17 hm has to know whether a vm is down due to request, or due to exceptions/errors 13:25:49 it should do recovery only for the later cases, but it is currently receiving all VM down events 13:26:14 Qiming, yes, unless we have a "list" of VMs need to monitor 13:26:51 even with that 13:27:14 when you are getting an event about VM1 being shutdown 13:27:27 that cluster is already monitored 13:27:36 how would you follow up? 13:28:00 the key is that hm should be able to deduce the reason behind vm "failure" 13:28:02 hmm, so the monitoring may be done with node granularity? 13:28:07 yes 13:28:33 the events are per-node (vm) 13:28:41 you got to filter them 13:28:48 yes 13:28:54 filtering is needed 13:29:21 it is not about whether you CAN get a vm notification 13:29:36 it is about WHAT you should do after got such a notification 13:30:12 I see. You mean hm collects all events, then decide which of them should be reacted 13:30:28 so the easier path to attack this is we suspend health check during cluster shrinkage 13:30:39 hm cannot be that smarter 13:30:48 it is a separate rpc server, ... 13:31:13 got it 13:31:29 it needs too many information to make a call 13:31:51 so the flow would be something like this 13:32:52 scale_in request -> scale-in action -> cluster locked -> hm disable itself (if enabled) -> scale in -> hm reenable itself -> cluster unlocked -> action complete 13:33:40 Qiming, sounds reasonable 13:33:52 VM failures during action execution cannot be handled, it is a pity 13:34:14 just one thing may need attention, that the hm could be disabled for a long while if scaling action hangs for some reasons 13:34:32 Qiming, yes 13:34:37 anyway, that is the current design, will continue try things up and see if the loop can be closed soon 13:34:50 ok 13:35:01 looks great 13:35:08 yes, during that period, you are not supposed to enable health recovery 13:35:37 cluster status is in transition 13:36:08 okay, 13:36:23 we have got some fencing code proposed, please help review 13:36:35 and xinhui is pushing more to gerrit 13:36:41 yes, will check it 13:36:58 will check it later 13:37:08 profile/policy version control, again, sorry, didn't got time to look into it 13:37:29 no problem :) 13:37:41 container profile fixes ... 13:37:42 it's not that urgent 13:38:00 mostly looks okay 13:38:29 receiver 13:38:39 sdk side blocked 13:38:54 yes, have posted patch to add claim support 13:39:01 both patches look good now 13:39:06 but need to add test case 13:39:14 just need another +2 13:39:38 events/notifications .... I am not aware of any activities therein 13:39:42 yes, after this support is done, will try to work on message based receiver 13:39:53 so, that's the etherpad 13:40:15 Qiming, another thing want to mention is senlinclient is broken for latest sdk change I think... 13:40:31 it works correctly with sdk 0.9.2 13:40:34 last week I was reworking the exception catching logics for profiles, now it's all settled, please test 13:40:39 but broken using master 13:40:48 yes? 13:40:52 Qiming, sure, saw that huge work :) 13:40:54 yes 13:41:02 me and elynn tested this afternonn 13:41:26 okay, I checked the bouncer ... 13:41:35 error just happened at client side 13:41:53 clientside has some patches not reviewed 13:42:03 ok, will check them 13:42:15 oh, all self approved I think 13:42:53 are you using latest cliend with latest sdk? 13:42:59 yes 13:43:05 both are latest code 13:43:07 the error message? 13:43:16 I can't remember it, I 13:43:24 I'm now using my own laptop :) 13:43:44 something like "unexpected header-*** attr" 13:43:53 can't recall it exactly 13:44:20 hi, elynn, do you have the log of error? 13:44:22 the additional_headers problem could be caused by keystoneauth version 13:44:39 oh 13:45:04 in global requirements, it has been bumped to "keystoneauth1>=2.10.0 " 13:45:18 don't confuse keystoneauth with keystoneauth1 13:45:23 yanyan, no, I'm using another computer now. 13:45:31 the latter one is the correct package 13:45:34 elynn, I see. thanks 13:45:48 yes 13:46:29 so any fix we need to apply on client? 13:46:35 no 13:46:59 check this: https://review.openstack.org/#/c/343992/6/openstack/session.py 13:47:31 line 96 was using the additional_headers parameter added to latest version of keystoneauth1 13:47:49 that is the way we pass the additional header "openstack-api-version: clustering 1.2" to server 13:48:02 thus we are getting cluster-collect operation work properly 13:48:10 ok 13:48:44 if you are seeing errors about set_api_version, that means you sdk version is too old 13:48:56 hmm, that is weird 13:49:02 will test it again tomorrow 13:49:13 it is also added in this patch: https://review.openstack.org/#/c/343992/6/openstack/profile.py 13:49:25 I have cleaned my sdk installation and reinstalled it 13:49:55 senlinclient now will automatically add a 'clustering 1.2' header value to indicate it knows the cluster-collect call 13:50:14 I see. Will have a try tomorrow 13:50:27 okay 13:52:12 anything else? 13:52:22 no from me 13:52:34 guoshan, thanks for joining 13:52:45 are you based in China? 13:53:00 yes 13:53:13 guoshan, welcome :) 13:53:15 I see, so it is 21:53 for you too 13:53:26 yep:) 13:53:30 maybe some self introduction? 13:53:32 :) 13:53:59 I'm not single, but always available, period. 13:54:11 haha 13:54:49 zan 13:55:32 I work in Awcloud, base in wuhan. New comer in senlin. 13:56:07 And I am single and available :) 13:56:11 welcome, any question, plz feel free to ping us in senlin channel :) 13:56:18 yep 13:56:22 :P 13:56:31 haha, welcome guoshan 13:56:55 4 mins left 13:57:10 btw, I'm working on profile/policy validation, please help to review those patches if you have time. 13:57:41 elynn, sure, great work 13:58:08 very glad you have time to work on it recently :) 13:58:25 thanks, elynn 13:58:34 Finally I got some spare time. 13:58:50 ) 13:58:52 np Qiming 13:59:33 if you don't have other things to talk, we can release the channel now 13:59:40 thanks for joining 13:59:45 #endmeeting