13:00:22 #startmeeting senlin 13:00:23 Meeting started Tue Aug 15 13:00:22 2017 UTC and is due to finish in 60 minutes. The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:00:24 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:00:27 The meeting name has been set to 'senlin' 13:00:49 evening 13:01:21 evening Qiming 13:02:12 hi 13:02:55 network connection is no good today 13:03:20 my computer went down.. need to reinstall OS.. 13:03:30 now? 13:03:38 ;) 13:03:42 later :) 13:04:30 I'm looking at the etherpad 13:04:43 checking if there are critical things to be completed by this relesae 13:04:51 s/relesae/release 13:05:27 liyi has been a shy hero recently 13:05:27 em,yes Qiming, I just proposed a commit about the scheduler and actions 13:06:56 it is a little bit complex 13:07:02 need some time to digest 13:07:24 sure 13:08:10 can you elaborate the points of the change made? 13:08:40 yes Qiming, basically I want to use one backend thread to process all the actions belong to the engine service 13:09:17 and the dispatcher just tell scheduler to queue the action which status is READY 13:09:17 one "backend" thread? 13:10:06 so when we are creating a cluster or 10 nodes 13:10:30 the 10 NODE_CREATE actions will now get executed sequentially? 13:10:32 currently we use dispatcher.start_action() to notify scheduler to work, and each time the request will enter scheduler, so, there might be a lot of request want to grad the action AND threads to process 13:10:51 grad/grab 13:11:55 inface, the start_action() method will not process the action it get from DB, it will be queued, and the backend thread will grab thread from thread pool to process it 13:13:02 but there is a single backend thread now 13:13:24 yes Qiming.. 13:15:04 although there is no real multi-threading in Python, this patch doesn't look like an improvement to me 13:17:04 em, may reduce threads conflicts when a lot of requests get in :) 13:17:33 but it will hurt performance badly 13:18:05 yes, one thread will be used.. okay 13:18:14 please think again 13:18:42 and another thing is that we grab action randomly 13:19:07 yes, that is a no-so-good scheduling 13:19:15 can be improved for sure 13:19:20 if, in some case, some actions may not be processed all the time 13:19:53 for example, fix that db call to be acquire_first_ready() 13:20:23 by "first_ready", I mean we sort REDAY actions in DB by start_time or created_at etc 13:20:59 Qiming, you mean :acquire_random_in(latest=20) or acquire_the_first_one()? 13:21:28 acquire_the_first_one 13:21:50 to ensure no action will be starving 13:21:57 but this will increase risky of dead lock? 13:22:10 all the scheduler want to get the first one 13:22:15 no ... we only acquire READY actions 13:22:49 if there are action dependencies, the dependent won't be selected because they are not READY 13:23:08 but we broadcast the request to all dispatchers 13:23:17 that is fine 13:23:27 all threads are assumed to be dummy workers 13:23:44 they don't (shouldn't) know the semantics or dependencies 13:23:57 they just grab a ready action and do it 13:24:22 okay, that makes sense 13:24:35 we pushed all synchronization problem to the db layer 13:24:52 instead of handling them at different layers --- a common source of dead locks 13:25:28 if there are concurrency problem, we look into the db records, we blame and fix sqlalchemy calls 13:26:06 one problem we met is: cluster action timeout, but the depends and dependents are still there.. 13:26:41 that means one or two db calls are not thread safe 13:27:25 yes Qiming, we want to delete all the records when action timeout, but the node action are still executing.. 13:27:31 that is an known problem? 13:27:43 I spent a lot time looking into sqlalchemy doc, trying to learn some best practices 13:28:18 node action is depended by cluster action 13:28:39 if node action is still running, you are not supposed to delete the cluster action, right? 13:29:30 yes Qiming 13:29:32 we have a signal call before 13:30:08 "action_signal" 13:30:22 it was designed for this purpose 13:31:04 the intent is to have an action occasionally check if it has received a signal ... 13:31:13 if it does, it will abort 13:32:01 it looks like we have not yet used that weapon 13:32:22 not yet, cluster-resize 100 --> cluster action timeout --> mark_timeout(delete all dependents) --> return : node actions left, locks, dependents 13:33:11 yes, we are talking about the same thing 13:33:32 I was proposing to add "action_signal" call in the "mark_timeout" logic 13:33:55 so mark_timeout can kill the depended node actions 13:35:34 http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n260 13:35:39 its hard to tell whether they are blocked or just process slowly 13:36:05 http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n366 13:37:01 such a logic may be pushed down to db layer as well 13:40:54 by the way, I hope team has noticed liyi's contribution 13:41:23 he has proposed quite a few high quality patches 13:41:41 yes, liyi is doing great jobs 13:41:57 he doesn't show up on IRC 13:42:19 but that won't be a blocking factor for hiring him/here 13:42:25 s/here/her 13:42:26 yes 13:42:53 XueFeng, is liyi from your team? 13:43:24 He come from another company 13:43:50 okay 13:44:22 I'll contact him/her and see if there is interest to beome one of us 13:44:48 It's good 13:44:57 +1 13:45:10 #vote 13:45:34 I contact him before then he works good in API job 13:47:00 he knows sdk 13:47:38 it would be great to have more eyes on gating the contributions 13:48:05 last thing in my mind for today is about high priority bugs 13:48:27 liyi has proposed several patches related to node adoption 13:48:43 Yes, we should make our team and meeting smoothly 13:48:47 most of them are about some cases we haven't thought about 13:49:15 Will review for these patches 13:49:25 great 13:49:51 senlin is already in rdo? 13:49:55 About bugs, I think there are no high priority bugs 13:50:14 senlinclient and dashboard are not there yet? 13:50:27 Yes 13:50:48 Senlin server(API and Engine) has in rdo 13:50:52 there are hands and eyes on it? 13:50:57 on them? 13:51:05 Senlinclient is in process 13:51:06 https://bugs.launchpad.net/senlin/+bug/1710834 13:51:07 Launchpad bug 1710834 in senlin "physical id should be None when creation process failed" [Undecided,In progress] - Assigned to RUIJIE YUAN (cnjie0616) 13:52:09 ... "UNKOWN" .. what's that for? 13:52:14 for heat stack there is no such problem 13:52:22 for the exception message .. 13:52:27 can we just remove it 13:52:51 cannot recall why we set it that way 13:52:58 there must be a reaon 13:53:00 reason 13:53:13 we raised ResourceException when creation failed.. 13:53:14 but I'm fine with setting it to None 13:53:39 and the resource_id was set to 'UNKNOWN' 13:54:14 I cannot recall why we explicitly set it to 'UNKNOWN' 13:54:38 but I believe there was a reason, not just for exception message 13:54:59 compute.create() may not return server and we may not have server.id, then we are not supposed to show the exception message with server_id .. 13:55:12 can we just remove it from the Exception class 13:55:34 the problem is ... 13:55:51 it may return a server, but that server won't get active 13:56:18 yup .. 13:56:31 if it is not returning a server record, we should set it to None certainly 13:57:13 okay Qiming, will do it in profile layer 13:57:22 thx 13:57:53 np :) 13:58:41 alright, time is up 13:58:47 OK 13:58:57 Good night 13:59:09 thanks for joining! 13:59:13 #endmeeting