13:00:22 <Qiming> #startmeeting senlin 13:00:23 <openstack> Meeting started Tue Aug 15 13:00:22 2017 UTC and is due to finish in 60 minutes. The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:00:24 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:00:27 <openstack> The meeting name has been set to 'senlin' 13:00:49 <Qiming> evening 13:01:21 <ruijie> evening Qiming 13:02:12 <Qiming> hi 13:02:55 <Qiming> network connection is no good today 13:03:20 <ruijie> my computer went down.. need to reinstall OS.. 13:03:30 <Qiming> now? 13:03:38 <Qiming> ;) 13:03:42 <ruijie> later :) 13:04:30 <Qiming> I'm looking at the etherpad 13:04:43 <Qiming> checking if there are critical things to be completed by this relesae 13:04:51 <Qiming> s/relesae/release 13:05:27 <Qiming> liyi has been a shy hero recently 13:05:27 <ruijie> em,yes Qiming, I just proposed a commit about the scheduler and actions 13:06:56 <Qiming> it is a little bit complex 13:07:02 <Qiming> need some time to digest 13:07:24 <ruijie> sure 13:08:10 <Qiming> can you elaborate the points of the change made? 13:08:40 <ruijie> yes Qiming, basically I want to use one backend thread to process all the actions belong to the engine service 13:09:17 <ruijie> and the dispatcher just tell scheduler to queue the action which status is READY 13:09:17 <Qiming> one "backend" thread? 13:10:06 <Qiming> so when we are creating a cluster or 10 nodes 13:10:30 <Qiming> the 10 NODE_CREATE actions will now get executed sequentially? 13:10:32 <ruijie> currently we use dispatcher.start_action() to notify scheduler to work, and each time the request will enter scheduler, so, there might be a lot of request want to grad the action AND threads to process 13:10:51 <ruijie> grad/grab 13:11:55 <ruijie> inface, the start_action() method will not process the action it get from DB, it will be queued, and the backend thread will grab thread from thread pool to process it 13:13:02 <Qiming> but there is a single backend thread now 13:13:24 <ruijie> yes Qiming.. 13:15:04 <Qiming> although there is no real multi-threading in Python, this patch doesn't look like an improvement to me 13:17:04 <ruijie> em, may reduce threads conflicts when a lot of requests get in :) 13:17:33 <Qiming> but it will hurt performance badly 13:18:05 <ruijie> yes, one thread will be used.. okay 13:18:14 <Qiming> please think again 13:18:42 <ruijie> and another thing is that we grab action randomly 13:19:07 <Qiming> yes, that is a no-so-good scheduling 13:19:15 <Qiming> can be improved for sure 13:19:20 <ruijie> if, in some case, some actions may not be processed all the time 13:19:53 <Qiming> for example, fix that db call to be acquire_first_ready() 13:20:23 <Qiming> by "first_ready", I mean we sort REDAY actions in DB by start_time or created_at etc 13:20:59 <ruijie> Qiming, you mean :acquire_random_in(latest=20) or acquire_the_first_one()? 13:21:28 <Qiming> acquire_the_first_one 13:21:50 <Qiming> to ensure no action will be starving 13:21:57 <ruijie> but this will increase risky of dead lock? 13:22:10 <ruijie> all the scheduler want to get the first one 13:22:15 <Qiming> no ... we only acquire READY actions 13:22:49 <Qiming> if there are action dependencies, the dependent won't be selected because they are not READY 13:23:08 <ruijie> but we broadcast the request to all dispatchers 13:23:17 <Qiming> that is fine 13:23:27 <Qiming> all threads are assumed to be dummy workers 13:23:44 <Qiming> they don't (shouldn't) know the semantics or dependencies 13:23:57 <Qiming> they just grab a ready action and do it 13:24:22 <ruijie> okay, that makes sense 13:24:35 <Qiming> we pushed all synchronization problem to the db layer 13:24:52 <Qiming> instead of handling them at different layers --- a common source of dead locks 13:25:28 <Qiming> if there are concurrency problem, we look into the db records, we blame and fix sqlalchemy calls 13:26:06 <ruijie> one problem we met is: cluster action timeout, but the depends and dependents are still there.. 13:26:41 <Qiming> that means one or two db calls are not thread safe 13:27:25 <ruijie> yes Qiming, we want to delete all the records when action timeout, but the node action are still executing.. 13:27:31 <ruijie> that is an known problem? 13:27:43 <Qiming> I spent a lot time looking into sqlalchemy doc, trying to learn some best practices 13:28:18 <Qiming> node action is depended by cluster action 13:28:39 <Qiming> if node action is still running, you are not supposed to delete the cluster action, right? 13:29:30 <ruijie> yes Qiming 13:29:32 <Qiming> we have a signal call before 13:30:08 <Qiming> "action_signal" 13:30:22 <Qiming> it was designed for this purpose 13:31:04 <Qiming> the intent is to have an action occasionally check if it has received a signal ... 13:31:13 <Qiming> if it does, it will abort 13:32:01 <Qiming> it looks like we have not yet used that weapon 13:32:22 <ruijie> not yet, cluster-resize 100 --> cluster action timeout --> mark_timeout(delete all dependents) --> return : node actions left, locks, dependents 13:33:11 <Qiming> yes, we are talking about the same thing 13:33:32 <Qiming> I was proposing to add "action_signal" call in the "mark_timeout" logic 13:33:55 <Qiming> so mark_timeout can kill the depended node actions 13:35:34 <Qiming> http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n260 13:35:39 <ruijie> its hard to tell whether they are blocked or just process slowly 13:36:05 <Qiming> http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n366 13:37:01 <Qiming> such a logic may be pushed down to db layer as well 13:40:54 <Qiming> by the way, I hope team has noticed liyi's contribution 13:41:23 <Qiming> he has proposed quite a few high quality patches 13:41:41 <ruijie> yes, liyi is doing great jobs 13:41:57 <Qiming> he doesn't show up on IRC 13:42:19 <Qiming> but that won't be a blocking factor for hiring him/here 13:42:25 <Qiming> s/here/her 13:42:26 <XueFeng> yes 13:42:53 <Qiming> XueFeng, is liyi from your team? 13:43:24 <XueFeng> He come from another company 13:43:50 <Qiming> okay 13:44:22 <Qiming> I'll contact him/her and see if there is interest to beome one of us 13:44:48 <XueFeng> It's good 13:44:57 <XueFeng> +1 13:45:10 <ruijie> #vote 13:45:34 <XueFeng> I contact him before then he works good in API job 13:47:00 <Qiming> he knows sdk 13:47:38 <Qiming> it would be great to have more eyes on gating the contributions 13:48:05 <Qiming> last thing in my mind for today is about high priority bugs 13:48:27 <Qiming> liyi has proposed several patches related to node adoption 13:48:43 <XueFeng> Yes, we should make our team and meeting smoothly 13:48:47 <Qiming> most of them are about some cases we haven't thought about 13:49:15 <XueFeng> Will review for these patches 13:49:25 <Qiming> great 13:49:51 <Qiming> senlin is already in rdo? 13:49:55 <XueFeng> About bugs, I think there are no high priority bugs 13:50:14 <Qiming> senlinclient and dashboard are not there yet? 13:50:27 <XueFeng> Yes 13:50:48 <XueFeng> Senlin server(API and Engine) has in rdo 13:50:52 <Qiming> there are hands and eyes on it? 13:50:57 <Qiming> on them? 13:51:05 <XueFeng> Senlinclient is in process 13:51:06 <ruijie> https://bugs.launchpad.net/senlin/+bug/1710834 13:51:07 <openstack> Launchpad bug 1710834 in senlin "physical id should be None when creation process failed" [Undecided,In progress] - Assigned to RUIJIE YUAN (cnjie0616) 13:52:09 <Qiming> ... "UNKOWN" .. what's that for? 13:52:14 <ruijie> for heat stack there is no such problem 13:52:22 <ruijie> for the exception message .. 13:52:27 <ruijie> can we just remove it 13:52:51 <Qiming> cannot recall why we set it that way 13:52:58 <Qiming> there must be a reaon 13:53:00 <Qiming> reason 13:53:13 <ruijie> we raised ResourceException when creation failed.. 13:53:14 <Qiming> but I'm fine with setting it to None 13:53:39 <ruijie> and the resource_id was set to 'UNKNOWN' 13:54:14 <Qiming> I cannot recall why we explicitly set it to 'UNKNOWN' 13:54:38 <Qiming> but I believe there was a reason, not just for exception message 13:54:59 <ruijie> compute.create() may not return server and we may not have server.id, then we are not supposed to show the exception message with server_id .. 13:55:12 <ruijie> can we just remove it from the Exception class 13:55:34 <Qiming> the problem is ... 13:55:51 <Qiming> it may return a server, but that server won't get active 13:56:18 <ruijie> yup .. 13:56:31 <Qiming> if it is not returning a server record, we should set it to None certainly 13:57:13 <ruijie> okay Qiming, will do it in profile layer 13:57:22 <Qiming> thx 13:57:53 <ruijie> np :) 13:58:41 <Qiming> alright, time is up 13:58:47 <XueFeng> OK 13:58:57 <XueFeng> Good night 13:59:09 <Qiming> thanks for joining! 13:59:13 <Qiming> #endmeeting