16:05:18 <ddeja> #startmeeting mistral 16:05:18 <openstack> Meeting started Mon Sep 19 16:05:18 2016 UTC and is due to finish in 60 minutes. The chair is ddeja. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:05:19 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:05:21 <openstack> The meeting name has been set to 'mistral' 16:05:34 <ddeja> hello 16:06:26 <rakhmerov> hi 16:06:28 <rakhmerov> I was finally able to join 16:06:30 <rakhmerov> sorry 16:06:36 <ddeja> oh, cool 16:06:36 <d0ugal> Hey 16:06:40 <rakhmerov> ddeja: still here? 16:06:43 <rakhmerov> d0ugal: hi hi ) 16:07:02 <ddeja> rakhmerov: I've just read your mail and I've just started the meeting 16:07:26 <rakhmerov> ok, good 16:07:38 <rakhmerov> ddeja: please keep in mind that you'll have to finish it 16:07:41 <rakhmerov> because you started it 16:08:00 <mgershen> hi 16:08:06 <rakhmerov> so, let's sync up quickly 16:08:10 <rakhmerov> mgershen: hi! 16:08:14 <ddeja> rakhmerov: yes, I know. Not a first time chairing ;) 16:08:27 <ddeja> #topic Review action items 16:08:39 <rakhmerov> ok :) 16:08:43 <rakhmerov> thanks a lot 16:08:53 <rakhmerov> you saved my ... 16:09:22 <rakhmerov> ddeja: I'm not sure if you have any AIs 16:09:34 <rakhmerov> we skipped last 2 meetings I guess 16:09:38 <ddeja> oh, ok 16:09:41 <d0ugal> Yeah, probably not because of that 16:09:41 <rakhmerov> yeah 16:09:50 <ddeja> #topic Current status (progress, issues, roadblocks, further plans) 16:10:07 <rakhmerov> sorry for that, I've been extremely busy last couple of months, and I've been travelling for 2 weeks by now 16:10:37 <d0ugal> rakhmerov: No problem, maybe we should move the meeting to a time that is easier for you? but that is a different discussion :) 16:10:56 <rakhmerov> yeah, we were supposed to do that long time ago :) 16:11:03 <rakhmerov> it's not convenient for many people 16:11:15 <rakhmerov> it's my debt 16:11:22 <d0ugal> :) 16:11:41 <rakhmerov> my status: still working on stability and performance improvements, last week made some great changes, they made Mistral work much faster on large workflows 16:11:45 <mgershen> status: I have some code on review in rally (yes still...), but internal things take most my time. 16:12:15 <rakhmerov> now optimizing processing of workflow context 16:12:24 <d0ugal> TripleO integration is taking most of my time, so not much to report (other than the bug I added to the agenda :) ) 16:12:36 <rakhmerov> mgershen: can you please add us as reviewers? 16:12:56 <d0ugal> mgershen: or just link it? 16:13:00 <rakhmerov> d0ugal: ok, this is the main thing we probably need to discuss today 16:13:04 <ddeja> my status: mostly testing, found one bug and a root cause for another; despite that a little bit of reviews (but to little!) 16:13:17 <mgershen> sure, I'll find the link 16:13:26 <rakhmerov> ok 16:13:52 <mgershen> I have changes to do, hopfully I will have time soon... https://review.openstack.org/#/c/358352 16:14:01 <rakhmerov> ok, thanks 16:14:15 <rakhmerov> so, just before we move forward 16:14:25 <rakhmerov> please keep in mind that RC1 is released 16:14:39 <rakhmerov> and master is now open for developing new features 16:14:56 <d0ugal> rakhmerov: There is no newton branch yet 16:15:05 <d0ugal> rakhmerov: so I don't think master should be open? 16:15:08 <rakhmerov> from now on we'll be backporting only bug fixes into stable/newton 16:15:28 <rakhmerov> d0ugal: it should be created, I saw an email from Doug 16:15:31 <rakhmerov> let me check 16:15:36 <d0ugal> I asked in #openstack-release earlier, they said it should be done today 16:15:39 <d0ugal> but I still can't see it 16:16:00 <ddeja> rakhmerov: yup, there is no newton/stable branch 16:16:00 <rakhmerov> yeah, true 16:16:13 <rakhmerov> yes, hm.. it's kinda weird 16:16:20 <d0ugal> I agree :) 16:16:23 <rakhmerov> maybe something was broken in their toolkit 16:16:32 <rakhmerov> for making releases 16:16:36 <rakhmerov> ok, anyway 16:16:50 <rakhmerov> ddeja: let's move on? 16:17:40 <ddeja> rakhmerov: yup 16:17:58 <ddeja> #topic (d0ugal) MessagingTimeout when executing mistral actions https://bugs.launchpad.net/mistral/+bug/1624284 16:18:00 <openstack> Launchpad bug 1624284 in Mistral "MessagingTimeout when executing mistral actions" [Critical,Confirmed] - Assigned to Dawid Deja (dawid-deja-0) 16:18:32 * rakhmerov Renat is reading again.. 16:18:36 <d0ugal> Okays, so for anyone unfamiliar, the last comment on that bug from ddeja is a good summary 16:19:08 <rakhmerov> d0ugal: yes 16:19:46 <rakhmerov> ddeja: does it help if engine and executor are running in separate processes? 16:19:56 <ddeja> rakhmerov: no 16:20:00 <rakhmerov> ok 16:20:04 <rakhmerov> just for my info 16:20:08 <ddeja> rakhmerov: I have such configuration on my devstack 16:20:41 <ddeja> and it doesn't matter 16:20:52 <rakhmerov> ok 16:21:12 <rakhmerov> ok, I'm reading these 4 steps that you pointed out 16:21:29 <rakhmerov> and I'm not sure that I understand the problem on 100% 16:21:37 <rakhmerov> so, again 16:21:52 <rakhmerov> engine sends a request to run "std.sleep" 16:21:59 <rakhmerov> executor sleeps for 30 sec 16:22:14 <ddeja> rakhmerov: yes, bu the request is a workfow (it's important) 16:22:30 <rakhmerov> which one? 16:22:36 <ddeja> the first one 16:22:42 <ddeja> std.sleep is an action in workflow 16:22:55 <rakhmerov> ooh, ok 16:23:17 <rakhmerov> reading again... 16:23:35 <rakhmerov> I don't understand #4 16:23:45 <rakhmerov> "Executor sends *sync* request: I woke up!" 16:24:02 <rakhmerov> ddeja: can you explain it? 16:24:17 <rakhmerov> what did you mean by "I woke up!"? 16:24:48 <ddeja> rakhmerov: Oh, that can be misleading 16:24:54 <ddeja> it is just sending the action results 16:25:07 <rakhmerov> for run-action ? 16:25:11 <ddeja> no 16:25:12 <rakhmerov> ooh, I got it 16:25:22 <rakhmerov> but for what? 16:25:44 <ddeja> it is for action run as a task t1 from 'sleep' workflow 16:26:07 <rakhmerov> ok 16:26:19 <rakhmerov> and why do we have a deadlock? 16:26:25 <ddeja> so 16:26:55 <ddeja> engine send request to executor 'run action std.sleep'. Since this action is a part of workflow, the request is async 16:27:05 <ddeja> which means, that we send a message via RPC and move on 16:27:05 <rakhmerov> yes 16:27:06 <rakhmerov> ok 16:27:10 <rakhmerov> yes 16:27:15 <rakhmerov> on engine side 16:27:48 <ddeja> o engine side, nothing is happening right now. On executor site, it goes to sleep (which simulates any long running task) 16:28:03 <rakhmerov> yes 16:28:22 <ddeja> while the executor is doing 'long running task' API sends eninge another request, to run action std.noop 16:28:35 <rakhmerov> ok 16:28:56 <ddeja> engine accpets the request, and since this is a 'run-action', not a part of workflow, it sends a request to executor in sync manner 16:29:05 <rakhmerov> yep 16:29:13 <ddeja> but executor is doing it previous job 16:29:17 <ddeja> so, engine waits 16:29:26 <rakhmerov> yes 16:29:31 <ddeja> after some time, executor ends it first job 16:29:39 <rakhmerov> so essentially it's not a real deadlock 16:29:41 <ddeja> and want to send result back to engine 16:29:50 <ddeja> and it do it in sync manner 16:29:52 <rakhmerov> it's just run-action fails with timeout, right? 16:30:20 <rakhmerov> ooh, no 16:30:26 <rakhmerov> ok, it's a real deadlock 16:30:28 <rakhmerov> now I see 16:30:29 <ddeja> so it waits for engine to reply for message but in the same time, engine is waiting for executor to anwser to its message 16:30:32 <ddeja> yup 16:30:37 <rakhmerov> yes, gotcha 16:30:50 <rakhmerov> it can't even send a result for 'sleep' 16:30:56 <ddeja> yes 16:31:00 <rakhmerov> because RPC subsystem is busy 16:31:05 <rakhmerov> so 16:31:20 <ddeja> well, it send it at least, becuse the first message timesout, and engine starts to operate again 16:31:35 <rakhmerov> yes 16:31:38 <rakhmerov> what about configuring RPC server differently for engine end executor? 16:31:54 <ddeja> it should work 16:32:04 <rakhmerov> will it help if executor won't be waiting to send results 16:32:22 <rakhmerov> it's one thing that we can do 16:32:36 <d0ugal> Configuting them differently where? 16:32:57 <rakhmerov> when we are initializing them 16:33:01 <rakhmerov> in launch.py 16:33:06 <ddeja> another thing - in mistral there is a lot of places where we use sync calls, but we are not doing anything with the results 16:33:25 <rakhmerov> ddeja: yes, right, we need to fix that too 16:33:37 <ddeja> it would improve performance 16:33:43 <rakhmerov> agree 16:34:26 <rakhmerov> I hope that pretty soon we'll get it back to 'eventlet' for engine too once I solve that stupid problem with green threads 16:34:34 <rakhmerov> I'll be working on it later this week 16:34:54 <ddeja> So, we want to change the executor so it uses eventlet? 16:35:10 <ddeja> or we want to use it async for returning messages? 16:35:18 <ddeja> returning results* 16:35:22 <rakhmerov> we need to do both 16:35:26 <ddeja> OK 16:35:36 <rakhmerov> starting with the simplest and more obvious change 16:35:58 <d0ugal> which one is that? :) 16:36:11 <rakhmerov> it seems like that enabling 'eventlet' for executor should be pretty simple 16:36:29 <d0ugal> Right 16:36:50 <rakhmerov> we just need to add one more parameter into the function that creates an RPC server for us and pass a different value when initializing engine and executor in launch.py 16:36:55 <rakhmerov> ddeja: sounds about right? 16:37:48 <d0ugal> Sounds easy. 16:37:54 <ddeja> rakhmerov: yup. 16:37:56 <rakhmerov> yes 16:38:02 <rakhmerov> ok :) 16:38:06 <d0ugal> I'd be happy to help in any way I can. 16:38:09 <ddeja> but it would make kombu driver still broken 16:38:20 <rakhmerov> yeah, that's what I thought too 16:38:37 <rakhmerov> but, you know, for Kombu we can just ignore this parameter for now 16:38:48 <ddeja> no, that is not a problem 16:38:59 <d0ugal> rakhmerov: If you plan to land performance fixes, can we swift back to eventlet and take the performance hit for a week or so? 16:39:07 <ddeja> a problem is that this deadlock bug will still be happening if one is using the kombu driver instead of oslo 16:39:11 <rakhmerov> we can give it some abstract name like 'rpc_processing_method' and ignore it for Kombu 16:39:30 <rakhmerov> ddeja: true, but we'll have time to fix it soon 16:39:52 <ddeja> I'll check tommorow if it is safe to change from sync to async in executor 16:40:13 <rakhmerov> ddeja: yes, please take it if you can 16:40:37 <rakhmerov> ddeja: btw, awesome job on investigating this 16:40:48 <d0ugal> ++ 16:40:58 <ddeja> #action ddeja will check if it is safe to change from sync to async in default executor while returning action results 16:41:05 <ddeja> thanks :) 16:41:21 <rakhmerov> d0ugal: what did you mean by "performance hit"? :) 16:41:27 <rakhmerov> sorry, didn't get your question 16:41:35 <d0ugal> rakhmerov: don't worry, I think the plan you have sounds good 16:42:02 <rakhmerov> ooh, the performance fixes I made last week are in RC1 already 16:42:06 <rakhmerov> they are merged 16:42:15 <d0ugal> rakhmerov: I just got a bit confused with the switch from eventlet to blocking and then you said you want to go back to eventlet? 16:42:28 <rakhmerov> as far as what I'm working on, they will be finished tomorrow (one test is failing) 16:42:48 <rakhmerov> d0ugal: yes, but only for executor 16:42:57 <d0ugal> I see, thanks. 16:43:05 <rakhmerov> by design, it's safe to use 'eventlet' for executor 16:43:17 <rakhmerov> but not safe for engine (problem with green threads) 16:43:30 <rakhmerov> d0ugal: at least ddeja and I believe so :) 16:43:38 <rakhmerov> hopefully we're right 16:43:41 <d0ugal> haha, I trust you :) 16:43:53 <d0ugal> Hopefully I can find time to learn this part of Mistral more soon. 16:44:01 <rbrady> +1 16:44:19 <rakhmerov> d0ugal: sure, it's pretty complicated but I can explain everything 16:44:29 <ddeja> rakhmerov: it should be totaly safe as long as actions do not try to communicate with DB 16:44:53 <rakhmerov> yes 16:44:56 <rakhmerov> right 16:45:16 <rakhmerov> ok, seems like we have a plan 16:45:26 <rakhmerov> let's move on 16:45:35 <rakhmerov> any other topics? 16:45:42 <ddeja> #topic Open discussion 16:46:00 <rakhmerov> btw, just FYI 16:46:15 <rbrady> ddeja: actions not communicate with db? directly or calling something that does communicate with db? e.g. fetching mistral environment 16:46:22 <rakhmerov> what we did last week makes mistral ~5 times faster 16:46:24 <rakhmerov> :) 16:46:37 <ddeja> rbrady: directly 16:46:43 <rakhmerov> I found some huge huge problems that I was able to remove 16:46:44 <rbrady> ddeja: ack. thanks 16:47:33 <rakhmerov> rbrady: yeah, the problem occurs only when we use green threads (eventlet's) and they do some blocking external calls 16:47:36 <rakhmerov> potentially blocking 16:47:43 <d0ugal> rakhmerov: Nice! 16:47:50 <rakhmerov> like acquiring a lock in DB 16:47:56 <rakhmerov> yeah :) 16:48:34 <d0ugal> rakhmerov: Do you have any benchmarks you can share? It would be a good thinkg to show off for Newton. 16:48:53 <rakhmerov> rbrady: in this case green threads dispatches doesn't switch threads as expected (although my understanding was different before I got this problem) 16:49:19 <rakhmerov> d0ugal: well, I can provide some numbers, yes 16:49:35 <rakhmerov> for some test workflows that I use 16:49:42 <d0ugal> rakhmerov: That would be cool, but not urgent at all :) 16:49:47 <rakhmerov> ok ) 16:50:05 <rakhmerov> alright 16:50:06 <d0ugal> Okay, sorry but I need to leave a bit early 16:50:12 <rakhmerov> me too! 16:50:26 <rakhmerov> rbrady, mgershen, ddeja: how about you? 16:50:33 <rakhmerov> ok to close the meeting? 16:50:39 <rbrady> ok for me 16:50:39 <d0ugal> Thanks rakhmerov and ddeja for your discussion, that was very useful and please let me know if I can help at all. 16:50:41 <ddeja> yes 16:50:42 <mgershen> sure 16:50:48 <rakhmerov> d0ugal: sure 16:50:53 <rakhmerov> thanks everyone 16:51:00 <ddeja> ok, thanks you all and see you next week 16:51:06 <rbrady> bye 16:51:08 <rakhmerov> ddeja: thanks twice! For investigation and for driving the meeting :) 16:51:12 <mgershen> bye 16:51:16 <d0ugal> Bye :) 16:51:17 <rakhmerov> see ya 16:51:21 <ddeja> rakhmerov: no problem, bye 16:51:26 <ddeja> #endmeeting