08:02:23 <rakhmerov> #startmeeting Mistral 08:02:24 <openstack> Meeting started Wed May 29 08:02:23 2019 UTC and is due to finish in 60 minutes. The chair is rakhmerov. Information about MeetBot at http://wiki.debian.org/MeetBot. 08:02:25 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 08:02:27 <openstack> The meeting name has been set to 'mistral' 08:02:36 <apetrich> Morning 08:03:14 <rakhmerov> morning 08:03:48 <rakhmerov> apetrich: btw, just a reminder: still waiting when you change those backports ) 08:03:54 <rakhmerov> no rush but please don't forget 08:04:15 <apetrich> rakhmerov, I know. Thanks for understanding. 08:04:32 <rakhmerov> no problem 08:04:35 <rakhmerov> so 08:04:36 <vgvoleg_> oh I did not have a time to write a blueprint about fail-on policy 08:04:41 <vgvoleg_> sorry 08:04:48 <rakhmerov> :) 08:04:55 <rakhmerov> please do 08:05:20 <rakhmerov> vgvoleg_: I know you wanted to share some concerns during the office hour 08:05:27 <rakhmerov> you can go ahead and do that 08:06:00 <vgvoleg_> yes 08:06:30 <vgvoleg_> For now, we are testing mistral with huge workflow 08:07:00 <vgvoleg_> it has about 600 nested wf and a very big context 08:07:21 <vgvoleg_> we found 3 problems 08:08:07 <vgvoleg_> 1) There are some memory leaks in engine 08:08:15 <rakhmerov> щл 08:09:23 <rakhmerov> ok 08:09:30 <vgvoleg_> 2) Mistral for some reasons stuck because of db deadlocks in action execution reporter 08:09:52 <rakhmerov> Oleg, how do you know that you're observing memory leaks? 08:10:19 <rakhmerov> maybe it's just a large memory footprint (which is totally OK with that big workflow) 08:10:20 <vgvoleg_> we see lots of active sessions 'update state error info heartbeat wasn't reseived' 08:10:31 <rakhmerov> ok 08:10:50 <rakhmerov> maybe you need to increase timeouts? 08:10:59 <vgvoleg_> nonono 08:11:09 <vgvoleg_> it's ok to fail action 08:11:22 <vgvoleg_> it is not ok to stuck mistral :D 08:12:00 <vgvoleg_> and from that point engines couldn't do anything, they miss connection to rabbit and never return to working state 08:12:39 <vgvoleg_> btw we see a lot of sessions in 'idle in transaction' state, tbh I don't know what does it mean 08:13:15 <rakhmerov> ok 08:13:17 <rakhmerov> I see 08:13:32 <vgvoleg_> about memory leaks: we use monitoring to see current state of Mistral's pods 08:13:41 <rakhmerov> Oleg, we've recently found one issue with RabbitMQ 08:14:00 <rakhmerov> if you're using the latest code you probably hit it as well 08:14:13 <rakhmerov> the thing is that oslo.messagine recently removed some deprecated options 08:14:18 <vgvoleg_> and we see that memory value increases after complete run 08:14:34 <rakhmerov> and the configuration option responsible for retries is now zero by default 08:14:41 <rakhmerov> so it never tries to reconnect 08:14:56 <rakhmerov> it's easy to solve just by reconfiguring the connnection a little bit 08:15:22 <vgvoleg_> if we run flow once again, this value will add some more memory, in our case this is about 2GB per pod 08:15:43 <vgvoleg_> 2GB that we don't know where they comes from 08:15:55 <vgvoleg_> even if we turn off all caches 08:16:50 <vgvoleg_> Oh, great news about rabbit, ty! We'll try ro research it 08:17:48 <rakhmerov> vgvoleg_: yes, I can share the details on Rabbit with you separately 08:18:24 <rakhmerov> as far as leaks, ok, I understood. We used to observe them but all of them have been fixed 08:18:46 <rakhmerov> we haven't observed any leaks for at least a year of constant Mistral use 08:18:58 <rakhmerov> although the workflows are also huge 08:19:15 <rakhmerov> but ok, I assume you may be hitting some corner case or something 08:19:38 <rakhmerov> I can also advise you on how to find memory leaks 08:19:44 <rakhmerov> I can recommend some tools for that 08:19:48 <vgvoleg_> can anyone help me and tell about mechanisms to detect where they come from? 08:19:58 <vgvoleg_> oh :) 08:20:00 <vgvoleg_> ok 08:20:09 <rakhmerov> basically you need to see objects of what type are mostly in the Python heap 08:20:29 <rakhmerov> yes, I'll let you know later here in the channel 08:20:34 <rakhmerov> so that others could see as well 08:20:44 <rakhmerov> I just need to find all the relevant links 08:20:56 <vgvoleg_> I've tried to do something like this, and some 'dict: 9999KB' is not helping at all 08:22:36 <rakhmerov> yeah-yeah, I know 08:22:39 <rakhmerov> it's not that simple 08:23:03 <vgvoleg_> The third issue i'd like to tell about is the load balancing between engine instances 08:23:10 <rakhmerov> you need to trace a chain of references from these primitive types to our user types 08:23:20 <rakhmerov> vgvoleg_: 08:23:21 <rakhmerov> ok 08:24:03 <vgvoleg_> There are cases, e.g one task has on-success with lots of other tasks 08:24:17 <rakhmerov> yep 08:24:43 <vgvoleg_> Starting and executing all this tasks is one indivisible operation 08:25:03 <rakhmerov> yes 08:25:23 <vgvoleg_> so it doesn't split between engines and we can see that one engine use 99% CPU and others 2-4% 08:25:52 <rakhmerov> as we discussed, I'd propose to make that option that we recently added (start_subworkflows_via_rpc) more universal 08:26:17 <vgvoleg_> we use one very dirty hack to solve this problem: we add 'join:all' to all this tasks, and it help to balance load 08:26:23 <rakhmerov> so that it works not only during the start but at any point during the execution life tie 08:26:25 <rakhmerov> time 08:27:28 <rakhmerov> vgvoleg_: yes 08:27:42 <rakhmerov> what do you think about my suggestion? Do you think it will help you? 08:28:50 <vgvoleg_> I think creating tasks with WAITING state and start them by rpc is the only correct solution 08:29:13 <rakhmerov> yes, makes sense 08:29:33 <vgvoleg_> this is good because it could solve one more problem 08:29:33 <rakhmerov> can you come up with some short spec or blueprint for this please? 08:29:58 <rakhmerov> we need to understand what else it will affect 08:30:22 <rakhmerov> basically, here we're going to change how tasks change their state 08:30:27 <rakhmerov> and this is a serious change 08:31:03 <vgvoleg_> if all execution steps are atomic and rpc-based, we can use priority to make old executions finish faster than new one 08:31:45 <rakhmerov> vgvoleg_: we definitely need to write up a description of this somewhere :) 08:31:54 <rakhmerov> with all the details and consequences of this change 08:32:01 <rakhmerov> I'd propose to make a blueprint for now 08:32:06 <rakhmerov> can you do this please? 08:32:09 <vgvoleg_> ok, i'll try 08:32:26 <akovi> hi 08:33:00 <rakhmerov> #action vgvoleg_: file a blueprint for processing big on-success|on-error|on-complete clauses using WAITING state and RPC 08:33:07 <akovi> the action execution reporting may fail because of custom actions, unfortunately 08:33:08 <rakhmerov> akovi: hi! how's it going? 08:33:18 <rakhmerov> custom actions? 08:33:21 <rakhmerov> ad-hoc you mean? 08:33:44 <akovi> if the reporter thread (green) does not get time to run, timeouts will happen in the engine and the execution will be closed 08:34:11 <vgvoleg_> we use only std.noop and std.echo in out test case 08:34:21 <akovi> we found such an error in one of our custom actions that listened to the output of a forked process 08:35:40 <rakhmerov> ok 08:36:33 <vgvoleg_> btw we have tried to use random_delay in this job, but deadlocks appear quite often 08:38:12 <rakhmerov> vgvoleg_: it's not going to help much 08:38:19 <rakhmerov> we've already proven that 08:38:44 <rakhmerov> I have to admit that it was a stupid attempt to help mitigate a bad scheduler architecture 08:38:58 <rakhmerov> that whole thing is going to change pretty soon 08:39:08 <akovi> large context can also cause issues in RPC processing 08:39:33 <rakhmerov> yes 08:39:45 <vgvoleg_> yes, but action execution reporter works with DB 08:40:25 <akovi> I tried compressing the data (with lzo and then lz4) but the results were not conclusive 08:40:29 <akovi> no 08:40:44 <akovi> the reporter has a list of the running actions 08:41:04 <akovi> this list is sent to the engine over RPC in given intervals 08:41:15 <akovi> the DB update happens in the engine 08:41:31 <vgvoleg_> oh I got it 08:41:40 <akovi> could be moved to the executor (this happened accidentally once) 08:41:59 <akovi> but that would mean the executor has to have direct DB access 08:42:06 <akovi> which was not required earlier 08:42:25 <vgvoleg_> I'll try to research it (the problem appears yesterday) 08:42:26 <rakhmerov> yeah, we've always tried to avoid that for several reasons 08:42:40 <akovi> if the RPC channels are overloaded and messages pile up, that can cause heartbeat misses too 08:44:47 <akovi> action heartbeating should probably be a last resort actually 08:44:56 <akovi> to close an execution 08:45:13 <akovi> giving a fair amount of time for processing is a good practice 08:45:50 <akovi> may cause crashes to be processed later but the overall stability improves significantly 08:46:42 <akovi> 30 sec reporting and 10 misses should be ok 08:47:00 <rakhmerov> akovi: yes, it's been working for us pretty well so far 08:47:16 <rakhmerov> no complaints 08:49:17 <rakhmerov> vgvoleg: so, please complete those two action items (creating blueprints) 08:49:43 <rakhmerov> I'll provide you details on Rabbit connectivity and detecting leaks in about an hour 08:50:23 <vgvoleg_> thank you :) 08:50:34 <rakhmerov> sure thing 08:52:56 <rakhmerov> ok, I guess we've discussed what we had 08:53:15 <rakhmerov> I'll close the meeting now (the logs will be available online) 08:53:18 <akovi> (thumbs up) 08:53:20 <rakhmerov> #endmeeting