#openstack-mistral log

08:02:23 <rakhmerov> #startmeeting Mistral
08:02:24 <openstack> Meeting started Wed May 29 08:02:23 2019 UTC and is due to finish in 60 minutes.  The chair is rakhmerov. Information about MeetBot at http://wiki.debian.org/MeetBot.
08:02:25 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
08:02:27 <openstack> The meeting name has been set to 'mistral'
08:02:36 <apetrich> Morning
08:03:14 <rakhmerov> morning
08:03:48 <rakhmerov> apetrich: btw, just a reminder: still waiting when you change those backports )
08:03:54 <rakhmerov> no rush but please don't forget
08:04:15 <apetrich> rakhmerov, I know. Thanks for understanding.
08:04:32 <rakhmerov> no problem
08:04:35 <rakhmerov> so
08:04:36 <vgvoleg_> oh I did not have a time to write a blueprint about fail-on policy
08:04:41 <vgvoleg_> sorry
08:04:48 <rakhmerov> :)
08:04:55 <rakhmerov> please do
08:05:20 <rakhmerov> vgvoleg_: I know you wanted to share some concerns during the office hour
08:05:27 <rakhmerov> you can go ahead and do that
08:06:00 <vgvoleg_> yes
08:06:30 <vgvoleg_> For now, we are testing mistral with huge workflow
08:07:00 <vgvoleg_> it has about 600 nested wf and a very big context
08:07:21 <vgvoleg_> we found 3 problems
08:08:07 <vgvoleg_> 1) There are some memory leaks in engine
08:08:15 <rakhmerov> щл
08:09:23 <rakhmerov> ok
08:09:30 <vgvoleg_> 2) Mistral for some reasons stuck because of db deadlocks in action execution reporter
08:09:52 <rakhmerov> Oleg, how do you know that you're observing memory leaks?
08:10:19 <rakhmerov> maybe it's just a large memory footprint (which is totally OK with that big workflow)
08:10:20 <vgvoleg_> we see lots of active sessions 'update state error info heartbeat wasn't reseived'
08:10:31 <rakhmerov> ok
08:10:50 <rakhmerov> maybe you need to increase timeouts?
08:10:59 <vgvoleg_> nonono
08:11:09 <vgvoleg_> it's ok to fail action
08:11:22 <vgvoleg_> it is not ok to stuck mistral :D
08:12:00 <vgvoleg_> and from that point engines couldn't do anything, they miss connection to rabbit and never return to working state
08:12:39 <vgvoleg_> btw we see a lot of sessions in 'idle in transaction' state, tbh I don't know what does it mean
08:13:15 <rakhmerov> ok
08:13:17 <rakhmerov> I see
08:13:32 <vgvoleg_> about memory leaks: we use monitoring to see current state of Mistral's pods
08:13:41 <rakhmerov> Oleg, we've recently found one issue with RabbitMQ
08:14:00 <rakhmerov> if you're using the latest code you probably hit it as well
08:14:13 <rakhmerov> the thing is that oslo.messagine recently removed some deprecated options
08:14:18 <vgvoleg_> and we see that memory value increases after complete run
08:14:34 <rakhmerov> and the configuration option responsible for retries is now zero by default
08:14:41 <rakhmerov> so it never tries to reconnect
08:14:56 <rakhmerov> it's easy to solve just by reconfiguring the connnection a little bit
08:15:22 <vgvoleg_> if we run flow once again, this value will add some more memory, in our case this is about 2GB per pod
08:15:43 <vgvoleg_> 2GB that we don't know where they comes from
08:15:55 <vgvoleg_> even if we turn off all caches
08:16:50 <vgvoleg_> Oh, great news about rabbit, ty! We'll try ro research it
08:17:48 <rakhmerov> vgvoleg_: yes, I can share the details on Rabbit with you separately
08:18:24 <rakhmerov> as far as leaks, ok, I understood. We used to observe them but all of them have been fixed
08:18:46 <rakhmerov> we haven't observed any leaks for at least a year of constant Mistral use
08:18:58 <rakhmerov> although the workflows are also huge
08:19:15 <rakhmerov> but ok, I assume you may be hitting some corner case or something
08:19:38 <rakhmerov> I can also advise you on how to find memory leaks
08:19:44 <rakhmerov> I can recommend some tools for that
08:19:48 <vgvoleg_> can anyone help me and tell about mechanisms to detect where they come from?
08:19:58 <vgvoleg_> oh :)
08:20:00 <vgvoleg_> ok
08:20:09 <rakhmerov> basically you need to see objects of what type are mostly in the Python heap
08:20:29 <rakhmerov> yes, I'll let you know later here in the channel
08:20:34 <rakhmerov> so that others could see as well
08:20:44 <rakhmerov> I just need to find all the relevant links
08:20:56 <vgvoleg_> I've tried to do something like this, and some 'dict: 9999KB' is not helping at all
08:22:36 <rakhmerov> yeah-yeah, I know
08:22:39 <rakhmerov> it's not that simple
08:23:03 <vgvoleg_> The third issue i'd like to tell about is the load balancing between engine instances
08:23:10 <rakhmerov> you need to trace a chain of references from these primitive types to our user types
08:23:20 <rakhmerov> vgvoleg_:
08:23:21 <rakhmerov> ok
08:24:03 <vgvoleg_> There are cases, e.g one task has on-success with lots of other tasks
08:24:17 <rakhmerov> yep
08:24:43 <vgvoleg_> Starting and executing all this tasks is one indivisible operation
08:25:03 <rakhmerov> yes
08:25:23 <vgvoleg_> so it doesn't split between engines and we can see that one engine use 99% CPU and others 2-4%
08:25:52 <rakhmerov> as we discussed, I'd propose to make that option that we recently added (start_subworkflows_via_rpc) more universal
08:26:17 <vgvoleg_> we use one very dirty hack to solve this problem: we add 'join:all' to all this tasks, and it help to balance load
08:26:23 <rakhmerov> so that it works not only during the start but at any point during the execution life tie
08:26:25 <rakhmerov> time
08:27:28 <rakhmerov> vgvoleg_: yes
08:27:42 <rakhmerov> what do you think about my suggestion? Do you think it will help you?
08:28:50 <vgvoleg_> I think creating tasks with WAITING state and start them by rpc is the only correct solution
08:29:13 <rakhmerov> yes, makes sense
08:29:33 <vgvoleg_> this is good because it could solve one more problem
08:29:33 <rakhmerov> can you come up with some short spec or blueprint for this please?
08:29:58 <rakhmerov> we need to understand what else it will affect
08:30:22 <rakhmerov> basically, here we're going to change how tasks change their state
08:30:27 <rakhmerov> and this is a serious change
08:31:03 <vgvoleg_> if all execution steps are atomic and rpc-based, we can use priority to make old executions finish faster than new one
08:31:45 <rakhmerov> vgvoleg_: we definitely need to write up a description of this somewhere :)
08:31:54 <rakhmerov> with all the details and consequences of this change
08:32:01 <rakhmerov> I'd propose to make a blueprint for now
08:32:06 <rakhmerov> can you do this please?
08:32:09 <vgvoleg_> ok, i'll try
08:32:26 <akovi> hi
08:33:00 <rakhmerov> #action vgvoleg_: file a blueprint for processing big on-success|on-error|on-complete clauses using WAITING state and RPC
08:33:07 <akovi> the action execution reporting may fail because of custom actions, unfortunately
08:33:08 <rakhmerov> akovi: hi! how's it going?
08:33:18 <rakhmerov> custom actions?
08:33:21 <rakhmerov> ad-hoc you mean?
08:33:44 <akovi> if the reporter thread (green) does not get time to run, timeouts will happen in the engine and the execution will be closed
08:34:11 <vgvoleg_> we use only std.noop and std.echo in out test case
08:34:21 <akovi> we found such an error in one of our custom actions that listened to the output of a forked process
08:35:40 <rakhmerov> ok
08:36:33 <vgvoleg_> btw we have tried to use random_delay in this job, but deadlocks appear quite often
08:38:12 <rakhmerov> vgvoleg_: it's not going to help much
08:38:19 <rakhmerov> we've already proven that
08:38:44 <rakhmerov> I have to admit that it was a stupid attempt to help mitigate a bad scheduler architecture
08:38:58 <rakhmerov> that whole thing is going to change pretty soon
08:39:08 <akovi> large context can also cause issues in RPC processing
08:39:33 <rakhmerov> yes
08:39:45 <vgvoleg_> yes, but action execution reporter works with DB
08:40:25 <akovi> I tried compressing the data (with lzo and then lz4) but the results were not conclusive
08:40:29 <akovi> no
08:40:44 <akovi> the reporter has a list of the running actions
08:41:04 <akovi> this list is sent to the engine over RPC in given intervals
08:41:15 <akovi> the DB update happens in the engine
08:41:31 <vgvoleg_> oh I got it
08:41:40 <akovi> could be moved to the executor (this happened accidentally once)
08:41:59 <akovi> but that would mean the executor has to have direct DB access
08:42:06 <akovi> which was not required earlier
08:42:25 <vgvoleg_> I'll try to research it (the problem appears yesterday)
08:42:26 <rakhmerov> yeah, we've always tried to avoid that for several reasons
08:42:40 <akovi> if the RPC channels are overloaded and messages pile up, that can cause heartbeat misses too
08:44:47 <akovi> action heartbeating should probably be a last resort actually
08:44:56 <akovi> to close an execution
08:45:13 <akovi> giving a fair amount of time for processing is a good practice
08:45:50 <akovi> may cause crashes to be processed later but the overall stability improves significantly
08:46:42 <akovi> 30 sec reporting and 10 misses should be ok
08:47:00 <rakhmerov> akovi: yes, it's been working for us pretty well so far
08:47:16 <rakhmerov> no complaints
08:49:17 <rakhmerov> vgvoleg: so, please complete those two action items (creating blueprints)
08:49:43 <rakhmerov> I'll provide you details on Rabbit connectivity and detecting leaks in about an hour
08:50:23 <vgvoleg_> thank you :)
08:50:34 <rakhmerov> sure thing
08:52:56 <rakhmerov> ok, I guess we've discussed what we had
08:53:15 <rakhmerov> I'll close the meeting now (the logs will be available online)
08:53:18 <akovi> (thumbs up)
08:53:20 <rakhmerov> #endmeeting