10:16:18 <rakhmerov> #startmeeting Mistral Bug Review 10:16:19 <openstack> Meeting started Wed Nov 11 10:16:18 2015 UTC and is due to finish in 60 minutes. The chair is rakhmerov. Information about MeetBot at http://wiki.debian.org/MeetBot. 10:16:20 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 10:16:23 <openstack> The meeting name has been set to 'mistral_bug_review' 10:16:31 <rakhmerov> 1. Mistral stops responding after a few days that we haven't investigated / opened yet 10:16:50 <rakhmerov> ok 10:17:04 <rakhmerov> then let's at least file a bug 10:17:04 <[1]melisha> We need to investigate further. 10:17:28 <rakhmerov> #action melisha: File a bug for "Mistral stops responding after a few days that we haven't investigated / opened yet" 10:17:34 <[1]melisha> Cool 10:17:36 <rakhmerov> ok 10:17:56 <rakhmerov> 2. task stuck in RUNNING state when all action executions are finished - https://bugs.launchpad.net/mistral/+bug/1513456 10:17:56 <openstack> Launchpad bug 1513456 in Mistral "task stuck in RUNNING state when all action executions are finished" [Critical,Triaged] 10:18:39 <rakhmerov> On this one, we came across it a number of times 10:18:47 <nkoffman> we get this a lot, specifically when ruuning in HA mode, we see that all action-executions were sucessfull but the task doesn't en 10:18:48 <nkoffman> end 10:19:24 <nmakhotkin> yes, I investigated that a little bit 10:19:28 <rakhmerov> have you done any investigation? Any assumption where the problem is? 10:19:42 <nastya_> we also got it 10:19:49 <rakhmerov> yes 10:19:58 <nmakhotkin> the potential problem is our transactions 10:20:04 <rakhmerov> I assume that the issue is in transactions 10:20:04 <rakhmerov> yes 10:20:23 <rakhmerov> Winson also observed this behavior but in a different context 10:20:42 <rakhmerov> nkoffman: do you know a reliable way of reproducing it? 10:21:04 <rakhmerov> or at least increasing conditions that increase chances or reproducing it 10:21:09 <nkoffman> I saw it using a workflow with a task using with_items on HA, 10:21:22 <rakhmerov> ok 10:21:28 <nkoffman> I can try to reproduce on our node, haven't seen it on devstack though 10:21:38 <rakhmerov> please try to fill all info you have in bugs' comments 10:21:42 <rakhmerov> ok 10:21:48 <nkoffman> ok 10:21:51 <nastya_> nkoffman: I saw it in devsatck installation without any ha 10:22:06 <rakhmerov> I can try to dig this task myself since I have a couple of thoughts how to track it down 10:22:14 <rakhmerov> ok 10:22:55 <nkoffman> nastya_: I assume the HA might only bring it up more often 10:23:07 <rakhmerov> yes, I guess so 10:23:11 <nastya_> nkoffman: yeah, agree 10:23:27 <rakhmerov> ok, I assigned it to myself 10:23:36 <rakhmerov> will try to fix it 10:23:38 <rakhmerov> soon 10:23:45 <rakhmerov> let's continue 10:24:07 <rakhmerov> 2. Running WFs fail on failed to find system actions / workflows if DB sync is running in parallel - https://bugs.launchpad.net/mistral/+bug/1508379 10:24:07 <openstack> Launchpad bug 1508379 in Mistral "Running WFs fail on failed to find system actions / workflows if DB sync is running in parallel" [Medium,In progress] - Assigned to Tomer Shtilman (tomer-shtilman) 10:24:08 <nastya_> rakhmerov: you can use my env to debug where this problem was occured 10:24:16 <rakhmerov> ooh, sorry, it was #3 10:24:36 <rakhmerov> nastya_: ok, will talk to you once I get to working on it, thanks 10:25:44 <nmakhotkin> this one is being fixed in https://review.openstack.org/#/c/240705/ 10:25:54 <rakhmerov> [1]melisha, LimorStotland, nkoffman: so this happens if you need to reinstall one of Mistral instances? 10:26:02 <[1]melisha> rakhmerov: We all know the reason for this and Tomer is working on a fix with very responsive reviews from you all so that's OK 10:26:04 <nmakhotkin> but I'm not sure on 100% 10:26:58 <rakhmerov> nmakhotkin: yes, this seems to be the right patch 10:27:12 <[1]melisha> rakhmerov: On production setups, there is a puppet agent that always makes sure that the VM is up-to-date 10:27:28 <rakhmerov> #action: rakhmerov, nmakhotkin: review https://review.openstack.org/#/c/240705/ 10:27:35 <rakhmerov> [1]melisha: ok 10:27:39 <[1]melisha> This puppet agent runs every X minutes and compares conf files, etc. and also runs mistral syn db 10:27:50 <rakhmerov> I see 10:28:21 <rakhmerov> I'm just wondering.. Maybe we should change the whole algorithm of updating actions in DB 10:28:27 <rakhmerov> w/o deleting them 10:28:43 <rakhmerov> but on the other hand, if we use transactions properly it should fix the problem 10:28:52 <rakhmerov> ok, let's move on 10:28:57 <[1]melisha> It will fix the problem 10:29:09 <rakhmerov> 4. Workflow executed more than once when using cron-trigger with multiple engines - https://bugs.launchpad.net/mistral/+bug/1513548 10:29:09 <openstack> Launchpad bug 1513548 in Mistral "Workflow executed more than once when using cron-trigger with multiple engines" [High,In progress] - Assigned to Moshe Elisha (melisha) 10:29:22 <rakhmerov> this is being worked on 10:29:34 <rakhmerov> [1]melisha: I still owe you a review, sorry 10:29:34 <[1]melisha> Yes 10:29:41 <[1]melisha> np 10:30:04 <rakhmerov> #action: rakhmerov: Review https://review.openstack.org/243234 ASAP 10:30:58 <rakhmerov> ok, guys, btw, just for the same of time saving I'm not tagging these tickets with the new tag 10:31:06 <rakhmerov> I'll do it once we finish the meeting 10:31:16 <rakhmerov> .. for the sake ... 10:31:36 <rakhmerov> the next one 10:31:41 <rakhmerov> 5. Some DB queries are reported slow as no indices are used - https://bugs.launchpad.net/mistral/+bug/1505664 10:31:41 <openstack> Launchpad bug 1505664 in Mistral "Some DB queries are reported slow as no indices are used" [High,Confirmed] - Assigned to Winson Chan (winson-c-chan) 10:31:48 <rakhmerov> this one is assigned to Winson 10:32:05 <rakhmerov> I'll help him fix that, it's pretty straightforward thing to do 10:32:31 <[1]melisha> Cool. Do you have an easy way to know the indexes that are needed? 10:32:39 <rakhmerov> #action rakhmerov: tag all needed bugs with liberty-backport-potential 10:32:42 <[1]melisha> or the queries that are executed? 10:33:01 <rakhmerov> [1]melisha: yes, it's mostly in my head ) 10:33:32 <rakhmerov> If I look at DB model I'll say exactly what should be indexed and what should not 10:33:45 <[1]melisha> Great 10:33:56 <rakhmerov> of course, this doesn't cancel the need of some testing 10:34:22 <[1]melisha> :-) Sure. We will help with that 10:34:43 <rakhmerov> #action rakhmerov: put info into https://bugs.launchpad.net/mistral/+bug/1505664 about what exact indexes need to be created 10:34:43 <openstack> Launchpad bug 1505664 in Mistral "Some DB queries are reported slow as no indices are used" [High,Confirmed] - Assigned to Winson Chan (winson-c-chan) 10:35:10 <rakhmerov> 6. WF execution is not created if input preparation of initial task fails - https://bugs.launchpad.net/mistral/+bug/1506470 10:35:10 <openstack> Launchpad bug 1506470 in Mistral "WF execution is not created if input preparation of initial task fails" [High,Fix committed] - Assigned to Nikolay Makhotkin (nmakhotkin) 10:35:30 <rakhmerov> so here my question is: is this a bug at all? 10:36:01 <rakhmerov> opinions? 10:36:11 <nmakhotkin> the fix is already commited 10:36:22 <nmakhotkin> IMO, yes, it is a bug 10:36:25 <[1]melisha> I think it is a bug. As I see it an execution should always be created 10:36:34 <nkoffman> I agree 10:36:51 <LimorStotland> me 2 10:37:35 <rakhmerov> already committed? or merged? 10:37:43 <rakhmerov> can you please help me to find it? 10:37:55 <nastya_> merged 10:38:07 <nmakhotkin> fix commited in LP means that it is merged :) 10:38:16 <nastya_> rakhmerov: https://review.openstack.org/#/c/239638/ 10:38:29 <rakhmerov> ok, I can find it via the ticket 10:38:35 <rakhmerov> yep, thanks 10:39:09 <rakhmerov> ok, great! 10:39:45 <rakhmerov> #action rakhmerov: take a look at https://review.openstack.org/#/c/239638/ and backport it 10:40:06 <rakhmerov> 7. HTTP connection issues on simple load testing - https://bugs.launchpad.net/mistral/+bug/1423054 10:40:07 <openstack> Launchpad bug 1423054 in Mistral mitaka "HTTP connection issues on simple load testing" [High,Triaged] 10:40:38 <[1]melisha> If the bug description is true - this will surely be an issue for our customers 10:41:17 <rakhmerov> ok, I'll just share what I know quickly 10:41:43 <rakhmerov> we discussed it a lot with StackStorm about 8 months ago and particularly with Winson 10:42:04 <rakhmerov> note that latest comment was made on 2015-02-18 10:42:41 <rakhmerov> so, I'm almost sure this is not really a bug if we just consider Mistral codebase 10:43:07 <_gryf> 1 10:43:10 <rakhmerov> Winson told me that once they put Mistral behind Apache server or Nginx this issue stopped appearing completely 10:43:50 <rakhmerov> the thing is that if we use just an http server provided out of the box it's mostly intended to be used for development, not for production 10:44:19 <rakhmerov> in other words, it can't really server a lot of parallel requests well and dies under even modest load 10:44:40 <[1]melisha> OK. I see 10:44:59 <rakhmerov> Apache or Nginx help exactly with a big number of requests coming in in parallel 10:45:15 <[1]melisha> so no need to backport 10:45:25 <rakhmerov> just in case, I'd suggest we talk to Winson again and clarify this information 10:46:17 <rakhmerov> #action rakhmerov, [1]melisha: talk to Winson about https://bugs.launchpad.net/mistral/+bug/1423054 and confirm that this can be solved with putting Apache or Nginx in front of Mistral API server 10:46:17 <openstack> Launchpad bug 1423054 in Mistral mitaka "HTTP connection issues on simple load testing" [High,Triaged] 10:47:34 <rakhmerov> 8. execution-get truncates "State info" - https://bugs.launchpad.net/mistral/+bug/1509456 10:47:34 <openstack> Launchpad bug 1509456 in Mistral "execution-get truncates "State info"" [Medium,Confirmed] - Assigned to hardik (hardik-parekh047) 10:48:13 <rakhmerov> [1]melisha: can you confirm this bug too? 10:48:25 <rakhmerov> or anyone else? 10:48:33 <rakhmerov> I didn't see it myself 10:48:35 <nkoffman> we didn 10:48:38 <nkoffman> 't 10:49:07 <nkoffman> see it either, but based on the description, it looks like it could be an issue, if it does happen 10:49:31 <rakhmerov> I'm ready to bet that Mistral server doesn't truncate anything. If the problem exists it might be something on a client side 10:49:38 <nkoffman> nmakhotkin:I see it is confirmed by 10:50:18 <rakhmerov> ooh, yes, nmakhotkin confirmed it 10:50:20 <nmakhotkin> yep, I confirmed that 10:50:44 <rakhmerov> ok, then it should be something simple to fix 10:50:44 <nmakhotkin> state_info is really truncated 10:50:57 <rakhmerov> let's not spend time on that now, we just need to fix it 10:51:16 <rakhmerov> 9. wait-before and retry policies directly call task_handler.run_existing_task() method via RPC - https://bugs.launchpad.net/mistral/+bug/1484521 10:51:16 <openstack> Launchpad bug 1484521 in Mistral "wait-before and retry policies directly call task_handler.run_existing_task() method via RPC" [High,In progress] - Assigned to Renat Akhmerov (rakhmerov) 10:51:40 <rakhmerov> Yes, this is definitely a bug but it's more like an architectural bug 10:52:05 <nkoffman> what are the consequences of this bug to users? 10:52:06 <rakhmerov> we've discovered it with Limor together while improving Scheduler 10:52:37 <rakhmerov> no consequences I'd be able to tell about actually 10:52:51 <rakhmerov> it's rather an ugly design 10:53:07 <rakhmerov> and it requires some serious refactoring in engine and policies 10:53:28 <rakhmerov> not sure we need to backport it actually 10:53:34 <LimorStotland> yes, we have a bp on improving Scheduler:https://blueprints.launchpad.net/mistral/+spec/fallback-mechanism-for-scheduler 10:53:34 <nkoffman> ok, so in that case, probably unnecessary to backport 10:54:03 <rakhmerov> yes, I think we need to make a design improvement for Mitaka 10:54:27 <rakhmerov> it'll require I think a couple of weeks for me to fix it properly 10:54:41 <nkoffman> ok 10:54:49 <LimorStotland> I think if it doesn't have any effect on the user and its risky we shouldn't backport 10:55:08 <rakhmerov> I assigned to myself to M-2 for now 10:55:41 <rakhmerov> yes, it is risky a little bit because, as I said, it's not just a simple change, it's rather a refactoring 10:55:45 <rakhmerov> of engine and policies 10:56:06 <rakhmerov> which I'd love to do personally but it's pretty time consuming 10:56:23 <rakhmerov> 10. Wrong execution state with conditional transitions - https://bugs.launchpad.net/mistral/+bug/1510936 10:56:23 <openstack> Launchpad bug 1510936 in Mistral "Wrong execution state with conditional transitions" [High,Fix committed] - Assigned to Nikolay Makhotkin (nmakhotkin) 10:57:12 <nmakhotkin> this one is fixed already 10:57:40 <rakhmerov> ok, cool 10:57:40 <nkoffman> we haven't tried to reproduce it on our set up yet, but again based on description, we will need the fix 10:57:46 <rakhmerov> needs to be backported I think 10:58:05 <nmakhotkin> one thing, this bug is not completely fixed (case of on-complete is uncovered) 10:58:09 <rakhmerov> yes, it was definitely a bug, we discussed it with Nikolay before 10:58:26 <nkoffman> ok, so we need to backport it 10:58:52 <rakhmerov> #action nmakhotkin: fix https://bugs.launchpad.net/mistral/+bug/1510936 completely and backport all related patches 10:58:52 <openstack> Launchpad bug 1510936 in Mistral "Wrong execution state with conditional transitions" [High,Fix committed] - Assigned to Nikolay Makhotkin (nmakhotkin) 10:58:52 <rakhmerov> :) 10:59:41 <rakhmerov> 11. create pagination for the mistral client (This should be treated like a bug) - https://blueprints.launchpad.net/python-mistralclient/+spec/pagination-execution-mitralclient (This is mandatory as after a few days of work there is no way to get the execution list anymore). 11:00:28 <LimorStotland> i am missing one more +2 : https://review.openstack.org/#/c/242996/ 11:01:04 <rakhmerov> #action rakhmerov: review https://review.openstack.org/#/c/242996/ and backport it into stable/liberty 11:01:22 <rakhmerov> no questions on that, this is really a bad thing 11:01:36 <LimorStotland> and then i think we need to backported it because users with croon-trigers can use execution-list without it 11:01:38 <rakhmerov> I wish we could spend more time polishing such things 11:01:50 <rakhmerov> LimorStotland: sure, agree on 100% 11:02:05 <LimorStotland> cool :-) 11:02:48 <rakhmerov> 12. Add ceilometer apis as mistral actions - https://blueprints.launchpad.net/mistral/+spec/mistral-ceilometer-actions (This is not mandatory but we have some use cases that require this). 11:03:15 <rakhmerov> as we discussed at the team meeting this is pretty easy to implement 11:03:23 <rakhmerov> nmakhotkin can do it in 10 mins ;) 11:03:29 <nkoffman> yes, I did it on our installation as a POC, 11:03:35 <nmakhotkin> :D 11:03:41 <[1]melisha> in 11 mins 11:03:50 <nkoffman> :) 11:03:50 <rakhmerov> :) 11:03:52 <[1]melisha> or was it 9 mins? 11:04:23 <rakhmerov> ok, a serious question: who will be working on it? nmakhotkin or nkoffman? 11:04:40 <nkoffman> I can take it 11:04:57 <rakhmerov> ok 11:05:00 <nkoffman> is it ok for backporting? 11:05:03 <nmakhotkin> ok 11:05:16 <rakhmerov> then I'll tag it properly as well to backport it 11:05:30 <rakhmerov> alright, we ran out of time already actually 11:05:33 <nkoffman> great :) 11:05:42 <nastya_> do we really need to backport it? it is not so critical bug 11:05:47 <rakhmerov> very productive meeting I think 11:06:01 <rakhmerov> nastya_: good question 11:06:10 <rakhmerov> [1]melisha, nkoffman: what do you think guys? 11:06:15 <nastya_> actually it ia a new feature 11:06:16 <[1]melisha> nastya_: You are right. It is not even a bug 11:06:35 <nkoffman> this is indeed not critical, however since this is a low risk, and usefull for our customers, it would be helpfull if backported 11:06:37 <[1]melisha> Yes. But so easy impl will make some really cool use cases possible 11:06:51 <LimorStotland> I don't think it mandatory for backport but it can be nice 11:07:00 <[1]melisha> But up to you to decide 11:07:07 <rakhmerov> my perspective: it's pretty easy to backport and if it brings some comfort for your customers then let's do this 11:07:25 <rakhmerov> [1]melisha: completely agree 11:07:33 <rakhmerov> customers' happiness first 11:07:42 <[1]melisha> :-) 11:07:48 <LimorStotland> it have no risk and it can be very useful no way not? 11:07:56 <rakhmerov> yes 11:08:07 <rakhmerov> ok, are we good now? 11:08:10 <nastya_> ok, if it is not risky, then let's do it 11:08:14 <nkoffman> yes 11:08:15 <rakhmerov> any other questions? 11:08:18 <LimorStotland> yep 11:08:28 <nkoffman> I have a request, 11:08:43 <nkoffman> not regarding the meeting though.. 11:09:09 <rakhmerov> let me end the meeting then 11:09:15 <rakhmerov> and we'll continue to talk 11:09:19 <rakhmerov> #endmeeting