16:02:20 <rakhmerov> #startmeeting Mistral 16:02:20 <openstack> Meeting started Mon Oct 20 16:02:20 2014 UTC and is due to finish in 60 minutes. The chair is rakhmerov. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:02:21 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:02:24 <openstack> The meeting name has been set to 'mistral' 16:02:26 <rakhmerov> sorry guys 16:02:27 <rakhmerov> :) 16:02:33 <rakhmerov> hi 16:02:35 <ttx> rakhmerov: be my guest :) 16:02:56 <rakhmerov> thanks! 16:02:58 <dzimine> hi here 16:03:04 <rakhmerov> hi Dmitri 16:03:08 <rakhmerov> how are you? 16:03:09 <bhavenst> hi 16:03:16 <bhavenst> long time no type 16:03:30 <dzimine> hi bhavenst 16:04:13 <rakhmerov> hey bryan 16:04:18 <rakhmerov> how have you been? 16:04:44 <bhavenst> doing fine, just busy @ work. But letting up so have been starting work on blueprints.. 16:04:46 <nikolaym> hi ! 16:05:08 <rakhmerov> ooh, very cool 16:05:15 <rakhmerov> ok, let's start 16:05:27 <dzimine> I was talking about Mistral on openstack automation meetup, and got a strong feedback about need of ceilometer integration, exactly along the lines of Brian's blueprint. 16:05:35 <akuznetsova_> Hi, sorry, I am late 16:05:42 <rakhmerov> sorry for not sending out an agenda, i've been really really busy these days 16:06:01 <rakhmerov> yeaah, that is cool 16:06:07 <rakhmerov> hi Nastya :) 16:06:13 <rakhmerov> ok 16:06:34 <rakhmerov> we didn't have any AIs from last meetings, they were really short 16:06:48 <rakhmerov> some folks were on vacations or busy with something 16:07:10 <rakhmerov> so let's go straight to the current status 16:07:22 <rakhmerov> #topic Current Status (by team members) 16:08:25 <rakhmerov> my status for the last couple of weeks is: I've been working mostly on bugs (both client and server), working on the examples, and preparing presentations 16:08:26 <nikolaym> Almost all last week I worked on for-each, it just works fine now 16:08:38 <akuznetsova_> I've added simple positive and negative tests for cron- triggers (API and CLI integration tests) 16:08:42 <bhavenst> Starting thinking about metrics blueprints, sent questions to Ceilometer person, started an etherpad.. 16:08:44 <rakhmerov> I still need to review it 16:09:00 <nikolaym> And today I found bug with auth in std.http 16:09:15 <rakhmerov> bhavenst, could you please send it out via openstack-dev? 16:09:31 <rakhmerov> nikolaym, ok, I saw your patch 16:09:34 <bhavenst> Yeah, can do that when it's a bit more refined. :) 16:09:44 <rakhmerov> sure 16:10:20 <rakhmerov> as far as for-each 16:10:36 <rakhmerov> what Nikolay did looks ok 16:10:56 <rakhmerov> however, looks like we have some serious design issue 16:11:18 <rakhmerov> basically we have race condition between some transactions 16:11:21 <rakhmerov> in our engine 16:11:42 <rakhmerov> and in case of for-each it gets clearly revealed 16:12:37 <rakhmerov> the point is that when engine starts the workflow it creates all the tasks in DB (and execution) and currently we start tasks from within the same transaction 16:13:01 <rakhmerov> and there's a reason for this although it's considered anti-pattern 16:13:22 <rakhmerov> I mean to call any external things from DB transactions like rabbit mq 16:14:06 <rakhmerov> so it may happen that task is finished and its result comes back to engine before that first transaction completes 16:14:20 <rakhmerov> it on_task_resul() method won't find a task in DB 16:15:06 <dzimine> oh, nice. 16:15:11 <rakhmerov> I thought in case of READ_COMMITTED transactions there shouldn't be race conditions because the second transaction should block on the same object that is not committed yet 16:15:47 <rakhmerov> but either 1) I was wrong 2) or we are doing something inproperly somewhere else 16:16:04 <rakhmerov> e.g. configuring transaction isolation level 16:16:09 <bhavenst> I hit something similar many times during testing of the failed workflow bug I worked on, but that was before the refactoring so not sure if it applies. 16:16:15 <rakhmerov> so it's something that we need to test more carefully 16:16:30 <rakhmerov> it might 16:16:35 <rakhmerov> yes 16:17:05 <bhavenst> Multiple tasks doing things like echos, which I guess are fast enough to cause such an issue 16:17:23 <rakhmerov> well, first of all, if you run mistral for something serious (not unit tests) then forget about sqlite 16:17:36 <dzimine> foreach exacerbates the problem indeed. Now we have many (way too many) calls to rabbit within transactional scope. 16:17:39 <rakhmerov> yes, bhavenst, exactly! 16:17:49 <rakhmerov> yes 16:17:54 <rakhmerov> 100% right 16:17:56 <bhavenst> My solution was to add sleeps. :) 16:18:19 <rakhmerov> yeah, that's what Nikolay did I guess to make it work 16:18:49 <rakhmerov> so the two obvious options (ooh god, we discussed it already so many times): 16:19:25 <rakhmerov> 1) run tasks after transaction completes 16:20:09 <rakhmerov> 2) leave as is and use something to do proper synchronization (even though it's not clear to me) 16:20:39 <rakhmerov> option 1 has a problem of being vulnerable for failures 16:21:17 <rakhmerov> so if engine fails right after transaction and before pushing tasks into rabbit then the system will end up in an inconsistent state 16:21:40 <rakhmerov> and there will be no way to figure out if some tasks have already been put into rabbit 16:22:02 <dzimine> this is de-ja-vu. I need to recall all the details on the arguments we did... 16:22:06 <rakhmerov> so in other words, our DB state won't correspond to the state of the MQ 16:22:13 <rakhmerov> yeah 16:22:40 <rakhmerov> I think it's kinda challenge to discuss it in IRC for it being too complicated problem 16:22:56 <rakhmerov> but I'm just asking you to think about it if you have a chance 16:22:58 <dzimine> I recall we discussed "QUEING" status for a task.. 16:23:10 <rakhmerov> you may come up with some ideas 16:23:12 <dzimine> suggest we set up a time to brainstorm it. 16:23:17 <rakhmerov> yes 16:23:21 <bhavenst> Are you guys going to be @ Paris summit? 16:23:26 <rakhmerov> so I'm just letting you know... 16:23:31 <rakhmerov> yes 16:23:35 <rakhmerov> we are 16:23:42 <bhavenst> ok cool, I'll be there too 16:23:43 <dzimine> outside of this meeting (or if we have time left) 16:23:46 <rakhmerov> it may be a good time to get back to that problem 16:23:52 <rakhmerov> ooh, nice 16:24:08 <rakhmerov> I don't think we can fix it before the summit anyway 16:24:33 <dzimine> how will it work (or rather "not work") in between? 16:24:37 <rakhmerov> there's just a fundumental problem of keeping two systems (DB and MQ) in a consistent state 16:24:42 <dzimine> fast tasks will fail? 16:24:49 <rakhmerov> yes 16:25:10 <rakhmerov> surprisingly, it mostly works unless we don't use something like 'for-each' 16:25:32 <rakhmerov> I think the reason is that we always run tasks via oslo 16:25:37 <rakhmerov> even echo :) 16:26:00 <rakhmerov> again, I'm still hoping that we just need to cofigure mysql properly 16:26:06 <rakhmerov> but 16:26:13 <rakhmerov> it may not really be helpful 16:26:30 <rakhmerov> so, the general problem is keeping two systems in sync 16:26:33 <dzimine> the direction I will be thinking is "to rely on one source of truth", not DB and MQ. Use DB as a source of truth. 16:26:48 <rakhmerov> usually people use something like XA transactions for this 16:27:01 <rakhmerov> which are not available at this point for us 16:27:16 <rakhmerov> may be 16:27:50 <rakhmerov> we could even pass it by somehow if say executors could access DB 16:28:06 <rakhmerov> but not really likely 16:28:14 <rakhmerov> ok 16:28:28 <rakhmerov> let's move on now, just asking you to think about it again 16:28:53 <rakhmerov> w/o it the system won't actually work on any more or less serious load 16:29:20 <rakhmerov> another thing I was planning to discuss really quick is our planned release 0.2 16:29:44 <rakhmerov> #topic Release 0.2 16:29:47 <rakhmerov> https://launchpad.net/mistral/+milestone/0.2 16:30:11 <rakhmerov> the thing is that looks like we're seriously behind the schedule with it 16:30:29 <rakhmerov> basically we have just 9 business days left 16:30:54 <bhavenst> If there is anything relatively simple I don't mind taking it, since blueprints are not at all pressing. 16:31:08 <rakhmerov> and our resources turned to be not enough because Nikolay and I got buried with a lot of unplanned stuff 16:31:26 <rakhmerov> yeah 16:31:28 <dzimine> winson has done #1380873 locally, expect it on review today/tomorrow. 16:31:40 <rakhmerov> ok, that's good 16:31:52 <rakhmerov> bhavenst, let me see what we have 16:31:55 <bhavenst> sure 16:32:20 <dzimine> but he and I won't have time to do events mistral-event-listeners-http 16:32:32 <rakhmerov> but generally the situation is that of 9 days I have effectively 4-5 days, the rest I'll have to spend on summit preps and other activities 16:32:37 <rakhmerov> Nikolay too 16:32:46 <rakhmerov> yeah, I see 16:32:53 <rakhmerov> so two options again: 16:33:17 <rakhmerov> 1) we shrink the scope of 0.2 and push it on Oct 31 as planned 16:33:48 <rakhmerov> 2) we move the due date 2-3 weeks further 16:34:02 <rakhmerov> what do you think? 16:34:20 <rakhmerov> I guess what we could do is: 16:34:32 <rakhmerov> (by "do" I mean complete) 16:34:51 <rakhmerov> 1. https://blueprints.launchpad.net/mistral/+spec/mistral-direct-workflow-join-control 16:35:32 <rakhmerov> 2. https://blueprints.launchpad.net/mistral/+spec/mistral-pause-before-policy (btw, this one should be pretty easy and I could assign it to Bryan) 16:35:42 <akuznetsova_> There will be holiday in Russia and Paris summit, so one week will out if scope 16:36:00 <rakhmerov> 3. https://blueprints.launchpad.net/mistral/+spec/mistral-dataflow-collections It's mostly done except it's not gonna be usable with that race condition 16:36:11 <rakhmerov> yes 16:36:14 <rakhmerov> good concern 16:36:44 <rakhmerov> 4. https://blueprints.launchpad.net/mistral/+spec/mistral-workflow-resume - Likely we could knock this down as well 16:37:36 <nikolaym> I thought that n.2 connected with n.4 16:37:46 <rakhmerov> so we definitely won't be able to tackle HA (testing etc.), I guess HTTP listeners and I have doubts about workflow resume too 16:37:58 <nikolaym> Resume and pause-before 16:38:45 <rakhmerov> well, logically yes. But strictly speaking they're separate things both needed for "manual checkpoints" 16:38:59 <rakhmerov> they could be done separately 16:39:22 <tsufiev> hi there! seems I missed the beginning of meeting. Do you have open discussion section :)? 16:39:50 <rakhmerov> hi Timur 16:39:53 <rakhmerov> not yet ) 16:40:00 <rakhmerov> but soon 16:40:14 <akuznetsova_> Hi Timur 16:40:19 <rakhmerov> so what do you guys thing about release date ? 16:40:27 <rakhmerov> let me put it this way... 16:40:41 <dzimine> IMO move out. 16:40:55 <rakhmerov> do you think it makes a lot of sense to push it before the summit whatever it takes? 16:41:03 <dzimine> but still do few things by Paris. 16:41:18 <tsufiev> rakhmerov, okay. I have a little update about Merlin Workbook Builder for Mistral 16:41:24 <rakhmerov> without any official announcements? 16:41:38 <rakhmerov> tsufiev, sure, a couple of mins pls 16:41:54 <dzimine> "without any official announcements?" how do u mean/ 16:42:00 <tsufiev> rakhmerov, np 16:42:11 <rakhmerov> my opinion: nobody will really get familiar with the release if we push it two days before the summit 16:42:49 <rakhmerov> dzimine, I mean "We're pleased to announce Mistral 0.2, here's the link to the new capabilities etc. etc." 16:43:29 <rakhmerov> so my suggestion is move it out but yes, implement most important things 16:43:46 <rakhmerov> for example, join 16:43:53 <rakhmerov> and try to fix that race condition 16:44:10 <rakhmerov> thoughts? 16:44:19 <rakhmerov> let's vote :) 16:44:20 <dzimine> the three big areas to me are 1) resuming workflow 2) work under load (including this transaction problem we dicsussed) and 3) refine REST API 16:44:21 <bhavenst> sounds reasonable 16:44:57 <dzimine> and fixing race condintion, etc, need to take as needed. 16:45:06 <rakhmerov> dzimine, I agree but don't see chances to fully address all this before Nov 16:45:12 <dzimine> there's no point of "pause before" till we learn to resume :) 16:45:24 <rakhmerov> race condition for me is actually the #1 problem 16:45:27 <dzimine> that's why I am for moving the date out. 16:45:35 <rakhmerov> ok, I see 16:45:47 <rakhmerov> ok, any objections? 16:45:50 <rakhmerov> Nikolay? 16:45:54 <rakhmerov> Nastya? 16:46:10 <dzimine> and I agree race condition is #1 prio. 16:46:13 <akuznetsova_> I guess that we need to move release 16:46:19 <rakhmerov> ok 16:46:24 <rakhmerov> nikolaym? 16:46:44 <nikolaym> Yes, move out the release 16:46:46 <rakhmerov> basically we already have all estimates for the BPs so I could take some time and suggest a new date 16:47:02 <rakhmerov> like I said I guess it should be at least 2 weeks later 16:47:41 <rakhmerov> we just need to look at everyone's schedule and make a conscious decision 16:47:46 <rakhmerov> ok, decided 16:47:58 <rakhmerov> tsufiev, please speak :) 16:48:25 <rakhmerov> #action Race condition in engine is the #1 problem to fix 16:48:49 <tsufiev> rakhmerov, so, good news: there is chance that we'll get UI for Workbook Builder in Merlin done by designer, not me :) 16:48:51 <rakhmerov> #action Suggest a new date for 0.2 release 16:49:00 <rakhmerov> ooh 16:49:01 <rakhmerov> cool 16:49:19 <rakhmerov> who is it going to be? Already known to us? 16:49:48 <tsufiev> so if you have more feedback to share about the current state of Workbook Builder, you are strongly encouraged to share it - so it will be taken into account 16:50:01 <tsufiev> rakhmerov, I've spoken with Bogdan Dudko 16:50:09 <rakhmerov> ok 16:50:21 <tsufiev> he is from Mirantis Fuel team, and may have some free cycles to help Merlin 16:50:36 <rakhmerov> tsufiev, ooh, that is awesome 16:51:15 <rakhmerov> so, remember I sent you a list of sugestions.. 16:51:37 <tsufiev> rakhmerov, yep, it will be the primary input for Bogdan 16:51:46 <rakhmerov> do you think they all could be done using this JS framework? 16:51:58 <rakhmerov> or how is it going to be done? 16:52:36 <tsufiev> I'd like to keep as much as possible interactions on client-side to make Merlin more responsive (less calls to server) 16:52:37 <rakhmerov> I mean I am not really sure what depends on barricade JS and the designer skills :) 16:52:46 <rakhmerov> ok 16:52:55 <rakhmerov> let's see 16:53:10 <rakhmerov> and btw, we need to sync up on DSL changes again 16:53:11 <tsufiev> well, the project definitely needs at least 2 people generating some ideas ) 16:53:32 <rakhmerov> I looked at Merlin about 3 days ago and there're some disrupancies 16:53:36 <tsufiev> because I have some problems with simultaneous creating new design and implementing it 16:53:52 <rakhmerov> yup, totally understandable 16:54:17 <tsufiev> rakhmerov, could you write about them to ML? 16:54:34 <tsufiev> there is a thread already known to you... 16:55:25 <rakhmerov> I think some of us will be able to contribute after the summit when the dust settles 16:55:35 <rakhmerov> ok, I'll do that 16:55:44 <tsufiev> rakhmerov, thanks! 16:55:52 <akuznetsova_> Only after our release ) 16:56:22 <rakhmerov> #action Write about DSL discrepancies in Merlin to ML 16:56:31 <rakhmerov> :)) 16:56:37 <tsufiev> I hope to find some contributors at summit or at least make some advertising from Merlin :) 16:56:46 <tsufiev> s/from/for/ 16:56:51 <rakhmerov> will you be there too? 16:56:54 <tsufiev> yes 16:56:58 <rakhmerov> cool 16:57:12 <rakhmerov> ok 16:57:17 <rakhmerov> guys, anything else? 16:57:41 <rakhmerov> I discussed the most important things that I wanted (race condition and 0.2 release date) 16:57:58 <rakhmerov> so let's then close the meeting 16:58:17 <tsufiev> bye! 16:58:24 <akuznetsova_> Bye 16:58:32 <nikolaym> bye! 16:58:52 <rakhmerov> bhavenst, I would suggest you try 'pause-before' BP and we're waiting for the news about Ceilometer integration 16:59:08 <rakhmerov> I'll assign it to you 16:59:15 <rakhmerov> :) 16:59:15 <bhavenst> OK, will give it a shot 16:59:21 <rakhmerov> ok, cool 16:59:24 <rakhmerov> thanks guys 16:59:27 <rakhmerov> bye-by 16:59:29 <bhavenst> bye 16:59:34 <rakhmerov> #endmeeting