16:02:20 <rakhmerov> #startmeeting Mistral
16:02:20 <openstack> Meeting started Mon Oct 20 16:02:20 2014 UTC and is due to finish in 60 minutes.  The chair is rakhmerov. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:02:21 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:02:24 <openstack> The meeting name has been set to 'mistral'
16:02:26 <rakhmerov> sorry guys
16:02:27 <rakhmerov> :)
16:02:33 <rakhmerov> hi
16:02:35 <ttx> rakhmerov: be my guest :)
16:02:56 <rakhmerov> thanks!
16:02:58 <dzimine> hi here
16:03:04 <rakhmerov> hi Dmitri
16:03:08 <rakhmerov> how are you?
16:03:09 <bhavenst> hi
16:03:16 <bhavenst> long time no type
16:03:30 <dzimine> hi bhavenst
16:04:13 <rakhmerov> hey bryan
16:04:18 <rakhmerov> how have you been?
16:04:44 <bhavenst> doing fine, just busy @ work.  But letting up so have been starting work on blueprints..
16:04:46 <nikolaym> hi !
16:05:08 <rakhmerov> ooh, very cool
16:05:15 <rakhmerov> ok, let's start
16:05:27 <dzimine> I was talking about Mistral on openstack automation meetup, and got a strong feedback about need of ceilometer integration, exactly along the lines of Brian's blueprint.
16:05:35 <akuznetsova_> Hi, sorry, I am late
16:05:42 <rakhmerov> sorry for not sending out an agenda, i've been really really busy these days
16:06:01 <rakhmerov> yeaah, that is cool
16:06:07 <rakhmerov> hi Nastya :)
16:06:13 <rakhmerov> ok
16:06:34 <rakhmerov> we didn't have any AIs from last meetings, they were really short
16:06:48 <rakhmerov> some folks were on vacations or busy with something
16:07:10 <rakhmerov> so let's go straight to the current status
16:07:22 <rakhmerov> #topic Current Status (by team members)
16:08:25 <rakhmerov> my status for the last couple of weeks is: I've been working mostly on bugs (both client and server), working on the examples, and preparing presentations
16:08:26 <nikolaym> Almost all last week I worked on for-each, it just works fine now
16:08:38 <akuznetsova_> I've added simple positive and negative tests for cron- triggers (API and CLI integration tests)
16:08:42 <bhavenst> Starting thinking about metrics blueprints, sent questions to Ceilometer person, started an etherpad..
16:08:44 <rakhmerov> I still need to review it
16:09:00 <nikolaym> And today I found bug with auth in std.http
16:09:15 <rakhmerov> bhavenst, could you please send it out via openstack-dev?
16:09:31 <rakhmerov> nikolaym, ok, I saw your patch
16:09:34 <bhavenst> Yeah, can do that when it's a bit more refined. :)
16:09:44 <rakhmerov> sure
16:10:20 <rakhmerov> as far as for-each
16:10:36 <rakhmerov> what Nikolay did looks ok
16:10:56 <rakhmerov> however, looks like we have some serious design issue
16:11:18 <rakhmerov> basically we have race condition between some transactions
16:11:21 <rakhmerov> in our engine
16:11:42 <rakhmerov> and in case of for-each it gets clearly revealed
16:12:37 <rakhmerov> the point is that when engine starts the workflow it creates all the tasks in DB (and execution) and currently we start tasks from within the same transaction
16:13:01 <rakhmerov> and there's a reason for this although it's considered anti-pattern
16:13:22 <rakhmerov> I mean to call any external things from DB transactions like rabbit mq
16:14:06 <rakhmerov> so it may happen that task is finished and its result comes back to engine before that first transaction completes
16:14:20 <rakhmerov> it on_task_resul() method won't find a task in DB
16:15:06 <dzimine> oh, nice.
16:15:11 <rakhmerov> I thought in case of READ_COMMITTED transactions there shouldn't be race conditions because the second transaction should block on the same object that is not committed yet
16:15:47 <rakhmerov> but either 1) I was wrong 2) or we are doing something inproperly somewhere else
16:16:04 <rakhmerov> e.g. configuring transaction isolation level
16:16:09 <bhavenst> I hit something similar many times during testing of the failed workflow bug I worked on, but that was before the refactoring so not sure if it applies.
16:16:15 <rakhmerov> so it's something that we need to test more carefully
16:16:30 <rakhmerov> it might
16:16:35 <rakhmerov> yes
16:17:05 <bhavenst> Multiple tasks doing things like echos, which I guess are fast enough to cause such an issue
16:17:23 <rakhmerov> well, first of all, if you run mistral for something serious (not unit tests) then forget about sqlite
16:17:36 <dzimine> foreach exacerbates the problem indeed. Now we have many (way too many) calls to rabbit within transactional scope.
16:17:39 <rakhmerov> yes, bhavenst, exactly!
16:17:49 <rakhmerov> yes
16:17:54 <rakhmerov> 100% right
16:17:56 <bhavenst> My solution was to add sleeps. :)
16:18:19 <rakhmerov> yeah, that's what Nikolay did I guess to make it work
16:18:49 <rakhmerov> so the two obvious options (ooh god, we discussed it already so many times):
16:19:25 <rakhmerov> 1) run tasks after transaction completes
16:20:09 <rakhmerov> 2) leave as is and use something to do proper synchronization (even though it's not clear to me)
16:20:39 <rakhmerov> option 1 has a problem of being vulnerable for failures
16:21:17 <rakhmerov> so if engine fails right after transaction and before pushing tasks into rabbit then the system will end up in an inconsistent state
16:21:40 <rakhmerov> and there will be no way to figure out if some tasks have already been put into rabbit
16:22:02 <dzimine> this is de-ja-vu. I need to recall all the details on the arguments we did...
16:22:06 <rakhmerov> so in other words, our DB state won't correspond to the state of the MQ
16:22:13 <rakhmerov> yeah
16:22:40 <rakhmerov> I think it's kinda challenge to discuss it in IRC for it being too complicated problem
16:22:56 <rakhmerov> but I'm just asking you to think about it if you have a chance
16:22:58 <dzimine> I recall we discussed "QUEING" status for a task..
16:23:10 <rakhmerov> you may come up with some ideas
16:23:12 <dzimine> suggest we set up a time to brainstorm it.
16:23:17 <rakhmerov> yes
16:23:21 <bhavenst> Are you guys going to be @ Paris summit?
16:23:26 <rakhmerov> so I'm just letting you know...
16:23:31 <rakhmerov> yes
16:23:35 <rakhmerov> we are
16:23:42 <bhavenst> ok cool, I'll be there too
16:23:43 <dzimine> outside of this meeting (or if we have time left)
16:23:46 <rakhmerov> it may be a good time to get back to that problem
16:23:52 <rakhmerov> ooh, nice
16:24:08 <rakhmerov> I don't think we can fix it before the summit anyway
16:24:33 <dzimine> how will it work (or rather "not work") in between?
16:24:37 <rakhmerov> there's just a fundumental problem of keeping two systems (DB and MQ) in a consistent state
16:24:42 <dzimine> fast tasks will fail?
16:24:49 <rakhmerov> yes
16:25:10 <rakhmerov> surprisingly, it mostly works unless we don't use something like 'for-each'
16:25:32 <rakhmerov> I think the reason is that we always run tasks via oslo
16:25:37 <rakhmerov> even echo :)
16:26:00 <rakhmerov> again, I'm still hoping that we just need to cofigure mysql properly
16:26:06 <rakhmerov> but
16:26:13 <rakhmerov> it may not really be helpful
16:26:30 <rakhmerov> so, the general problem is keeping two systems in sync
16:26:33 <dzimine> the direction I will be thinking is "to rely on one source of truth", not DB and MQ. Use DB as a source of truth.
16:26:48 <rakhmerov> usually people use something like XA transactions for this
16:27:01 <rakhmerov> which are not available at this point for us
16:27:16 <rakhmerov> may be
16:27:50 <rakhmerov> we could even pass it by somehow if say executors could access DB
16:28:06 <rakhmerov> but not really likely
16:28:14 <rakhmerov> ok
16:28:28 <rakhmerov> let's move on now, just asking you to think about it again
16:28:53 <rakhmerov> w/o it the system won't actually work on any more or less serious load
16:29:20 <rakhmerov> another thing I was planning to discuss really quick is our planned release 0.2
16:29:44 <rakhmerov> #topic Release 0.2
16:29:47 <rakhmerov> https://launchpad.net/mistral/+milestone/0.2
16:30:11 <rakhmerov> the thing is that looks like we're seriously behind the schedule with it
16:30:29 <rakhmerov> basically we have just 9 business days left
16:30:54 <bhavenst> If there is anything relatively simple I don't mind taking it, since blueprints are not at all pressing.
16:31:08 <rakhmerov> and our resources turned to be not enough because Nikolay and I got buried with a lot of unplanned stuff
16:31:26 <rakhmerov> yeah
16:31:28 <dzimine> winson has done #1380873 locally, expect it on review today/tomorrow.
16:31:40 <rakhmerov> ok, that's good
16:31:52 <rakhmerov> bhavenst, let me see what we have
16:31:55 <bhavenst> sure
16:32:20 <dzimine> but he and I won't have time to do events mistral-event-listeners-http
16:32:32 <rakhmerov> but generally the situation is that of 9 days I have effectively 4-5 days, the rest I'll have to spend on summit preps and other activities
16:32:37 <rakhmerov> Nikolay too
16:32:46 <rakhmerov> yeah, I see
16:32:53 <rakhmerov> so two options again:
16:33:17 <rakhmerov> 1) we shrink the scope of 0.2 and push it on Oct 31 as planned
16:33:48 <rakhmerov> 2) we move the due date 2-3 weeks further
16:34:02 <rakhmerov> what do you think?
16:34:20 <rakhmerov> I guess what we could do is:
16:34:32 <rakhmerov> (by "do" I mean complete)
16:34:51 <rakhmerov> 1. https://blueprints.launchpad.net/mistral/+spec/mistral-direct-workflow-join-control
16:35:32 <rakhmerov> 2. https://blueprints.launchpad.net/mistral/+spec/mistral-pause-before-policy (btw, this one should be pretty easy and I could assign it to Bryan)
16:35:42 <akuznetsova_> There will be holiday in Russia and Paris summit, so one week will out if scope
16:36:00 <rakhmerov> 3. https://blueprints.launchpad.net/mistral/+spec/mistral-dataflow-collections It's mostly done except it's not gonna be usable with that race condition
16:36:11 <rakhmerov> yes
16:36:14 <rakhmerov> good concern
16:36:44 <rakhmerov> 4. https://blueprints.launchpad.net/mistral/+spec/mistral-workflow-resume - Likely we could knock this down as well
16:37:36 <nikolaym> I thought that n.2 connected with n.4
16:37:46 <rakhmerov> so we definitely won't be able to tackle HA (testing etc.), I guess HTTP listeners and I have doubts about workflow resume too
16:37:58 <nikolaym> Resume and pause-before
16:38:45 <rakhmerov> well, logically yes. But strictly speaking they're separate things both needed for "manual checkpoints"
16:38:59 <rakhmerov> they could be done separately
16:39:22 <tsufiev> hi there! seems I missed the beginning of meeting. Do you have open discussion section :)?
16:39:50 <rakhmerov> hi Timur
16:39:53 <rakhmerov> not yet )
16:40:00 <rakhmerov> but soon
16:40:14 <akuznetsova_> Hi Timur
16:40:19 <rakhmerov> so what do you guys thing about release date ?
16:40:27 <rakhmerov> let me put it this way...
16:40:41 <dzimine> IMO move out.
16:40:55 <rakhmerov> do you think it makes a lot of sense to push it before the summit whatever it takes?
16:41:03 <dzimine> but still do few things by Paris.
16:41:18 <tsufiev> rakhmerov, okay. I have a little update about Merlin Workbook Builder for Mistral
16:41:24 <rakhmerov> without any official announcements?
16:41:38 <rakhmerov> tsufiev, sure, a couple of mins pls
16:41:54 <dzimine> "without any official announcements?" how do u mean/
16:42:00 <tsufiev> rakhmerov, np
16:42:11 <rakhmerov> my opinion: nobody will really get familiar with the release if we push it two days before the summit
16:42:49 <rakhmerov> dzimine, I mean "We're pleased to announce Mistral 0.2, here's the link to the new capabilities etc. etc."
16:43:29 <rakhmerov> so my suggestion is move it out but yes, implement most important things
16:43:46 <rakhmerov> for example, join
16:43:53 <rakhmerov> and try to fix that race condition
16:44:10 <rakhmerov> thoughts?
16:44:19 <rakhmerov> let's vote :)
16:44:20 <dzimine> the three big areas to me are 1) resuming workflow 2) work under load (including this transaction problem we dicsussed) and 3) refine REST API
16:44:21 <bhavenst> sounds reasonable
16:44:57 <dzimine> and fixing race condintion, etc, need to take as needed.
16:45:06 <rakhmerov> dzimine, I agree but don't see chances to fully address all this before Nov
16:45:12 <dzimine> there's no point of "pause before" till we learn to resume :)
16:45:24 <rakhmerov> race condition for me is actually the #1 problem
16:45:27 <dzimine> that's why I am for moving the date out.
16:45:35 <rakhmerov> ok, I see
16:45:47 <rakhmerov> ok, any objections?
16:45:50 <rakhmerov> Nikolay?
16:45:54 <rakhmerov> Nastya?
16:46:10 <dzimine> and I agree race condition is #1 prio.
16:46:13 <akuznetsova_> I guess that we need to move release
16:46:19 <rakhmerov> ok
16:46:24 <rakhmerov> nikolaym?
16:46:44 <nikolaym> Yes, move out the release
16:46:46 <rakhmerov> basically we already have all estimates for the BPs so I could take some time and suggest a new date
16:47:02 <rakhmerov> like I said I guess it should be at least 2 weeks later
16:47:41 <rakhmerov> we just need to look at everyone's schedule and make a conscious decision
16:47:46 <rakhmerov> ok, decided
16:47:58 <rakhmerov> tsufiev, please speak :)
16:48:25 <rakhmerov> #action Race condition in engine is the #1 problem to fix
16:48:49 <tsufiev> rakhmerov,  so, good news: there is chance that we'll get UI for Workbook Builder in Merlin done by designer, not me :)
16:48:51 <rakhmerov> #action Suggest a new date for 0.2 release
16:49:00 <rakhmerov> ooh
16:49:01 <rakhmerov> cool
16:49:19 <rakhmerov> who is it going to be? Already known to us?
16:49:48 <tsufiev> so if you have more feedback to share about the current state of Workbook Builder, you are strongly encouraged to share it - so it will be taken into account
16:50:01 <tsufiev> rakhmerov, I've spoken with Bogdan Dudko
16:50:09 <rakhmerov> ok
16:50:21 <tsufiev> he is from Mirantis Fuel team, and may have some free cycles to help Merlin
16:50:36 <rakhmerov> tsufiev, ooh, that is awesome
16:51:15 <rakhmerov> so, remember I sent you a list of sugestions..
16:51:37 <tsufiev> rakhmerov, yep, it will be the primary input for Bogdan
16:51:46 <rakhmerov> do you think they all could be done using this JS framework?
16:51:58 <rakhmerov> or how is it going to be done?
16:52:36 <tsufiev> I'd like to keep as much as possible interactions on client-side to make Merlin more responsive (less calls to server)
16:52:37 <rakhmerov> I mean I am not really sure what depends on barricade JS and the designer skills :)
16:52:46 <rakhmerov> ok
16:52:55 <rakhmerov> let's see
16:53:10 <rakhmerov> and btw, we need to sync up on DSL changes again
16:53:11 <tsufiev> well, the project definitely needs at least 2 people generating some ideas )
16:53:32 <rakhmerov> I looked at Merlin about 3 days ago and there're some disrupancies
16:53:36 <tsufiev> because I have some problems with simultaneous creating new design and implementing it
16:53:52 <rakhmerov> yup, totally understandable
16:54:17 <tsufiev> rakhmerov, could you write about them to ML?
16:54:34 <tsufiev> there is a thread already known to you...
16:55:25 <rakhmerov> I think some of us will be able to contribute after the summit when the dust settles
16:55:35 <rakhmerov> ok, I'll do that
16:55:44 <tsufiev> rakhmerov, thanks!
16:55:52 <akuznetsova_> Only after our release )
16:56:22 <rakhmerov> #action Write about DSL discrepancies in Merlin to ML
16:56:31 <rakhmerov> :))
16:56:37 <tsufiev> I hope to find some contributors at summit or at least make some advertising from Merlin :)
16:56:46 <tsufiev> s/from/for/
16:56:51 <rakhmerov> will you be there too?
16:56:54 <tsufiev> yes
16:56:58 <rakhmerov> cool
16:57:12 <rakhmerov> ok
16:57:17 <rakhmerov> guys, anything else?
16:57:41 <rakhmerov> I discussed the most important things that I wanted (race condition and 0.2 release date)
16:57:58 <rakhmerov> so let's then close the meeting
16:58:17 <tsufiev> bye!
16:58:24 <akuznetsova_> Bye
16:58:32 <nikolaym> bye!
16:58:52 <rakhmerov> bhavenst, I would suggest you try 'pause-before' BP and we're waiting for the news about Ceilometer integration
16:59:08 <rakhmerov> I'll assign it to you
16:59:15 <rakhmerov> :)
16:59:15 <bhavenst> OK, will give it a shot
16:59:21 <rakhmerov> ok, cool
16:59:24 <rakhmerov> thanks guys
16:59:27 <rakhmerov> bye-by
16:59:29 <bhavenst> bye
16:59:34 <rakhmerov> #endmeeting