17:00:52 <sandywalsh> #startmeeting 17:00:53 <openstack> Meeting started Tue Nov 1 17:00:52 2011 UTC. The chair is sandywalsh. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:00:54 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic. 17:01:12 <sandywalsh> #topic plead forgiveness 17:01:18 <sandywalsh> :) 17:01:29 <sandywalsh> #link http://wiki.openstack.org/Meetings/Orchestration 17:01:59 <maoy> Hello 17:02:06 <sandywalsh> o/ 17:02:51 <sandywalsh> this will be a short meeting I think 17:03:07 <sandywalsh> so, I haven't had a lot of time to get any prep done 17:03:27 <sandywalsh> (specifically the video) 17:03:44 <maoy> that's fine by me, since I watched your talk.. 17:03:51 <sandywalsh> but regardless there are two issues I think (and please jump in to correct me) 17:04:00 <sandywalsh> 1. the tactical issues 17:04:18 <sandywalsh> a. how to get events back to the orchestration layer from the services 17:04:31 <sandywalsh> b. where the orchestration service lives (in scheduler?) 17:04:49 <sandywalsh> and 2. what is the strategic approach to orchestration 17:05:02 <sandywalsh> a. a trivial state machine 17:05:10 <sandywalsh> b. a more complex state machine (petri) 17:05:30 <sandywalsh> c. another service (some pre-existing library) 17:05:33 <sandywalsh> d. other? 17:05:42 <sandywalsh> maoy, I think this is where your paper comes in 17:06:02 <sandywalsh> (which I have to apologize for, I haven't read yet, but it's at the top of my stack) 17:06:29 <sandywalsh> In the link I have a list of what I think are tactical items 17:06:47 <maoy> great. i was about to say that the paper is quite relevant here. 17:06:47 <sandywalsh> which I think are applicable regardless of the strategic approach 17:07:25 <sandywalsh> good ... I'm keen to read it. I'll try to get some meaningful feedback on it by next meeting 17:07:40 <maoy> I think that the orchestration might make more sense to be below the scheduler. 17:08:23 <maoy> sandy, i'm also looking at petri net. 17:08:24 <sandywalsh> ok, so scheduler talks to orchestration and steps out of the way? 17:09:02 <sandywalsh> I sort of envisioned orchestration talking to scheduler, but you suggest the other way around? 17:10:03 <maoy> to me it depends on how we define what exactly the orchestration layer does 17:10:30 <sandywalsh> well, maoy if you can perhaps create a wiki page to summarize your idea (nothing fancy), we can comment on it there? 17:10:46 <maoy> sure 17:10:49 <maoy> i'll work on that 17:10:51 <sandywalsh> excellent 17:11:22 <maoy> i have a question on the petri net 17:11:22 <sandywalsh> we've started considering what it would take to do simple retry, so hopefully that will give us a little bit of the tactical stuff we need 17:11:26 <sandywalsh> sure 17:11:44 <vladimir3p> Hi All, Sorry for being late and you probably already discussed it, but I guess we need to divide it into several parts, where one of them - return status over AMQP is kind of related to Orchestrator, but not really the orchestrator 17:12:09 <vladimir3p> to me it seems like the orchestrator should be the one who requests from scheduler what to do ... 17:12:10 <maoy> petrinet is a great way to model concurrent processes. i'm just curious after the modeling what could we do with it 17:12:33 <sandywalsh> vladimir3p, yes, I outlined one suggestion in the agenda: http://wiki.openstack.org/Meetings/Orchestration 17:12:51 <sandywalsh> maoy, can you give an example? 17:13:38 <sandywalsh> vladimir3p, I think Orch should ask of the scheduler, but maoy is going to propose an alternative approach. 17:14:11 <vladimir3p> ah, sorry. I was definitely late for this meting :-) 17:14:28 <maoy> i'm completely unfamiliar with celery and the other tool you mentioned in the talk, but I am wondering what benefit we have with the modeling effort 17:14:33 <sandywalsh> vladimir3p, np 17:14:56 <sandywalsh> maoy, from the feedback I got, we don't want to use celery tasks. 17:15:18 <maoy> sandy and vlad, I think we might have a similar idea, but use different understanding in the terminologies, esp on "orchestration" 17:15:31 <sandywalsh> quite likely 17:15:41 <vladimir3p> yep 17:15:52 <maoy> ok. 17:16:07 <sandywalsh> still, write up your suggestion and we'll make sure we're on the same page 17:16:23 <maoy> absolutely 17:16:39 <sandywalsh> #action maoy to write up his suggestions for how the orch service works with the scheduler (and other services) 17:17:27 <maoy> do we deal with high availability issues here? 17:17:35 <sandywalsh> maoy, my ideas for using petri net was simply to be a "better state machine". There were no other immediate plans from there. Just generic hooks to the outside world 17:17:39 <maoy> e.g. the orchestrator crashes. 17:17:51 <maoy> got it 17:18:11 <sandywalsh> maoy, that's a big issue ... we're running into that now with the scheduler. How do synchronize state when there are many concurrent workers 17:18:42 <sandywalsh> Master-Slave works great for these problems since there's only one decision maker. But it's a single point of failure 17:18:59 <maoy> in the paper, we use ZooKeeper who provides a quorum-based highly available storage and coordination service 17:19:25 <sandywalsh> Workers are great for scalability, but only when the tasks can be idempotent and can be done in parallel. Scheduling/State-management doesn't seem to be one of those problems. 17:19:42 <maoy> agreed. 17:19:51 <sandywalsh> #action sandy to learn about ZooKeeper 17:20:42 <vladimir3p> vlad 17:20:49 <vladimir3p> oops :-) sorry 17:20:51 <sandywalsh> ok ... I think those are two good starts. Ideally for next meeting we should be in some agreement how to tackle the concurrency problem. 17:21:17 <maoy> in the paper, we addressed 4 problems: 17:21:29 <sandywalsh> let's keep the discussion going on the mailing list. If zookeeper looks promising perhaps we work it into the tactical parts? 17:21:36 <sandywalsh> maoy, carry on ... 17:22:01 <maoy> concurrency, high availability, unexpected errors during worker execution, and imposing policies to prevent mis-operations 17:22:12 <maoy> we can probably ignore the 4th one 17:22:50 <sandywalsh> great ... that's the stuff we need to nail down. 17:22:51 <maoy> and see if the ideas in the others can be applied in nova in an non-disruptive way 17:23:03 <sandywalsh> #action give maoy some good feedback on his paper 17:23:19 <vladimir3p> a quick question - do you plan to apply same principles of "opertation" orchestration not only for between scheduler-compute/volume nodes, but between API nodes - scheduler? 17:23:20 <sandywalsh> #link http://dl.dropbox.com/u/166877/CloudTransaction.pdf 17:23:49 <sandywalsh> vladimir3p, can you give an example? 17:24:09 <vladimir3p> when you create bunch of instances the call goes to scheduler 17:24:22 <vladimir3p> but if it was not accepted/received you probably want to retry it 17:24:33 <vladimir3p> especially if we have multiple schedulers 17:24:49 <vladimir3p> actually, it applies to any operation performed over AMQP 17:25:06 <maoy> is AMQP lossy? 17:25:17 <maoy> i'm not very familiar with it.. sorry 17:25:19 <vladimir3p> it may stuck there 17:26:24 <maoy> this is undesirable.. 17:26:30 <vladimir3p> I was thinking of case when particular scheduler accepted request but crashed... 17:26:38 <vladimir3p> (as an example) 17:28:03 <maoy> seems like either we can retry and make scheduling job idempotent, or to fix amqp.. 17:28:37 <sandywalsh> my assumption was the first step was to create the workflow and that would get picked up by orch layer and worked on from there. 17:29:02 <vladimir3p> ok, np 17:29:43 <sandywalsh> ok, well guys I think we have a good start here. Let's keep the discussion going on the ML once we review all the materials. 17:29:52 <maoy> great 17:29:54 <sandywalsh> cool? 17:30:02 <vladimir3p> fine 17:30:09 <maoy> i'll put up a wiki 17:30:17 <sandywalsh> excellent 17:30:21 <sandywalsh> ... thanks for your time guys 17:30:23 <maoy> my idea is still quite rough since I don't know nova that well 17:30:34 <vladimir3p> sandy, just to make sure - the same error/reply logic we could make "generic" 17:30:36 <maoy> but you guys will help me. :) 17:30:49 <vladimir3p> and try to apply it for API-sched communications 17:31:13 <vladimir3p> and it willbe kind of an essential part of orch, but not really the orch 17:31:46 <sandywalsh> yes, makes sense 17:32:36 <sandywalsh> #endmeeting