17:00:52 #startmeeting 17:00:53 Meeting started Tue Nov 1 17:00:52 2011 UTC. The chair is sandywalsh. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:00:54 Useful Commands: #action #agreed #help #info #idea #link #topic. 17:01:12 #topic plead forgiveness 17:01:18 :) 17:01:29 #link http://wiki.openstack.org/Meetings/Orchestration 17:01:59 Hello 17:02:06 o/ 17:02:51 this will be a short meeting I think 17:03:07 so, I haven't had a lot of time to get any prep done 17:03:27 (specifically the video) 17:03:44 that's fine by me, since I watched your talk.. 17:03:51 but regardless there are two issues I think (and please jump in to correct me) 17:04:00 1. the tactical issues 17:04:18 a. how to get events back to the orchestration layer from the services 17:04:31 b. where the orchestration service lives (in scheduler?) 17:04:49 and 2. what is the strategic approach to orchestration 17:05:02 a. a trivial state machine 17:05:10 b. a more complex state machine (petri) 17:05:30 c. another service (some pre-existing library) 17:05:33 d. other? 17:05:42 maoy, I think this is where your paper comes in 17:06:02 (which I have to apologize for, I haven't read yet, but it's at the top of my stack) 17:06:29 In the link I have a list of what I think are tactical items 17:06:47 great. i was about to say that the paper is quite relevant here. 17:06:47 which I think are applicable regardless of the strategic approach 17:07:25 good ... I'm keen to read it. I'll try to get some meaningful feedback on it by next meeting 17:07:40 I think that the orchestration might make more sense to be below the scheduler. 17:08:23 sandy, i'm also looking at petri net. 17:08:24 ok, so scheduler talks to orchestration and steps out of the way? 17:09:02 I sort of envisioned orchestration talking to scheduler, but you suggest the other way around? 17:10:03 to me it depends on how we define what exactly the orchestration layer does 17:10:30 well, maoy if you can perhaps create a wiki page to summarize your idea (nothing fancy), we can comment on it there? 17:10:46 sure 17:10:49 i'll work on that 17:10:51 excellent 17:11:22 i have a question on the petri net 17:11:22 we've started considering what it would take to do simple retry, so hopefully that will give us a little bit of the tactical stuff we need 17:11:26 sure 17:11:44 Hi All, Sorry for being late and you probably already discussed it, but I guess we need to divide it into several parts, where one of them - return status over AMQP is kind of related to Orchestrator, but not really the orchestrator 17:12:09 to me it seems like the orchestrator should be the one who requests from scheduler what to do ... 17:12:10 petrinet is a great way to model concurrent processes. i'm just curious after the modeling what could we do with it 17:12:33 vladimir3p, yes, I outlined one suggestion in the agenda: http://wiki.openstack.org/Meetings/Orchestration 17:12:51 maoy, can you give an example? 17:13:38 vladimir3p, I think Orch should ask of the scheduler, but maoy is going to propose an alternative approach. 17:14:11 ah, sorry. I was definitely late for this meting :-) 17:14:28 i'm completely unfamiliar with celery and the other tool you mentioned in the talk, but I am wondering what benefit we have with the modeling effort 17:14:33 vladimir3p, np 17:14:56 maoy, from the feedback I got, we don't want to use celery tasks. 17:15:18 sandy and vlad, I think we might have a similar idea, but use different understanding in the terminologies, esp on "orchestration" 17:15:31 quite likely 17:15:41 yep 17:15:52 ok. 17:16:07 still, write up your suggestion and we'll make sure we're on the same page 17:16:23 absolutely 17:16:39 #action maoy to write up his suggestions for how the orch service works with the scheduler (and other services) 17:17:27 do we deal with high availability issues here? 17:17:35 maoy, my ideas for using petri net was simply to be a "better state machine". There were no other immediate plans from there. Just generic hooks to the outside world 17:17:39 e.g. the orchestrator crashes. 17:17:51 got it 17:18:11 maoy, that's a big issue ... we're running into that now with the scheduler. How do synchronize state when there are many concurrent workers 17:18:42 Master-Slave works great for these problems since there's only one decision maker. But it's a single point of failure 17:18:59 in the paper, we use ZooKeeper who provides a quorum-based highly available storage and coordination service 17:19:25 Workers are great for scalability, but only when the tasks can be idempotent and can be done in parallel. Scheduling/State-management doesn't seem to be one of those problems. 17:19:42 agreed. 17:19:51 #action sandy to learn about ZooKeeper 17:20:42 vlad 17:20:49 oops :-) sorry 17:20:51 ok ... I think those are two good starts. Ideally for next meeting we should be in some agreement how to tackle the concurrency problem. 17:21:17 in the paper, we addressed 4 problems: 17:21:29 let's keep the discussion going on the mailing list. If zookeeper looks promising perhaps we work it into the tactical parts? 17:21:36 maoy, carry on ... 17:22:01 concurrency, high availability, unexpected errors during worker execution, and imposing policies to prevent mis-operations 17:22:12 we can probably ignore the 4th one 17:22:50 great ... that's the stuff we need to nail down. 17:22:51 and see if the ideas in the others can be applied in nova in an non-disruptive way 17:23:03 #action give maoy some good feedback on his paper 17:23:19 a quick question - do you plan to apply same principles of "opertation" orchestration not only for between scheduler-compute/volume nodes, but between API nodes - scheduler? 17:23:20 #link http://dl.dropbox.com/u/166877/CloudTransaction.pdf 17:23:49 vladimir3p, can you give an example? 17:24:09 when you create bunch of instances the call goes to scheduler 17:24:22 but if it was not accepted/received you probably want to retry it 17:24:33 especially if we have multiple schedulers 17:24:49 actually, it applies to any operation performed over AMQP 17:25:06 is AMQP lossy? 17:25:17 i'm not very familiar with it.. sorry 17:25:19 it may stuck there 17:26:24 this is undesirable.. 17:26:30 I was thinking of case when particular scheduler accepted request but crashed... 17:26:38 (as an example) 17:28:03 seems like either we can retry and make scheduling job idempotent, or to fix amqp.. 17:28:37 my assumption was the first step was to create the workflow and that would get picked up by orch layer and worked on from there. 17:29:02 ok, np 17:29:43 ok, well guys I think we have a good start here. Let's keep the discussion going on the ML once we review all the materials. 17:29:52 great 17:29:54 cool? 17:30:02 fine 17:30:09 i'll put up a wiki 17:30:17 excellent 17:30:21 ... thanks for your time guys 17:30:23 my idea is still quite rough since I don't know nova that well 17:30:34 sandy, just to make sure - the same error/reply logic we could make "generic" 17:30:36 but you guys will help me. :) 17:30:49 and try to apply it for API-sched communications 17:31:13 and it willbe kind of an essential part of orch, but not really the orch 17:31:46 yes, makes sense 17:32:36 #endmeeting