20:01:08 <sandywalsh> #startmeeting 20:01:09 <openstack> Meeting started Thu Dec 1 20:01:08 2011 UTC. The chair is sandywalsh. Information about MeetBot at http://wiki.debian.org/MeetBot. 20:01:10 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic. 20:01:20 <sandywalsh> #link http://wiki.openstack.org/Meetings/Orchestration 20:01:36 <sandywalsh> #info Orchestration 20:01:55 <maoy> I'll put another name up there: Task Management 20:02:08 <sandywalsh> I just threw together that agenda since I've been out for the last couple of weeks 20:02:34 <sandywalsh> maoy, k, let's tackle yours first ... can you explain a little about Task Management? 20:02:43 <sandywalsh> #topic Task Management 20:02:54 <maoy> ah. 20:03:01 <maoy> i was referring to the new naming effort 20:03:09 <sandywalsh> ah, heh ... gotcha 20:03:17 <sandywalsh> ok, let's tackle naming 20:03:20 <sandywalsh> #topic naming 20:03:44 <maoy> i like transaction management, but it does sound like db lingo 20:03:53 <sandywalsh> So, orchestration is causing trouble for people 20:04:02 <maoy> so task might be more general alternative 20:04:18 <n0ano> not for me, I'm beginning to like orchestration, do we really need to change? 20:04:22 <sandywalsh> I think it's ultimately about distributed state machine management 20:04:51 <sandywalsh> n0ano, I agree, but people think of it in the larger BPM / Workflow sense 20:05:02 <sandywalsh> and we're more tactical than that 20:05:27 <n0ano> so we just need to make sure others know that it is tactical and not workflow 20:05:42 <sandywalsh> Tactical Orchestration? 20:05:48 * beekhof doesnt have a problem with Orchestration 20:05:48 <mikeyp> to add some background, 'Orchestration' is being confused with BPM, and larger cloud management frameworks. 20:06:03 <beekhof> sandywalsh: sounds like a nuclear option :) 20:06:08 <sandywalsh> heh 20:06:17 <n0ano> and scheduling is too specific 20:06:17 <maoy> lo 20:06:18 <sandywalsh> naming issues suck :) 20:06:42 <maoy> if we don't have a consensus we might just stay where we are.. 20:06:48 <sandywalsh> that's why I like State Management ... little room for interpretation 20:07:12 <maoy> according to wikipedia..Orchestration describes the automated arrangement, coordination, and management of complex computer systems, middleware, and services. 20:07:29 <beekhof> wfm 20:07:36 <sandywalsh> certainly sticking with orchestration is the easiest 20:07:59 <sandywalsh> I'll add a blurb to the working group description to highlight the distinction 20:08:10 <sandywalsh> vote to stick with orchestration? 20:08:15 * n0ano aye 20:08:24 <sandywalsh> +1 20:08:29 <beekhof> +1 20:08:31 <maoy> like 20:08:38 <mikeyp> -1 - too many large implications 20:08:52 <mikeyp> +1 for state management 20:09:06 <sandywalsh> votes for state management? 20:09:22 <sandywalsh> votes for other choices? 20:09:44 <sandywalsh> #action stick with "orchestration" until we can come up with a better name :) 20:09:48 * n0ano abstain for other choices 20:10:01 <sandywalsh> #topic tactical scheduler changes 20:10:17 <sandywalsh> #link http://wiki.openstack.org/EssexSchedulerImprovements 20:10:24 <sandywalsh> So, I emailed this out this morning 20:10:51 <sandywalsh> I know it's not directly related to Orchestration per se, but hopefully we can see a pattern here for how we can process events and get them back to the Scheduler 20:11:13 <maoy> i think it's closely related.. 20:11:13 <sandywalsh> I think if we get this mechanism in place, we can start to add the State Machine stuff 20:11:36 <sandywalsh> if you haven't read it, please do and offer feedback 20:11:47 <maoy> for distributed scheduler, you mean the zone stuff, right? 20:12:08 <sandywalsh> well, initially we need it for single zone, but multi zone will follow behind shortly 20:12:09 <mikeyp> The idea of capacity cache as a summary table makes sense. 20:12:39 <mikeyp> who wold 'own' that table in terms of updates ? 20:13:03 <n0ano> I like the idea of putting the data in the DB, I'm concerned that using pre-defined columns would limit the extensibility of the scheduler 20:13:10 <sandywalsh> the scheduler would update for new rows 20:13:21 <sandywalsh> and the compute nodes would update for changes to instance status 20:13:27 <sandywalsh> (including delete) 20:13:34 <maoy> the challenge I see for "orchestration" is to maintain that table up2date in the event of node crashes, and errors. 20:14:10 <beekhof> i only just got up, havent read it yet 20:14:34 <sandywalsh> n0ano, good point ... something we need to consider ... how to extend what's in the table 20:14:37 <maoy> agree on extensibility. scheduler is probably the most likely customized module in deployment imho.. 20:14:57 <n0ano> +1 on customizable module 20:14:59 <mikeyp> the table should have an updated timestamp for each row. 20:15:21 <sandywalsh> maoy, ideally the ComputeNode table will let us now about host failures 20:15:21 <mikeyp> could then decide when info was stale 20:15:37 <n0ano> timestamp is standard for all the current rows in the DB, that shouldn't change 20:15:49 <sandywalsh> mikeyp, all Nova tables have those common fields 20:16:20 <sandywalsh> mikeyp, nova.db.sqlalchemy.models.NovaBase 20:17:06 <sandywalsh> #action give some consideration to extending CapacityCache table 20:17:37 <sandywalsh> One other thing that came up from our meeting last week was the need to keep Essex stable 20:17:42 <sandywalsh> no radical changes 20:18:01 <sandywalsh> so we may need to do our orchestration stuff along side the current compute.run_instance() code 20:18:07 <n0ano> I actually created a proof of concept scheduler based on cactus that put metrics in the DB and made decisions based on that, I can send an email to describe it in more deatil 20:18:09 <maoy> if a compute node can't finish provision a vm, we need to find another node. is this included in the new scheduler? 20:18:11 <sandywalsh> perhaps like we did with /zones/boot initially 20:18:24 <sandywalsh> n0ano, that would be great 20:18:42 <sandywalsh> #action n0ano to outline his previous scheduler efforts on ML 20:19:08 <sandywalsh> maoy, not yet, no retries yet ... first it's just getting reliable instance state info 20:19:35 <n0ano> if the goal is essex stability that would imply minimal changes to the current code, right? 20:19:52 <sandywalsh> as I was saying, we may need to do something like POST /orchestration/boot or something in the meanwhile to not upset nova while we get our state management in place 20:19:57 <maoy> sandywalsh, can you elaborate on what you did with /zones/boot initially? 20:20:03 <sandywalsh> likely a simple state machine (not petri net) in the short term 20:20:08 <n0ano> so something like moving to a DB base would have to wait for the F release 20:20:36 <sandywalsh> so, when we started with the scheduler we had a new POST /servers method that worked across zones 20:20:48 <sandywalsh> but it had a different signature than POST /servers 20:21:07 <sandywalsh> so we created POST /zones/boot to use the alternate approach 20:21:19 <sandywalsh> and later, we integrated the two back into POST /servers 20:21:28 <sandywalsh> and ditched /zone/boot 20:21:44 <sandywalsh> we may need to do the same thing with state-machine based boot 20:21:59 <sandywalsh> can't upset the Essex apple cart 20:22:01 <maoy> about petrinet. from our previous email, it looks like most things are sequential so state machine is probably good enough.. 20:22:30 <sandywalsh> maoy, agreed ... won't really be an issue until we get into multiple concurrent instance provisioning requests 20:22:42 <beekhof> agreed. petrinet looked cool but quit epossibly overkill 20:22:50 <sandywalsh> that's why this event handling stuff is important now 20:23:20 <sandywalsh> #action stick with a simple state machine for now ... revisit petrinets later when concurrency is required 20:23:32 <maoy> one more q: 20:23:44 <maoy> is all the steps in a job executed on the same node (compute?) 20:24:09 <maoy> i guess some might be on network nodes or storage nodes? 20:24:18 <sandywalsh> maoy, yes for run_instance() ... resize or migrate may be different (need to verify) 20:24:33 <sandywalsh> and yes, there are things that need to be done on the network/volume nodes 20:24:45 <sandywalsh> but it's usually still all serial 20:24:52 <sandywalsh> (sequential) 20:24:58 <maoy> got it 20:25:43 <sandywalsh> beekhof, when you get a chance to read that wiki page, I'd be curious to know if the event notification stuff will work ok with pacemaker 20:25:52 <maoy> in this case, the automatic rollback with predefined undo functions during failure seems to make sense 20:26:00 <sandywalsh> hopefully it should fit well? 20:26:10 <beekhof> i think it would 20:26:15 <sandywalsh> cool 20:26:28 <sandywalsh> maoy, do you mean hard-coded rewind functions? 20:26:31 <beekhof> would simplify a lot if there was a single table to go to 20:26:43 <sandywalsh> right 20:27:24 <maoy> i mean for each step, the developer specify a undo step in case things happen later blow up 20:27:38 <sandywalsh> yes, right ... I think so too 20:27:49 <sandywalsh> k, so anything else anyone cares to bring up? 20:28:03 <sandywalsh> I need to review some of the communications from the last two weeks 20:28:05 <maoy> this would prevent bugs like forget to un-allocate IP if VM doesn't boot 20:28:13 <sandywalsh> correct 20:28:19 <mikeyp> the VM state managements are still in review 20:28:42 <beekhof> there was my compute-cluster idea, dunno if its worth discussing that 20:28:48 <mikeyp> how / does that impact this work ? 20:29:19 <sandywalsh> mikeyp, I think they need to come to agreement on the VM states before we can really do much 20:29:38 <mikeyp> thats what I thought 20:30:13 <sandywalsh> beekhof, I need to re-read it, but perhaps others are ready? 20:30:34 <beekhof> we can do next week 20:30:51 <sandywalsh> #action discuss beekhok compute-cluster idea next meeting 20:30:58 <beekhof> i've put it to one side since RH is going to take a diff approach anyway 20:31:03 <beekhof> but i still think its kinda neat :) 20:31:03 <sandywalsh> *beekhof 20:31:23 <sandywalsh> :) can you elaborate on the RH approach? 20:32:08 <beekhof> now or next week? 20:32:23 <sandywalsh> perhaps in an email and we can touch on it next week? 20:32:41 <sandywalsh> (or did your last email already mention it?) 20:32:42 <beekhof> sure 20:32:50 <maoy> agreed. email has higher goodput 20:32:55 <beekhof> 10000-ft view... 20:32:58 <sandywalsh> k 20:33:05 <beekhof> its a layered design 20:33:16 <beekhof> sits on top of openstack instead of being a part of it 20:33:28 <sandywalsh> ah, that'll be good to hear about 20:33:37 <sandywalsh> well, let's wrap this one up and we'll see you on the lists! 20:33:45 <beekhof> that way it can also manage other stacks 20:33:47 <beekhof> ok 20:33:52 <sandywalsh> cool 20:33:58 <sandywalsh> #endmeeting