#openstack-meeting log

20:01:08 <sandywalsh> #startmeeting
20:01:09 <openstack> Meeting started Thu Dec  1 20:01:08 2011 UTC.  The chair is sandywalsh. Information about MeetBot at http://wiki.debian.org/MeetBot.
20:01:10 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic.
20:01:20 <sandywalsh> #link http://wiki.openstack.org/Meetings/Orchestration
20:01:36 <sandywalsh> #info Orchestration
20:01:55 <maoy> I'll put another name up there: Task Management
20:02:08 <sandywalsh> I just threw together that agenda since I've been out for the last couple of weeks
20:02:34 <sandywalsh> maoy, k, let's tackle yours first ... can you explain a little about Task Management?
20:02:43 <sandywalsh> #topic Task Management
20:02:54 <maoy> ah.
20:03:01 <maoy> i was referring to the new naming effort
20:03:09 <sandywalsh> ah, heh ... gotcha
20:03:17 <sandywalsh> ok, let's tackle naming
20:03:20 <sandywalsh> #topic naming
20:03:44 <maoy> i like transaction management, but it does sound like db lingo
20:03:53 <sandywalsh> So, orchestration is causing trouble for people
20:04:02 <maoy> so task might be more general alternative
20:04:18 <n0ano> not for me, I'm beginning to like orchestration, do we really need to change?
20:04:22 <sandywalsh> I think it's ultimately about distributed state machine management
20:04:51 <sandywalsh> n0ano, I agree, but people think of it in the larger BPM / Workflow sense
20:05:02 <sandywalsh> and we're more tactical than that
20:05:27 <n0ano> so we just need to make sure others know that it is tactical and not workflow
20:05:42 <sandywalsh> Tactical Orchestration?
20:05:48 * beekhof doesnt have a problem with Orchestration
20:05:48 <mikeyp> to add some background, 'Orchestration' is being confused with BPM, and larger cloud management frameworks.
20:06:03 <beekhof> sandywalsh: sounds like a nuclear option :)
20:06:08 <sandywalsh> heh
20:06:17 <n0ano> and scheduling is too specific
20:06:17 <maoy> lo
20:06:18 <sandywalsh> naming issues suck :)
20:06:42 <maoy> if we don't have a consensus we might just stay where we are..
20:06:48 <sandywalsh> that's why I like State Management ... little room for interpretation
20:07:12 <maoy> according to wikipedia..Orchestration describes the automated arrangement, coordination, and management of complex computer systems, middleware, and services.
20:07:29 <beekhof> wfm
20:07:36 <sandywalsh> certainly sticking with orchestration is the easiest
20:07:59 <sandywalsh> I'll add a blurb to the working group description to highlight the distinction
20:08:10 <sandywalsh> vote to stick with orchestration?
20:08:15 * n0ano aye
20:08:24 <sandywalsh> +1
20:08:29 <beekhof> +1
20:08:31 <maoy> like
20:08:38 <mikeyp> -1 - too many large implications
20:08:52 <mikeyp> +1 for state management
20:09:06 <sandywalsh> votes for state management?
20:09:22 <sandywalsh> votes for other choices?
20:09:44 <sandywalsh> #action stick with "orchestration" until we can come up with a better name :)
20:09:48 * n0ano abstain for other choices
20:10:01 <sandywalsh> #topic tactical scheduler changes
20:10:17 <sandywalsh> #link http://wiki.openstack.org/EssexSchedulerImprovements
20:10:24 <sandywalsh> So, I emailed this out this morning
20:10:51 <sandywalsh> I know it's not directly related to Orchestration per se, but hopefully we can see a pattern here for how we can process events and get them back to the Scheduler
20:11:13 <maoy> i think it's closely related..
20:11:13 <sandywalsh> I think if we get this mechanism in place, we can start to add the State Machine stuff
20:11:36 <sandywalsh> if you haven't read it, please do and offer feedback
20:11:47 <maoy> for distributed scheduler, you mean the zone stuff, right?
20:12:08 <sandywalsh> well, initially we need it for single zone, but multi zone will follow behind shortly
20:12:09 <mikeyp> The idea of capacity cache as a summary table makes sense.
20:12:39 <mikeyp> who wold 'own' that table in terms of updates ?
20:13:03 <n0ano> I like the idea of putting the data in the DB, I'm concerned that using pre-defined columns would limit the extensibility of the scheduler
20:13:10 <sandywalsh> the scheduler would update for new rows
20:13:21 <sandywalsh> and the compute nodes would update for changes to instance status
20:13:27 <sandywalsh> (including delete)
20:13:34 <maoy> the challenge I see for "orchestration" is to maintain that table up2date in the event of node crashes, and errors.
20:14:10 <beekhof> i only just got up, havent read it yet
20:14:34 <sandywalsh> n0ano, good point ... something we need to consider ... how to extend what's in the table
20:14:37 <maoy> agree on extensibility. scheduler is probably the most likely customized module in deployment imho..
20:14:57 <n0ano> +1 on customizable module
20:14:59 <mikeyp> the table should have an updated timestamp for each row.
20:15:21 <sandywalsh> maoy, ideally the ComputeNode table will let us now about host failures
20:15:21 <mikeyp> could then decide when info was stale
20:15:37 <n0ano> timestamp is standard for all the current rows in the DB, that shouldn't change
20:15:49 <sandywalsh> mikeyp, all Nova tables have those common fields
20:16:20 <sandywalsh> mikeyp, nova.db.sqlalchemy.models.NovaBase
20:17:06 <sandywalsh> #action give some consideration to extending CapacityCache table
20:17:37 <sandywalsh> One other thing that came up from our meeting last week was the need to keep Essex stable
20:17:42 <sandywalsh> no radical changes
20:18:01 <sandywalsh> so we may need to do our orchestration stuff along side the current compute.run_instance() code
20:18:07 <n0ano> I actually created a proof of concept scheduler based on cactus that put metrics in the DB and made decisions based on that, I can send an email to describe it in more deatil
20:18:09 <maoy> if a compute node can't finish provision a vm, we need to find another node. is this included in the new scheduler?
20:18:11 <sandywalsh> perhaps like we did with /zones/boot initially
20:18:24 <sandywalsh> n0ano, that would be great
20:18:42 <sandywalsh> #action n0ano to outline his previous scheduler efforts on ML
20:19:08 <sandywalsh> maoy, not yet, no retries yet ... first it's just getting reliable instance state info
20:19:35 <n0ano> if the goal is essex stability that would imply minimal changes to the current code, right?
20:19:52 <sandywalsh> as I was saying, we may need to do something like POST /orchestration/boot or something in the meanwhile to not upset nova while we get our state management in place
20:19:57 <maoy> sandywalsh, can you elaborate on what you did with /zones/boot initially?
20:20:03 <sandywalsh> likely a simple state machine (not petri net) in the short term
20:20:08 <n0ano> so something like moving to a DB base would have to wait for the F release
20:20:36 <sandywalsh> so, when we started with the scheduler we had a new POST /servers method that worked across zones
20:20:48 <sandywalsh> but it had a different signature than POST /servers
20:21:07 <sandywalsh> so we created POST /zones/boot to use the alternate approach
20:21:19 <sandywalsh> and later, we integrated the two back into POST /servers
20:21:28 <sandywalsh> and ditched /zone/boot
20:21:44 <sandywalsh> we may need to do the same thing with state-machine based boot
20:21:59 <sandywalsh> can't upset the Essex apple cart
20:22:01 <maoy> about petrinet. from our previous email, it looks like most things are sequential so state machine is probably good enough..
20:22:30 <sandywalsh> maoy, agreed ... won't really be an issue until we get into multiple concurrent instance provisioning requests
20:22:42 <beekhof> agreed. petrinet looked cool but quit epossibly overkill
20:22:50 <sandywalsh> that's why this event handling stuff is important now
20:23:20 <sandywalsh> #action stick with a simple state machine for now ... revisit petrinets later when concurrency is required
20:23:32 <maoy> one more q:
20:23:44 <maoy> is all the steps in a job executed on the same node (compute?)
20:24:09 <maoy> i guess some might be on network nodes or storage nodes?
20:24:18 <sandywalsh> maoy, yes for run_instance() ... resize or migrate may be different (need to verify)
20:24:33 <sandywalsh> and yes, there are things that need to be done on the network/volume nodes
20:24:45 <sandywalsh> but it's usually still all serial
20:24:52 <sandywalsh> (sequential)
20:24:58 <maoy> got it
20:25:43 <sandywalsh> beekhof, when you get a chance to read that wiki page, I'd be curious to know if the event notification stuff will work ok with pacemaker
20:25:52 <maoy> in this case, the automatic rollback with predefined undo functions during failure seems to make sense
20:26:00 <sandywalsh> hopefully it should fit well?
20:26:10 <beekhof> i think it would
20:26:15 <sandywalsh> cool
20:26:28 <sandywalsh> maoy, do you mean hard-coded rewind functions?
20:26:31 <beekhof> would simplify a lot if there was a single table to go to
20:26:43 <sandywalsh> right
20:27:24 <maoy> i mean for each step, the developer specify a undo step in case things happen later blow up
20:27:38 <sandywalsh> yes, right ... I think so too
20:27:49 <sandywalsh> k, so anything else anyone cares to bring up?
20:28:03 <sandywalsh> I need to review some of the communications from the last two weeks
20:28:05 <maoy> this would prevent bugs like forget to un-allocate IP if VM doesn't boot
20:28:13 <sandywalsh> correct
20:28:19 <mikeyp> the VM state managements are still in review
20:28:42 <beekhof> there was my compute-cluster idea, dunno if its worth discussing that
20:28:48 <mikeyp> how / does that impact this work ?
20:29:19 <sandywalsh> mikeyp, I think they need to come to agreement on the VM states before we can really do much
20:29:38 <mikeyp> thats what I thought
20:30:13 <sandywalsh> beekhof, I need to re-read it, but perhaps others are ready?
20:30:34 <beekhof> we can do next week
20:30:51 <sandywalsh> #action discuss beekhok compute-cluster idea next meeting
20:30:58 <beekhof> i've put it to one side since RH is going to take a diff approach anyway
20:31:03 <beekhof> but i still think its kinda neat :)
20:31:03 <sandywalsh> *beekhof
20:31:23 <sandywalsh> :) can you elaborate on the RH approach?
20:32:08 <beekhof> now or next week?
20:32:23 <sandywalsh> perhaps in an email and we can touch on it next week?
20:32:41 <sandywalsh> (or did your last email already mention it?)
20:32:42 <beekhof> sure
20:32:50 <maoy> agreed. email has higher goodput
20:32:55 <beekhof> 10000-ft view...
20:32:58 <sandywalsh> k
20:33:05 <beekhof> its a layered design
20:33:16 <beekhof> sits on top of openstack instead of being a part of it
20:33:28 <sandywalsh> ah, that'll be good to hear about
20:33:37 <sandywalsh> well, let's wrap this one up and we'll see you on the lists!
20:33:45 <beekhof> that way it can also manage other stacks
20:33:47 <beekhof> ok
20:33:52 <sandywalsh> cool
20:33:58 <sandywalsh> #endmeeting