20:01:08 #startmeeting 20:01:09 Meeting started Thu Dec 1 20:01:08 2011 UTC. The chair is sandywalsh. Information about MeetBot at http://wiki.debian.org/MeetBot. 20:01:10 Useful Commands: #action #agreed #help #info #idea #link #topic. 20:01:20 #link http://wiki.openstack.org/Meetings/Orchestration 20:01:36 #info Orchestration 20:01:55 I'll put another name up there: Task Management 20:02:08 I just threw together that agenda since I've been out for the last couple of weeks 20:02:34 maoy, k, let's tackle yours first ... can you explain a little about Task Management? 20:02:43 #topic Task Management 20:02:54 ah. 20:03:01 i was referring to the new naming effort 20:03:09 ah, heh ... gotcha 20:03:17 ok, let's tackle naming 20:03:20 #topic naming 20:03:44 i like transaction management, but it does sound like db lingo 20:03:53 So, orchestration is causing trouble for people 20:04:02 so task might be more general alternative 20:04:18 not for me, I'm beginning to like orchestration, do we really need to change? 20:04:22 I think it's ultimately about distributed state machine management 20:04:51 n0ano, I agree, but people think of it in the larger BPM / Workflow sense 20:05:02 and we're more tactical than that 20:05:27 so we just need to make sure others know that it is tactical and not workflow 20:05:42 Tactical Orchestration? 20:05:48 * beekhof doesnt have a problem with Orchestration 20:05:48 to add some background, 'Orchestration' is being confused with BPM, and larger cloud management frameworks. 20:06:03 sandywalsh: sounds like a nuclear option :) 20:06:08 heh 20:06:17 and scheduling is too specific 20:06:17 lo 20:06:18 naming issues suck :) 20:06:42 if we don't have a consensus we might just stay where we are.. 20:06:48 that's why I like State Management ... little room for interpretation 20:07:12 according to wikipedia..Orchestration describes the automated arrangement, coordination, and management of complex computer systems, middleware, and services. 20:07:29 wfm 20:07:36 certainly sticking with orchestration is the easiest 20:07:59 I'll add a blurb to the working group description to highlight the distinction 20:08:10 vote to stick with orchestration? 20:08:15 * n0ano aye 20:08:24 +1 20:08:29 +1 20:08:31 like 20:08:38 -1 - too many large implications 20:08:52 +1 for state management 20:09:06 votes for state management? 20:09:22 votes for other choices? 20:09:44 #action stick with "orchestration" until we can come up with a better name :) 20:09:48 * n0ano abstain for other choices 20:10:01 #topic tactical scheduler changes 20:10:17 #link http://wiki.openstack.org/EssexSchedulerImprovements 20:10:24 So, I emailed this out this morning 20:10:51 I know it's not directly related to Orchestration per se, but hopefully we can see a pattern here for how we can process events and get them back to the Scheduler 20:11:13 i think it's closely related.. 20:11:13 I think if we get this mechanism in place, we can start to add the State Machine stuff 20:11:36 if you haven't read it, please do and offer feedback 20:11:47 for distributed scheduler, you mean the zone stuff, right? 20:12:08 well, initially we need it for single zone, but multi zone will follow behind shortly 20:12:09 The idea of capacity cache as a summary table makes sense. 20:12:39 who wold 'own' that table in terms of updates ? 20:13:03 I like the idea of putting the data in the DB, I'm concerned that using pre-defined columns would limit the extensibility of the scheduler 20:13:10 the scheduler would update for new rows 20:13:21 and the compute nodes would update for changes to instance status 20:13:27 (including delete) 20:13:34 the challenge I see for "orchestration" is to maintain that table up2date in the event of node crashes, and errors. 20:14:10 i only just got up, havent read it yet 20:14:34 n0ano, good point ... something we need to consider ... how to extend what's in the table 20:14:37 agree on extensibility. scheduler is probably the most likely customized module in deployment imho.. 20:14:57 +1 on customizable module 20:14:59 the table should have an updated timestamp for each row. 20:15:21 maoy, ideally the ComputeNode table will let us now about host failures 20:15:21 could then decide when info was stale 20:15:37 timestamp is standard for all the current rows in the DB, that shouldn't change 20:15:49 mikeyp, all Nova tables have those common fields 20:16:20 mikeyp, nova.db.sqlalchemy.models.NovaBase 20:17:06 #action give some consideration to extending CapacityCache table 20:17:37 One other thing that came up from our meeting last week was the need to keep Essex stable 20:17:42 no radical changes 20:18:01 so we may need to do our orchestration stuff along side the current compute.run_instance() code 20:18:07 I actually created a proof of concept scheduler based on cactus that put metrics in the DB and made decisions based on that, I can send an email to describe it in more deatil 20:18:09 if a compute node can't finish provision a vm, we need to find another node. is this included in the new scheduler? 20:18:11 perhaps like we did with /zones/boot initially 20:18:24 n0ano, that would be great 20:18:42 #action n0ano to outline his previous scheduler efforts on ML 20:19:08 maoy, not yet, no retries yet ... first it's just getting reliable instance state info 20:19:35 if the goal is essex stability that would imply minimal changes to the current code, right? 20:19:52 as I was saying, we may need to do something like POST /orchestration/boot or something in the meanwhile to not upset nova while we get our state management in place 20:19:57 sandywalsh, can you elaborate on what you did with /zones/boot initially? 20:20:03 likely a simple state machine (not petri net) in the short term 20:20:08 so something like moving to a DB base would have to wait for the F release 20:20:36 so, when we started with the scheduler we had a new POST /servers method that worked across zones 20:20:48 but it had a different signature than POST /servers 20:21:07 so we created POST /zones/boot to use the alternate approach 20:21:19 and later, we integrated the two back into POST /servers 20:21:28 and ditched /zone/boot 20:21:44 we may need to do the same thing with state-machine based boot 20:21:59 can't upset the Essex apple cart 20:22:01 about petrinet. from our previous email, it looks like most things are sequential so state machine is probably good enough.. 20:22:30 maoy, agreed ... won't really be an issue until we get into multiple concurrent instance provisioning requests 20:22:42 agreed. petrinet looked cool but quit epossibly overkill 20:22:50 that's why this event handling stuff is important now 20:23:20 #action stick with a simple state machine for now ... revisit petrinets later when concurrency is required 20:23:32 one more q: 20:23:44 is all the steps in a job executed on the same node (compute?) 20:24:09 i guess some might be on network nodes or storage nodes? 20:24:18 maoy, yes for run_instance() ... resize or migrate may be different (need to verify) 20:24:33 and yes, there are things that need to be done on the network/volume nodes 20:24:45 but it's usually still all serial 20:24:52 (sequential) 20:24:58 got it 20:25:43 beekhof, when you get a chance to read that wiki page, I'd be curious to know if the event notification stuff will work ok with pacemaker 20:25:52 in this case, the automatic rollback with predefined undo functions during failure seems to make sense 20:26:00 hopefully it should fit well? 20:26:10 i think it would 20:26:15 cool 20:26:28 maoy, do you mean hard-coded rewind functions? 20:26:31 would simplify a lot if there was a single table to go to 20:26:43 right 20:27:24 i mean for each step, the developer specify a undo step in case things happen later blow up 20:27:38 yes, right ... I think so too 20:27:49 k, so anything else anyone cares to bring up? 20:28:03 I need to review some of the communications from the last two weeks 20:28:05 this would prevent bugs like forget to un-allocate IP if VM doesn't boot 20:28:13 correct 20:28:19 the VM state managements are still in review 20:28:42 there was my compute-cluster idea, dunno if its worth discussing that 20:28:48 how / does that impact this work ? 20:29:19 mikeyp, I think they need to come to agreement on the VM states before we can really do much 20:29:38 thats what I thought 20:30:13 beekhof, I need to re-read it, but perhaps others are ready? 20:30:34 we can do next week 20:30:51 #action discuss beekhok compute-cluster idea next meeting 20:30:58 i've put it to one side since RH is going to take a diff approach anyway 20:31:03 but i still think its kinda neat :) 20:31:03 *beekhof 20:31:23 :) can you elaborate on the RH approach? 20:32:08 now or next week? 20:32:23 perhaps in an email and we can touch on it next week? 20:32:41 (or did your last email already mention it?) 20:32:42 sure 20:32:50 agreed. email has higher goodput 20:32:55 10000-ft view... 20:32:58 k 20:33:05 its a layered design 20:33:16 sits on top of openstack instead of being a part of it 20:33:28 ah, that'll be good to hear about 20:33:37 well, let's wrap this one up and we'll see you on the lists! 20:33:45 that way it can also manage other stacks 20:33:47 ok 20:33:52 cool 20:33:58 #endmeeting