16:01:58 <adrian_otto> #startmeeting containers
16:01:59 <openstack> Meeting started Wed Oct  1 16:01:58 2014 UTC and is due to finish in 60 minutes.  The chair is adrian_otto. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:02:00 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:02:02 <openstack> The meeting name has been set to 'containers'
16:02:18 <adrian_otto> #topic Gantt and Magnum - Scheduling for Containers in OpenStack
16:02:26 <adrian_otto> welcome bauzas
16:02:32 <bauzas> adrian_otto: thanks
16:02:45 <bauzas> n0ano just joined us too, he's also working on Gantt
16:03:15 <adrian_otto> thanks for agreeing to meet with us. The purpose of today's discussion is to put some additional details about how Magnum will approach the subject of scheudling
16:03:30 <bauzas> for sure, I would love knowing your requirements
16:03:34 <adrian_otto> resource scheduling is something that Nova already does, and has a modular interface to adapt how it's done.
16:03:39 <bauzas> right
16:03:56 <adrian_otto> we understand that you plan to pull out the schedulingpiece as a standalone project named Gantt
16:04:03 <bauzas> right too
16:04:11 <adrian_otto> once that happens, other projects could tap into that capability, which is what we would like to do
16:04:19 <bauzas> that's the idea
16:04:24 <adrian_otto> also, last time we met, we covered our initial intent for Magnum
16:04:24 <n0ano> +1
16:04:35 <bauzas> because we're convinced that scheduling needs to be holistic
16:04:47 <adrian_otto> to support two placement strategies for containers within Nova instances (through the help of an in-guest agent)
16:05:03 <adrian_otto> 1) To place a container on a specifiy instance id
16:05:09 <bauzas> right, could you please just restate here for n0ano very briefly ?
16:05:44 <adrian_otto> 2) To repeatedly fill an instance with containers until no more will fit, and then we will create another instance, fill that, etc. We have been referring to this mode as "simple sequential fill".
16:06:08 <bauzas> right
16:06:14 <adrian_otto> as containers are removed, and Nova instances are left completely vacant, the instances may be automatically deleted by Magnum
16:06:36 <adrian_otto> so clearly those two modes are a first step, and we can get there independently from Gantt, giving us some runway in terms of time
16:06:52 <adrian_otto> we would like to be able to do more sophisticated placement in the future
16:07:06 <adrian_otto> example use cases are instance affinity (think host affinity)
16:07:09 <bauzas> that makes sense
16:07:12 <adrian_otto> and instance anti-affinity
16:07:18 <bauzas> right
16:07:42 <adrian_otto> we may also wany other concepts such as zone filters (to represent network performance boundaries, etc.)
16:07:43 <n0ano> I think `everybody` wants to do more sophisticated scheduling, hence the desire to split out gantt to able able to satisfy more projects than just nova
16:07:51 <adrian_otto> indeed!
16:07:56 <adrian_otto> so that's the starting point.
16:08:07 <n0ano> violent agreement so far
16:08:21 <adrian_otto> we hope to leverage as much of what's already there as possible, and have a sensible way of extending it
16:08:22 <bauzas> n0ano: here there is no need for a cross-project scheduler, just for something that Nova doesn't provide yet
16:09:01 <adrian_otto> yes, we don't care if the scheduler is provided by nova or some other service
16:09:02 <n0ano> which should make the containers usage a closer fit, fewer changes needed
16:09:07 <adrian_otto> we are after the funcitonality
16:09:24 <bauzas> here, the idea is that you have N sets as capacity, and you want to spawn Y children based on a best effort occupancy of X
16:09:54 <adrian_otto> so a great outcome of today's chat will be a shared vision of where we are headed, and a rough tactical outline for how to begin advancing toward that vision together
16:10:25 <bauzas> X can be hosts/instances/routers/cinder-volumes and Y can be instances/containers/networks/volumes
16:10:26 <adrian_otto> yes, that's right
16:10:49 <bauzas> the placement logic and the efficiency of that logic still has to be the same
16:10:51 <adrian_otto> agreed.
16:11:08 <n0ano> I think there's general concensus on the desired outcome, we've been bogged down by the mechanics of getting there so are
16:11:31 <bauzas> n0ano: right, and that's why we're taking down to using objects
16:11:53 <adrian_otto> I do have a specific question about fault recovery, and whether we conceptualize that as scheduling, or orchestration
16:11:57 <bauzas> adrian_otto: the current nova scheduler model is accepting non validated and non-typed dictionaries of values
16:12:10 <adrian_otto> say I have provisioned resource Y on instance X
16:12:17 <bauzas> adrian_otto: sure go ahead
16:12:19 <adrian_otto> and Y dies.
16:12:21 <bauzas> yep
16:12:33 <adrian_otto> I need a way to have that restarted.
16:12:40 <adrian_otto> does Nova already have a concept of this?
16:12:50 <bauzas> that's not automatic
16:12:55 <n0ano> adrian_otto, I've always interpreted that as a heat issue, not a nova scheduling issue
16:12:56 <bauzas> but you can evacuate
16:13:02 <bauzas> n0ano: strong +1
16:13:12 <adrian_otto> ok, so the user would use a delete command in the api and then another post to create a new one?
16:13:20 <bauzas> IIRC, there is a heat effort about that
16:13:29 <bauzas> adrian_otto: nope
16:13:49 <bauzas> I heard about Heat 'convergence'
16:13:52 <n0ano> nova (at least currently) is kind of a fire and forget, once you've launced the instance nova scheduling is done (this is an area that might need to change in future)
16:13:53 <adrian_otto> yes, I'm pretty sure this is within the scope of Heat, but I wanted your perspective on that
16:13:57 <bauzas> still needs to figure out what's exactly
16:14:14 <bauzas> adrian_otto: Gantt will be scheduling things and provide SLAs
16:14:36 <bauzas> adrian_otto: if something goes weird, Gantt has to be updated for doing good decisions
16:14:44 <bauzas> adrian_otto: by design, Gantt can be racy
16:15:01 <n0ano> I think fault recover is a heat issue, gantt will have to be involved but the core logic comes from heat
16:15:06 <bauzas> adrian_otto: ie. if the updates are stale, Gantt will provide wrong decisions
16:15:17 <adrian_otto> well, you do need to know the current utilization against capacity, so if a sub-resource is deleted, the scheduler may need to learn about that. Is this assumption correct?
16:15:34 <bauzas> adrian_otto: right, Gantt will have a statistics API
16:15:40 <adrian_otto> or will it poll each instance at decision time?
16:15:44 <adrian_otto> ok, I see
16:15:44 <bauzas> adrian_otto: so projects using it will update its views
16:15:55 <bauzas> adrian_otto: nope, no polling from Gantt
16:16:05 <n0ano> adrian_otto, right one of the changes we're actively doing is to put all resource tracking inside the scheduler
16:16:14 <bauzas> adrian_otto: projects provide a view to Gantt, that's project's responsibilies to engage consistency
16:16:15 <adrian_otto> so there will need to be some way to inform the stats through that statistics API
16:16:26 <bauzas> adrian_otto: exactly
16:16:41 <adrian_otto> we will expect Heat do to that upon triggering a resource restart (heal) event
16:16:43 <bauzas> n0ano: to be precise, that will be the claiming process
16:16:45 <adrian_otto> got it, ok
16:16:56 <n0ano> what bauzas said
16:17:05 <adrian_otto> excellent.
16:17:12 <bauzas> n0ano: for making sure that concurrent schedulers can do optimistic scheduling
16:17:25 <bauzas> without lock mechanism
16:17:34 <bauzas> but that's internal to Gantt
16:18:01 <adrian_otto> ok
16:18:07 <bauzas> projects will have to put stats to Gantt using an API endpoint, and will consume decisions from another API endpoint
16:18:35 <adrian_otto> ok, understood.
16:18:48 <bauzas> so that requires a formal datatype for updating the stats
16:18:56 <adrian_otto> this will be some form of a feed interface like an RPC queue?
16:18:56 <bauzas> and here is what we're currently working on
16:19:07 <n0ano> note that, currently, all stats are kind of nova instance related, for cross project (cinder, neutron) usage we'll have to expand those stats
16:19:13 <bauzas> adrian_otto: the internals are not yet agreed
16:19:17 <adrian_otto> kk
16:19:28 <bauzas> adrian_otto: I was seeing someting like a WSME datatype
16:19:37 <bauzas> adrian_otto: or a Nova object currently
16:20:03 <bauzas> adrian_otto: my personal opinion is that Nova objects are good candidates for becoming WSME datatypes
16:20:13 <adrian_otto> :-)
16:20:16 <bauzas> adrian_otto: but that's my personal opinion :)
16:20:41 <bauzas> so the API will have precise types
16:20:48 <adrian_otto> ok, so I think I distracted you from your description of your current focus of work
16:20:57 <bauzas> adrian_otto: not exactly
16:21:23 <bauzas> adrian_otto: I just wanted to emphasize that as the current Nova scheduler is not validating what we send to it, we need to do that work
16:21:23 <n0ano> focus = clean up claims, split out into gantt
16:21:46 <adrian_otto> ok
16:22:05 <bauzas> so the basic plan for Kilo is what we call "fix the tech debt, ie. provide clean interfaces for updating stats and consuming requests"
16:22:34 <bauzas> and eventually move the claiming process to Gantt for doing concurrent non-locking decisions
16:22:45 <bauzas> but that's not impacting your work
16:22:52 <n0ano> there is so much that people want to do that is predicated on splitting out gantt that I `really` want to focus on that
16:24:03 <adrian_otto> our favorite question from product managers is "when will XYZ be ready". We have been pretty clear about the sequencing of new features. What we could tighten up is an expectation of when we might clear various milestones. What do you think the best way is to address these questions?
16:24:32 <bauzas> adrian_otto: what we can promise is to deliver a Nova library by Kilo
16:25:11 <bauzas> adrian_otto: ie. something detached from the Nova repo, residing in its own library but still pointing to the same DB
16:25:19 <adrian_otto> what is a reasonable expectation for what features will be present in that library? Is that a feature parity with what is in nova today with tech debt subtracted out?
16:25:21 <n0ano> adrian_otto, when you get a good answer to that let me know, I get the same questions internally and `when it's done' is never sufficient
16:25:39 <adrian_otto> n0ano: here, here!
16:25:40 <bauzas> adrian_otto: feature parity is a must have
16:25:53 <bauzas> adrian_otto: no split without feature parity
16:26:02 <adrian_otto> ok, that makes sense
16:26:06 <n0ano> bauzas, +1
16:26:30 <bauzas> adrian_otto: so, based on our progress, that requires to provide some way to update the scheduler for other than just "compute node" stats
16:26:35 <adrian_otto> maybe if we thought about it like "in what release of OpenStack could Magnum and Gantt be offering affinity and anti-affinity scheduling capability"?
16:26:41 <bauzas> and the target is for Kilo
16:26:48 <adrian_otto> that might be a more concrete question that we could chew on together
16:27:20 <bauzas> adrian_otto: ask Nova cores the priority they will put for the Scheduler effort
16:27:32 <adrian_otto> !! :-)
16:27:32 <openstack> adrian_otto: Error: "!" is not a valid command.
16:27:43 <bauzas> adrian_otto: raise the criticity if you want so
16:27:55 <adrian_otto> ok so assuming this is in a new repo, it could have it's own review queue
16:27:56 <bauzas> adrian_otto: tbh, we're not sufferring from a contributors bandwidth
16:28:01 <adrian_otto> and a focused reviewer team
16:28:20 <n0ano> adrian_otto, +1
16:28:20 <adrian_otto> anyone who is a true stakeholder could opt into that
16:28:32 <bauzas> adrian_otto: right, but before doing that, we need to get merged the necessary bits for fixing the tech debt and provide stats with other Nova objects
16:28:49 <adrian_otto> ok, that's sensible too
16:28:53 <bauzas> and here comes the velocity problem
16:29:29 <bauzas> as we need feature parity *before* the split, we need to carry our changes using the Nova workflow
16:29:46 <bauzas> adrian_otto: hope you understand what I mean
16:29:50 <adrian_otto> I do.
16:29:58 <adrian_otto> I'm thinking on it
16:30:01 <bauzas> adrian_otto: in particular wrt Nova runways/slots proposal
16:30:23 <bauzas> adrian_otto: that's envisaged for kilo-3, not before but still
16:30:46 <n0ano> bauzas, I'm not `too` concerned about the runways proposal, I hope we'll be in before we have to worry about that (I don't like it but that's a different issue)
16:31:14 <bauzas> n0ano: think about it, and how we can merge changes before kilo-3
16:31:30 <bauzas> n0ano: and you'll see that isolate-scheduler-db will probably be handled by kilo-3
16:31:56 <bauzas> and winter is coming (no jokes)
16:32:29 <n0ano> yeah but we get more accomplished in winter (ignoring the holidays)
16:32:33 <adrian_otto> the challenge here is that we can't just produce more Nova core reviewers by mustering them up.
16:32:49 <adrian_otto> that is a scarce resource that grows organically, and at a slow rate.
16:33:01 <n0ano> adrian_otto, nope, but this is a generic problem that everyone is facing, we just have to deal with it
16:33:19 <adrian_otto> so as you mentioned, we would need to cause the existing reviewers to perceive the Gantt related work as a priority.
16:33:45 <bauzas> adrian_otto: that's the thing
16:34:30 <bauzas> in particular, Design Summits are a good opportunity for raising concerns about prorities
16:34:39 <adrian_otto> it probably only takes 3 reviewers with a commitment of 5 hours a week or less to succeed at this, right?
16:34:50 <bauzas> even less
16:34:59 <adrian_otto> probably half that time
16:35:15 <bauzas> patches are quite small, except a big one I just proposed
16:35:16 <adrian_otto> let's just call it 2 hours a week for sake of discussion
16:35:27 <n0ano> we can get reviewers, it 2 core reviewers that is the problem.
16:35:34 <adrian_otto> that seems to be something that we could influence using a 1:1 campaign
16:35:37 <bauzas> n0ano: +1000
16:36:41 <bauzas> adrian_otto: as usual, pings and f2f during Summit
16:36:46 <adrian_otto> and I have a feeling that not just any cores will do, it's a subset of the cores who can rationalize and criticize contributions to this part of the system architecture.
16:37:14 <bauzas> adrian_otto: tbh, we already have support from a set of them
16:37:30 <adrian_otto> so what we need is an earmarked time commitment from them
16:37:35 <n0ano> the reality is we've thrashed the issues enough that I think we know what needs to be done, now it's just a matter of doing it and getting it reviewed
16:37:36 <bauzas> adrian_otto: we just need to emphasize the priority
16:37:49 <n0ano> bauzas, +1
16:38:05 <adrian_otto> and if we approached each of them with a join me for two 1-hour meetings each week to review patches for this subject matter.
16:38:31 <adrian_otto> that's a commitment that's likely to succeed, and may be just enough lubrication to help you get that work through
16:38:50 <adrian_otto> or even three 30 minute meetings, something along those lines
16:39:00 <adrian_otto> ideally you'd have them review together interactively.
16:39:20 <adrian_otto> so they can debate disputes on the spot to keep cycle times between patch revisions shorter.
16:39:54 <n0ano> that'd be nice, whether we can get that kind of commitment is the issue
16:40:17 <adrian_otto> how would you feel about working in this way for a limited time, until a particular milestone is reached (Gantt in own code base with it's own reviewer team?)
16:40:56 <adrian_otto> so the comitter, and ideally three reviewers show up like that on a regular schedule as a tiger team
16:41:16 <n0ano> I'll try anything, if you think that can help I'd do it
16:41:17 <adrian_otto> and make revisions on the fly to the extent possible
16:41:30 <bauzas> adrian_otto: well, that's quite close to what Nova calls 'slot'
16:42:02 <adrian_otto> ok, I'd like to help pitch that approach, or any variation on this theme that you think might resonate with the stakeholders and get the throughput up (velocity++).
16:42:05 <bauzas> adrian_otto: I mean, asking specific people to do specific reviews on a specific time is FWIW what we call a runway or a slot
16:42:16 <adrian_otto> tell me more about 'slot' please.
16:42:37 <adrian_otto> ok, mikal proposed that about a month back
16:42:42 <bauzas> adrian_otto: the proposal is about having a certain number of blueprints covered at the same time
16:42:56 <adrian_otto> has that approach resumed, and is it effective?
16:42:58 <bauzas> the current threshold is set to 10
16:43:08 <bauzas> adrian_otto: not yet, planned for k3
16:43:11 <adrian_otto> oh, so like an "on approach" strategy
16:43:13 <n0ano> adrian_otto, only a proposal, hasn't been done yet
16:43:22 <adrian_otto> where you cherry pick feature topics that must land by a deadline?
16:43:30 <n0ano> basically, it's a way to prioritize important BPs
16:43:39 <adrian_otto> I see
16:43:48 <bauzas> adrian_otto: that's just giving reviews attention until it gets merged
16:43:58 <bauzas> s/reviews/reviewers
16:44:04 <n0ano> or, another way of saying it, the current review system doesn't work :-)
16:44:15 <adrian_otto> I refer to this as the "bird dog" approach
16:44:19 <bauzas> adrian_otto: if a blueprint gets staled in the middle of a vote, it looses its slot
16:44:29 <adrian_otto> you have a team of reviewers bird dog a given patch topic
16:45:29 <adrian_otto> ok, that's good food for thought. Please let me know how I can help with this.
16:45:48 <adrian_otto> I'll plan to help with the persuasive pitch delivery.
16:45:55 <bauzas> adrian_otto: well, the best way is to emphasize your needs during the Summit
16:46:05 <n0ano> adrian_otto, tnx, much appreciated
16:46:06 <adrian_otto> I'll be there. -)
16:46:10 <bauzas> we have some proposals for talks at the Summit
16:46:34 <adrian_otto> the Design summit schedule is still unreleased, correct?
16:47:15 <n0ano> a couple of proposed launchpads so far
16:47:31 <n0ano> https://etherpad.openstack.org/p/kilo-nova-summit-topics
16:47:38 <n0ano> https://etherpad.openstack.org/p/kilo-crossproject-summit-topics
16:48:08 <bauzas> n0ano: s/launchpad/etherpad
16:48:50 <n0ano> at least I got the pad part right :-)
16:48:57 <adrian_otto> :-)
16:49:15 <adrian_otto> ok, I will review those and make remarks on them
16:49:37 <adrian_otto> thanks for your time today bauzas and n0ano.
16:49:48 <n0ano> NP
16:49:51 <bauzas> np
16:50:10 <adrian_otto> anything more before we wrap up?
16:50:16 <bauzas> good to chat with you
16:50:22 <bauzas> nope
16:50:27 <bauzas> we got your requirements
16:50:31 <n0ano> looking forward to Paris
16:50:32 <adrian_otto> ok, cool. I'll see you in Paris!
16:50:38 <adrian_otto> #endmeeting