15:00:05 #startmeeting gantt 15:00:06 Meeting started Tue Aug 19 15:00:05 2014 UTC and is due to finish in 60 minutes. The chair is bauzas. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:07 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:09 The meeting name has been set to 'gantt' 15:00:27 hi, who's here for discussing nova scheduler efforts ? 15:00:36 me 15:01:05 n0ano is having a meeting conflict so I'll be chairing today 15:02:42 ok, still waiting one more min 15:04:01 o/ 15:04:05 hi folks 15:04:07 sorry late 15:04:20 no problem, we haven't yet started 15:04:34 only mspreitz and me seem to be present today :) 15:04:44 and Jay 15:04:57 was talking to jaypipes :) 15:05:03 ok, I guess we can start then 15:05:07 * jaypipes asked ndipanov to hop in here. 15:05:11 little agenda, but still :) 15:05:20 jaypipes: cool thanks 15:05:23 * ndipanov lands 15:05:24 is PaulMurray still on holiday? 15:05:32 and where's Mr. Moustache? 15:05:33 jaypipes: seems so 15:05:46 jaypipes: my sources tell me that Paul is somewhere in France 15:05:53 darn Europeans with all your vacation :P 15:05:58 haha 15:06:14 jaypipes: so I suspect that he liked so much the country that he won't be there for years 15:06:25 hehe 15:06:33 n0ano is having administrative tasks IIIUC 15:06:34 well, he'll be there i November :) 15:06:49 anyway, let's start 15:06:49 hi 15:06:54 Yathi: \o 15:07:12 #topic Forklift Status 15:07:23 so much fun here 15:07:36 so, basically, a quick status, as usual 15:08:13 https://review.openstack.org/82778 and https://review.openstack.org/104556 are identified as priorities for J-3 reviews 15:08:26 still waiting approvals tho 15:08:38 both of them are related to bp/scheduler-lib stuff 15:09:03 another bp is on-going 15:09:14 with the spec to be validated 15:09:22 https://review.openstack.org/89893 15:09:59 changes have been proposed, https://review.openstack.org/#/q/status:open+topic:bp/isolate-scheduler-db,n,z 15:10:16 bauzas: I have reservations about some of the code in those two patches. will review with comments today. 15:10:24 jaypipes: sure, please do 15:10:49 about isolate-scheduler-db, the main concern is about its usage of ERT (Extensible Resource Tracker) 15:10:51 I do as well but mine are well known :) 15:11:00 ndipanov: :) 15:11:11 although not sure it;s the 2 patches I am referring to 15:11:13 bauzas: my concerns also revolve around ERT. 15:11:34 bauzas: actually, let me restate... 15:11:42 ndipanov: it's not the same bp 15:12:00 then nothing... carry on 15:12:00 ndipanov: the 2 formers are creating a new client 15:12:09 haven't looked at those 15:12:13 jaypipes: sure, please do 15:12:41 bauzas: my concerns around the isolate scheduer DB patches is that the fundamentals of the API -- the API structure and the parameters passed between conductor/api and the scheduler -- need to be cleaned up before creating a client lib. And the ERT stuff made the interfaces worse than they already were. 15:12:50 I opened a thread in -dev ML for discussions also http://lists.openstack.org/pipermail/openstack-dev/2014-August/043466.html 15:13:55 jaypipes: by creating client lib, you also refer to bp/scheduler-lib ? 15:14:01 s/creating/saying 15:14:13 I could not possibly agree more with jaypipes 15:14:23 yes, I will try to reply to the ML. I am stepping over some toes with my comments, though, and am treading a tightrope between comments and antagonism. 15:14:47 bauzas: creating client lib == https://review.openstack.org/#/c/82778/ 15:14:54 sooooo 15:15:10 sounds like we opened the Pandore box 15:15:23 bauzas: well, technically ERT opened it. :P 15:15:39 jaypipes, I really don't want to be antagonistic as well - but if we get to stall for one cycle and get things right(er) based on real feedback - it's a net win for gantt imho 15:16:06 ndipanov: I agree (that's actually been what I said repeatedly in Oregon as well). 15:16:18 jaypipes: hence the plan we discussed 15:16:30 jaypipes: I mean, we identified some work to do 15:16:35 bauzas: yes, agreed. 15:17:00 bauzas: for instance, I am 100% supportive of the removal of the direct DB and objects calls from the nova/scheduler/ code 15:17:14 btw. nice blogpost from mikal here http://www.stillhq.com/openstack/juno/000012.html for summarizing what was discussing in the nova meetup about scheduler 15:17:34 jaypipes: glad to hear I have sponsors :) 15:17:40 bauzas: the issue I have is that the calling structures for the API calls (currently internal, but will become external once the split goes forward) are awkward and not future-proof. 15:18:04 that's why we identified the need to iterate on that 15:18:21 but if we take the strategy, the work for Kilo is about creating a python lib 15:18:23 bauzas: I believe the ERT work is an "anti-iteration" of that, though. 15:18:38 so that means that external API will be very loosy 15:19:00 you mean lossy, bad? 15:19:12 mspreitz: uh, my bad 15:19:26 mspreitz: I mean, that will be very lightweight 15:20:05 so, my concerns are about the alternatives 15:20:10 I'm not fully pro-ERT :) 15:20:43 jaypipes: ndipanov: what are your thoughts on what should be done first ? (I'm not saying "rewrite RT and deliver it in Scheduler" :D ) 15:20:53 well 15:20:57 this is how I see it 15:21:07 if we are going to stic with "optimistic scheduling" 15:21:08 provided we identified that scheduler needs to have a clear way to get info from other nova bits 15:21:10 and even if we are not 15:21:22 ndipanov: could you please link again your paper ? 15:21:39 ndipanov: was pretty much of interest, btw. :) 15:21:51 we need a way to agree on what data goes to the scheduler, and after that to the compute nodes 15:21:59 ndipanov: agreed 15:22:01 and what data goes from compute nodes to the scheduler 15:22:22 ndipanov: I think the statement of what's required in Nova filters has been done in the sped 15:22:25 spec 15:22:47 and we need to make sure that this data can be retrieved in an efficient manner 15:22:56 ndipanov: agreed too 15:23:06 ndipanov: lemme give you all the spec rst file 15:23:11 k 15:23:16 ndipanov: so you'll see all deps 15:23:28 once you go down the road of agreeing on data 15:23:36 (at least the ones I identified, I'm not bugproof :) ) 15:23:54 I think you will see that RT will need to live in the scheduler likely even though it will be called in computes 15:23:54 have you considered keeping the data outside of scheduler completely as a external db service 15:23:56 https://review.openstack.org/#/c/89893/11/specs/juno/isolate-scheduler-db.rst 15:24:00 in case we do optimistic scheduling 15:24:19 which is what we do now - no locking in the sched, but may retry on computes 15:24:25 ndipanov: I was seeing RT as a client for updating the Scheduler 15:24:52 ie. RT and Scheduler need to have same view 15:25:03 and RT pushes updates to Scheduler 15:25:19 so, even if Scheduler goes stale, it goes back to RT for claiming with the correct values 15:25:57 is johnthetubaguy around ? 15:26:08 what jaypipes was proposing with one of his POCs is to not do claims and retries on the host 15:26:20 ndipanov: yeah, I know 15:26:49 ndipanov: I was just mentioning another approach which was to keep claims (and RT) in Compute 15:27:00 bauzas: I am semi-available but behind on my email 15:27:24 ndipanov: well, it was proposing to do a final retry/check on the host, but do claims (and return those claims over the Schedule API) in the scheduler itself. 15:27:29 johnthetubaguy: cool, we're just debating on how the RT/Scheduler thing is articulating 15:27:46 jaypipes, even better 15:27:59 that this would totally be an implementation detail of the resource tracker 15:28:00 ndipanov: so we do a tight loop on the scheduler side, with optimistic locking on the compute node resources, and then just do a retry/exception logic on the compute node itself. 15:28:46 jaypipes: this sounds like what we badly called two-phase commit before, or did I miss understand the proposal? 15:29:10 jaypipes: ah, your loop is inside the scheduler, not in the conductor 15:29:13 #link http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf 15:29:18 johnthetubaguy: no, no 2-phase commit at all. 15:29:35 jaypipes, do link that patch here :) 15:29:37 johnthetubaguy: right. the retry claim loop is entirely in the scheduler. 15:29:41 jaypipes: honestly what we called was't two phase commit either 15:29:52 https://review.openstack.org/#/c/103598/ 15:30:00 johnthetubaguy: yes, understood :) 15:30:38 anyway, the above PoC code was just that.. for demo purposes. It includes a bunch of code that shows how to model resources properly without ERT too. 15:31:13 ah I even starred it 15:31:50 it really should be broken down into two parts: 15:32:21 a) changing the scheduler APIs to use resource models and a real class (no nested dicts) for modeling requested resources, launch policies, and conditions 15:32:32 b) having the scheduler do the claim process, not the compute node 15:33:19 the way I see it - you guys need to do a) and b) can come later 15:33:30 but a) is something that needs to be done or we will regret it 15:33:36 jaypipes: one alternative is to move all claims to the conductor, then see how they fix into the scheduler, where possible? but maybe thats more work than we need 15:33:46 right, that's what I've been saying. and doing a) after a split is gonna just be painful 15:33:56 jaypipes: just thinking of resize claims vs boot cliams, but maybe that split is silly 15:34:24 johnthetubaguy, you need all the data that sched had to do either 15:34:30 yeah 15:34:32 yeah, ignore me I think, straight to scheduler makes more sense, its a given slot you are trying to reserver either way around 15:35:43 jaypipes: correct me if I'm wrong, but a) is just stopping sending blobs ? 15:35:49 well, look, what I've been saying is that we need to get these resource models and resource/launch request models done first, then work on claim stuff. The problem with ERT is that it throws away good resource modeling in favor of yet more nested dicts of stuff. 15:35:53 ndipanov: so in the back of my head, its keeping the single scheduler running fast, so giving it less work to do, but really we need to make multiple schedulers work, and frankly I had bet on this claim process on the compute being the locking mechanism to fix that, so this fits those two things together nicely 15:37:07 jaypipes, and also does not provide any way to go from: this was requested by the user-> this is the data we all see 15:37:14 yup 15:37:16 jaypipes: so I see the current ERT as a small step towards refactoring the existing code, not a finished thing, we need to split up the big blob of code into smaller chunks where is clear what you do to add new resources, and agreed we need something better than random dicts 15:37:28 johnthetubaguy: +1 15:37:34 johnthetubaguy: sorry, I disagree. I see at as a step backwards. 15:37:39 johnthetubaguy, as for the claims being on compute - I don't think it's a bad design 15:38:24 but what is bad design - is that there is no clear way to do the same thing in sched and in the claim 15:38:27 jaypipes: I mean, I can understand that nested dicts are evil, but why not consider ERT updating resource models ? 15:38:30 Those of us who want to make joint decisions will need a way to make joint claims 15:38:36 without doing select * from join...join... 15:39:00 ndipanov: hence the idea that RT and Scheduler should have same modezl 15:39:05 ndipanov: I do. the placement engine needs to have a holistic view of the system's resources, and having claims handled on the compute node means the placement engine has out-of-date info and cannot make quick decisions (must rely on retry exceptions being raised from the compute) 15:39:33 jaypipes: I think agree about your issues with the interface, I just saw that as something we need to improve and evolve to a strict versioned system, there are certainly safer ways down that path for the same kind of code split 15:39:37 bauzas: because the thing that is extensible about ERT does not need to be extensible? :) 15:40:21 jaypipes, that is fair bu that is a design decision, and one we can walk away from for a better desing (less trade-offs) - and orthogonal to the idea of data modeling (and querying) 15:40:23 bauzas: resources don't need to be extensible. they need to be properly modeled. 15:40:23 jaypipes: extensible is just another word for on-demand 15:40:29 bauzas: no... 15:40:42 jaypipes: I want to be able to reduce the reported resources to a bare minimum for my filters, but lets not try go there 15:40:51 bauzas: extensible, in the case of ERT, means resources are classes that are loaded as plugins in stevedore, and that is totally useless IMO 15:41:28 bauzas: instead, we need to properly model resources that we know are used in Nova: cpus, memory, NUMA placement, disk, ect 15:41:41 and aggregates, flavors, instances... 15:42:05 bauzas: and making those resources "plugins" does not make those things suddenly proper models. in fact, it makes it even more loosely defined and non-standardized/inconsistently-applied 15:42:35 jaypipes: ok, let's ban the word "plugins" and replace it with "classes" 15:42:38 jaypipes: yeah, I wish that big wasn't implemented already, its the ability to reduce the traffic to a minimum that I would like, and prehaps only reporting the deltas does that anyway, but maybe lets not go there right now, I love the claims discussion 15:42:56 NUMA placement is not a resource... it is the observation that you can not factor a node's resources into orthogonal sets 15:43:11 orthogonal dimensions 15:43:19 johnthetubaguy: I'm not really following this closely, but I too worry about it being *more* work to go from ERT to versioned/stable data than it would be from what we had before 15:43:27 mspreitz: no, that is not correct. if an instance consumes a certain socket/core/thread, it is consumed as a whole. 15:44:12 dansmith: yeah, thats a good point, it seemed easier in my head, but happy to bow to the consensus on that 15:44:36 mspreitz, not sure I follow... 15:44:38 so should we consider to make use of what we already version ? 15:44:49 ie. update Scheduler with objects ? 15:44:53 johnthetubaguy: I went from "on board" when it used objects, to "meh" when it didn't, and made the slow slide to -1 over the course of watching it 15:45:05 never mind, I was thinking of a more general kind of NUMA, I guess 15:45:49 bauzas: yes, definitely, but there's a number of things that are not objects -- for example, a "LaunchRequest" and a "Resource" (and subclasses) are not objects yet. 15:45:49 in the sence where it defines access times/bandwith that you could then somehow schedule on? 15:45:56 mspreitz, ^ 15:46:18 jaypipes: what do you mean by LaunchRequest ? 15:46:28 and a ComputeNode is a resource 15:46:34 bauzas, I assume all the data we need but don't have in a single place right now 15:46:39 bauzas: the thing currently called "request_spec" in the scheduler APIs. 15:46:45 like filter_specs and requset_spec 15:46:48 yes that 15:46:50 bauzas: but made into a real class, not a random set of nested dicts. 15:46:59 jaypipes, +1000 15:47:05 jaypipes: +1 15:47:12 jaypipes: I was expecting/fearing this answer... 15:47:16 ok +1 15:47:18 https://review.openstack.org/#/c/103598/4/nova/placement/__init__.py <-- see PlacementRequest class. 15:47:39 ndipanov: I was thinking of NUMA as referring to non-uniform access to main memory; I think the discussion here is focusing only on cache, which is bound to core 15:47:40 bauzas: don't fear the reaper. 15:47:49 lol 15:47:53 ok, time is running fast (for the reaper too) 15:48:06 bauzas: don't fear the reaper. 15:48:09 gah... 15:48:16 up key 15:48:21 we need to somewhat conclude on that topic, even if the the last topic is open discussion 15:48:31 so 15:48:39 wrt what has been discussed 15:48:40 needs more cowbells 15:49:01 bauzas, that's what I've been trying to say all along - for me without this (modeling data first) - we are just postponing the pain 15:49:08 mspreitz: ++ :) 15:49:19 jaypipes: do you agreed on finding some time to discuss with me about a real *change* ? 15:49:23 bauzas: jaypipes: I like the retry loop for gaining claims living inside the scheduler, for the record 15:49:28 jaypipes: of course, it will deserve a spec... 15:49:39 bauzas: absolutely. that's why I keep showing up here :) 15:49:41 I do have one thing for opens, or ML 15:49:51 ok so 15:50:08 bauzas: I just think ERT makes it harder to get to where we need to be. 15:50:12 I'd like to tighten up the arguments around smart or solver scheduler 15:50:30 #action bauzas and jaypipes to propose a resource model for scheduler 15:50:35 bing. 15:50:40 jaypipes: dansmith: I guess thats the point of contention, is ERT a step backwards or forwards 15:50:52 * ndipanov supports that and will even help if it's after thursday 15:51:04 ++ 15:51:18 ok, happy us, we have an action 15:51:21 johnthetubaguy: ndipanov and I are pretty strong on the backwards side. 15:51:42 I'm proposing to discuss on ERT when PaulMurray is back 15:51:54 yeah - I think it opens us up to more pain later for skimping on proper design now 15:52:00 yep 15:52:01 at least not reverting his change until he can somewhat discuss 15:52:03 My opinion is that the central logic can be very generic: for each resource you have capacity and demands. Could be handled with dicts. 15:52:17 bauzas: yes, I agree completely with waiting for Paul to be back. 15:52:19 jaypipes: dansmith: right, I kinda thought it was a baby step forward, all be it with some unfortunate baggage, but happy to go with the majority on this, I agree there are other much easier routes forward, just this one had effort on it already 15:52:32 johnthetubaguy: cool 15:52:45 ... well, at least until Frogs free an English from jail 15:53:00 ok, next topic so 15:53:03 7 mins left 15:53:07 #topic open discussion 15:53:11 mspreitz: ? 15:53:15 ok... 15:53:44 I wonder if we can separate the issues of more sophisticated placement criteria from the issue of simultaneous vs. sequential 15:54:05 I also wonder if Yathi has a runtime complexity argument in favor of simultaneous 15:54:13 bauzas, as for the paper that I linked to you the otehr day just google "google omega paper" 15:54:29 ndipanov: I gave the link in that discussion 15:54:30 :) 15:54:36 ndipanov: see ^ 15:54:40 That is, I think we can do sophisticated placement criteria with scheduler hints, if we are willing to accept sequential solving. 15:55:01 The argument about simultaneous vs. sequential solving is a possibly separable thing 15:55:23 Yathi: are you still here? 15:55:24 mspreitz, sequential as - we have a queue and some kind of a lock on all resources 15:55:26 mspreitz: I think that's a good question which deserves to rediscuss about Solver Scheduler bp 15:55:42 hi yes 15:55:55 by sequential I mean how we do it now, wtih no attempt to gather a bunch of things together for a joint placement decision 15:56:03 ndipanov: hmm, interesting, thanks 15:56:34 simultaneous gives you a way to cover a unified view which could be lost when done sequentially 15:56:38 johnthetubaguy, of course not all of that applies here - but it does make a nice taxonomy of differen sched designs and their tradeoffs 15:56:45 I want to be precise about the loss 15:56:58 one loss is this: you risk picking a poorer solution 15:57:05 Yathi: simultaneous would possibly require to see a locking mechanism 15:57:06 ndipanov: right, thats always handy, shared terminology, 15:57:07 another possible loss is this: you spend more time solving 15:57:15 I want to understand if the second is so 15:57:22 I especially liked the 2 level approach where you have the resource master and the schedulers that see a subset of resources 15:57:44 that the master let's them see 15:57:45 mspreitz: one you do additions and substractions, you end up having to do both I guess, so maybe we do sequentially first? 15:58:08 ndipanov: thats what cells does today (all be it badly) 15:58:19 johnthetubaguy: correct. 15:58:31 johnthetubaguy: sharding vs. unified view. 15:58:31 anything that will not result in a loss can be handled sequentialy i agree 15:58:36 johnthetubaguy: I could easily see a roadmap that starts with more sophisticated placement criteria and switches from sequential to simultaneous later 15:58:54 but these will not increase solving time when done simultaneoulsy either 15:58:57 mspreitz: IMHO, simultaneous needs to be covered out of Nova 15:59:23 mspreitz: because I'm seeing it as something asking for a "lease", and I think you know what I'm seeing 15:59:28 I am kinda lost in the cross conversation 15:59:33 can we pls follow up in the ML? 15:59:41 mspreitz: sure, open a thread 15:59:51 ok 15:59:51 sure 15:59:53 ++ 16:00:07 folks, thanks for your help, much appreciated 16:00:12 ty bauzas :) 16:00:12 bauzas, np 16:00:25 see you next week, and jaypipes don't plan to take vacations soon :) 16:00:26 thank you for caring about this 16:00:32 jaypipes: the cells sharding is more useful for "soft" people reasons like isolating infrastructure into like-typed failure zones as you add capacity, so you can spot failure patterns more easily, etc 16:00:37 bye all 16:00:39 #endmeeting