15:00:40 #startmeeting gantt 15:00:41 armax: thanks 15:00:41 Meeting started Tue Feb 10 15:00:40 2015 UTC and is due to finish in 60 minutes. The chair is n0ano. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:42 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:44 otherwiseguy, hm, ok 15:00:45 The meeting name has been set to 'gantt' 15:00:55 Anyone here to talk about the scheduler? 15:00:56 o/ 15:00:57 * jaypipes around... barely. 15:01:09 o/ 15:01:51 hi 15:02:21 let's get started then ( bauzas should join soon ) 15:02:33 #topic Remove direct nova DB/API access by Scheduler Filters - https://review.opernstack.org/138444/ 15:02:49 edleafe, I see ndipanov has some issues with your spec (sigh), were you able to work it out? 15:02:51 OK, so after last week's meeting I pushed to get eyes on the spec after integrating all of the spec cores' suggestions. 15:02:54 \o 15:02:56 * bauzas waves late 15:02:59 I got all +1s from the Gantt team members, but then Nikola brought up some strong objections (he wasn't at the midcycle). 15:03:07 Dan Smith then objected, suggesting that we change the approach to the one that jaypipes rejected earlier, namely sending just the changes. 15:03:15 After speaking with Dan, jaypipes then changed his mind, too, and since he was the driving force behind this approach, we were back at the starting line. 15:03:23 There was no chance adding differential notification to the spec and getting all of that approved and coded in time for Kilo. 15:03:35 So the only chance for anything in Kilo was to greatly simplify the approach, so I re-wrote the specs so that the only change is additional instance info is returned with the host info. 15:03:43 No feedback whatsoever in the past 4 days, though. 15:03:53 edleafe, yeah sorry about no feedback bit 15:03:55 o/ 15:03:59 was completely swamped 15:04:04 So I'm pretty sure that there is very little chance of this making it through 15:04:13 edleafe: I think you're right 15:04:16 Someone convince me otherwise, please! :) 15:04:24 edleafe: I mean, about the deliverables 15:04:39 Pardon my naivete - is a FFE implausible? 15:04:51 edleafe, it's still open to exception approval (note johnthetubaguy mail), the question is creating the code patches 15:04:51 Do you mean it's very unlikely even with a FFE? 15:04:51 lxsli: the spec is not merged yet 15:05:02 lxsli: we don't even have an approved spec 15:05:12 lxsli: so we're even not talking about patches 15:05:15 and the one that was close to approved has been rejected 15:05:34 So we're starting over now 15:05:36 edleafe: my spec has been approved, so I can go thru it 15:05:51 edleafe, but you simplified the approach, that should help - yes? 15:05:52 bauzas: but yours doesn't cover this stuff 15:05:57 edleafe: in the meantime, maybe some people will jump in and discuss about the change in game 15:06:12 n0ano: that was my hope 15:06:13 edleafe: yeah, but it will just shed the light on 15:06:19 but time keeps ticking... 15:06:51 edleafe: I seriously doubt that your spec could be approved 15:06:58 edleafe: so keep it for L 15:07:03 I'd prefer to keep pushing while realizing it is very risky, we have to get some cores to review your patch 15:07:05 bauzas: and with your changes, the spec I wrote would need further change 15:07:12 (I based it on existing code) 15:07:39 edleafe: hency my thoughts : you made great work but you need to consider it for L 15:08:00 edleafe: which is not that far away, btw. 15:08:15 bauzas: no, only 8 months 15:08:16 :) 15:08:38 edleafe: the branching should happen in 1 month or so 15:08:48 edleafe: I mean, Lxxx open for merging 15:08:52 bauzas: I know the timeline 15:09:11 edleafe: anything we can help? 15:09:12 bauzas, but spec approval won't even start until after Vancouver 15:09:13 bauzas: but it means that a separate scheduler now looks like an M thing at the earliest 15:09:28 edleafe: I'm not seeing this as sequential 15:09:35 alex_xu_: no, not really. Unless you have a time machine 15:10:09 edleafe, I hope that's not the case, if we can get you approved early in L we will still have time for the split 15:10:14 emm....that's said 15:10:20 n0ano: +1 15:10:25 n0ano: that's my thoughts 15:10:45 n0ano: we can work on splitting things while edleafe is working on his spec 15:10:56 we know there's going to be lots of work for the split, this will just be another part of it 15:11:00 edleafe: at least we can try to get the spec in this release, then we can save some spec review time in L 15:11:10 n0ano: smart sentence 15:11:10 bauzas: the split can happen without this 15:11:27 bauzas: this was just an intermediate step 15:11:48 we can separate the data without having to first go through this 15:12:01 edleafe: agreed, so what's the purpose of you saying 'it won't happen for L?3 15:12:02 :) 15:12:25 edleafe, we should keep pushing for approval on your spec to increase the odds that it'll be the first spec approved for L 15:12:42 n0ano: you're just picking my words 15:12:55 bauzas: my point is that if we insist on this spec first, it will push everything off 15:13:02 bauzas, clarifying, that's all :-) 15:13:20 edleafe: but you said that you would have to rewrite your spec based on my work, right ? 15:13:28 * alex_xu_ hopes have a car like in the movie 'back to the future' 15:13:41 edleafe: so why not considering that my work is just a dependency on your stuff ? 15:13:48 I say we need to just focus on cleaning the interfaces for data, instead of trying to force them into these intermediate half-changes 15:14:22 edleafe: right, so I think that your spec needs to be a final one, and not something that you're planning to land for Kilo 15:14:29 that's just your words - intermediate 15:14:31 bauzas: then I don't understand why it was ever supposed to be a separate spec 15:15:04 bauzas: if it just dependent on your changes 15:15:26 edleafe: sounds like there is a confusion there 15:15:35 bauzas: intermediate in the sense that this isn't how we want a separate scheduler to work in the long run 15:16:01 we want it to own the data, not just ask for it from nova in different ways 15:16:27 edleafe: I was just referring to 15:16:29 (16:03:35) edleafe: So the only chance for anything in Kilo was to greatly simplify the approach, so I re-wrote the specs so that the only change is additional instance info is returned with the host info. 15:16:52 edleafe: forget Kilo and provide the right approach for L 15:17:10 bauzas: that was last week when I re-wrote the spec 15:17:34 the fact that nobody has given feedback since then is what made me think that it cannot make it into K 15:18:12 edleafe, well, personally, I was waiting for ndipanov to comment, he had the original objection and should have responded to your change 15:18:32 edleafe: agreed, so that's why we still need to help you by reviewing your approach so we keep momentum on your spec 15:18:50 n0ano: Understood, but it's not like the cores don't have a zillion other things to focus on 15:19:32 edleafe, don't get me started I don't accept that as an excuse, we all have many things to do 15:19:47 anyway, I think we have a consensus : we need to review your spec and comment it, whatever the outcome will be 15:20:03 bauzas, +1 15:20:19 bauzas: +1 15:20:21 n0ano: not an excuse; just reality 15:20:32 #action all to review the spec `today` 15:20:46 * n0ano wonders why his actions never take 15:20:50 edleafe: and as I said to you, your spec deserves some time before reviewing, because that one is quite tricky 15:20:51 #link https://review.openstack.org/#/c/138444/ 15:20:52 it'd be great if ndipanov, bauzas + edleafe could agree the bones of the ideal solution right now rather than volleying review comments 15:21:11 sure 15:21:16 my point is basically this 15:21:27 n0ano: actions take; the bot just doesn't echo them here 15:21:34 we should not disable hitting the db as such 15:21:35 n0ano: check the minutes afterwards 15:21:49 we should make it go through a well defined scheduler (gantt) interface 15:22:11 that we need to define for this 15:22:17 as this is a common usecase for people 15:22:34 and I feel it is orthogonal to any performance improvements like caching 15:22:53 ndipanov: yes, this is not about performance per se 15:22:59 That does sound like it could be a smaller, more incremental change 15:23:01 but having said that I did not read dansmith 's follow up comments on my comment 15:23:14 so I'd be happy with just solidifying that interface 15:23:24 and also I am not sure it should be about instances as such 15:23:34 ndipanov: dansmith wanted to send diffs of the instance list instead of the full list 15:23:39 but I could be flexible on this maybe 15:23:53 ndipanov: sending the full list was something jaypipes had pushed for 15:23:59 edleafe: no, I wanted to send just the instance on which the action is being taken 15:23:59 edleafe, to me this is also an optimization 15:24:21 dansmith: well, isn't that the same thing? 15:24:26 edleafe: no? 15:24:57 dansmith: ok, a node has instances. Something changes on one. We send that change. 15:25:14 correct 15:25:22 dansmith: to the scheduler, that's a diff: take the instance that changed, and update your local view of it 15:25:37 and then send a list of just uuids periodically, which allows the scheduler to patch up in the case of a missed one 15:25:54 edleafe: diffs of the list imply potentially multiple instances updated in a single go to me 15:26:03 dansmith: sure, but that's a separate concern 15:26:10 edleafe: what we should be doing is saying "this instance changed" not "here's a patch against your cache" 15:26:14 I think we validated a 2-step approach 15:26:30 dansmith: sure, definitely not a patch 15:26:31 1. update the list of instances per host when the Scheduler is starting 15:26:45 2. provide updates to the scheduler when a state changes 15:26:55 dansmith, I think we are talking about 2 different things here 15:26:59 dansmith: we used the term 'diff' in earlier discussions; contrasted with sending the whole list 15:27:04 2. can be a full list of instances per host or only a diff, but that's still 2. 15:27:15 the difference is only optimization 15:27:34 what you guys are saying is that we should make update instances from the host and then debate on how to do this in a reasonable manner 15:27:38 edleafe: that's why I dislike your new PS 15:27:54 what I am trying to say is that we should design that API to not be only about instances 15:27:59 the critical difference to me is the semantic meaning of the interface 15:28:00 edleafe: because you are no longer mentioning those 2 steps 15:28:09 dansmith, go on 15:28:27 bauzas: the new version doesn't have scheduler keeping a list of instances 15:28:39 ndipanov: we only need instances today - we can add other info later, maybe after the split is further progressed + the scheduler is truly owning instance data 15:28:43 ndipanov: I don't know what you mean about "making it not just about instances" 15:28:55 ndipanov: FYI that's been approved for Aggregates http://specs.openstack.org/openstack/nova-specs/specs/kilo/approved/isolate-scheduler-db-aggregates.html 15:29:06 the only thing that needs to be communicated immediately is about instances, AIUI 15:29:19 dansmith: correct 15:29:22 dansmith, ok 15:29:31 dansmith: and only two attributes of the instances 15:29:48 edleafe: eh? what two attributes? 15:30:06 dansmith: uuid and instance_type_id 15:30:16 I don't think I agree with that 15:30:18 dansmith: those are the only two things that the filters need 15:30:27 that makes the API far too specific to be generally useful going forward 15:30:34 dansmith, agreed 15:30:35 edleafe: yeah, but that just makes it useless for anything else 15:30:37 edleafe: I think we're all happy to store all instance attrs rather than splitting it down so fine though 15:30:44 I see what dansmith is saying 15:30:53 edleafe: we need to just pass the whole instance object 15:31:08 and then have a "protocol" on how to pass updates 15:31:08 dansmith: that, we agreed 15:31:14 edleafe: the scheduler can glean whatever it needs from that information, regardless of what it is looking for -- the compute node doesn't and shouldn't care 15:31:50 ndipanov: that's what we're trying to provide, ie. a new Scheduler RPC API method for updating Scheduler's view 15:32:02 dansmith: agreed - that was my original logic 15:32:05 bauzas, that's what I lobbied for in that spec too 15:32:21 jaypipes objected to the the individual updates 15:32:34 he pushed for sending the full InstanceList 15:32:34 ndipanov: so, I think it was unclear in edleafe's spec then, because it was explicit in my own spec for aggregates 15:32:52 so that there would be no issues involving sync 15:33:10 ndipanov: ie. we have a warmup period for learning the world, and then we provide updates to the scheduler 15:33:11 IMHO, syncing the entire list any time anything happens is faaaar too heavy 15:33:33 dansmith: hence your point on only sending the diff 15:33:43 bauzas: stop saying diff :) 15:33:45 the diff being the full Instance object 15:33:48 bauzas: "affected instance" 15:33:49 heh 15:33:52 bauzas, would that work for aggregates too? 15:33:55 There was a suggestion to send the whole InstanceList on startup; just the changed Instance on change; and the instance UUIDs from each host periodically to prevent desync 15:34:01 you typed too fast 15:34:05 ndipanov: see the spec http://specs.openstack.org/openstack/nova-specs/specs/kilo/approved/isolate-scheduler-db-aggregates.html 15:34:47 ndipanov: the problem with instances is that it doesn't scale the same as for aggregates 15:34:58 bauzas, right... 15:35:00 ndipanov: that's far more bigger than just a list of aggs 15:35:16 so this is a scale problem, not a design problem 15:35:39 because we agreed on providing the updates thru RPC to the scheduler 15:35:50 so what dansmith is saying is - design the API so that it can scale if it has to (I think) 15:35:53 fanout for the moment, incremental updates in a later cycle 15:36:05 edleafe: I only objected because I thought it was an optimization that could be done over time... 15:36:07 ndipanov: that, I strongly agree 15:36:09 bauzas, hmm 15:36:19 ndipanov: eh, multiple schedulers... 15:36:25 jaypipes: exactly 15:36:31 ndipanov: scale reasonably, so as not to be too specific to be irrelevant for new things, and not too heavy to be ,,, unscalable 15:36:37 jaypipes: but it seems that it needs to be done upfront 15:36:40 dansmith, exactly! 15:36:50 dansmith: smart words 15:36:50 jaypipes, I'd say it's part of the semantics really 15:36:57 I mean 15:37:04 edleafe: yes, after chatting with dansmith yesterday, I think I agree the API should be corrected now, not later. 15:37:27 jaypipes: I think I agree with you 15:37:42 ok, so let me see if I can summarize 15:37:45 so wouldn't the API for instances and aggregates then look super similar? 15:37:52 jaypipes: we only need to send the instance that is being impacted, period 15:37:59 - go back to previous design (discussed at midcycle) 15:38:01 ndipanov: that's my thought 15:38:14 There was a suggestion to send the whole InstanceList on startup; just the changed Instance on change; and the instance UUIDs from each host periodically to prevent desync 15:38:18 ndipanov: if we send an object, then we could have one update method and send any object the scheduler should know about down that tube, is that what you mean? 15:38:22 - change it to add an incremental update for changed instances 15:38:28 dansmith, well maybe 15:38:40 not sure that it's relevant enough to block all work on 15:38:54 but seems like here's intial state of this thing 15:38:56 - add a regular update of all UUIDs that scheduler can use to chaeck for snc problems 15:38:56 here are updates 15:39:11 is the pattern not tied to the object at hand 15:39:21 - add a way for scheduler to ask for updates to missing/extra instance 15:39:25 ndipanov: +1 15:39:28 Is that about right? 15:39:31 edleafe: +1 15:40:02 so not sure UUIDs are what we want here 15:40:07 as they are kind of instance specific 15:40:17 ndipanov: instances are all we need for now 15:40:23 hmmm 15:40:24 we discussed that 15:40:26 right 15:40:27 ndipanov: we can just send over the wire the whole Instance object 15:40:42 bauzas, I'd prefer to use the facilites provided by objects 15:40:50 bauzas: this is for preventing desync - a UUID list is lighter than an InstanceList 15:40:57 even if it means extending them with an optional global ID that is not the DB id 15:41:03 dansmith, ? ^ 15:41:15 huh? why do we need that? 15:41:21 ndipanov: um, isn't that what a uuid is? 15:41:22 well not sure we do 15:41:33 edleafe, it is but won't work for aggregates 15:41:34 lxsli: I thought we discussed on only sending *one* item thru RPC, ie. update_instance() (with the arg being something related to *one* instance) 15:41:53 Yes, for "change it to add an incremental update for changed instances" 15:42:05 so that's why I would avoid sticking to instance semantincs - and use Object semantics rather 15:42:11 if that makes sense 15:42:24 bauzas, but for "add a regular update of all UUIDs that scheduler can use to check for sync problems" we'll use UUIDs not an InstanceList 15:42:35 ndipanov: that sounds like quite a different spec, though 15:42:37 ndipanov: I think the scheduler still needs to have an isinstance(thing_being_updated, objects.Instance) logic 15:42:45 ndipanov: again, http://specs.openstack.org/openstack/nova-specs/specs/kilo/approved/isolate-scheduler-db-aggregates.html is referring to passing over the wire an Aggregate object as parameter 15:42:45 ndipanov: but the interface can be "here's an object I just changed" 15:42:55 ndipanov: adding a general interface for any service to update scheduler about any type of object 15:43:32 ndipanov: I can see that as the next step, but it seems a bit overreaching for now 15:43:36 if I understand, we pass the entire instance object and this code would only extrace the UUID, other code would extract other parts 15:43:43 edleafe: atm, we need dedicated interfaces for each type of resouce, we will factorize this later 15:43:49 edleafe, well how do you go from this to general 15:44:03 there is no way other than adding a completely separate new API 15:44:26 and iiuc this whole effort correctly 15:44:35 it's about making the gantt interfaces solid 15:44:41 (not necessarily perfect) 15:44:43 but workable 15:44:44 no, I don't think so 15:44:50 ok 15:44:55 I think that this effort is actually about solving this one problem in a vacuum 15:45:16 dansmith: yes 15:45:18 what we want to avoid, however, is pinning ourselves into an interface that isn't supportable as we make the next step to migrate to gantt 15:45:19 yes 15:45:49 well in that case - use UUIDs and a new client method and all other things you guys said 15:45:58 and call it an Instance something 15:46:13 update_instances_on_host_or_whatever 15:46:17 that sounds excellent 15:46:42 it may not actually be such a bad idea 15:46:44 ndipanov: 'update_instance_info()' - it's in the spec :) 15:47:04 though I would like to see it more general 15:47:06 then, just to be clear, do we all agree on edleafe proposed steps: 15:47:08 - go back to previous design (discussed at midcycle) 15:47:08 - change it to add an incremental update for changed instances 15:47:08 - add a regular update of all UUIDs that scheduler can use to chaeck for snc problems 15:47:08 - add a way for scheduler to ask for updates to missing/extra instance 15:47:16 +1 15:47:20 +1 15:47:26 I guess 15:47:30 +1 15:47:34 that sounds reasonable 15:47:35 +1 15:47:47 +1 15:48:00 we can really look into making it more general later on 15:48:02 ndipanov: I understand your frustration. This is the sort of thing I was hoping we would have addressed more clearly earlier on. 15:48:24 ndipanov: yeah, that's something we needed to work out after all of this 15:48:31 edleafe, I think we have a plan, let's all try and review the updated spec as soon as it comes out 15:48:38 edleafe, yeah - I mean - it is about striking a balance between moving forward and not having a horrible API 15:48:41 ndipanov: agreed, I'm looking forward to edleafe's plans for "making gantt own it completely rather than asking nova-compute in a different way" 15:48:45 hence my desire to discuss general the scheduler interface at the midcycle 15:49:18 edleafe: I was thinking there was a consensus on this... 15:49:22 not sure what "making gantt own it completely rather than asking nova-compute in a different way" exactly means 15:49:34 bauzas: agreed 15:49:43 edleafe: again, one spec has been approved, the other one is only different because of a scale problem 15:50:13 ndipanov: it means that the scheduler needs to persist information it needs to make scheduling decisions. 15:50:24 right 15:50:25 We're nowhere near that now 15:50:36 well we are persisting it in the DB :) 15:50:42 but yes I get what you mean 15:50:44 ndipanov, what I was going to say 15:50:45 edleafe: that probably requires another scheduler worker 15:50:53 edleafe: but I prefer to leave it for Gantt 15:51:26 Ten minutes, did we have a 2nd agenda item n0ano ? 15:51:26 edleafe: because of the scaling problem mentioned by johnthetubaguy 15:51:31 bauzas, I also don't think it's bad to have two different ways to update 1) gimme all 2) gimme updates 15:51:36 guys, we're running out of time... 15:51:42 #topic opens 15:51:45 so ur spec is cool too 15:51:57 edleafe, anyway ping me when spec is updated pls 15:52:04 any opens for today (not enough time to talk about patches) 15:52:04 ndipanov: will do 15:52:05 open discussion? 15:52:21 dims__, yes, right nwo 15:52:25 edleafe: do ping me about the spec too, once you folks are happy with it 15:52:30 s/nwo/now 15:52:40 johnthetubaguy: you got it 15:52:48 :) 15:53:01 * n0ano johnthetubaguy just asked to be bothered, works for me :-) 15:53:10 eh eh 15:53:14 crickets for opens ? 15:53:19 yup, saves me attempting to remember and failing 15:53:21 if there's nothing else 15:53:22 ok, nova cores, there's a well curated quick hit list of bugs that are ready for review (updated few times every day) https://etherpad.openstack.org/p/kilo-nova-priorities-tracking 15:53:36 dansmith ndipanov: do you agree on the way forward for https://review.openstack.org/#/c/152689/ please? 15:53:47 dims__: is that the right place to discuss this ? :D 15:53:48 dims__, yeah, but that is more approriate for the nova meeting on Thurs 15:53:58 dims__: see /topic 15:54:02 wow! 15:54:15 whoops. sorry :) 15:54:21 dims__, NP 15:54:27 lxsli: I rarely agree to anything 15:54:28 ah that thing 15:54:34 well I think we do in general 15:54:37 dansmith: noted... 15:54:44 OK, I think we're done for today 15:54:51 I thought we did 15:54:54 tnx everyone, be back here next week 15:54:59 #endmeeting