14:01:21 <edleafe> #startmeeting nova_scheduler 14:01:22 <openstack> Meeting started Mon Apr 17 14:01:21 2017 UTC and is due to finish in 60 minutes. The chair is edleafe. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:01:23 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:01:25 <openstack> The meeting name has been set to 'nova_scheduler' 14:01:32 <mriedem> o/ 14:01:32 <cdent> I'll take that as a yes 14:01:35 <edleafe> Sorry, I was making coffee :) 14:01:38 <alex_xu> o/ 14:01:54 <jaypipes> o/ 14:02:06 * jaypipes also making coffee... 14:02:11 <lei-zh> o/ 14:03:13 <edleafe> #topic Specs & Reviews 14:03:37 <edleafe> Good to see that the Traits stuff has merged \o/ 14:03:44 <edleafe> Still one bit left: 14:03:46 <edleafe> #link os-traits sync to DB https://review.openstack.org/#/c/450125/ 14:03:47 <jaypipes> edleafe: I think the claims one is the biggie. 14:03:48 <alex_xu> \o/ 14:03:59 <jaypipes> edleafe: as far as specs go. 14:04:12 <mriedem> claims is the only scheduler related spec outstanding that i care about at this point 14:04:14 <edleafe> jaypipes: yep, getting to that 14:04:53 <edleafe> Also some work on adding to os-traits, and re-organizing 14:04:53 <edleafe> os-traits 14:04:54 <edleafe> #link os-traits reorg: https://review.openstack.org/#/c/448282/ 14:05:44 <edleafe> If there's nothing about any of those, we can move on to the claims stuff 14:06:16 <edleafe> ok then 14:06:17 <edleafe> #link WIP placement doing claims: https://review.openstack.org/437424 14:06:20 <edleafe> #link Forum session proposed for claims in the scheduler: http://forumtopics.openstack.org/cfp/details/63 14:06:23 <edleafe> So let 14:06:25 <edleafe> ugh 14:06:30 <edleafe> let's talk about claims 14:08:20 <jaypipes> worst 80s song ever. 14:08:23 <mriedem> i haven't looked at the latest since our hangout 14:08:49 <mriedem> jaypipes: salt n pepa were pressured by their management to do it anyway 14:08:59 <jaypipes> mriedem: cdent has concerns about sending the allocation list object over RPC. to summarize, he believes it is better/simpler to just have the compute node do GET /allocations 14:09:31 <mriedem> and of course i can't find the etherpad from the hangout session now 14:09:46 <edleafe> For me it's not a concern about load or reliability as much as it seems to go against the design 14:10:34 <mriedem> so conductor would PUT the allocations, compute would GET the allocations and if they don't exist yet, compute does the old style thing, right? 14:10:36 <mriedem> is the idea 14:10:43 <jaypipes> mriedem: right. 14:10:49 <mriedem> whereas, 14:11:01 <mriedem> if conductor passed the allocations to compute, compute can just key off that to know if it needs to do the old style thing 14:11:04 <jaypipes> mriedem: and frankly, I agree with edleafe and cdent that the PUT then GET is cleaner. 14:11:35 <jaypipes> mriedem: yes, but passing allocations to compute introduces significantly more complexity to the code. 14:11:38 <cdent> mriedem: the "can just key off" part is the part that's concerning/misleading. the behavior is not complex, but the code changes and commitments are 14:11:43 <mriedem> yeah i agree it's cleaner, it's a bit redundant, and it takes some control away from the controller layer, which is something dansmith mentioned when we talked about it 14:11:48 <jaypipes> mriedem: mostly because we will now need to use versioned objects for the communication 14:12:05 <mriedem> jaypipes: why? 14:12:15 <mriedem> we said we weren't going to stuff the allocations in the request spec 14:12:41 <jaypipes> mriedem: we still would send the AllocationList as a param over RPC, no? 14:12:45 <mriedem> yes 14:13:00 <jaypipes> mriedem: so therefore that param would need to be an ovo, no? 14:13:05 <mriedem> yes 14:13:15 <mriedem> well, doesn't need to be, but it would be 14:13:27 <dansmith> and those ovo definitions are "owned" by placement, 14:13:28 <mriedem> to link back to the resource provider object 14:13:32 <dansmith> which is to be kicked out 14:13:32 <jaypipes> mriedem: and we don't use versioned objects for placement/resource providers stuff on the compute node. 14:14:00 <mriedem> dansmith: so you've come around to the idea of pulling the allocations from the compute rather than pushing them there? 14:14:22 <dansmith> mriedem: I don't need to come around, I was saying there are good attributes about both approaches 14:14:31 <mriedem> to be clear, i'm fine with the compute pulling them 14:14:44 <mriedem> i think we mostly picked a side during the hangout because we were ratholing 14:14:47 <mriedem> and needed to move on 14:14:55 <jaypipes> tru 14:15:14 <mriedem> but yeah keeping the placement objects out of nova would be great 14:15:22 <mriedem> and then the compute is just another client working with the dict responses 14:16:10 <mriedem> anyone have that etherpad link handy? 14:16:48 <jaypipes> https://etherpad.openstack.org/p/nova-pike-claims-in-scheduler 14:17:05 * edleafe was too slow 14:17:57 <jaypipes> mriedem: see line 37 :) 14:17:58 <mriedem> one question in there was, "The compute could fail the HTTP call to Placement and then what do we do?" 14:18:27 <cdent> isn't that the same as "the RPC call could fail and then what?" 14:18:36 <jaypipes> mriedem: if the compute can reach the conductor, then set the instance to ERROR, I would say. 14:18:51 <mriedem> jaypipes: it can't in multi-cell with super conductor 14:18:53 <mriedem> cdent: good point 14:19:12 <jaypipes> mriedem: why not? 14:19:17 <mriedem> cdent: in that case, we delete the allocations when we delete the instance, or whatever hand wavey thing we said we'd do about cleaning up allocations 14:19:26 <mriedem> jaypipes: that's the same as the retry issue, 14:19:40 <mriedem> with super-conductor the compute can't rpc cast back to conductor to retry a build 14:20:01 <jaypipes> mriedem: wouldn't it call Instance.save(), which would communicate with the local conductor and set the instance's vm_state to ERROR... we could have the audit job look for an ERROR'd instances and dselete allocations from placement. 14:20:02 <mriedem> so once we claim and pick a host, it builds on that host or fails 14:20:29 <mriedem> i was thinking we put the instance in ERROR state yeah 14:20:31 <mriedem> but, 14:20:56 <mriedem> how do we know the GET /allocations fail is a new style thing and not the old style fallback, maybe that does'nt matter 14:21:08 <mriedem> i.e. how does the compute know allocations have already been made for it? 14:21:17 <mriedem> so it should error if it can't get them 14:22:09 <jaypipes> mriedem: well there's a difference between not getting allocations from the placement API and not being able to communicate with the placement API... 14:22:34 <jaypipes> mriedem: in the former case, the compute would just proceed with its local claim process as normal. in the latter, it would set the instance to ERROR. 14:22:42 <mriedem> might be overthinking this of course, if we assume the control plane is upgraded before computes and it's making allocations for the compute, then we wouldn't have any fallback in the compute if GET /allocations fails 14:22:47 <jaypipes> mriedem: assuming it can contact the local conductor of course. 14:22:48 <mriedem> *fails or is an empty response 14:23:11 <mriedem> jaypipes: but i'm not sure it should proceed with the local claim process 14:23:13 <mriedem> is what i'm getting at 14:23:21 <jaypipes> mriedem: well, a 404 is perfectly valid, right? 14:23:30 <mriedem> because we could then be double-claiming 14:23:33 <mriedem> right? 14:23:55 <jaypipes> mriedem: if it gets a 404, that means "hey, not all compute nodes are upgraded to understand the new claims in placement yet, so I should proceed as per the old way" 14:24:30 <edleafe> GET /allocations/{instance_id} should return an empty list if there aren't any 14:24:34 <mriedem> yes that would be true if you have ocata computes and conductor isn't claiming yet 14:24:35 <edleafe> NOt 404 14:25:03 <jaypipes> edleafe: if there's been no allocations for an instance, it returns an empty list? 14:25:35 <edleafe> That's what I thought. cdent? 14:25:41 <mriedem> i'm asking about the case that all computes are pike, conductor is claiming, and the compute does a GET /allocations/{instance_id} and it's empty or 404, then what? i'm thinking we just put the instance into error state - but how to know if we should do that is tricky 14:25:59 * edleafe is rebuilding his laptop and currently without his dev environment 14:26:07 <mriedem> granted, it shouldn't be empty or 404 14:26:13 <jaypipes> mriedem: why would you put the instance into error state instead of just trying the local claim? 14:26:24 <mriedem> jaypipes: because then you are double claiming 14:26:35 <edleafe> mriedem: how is that possible? 14:26:50 <cdent> sorry, my network is being very lame 14:26:56 <edleafe> if there are allocations against that instance, how could it return an empty list? 14:27:00 <mriedem> would the local claim just overwrite the existing allocations that conductor created? 14:27:01 <cdent> get allocations for consumer returns an empty result, not a 404 14:27:13 <jaypipes> cdent: ok 14:27:34 <edleafe> And if there are allocations, but the compute gets back an empty list, we have much bigger problems 14:27:58 <mriedem> edleafe: i agree with that 14:27:59 <jaypipes> mriedem: I think for at least one release, we should have the compute node try the old local claim process. 14:28:08 <jaypipes> mriedem: and then remove all that code in, say "R". 14:28:20 <edleafe> jaypipes: that sounds reasonable 14:28:26 <jaypipes> mriedem: after removing that code, the default brhaviour would be put in error state. 14:28:58 <mriedem> does anyone know if the local claim would overwrite the existing allocations? or is there a generation id or something that would make that a conflict? 14:29:22 <jaypipes> mriedem: checking, one sec. 14:29:32 <jaypipes> mriedem: pretty sure there is a check, but hold up. 14:29:47 <edleafe> it's a PUT, right? 14:30:12 * edleafe feels extra dumb without code available 14:30:45 <jaypipes> mriedem: the AllocationList.set_allocations() would fail with a ConcurrentUpdateDetected and return a 409 to the client. 14:30:46 <cdent> iirc: when we set allocations we don't check any generation that is provided, but we do check the generation internally 14:31:09 <cdent> jaypipes: you looking at set_allocations in the handler code? 14:31:15 <jaypipes> cdent: the rp.generation(s) are passed to the placement REST API./ 14:31:26 * cdent re-looks 14:31:48 <cdent> jaypipes: look at the gabbits 14:32:06 <jaypipes> sigh 14:32:22 <cdent> this is something we talked about way back then, and for some reason decided it wasn't necessary 14:32:34 <alex_xu> I thought the scheduler will upgrade first, and after compute node upgrade also, why the upgrade compute node needs to fallback to old claim? 14:32:53 <mriedem> alex_xu: conductor will only do the new style claim if all computes are upgraded 14:33:08 <mriedem> because ocata computes would always be doing a local claim 14:33:11 <alex_xu> mriedem: ah, yea, right 14:33:51 <jaypipes> why on earth didn't we include rp.generation in the request payload... :( 14:34:05 <cdent> for reference 14:34:11 <cdent> #link allocations with no gen: https://github.com/openstack/nova/blob/05a25d7e2a6aa9d5eb7327c91509178f95186e17/nova/tests/functional/api/openstack/placement/gabbits/allocations.yaml#L136 14:34:11 <mriedem> jaypipes: cdent: so i think it's reasonable to assume that if conductor makes the allocations, compute should be able to get them, and if for whatever reason it doesn't, then it does the local claim which could try and do it's own PUT /allocations - if that failed with a 409, then idk, error? 14:34:11 <jaypipes> that really should be added. 14:34:24 <jaypipes> mriedem: yes. 14:34:55 <jaypipes> cdent: that's really a bug. I'll log one and fix it. 14:35:02 <mriedem> we'll sort out weirdo edge cases once we start rolling out code i guess 14:35:16 <mriedem> so to summarize: 14:35:16 <jaypipes> k 14:35:17 <cdent> jaypipes: okay, but I wish I could remember what you told me when I questioned it in the first place, because that would be useful context 14:35:22 <mriedem> 1. conductor won't do anything until all computes are upgraded, 14:35:37 <mriedem> 2. once all computes are upgraded, conductor makes the claim (creates allocations) 14:35:58 <jaypipes> cdent: all I remember is rejecting the original request payload that had the allocations structured as a dict keyed by rp.uuid, saying that we may need other fields in the resource provider object in the future. 14:36:04 <mriedem> 3. conductor doesn't pass anything to compute during build, the compute will look for allocations and if they exist, it doesn't do a local claim, otherwise it attempts to (for a release or two) 14:36:38 <jaypipes> mriedem: yes, ++ 14:37:02 <mriedem> 4. when we hit some crazy edge case we didn't anticipate with ironic, we'll hack a fix as usual 14:37:12 <jaypipes> cdent: more than likely we said "well, generation doesn't matter until we're doing claims in the scheduler" and left it out. 14:37:21 <edleafe> Can anyone remind me why we have conductor doing the claim and not the scheduler? 14:37:34 <cdent> mriedem: step 4 is like steps 4-math.inf 14:37:42 <jaypipes> cdent: since right now, the compute node holds a semaphore on the compute node record and therefore there's no danger of concurrent updates. 14:38:00 <cdent> jaypipes: perhaps. I guess we just fix it and that's fine. 14:38:09 <jaypipes> ya 14:38:20 <mriedem> edleafe: i can't remember exactly 14:38:37 <edleafe> It would seem that if there was a race and the claim failed, the scheduler could move to its next choice and try claiming that 14:38:54 <mriedem> edleafe: as could conductor 14:39:13 <mriedem> dansmith: do you remember why we said conductor would create the allocations rather than scheduler? 14:39:34 <dansmith> mriedem: because that's where we do that kind of thing 14:39:36 <edleafe> mriedem: conductor doesn't have the list of filtered/weighed hosts 14:39:44 <mriedem> edleafe: yes it does 14:40:02 <mriedem> but i'll check for sure 14:40:05 <edleafe> mriedem: if conductor claim fails, it has to re-do the whole scheduler filter/weigh process all over again 14:40:22 <mriedem> https://github.com/openstack/nova/blob/master/nova/conductor/manager.py#L914 14:40:33 <dansmith> mriedem: not sure why we'd want the scheduler doing that 14:41:30 <mriedem> edleafe: if i'm reading that correctly, conductor has a list of filtered and weighed hosts 14:41:32 <jaypipes> #link https://bugs.launchpad.net/nova/+bug/1683377 14:41:34 <openstack> Launchpad bug 1683377 in OpenStack Compute (nova) "[placement] PUT /allocations should include resource provider generations" [Medium,Triaged] - Assigned to Jay Pipes (jaypipes) 14:42:17 <jaypipes> mriedem: the reason we said conductor instead of scheduler was because conductor handles retries. 14:42:22 <edleafe> mriedem: no, the scheduler gets that list from placement, runs its additional filters, sorts, and returns its one selected host 14:42:44 <mriedem> edleafe: so you're saying the "hosts" variable here is a single entry? https://github.com/openstack/nova/blob/master/nova/conductor/manager.py#L638 14:43:11 <dansmith> mriedem: it is if we have subset =1 14:43:12 <dansmith> er, subset_size=1 14:43:12 <edleafe> mriedem: it can be more than one if we are building several VMs at once 14:43:13 <dansmith> IIRC 14:43:19 <edleafe> mriedem: https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L102 14:44:00 <edleafe> dansmith: isn't it spec_obj.num_instances? 14:44:27 <dansmith> edleafe: I think subset_size also means we get back more than one, right? maybe that is stripped off before that.. 14:44:38 <dansmith> edleafe: we need at least num_instances there, yes 14:44:45 <edleafe> dansmith: that's a trick we use to reduce races 14:45:12 <edleafe> we pick a random host from the top N, where N == subset_size 14:45:49 <edleafe> that could go away if the scheduler could claim directly 14:47:03 <mriedem> ok so i thought conductor had the list of possible hosts to build a single instance, i guess that was wrong 14:47:37 <mriedem> if the claim in conductor fails, is it worth it to redo the host filtering in the scheduler anyway? 14:48:10 <mriedem> it could be expensive if nothing else changed but that one host you hit 14:48:26 <edleafe> mriedem: yeah, it would be expensive 14:48:43 <edleafe> getting the list of hosts is better now that we do some weeding out in the placement call 14:48:55 <edleafe> but it's still wasteful to repeat if not needed 14:49:49 <edleafe> If the scheduler were to make the claim, the flow when a race happens would look like: 14:49:53 <edleafe> Scheduler has a list of acceptable hosts. Makes claim - if it fails, move on to #2, then #3. 14:50:06 <edleafe> If it succeeds... PROFIT! 14:50:20 <mriedem> yes that's what i originally expected, but i thought conductor could do the same, but i guess not 14:51:24 <edleafe> The original idea was to have the scheduler do it 14:51:37 <edleafe> The spec's commit message even says that 14:51:41 <edleafe> "Proposal to have scheduler posting allocations instead of computes" 14:52:02 <edleafe> Somewhere it got changed, and I couldn't remember the reason 14:52:04 <mriedem> doing claims in the conductor makes things work for caching scheduler too...but we want to deprecate that after claims in controller, and even if only the filter scheduler did the claim, the computes would have the fallback 14:52:14 <mriedem> i think the thinking changed during the hangout 14:52:24 <mriedem> when dansmith said it would be conductor making the claim 14:52:33 <mriedem> but as noted, i thought conductor had the list of filtered hosts 14:52:48 <dansmith> edleafe: speaking for me at least, I always meant "claims in the scheduling phase" when we said that 14:52:57 <edleafe> FYI: 8 minutes left 14:53:26 <edleafe> dansmith: ok, but is there a reason why it needs to be in the conductor? 14:53:29 <mriedem> i'd like to not have to get the hosts again and filter them if the claim on the first one fails 14:53:50 <mriedem> especially since that could happen with packing right? 14:53:55 <dansmith> edleafe: well, I'm not sure it makes as much sense if we made the scheduler a library, but maybe 14:54:11 <mriedem> today we deal with packing conflicts with retries, which we won't have with super-conductor and claims in the scheduler/conductor 14:54:13 <dansmith> edleafe: however, it also puts a blocking http call in the middle of a single-threaded thing 14:54:56 <dansmith> mriedem: the claim would fail because something has changed anyway, right? why wouldn't we reschedule to get a freshened idea of the world? 14:55:17 <mriedem> dansmith: that's what i said above, 14:55:23 <dansmith> yeah 14:55:27 <mriedem> i.e. is it a good idea to redo the scheduling anyway 14:55:34 <jaypipes> mriedem: dansmith is correct. we don't eliminate retries. we eliminate retries *happening from the compute nodes*. 14:55:42 <dansmith> right 14:55:44 <edleafe> dansmith: in a typical race where there are two requests for essentially the same VM, the only thing that changed is the first one used up some of the host 14:55:45 <dansmith> much faster turnaround 14:56:20 <mriedem> edleafe: but if that host still has some inventory, the 2nd claim might still fit right? 14:56:32 <mriedem> and if it doesn't, then that host is out of the running for that vm and we need to find another host 14:56:35 <edleafe> Sure, in which case it would succeed 14:56:43 <edleafe> We're talking about where the second one fails 14:56:58 <edleafe> In that case, it's best to pick host #2 from its weighed list 14:56:58 <alex_xu> jaypipes: we will move that retries inside placement in the future, right? when we can claim directly 14:57:35 <jaypipes> alex_xu: possibly, yes. but we're a ways away from the placement service having all the required information in order to make that full determination. 14:57:48 <edleafe> I'm doing a terrible job of time boxing 14:57:57 <edleafe> But this is an important discussion 14:58:12 <edleafe> We can continue it in -nova in 2 minutes 14:58:16 <edleafe> #topic Opens 14:58:19 <mriedem> really without some poc code both ways with performance testing of collisions it's hard to make a decision on this 14:58:29 <edleafe> Anyone want to add something before we adjourn? 14:58:53 <mriedem> e.g. with 1000 nodes, 14:59:05 * edleafe feels bad for anyone who hung around waiting for Open discussion 14:59:06 <mriedem> how long does it take to get the filtered / weighed list of hosts 14:59:34 <mriedem> because that's how long it's going to take per retry if we fail to claim in conductor 15:00:00 <edleafe> OK, we have to move this to -nova now 15:00:04 <edleafe> #endmeeting