14:00:10 <edleafe> #startmeeting nova_scheduler
14:00:11 <openstack> Meeting started Mon Jan 30 14:00:10 2017 UTC and is due to finish in 60 minutes.  The chair is edleafe. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:12 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:14 <openstack> The meeting name has been set to 'nova_scheduler'
14:00:16 <jaypipes> o/
14:00:18 <_gryf> o/
14:00:24 <edleafe> Good UGT morning everyone!
14:00:29 <jroll> \o
14:00:30 <diga> o/
14:00:37 <macsz> \o
14:02:48 <johnthetubaguy> o/
14:03:00 <edleafe> OK, let's get started
14:03:08 <edleafe> #topic Specs & Reviews
14:03:17 <edleafe> The first item is: Prevent compute crash on discovery failure https://review.openstack.org/#/c/422780
14:03:25 <edleafe> But that's already merged, so...
14:03:42 <_gryf> oh
14:03:53 <_gryf> it was put on scheduler meeting last time
14:03:58 <_gryf> *on agenda
14:04:06 <edleafe> let's discuss the patch for updating the research tracker for Ironic
14:04:09 <edleafe> #link https://review.openstack.org/#/c/404472/
14:04:27 <edleafe> _gryf: no worries
14:04:28 <bauzas> \o
14:04:41 <edleafe> Here is the basic problem we found:
14:04:45 <bauzas> I just provided a question about ^
14:04:58 <bauzas> about why we stop providing DB modifications
14:05:05 <edleafe> This patch changes how the RT reports Ironic inventory
14:05:12 <jaypipes> the flavor doesn't contain the node.resource_class and therefore NoValidHosts is returned for any request?
14:05:36 <cdent> o/ sorry for late, stuck in traffic
14:05:36 <edleafe> We stop reporting the old VCPU-style values, and instead report 1 of the particular class of Ironic hardware
14:05:43 <jroll> bauzas: ++
14:06:04 <edleafe> But the placement API cannot select based on custom resource classes (yet)
14:06:32 <jroll> yeah, that's problematic
14:06:36 <edleafe> So once an ironic node is reporting new-style inventory, it cannot be selected. It's essentially invisible to placement
14:06:43 <jaypipes> edleafe: well, there's nothing about custom resource classes that the placement API cannot select on. It's just that there is no mapping between flavor and ironic custom resource class yet.
14:07:03 <edleafe> jaypipes: yes, that's another way to say the same thing
14:07:06 <bauzas> lemme clarify my question
14:07:16 <bauzas> if we stop reporting old-way for new Ironic nodes
14:07:35 <jaypipes> edleafe: cdent told me about the idea to just have BOTH the new custom resource class AND the old VCPU/MEMORY_MB/DISK_GB inventory records for a time period. I think that's a good idea.
14:07:40 <bauzas> then how could we possibly have ComputeCapabilitiesFilter using the HostState ?
14:07:57 <bauzas> which is one of the main filters ironic operators use
14:08:03 <jroll> jaypipes: that's what I've been saying since the beginning, and I thought that's what this patch did in the beginning
14:08:12 <bauzas> (at least until we have traits)
14:08:16 <edleafe> jaypipes: that was our temporary fix for now
14:08:25 <jaypipes> bauzas: totally different. one is a qualitative filter (ComputeCapabilitiesFilter) and the other is a quantitative filter.
14:08:40 <bauzas> jaypipes: fair to me
14:08:53 <jaypipes> jroll: yes, I know you've been saying from beginning :( it slipped my mind. :(
14:08:54 <bauzas> jaypipes: but ComputeCapabilitiesFilter uses the HostState to know capabilities
14:08:57 <johnthetubaguy> so there are two things here (1) keep current stuff working (2) replace all existing features
14:09:10 <johnthetubaguy> I think (1) is the first step, which the report old and new helps with
14:09:19 <bauzas> johnthetubaguy: +1
14:09:40 <jroll> ++
14:09:43 <bauzas> I think we should still provide the current DB modifications even if we have new Ironic nodes
14:10:03 <johnthetubaguy> bauzas: thats the current path, I believe
14:10:07 <jaypipes> bauzas: I don't know what you mean by "current DB modifications"?
14:10:13 <jroll> bauzas: yes, we have to, we can't schedule on the new-style stuff
14:10:41 <jroll> jaypipes: compute_nodes table is how I understood that
14:10:49 <jaypipes> ah
14:10:52 <johnthetubaguy> existing compute_node.save()?
14:10:56 <bauzas> johnthetubaguy: unless I missed thay, I think we stop using it
14:11:07 <bauzas> https://review.openstack.org/#/c/404472/26/nova/compute/resource_tracker.py@540
14:11:18 <johnthetubaguy> bauzas: g
14:11:29 <johnthetubaguy> bauzas: good point, that is bypassed, I missed that
14:11:51 <bauzas> ok, if that's just a mistake, I'm fine then
14:11:57 <jaypipes> bauzas: right, I believe the suggestion was to continue calling update_resource_stats() *in addition to* doing the new custom resource class inventory. correct, johnthetubaguy, edleafe and cdent?
14:12:13 <bauzas> I thought it was rather a design point
14:12:32 <bauzas> saying that we should stop reporting old-way if new nodes, which I disagree
14:12:52 <johnthetubaguy> we need all the old things to keep happening
14:12:59 <johnthetubaguy> we can decide what that looks like in the code
14:13:17 <bauzas> sure my question wasn't an implementation detail
14:13:20 <edleafe> We *eventually* want to stop reporting the old way
14:13:25 <edleafe> We just aren't there yet
14:13:31 <bauzas> rather a discussion to make sure we all agree that we need to support old-way
14:13:45 <bauzas> okay, seems to me we're all violently agreeing
14:13:46 <jroll> +1 for both, that's what we were doing way back in patchset 4, not sure where it got lost
14:13:53 <jroll> bauzas: +1
14:13:55 <johnthetubaguy> bauzas has spotted a bigger issues, which I think jaypipes is touching, still need to call compute_node.save()
14:13:58 <johnthetubaguy> bauzas: yeah
14:14:00 <bauzas> it's just an implementation mistake, point.
14:14:10 <jaypipes> gotcha
14:14:54 <bauzas> that said
14:15:01 <bauzas> it ties to me how we cover that
14:15:04 <bauzas> I mean
14:15:16 <bauzas> it seems to me Jenkins was fine with that
14:15:31 <johnthetubaguy> so there is another piece, and thats what do we *need* in ocata
14:15:36 <bauzas> but, if we were merging it, then it could be a problem with operators
14:15:47 <bauzas> I just want to make sure we are testing our path
14:16:01 <bauzas> johnthetubaguy: right
14:16:05 <johnthetubaguy> bauzas: jenkins failed on the last version, not sure about this updated one
14:16:06 <bauzas> and RC1 is in 3 days
14:16:45 <johnthetubaguy> so in pike, if we fixed the whole world... it would be good if the new resources were already there
14:16:53 <bauzas> yeah
14:17:01 <bauzas> because it's also a compute modification
14:17:09 <johnthetubaguy> but don't we also need the instance claims to be present for the new resources, else we still can't schedule on just the new resources?
14:17:19 <johnthetubaguy> s/claims/allocations/
14:17:27 <bauzas> meaning that we need to also support N-1 computes not reporting ironic resources new-way
14:17:47 <bauzas> which would defer the placement API being able to schedule something for Ironic until Queens
14:17:50 <jroll> bauzas: remember current CI job does not set the resource_class on ironic nodes, so doesn't hit the new code path, that's what that test patch and my WIP experimental job are for
14:17:54 <jroll> and yep, +1
14:18:26 <jroll> (also because the flavors for this won't work until pike)
14:18:30 <edleafe> So is there any point of disagreement about the problem?
14:18:44 <bauzas> so here I'm thinking of some way to not change everything, but just reporting what's needed
14:18:45 <edleafe> If not, we can start discussing the path forward to fix it
14:18:51 <johnthetubaguy> well, if new resource present means new allocations are present, we might be able to transition inside one release
14:18:56 <bauzas> a very small and not invasive patch that would stop send things
14:19:02 <johnthetubaguy> if we report things incorrectly today, that will make it worse
14:19:09 <bauzas> right
14:19:34 <bauzas> that said, we can still fix things in a point release
14:19:47 <bauzas> and ask operators to deploy the latest point release on computes before upgrading
14:20:24 <bauzas> here, I want to begin reporting Ironic things in Ocata, without really changing too much
14:20:36 <johnthetubaguy> so... my question is really about this bit: https://github.com/openstack/nova/blob/master/nova/scheduler/client/report.py#L128
14:20:59 <johnthetubaguy> it feels like for ironic, what we want to do, is claim all resources for the chosen node, regardless of what the flavor says
14:21:13 <bauzas> johnthetubaguy: FWIW, there is a bug with https://github.com/openstack/nova/blob/master/nova/scheduler/client/report.py#L138
14:21:20 <bauzas> because swap is miscounted
14:21:28 <bauzas> but meh
14:21:28 <jaypipes> johnthetubaguy, bauzas: hold up, I think you're overthinking this.
14:21:42 <johnthetubaguy> quite possibly
14:21:49 * bauzas is my mark of fabric
14:22:19 <bauzas> I mean my trademark
14:22:42 <jaypipes> bauzas, johnthetubaguy: if we simply keep the call to update_resource_stats() and in addition to that we just add allocation record for the custom resource class (if present in ironic) then we should be fine.
14:22:55 <jaypipes> sec, finding link.
14:23:28 <bauzas> jaypipes: that's what I claim for :)
14:23:31 <jaypipes> line 695 here: https://review.openstack.org/#/c/404472/26/nova/compute/resource_tracker.py
14:23:34 <johnthetubaguy> but, when we come to try and place things next cycle, all the ironic nodes will be showing free resources
14:23:56 <jaypipes> bauzas, johnthetubaguy: so instead of deleting old allocations, we simply create the provider allocations if ironic.node_class is present.
14:23:56 <bauzas> I just want the less invasive thing that would just start reporting things in Ocata
14:24:39 <bauzas> I need to disappear in litterally 2 mins :(
14:25:12 <jaypipes> how about I just fix this up and push a change.
14:25:14 <johnthetubaguy> jaypipes: we toally can't delete old allocations, we are agreed there
14:25:15 <jaypipes> gimme about an hour.
14:25:23 <johnthetubaguy> so there is still a problem here, for next cycle
14:25:38 <johnthetubaguy> although not sure its as bad as I first thought
14:25:46 <jaypipes> johnthetubaguy: if we can get allocations and inventory (both old and new) being wrtitten in Ocata for Ironic, I'd be happy.
14:26:05 <johnthetubaguy> yeah, me too
14:26:11 <edleafe> me three
14:26:13 <jaypipes> ok, gimme an hour.
14:26:13 <jroll> +1
14:26:16 <johnthetubaguy> if we do allocations and inventory
14:26:17 * jaypipes codes
14:26:28 <jaypipes> gotta love PTO.
14:26:42 <edleafe> jaypipes: you could have timed it better
14:26:44 <johnthetubaguy> can't do half of it, thats my main concern
14:26:48 <jaypipes> heh
14:27:00 <bauzas> I need to bail out, folks \o
14:27:12 <jaypipes> ciao
14:27:21 <edleafe> thanks bauzas
14:27:46 <bauzas> jaypipes: ping me once you're done, it's now my top-prio review patch
14:27:53 <jaypipes> bauzas: will do.
14:28:21 <edleafe> Looks like we have a plan
14:28:56 <edleafe> One other thing that I'd like to mention is that we could have avoided a lot of this if we had a true functional test in place ahead of time
14:29:25 <edleafe> IOW, something that would create an ironic node, have it report resources, and then have it selected by the scheduler
14:29:43 <edleafe> At the very least it would have identified the holes that needed filling in
14:29:56 <edleafe> such as the flavor extra-specs stuff
14:30:11 <cdent> +many
14:32:20 <edleafe> Let's move on
14:32:24 <edleafe> #topic Bugs
14:32:28 <edleafe> Nothing on the agenda
14:32:40 <cdent> I got a couple of bug related fixes that would be nice to get in
14:32:40 <edleafe> Anyone have anything to point out, bug-wise?
14:32:49 <cdent> https://review.openstack.org/#/c/414230/
14:32:56 <cdent> (and the one above it
14:33:04 <cdent> nothing super serious but useful for debugging
14:34:05 <edleafe> Yes, let's get that in.
14:34:08 <edleafe> Anything else
14:34:10 <edleafe> ?
14:34:19 <cdent> don't think so
14:34:37 <edleafe> Seems like everyone left after the big discussion :)
14:34:42 <edleafe> #topic Opens
14:34:54 <edleafe> So.... what's on your mind?
14:34:57 <cdent> couple things
14:34:57 <edleafe> :)
14:35:14 <cdent> cors didn't make it into the eyes of cors: https://review.openstack.org/#/c/392891/
14:35:28 <cdent> but we agreed (in various places) that we would prefer to have it in ocata
14:35:32 <cdent> but I guess it's too late now
14:35:55 <cdent> the other thing is docs: I'm going to need help on the api-ref: https://review.openstack.org/#/c/409340/
14:36:11 <cdent> not just writing the docs but also creating a date job to draft and publish the docs
14:36:52 <edleafe> #link https://review.openstack.org/#/c/392891/
14:36:55 * cdent watches everyone leaping at once
14:36:56 <edleafe> #link https://review.openstack.org/#/c/409340/
14:37:45 <edleafe> cdent: Of course - everyone loves working on doc infrastructure!
14:38:25 <cdent> ikr
14:38:55 <edleafe> OK, then I think since everyone's gone, let's get back to work
14:38:58 <edleafe> #endmeeting