14:00:10 <edleafe> #startmeeting nova_scheduler 14:00:11 <openstack> Meeting started Mon Jan 30 14:00:10 2017 UTC and is due to finish in 60 minutes. The chair is edleafe. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:12 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:14 <openstack> The meeting name has been set to 'nova_scheduler' 14:00:16 <jaypipes> o/ 14:00:18 <_gryf> o/ 14:00:24 <edleafe> Good UGT morning everyone! 14:00:29 <jroll> \o 14:00:30 <diga> o/ 14:00:37 <macsz> \o 14:02:48 <johnthetubaguy> o/ 14:03:00 <edleafe> OK, let's get started 14:03:08 <edleafe> #topic Specs & Reviews 14:03:17 <edleafe> The first item is: Prevent compute crash on discovery failure https://review.openstack.org/#/c/422780 14:03:25 <edleafe> But that's already merged, so... 14:03:42 <_gryf> oh 14:03:53 <_gryf> it was put on scheduler meeting last time 14:03:58 <_gryf> *on agenda 14:04:06 <edleafe> let's discuss the patch for updating the research tracker for Ironic 14:04:09 <edleafe> #link https://review.openstack.org/#/c/404472/ 14:04:27 <edleafe> _gryf: no worries 14:04:28 <bauzas> \o 14:04:41 <edleafe> Here is the basic problem we found: 14:04:45 <bauzas> I just provided a question about ^ 14:04:58 <bauzas> about why we stop providing DB modifications 14:05:05 <edleafe> This patch changes how the RT reports Ironic inventory 14:05:12 <jaypipes> the flavor doesn't contain the node.resource_class and therefore NoValidHosts is returned for any request? 14:05:36 <cdent> o/ sorry for late, stuck in traffic 14:05:36 <edleafe> We stop reporting the old VCPU-style values, and instead report 1 of the particular class of Ironic hardware 14:05:43 <jroll> bauzas: ++ 14:06:04 <edleafe> But the placement API cannot select based on custom resource classes (yet) 14:06:32 <jroll> yeah, that's problematic 14:06:36 <edleafe> So once an ironic node is reporting new-style inventory, it cannot be selected. It's essentially invisible to placement 14:06:43 <jaypipes> edleafe: well, there's nothing about custom resource classes that the placement API cannot select on. It's just that there is no mapping between flavor and ironic custom resource class yet. 14:07:03 <edleafe> jaypipes: yes, that's another way to say the same thing 14:07:06 <bauzas> lemme clarify my question 14:07:16 <bauzas> if we stop reporting old-way for new Ironic nodes 14:07:35 <jaypipes> edleafe: cdent told me about the idea to just have BOTH the new custom resource class AND the old VCPU/MEMORY_MB/DISK_GB inventory records for a time period. I think that's a good idea. 14:07:40 <bauzas> then how could we possibly have ComputeCapabilitiesFilter using the HostState ? 14:07:57 <bauzas> which is one of the main filters ironic operators use 14:08:03 <jroll> jaypipes: that's what I've been saying since the beginning, and I thought that's what this patch did in the beginning 14:08:12 <bauzas> (at least until we have traits) 14:08:16 <edleafe> jaypipes: that was our temporary fix for now 14:08:25 <jaypipes> bauzas: totally different. one is a qualitative filter (ComputeCapabilitiesFilter) and the other is a quantitative filter. 14:08:40 <bauzas> jaypipes: fair to me 14:08:53 <jaypipes> jroll: yes, I know you've been saying from beginning :( it slipped my mind. :( 14:08:54 <bauzas> jaypipes: but ComputeCapabilitiesFilter uses the HostState to know capabilities 14:08:57 <johnthetubaguy> so there are two things here (1) keep current stuff working (2) replace all existing features 14:09:10 <johnthetubaguy> I think (1) is the first step, which the report old and new helps with 14:09:19 <bauzas> johnthetubaguy: +1 14:09:40 <jroll> ++ 14:09:43 <bauzas> I think we should still provide the current DB modifications even if we have new Ironic nodes 14:10:03 <johnthetubaguy> bauzas: thats the current path, I believe 14:10:07 <jaypipes> bauzas: I don't know what you mean by "current DB modifications"? 14:10:13 <jroll> bauzas: yes, we have to, we can't schedule on the new-style stuff 14:10:41 <jroll> jaypipes: compute_nodes table is how I understood that 14:10:49 <jaypipes> ah 14:10:52 <johnthetubaguy> existing compute_node.save()? 14:10:56 <bauzas> johnthetubaguy: unless I missed thay, I think we stop using it 14:11:07 <bauzas> https://review.openstack.org/#/c/404472/26/nova/compute/resource_tracker.py@540 14:11:18 <johnthetubaguy> bauzas: g 14:11:29 <johnthetubaguy> bauzas: good point, that is bypassed, I missed that 14:11:51 <bauzas> ok, if that's just a mistake, I'm fine then 14:11:57 <jaypipes> bauzas: right, I believe the suggestion was to continue calling update_resource_stats() *in addition to* doing the new custom resource class inventory. correct, johnthetubaguy, edleafe and cdent? 14:12:13 <bauzas> I thought it was rather a design point 14:12:32 <bauzas> saying that we should stop reporting old-way if new nodes, which I disagree 14:12:52 <johnthetubaguy> we need all the old things to keep happening 14:12:59 <johnthetubaguy> we can decide what that looks like in the code 14:13:17 <bauzas> sure my question wasn't an implementation detail 14:13:20 <edleafe> We *eventually* want to stop reporting the old way 14:13:25 <edleafe> We just aren't there yet 14:13:31 <bauzas> rather a discussion to make sure we all agree that we need to support old-way 14:13:45 <bauzas> okay, seems to me we're all violently agreeing 14:13:46 <jroll> +1 for both, that's what we were doing way back in patchset 4, not sure where it got lost 14:13:53 <jroll> bauzas: +1 14:13:55 <johnthetubaguy> bauzas has spotted a bigger issues, which I think jaypipes is touching, still need to call compute_node.save() 14:13:58 <johnthetubaguy> bauzas: yeah 14:14:00 <bauzas> it's just an implementation mistake, point. 14:14:10 <jaypipes> gotcha 14:14:54 <bauzas> that said 14:15:01 <bauzas> it ties to me how we cover that 14:15:04 <bauzas> I mean 14:15:16 <bauzas> it seems to me Jenkins was fine with that 14:15:31 <johnthetubaguy> so there is another piece, and thats what do we *need* in ocata 14:15:36 <bauzas> but, if we were merging it, then it could be a problem with operators 14:15:47 <bauzas> I just want to make sure we are testing our path 14:16:01 <bauzas> johnthetubaguy: right 14:16:05 <johnthetubaguy> bauzas: jenkins failed on the last version, not sure about this updated one 14:16:06 <bauzas> and RC1 is in 3 days 14:16:45 <johnthetubaguy> so in pike, if we fixed the whole world... it would be good if the new resources were already there 14:16:53 <bauzas> yeah 14:17:01 <bauzas> because it's also a compute modification 14:17:09 <johnthetubaguy> but don't we also need the instance claims to be present for the new resources, else we still can't schedule on just the new resources? 14:17:19 <johnthetubaguy> s/claims/allocations/ 14:17:27 <bauzas> meaning that we need to also support N-1 computes not reporting ironic resources new-way 14:17:47 <bauzas> which would defer the placement API being able to schedule something for Ironic until Queens 14:17:50 <jroll> bauzas: remember current CI job does not set the resource_class on ironic nodes, so doesn't hit the new code path, that's what that test patch and my WIP experimental job are for 14:17:54 <jroll> and yep, +1 14:18:26 <jroll> (also because the flavors for this won't work until pike) 14:18:30 <edleafe> So is there any point of disagreement about the problem? 14:18:44 <bauzas> so here I'm thinking of some way to not change everything, but just reporting what's needed 14:18:45 <edleafe> If not, we can start discussing the path forward to fix it 14:18:51 <johnthetubaguy> well, if new resource present means new allocations are present, we might be able to transition inside one release 14:18:56 <bauzas> a very small and not invasive patch that would stop send things 14:19:02 <johnthetubaguy> if we report things incorrectly today, that will make it worse 14:19:09 <bauzas> right 14:19:34 <bauzas> that said, we can still fix things in a point release 14:19:47 <bauzas> and ask operators to deploy the latest point release on computes before upgrading 14:20:24 <bauzas> here, I want to begin reporting Ironic things in Ocata, without really changing too much 14:20:36 <johnthetubaguy> so... my question is really about this bit: https://github.com/openstack/nova/blob/master/nova/scheduler/client/report.py#L128 14:20:59 <johnthetubaguy> it feels like for ironic, what we want to do, is claim all resources for the chosen node, regardless of what the flavor says 14:21:13 <bauzas> johnthetubaguy: FWIW, there is a bug with https://github.com/openstack/nova/blob/master/nova/scheduler/client/report.py#L138 14:21:20 <bauzas> because swap is miscounted 14:21:28 <bauzas> but meh 14:21:28 <jaypipes> johnthetubaguy, bauzas: hold up, I think you're overthinking this. 14:21:42 <johnthetubaguy> quite possibly 14:21:49 * bauzas is my mark of fabric 14:22:19 <bauzas> I mean my trademark 14:22:42 <jaypipes> bauzas, johnthetubaguy: if we simply keep the call to update_resource_stats() and in addition to that we just add allocation record for the custom resource class (if present in ironic) then we should be fine. 14:22:55 <jaypipes> sec, finding link. 14:23:28 <bauzas> jaypipes: that's what I claim for :) 14:23:31 <jaypipes> line 695 here: https://review.openstack.org/#/c/404472/26/nova/compute/resource_tracker.py 14:23:34 <johnthetubaguy> but, when we come to try and place things next cycle, all the ironic nodes will be showing free resources 14:23:56 <jaypipes> bauzas, johnthetubaguy: so instead of deleting old allocations, we simply create the provider allocations if ironic.node_class is present. 14:23:56 <bauzas> I just want the less invasive thing that would just start reporting things in Ocata 14:24:39 <bauzas> I need to disappear in litterally 2 mins :( 14:25:12 <jaypipes> how about I just fix this up and push a change. 14:25:14 <johnthetubaguy> jaypipes: we toally can't delete old allocations, we are agreed there 14:25:15 <jaypipes> gimme about an hour. 14:25:23 <johnthetubaguy> so there is still a problem here, for next cycle 14:25:38 <johnthetubaguy> although not sure its as bad as I first thought 14:25:46 <jaypipes> johnthetubaguy: if we can get allocations and inventory (both old and new) being wrtitten in Ocata for Ironic, I'd be happy. 14:26:05 <johnthetubaguy> yeah, me too 14:26:11 <edleafe> me three 14:26:13 <jaypipes> ok, gimme an hour. 14:26:13 <jroll> +1 14:26:16 <johnthetubaguy> if we do allocations and inventory 14:26:17 * jaypipes codes 14:26:28 <jaypipes> gotta love PTO. 14:26:42 <edleafe> jaypipes: you could have timed it better 14:26:44 <johnthetubaguy> can't do half of it, thats my main concern 14:26:48 <jaypipes> heh 14:27:00 <bauzas> I need to bail out, folks \o 14:27:12 <jaypipes> ciao 14:27:21 <edleafe> thanks bauzas 14:27:46 <bauzas> jaypipes: ping me once you're done, it's now my top-prio review patch 14:27:53 <jaypipes> bauzas: will do. 14:28:21 <edleafe> Looks like we have a plan 14:28:56 <edleafe> One other thing that I'd like to mention is that we could have avoided a lot of this if we had a true functional test in place ahead of time 14:29:25 <edleafe> IOW, something that would create an ironic node, have it report resources, and then have it selected by the scheduler 14:29:43 <edleafe> At the very least it would have identified the holes that needed filling in 14:29:56 <edleafe> such as the flavor extra-specs stuff 14:30:11 <cdent> +many 14:32:20 <edleafe> Let's move on 14:32:24 <edleafe> #topic Bugs 14:32:28 <edleafe> Nothing on the agenda 14:32:40 <cdent> I got a couple of bug related fixes that would be nice to get in 14:32:40 <edleafe> Anyone have anything to point out, bug-wise? 14:32:49 <cdent> https://review.openstack.org/#/c/414230/ 14:32:56 <cdent> (and the one above it 14:33:04 <cdent> nothing super serious but useful for debugging 14:34:05 <edleafe> Yes, let's get that in. 14:34:08 <edleafe> Anything else 14:34:10 <edleafe> ? 14:34:19 <cdent> don't think so 14:34:37 <edleafe> Seems like everyone left after the big discussion :) 14:34:42 <edleafe> #topic Opens 14:34:54 <edleafe> So.... what's on your mind? 14:34:57 <cdent> couple things 14:34:57 <edleafe> :) 14:35:14 <cdent> cors didn't make it into the eyes of cors: https://review.openstack.org/#/c/392891/ 14:35:28 <cdent> but we agreed (in various places) that we would prefer to have it in ocata 14:35:32 <cdent> but I guess it's too late now 14:35:55 <cdent> the other thing is docs: I'm going to need help on the api-ref: https://review.openstack.org/#/c/409340/ 14:36:11 <cdent> not just writing the docs but also creating a date job to draft and publish the docs 14:36:52 <edleafe> #link https://review.openstack.org/#/c/392891/ 14:36:55 * cdent watches everyone leaping at once 14:36:56 <edleafe> #link https://review.openstack.org/#/c/409340/ 14:37:45 <edleafe> cdent: Of course - everyone loves working on doc infrastructure! 14:38:25 <cdent> ikr 14:38:55 <edleafe> OK, then I think since everyone's gone, let's get back to work 14:38:58 <edleafe> #endmeeting