14:00:11 #startmeeting nova_scheduler 14:00:16 Meeting started Mon Oct 16 14:00:11 2017 UTC and is due to finish in 60 minutes. The chair is edleafe. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:17 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:19 The meeting name has been set to 'nova_scheduler' 14:00:23 8/ 14:00:24 Who's around? 14:00:26 o/ 14:00:28 o/ 14:00:32 o/ 14:00:42 efried: is that a man bun on your head? 14:00:48 Aw hail no. 14:01:51 #topic Specs 14:01:59 #link Granular Resource Request Syntax https://review.openstack.org/510244 14:02:02 Still some debate on this. efried, want to comment? 14:02:21 I don't think there's a lot of debate, really. 14:02:29 * alex_xu waves late 14:02:39 the solution appears to be agreed upon, what’s not clear is if the problem is. 14:02:53 A couple of reviewers had a question about whether numeric suffixes are The Right Thing or if we should do something fancier. 14:03:10 Me, I'm not sure if anything fancier adds any value. 14:03:16 ditto 14:03:41 cdent You're saying the problem is not well defined? 14:04:22 efried: no you did a good job, it just felt like a lot of the commentary was on the the details of the explanation, rather than on the actual solution 14:05:19 Yeah, I get myself in trouble expressing things other than in real python or API syntax. 14:05:27 efried: in a roundabout way I’m saying: we can probably move this one along as the solution applies regardless of the details 14:05:45 Perhaps getting a few more eyes on that will help speed things up 14:06:04 Okay. I posted a new rev half an hour ago addressing comments from stephenfin, dansmith, and alex_xu. 14:06:17 I only had one other spec on the agenda: 14:06:17 #link Add spec for symmetric GET and PUT of allocations https://review.openstack.org/#/c/508164/ 14:06:20 Need some +2s on this 14:06:30 But looks good 14:06:52 Any other specs to discuss? Spec freeze is this coming Thursday 14:07:35 #topic Reviews 14:07:38 #link Nested RP series starting with: https://review.openstack.org/#/c/470575/ 14:07:45 Jay is AWOL 14:08:11 Does anyone feel confident enough to take this on while he's on holiday? 14:08:51 The first three have no -1s 14:09:16 So not necessarily needing further deltas. 14:09:18 what's going on with the de-orm changes at the end of the series? 14:09:22 https://review.openstack.org/#/c/509908/3 14:09:27 I personally haven't looked over them in a while as Jay was re-working some things 14:10:21 mriedem Looks to me like a fork in the series. 14:10:23 mriedem: I think there’s a bit of weird rebasing going on: I think jay injected some of the nested stuff into the middle of the de-orm stack 14:10:25 The de-orm stuff looked straightforward enough 14:10:39 or that 14:10:43 It was supposed to be before the NRP stuff 14:11:00 looks like some was moved before, and the rest left at thened 14:11:04 yeah 14:11:05 the end 14:11:22 ok the 4 at the end are all de-orm for traits stuff, 14:11:33 which i guess doesn't have an immediate impact on the NRP stuff correct? 14:11:34 I think those aren't required for RPs either, 14:11:46 ah, ok 14:11:49 he hooked in the NRP set mid-way through the de-ORMing where it could be done 14:11:57 ok 14:12:05 * edleafe keeps reading that as "de-worming" 14:12:21 So core reviews on the first three NRP patches? 14:12:34 And see if Jay's back by then? :) 14:13:08 I will volunteer to try fixups as necessary. 14:13:24 sounds like a plan 14:13:28 I'd like to at least point out that it is a fragile strategy to have a critical part of the development of a project in the brain of just one developer 14:13:29 Though willingness:capability ratio may be lopsided. 14:13:33 #link https://blog.leafe.com/pair-development/ 14:13:47 This has happened to us before, and is happening again here 14:14:10 I don't think we're in a place where nobody else *could* pick it up. 14:14:21 efried: I'm not saying that 14:14:30 then what are you saying? 14:14:33 wait for jay to get back? 14:14:37 Any of us could 14:14:40 i couldn't 14:15:08 I believe edleafe is saying that it would be better if we shared brains sooner in the process? 14:15:15 As far as the design goes, agree that shouldn't be exclusively in Jay's head - it should be in the spec. And if it's not, we should examine our spec process. 14:15:18 NRP has been around since newton yes? 14:15:22 It's just that a) it would take time to get familiar with the code and b) we will not understand why it is written the way it is, causing more problems 14:15:24 so the brain sharing has been going on for over a year 14:15:54 mriedem: and yet no one could just pick it up and run with it 14:16:05 jay's had patches up for a year yes? 14:16:11 so what is there to run with? 14:16:21 eric said he'd rebase it and handle changes while jay is out 14:16:31 so i don't see why this is an opportunity to talk about how this is all just jay 14:16:33 when it's not 14:16:42 mriedem: that’s not what’s being said 14:17:08 what’s being said is that nrp is complicated and hard to pick up. no one is ascribing blame here 14:17:28 just that ideally it would be better, in every situation, if all of us were able to pick things up, more easily 14:17:48 if any of me, ed, or eric pick this up, it will be an interesting challenge 14:17:55 that’s okay, but not optimal 14:18:14 ok, well, hard things are hard 14:18:19 i hear what you guys are saying, 14:18:31 Obviously there's going to be a point person on a given piece, and usually that person will understand things the best. But here I think there's enough understanding spread around the team that we're not in a serious bind. 14:18:49 that ^ 14:18:54 efried: no one's panicking 14:19:08 efried: just pointing out that there are better ways of doing things 14:19:31 do we know when does jay come back on scene? 14:19:39 Can you give an example of a way we could [do|have done] things better in this case? 14:19:46 I've managed a ton of software projects, and always strive to have more than one person with their hands in any part of code 14:20:56 efried: if for nothing else than the bus factor https://en.wikipedia.org/wiki/Bus_factor 14:21:08 nothing is stopping anyone from having their hands in any of this code 14:21:17 it's not the end of the world 14:21:21 it's just not ideal 14:22:26 ok ideal utopias aside, 14:22:27 Okay, so are we good with the plan for right here right now? 14:22:29 what's next? 14:22:40 efried: i am yes 14:23:04 mriedem: no need to be condescending 14:23:08 #link Add traits to GET /allocation_candidates https://review.openstack.org/479776 14:23:11 #link Add traits to get RPs with shared https://review.openstack.org/478464/ 14:23:14 Not much activity in the past week 14:23:32 I rewrite the implementation last week 14:24:00 I took a look at it, but there's a bunch I don't understand about how aggregates are supposed to work with traits. 14:24:15 bascially this is the new version https://review.openstack.org/479766 instead of the sql implementation https://review.openstack.org/489206 14:24:20 alex_xu: where are you in the rewrite? Is it ready for review? 14:25:13 edleafe: 479755 is the one, last week I expected jay to take a look at if that is the direction he want, but ... 14:25:37 sorry, s/479755/479766 14:25:46 I still have some reading to do, but I'm concerned that the expected semantics may not be expressed in a spec anywhere. 14:26:17 efried: which semantics? 14:26:39 How a request with traits maps to a compute host RP plus aggregates. 14:27:11 I'd actually like to understand the extent to which we intend to support aggregate stuff in Q. 14:27:43 we aren't doing sharing providers in queens 14:27:47 we said that at the ptg 14:27:52 I think the spec describe the shared rp case 14:27:56 #link https://review.openstack.org/#/c/497713/10/specs/queens/approved/add-trait-support-in-allocation-candidates.rst@83 14:28:06 shared rp is a distraction at this point 14:28:18 mriedem: but the current implementation of allocation candidates works for the shared rp 14:28:28 except some bugs 14:28:31 alex_xu: the API does yes, 14:28:35 but using them in the rest of nova doesn't 14:28:43 like for resource tracking 14:29:12 So my question is whether alex_xu's review noted above should even be trying to implement aggregate semantics. 14:29:45 mriedem: so the trait support shouldn't break the API? 14:30:13 alex_xu: you're talking about https://review.openstack.org/#/c/478464/ specifically? 14:31:10 mriedem: sorry, that is the old one, the new one is https://review.openstack.org/479766, I just keep the old for people reference the two implementation 14:31:11 so the question is just whether or not we should apply traits filtering also to shared RPs? 14:31:52 if at the top GET /allocation_candidates route we're going to support filtering by traits, then i think it's fine to also include filtering traits on shared RPs 14:32:07 otherwise we'd have to have another microversion in rocky for that, which would be weird 14:33:02 mriedem: ok 14:33:13 Cool. And my concern could very well be addressed by words in the spec - so disregard for now. I'll go away and read, and bring it up again if I'm still worried. 14:33:26 Next up are a couple from cdent: 14:33:26 #link Allow _set_allocations to delete allocations https://review.openstack.org/#/c/501051/ 14:33:28 i don't think there is any wordering in the spec about shared RPs 14:33:30 Got a +2, needs rebase 14:33:32 #link POST /allocations for >1 consumer https://review.openstack.org/#/c/500073/ 14:33:35 ALso needs rebase 14:33:42 whole stack just pushed with rebase 14:33:57 but I put a -W back in the middle to block until I resolve an earlier comment 14:33:57 cdent: ah, I compiled this list an hour ago 14:34:38 #link Use ksa adapter for placement https://review.openstack.org/#/c/492247/ 14:34:41 Looks solid; anything to add, efried? 14:34:49 nope 14:34:55 may I ask one more quesiton for traits patch? 14:35:05 alex_xu: of course 14:35:09 edleafe: thanks 14:36:02 so..should I continue make the new implementation patch, or just waiting for some feedback between two implementaion first? 14:37:26 should I continue to improve the patch of new implement or just waiting for some feedback first? 14:37:39 what was the reason for the new implementation? 14:37:51 As a relative noob, I have a preference for the new implementation. This is a pretty hard concept to understand, but the older one is *really* hard to follow. 14:38:35 mriedem: jay have comment at https://review.openstack.org/#/c/489206/, bascialy want to remove the complex sql, using the debugable python code to implement 14:39:21 but in the python code, we need to take care the mapping of root rp and sharing rp stuff instead of sql 14:39:38 was that before or after jay decided to de-orm everything? :) 14:39:43 maybe it does'nt matter 14:39:50 but yeah i skimmed his comment on the old impl 14:40:23 i guess go with the new one until someone pushes back on it 14:40:40 ok 14:40:44 alex_xu: yeah, readability and debuggability count 14:40:59 yea, the huge sql isn't debugable 14:41:14 ok, moving on 14:41:15 #link Alternate hosts: series starting with https://review.openstack.org/#/c/486215/ 14:41:19 There are concerns about removing limits in the Selection object, as has been suggested previously. I've started working on adding them back; should have the revised Selection object and code by this afternoon. 14:41:30 Also, can we please have a better name for the RPC flag noted here: https://review.openstack.org/#/c/510159/ 14:41:33 I chose an intentionally dumb name with the expectation that someone would jump in with a better suggestion. 14:42:00 so "modern_flag"? 14:42:13 how about just return_alternates? 14:42:15 the flag should imply "new return type" 14:42:16 yes 14:42:21 problem solved 14:42:27 The series is otherwise complete, with the details of the claiming of alternates still to be worked out 14:44:26 #topic Bugs 14:44:27 #link https://bugs.launchpad.net/nova/+bugs?field.tag=placement 14:44:59 Several new bugs, most reported by cdent 14:45:23 mostly notes to self, already addressed 14:45:33 ok, that's what it looked like 14:45:39 Anything more for bugs? 14:46:29 ok then 14:46:31 #topic Open discussion 14:46:32 Two items on the agenda 14:46:39 preserving/returning auto-healing behavior in automated clients (cdent) 14:46:46 the floor is yours, cdent 14:47:23 Yeah, so I’m unclear on where we landed on the extent to which a compute-node is going to be capable of healing itself. Say the placement DB got wiped and then a compute-node is rebooted: 14:47:44 in our long term plan will the compute node come up, reports its inventory and allocations and life is groovy? 14:48:07 the computes won't report allocations 14:48:11 once they are at least all at pike 14:48:13 s/rebooted/the service is restarted, but physical hardware happy/ 14:48:23 computes report inventory in the periodic 14:48:26 but not allocations 14:48:39 Will the RT be responsible for rebuilding the allocations? 14:48:45 no 14:48:57 we're moving away from the RT being responsible for allocatoins 14:49:05 Can you recall why we decided that the allocations would not be repoerted? And should we consider having them back since it makes managing some aspects of state a bit cleaner? 14:49:07 because of how mixed version RTs trample on each other 14:49:16 because of said trampling 14:49:20 which we found late in pike 14:49:30 Okay, so who's responsible for rebuilding the allocations in the scenario cdent mentions? 14:49:32 remember ocata computes were overwriting allocations 14:49:35 because multiple things managing allocations when both think they're right is fragile 14:49:38 yeah, what efried said 14:49:48 if the placement db is wiped, 14:49:50 efried: "the db got wiped" is not a reasonable scenario, IMHO 14:49:51 yo'ure kind of fucked? 14:49:57 if we lose the nova db, we don't have a plan 14:50:12 we probably need some fsck like behavior for nova manage, 14:50:16 and thus nova is not enterprise grade 14:50:18 I’m thinking fairly long term here, where placement is no longer a part of the nova db 14:50:25 to patch things up or at least identify places where we dropped something or lost some accounting, 14:50:29 and _if_ nodes were self healing, you’d have some options 14:50:34 but a full rebuild scenario is not realistic, IMHO 14:51:14 Assuming that the consumer of an allocation will not always be an instance, I think I agree. 14:51:15 my gut feeling is yo'ure going to introduce more bugs and issues supporting self healing 14:51:43 because we support mixed version computes 14:52:13 I think live self healing is probably bad news, but a kind of recovery mode might be interesting to think about (so I’ll think about it, nothing to do at this stage, just wanted info/input) 14:52:45 sure we can cram stuff in nova-manage if we want 14:52:49 hence nova-manage fsck 14:53:46 The other item is sort of moving in the opposite direction: 14:53:47 Getting allocations into virt (e.g. new param to spawn). Some discussion here: 14:53:50 #link http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2017-10-04.log.html#t2017-10-04T13:49:18-2 14:53:57 efried: all yours 14:54:05 how is that moving in the opposite direction? 14:54:09 Oh, this is still on here? 14:54:16 I'm working on this change. 14:54:50 dansmith: getting placement info down to compute 14:54:56 this is request info 14:54:58 not placement info 14:55:08 think of it like passing network info and bdms to driver.spawn 14:55:09 https://review.openstack.org/#/c/511879/ 14:55:32 I thought there was the aspect of a particular PF that placement selected 14:55:33 edleafe: getting placement info _to_ compute is fine and good, getting compute to change placement is not 14:55:44 this is about providing the virt driver with more information 14:55:48 Yeah, so an alternative that was discussed for ^ but dismissed was adding the allocation blob to the instance object. 14:55:55 this is providing an allocation that was made during scheduling, yes? 14:55:58 dansmith: ...hence "opposite" 14:56:09 um. 14:56:10 like providing a port id or volume id 14:56:35 mriedem: yes 14:56:39 anyway, sounds like efried is still working on it 14:56:43 anything to discuss about it here? 14:56:47 with 4 minutes left? 14:56:50 yep, I took a pass through it already 14:57:42 OK, thanks everyone! 14:57:43 #endmeeting