14:00:19 #startmeeting nova_scheduler 14:00:21 Meeting started Mon Aug 7 14:00:19 2017 UTC and is due to finish in 60 minutes. The chair is edleafe. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:22 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:25 The meeting name has been set to 'nova_scheduler' 14:00:36 o/ 14:00:46 Anyone around? 14:00:58 no 14:01:14 jaypipes: good! 14:02:10 nobody at all 14:02:10 \o 14:02:12 Hmmm... should we have a meeting, or just keep arguing in -nova? 14:02:25 let's get on with it 14:02:42 #topic Specs & Reviews 14:02:50 edleafe: just a note that we have an open critical bug 14:03:05 bauzas: Yes, the bugs topic is next 14:03:17 #link Correct allocations for moves https://review.openstack.org/#/c/488510/ 14:03:19 * alex_xu waves late 14:03:21 jaypipes, dansmith - comments on that series? 14:04:04 edleafe: slow slogging but making progress. 14:04:18 edleafe: agree with alex_xu's comments about lack of testing of rebuild/evacuate 14:04:27 edleafe: will be focusing on this today 14:04:42 rebuild avoids the scheduler so should be ok 14:04:44 it's a noop claim 14:04:45 jaypipes: ok. Is there anything we can help with (other than reviews)? 14:05:01 mriedem: but it will not hit the "auto-heal" thing in Pike computes.. 14:05:09 yea, rebuild is ok 14:05:11 mriedem: in any case, we can discuss on #openstack-nova 14:05:28 oh that's a concern for the patch 14:05:31 nvm me then 14:05:41 #link Amend spec for Custom Resource Classes in Flavors https://review.openstack.org/#/c/481748/ 14:05:44 Has a +2; needs another nova-specs core 14:05:51 #link Ironic Flavor Migration https://review.openstack.org/#/c/487954/ 14:05:54 Finally has approval of Ironic people. Should be good to go. 14:06:29 #link Improve handling of ironicclient failures https://review.openstack.org/#/c/487925/ 14:06:32 Keeps failing a few tests in Jenkins, but never locally. 14:06:35 At a loss figuring out the problem. 14:06:54 Additional eyeballs on that might provide the solution 14:07:16 * dtantsur was not aware of that, will check 14:07:23 thx 14:07:34 #link Devstack to use resource classes by default https://review.openstack.org/#/c/476968/ 14:07:37 dtantsur is still working out some issues with this. 14:07:59 yeah, no much progress. I don't really understand why it fails. 14:08:12 on a bright side, a person told us that roughly the same thing worked locally for them 14:08:23 sooo.. maybe missing 'sleep 10' somewhere? ;) 14:08:29 heh 14:08:50 "did you try unplugging it and then plugging it back in?" 14:08:56 right :D 14:09:08 The next few traits patches are on hold until Queens 14:09:08 #link Traits support in the Allocation Candidates https://review.openstack.org/478464/ 14:09:11 #link Add traits to the ResourceProviders filters https://review.openstack.org/#/c/474602/ 14:09:14 #link Refactor of trait filtering SQLA https://review.openstack.org/#/c/489206/ 14:09:17 #link Correct resource provider set_traits() https://review.openstack.org/#/c/489205/ 14:09:28 #link Nested Resource Providers series starting with https://review.openstack.org/#/c/470575/ 14:09:31 Also on hold until Queens 14:09:36 #link Placement api-ref docs 14:09:36 https://review.openstack.org/#/q/topic:cd/placement-api-ref+status:open 14:09:39 Looking better and better! 14:09:53 Anything else for specs/reviews? 14:10:25 there’s lots of minor bug fixes from the most recent rp/update that still need attention 14:10:27 i've also got a thing started 14:10:36 even though that rp/update is more than a week old 14:10:42 #link shared storage functional resize test WIP https://review.openstack.org/#/c/490733/ 14:11:01 ^ has flushed out some other issues beyond just that bug it's trying to test 14:11:19 the main issue being we don't sum/max shared storage allocation in the scheduler 14:11:29 so if you resize disk up, we don't actually allocate that during the move in the scheduler 14:11:33 mriedem: functional tests tend to do that :) 14:11:45 * cdent edleafe burn 14:11:59 * cdent sizzles 14:12:10 * edleafe hates trusting unit tests 14:12:21 anyway, 14:12:35 i don't think it's something we were focusing on until after jay and dan's changes for the move stuff anyway 14:12:46 the fixes, at least for me locally, get messy 14:13:01 so shared storage is probably going to be a known issue in the release notes, as in we know it doesn't work 14:13:36 messy how? a simple sum of everything not workable? 14:13:45 yeah, it would be nice to map out all the use cases to make sure we have them covered 14:14:13 the merge/sum in the scheduler isn't too bad, 14:14:25 it's the report client that gets bad https://review.openstack.org/#/c/491098/1 14:15:03 starting around https://review.openstack.org/#/c/491098/1/nova/scheduler/client/report.py@1019 14:15:16 the scheduler report client has to figure out if we're using shared storage for the instance already, 14:15:26 and if that shared storage is related to the compute node that's updating the allocations, 14:15:36 so it avoids overwriting the disk using the local compute node 14:15:51 but that doesn't handle shared storage associated with the dest node 14:16:07 seems like too many parts need to have knowledge about other parts 14:16:12 yeah 14:16:29 i've been thinking the _allocate_for_instance code should just add/update allocations for the local compute node unless told otherwise via delete_allocations_for_instance 14:16:42 but anyway, like i said, it gets messy 14:16:52 can we instead: reverse the allocation structure so it is keyed by resourc class with value of rp uuid and quantity 14:17:02 and then loop across both 14:17:07 and it will stay messy until the dansmith/jaypipes patches settle 14:17:34 if source rp uuid == target rp uuid sum 14:17:42 (into a new reversed alloc structure) 14:18:03 sorry, I’m still back on the claim, not the unwind 14:18:12 but I still think changing the data structure would help 14:18:20 cdent: it seems like it would be cleaner to have multiple allocations for the same RP/consumer 14:18:33 i.e., one for source, one for dest 14:18:37 edleafe: how would you differentiate them? 14:18:46 so resize to same host would effectively sum 14:18:47 edleafe: yes, but I think we declared that off limits until after pike 14:18:55 jaypipes: why would you need to know that 14:19:09 cdent: sure, I'm just throwing that out there 14:19:16 it’s a good pitch 14:19:17 edleafe: to know which one to "free" on revert or confirm 14:19:44 jaypipes: from old flavor/new flavor? 14:19:58 edleafe: dansmith had a good idea of allocating the resources against a consumer ID of the migration UUID (once migrations get a UUID) 14:20:22 jaypipes: yeah, we talked about that a while ago 14:20:23 just a side note that not all migrations have a migration object 14:20:30 eh? 14:20:32 they better 14:20:32 until we realized that migrations don't have a uuid :) 14:20:41 heh, yeah 14:20:45 dansmith: not live-migrations AFAIK 14:20:50 * cdent blinks 14:20:53 yeah, I added it 14:20:58 oh you did ? 14:21:01 a while ago 14:21:04 excellent, missed that 14:21:11 * edleafe wipes his brow 14:21:30 i also think tracking the migrations allocations via the migration uuid will also help sort things out 14:21:48 seriously, the long-standing bugfix from nikola confused me 14:21:53 I agree 14:22:02 so we need to limp along in pike until we have that i guess 14:22:17 Ready to move on to bugs? 14:22:28 * cdent nods 14:22:33 #topic Bugs 14:22:34 #link Placement bugs https://bugs.launchpad.net/nova/+bugs?field.tag=placement 14:22:35 I serisouly didn't push for that idea because I thought we weren't creating migration objects for live-mig 14:22:36 Several new bugs as we start to pound on the code harder. 14:22:42 * bauzas facepalms 14:22:43 One critical bug: migration of single instance from multi-instance request spec fails with IndexError https://bugs.launchpad.net/nova/+bug/1708961 14:22:43 Launchpad bug 1708961 in OpenStack Compute (nova) "migration of single instance from multi-instance request spec fails with IndexError" [Critical,In progress] - Assigned to Sylvain Bauza (sylvain-bauza) 14:22:45 bauzas has a fix in the works: 14:22:48 #link https://review.openstack.org/#/c/491439/ 14:23:02 bauzas: comment on that fix? 14:23:17 I marked it critical as I considered it a serious regression 14:23:27 nothing but what's in the commit msg 14:23:45 and the fact I missed that when I reviewed, so I blame for myself 14:24:21 ok, we'll follow up on that after the meeting 14:24:26 gah, yet another poop accident just had to clean up... ffs, today sucks. 14:24:42 from a dog right? 14:24:48 jaypipes: maybe you need a cage 14:24:57 jaypipes: you've made me not care about the flash floods waiting to carry my house away 14:25:12 mriedem: no, from Julie. 14:25:21 jaypipes: I sincerely hope you don't have a Roomba 14:25:28 bauzas: lol, no. 14:25:43 * bauzas checked. 14:25:52 jaypipes: in sickness and in health... 14:25:59 alright, moving on. 14:26:44 jaypipes: http://imgur.com/a/6ympr 14:27:02 yikes 14:27:14 * edleafe is waiting for the electricity to go out any minute... 14:27:36 So before I lose internet, anything else for bugs? 14:27:49 I’m going to make some more 14:27:55 I hope everyone else will too 14:28:07 cdent: doing what you do best 14:28:21 #topic Open Discussion 14:28:32 I have an ironic question :) 14:28:37 go for it 14:28:47 when should we recommend people to 1. populate resource_class in ironic, 2. switch flavors? 14:28:52 I suspect the latter is for Queens 14:29:09 correct 14:29:10 so my question is more about the former. should we already put in our upgrade notes that resource_class has to be populated 14:29:11 dtantsur: populate resource_class now 14:29:19 now = before upgrade to Pike or after? 14:29:29 dtantsur: I would say now. 14:29:49 dtantsur: in other words, make resource class required *in Pike* 14:30:15 dtantsur: it should work if they add during Pike, but if you say that, they'll probably trip on some edge cases 14:30:24 okay, so I put something like "Before upgrade to Pike, you have to populate node.resource_class", right? 14:30:35 ...for all nodes 14:30:40 from an ironic perspective, I'd say yup 14:31:10 mriedem, cdent, edleafe, dansmith: so, I've been thinking... with all the latent issues we're finding in the resize/move space, would it be good for me to write a dev-ref doc specifically about what happens with placement, scheduler and compute nodes in the various resize operations? 14:31:24 and the flavors should be updated during Pike, before Queens, right? 14:31:40 dtantsur: yes, updated flavor extra_specs will work 14:31:40 jaypipes: yes, because it seems likely that in the act of writing it, it will make us/you think of gaps 14:31:50 k 14:31:53 awesome, thanks all 14:31:58 Pike supports both ways of requesting ironic resources 14:31:59 * dtantsur will ping someone to review the docs change 14:32:06 jaypipes: yeah agree, docs are always good 14:32:30 jaypipes: that might also serve as a guideline for the functional tests that will be needed 14:33:08 ya 14:33:50 ++ 14:34:01 ok, then: 14:34:03 #action jaypipes to write a dev-ref doc specifically about what happens with placement, scheduler and compute nodes in the various resize operations 14:34:55 Anything else for open discussion? How about talking about how much we love the ChanceScheduler? 14:35:15 * cdent writes EdScheduler 14:35:15 oh, you've got a random failure scheduling filter? nice! 14:35:17 :D 14:35:32 def select_destinations(…): raise NoValidHost() 14:35:45 * edleafe remembers when ChanceScheduler was the *only* scheduler in Nova 14:35:58 if random() > 0.5: raise NoValidHost() <-- this is cloud, c'mon 14:36:41 I think we're done here. Thanks everyone! 14:36:44 #endmeeting