14:00:19 <edleafe> #startmeeting nova_scheduler 14:00:21 <openstack> Meeting started Mon Aug 7 14:00:19 2017 UTC and is due to finish in 60 minutes. The chair is edleafe. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:22 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:25 <openstack> The meeting name has been set to 'nova_scheduler' 14:00:36 <mriedem> o/ 14:00:46 <edleafe> Anyone around? 14:00:58 <jaypipes> no 14:01:14 <edleafe> jaypipes: good! 14:02:10 <dtantsur> nobody at all 14:02:10 <bauzas> \o 14:02:12 <edleafe> Hmmm... should we have a meeting, or just keep arguing in -nova? 14:02:25 <mriedem> let's get on with it 14:02:42 <edleafe> #topic Specs & Reviews 14:02:50 <bauzas> edleafe: just a note that we have an open critical bug 14:03:05 <edleafe> bauzas: Yes, the bugs topic is next 14:03:17 <edleafe> #link Correct allocations for moves https://review.openstack.org/#/c/488510/ 14:03:19 * alex_xu waves late 14:03:21 <edleafe> jaypipes, dansmith - comments on that series? 14:04:04 <jaypipes> edleafe: slow slogging but making progress. 14:04:18 <jaypipes> edleafe: agree with alex_xu's comments about lack of testing of rebuild/evacuate 14:04:27 <jaypipes> edleafe: will be focusing on this today 14:04:42 <mriedem> rebuild avoids the scheduler so should be ok 14:04:44 <mriedem> it's a noop claim 14:04:45 <edleafe> jaypipes: ok. Is there anything we can help with (other than reviews)? 14:05:01 <jaypipes> mriedem: but it will not hit the "auto-heal" thing in Pike computes.. 14:05:09 <alex_xu> yea, rebuild is ok 14:05:11 <jaypipes> mriedem: in any case, we can discuss on #openstack-nova 14:05:28 <mriedem> oh that's a concern for the patch 14:05:31 <mriedem> nvm me then 14:05:41 <edleafe> #link Amend spec for Custom Resource Classes in Flavors https://review.openstack.org/#/c/481748/ 14:05:44 <edleafe> Has a +2; needs another nova-specs core 14:05:51 <edleafe> #link Ironic Flavor Migration https://review.openstack.org/#/c/487954/ 14:05:54 <edleafe> Finally has approval of Ironic people. Should be good to go. 14:06:29 <edleafe> #link Improve handling of ironicclient failures https://review.openstack.org/#/c/487925/ 14:06:32 <edleafe> Keeps failing a few tests in Jenkins, but never locally. 14:06:35 <edleafe> At a loss figuring out the problem. 14:06:54 <edleafe> Additional eyeballs on that might provide the solution 14:07:16 * dtantsur was not aware of that, will check 14:07:23 <edleafe> thx 14:07:34 <edleafe> #link Devstack to use resource classes by default https://review.openstack.org/#/c/476968/ 14:07:37 <edleafe> dtantsur is still working out some issues with this. 14:07:59 <dtantsur> yeah, no much progress. I don't really understand why it fails. 14:08:12 <dtantsur> on a bright side, a person told us that roughly the same thing worked locally for them 14:08:23 <dtantsur> sooo.. maybe missing 'sleep 10' somewhere? ;) 14:08:29 <edleafe> heh 14:08:50 <edleafe> "did you try unplugging it and then plugging it back in?" 14:08:56 <dtantsur> right :D 14:09:08 <edleafe> The next few traits patches are on hold until Queens 14:09:08 <edleafe> #link Traits support in the Allocation Candidates https://review.openstack.org/478464/ 14:09:11 <edleafe> #link Add traits to the ResourceProviders filters https://review.openstack.org/#/c/474602/ 14:09:14 <edleafe> #link Refactor of trait filtering SQLA https://review.openstack.org/#/c/489206/ 14:09:17 <edleafe> #link Correct resource provider set_traits() https://review.openstack.org/#/c/489205/ 14:09:28 <edleafe> #link Nested Resource Providers series starting with https://review.openstack.org/#/c/470575/ 14:09:31 <edleafe> Also on hold until Queens 14:09:36 <edleafe> #link Placement api-ref docs 14:09:36 <edleafe> https://review.openstack.org/#/q/topic:cd/placement-api-ref+status:open 14:09:39 <edleafe> Looking better and better! 14:09:53 <edleafe> Anything else for specs/reviews? 14:10:25 <cdent> there’s lots of minor bug fixes from the most recent rp/update that still need attention 14:10:27 <mriedem> i've also got a thing started 14:10:36 <cdent> even though that rp/update is more than a week old 14:10:42 <mriedem> #link shared storage functional resize test WIP https://review.openstack.org/#/c/490733/ 14:11:01 <mriedem> ^ has flushed out some other issues beyond just that bug it's trying to test 14:11:19 <mriedem> the main issue being we don't sum/max shared storage allocation in the scheduler 14:11:29 <mriedem> so if you resize disk up, we don't actually allocate that during the move in the scheduler 14:11:33 <edleafe> mriedem: functional tests tend to do that :) 14:11:45 * cdent edleafe burn 14:11:59 * cdent sizzles 14:12:10 * edleafe hates trusting unit tests 14:12:21 <mriedem> anyway, 14:12:35 <mriedem> i don't think it's something we were focusing on until after jay and dan's changes for the move stuff anyway 14:12:46 <mriedem> the fixes, at least for me locally, get messy 14:13:01 <mriedem> so shared storage is probably going to be a known issue in the release notes, as in we know it doesn't work 14:13:36 <cdent> messy how? a simple sum of everything not workable? 14:13:45 <edleafe> yeah, it would be nice to map out all the use cases to make sure we have them covered 14:14:13 <mriedem> the merge/sum in the scheduler isn't too bad, 14:14:25 <mriedem> it's the report client that gets bad https://review.openstack.org/#/c/491098/1 14:15:03 <mriedem> starting around https://review.openstack.org/#/c/491098/1/nova/scheduler/client/report.py@1019 14:15:16 <mriedem> the scheduler report client has to figure out if we're using shared storage for the instance already, 14:15:26 <mriedem> and if that shared storage is related to the compute node that's updating the allocations, 14:15:36 <mriedem> so it avoids overwriting the disk using the local compute node 14:15:51 <mriedem> but that doesn't handle shared storage associated with the dest node 14:16:07 <edleafe> seems like too many parts need to have knowledge about other parts 14:16:12 <cdent> yeah 14:16:29 <mriedem> i've been thinking the _allocate_for_instance code should just add/update allocations for the local compute node unless told otherwise via delete_allocations_for_instance 14:16:42 <mriedem> but anyway, like i said, it gets messy 14:16:52 <cdent> can we instead: reverse the allocation structure so it is keyed by resourc class with value of rp uuid and quantity 14:17:02 <cdent> and then loop across both 14:17:07 <edleafe> and it will stay messy until the dansmith/jaypipes patches settle 14:17:34 <cdent> if source rp uuid == target rp uuid sum 14:17:42 <cdent> (into a new reversed alloc structure) 14:18:03 <cdent> sorry, I’m still back on the claim, not the unwind 14:18:12 <cdent> but I still think changing the data structure would help 14:18:20 <edleafe> cdent: it seems like it would be cleaner to have multiple allocations for the same RP/consumer 14:18:33 <edleafe> i.e., one for source, one for dest 14:18:37 <jaypipes> edleafe: how would you differentiate them? 14:18:46 <edleafe> so resize to same host would effectively sum 14:18:47 <cdent> edleafe: yes, but I think we declared that off limits until after pike 14:18:55 <edleafe> jaypipes: why would you need to know that 14:19:09 <edleafe> cdent: sure, I'm just throwing that out there 14:19:16 <cdent> it’s a good pitch 14:19:17 <jaypipes> edleafe: to know which one to "free" on revert or confirm 14:19:44 <edleafe> jaypipes: from old flavor/new flavor? 14:19:58 <jaypipes> edleafe: dansmith had a good idea of allocating the resources against a consumer ID of the migration UUID (once migrations get a UUID) 14:20:22 <edleafe> jaypipes: yeah, we talked about that a while ago 14:20:23 <bauzas> just a side note that not all migrations have a migration object 14:20:30 <dansmith> eh? 14:20:32 <dansmith> they better 14:20:32 <edleafe> until we realized that migrations don't have a uuid :) 14:20:41 <jaypipes> heh, yeah 14:20:45 <bauzas> dansmith: not live-migrations AFAIK 14:20:50 * cdent blinks 14:20:53 <dansmith> yeah, I added it 14:20:58 <bauzas> oh you did ? 14:21:01 <dansmith> a while ago 14:21:04 <bauzas> excellent, missed that 14:21:11 * edleafe wipes his brow 14:21:30 <mriedem> i also think tracking the migrations allocations via the migration uuid will also help sort things out 14:21:48 <bauzas> seriously, the long-standing bugfix from nikola confused me 14:21:53 <bauzas> I agree 14:22:02 <mriedem> so we need to limp along in pike until we have that i guess 14:22:17 <edleafe> Ready to move on to bugs? 14:22:28 * cdent nods 14:22:33 <edleafe> #topic Bugs 14:22:34 <edleafe> #link Placement bugs https://bugs.launchpad.net/nova/+bugs?field.tag=placement 14:22:35 <bauzas> I serisouly didn't push for that idea because I thought we weren't creating migration objects for live-mig 14:22:36 <edleafe> Several new bugs as we start to pound on the code harder. 14:22:42 * bauzas facepalms 14:22:43 <edleafe> One critical bug: migration of single instance from multi-instance request spec fails with IndexError https://bugs.launchpad.net/nova/+bug/1708961 14:22:43 <openstack> Launchpad bug 1708961 in OpenStack Compute (nova) "migration of single instance from multi-instance request spec fails with IndexError" [Critical,In progress] - Assigned to Sylvain Bauza (sylvain-bauza) 14:22:45 <edleafe> bauzas has a fix in the works: 14:22:48 <edleafe> #link https://review.openstack.org/#/c/491439/ 14:23:02 <edleafe> bauzas: comment on that fix? 14:23:17 <bauzas> I marked it critical as I considered it a serious regression 14:23:27 <bauzas> nothing but what's in the commit msg 14:23:45 <bauzas> and the fact I missed that when I reviewed, so I blame for myself 14:24:21 <edleafe> ok, we'll follow up on that after the meeting 14:24:26 <jaypipes> gah, yet another poop accident just had to clean up... ffs, today sucks. 14:24:42 <mriedem> from a dog right? 14:24:48 <cdent> jaypipes: maybe you need a cage 14:24:57 <edleafe> jaypipes: you've made me not care about the flash floods waiting to carry my house away 14:25:12 <jaypipes> mriedem: no, from Julie. 14:25:21 <bauzas> jaypipes: I sincerely hope you don't have a Roomba 14:25:28 <jaypipes> bauzas: lol, no. 14:25:43 * bauzas checked. 14:25:52 <mriedem> jaypipes: in sickness and in health... 14:25:59 <jaypipes> alright, moving on. 14:26:44 <edleafe> jaypipes: http://imgur.com/a/6ympr 14:27:02 <jaypipes> yikes 14:27:14 * edleafe is waiting for the electricity to go out any minute... 14:27:36 <edleafe> So before I lose internet, anything else for bugs? 14:27:49 <cdent> I’m going to make some more 14:27:55 <cdent> I hope everyone else will too 14:28:07 <edleafe> cdent: doing what you do best 14:28:21 <edleafe> #topic Open Discussion 14:28:32 <dtantsur> I have an ironic question :) 14:28:37 <edleafe> go for it 14:28:47 <dtantsur> when should we recommend people to 1. populate resource_class in ironic, 2. switch flavors? 14:28:52 <dtantsur> I suspect the latter is for Queens 14:29:09 <mriedem> correct 14:29:10 <dtantsur> so my question is more about the former. should we already put in our upgrade notes that resource_class has to be populated 14:29:11 <edleafe> dtantsur: populate resource_class now 14:29:19 <dtantsur> now = before upgrade to Pike or after? 14:29:29 <jaypipes> dtantsur: I would say now. 14:29:49 <jaypipes> dtantsur: in other words, make resource class required *in Pike* 14:30:15 <edleafe> dtantsur: it should work if they add during Pike, but if you say that, they'll probably trip on some edge cases 14:30:24 <dtantsur> okay, so I put something like "Before upgrade to Pike, you have to populate node.resource_class", right? 14:30:35 <edleafe> ...for all nodes 14:30:40 <bauzas> from an ironic perspective, I'd say yup 14:31:10 <jaypipes> mriedem, cdent, edleafe, dansmith: so, I've been thinking... with all the latent issues we're finding in the resize/move space, would it be good for me to write a dev-ref doc specifically about what happens with placement, scheduler and compute nodes in the various resize operations? 14:31:24 <dtantsur> and the flavors should be updated during Pike, before Queens, right? 14:31:40 <edleafe> dtantsur: yes, updated flavor extra_specs will work 14:31:40 <cdent> jaypipes: yes, because it seems likely that in the act of writing it, it will make us/you think of gaps 14:31:50 <jaypipes> k 14:31:53 <dtantsur> awesome, thanks all 14:31:58 <edleafe> Pike supports both ways of requesting ironic resources 14:31:59 * dtantsur will ping someone to review the docs change 14:32:06 <mriedem> jaypipes: yeah agree, docs are always good 14:32:30 <edleafe> jaypipes: that might also serve as a guideline for the functional tests that will be needed 14:33:08 <jaypipes> ya 14:33:50 <bauzas> ++ 14:34:01 <edleafe> ok, then: 14:34:03 <edleafe> #action jaypipes to write a dev-ref doc specifically about what happens with placement, scheduler and compute nodes in the various resize operations 14:34:55 <edleafe> Anything else for open discussion? How about talking about how much we love the ChanceScheduler? 14:35:15 * cdent writes EdScheduler 14:35:15 <dtantsur> oh, you've got a random failure scheduling filter? nice! 14:35:17 <dtantsur> :D 14:35:32 <cdent> def select_destinations(…): raise NoValidHost() 14:35:45 * edleafe remembers when ChanceScheduler was the *only* scheduler in Nova 14:35:58 <dtantsur> if random() > 0.5: raise NoValidHost() <-- this is cloud, c'mon 14:36:41 <edleafe> I think we're done here. Thanks everyone! 14:36:44 <edleafe> #endmeeting