#openstack-meeting-alt log

14:00:19 <edleafe> #startmeeting nova_scheduler
14:00:21 <openstack> Meeting started Mon Aug  7 14:00:19 2017 UTC and is due to finish in 60 minutes.  The chair is edleafe. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:22 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:25 <openstack> The meeting name has been set to 'nova_scheduler'
14:00:36 <mriedem> o/
14:00:46 <edleafe> Anyone around?
14:00:58 <jaypipes> no
14:01:14 <edleafe> jaypipes: good!
14:02:10 <dtantsur> nobody at all
14:02:10 <bauzas> \o
14:02:12 <edleafe> Hmmm... should we have a meeting, or just keep arguing in -nova?
14:02:25 <mriedem> let's get on with it
14:02:42 <edleafe> #topic Specs & Reviews
14:02:50 <bauzas> edleafe: just a note that we have an open critical bug
14:03:05 <edleafe> bauzas: Yes, the bugs topic is next
14:03:17 <edleafe> #link Correct allocations for moves https://review.openstack.org/#/c/488510/
14:03:19 * alex_xu waves late
14:03:21 <edleafe> jaypipes, dansmith - comments on that series?
14:04:04 <jaypipes> edleafe: slow slogging but making progress.
14:04:18 <jaypipes> edleafe: agree with alex_xu's comments about lack of testing of rebuild/evacuate
14:04:27 <jaypipes> edleafe: will be focusing on this today
14:04:42 <mriedem> rebuild avoids the scheduler so should be ok
14:04:44 <mriedem> it's a noop claim
14:04:45 <edleafe> jaypipes: ok. Is there anything we can help with (other than reviews)?
14:05:01 <jaypipes> mriedem: but it will not hit the "auto-heal" thing in Pike computes..
14:05:09 <alex_xu> yea, rebuild is ok
14:05:11 <jaypipes> mriedem: in any case, we can discuss on #openstack-nova
14:05:28 <mriedem> oh that's a concern for the patch
14:05:31 <mriedem> nvm me then
14:05:41 <edleafe> #link Amend spec for Custom Resource Classes in Flavors https://review.openstack.org/#/c/481748/
14:05:44 <edleafe> Has a +2; needs another nova-specs core
14:05:51 <edleafe> #link Ironic Flavor Migration https://review.openstack.org/#/c/487954/
14:05:54 <edleafe> Finally has approval of Ironic people. Should be good to go.
14:06:29 <edleafe> #link Improve handling of ironicclient failures https://review.openstack.org/#/c/487925/
14:06:32 <edleafe> Keeps failing a few tests in Jenkins, but never locally.
14:06:35 <edleafe> At a loss figuring out the problem.
14:06:54 <edleafe> Additional eyeballs on that might provide the solution
14:07:16 * dtantsur was not aware of that, will check
14:07:23 <edleafe> thx
14:07:34 <edleafe> #link Devstack to use resource classes by default https://review.openstack.org/#/c/476968/
14:07:37 <edleafe> dtantsur is still working out some issues with this.
14:07:59 <dtantsur> yeah, no much progress. I don't really understand why it fails.
14:08:12 <dtantsur> on a bright side, a person told us that roughly the same thing worked locally for them
14:08:23 <dtantsur> sooo.. maybe missing 'sleep 10' somewhere? ;)
14:08:29 <edleafe> heh
14:08:50 <edleafe> "did you try unplugging it and then plugging it back in?"
14:08:56 <dtantsur> right :D
14:09:08 <edleafe> The next few traits patches are on hold until Queens
14:09:08 <edleafe> #link Traits support in the Allocation Candidates https://review.openstack.org/478464/
14:09:11 <edleafe> #link Add traits to the ResourceProviders filters https://review.openstack.org/#/c/474602/
14:09:14 <edleafe> #link Refactor of trait filtering SQLA https://review.openstack.org/#/c/489206/
14:09:17 <edleafe> #link Correct resource provider set_traits() https://review.openstack.org/#/c/489205/
14:09:28 <edleafe> #link Nested Resource Providers series starting with https://review.openstack.org/#/c/470575/
14:09:31 <edleafe> Also on hold until Queens
14:09:36 <edleafe> #link Placement api-ref docs
14:09:36 <edleafe> https://review.openstack.org/#/q/topic:cd/placement-api-ref+status:open
14:09:39 <edleafe> Looking better and better!
14:09:53 <edleafe> Anything else for specs/reviews?
14:10:25 <cdent> there’s lots of minor bug fixes from the most recent rp/update that still need attention
14:10:27 <mriedem> i've also got a thing started
14:10:36 <cdent> even though that rp/update is more than a week old
14:10:42 <mriedem> #link shared storage functional resize test WIP https://review.openstack.org/#/c/490733/
14:11:01 <mriedem> ^ has flushed out some other issues beyond just that bug it's trying to test
14:11:19 <mriedem> the main issue being we don't sum/max shared storage allocation in the scheduler
14:11:29 <mriedem> so if you resize disk up, we don't actually allocate that during the move in the scheduler
14:11:33 <edleafe> mriedem: functional tests tend to do that :)
14:11:45 * cdent edleafe burn
14:11:59 * cdent sizzles
14:12:10 * edleafe hates trusting unit tests
14:12:21 <mriedem> anyway,
14:12:35 <mriedem> i don't think it's something we were focusing on until after jay and dan's changes for the move stuff anyway
14:12:46 <mriedem> the fixes, at least for me locally, get messy
14:13:01 <mriedem> so shared storage is probably going to be a known issue in the release notes, as in we know it doesn't work
14:13:36 <cdent> messy how? a simple sum of everything not workable?
14:13:45 <edleafe> yeah, it would be nice to map out all the use cases to make sure we have them covered
14:14:13 <mriedem> the merge/sum in the scheduler isn't too bad,
14:14:25 <mriedem> it's the report client that gets bad https://review.openstack.org/#/c/491098/1
14:15:03 <mriedem> starting around https://review.openstack.org/#/c/491098/1/nova/scheduler/client/report.py@1019
14:15:16 <mriedem> the scheduler report client has to figure out if we're using shared storage for the instance already,
14:15:26 <mriedem> and if that shared storage is related to the compute node that's updating the allocations,
14:15:36 <mriedem> so it avoids overwriting the disk using the local compute node
14:15:51 <mriedem> but that doesn't handle shared storage associated with the dest node
14:16:07 <edleafe> seems like too many parts need to have knowledge about other parts
14:16:12 <cdent> yeah
14:16:29 <mriedem> i've been thinking the _allocate_for_instance code should just add/update allocations for the local compute node unless told otherwise via delete_allocations_for_instance
14:16:42 <mriedem> but anyway, like i said, it gets messy
14:16:52 <cdent> can we instead: reverse the allocation structure so it is keyed by resourc class with value of rp uuid and quantity
14:17:02 <cdent> and then loop across both
14:17:07 <edleafe> and it will stay messy until the dansmith/jaypipes patches settle
14:17:34 <cdent> if source rp uuid == target rp uuid sum
14:17:42 <cdent> (into a new reversed alloc structure)
14:18:03 <cdent> sorry, I’m still back on the claim, not the unwind
14:18:12 <cdent> but I still think changing the data structure would help
14:18:20 <edleafe> cdent: it seems like it would be cleaner to have multiple allocations for the same RP/consumer
14:18:33 <edleafe> i.e., one for source, one for dest
14:18:37 <jaypipes> edleafe: how would you differentiate them?
14:18:46 <edleafe> so resize to same host would effectively sum
14:18:47 <cdent> edleafe: yes, but I think we declared that off limits until after pike
14:18:55 <edleafe> jaypipes: why would you need to know that
14:19:09 <edleafe> cdent: sure, I'm just throwing that out there
14:19:16 <cdent> it’s a good pitch
14:19:17 <jaypipes> edleafe: to know which one to "free" on revert or confirm
14:19:44 <edleafe> jaypipes: from old flavor/new flavor?
14:19:58 <jaypipes> edleafe: dansmith had a good idea of allocating the resources against a consumer ID of the migration UUID (once migrations get a UUID)
14:20:22 <edleafe> jaypipes: yeah, we talked about that a while ago
14:20:23 <bauzas> just a side note that not all migrations have a migration object
14:20:30 <dansmith> eh?
14:20:32 <dansmith> they better
14:20:32 <edleafe> until we realized that migrations don't have a uuid :)
14:20:41 <jaypipes> heh, yeah
14:20:45 <bauzas> dansmith: not live-migrations AFAIK
14:20:50 * cdent blinks
14:20:53 <dansmith> yeah, I added it
14:20:58 <bauzas> oh you did ?
14:21:01 <dansmith> a while ago
14:21:04 <bauzas> excellent, missed that
14:21:11 * edleafe wipes his brow
14:21:30 <mriedem> i also think tracking the migrations allocations via the migration uuid will also help sort things out
14:21:48 <bauzas> seriously, the long-standing bugfix from nikola confused me
14:21:53 <bauzas> I agree
14:22:02 <mriedem> so we need to limp along in pike until we have that i guess
14:22:17 <edleafe> Ready to move on to bugs?
14:22:28 * cdent nods
14:22:33 <edleafe> #topic Bugs
14:22:34 <edleafe> #link Placement bugs https://bugs.launchpad.net/nova/+bugs?field.tag=placement
14:22:35 <bauzas> I serisouly didn't push for that idea because I thought we weren't creating migration objects for live-mig
14:22:36 <edleafe> Several new bugs as we start to pound on the code harder.
14:22:42 * bauzas facepalms
14:22:43 <edleafe> One critical bug: migration of single instance from multi-instance request spec fails with IndexError https://bugs.launchpad.net/nova/+bug/1708961
14:22:43 <openstack> Launchpad bug 1708961 in OpenStack Compute (nova) "migration of single instance from multi-instance request spec fails with IndexError" [Critical,In progress] - Assigned to Sylvain Bauza (sylvain-bauza)
14:22:45 <edleafe> bauzas has a fix in the works:
14:22:48 <edleafe> #link https://review.openstack.org/#/c/491439/
14:23:02 <edleafe> bauzas: comment on that fix?
14:23:17 <bauzas> I marked it critical as I considered it a serious regression
14:23:27 <bauzas> nothing but what's in the commit msg
14:23:45 <bauzas> and the fact I missed that when I reviewed, so I blame for myself
14:24:21 <edleafe> ok, we'll follow up on that after the meeting
14:24:26 <jaypipes> gah, yet another poop accident just had to clean up... ffs, today sucks.
14:24:42 <mriedem> from a dog right?
14:24:48 <cdent> jaypipes: maybe you need a cage
14:24:57 <edleafe> jaypipes: you've made me not care about the flash floods waiting to carry my house away
14:25:12 <jaypipes> mriedem: no, from Julie.
14:25:21 <bauzas> jaypipes: I sincerely hope you don't have a Roomba
14:25:28 <jaypipes> bauzas: lol, no.
14:25:43 * bauzas checked.
14:25:52 <mriedem> jaypipes: in sickness and in health...
14:25:59 <jaypipes> alright, moving on.
14:26:44 <edleafe> jaypipes: http://imgur.com/a/6ympr
14:27:02 <jaypipes> yikes
14:27:14 * edleafe is waiting for the electricity to go out any minute...
14:27:36 <edleafe> So before I lose internet, anything else for bugs?
14:27:49 <cdent> I’m going to make some more
14:27:55 <cdent> I hope everyone else will too
14:28:07 <edleafe> cdent: doing what you do best
14:28:21 <edleafe> #topic Open Discussion
14:28:32 <dtantsur> I have an ironic question :)
14:28:37 <edleafe> go for it
14:28:47 <dtantsur> when should we recommend people to 1. populate resource_class in ironic, 2. switch flavors?
14:28:52 <dtantsur> I suspect the latter is for Queens
14:29:09 <mriedem> correct
14:29:10 <dtantsur> so my question is more about the former. should we already put in our upgrade notes that resource_class has to be populated
14:29:11 <edleafe> dtantsur: populate resource_class now
14:29:19 <dtantsur> now = before upgrade to Pike or after?
14:29:29 <jaypipes> dtantsur: I would say now.
14:29:49 <jaypipes> dtantsur: in other words, make resource class required *in Pike*
14:30:15 <edleafe> dtantsur: it should work if they add during Pike, but if you say that, they'll probably trip on some edge cases
14:30:24 <dtantsur> okay, so I put something like "Before upgrade to Pike, you have to populate node.resource_class", right?
14:30:35 <edleafe> ...for all nodes
14:30:40 <bauzas> from an ironic perspective, I'd say yup
14:31:10 <jaypipes> mriedem, cdent, edleafe, dansmith: so, I've been thinking... with all the latent issues we're finding in the resize/move space, would it be good for me to write a dev-ref doc specifically about what happens with placement, scheduler and compute nodes in the various resize operations?
14:31:24 <dtantsur> and the flavors should be updated during Pike, before Queens, right?
14:31:40 <edleafe> dtantsur: yes, updated flavor extra_specs will work
14:31:40 <cdent> jaypipes: yes, because it seems likely that in the act of writing it, it will make us/you think of gaps
14:31:50 <jaypipes> k
14:31:53 <dtantsur> awesome, thanks all
14:31:58 <edleafe> Pike supports both ways of requesting ironic resources
14:31:59 * dtantsur will ping someone to review the docs change
14:32:06 <mriedem> jaypipes: yeah agree, docs are always good
14:32:30 <edleafe> jaypipes: that might also serve as a guideline for the functional tests that will be needed
14:33:08 <jaypipes> ya
14:33:50 <bauzas> ++
14:34:01 <edleafe> ok, then:
14:34:03 <edleafe> #action jaypipes to write a dev-ref doc specifically about what happens with placement, scheduler and compute nodes in the various resize operations
14:34:55 <edleafe> Anything else for open discussion? How about talking about how much we love the ChanceScheduler?
14:35:15 * cdent writes EdScheduler
14:35:15 <dtantsur> oh, you've got a random failure scheduling filter? nice!
14:35:17 <dtantsur> :D
14:35:32 <cdent> def select_destinations(…): raise NoValidHost()
14:35:45 * edleafe remembers when ChanceScheduler was the *only* scheduler in Nova
14:35:58 <dtantsur> if random() > 0.5: raise NoValidHost()  <-- this is cloud, c'mon
14:36:41 <edleafe> I think we're done here. Thanks everyone!
14:36:44 <edleafe> #endmeeting