#openstack-meeting-alt log

13:59:53 <n0ano> #startmeeting nova-scheduler
13:59:54 <openstack> Meeting started Mon Jan 11 13:59:53 2016 UTC and is due to finish in 60 minutes.  The chair is n0ano. Information about MeetBot at http://wiki.debian.org/MeetBot.
13:59:55 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
13:59:57 <openstack> The meeting name has been set to 'nova_scheduler'
14:00:02 <n0ano> anyone here to talk about the scheduler
14:00:41 <lxsli> o/
14:00:46 <tobasco> yes
14:00:51 <jaypipes> morning all
14:00:59 <carl_baldwin> o/
14:02:06 <lxsli> staff meeting...
14:02:26 <n0ano> lxsli, so, I'm in two meetings right now myself :-)
14:02:27 <bauzas> heya
14:02:31 <bauzas> \o
14:02:35 <jaypipes> n0ano: me too :)
14:02:45 <edleafe> o/
14:02:46 <n0ano> looks like we have quorum so let's go
14:02:57 <n0ano> #topic specs, BPs, patches
14:03:03 * bauzas attending his first '16 sched meeting \o/
14:03:19 * johnthetubaguy lurks
14:03:24 <n0ano> the big one here belongs to you jaypipes , what's happening with your resource providers BP?
14:04:25 <jaypipes> n0ano: I have split it into resource-classes, resource-providers, generic-resource-pools, and compute-node-inventory blueprints.
14:04:39 <jaypipes> n0ano: still working on compute-node-allocations blueprint, which is final one in series.
14:04:40 <bauzas> and some of them are being implemented, right?
14:05:06 <jaypipes> bauzas: yes, cdent is working on resource-classes (patches pushed already I believe) and we are working on others slowly
14:05:16 <bauzas> coolness
14:05:18 <cdent> oh, hi, I'm here
14:05:26 <jaypipes> I apologize for not being able to attend these meetings for a couple months :(
14:05:28 <cdent> yeah: I pushed some resource-classes stuff
14:05:33 <jaypipes> and for being so slow on the spec stuff :(
14:05:41 <cdent> but have held off on the next step waiting on the specs to progress
14:05:43 <bauzas> I'd advocate for using the etherpad of doom then :)
14:05:59 <n0ano> jaypipes, NP, my only concern is we're beyond feature freeze, do we need to get exceptions for these 5 BPs?
14:06:02 <bauzas> ie. https://etherpad.openstack.org/p/mitaka-nova-priorities-tracking
14:06:06 <bauzas> n0ano: no
14:06:17 <bauzas> n0ano: no exceptions are possible now
14:07:05 <n0ano> bauzas, then what's the implication for Mitaka?
14:07:39 <bauzas> n0ano: we can continue to work, but no merges
14:07:45 <jaypipes> n0ano: need to chat with johnthetubaguy about it. I was under the impression that the resource-providers work was a bit of an exception to the rule.
14:07:57 <n0ano> jaypipes, +1
14:08:26 <bauzas> cdent: jaypipes: like I said above, would it be possible to mention all the specs + implems in https://etherpad.openstack.org/p/mitaka-nova-priorities-tracking ?
14:08:30 <johnthetubaguy> if we get it reviewed, and folks to work on the follow up bits, thats OK I think
14:08:41 * bauzas was pretty dotted-line during the past weeks
14:08:48 <jaypipes> bauzas: yes. I will do that ASAP.
14:09:04 <johnthetubaguy> yeah, in that etherpad sounds like a good plan
14:09:12 <n0ano> johnthetubaguy, do we have a hard cutoff date for getting the BPs approved?
14:09:29 <johnthetubaguy> it was a few months back
14:09:36 <cdent> :)
14:10:01 <johnthetubaguy> but the process is here to help, so if we need exceptions, and everyone wants it, lets do what we can
14:10:15 <n0ano> so no, we'll just work with asap
14:10:38 <bauzas> honestly, I'm not that concerned by the Mitaka freeze
14:10:43 <johnthetubaguy> if it were not a priority item thats blocking os many things, it would be a hard no, but will to see what other folks think for this
14:11:07 <n0ano> so, if jaypipes  can update the epad with links to all the BPs we'll try and get them reviewed and approved as soon as possible
14:11:10 <bauzas> ie. we should be able to accept specs for Nxxx sooner or later
14:11:41 <bauzas> and we could just make sure that if we get enough stuff stable, it could be iterated very quickly for landing in N before Summit
14:11:45 <bauzas> like we did for ReqSpec
14:12:10 <bauzas> no precise need for bypassing what was agreed
14:12:54 <n0ano> I don't think we're bypassing so much as interpreting guidelines :-)
14:13:08 <n0ano> anyway, let's see if we can get it approved as soon as possible.
14:14:06 <n0ano> I think that's it for BPs, all others have either been approved or deferred to N, are they any specific patches people want to discuss today?
14:14:26 <cdent> I pushe a bug fix that needs some review: https://review.openstack.org/#/c/264349/
14:14:41 <cdent> was rated as "high"
14:14:57 <n0ano> cdent, excellent, let's all try to review it
14:16:09 <carl_baldwin> I pushed a backlog spec here at we talked about last week:  https://review.openstack.org/#/c/263898/
14:16:14 <bauzas> cdent: in my pip
14:16:18 <bauzas> pipe even
14:17:06 <n0ano> carl_baldwin, cool, you should also update the mitaka topics page at https://etherpad.openstack.org/p/mitaka-nova-midcycle
14:17:55 <carl_baldwin> n0ano: See L73
14:18:05 <johnthetubaguy> carl_baldwin: I was talking with armax about making sure we find time for the neutron related things, its on my mental list as well
14:18:20 <n0ano> carl_baldwin, I'm blnd, I didn't scroll down :-)
14:18:25 <carl_baldwin> johnthetubaguy: many thanks
14:18:33 <carl_baldwin> n0ano: No worries, it is a busy agenda.
14:19:00 <n0ano> kind of seques into the next topic
14:19:06 <n0ano> #topic mid-cycle meetup
14:20:01 <n0ano> given that the neutron scheduling is on one day and a general scheduler topic is on day one I think we're covered unless anyone
14:20:13 <bauzas> I feel so
14:20:55 <n0ano> I should note that scheduler testing is a suggested topic but not on a specific day yet, hopefully that won't be forgotten
14:21:54 <johnthetubaguy> should be good I think
14:22:16 <n0ano> OK, moving on
14:22:22 <n0ano> #topic bugs
14:22:41 <cdent> I was looking into working on this: https://bugs.launchpad.net/nova/+bug/1431291
14:22:42 <openstack> Launchpad bug 1431291 in OpenStack Compute (nova) "Scheduler Failures are no longer logged with enough detail for a site admin to do problem determination" [High,Confirmed] - Assigned to Pranav Salunke (dguitarbite)
14:22:48 <n0ano> I note that the current list is still at 38 (37 once we push cdent 's ing)
14:23:02 <cdent> but it feels like reality and the bug have gone in different directions since the bug was created
14:23:24 <cdent> so I was wondering if someone with a bit more experience could make some statement about the current desired functionality (on the bug)?
14:23:33 <bauzas> cdent: I feel it would be worth discussing that at the midcycle
14:23:51 <bauzas> cdent: because it's more a placeholder bug saying "meh, logs suck"
14:23:59 <cdent> Yeah, it kinda felt that way.
14:24:08 <cdent> Is there something more tractable I could/should work on in the meantime?
14:24:10 <n0ano> personally I don't have a problem with closing bugs out with `NOTABUG'
14:24:27 <bauzas> cdent: I guess you saw johnthetubaguy's mentions of what has been done in the past with that bug ?
14:24:30 <tobasco> I have a bug I would like some insight in, I recently marked mine as a duplicate and posted a comment in this one https://bugs.launchpad.net/nova/+bug/1469179 can't really tell if it's the correct room but got directed by johnthetubaguy
14:24:31 <openstack> Launchpad bug 1469179 in OpenStack Compute (nova) "instance.root_gb should be 0 for volume-backed instances" [Undecided,In progress] - Assigned to Feodor Tersin (ftersin)
14:24:38 <cdent> bauzas: yes
14:25:09 <bauzas> n0ano: agreed, I feel we could close that one by mentioning what has been done already and ask to reopen if something more specific is needed
14:25:22 <n0ano> bauzas, +1
14:25:27 <johnthetubaguy> bauzas: +1
14:25:35 <johnthetubaguy> feels out of date post last release
14:25:41 <bauzas> okay, I can do that
14:28:02 <n0ano> tobasco, johnthetubaguy not seeing where this bug is scheduler related, I don't know if anyone here has any insight on this
14:28:54 <johnthetubaguy> ah, sorry
14:28:59 <johnthetubaguy> looking again
14:29:13 <johnthetubaguy> this is more resource tracker
14:29:28 <bauzas> that's even a driver-specific problem, nope ?
14:29:37 <johnthetubaguy> I was thinking the resource providers work, so we track shared storage better, would improve things
14:29:39 <bauzas> because the RT is dumpb
14:29:41 <bauzas> dumb
14:29:47 <n0ano> johnthetubaguy, which we are effectively re-writing so maybe this bug might become irrelevant
14:29:54 <johnthetubaguy> I think this is cinder volumes confusing local storage space
14:29:55 <bauzas> it only persists what the driver is providing to it
14:30:10 <bauzas> n0ano: not exactly
14:30:31 <n0ano> bauzas, but sounds like this might be cinder related then
14:30:31 <johnthetubaguy> so I think most folks doing cinder only would disable the diskfilter and not spot the bug
14:30:32 <bauzas> n0ano: IIUC, the problem is about what's provided as disk space when you have a volume-backed instance
14:30:40 <tobasco> bauzas: Nova is reading root_gb for cinder volumes so effectively this is the resource tracker which creates the wrong stats
14:30:44 <johnthetubaguy> but if you are mix and match, it affects you, I guess
14:31:08 <bauzas> so, to be clear, possibly something has to be done on the RT classes
14:31:29 <tobasco> At the same time this effects the scheduler since the DiskFilter combined with the wrong stats created bad scheduling
14:31:49 <bauzas> unless I'm wrong, the DiskFilter is not a default filter
14:31:53 <tobasco> Only solution so far is to exclude DiskFilter or set the disk_allocation_ratio config value
14:32:04 <tobasco> bauzas: The DiskFilter was introduced as a default in liberty iirc
14:32:08 <johnthetubaguy> bauzas: it got added, I think
14:32:21 <johnthetubaguy> it might have been dropped since then, lol
14:32:22 <n0ano> tobasco, affects the scheduler but it's still doing the right thing, give it the wrong data it will correctly make the wrong decision
14:32:26 <bauzas> oh snap, it is indeed
14:32:37 <tobasco> n0ano: yes
14:32:52 <bauzas> so, yeah, the scheduler only schedules based on what the RT provides
14:32:53 <johnthetubaguy> so my through was about resource providers
14:32:57 <bauzas> ++
14:33:02 <tobasco> I just want to shine some more light into this since running a complete Cinder backed Nova today gives quite some headache
14:33:07 <johnthetubaguy> part of that is deciding the cinder vs local resources
14:33:24 <bauzas> zactly, we need jaypipes and cdent at wrok
14:33:25 <bauzas> work
14:33:35 <n0ano> should tobasco maybe have a discussion with some of the cinder people about this?
14:33:36 <cdent> totes
14:33:44 <johnthetubaguy> tobasco: you probably should just disable the disk filter if you are doing that mind, or is there something missing?
14:33:59 <johnthetubaguy> not sure we need cinder folk for this, seems a very Nova issue here
14:34:00 <bauzas> a possible workaround could be docs
14:34:15 <bauzas> ie. specify that DiskFilter should be disabled if Cinder backed volumes
14:34:23 <bauzas> as a known limitation
14:34:28 <johnthetubaguy> if there are no local disks in the BDM, we should ignore the flavor disk_gb in the resource tracker claim, and the scheduler stuff
14:34:28 <tobasco> johnthetubaguy: In our production environment I have disabled the DiskFilter however this still gives me the wrong stats and should be fixed iirc since it's wrong, or atleast have a look at it.
14:34:37 <tobasco> *imo
14:34:49 <johnthetubaguy> tobasco: ok, true, its more that it changes the priority
14:35:24 <bauzas> well, I guess the resource-providers epic already has lots of attention
14:35:32 <tobasco> For example, we run our hypervisors on very low disk and having the wrong statistics on what's actually on local disk is quite scary for people that doesn't know about this.
14:35:43 <bauzas> I feel it's a real bug
14:35:47 <bauzas> with a real problem
14:36:01 <bauzas> and with a possible solution that could be resource-providers
14:36:07 <n0ano> bauzas, yeah but it sounds like this is not really a RT problem, it's being given the wrong data
14:36:25 <bauzas> n0ano: if so, it's not nova
14:36:56 <tobasco> I agree with n0ano if only instances processed by the resource tracker and in the database would have root_gb set to 0 the scheduler nor the stats would have any issues
14:37:17 <tobasco> root_gb = 0 where Cinder backed Nova instances that would say, as root disk.
14:37:30 <n0ano> tobasco, but who should be setting it to 0, nova or cinder?
14:37:46 <bauzas> I remember some conditional in the RT about root_gb, hold on
14:38:51 <bauzas> meh, nevermind
14:38:58 <tobasco> n0ano: I saw some comment about the scheduler using the compute api is_volume_backed_instance function to check and set the root_gb to zero, however I assume this is only theoretical and not tested. I can't really be of very much help, I can troubleshoot code however I'm not so familar with the OpenStack concepts and codebase that I yet can help out.
14:39:49 <tobasco> I would be more than happy to help out and discuss this issue with other people further if it's needed, I have seen a lot of people having this issue and including myself so I would like to shine some light and get it resolved, that's all :)
14:39:54 <johnthetubaguy> bauzas: so it might be the spec object population code
14:39:57 <johnthetubaguy> bauzas: https://github.com/openstack/nova/blob/master/nova/scheduler/filters/disk_filter.py#L36
14:40:17 <johnthetubaguy> but yeah, lets take this into the openstack-nova channel I guess
14:40:31 <bauzas> +1
14:40:34 <n0ano> tobasco, having the scheduler do this check seems wrong, I'd like to see what cinder says about this
14:40:42 <n0ano> johnthetubaguy, +1 to the nova channel
14:41:19 <bauzas> yeah, I'd be very against any change in the scheduler codebase
14:41:30 <n0ano> tobasco, tnx for bringing this up but I think you'll get a better answer on #nova
14:41:34 <bauzas> IMHO, it's a resource issue, not a placement decision issue
14:41:38 <johnthetubaguy> n0ano: its the spec object population, rather than the scheduler, but yeah, lets talk about that over the other side
14:41:39 <tobasco> Ok, I'm satisfied, johnthetubaguy can you please help me out to get this further either by helping or giving me some hints on how to proceed. I would be glad to help out.
14:42:06 <bauzas> moving on ?
14:42:08 <johnthetubaguy> tobasco: yup, sounds like bauzas might be able to help us too ;-)
14:42:13 <n0ano> bauzas, indeed
14:42:15 <johnthetubaguy> aye
14:42:17 <n0ano> #topic opens
14:42:22 <tobasco> +1
14:42:27 <n0ano> that's all from me, anything new?
14:43:15 <cdent> I have a bit of a blue sky question: Has anybody done any experimentation with using pandas DataFrames as the scheduler's data structure?
14:43:22 * cdent is looking for prior art
14:43:30 * cdent doesn't have any immediate plans, just playing
14:43:52 <n0ano> cdent, ed leafe looked at using Cassandra but we dropped that work, talk to him might be good
14:44:22 <n0ano> cdent, there wasn't really much interest in changing the back end
14:44:28 <johnthetubaguy> the selection loop didn't seem that slow, compared to overall time in the scheduler, so I stopped looking at that bit of the scheduler
14:44:34 * cdent nods
14:44:47 <bauzas> what johnthetubaguy said
14:45:02 <bauzas> we want to have a scalable scheduler from zero to beyond infinite
14:45:02 <cdent> It's more of an exercise in understanding the conceptual stuff, not really coming up with a new solution
14:45:13 <johnthetubaguy> there is a fun unit test
14:45:31 <bauzas> cdent: luckily, any out-of-tree implementation can be done
14:45:46 <bauzas> cdent: you just need to implement the interfaces and rock on
14:45:50 * cdent nods
14:46:15 <n0ano> cdent, but expect issues if you want to merge such a thing back in
14:46:18 <edleafe> cdent: the main advantage of the cassandra design was that the scheduler claimed the resournces
14:46:31 * bauzas could joke on numpy dependency tho
14:46:34 <edleafe> cdent: it eliminated the raciness
14:47:21 <johnthetubaguy> cdent: https://github.com/openstack/nova/blob/master/nova/tests/unit/scheduler/test_caching_scheduler.py#L205
14:47:53 <cdent> thanks johnthetubaguy
14:48:02 <n0ano> anything else?
14:49:17 <n0ano> hearing more crickets
14:49:45 <cdent> I think we're done
14:49:51 <n0ano> tnx everyone, we'll talk again next week
14:49:54 <n0ano> #endmeeting