#openstack-meeting-alt log

14:00:34 <n0ano> #startmeeting nova-scheduler
14:00:34 <openstack> Meeting started Mon Jan  4 14:00:34 2016 UTC and is due to finish in 60 minutes.  The chair is n0ano. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:35 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:37 <openstack> The meeting name has been set to 'nova_scheduler'
14:00:43 <n0ano> anyone here to talk about the scheduler?
14:00:53 <edleafe> \o
14:00:57 <macsz> yup :)
14:01:03 <cdent> hello
14:01:04 <Yingxin> o/
14:01:18 <_gryf> \o
14:02:03 <carl_baldwin> o/
14:02:29 <n0ano> multiple new faces, welcome all
14:02:46 <n0ano> let's get started
14:02:52 <n0ano> #topic Spec & BPs & Patches - https://etherpad.openstack.org/p/mitaka-nova-spec-review-tracking
14:03:39 <n0ano> I was going over the scheduler priorities and noticed that some of the specs are stike out but still not approved
14:04:21 <n0ano> one of which is yours edleafe the configurable tracker, is that on track?
14:04:54 <edleafe> n0ano: no, that one has been victim of catch-22
14:05:15 <edleafe> can't make it configurable until there are more than one in-tree options
14:05:29 <edleafe> can't add an option in-tree unless it can be used
14:05:32 <edleafe> :)
14:06:11 <n0ano> well, that's kind of unfortunate, so we just leave that in limbo or should we officially drop it?
14:06:16 <edleafe> that's why it's crossed out
14:06:27 <edleafe> drop it
14:06:37 <edleafe> it's dead
14:07:17 <n0ano> OK, sad but so be it, I guess we remember the idea if it's necessary in the future
14:08:09 <edleafe> the reality is that it *could* be added, but it would need a majority supporting the idea of radical changes to the scheduler
14:08:19 <edleafe> and that's not going to happen
14:08:52 <n0ano> edleafe, is this something to discuss at the next summit or do you think it's just not possible now?
14:09:31 <edleafe> n0ano: I'd be happy to discuss it with anyone, but the general mood is "asked and answered".
14:10:19 <n0ano> hmm, hittin head against a wall is so much fun, oh well
14:11:15 <n0ano> OK, the other spec I thought I'd mention is jay's resource providers, https://review.openstack.org/#/c/225546/3
14:11:52 <n0ano> he updated the spec late last wee and we need reviews on it so just a gentle reminder to review this spec if you have time
14:11:52 <cdent> That changed a lot in the past week
14:12:04 <n0ano> cdent, indeed
14:12:39 <edleafe> I have it in my queue. My very, very long queue...
14:13:02 <n0ano> understood, just want to make sure everyone is aware of it.
14:14:26 <n0ano> those were the ones that popped out to me, any other specs/patches people want to talk about?
14:14:50 <Yingxin> I've implemented the bp host-state-level-locking
14:14:58 <cdent> Yingxin++
14:15:02 <cdent> some good work in that stuff
14:15:09 <Yingxin> there's a lot of changes in merging the claim logic
14:15:19 <n0ano> Yingxin, do you have a link to that?
14:15:21 <Yingxin> hoping I'm doing it the right way
14:15:30 <_gryf> n0ano, from my side, I was just wondering about the status on resource objects that jay is working on, but i assume it's all about the reviews…
14:15:30 <Yingxin> https://review.openstack.org/#/q/topic:bp/host-state-level-locking
14:15:43 <n0ano> _gryf, +1
14:16:52 <n0ano> Yingxin, so you need all 10 of thise specs approved & merged, right?
14:17:17 <Yingxin> yes
14:17:34 <n0ano> looks like the review queue just got larger :-)
14:18:23 <Yingxin> :) I'll help do more reviews
14:19:15 <n0ano> OK, let's move on
14:19:22 <n0ano> #topic Mid-cycle meetup
14:19:23 <johnthetubaguy> oh the spec front, we are technically closed for Mitaka specs right now
14:19:53 <johnthetubaguy> I do have an eye towards Jay's specs though, assuming we have the bandwidth for those
14:19:58 <n0ano> johnthetubaguy, I assumed that was the case but I thought we'd get an exceptino on Jay's
14:20:06 <n0ano> s/get/try to get
14:20:48 <johnthetubaguy> yeah, if there is consensus and bandwidth, it seems something we should consider, but we feel very oversubscribed right now
14:21:11 <johnthetubaguy> anyways, sounds like thats moving, which is all good no matter which release we can fit it into
14:21:29 <n0ano> let's hope we have the BW, it is rather important to the scheduler improvements
14:21:55 <n0ano> anyway, mid-cycle
14:22:34 <n0ano> it's coming up soon, end of this month, is beautiful, sunny England, wondering who is planning to attend (I won't make it this time)
14:22:53 <cdent> I'll be there.
14:23:05 <_gryf> i'll be there also
14:23:16 <johnthetubaguy> I will be there
14:23:22 <cdent> If it continues raining like it has been, Bristol won't be.
14:23:22 <n0ano> johnthetubaguy, :-)
14:23:50 <johnthetubaguy> cdent: totally possible
14:24:12 <n0ano> good, sounds like we will have reasonable representation, we should think about topics we'd like discussed
14:24:29 <n0ano> Jay's resource tracker would be high on my list, anthing else?
14:24:38 <carl_baldwin> I will be there hoping to discuss scheduling and Neutron
14:24:39 <edleafe> I won't be able to attend
14:24:56 <n0ano> edleafe, bummer
14:25:22 <n0ano> carl_baldwin, good subject, do you have some thoughts on that (don't need them now but we should get prepared in the next week or so)
14:25:24 <cdent> I'd like to get scheduler testing onto the midcycle agenda, at least some brain sharing
14:25:24 <johnthetubaguy> carl_baldwin: ah, thats a good topic to get started, so we have concrete things for the summit
14:26:02 <carl_baldwin> n0ano: walking up from long vacation this morning but I can share thoughts on that a bit later
14:26:22 <n0ano> carl_baldwin, sure, maybe next week would be perfect
14:26:30 <carl_baldwin> n0ano: johnthetubaguy what is the best way to share them?
14:26:45 <johnthetubaguy> a backlog spec would work for me, a short one
14:27:09 <johnthetubaguy> http://specs.openstack.org/openstack/nova-specs/specs/backlog/index.html
14:27:09 <carl_baldwin> n0ano: johnthetubaguy sounds good, I'll prepare that this week
14:27:10 <n0ano> carl_baldwin, or a message to the mailing list works too
14:27:17 <Yingxin> cdent++ it's difficult to test scheduler performance or reproduce race condition using current tests
14:27:42 * cdent nods
14:28:01 <johnthetubaguy> I have proposed a new scheduler via specs here: https://review.openstack.org/#/c/256323/
14:28:14 <n0ano> cdent, getting on the agenda is relatively easy, I assume johnthetubaguy will setup a white board the first day, just want to make sure we now what we want to talk about going in
14:28:30 <johnthetubaguy> there is an etherpad up already, let me find the link
14:28:47 <johnthetubaguy> #link https://wiki.openstack.org/wiki/Sprints/NovaMitakaSprint
14:28:55 <johnthetubaguy> #link https://etherpad.openstack.org/p/mitaka-nova-midcycle
14:28:59 <cdent> Yingxin, you want to add something there ^ or shall I?
14:29:40 <carl_baldwin> n0ano: thank you
14:29:55 <Yingxin> cdent: ok, I will
14:29:56 <johnthetubaguy> cdent: Yingxin: we have large ops and the nova functional test job that help with some of that, there are some other folks thinking about a third party performance gate test, which is also related, good to cover that stuff
14:30:16 <cdent> Thanks Yingxin
14:31:23 <n0ano> sounds good, let's talk more about the mid-cycle next week
14:31:39 <n0ano> #topic https://review.openstack.org/#/c/256323/
14:32:13 <n0ano> I just checked the scheduler bug list and it hasn't gone up noticable (but it hasn't gone down either)
14:32:46 <n0ano> I'm hoping the Intel San Antonio group will start working on the list but no progress yet
14:33:17 <n0ano> we all should check the list and try and work on a few bugs in our copious amounts of free time :-)
14:34:09 * cdent will try
14:34:27 <n0ano> cdent, tnx, that's all we're asking
14:34:32 <n0ano> anyway
14:34:36 <n0ano> #topic opens
14:34:46 <n0ano> Anyone have anything new for today?
14:35:02 <Yingxin> I've tried hard to reproduce a high bug https://bugs.launchpad.net/nova/+bug/1341420 with detailed logs :-)
14:35:03 <openstack> Launchpad bug 1341420 in OpenStack Compute (nova) "gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit" [High,In progress] - Assigned to Yingxin (cyx1231st)
14:36:37 <Yingxin> anyone interested in solving race conditions can have a look at it
14:37:19 <n0ano> races are fun, it might just take a long time testing to hit the bug, how long did you try?
14:37:52 <Yingxin> 4 days setting up environment and tried many many times
14:38:14 <Yingxin> that's why we neet functional tests :)
14:38:19 <edleafe> Yingxin: I think that the attitude that the scheduler is "optimistic" is going to make focusing on fixing race conditions more difficult
14:39:07 <johnthetubaguy> there are two schedulers required here, I suspect
14:39:08 <n0ano> I guess the issue is what's the effect of the race, if it's just a failure that is handled by the retry mechanism then there's not problem
14:39:27 <johnthetubaguy> the current scheduler assumes there is a fair chunk of free space at all times
14:39:36 <Yingxin> Yeah, but it's still weird that only one scheduler will still have the race problem
14:40:02 <Yingxin> sorry about grammar issues
14:40:24 <n0ano> Yingxin, the race is between two components, comute & scheduler, there can still be races there
14:40:30 <johnthetubaguy> Yingxin: have you tried the caching_scheduler?
14:40:47 <johnthetubaguy> that reduces the window *a lot* when you have only one scheduler
14:41:15 <johnthetubaguy> enough to fix the issues we were hitting in production before we started using that, at least
14:41:40 <johnthetubaguy> n0ano you do make a very good point though, what are the errors the retry does fix
14:41:58 <johnthetubaguy> oops, s/does fix/does not fix/
14:42:48 <johnthetubaguy> Yingxin: to be more explicit, I am talking about this one: https://github.com/openstack/nova/blob/master/nova/scheduler/caching_scheduler.py
14:43:33 <n0ano> johnthetubaguy, we'd have to look at that, I know if handles `here's a node, use it - the use it fails', not sure about cases where the scheduler thinks it can;'t find a node
14:43:44 <Yingxin> johnthetubaguy: yes, it will be indeed more efficient in one scheduler scenario.
14:44:25 <johnthetubaguy> Yingxin: the efficiency is not all it does, that means the decisions are stored in the local memory between select_destination calls, to reduce the racing
14:44:33 <Yingxin> but the problem still exists according to my log analysis
14:45:03 <Yingxin> https://bugs.launchpad.net/nova/+bug/1341420/comments/24
14:45:03 <openstack> Launchpad bug 1341420 in OpenStack Compute (nova) "gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit" [High,In progress] - Assigned to Yingxin (cyx1231st)
14:45:04 <johnthetubaguy> Yingxin: its basically helped because of this call here: https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L145
14:45:49 <johnthetubaguy> the problem is still there, its just much less of a problem, and the retry should fix up the remaining issues
14:45:59 <n0ano> looking at the bug I would think retry would apply, is this a bug or a comment that the current process is inefficient?
14:46:09 <johnthetubaguy> that goes back to nano's question, when do you have so many problems the retry doesn't work?
14:47:16 <Yingxin> the retry would work, but it's inefficient to some extent
14:47:23 <johnthetubaguy> to be clear, the current approach is not a good idea if you cloud is often full, its the wrong trade off, and we should create a different scheduler for that, and moving claims to the scheduler does something that is much closer to what you wanting
14:47:32 <n0ano> I would also point out that it seems this is probably not a high bugs, I'd drop it to medium if not low
14:48:03 <johnthetubaguy> Yingxin: only lots of testing will show that
14:48:06 <johnthetubaguy> n0ano: +1
14:48:12 <edleafe> Yingxin: inefficiency is by design :)
14:48:20 <johnthetubaguy> Yingxin: and it depends very much on your cloud
14:48:25 <n0ano> edleafe, +1 (unfortunately)
14:48:26 <cdent> edleafe: is that what "optimistic" means?
14:48:29 <johnthetubaguy> edleafe: yes, it is basically
14:48:35 <edleafe> cdent: pretty much
14:48:39 <cdent> gotchya
14:48:49 <edleafe> cdent: we expect it to fail, and handle it when it does
14:49:18 <cdent> And is the latency that is caused significant enough to be a real bear, or just a teddy bear?
14:49:28 <johnthetubaguy> depends how many retries you see
14:49:31 <n0ano> edleafe, I would substitue s/expect/tolerate/ but basically yes
14:49:39 <edleafe> cdent: as johnthetubaguy said, it depends on your cloud and the load placed on it
14:50:23 <johnthetubaguy> personally for us, its was well under 1% of builds that needed a retry, last time I checked, but yeah, "it depends" is the correct answer
14:50:50 <johnthetubaguy> so the extra speed and simplicity I see as a win
14:51:15 <johnthetubaguy> actually I am attempting to make it simpler to understand with this patch: https://review.openstack.org/#/c/256323/
14:51:28 <johnthetubaguy> those weights are a pain in the back to understand properly
14:52:02 <johnthetubaguy> s/patch/suggestion/
14:53:40 <n0ano> OK, good discussion but we're aproaching the top of the hour, any last minutes points?
14:54:33 <n0ano> then I'll thank everyone and we'll talk next week
14:54:45 <n0ano> #endmeeting