14:00:34 <n0ano> #startmeeting nova-scheduler 14:00:34 <openstack> Meeting started Mon Jan 4 14:00:34 2016 UTC and is due to finish in 60 minutes. The chair is n0ano. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:35 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:37 <openstack> The meeting name has been set to 'nova_scheduler' 14:00:43 <n0ano> anyone here to talk about the scheduler? 14:00:53 <edleafe> \o 14:00:57 <macsz> yup :) 14:01:03 <cdent> hello 14:01:04 <Yingxin> o/ 14:01:18 <_gryf> \o 14:02:03 <carl_baldwin> o/ 14:02:29 <n0ano> multiple new faces, welcome all 14:02:46 <n0ano> let's get started 14:02:52 <n0ano> #topic Spec & BPs & Patches - https://etherpad.openstack.org/p/mitaka-nova-spec-review-tracking 14:03:39 <n0ano> I was going over the scheduler priorities and noticed that some of the specs are stike out but still not approved 14:04:21 <n0ano> one of which is yours edleafe the configurable tracker, is that on track? 14:04:54 <edleafe> n0ano: no, that one has been victim of catch-22 14:05:15 <edleafe> can't make it configurable until there are more than one in-tree options 14:05:29 <edleafe> can't add an option in-tree unless it can be used 14:05:32 <edleafe> :) 14:06:11 <n0ano> well, that's kind of unfortunate, so we just leave that in limbo or should we officially drop it? 14:06:16 <edleafe> that's why it's crossed out 14:06:27 <edleafe> drop it 14:06:37 <edleafe> it's dead 14:07:17 <n0ano> OK, sad but so be it, I guess we remember the idea if it's necessary in the future 14:08:09 <edleafe> the reality is that it *could* be added, but it would need a majority supporting the idea of radical changes to the scheduler 14:08:19 <edleafe> and that's not going to happen 14:08:52 <n0ano> edleafe, is this something to discuss at the next summit or do you think it's just not possible now? 14:09:31 <edleafe> n0ano: I'd be happy to discuss it with anyone, but the general mood is "asked and answered". 14:10:19 <n0ano> hmm, hittin head against a wall is so much fun, oh well 14:11:15 <n0ano> OK, the other spec I thought I'd mention is jay's resource providers, https://review.openstack.org/#/c/225546/3 14:11:52 <n0ano> he updated the spec late last wee and we need reviews on it so just a gentle reminder to review this spec if you have time 14:11:52 <cdent> That changed a lot in the past week 14:12:04 <n0ano> cdent, indeed 14:12:39 <edleafe> I have it in my queue. My very, very long queue... 14:13:02 <n0ano> understood, just want to make sure everyone is aware of it. 14:14:26 <n0ano> those were the ones that popped out to me, any other specs/patches people want to talk about? 14:14:50 <Yingxin> I've implemented the bp host-state-level-locking 14:14:58 <cdent> Yingxin++ 14:15:02 <cdent> some good work in that stuff 14:15:09 <Yingxin> there's a lot of changes in merging the claim logic 14:15:19 <n0ano> Yingxin, do you have a link to that? 14:15:21 <Yingxin> hoping I'm doing it the right way 14:15:30 <_gryf> n0ano, from my side, I was just wondering about the status on resource objects that jay is working on, but i assume it's all about the reviews… 14:15:30 <Yingxin> https://review.openstack.org/#/q/topic:bp/host-state-level-locking 14:15:43 <n0ano> _gryf, +1 14:16:52 <n0ano> Yingxin, so you need all 10 of thise specs approved & merged, right? 14:17:17 <Yingxin> yes 14:17:34 <n0ano> looks like the review queue just got larger :-) 14:18:23 <Yingxin> :) I'll help do more reviews 14:19:15 <n0ano> OK, let's move on 14:19:22 <n0ano> #topic Mid-cycle meetup 14:19:23 <johnthetubaguy> oh the spec front, we are technically closed for Mitaka specs right now 14:19:53 <johnthetubaguy> I do have an eye towards Jay's specs though, assuming we have the bandwidth for those 14:19:58 <n0ano> johnthetubaguy, I assumed that was the case but I thought we'd get an exceptino on Jay's 14:20:06 <n0ano> s/get/try to get 14:20:48 <johnthetubaguy> yeah, if there is consensus and bandwidth, it seems something we should consider, but we feel very oversubscribed right now 14:21:11 <johnthetubaguy> anyways, sounds like thats moving, which is all good no matter which release we can fit it into 14:21:29 <n0ano> let's hope we have the BW, it is rather important to the scheduler improvements 14:21:55 <n0ano> anyway, mid-cycle 14:22:34 <n0ano> it's coming up soon, end of this month, is beautiful, sunny England, wondering who is planning to attend (I won't make it this time) 14:22:53 <cdent> I'll be there. 14:23:05 <_gryf> i'll be there also 14:23:16 <johnthetubaguy> I will be there 14:23:22 <cdent> If it continues raining like it has been, Bristol won't be. 14:23:22 <n0ano> johnthetubaguy, :-) 14:23:50 <johnthetubaguy> cdent: totally possible 14:24:12 <n0ano> good, sounds like we will have reasonable representation, we should think about topics we'd like discussed 14:24:29 <n0ano> Jay's resource tracker would be high on my list, anthing else? 14:24:38 <carl_baldwin> I will be there hoping to discuss scheduling and Neutron 14:24:39 <edleafe> I won't be able to attend 14:24:56 <n0ano> edleafe, bummer 14:25:22 <n0ano> carl_baldwin, good subject, do you have some thoughts on that (don't need them now but we should get prepared in the next week or so) 14:25:24 <cdent> I'd like to get scheduler testing onto the midcycle agenda, at least some brain sharing 14:25:24 <johnthetubaguy> carl_baldwin: ah, thats a good topic to get started, so we have concrete things for the summit 14:26:02 <carl_baldwin> n0ano: walking up from long vacation this morning but I can share thoughts on that a bit later 14:26:22 <n0ano> carl_baldwin, sure, maybe next week would be perfect 14:26:30 <carl_baldwin> n0ano: johnthetubaguy what is the best way to share them? 14:26:45 <johnthetubaguy> a backlog spec would work for me, a short one 14:27:09 <johnthetubaguy> http://specs.openstack.org/openstack/nova-specs/specs/backlog/index.html 14:27:09 <carl_baldwin> n0ano: johnthetubaguy sounds good, I'll prepare that this week 14:27:10 <n0ano> carl_baldwin, or a message to the mailing list works too 14:27:17 <Yingxin> cdent++ it's difficult to test scheduler performance or reproduce race condition using current tests 14:27:42 * cdent nods 14:28:01 <johnthetubaguy> I have proposed a new scheduler via specs here: https://review.openstack.org/#/c/256323/ 14:28:14 <n0ano> cdent, getting on the agenda is relatively easy, I assume johnthetubaguy will setup a white board the first day, just want to make sure we now what we want to talk about going in 14:28:30 <johnthetubaguy> there is an etherpad up already, let me find the link 14:28:47 <johnthetubaguy> #link https://wiki.openstack.org/wiki/Sprints/NovaMitakaSprint 14:28:55 <johnthetubaguy> #link https://etherpad.openstack.org/p/mitaka-nova-midcycle 14:28:59 <cdent> Yingxin, you want to add something there ^ or shall I? 14:29:40 <carl_baldwin> n0ano: thank you 14:29:55 <Yingxin> cdent: ok, I will 14:29:56 <johnthetubaguy> cdent: Yingxin: we have large ops and the nova functional test job that help with some of that, there are some other folks thinking about a third party performance gate test, which is also related, good to cover that stuff 14:30:16 <cdent> Thanks Yingxin 14:31:23 <n0ano> sounds good, let's talk more about the mid-cycle next week 14:31:39 <n0ano> #topic https://review.openstack.org/#/c/256323/ 14:32:13 <n0ano> I just checked the scheduler bug list and it hasn't gone up noticable (but it hasn't gone down either) 14:32:46 <n0ano> I'm hoping the Intel San Antonio group will start working on the list but no progress yet 14:33:17 <n0ano> we all should check the list and try and work on a few bugs in our copious amounts of free time :-) 14:34:09 * cdent will try 14:34:27 <n0ano> cdent, tnx, that's all we're asking 14:34:32 <n0ano> anyway 14:34:36 <n0ano> #topic opens 14:34:46 <n0ano> Anyone have anything new for today? 14:35:02 <Yingxin> I've tried hard to reproduce a high bug https://bugs.launchpad.net/nova/+bug/1341420 with detailed logs :-) 14:35:03 <openstack> Launchpad bug 1341420 in OpenStack Compute (nova) "gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit" [High,In progress] - Assigned to Yingxin (cyx1231st) 14:36:37 <Yingxin> anyone interested in solving race conditions can have a look at it 14:37:19 <n0ano> races are fun, it might just take a long time testing to hit the bug, how long did you try? 14:37:52 <Yingxin> 4 days setting up environment and tried many many times 14:38:14 <Yingxin> that's why we neet functional tests :) 14:38:19 <edleafe> Yingxin: I think that the attitude that the scheduler is "optimistic" is going to make focusing on fixing race conditions more difficult 14:39:07 <johnthetubaguy> there are two schedulers required here, I suspect 14:39:08 <n0ano> I guess the issue is what's the effect of the race, if it's just a failure that is handled by the retry mechanism then there's not problem 14:39:27 <johnthetubaguy> the current scheduler assumes there is a fair chunk of free space at all times 14:39:36 <Yingxin> Yeah, but it's still weird that only one scheduler will still have the race problem 14:40:02 <Yingxin> sorry about grammar issues 14:40:24 <n0ano> Yingxin, the race is between two components, comute & scheduler, there can still be races there 14:40:30 <johnthetubaguy> Yingxin: have you tried the caching_scheduler? 14:40:47 <johnthetubaguy> that reduces the window *a lot* when you have only one scheduler 14:41:15 <johnthetubaguy> enough to fix the issues we were hitting in production before we started using that, at least 14:41:40 <johnthetubaguy> n0ano you do make a very good point though, what are the errors the retry does fix 14:41:58 <johnthetubaguy> oops, s/does fix/does not fix/ 14:42:48 <johnthetubaguy> Yingxin: to be more explicit, I am talking about this one: https://github.com/openstack/nova/blob/master/nova/scheduler/caching_scheduler.py 14:43:33 <n0ano> johnthetubaguy, we'd have to look at that, I know if handles `here's a node, use it - the use it fails', not sure about cases where the scheduler thinks it can;'t find a node 14:43:44 <Yingxin> johnthetubaguy: yes, it will be indeed more efficient in one scheduler scenario. 14:44:25 <johnthetubaguy> Yingxin: the efficiency is not all it does, that means the decisions are stored in the local memory between select_destination calls, to reduce the racing 14:44:33 <Yingxin> but the problem still exists according to my log analysis 14:45:03 <Yingxin> https://bugs.launchpad.net/nova/+bug/1341420/comments/24 14:45:03 <openstack> Launchpad bug 1341420 in OpenStack Compute (nova) "gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit" [High,In progress] - Assigned to Yingxin (cyx1231st) 14:45:04 <johnthetubaguy> Yingxin: its basically helped because of this call here: https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L145 14:45:49 <johnthetubaguy> the problem is still there, its just much less of a problem, and the retry should fix up the remaining issues 14:45:59 <n0ano> looking at the bug I would think retry would apply, is this a bug or a comment that the current process is inefficient? 14:46:09 <johnthetubaguy> that goes back to nano's question, when do you have so many problems the retry doesn't work? 14:47:16 <Yingxin> the retry would work, but it's inefficient to some extent 14:47:23 <johnthetubaguy> to be clear, the current approach is not a good idea if you cloud is often full, its the wrong trade off, and we should create a different scheduler for that, and moving claims to the scheduler does something that is much closer to what you wanting 14:47:32 <n0ano> I would also point out that it seems this is probably not a high bugs, I'd drop it to medium if not low 14:48:03 <johnthetubaguy> Yingxin: only lots of testing will show that 14:48:06 <johnthetubaguy> n0ano: +1 14:48:12 <edleafe> Yingxin: inefficiency is by design :) 14:48:20 <johnthetubaguy> Yingxin: and it depends very much on your cloud 14:48:25 <n0ano> edleafe, +1 (unfortunately) 14:48:26 <cdent> edleafe: is that what "optimistic" means? 14:48:29 <johnthetubaguy> edleafe: yes, it is basically 14:48:35 <edleafe> cdent: pretty much 14:48:39 <cdent> gotchya 14:48:49 <edleafe> cdent: we expect it to fail, and handle it when it does 14:49:18 <cdent> And is the latency that is caused significant enough to be a real bear, or just a teddy bear? 14:49:28 <johnthetubaguy> depends how many retries you see 14:49:31 <n0ano> edleafe, I would substitue s/expect/tolerate/ but basically yes 14:49:39 <edleafe> cdent: as johnthetubaguy said, it depends on your cloud and the load placed on it 14:50:23 <johnthetubaguy> personally for us, its was well under 1% of builds that needed a retry, last time I checked, but yeah, "it depends" is the correct answer 14:50:50 <johnthetubaguy> so the extra speed and simplicity I see as a win 14:51:15 <johnthetubaguy> actually I am attempting to make it simpler to understand with this patch: https://review.openstack.org/#/c/256323/ 14:51:28 <johnthetubaguy> those weights are a pain in the back to understand properly 14:52:02 <johnthetubaguy> s/patch/suggestion/ 14:53:40 <n0ano> OK, good discussion but we're aproaching the top of the hour, any last minutes points? 14:54:33 <n0ano> then I'll thank everyone and we'll talk next week 14:54:45 <n0ano> #endmeeting