14:00:34 #startmeeting nova-scheduler 14:00:34 Meeting started Mon Jan 4 14:00:34 2016 UTC and is due to finish in 60 minutes. The chair is n0ano. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:35 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:37 The meeting name has been set to 'nova_scheduler' 14:00:43 anyone here to talk about the scheduler? 14:00:53 \o 14:00:57 yup :) 14:01:03 hello 14:01:04 o/ 14:01:18 <_gryf> \o 14:02:03 o/ 14:02:29 multiple new faces, welcome all 14:02:46 let's get started 14:02:52 #topic Spec & BPs & Patches - https://etherpad.openstack.org/p/mitaka-nova-spec-review-tracking 14:03:39 I was going over the scheduler priorities and noticed that some of the specs are stike out but still not approved 14:04:21 one of which is yours edleafe the configurable tracker, is that on track? 14:04:54 n0ano: no, that one has been victim of catch-22 14:05:15 can't make it configurable until there are more than one in-tree options 14:05:29 can't add an option in-tree unless it can be used 14:05:32 :) 14:06:11 well, that's kind of unfortunate, so we just leave that in limbo or should we officially drop it? 14:06:16 that's why it's crossed out 14:06:27 drop it 14:06:37 it's dead 14:07:17 OK, sad but so be it, I guess we remember the idea if it's necessary in the future 14:08:09 the reality is that it *could* be added, but it would need a majority supporting the idea of radical changes to the scheduler 14:08:19 and that's not going to happen 14:08:52 edleafe, is this something to discuss at the next summit or do you think it's just not possible now? 14:09:31 n0ano: I'd be happy to discuss it with anyone, but the general mood is "asked and answered". 14:10:19 hmm, hittin head against a wall is so much fun, oh well 14:11:15 OK, the other spec I thought I'd mention is jay's resource providers, https://review.openstack.org/#/c/225546/3 14:11:52 he updated the spec late last wee and we need reviews on it so just a gentle reminder to review this spec if you have time 14:11:52 That changed a lot in the past week 14:12:04 cdent, indeed 14:12:39 I have it in my queue. My very, very long queue... 14:13:02 understood, just want to make sure everyone is aware of it. 14:14:26 those were the ones that popped out to me, any other specs/patches people want to talk about? 14:14:50 I've implemented the bp host-state-level-locking 14:14:58 Yingxin++ 14:15:02 some good work in that stuff 14:15:09 there's a lot of changes in merging the claim logic 14:15:19 Yingxin, do you have a link to that? 14:15:21 hoping I'm doing it the right way 14:15:30 <_gryf> n0ano, from my side, I was just wondering about the status on resource objects that jay is working on, but i assume it's all about the reviews… 14:15:30 https://review.openstack.org/#/q/topic:bp/host-state-level-locking 14:15:43 _gryf, +1 14:16:52 Yingxin, so you need all 10 of thise specs approved & merged, right? 14:17:17 yes 14:17:34 looks like the review queue just got larger :-) 14:18:23 :) I'll help do more reviews 14:19:15 OK, let's move on 14:19:22 #topic Mid-cycle meetup 14:19:23 oh the spec front, we are technically closed for Mitaka specs right now 14:19:53 I do have an eye towards Jay's specs though, assuming we have the bandwidth for those 14:19:58 johnthetubaguy, I assumed that was the case but I thought we'd get an exceptino on Jay's 14:20:06 s/get/try to get 14:20:48 yeah, if there is consensus and bandwidth, it seems something we should consider, but we feel very oversubscribed right now 14:21:11 anyways, sounds like thats moving, which is all good no matter which release we can fit it into 14:21:29 let's hope we have the BW, it is rather important to the scheduler improvements 14:21:55 anyway, mid-cycle 14:22:34 it's coming up soon, end of this month, is beautiful, sunny England, wondering who is planning to attend (I won't make it this time) 14:22:53 I'll be there. 14:23:05 <_gryf> i'll be there also 14:23:16 I will be there 14:23:22 If it continues raining like it has been, Bristol won't be. 14:23:22 johnthetubaguy, :-) 14:23:50 cdent: totally possible 14:24:12 good, sounds like we will have reasonable representation, we should think about topics we'd like discussed 14:24:29 Jay's resource tracker would be high on my list, anthing else? 14:24:38 I will be there hoping to discuss scheduling and Neutron 14:24:39 I won't be able to attend 14:24:56 edleafe, bummer 14:25:22 carl_baldwin, good subject, do you have some thoughts on that (don't need them now but we should get prepared in the next week or so) 14:25:24 I'd like to get scheduler testing onto the midcycle agenda, at least some brain sharing 14:25:24 carl_baldwin: ah, thats a good topic to get started, so we have concrete things for the summit 14:26:02 n0ano: walking up from long vacation this morning but I can share thoughts on that a bit later 14:26:22 carl_baldwin, sure, maybe next week would be perfect 14:26:30 n0ano: johnthetubaguy what is the best way to share them? 14:26:45 a backlog spec would work for me, a short one 14:27:09 http://specs.openstack.org/openstack/nova-specs/specs/backlog/index.html 14:27:09 n0ano: johnthetubaguy sounds good, I'll prepare that this week 14:27:10 carl_baldwin, or a message to the mailing list works too 14:27:17 cdent++ it's difficult to test scheduler performance or reproduce race condition using current tests 14:27:42 * cdent nods 14:28:01 I have proposed a new scheduler via specs here: https://review.openstack.org/#/c/256323/ 14:28:14 cdent, getting on the agenda is relatively easy, I assume johnthetubaguy will setup a white board the first day, just want to make sure we now what we want to talk about going in 14:28:30 there is an etherpad up already, let me find the link 14:28:47 #link https://wiki.openstack.org/wiki/Sprints/NovaMitakaSprint 14:28:55 #link https://etherpad.openstack.org/p/mitaka-nova-midcycle 14:28:59 Yingxin, you want to add something there ^ or shall I? 14:29:40 n0ano: thank you 14:29:55 cdent: ok, I will 14:29:56 cdent: Yingxin: we have large ops and the nova functional test job that help with some of that, there are some other folks thinking about a third party performance gate test, which is also related, good to cover that stuff 14:30:16 Thanks Yingxin 14:31:23 sounds good, let's talk more about the mid-cycle next week 14:31:39 #topic https://review.openstack.org/#/c/256323/ 14:32:13 I just checked the scheduler bug list and it hasn't gone up noticable (but it hasn't gone down either) 14:32:46 I'm hoping the Intel San Antonio group will start working on the list but no progress yet 14:33:17 we all should check the list and try and work on a few bugs in our copious amounts of free time :-) 14:34:09 * cdent will try 14:34:27 cdent, tnx, that's all we're asking 14:34:32 anyway 14:34:36 #topic opens 14:34:46 Anyone have anything new for today? 14:35:02 I've tried hard to reproduce a high bug https://bugs.launchpad.net/nova/+bug/1341420 with detailed logs :-) 14:35:03 Launchpad bug 1341420 in OpenStack Compute (nova) "gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit" [High,In progress] - Assigned to Yingxin (cyx1231st) 14:36:37 anyone interested in solving race conditions can have a look at it 14:37:19 races are fun, it might just take a long time testing to hit the bug, how long did you try? 14:37:52 4 days setting up environment and tried many many times 14:38:14 that's why we neet functional tests :) 14:38:19 Yingxin: I think that the attitude that the scheduler is "optimistic" is going to make focusing on fixing race conditions more difficult 14:39:07 there are two schedulers required here, I suspect 14:39:08 I guess the issue is what's the effect of the race, if it's just a failure that is handled by the retry mechanism then there's not problem 14:39:27 the current scheduler assumes there is a fair chunk of free space at all times 14:39:36 Yeah, but it's still weird that only one scheduler will still have the race problem 14:40:02 sorry about grammar issues 14:40:24 Yingxin, the race is between two components, comute & scheduler, there can still be races there 14:40:30 Yingxin: have you tried the caching_scheduler? 14:40:47 that reduces the window *a lot* when you have only one scheduler 14:41:15 enough to fix the issues we were hitting in production before we started using that, at least 14:41:40 n0ano you do make a very good point though, what are the errors the retry does fix 14:41:58 oops, s/does fix/does not fix/ 14:42:48 Yingxin: to be more explicit, I am talking about this one: https://github.com/openstack/nova/blob/master/nova/scheduler/caching_scheduler.py 14:43:33 johnthetubaguy, we'd have to look at that, I know if handles `here's a node, use it - the use it fails', not sure about cases where the scheduler thinks it can;'t find a node 14:43:44 johnthetubaguy: yes, it will be indeed more efficient in one scheduler scenario. 14:44:25 Yingxin: the efficiency is not all it does, that means the decisions are stored in the local memory between select_destination calls, to reduce the racing 14:44:33 but the problem still exists according to my log analysis 14:45:03 https://bugs.launchpad.net/nova/+bug/1341420/comments/24 14:45:03 Launchpad bug 1341420 in OpenStack Compute (nova) "gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit" [High,In progress] - Assigned to Yingxin (cyx1231st) 14:45:04 Yingxin: its basically helped because of this call here: https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L145 14:45:49 the problem is still there, its just much less of a problem, and the retry should fix up the remaining issues 14:45:59 looking at the bug I would think retry would apply, is this a bug or a comment that the current process is inefficient? 14:46:09 that goes back to nano's question, when do you have so many problems the retry doesn't work? 14:47:16 the retry would work, but it's inefficient to some extent 14:47:23 to be clear, the current approach is not a good idea if you cloud is often full, its the wrong trade off, and we should create a different scheduler for that, and moving claims to the scheduler does something that is much closer to what you wanting 14:47:32 I would also point out that it seems this is probably not a high bugs, I'd drop it to medium if not low 14:48:03 Yingxin: only lots of testing will show that 14:48:06 n0ano: +1 14:48:12 Yingxin: inefficiency is by design :) 14:48:20 Yingxin: and it depends very much on your cloud 14:48:25 edleafe, +1 (unfortunately) 14:48:26 edleafe: is that what "optimistic" means? 14:48:29 edleafe: yes, it is basically 14:48:35 cdent: pretty much 14:48:39 gotchya 14:48:49 cdent: we expect it to fail, and handle it when it does 14:49:18 And is the latency that is caused significant enough to be a real bear, or just a teddy bear? 14:49:28 depends how many retries you see 14:49:31 edleafe, I would substitue s/expect/tolerate/ but basically yes 14:49:39 cdent: as johnthetubaguy said, it depends on your cloud and the load placed on it 14:50:23 personally for us, its was well under 1% of builds that needed a retry, last time I checked, but yeah, "it depends" is the correct answer 14:50:50 so the extra speed and simplicity I see as a win 14:51:15 actually I am attempting to make it simpler to understand with this patch: https://review.openstack.org/#/c/256323/ 14:51:28 those weights are a pain in the back to understand properly 14:52:02 s/patch/suggestion/ 14:53:40 OK, good discussion but we're aproaching the top of the hour, any last minutes points? 14:54:33 then I'll thank everyone and we'll talk next week 14:54:45 #endmeeting