17:03:29 <renuka> #startmeeting 17:03:30 <openstack> Meeting started Thu Oct 20 17:03:29 2011 UTC. The chair is renuka. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:03:31 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic. 17:04:17 <renuka> vladimir3p: would you like to begin with the scheduler discussion 17:04:27 <renuka> #topic volume scheduler 17:04:42 <vladimir3p> no prob. So, there is a blueprint spec you've probably seen 17:04:56 <renuka> yes 17:04:57 <vladimir3p> we have couple options on implementing it 17:05:19 <vladimir3p> 1. to create a single "generic" scheduler tat will perform some exemplary scheduling 17:05:39 <vladimir3p> 2. or don't do there almost nothing and just find an appropriate sub-scheduler 17:06:13 <vladimir3p> another change that we may need - providing some extra-data back to Default Scheduler (for sending it to manager host) 17:06:30 <vladimir3p> let's first of all discuss how we would like schedduler to look like: 17:06:39 <renuka> ok 17:06:40 <vladimir3p> either we will put there any basic logic 17:06:52 <vladimir3p> or just rely on every vendor to supply their own 17:07:01 <vladimir3p> I prefer the 1st option 17:07:06 <vladimir3p> any ideas? 17:07:21 <rnirmal> the only problem I see with the sub scheduler is, it's going to become too much too soon 17:07:23 <clayg> I think a generic type scheduler could get you pretty far 17:07:45 <rnirmal> so starting with a generic scheduler would be nice 17:07:48 <DuncanT> Assuming it is generic enough, I can't see a problem with a good generic scheduler 17:07:59 <clayg> I'm not sure how zones and host aggregates play in at the top level scheduler 17:08:04 <timr1> vladimir3p: am I rigth in understaning that the generic schduler will eb able to schdule on opaque keys 17:08:06 <renuka> I think the vendors can plug in their logic in report capabilities, correct? So why not let the scheduler decide which backend is appropriate and based on a mapping which shows which volume workers can reach with backend, we can have the scheduler select a node 17:09:01 <vladimir3p> timr1: yes, at the beginning the generic scheduler could just match volume_type keys with reported keys 17:09:16 <vladimir3p> I was thinking that it must have some logic for quantities as well 17:09:24 <DuncanT> Since the HP backend can support many volume types in a single instance, it is also important that required capabilities get passed through to the create. Is that covered in the design already? 17:09:32 <vladimir3p> (instead of going to the DB for ll available vols on that host) 17:09:59 <vladimir3p> DuncanT: can you pls clarify? 17:10:20 <vladimir3p> DuncanT: I was thinking that this is the essential requirement for it 17:10:49 <vladimir3p> renuka: on volume-driver level, yes, they will report whatever keys they want 17:10:49 <renuka> DuncanT: isn't the volume type already passed in the create call (with the extensions added) 17:11:21 <DuncanT> vladimir3p: A single hp backend could support say (spindle speed==4800, spindle speed=7200, spindlespeed = 15000) so if the user asked for spindlespeed > 7200 we'd need that detail passed through to create 17:11:23 <vladimir3p> DuncanT: yes, volume type is used during volume creation and might be retrieved by the scheduler 17:11:29 <renuka> vladimir3p: correct, so those keys should be sufficient at this point to plug in vendor logic, correct? 17:12:03 <timr1> that sound ok so 17:12:10 <renuka> DuncanT: I think the user need only specify the volume type. The admin can decide if a certain type means spindle speed = x 17:12:16 <vladimir3p> renuka: yes, but without special scheduler driver they will be useless (generic will only match volume type vs this reported capabilities) 17:13:12 <vladimir3p> DuncanT, renuka: yes, I was thinking that admin will create separate volume types. Every node will report what exactly it sees and scheduler will perform a matching between them 17:13:14 <renuka> vladimir3p: what is the problem with that? 17:13:46 <renuka> vladimir3p: every node should report for every backend (array) that it can reach 17:14:03 <renuka> the duplicates can be filtered later or appropriately at some point 17:14:05 <DuncanT> vladimir3p: That makes sense, thanks 17:14:54 <vladimir3p> renuka: I mean drivers will report whatever data they think is appropriate (some opaque key/value pairs), but generic scheduler will only look at the ones it recognize (and those are the one supplied in volume types) 17:15:22 <vladimir3p> if vendor would like to perform some special logic based on this opaque data - special schedulers will be required 17:15:36 <timr1> I agree that we want to deal with quantities as well 17:15:42 <rnirmal> can we do something like match volume_type: if type supports sub_criteria match those too (gleaned from the opaque key/value pairs) 17:16:01 <DuncanT> vladimir3p: It should be possible to do ==. =/=, <, > generically on any key shouldn't it? 17:16:19 <renuka> rnirmal: I agree 17:16:32 <timr1> dont htink we need special schedulers to support addiitonal vendor logic in backend 17:16:44 <vladimir3p> DuncanT: == and != yes, not sure about >, < 17:17:01 <renuka> yea, we should be able to add at least some basic logic via report capabilities 17:17:42 <vladimir3p> rnirmal: it would be great to have sub_criteria, but how scheduler will recognize that particular key/value pair is not a single criteria, but a complex one 17:17:45 <renuka> #idea what if we add a "specify rules" at driver level and keep "report capabilities" at node/backend level 17:18:28 <rnirmal> valdimir3p: it wouldn't know that, without some rules 17:18:36 <vladimir3p> folks, how about to start with something basic and improve it with the time (we could discuss these improvements right now, but let's agree on basics) 17:18:52 <vladimir3p> so, meanwhile we agreed that: 17:19:21 <vladimir3p> #agreed drivers will report capabilities in key/value pairs 17:19:22 <renuka> vladimir3p: an opaque rules should be simple enough to add, while keeping it generic 17:19:55 <renuka> correction, nodes will report capabilities in key/value pairs per backend 17:19:57 <vladimir3p> #agreed basic scheduler will have a logic to match volume_type's key/value pairs vs reported ones 17:20:30 <vladimir3p> renuka: driver will decide how many types it would like to report 17:20:41 <clayg> I think the abstract_scheduler for compute would be a good base for the gerneic type/quantity scheduler 17:20:41 <vladimir3p> each type has key/value pairs 17:20:59 <vladimir3p> clayg: yes, I was thinking exactly the same 17:21:12 <vladimir3p> clayg: there is already some logic for matching this stuff 17:21:19 <clayg> least_cost allows you to register cost functions 17:21:42 <clayg> oh right... 17:21:52 <clayg> in the vsa scheduler 17:21:53 <clayg> ? 17:22:13 <vladimir3p> we will need to rework vsa scheduler, but it has some basic logic for that 17:22:50 <vladimir3p> how about quantities: I suppose this is important. are we all agree to add reserved keywords? 17:23:08 <renuka> clayg: so do you think it will be straightforward enough to go with the register cost functions at this point? 17:24:07 <DuncanT> vladimir3p: I think the set needs to be small, since novel backends may not have many of the same concepts as other designs 17:24:13 <timr1> vladimir3p: yes - IOPS, bandwidths, capacity etc? 17:24:49 <clayg> renuka: I'm not sure... something like that may work, but the cost idea is more about selecting a best match from a group of basically indentically capable hosts - but with different loads 17:25:34 <vladimir3p> the cost in our case might be total capacity or amount of drivers, etc... 17:25:57 <renuka> I think there needs to be a way to call some method for a driver specific function. It will keep things generic and simplify a lot of stuff 17:26:05 <vladimir3p> that's why I suppose reporting something that scheduler will look at is important 17:26:06 <clayg> perhaps better would be to just subclass and over-ride "filter_capable_nodes" to look at type... and also some vendor specific opaque k/v pairs. 17:26:36 <clayg> the quantities stuff would more closely match the costs concepts 17:26:42 <vladimir3p> clayg: yes, subclass will work for single-vendor scenario 17:27:30 <clayg> vladimir3p: I mean you could call the super method and ignore stuff that's pass on "other-vendor" stuff 17:28:10 <renuka> vladimir3p: how about letting the scheduler to the basic scheduling as you said, and call a driver method passing the backends it has filtered? 17:28:22 <clayg> meh, I think we start with types and quantities - knowning that we'll have to had something more later 17:28:40 <rnirmal> regarding filtering, is anyone considering locating volume near vm, for scenarios with top-of-rack storage 17:28:42 <clayg> s/we/get out of the way and let Vlad 17:28:50 <renuka> rnirmal: yes 17:28:54 <vladimir3p> :-) 17:28:57 <clayg> ;) 17:29:18 <rnirmal> renuka: ok 17:29:33 <vladimir3p> renuka: not sure if generic scheduler should call driver... probably as clayg mentioned every provider could declare a subclass that will do whatever required 17:29:34 <timr1> timr1: we are including it 17:29:58 <vladimir3p> in this case we will not need to implement volume_type - 2 - driver translation tables 17:30:18 <vladimir3p> however, there is a disadvantage with this - only a single vendor per scheduler ... 17:30:24 <timr1> we (HP) are agnostic at this point as our driver is generic and we re-schdule in the back end 17:30:50 <vladimir3p> timr1: do you really want to rescheduler on volume driver level? 17:31:07 <vladimir3p> timr1: will it be easier if scheduler will have all the logic 17:31:17 <renuka> vladimir3p: you mean single driver... SM for example supports multiple vendors backend types 17:31:20 <timr1> vladimir3p: we already do it this way - legacy 17:31:23 <vladimir3p> timr1: and just use volume node as a gateway 17:31:37 <vladimir3p> ok 17:31:42 <timr1> that is correct - we jsut use it as a gateway 17:31:55 <vladimir3p> renuka: yes, single driver 17:32:35 <vladimir3p> so, can we claim that we agreed that there will be no volume_type-2-driver translation menwhile? 17:33:01 <renuka> vladimir3p: i still think matching simply on type is a bit less. If a driver does report capabilities per backend, there should be a way to distinguish between two backends 17:33:30 <renuka> for example, 2 netapp servers do not make 2 different volume types. They map to the same one, but the scheduler should be able to choose one 17:33:31 <clayg> two backends of the same type? 17:34:15 <vladimir3p> how about to go over particular renuka's scenario and see if we could do it with a basic scheduler 17:34:34 <renuka> similarly, as rnirmal pointed out, top of rack type of restrictions cannot be applied 17:34:56 <clayg> renuka: how do you suggest the scheduler choose between the two if they both have equal "qantities" 17:35:41 <renuka> that is where i think there should be a volume driver specific call 17:35:45 <renuka> or a similar way 17:35:48 <clayg> i do't really know how to support top of rack restrictions... unless you know where the volume is going to be attached when you create it 17:36:26 <renuka> clayg: that will have to be filtered based on some reachability which is reported 17:36:44 <rnirmal> clayg: yeah, it's a little more complicated case, where it would require combining both vm and volume placement, so it might be out of scope for the basic scheduler 17:37:06 <renuka> clayg: we can look at user/project info and make a decision. I expect that is how it is done anyway 17:37:29 <vladimir3p> renuka: I suppose the generic scheduler could not really determine which one should be used, but you could do it in the sub-class. So you will over-ride some methds and pick based on opaque data 17:38:21 <renuka> so by deciding to subclass, we have essentially made it so that at least for a while, we can have only 1 volume driver in the zone 17:38:22 <vladimir3p> in this case scheduler will call driver "indirecty", because driver will over-ride it 17:38:39 <renuka> can everyone here live with that for now? 17:38:58 <clayg> isn't there like an "agree" or "vote" option? 17:38:59 <DuncanT> One issue I've just thought off with volume-types: 17:39:05 <clayg> I'm totally +1'ing the simple thing first 17:39:51 <vladimir3p> #agreed simple things first :-) there will be no volume_type-2-driver translation meanwhile (sub-class will override whatever necessary) 17:40:03 <timr1> :) 17:40:06 <rnirmal> +1 17:40:40 <DuncanT> agreed 17:41:09 <vladimir3p> so, renuka, back to your scenario. I suppose it will be possible to register volume types like: SATA volumes, SAS volumes, etc... probably with some RPM, QoS, etc extra data 17:41:23 <renuka> ok 17:41:33 <DuncanT> Is it possible to specify volume affinity or anti-affinity with volume types? 17:41:56 <vladimir3p> on volume driver side, each driver will go to all its underlying arrays and will collect things like what type of storage each one of them support. 17:42:36 <DuncanT> I don't think there is a way to express taht with volume types? 17:42:50 <vladimir3p> I think after that it could rearrange it in the form like... [ {type: "SATA", arrays: [{array1, access path1}, {array2, accesspath2, ...}] 17:42:59 <renuka> vladimir3p: as long as we can subclass appropriately, the SM case should be fine, since we use one driver (multiple instances) in a zone 17:43:31 <vladimir3p> yeah, the sub-class on scheduler level will be able to perform all the filtering and will be able to recognize such data 17:43:37 <clayg> DuncanT: no I don't think so, not yet 17:43:40 <timr1> DuncanT: I dont think volume types can do this - but it is somethign customers want 17:43:43 <vladimir3p> the only thing - it will need to report back what exactly should be done 17:43:50 <renuka> DuncanT: yes, that is why I am stressing on reachability 17:45:04 <DuncanT> renuka: I see, yes, it is a reachability question too -we were looking at it for performance reasons but the API is the same 17:45:45 <vladimir3p> actually, on volume driver level, we could at least understand the preferred path to the array 17:45:48 <renuka> DuncanT: yes, we could give it a better name ;) 17:46:58 <vladimir3p> renuka: do you think that what I described above will work for you? 17:47:06 <DuncanT> renuka: Generically it is affinity - we don't care about reachability from a scheduler point of view but obviously other technologies might, but we need a way for it to be passed to the driver... 17:47:37 <renuka> vladimir3p: yes 17:47:44 <vladimir3p> renuka: great 17:47:51 <vladimir3p> DuncanT: yes, today scheduler returns back only host name 17:48:26 <vladimir3p> it will be required to return probably a tuple of host and some capabilities that were used for this decision 17:48:36 <renuka> vladimir3p: could we ensure that when a request is passed to the host, it has enough information about which backend array to use 17:48:47 <DuncanT> vladimir3p: I'm more worried about the user facing API - we need a suer to be able to specify both a volume_type and a volume (or list of volumes) to have affinity to or antiaffinity to 17:49:24 <renuka> #topic admin APIs 17:49:41 <vladimir3p> DuncanT: afinity of user to type? 17:49:56 <renuka> DuncanT: that was in fact the next thing on the agenda, what are the admin APIs that everyone requires 17:50:04 <DuncanT> vladimir3p: No, affinity of the new volume to an existing volume 17:50:13 <DuncanT> vladimir3p: On a per-create basis 17:50:44 <vladimir3p> DuncanT: yes, we have the same requirement 17:50:44 <timr1> i dont thinks duncanT is talkign about an admin API, it is the ability of a user to say create my volume near (or far)rom volume-abs 17:50:50 <renuka> DuncanT: could that be restated as ensuring a user/project data is kept together 17:51:09 <timr1> renuka: also want anti-ainity for availability 17:51:19 <timr1> anti-affinity 17:51:22 <renuka> timr1: isn't that too much power in the hands of the user. Will they even know such details? 17:52:22 <timr1> renuka: we have users who want to do it, I dont see it as power. They say make volume X anywhere. Make Volume Y in a different place from volume X 17:52:29 <renuka> timr1: yes, in general some rules about placing user/project data. Am i correct in assuming that anti-affinity is used while creating redundant volumes? 17:52:30 <rnirmal> renuka: I think it's something similar to the hostId, which is an opaque id to the user, on a per user/project basis and maps on the backend to a host location 17:52:31 <vladimir3p> DuncanT: I guess affinity could be solved on sub-scheduler level. Actually specialized scheduler go first retrive what was already created for the user/project and based on that perform filtering 17:53:32 <renuka> vladimir3p: yes but we still have to have a way of saying that 17:53:50 <rnirmal> volumes may need something like the hostId for vms, to be able to tackle affinity first 17:53:51 <timr1> yes it is for usere who want to create their own tier of availability. agree it could be solved on sub-sched - but need to user API expandedd 17:53:57 <clayg> vladimir3p: that would probably work, when they create the volume they can send in meta info, a customer scheduler could look for affinity and anti-affinity keys 17:54:07 <DuncanT> vladimir3p: What we'd like to make sure of if possible is that the user can specif which specific volume that already exists they want to be affine to (or anti-) - e.g. if they are creating several mirrored pairs, they might want to say that each pair is 'far' from each other to enhance the survivability of that data 17:54:49 <DuncanT> clayg: That should work fine, yes 17:55:39 <rnirmal> DuncanT: is this for inclusion in the basic scheduler ? 17:55:48 <vladimir3p> DuncanT: I see ... in our case we create volumes and build RAID on top of them. So, it is qute important for us to be able to schedule volumes on different nodes and to know about that 17:56:29 <DuncanT> rnirmal: The API field, yes. I'm happy for the scheduler to ignore it, just pass it to the driver 17:56:30 <renuka_> vladimir3p: I was disconnected, has that disrupted anything? 17:56:53 <vladimir3p> renuka: not really, still discussing options for affinity 17:57:13 <DuncanT> Tim and I can write up a more detailed explaination off-line and email it round if that helps with progress? 17:57:20 <clayg> the scheduler will send the volume_ref, any meta data hanging of of that will be available to the driver 17:57:43 <renuka_> #action DuncanT will write up detailed explaination for affinity 17:58:01 <DuncanT> :-) 17:58:10 <vladimir3p> DuncanT: thanks it wil be very helpfull. seems like we have some common areas there 17:58:20 <renuka_> vladimir3p: what is the next step? We have 2 minutes 17:58:21 <clayg> OH: someone on ozone is working on adding os-volumes support to novaclient 17:58:39 <vladimir3p> clayg: seems like we will need to change how message is sent from scheduler to volumemanager 17:59:10 <rnirmal> I suppose admin api topic is for the next meeting 17:59:19 <vladimir3p> renuka: I guess we will need to find a volunteer 17:59:20 <renuka_> rnirmal: yes 17:59:22 <vladimir3p> :-) 17:59:49 <clayg> vladimir3p: I don't see that, manager can get whatever it needs from volume_id 17:59:57 <vladimir3p> nope 18:00:13 <DuncanT> OT: I've put in a blueprint for snap/backup API for some discussion 18:00:16 <clayg> you're so mysterious... 18:00:25 <vladimir3p> clayg: scheduler will need to pass volume not only to the host, but "for particular backEnd array" 18:00:25 <renuka_> any volunteers for the simple scheduler work in the room? we can have people taking this up in the mailing list, since we are out of time 18:01:25 <clayg> hrmmm.... if the manager/driver is responsible for multiple backends that support the same type... I don't see why it would let the scheduler make that decision 18:01:25 <renuka_> alright i am going to end the meeting at this point 18:01:37 <renuka_> is that ok? 18:01:46 <timr1> okso 18:01:54 <vladimir3p> clayg: do you have some free time to continue ? 18:01:59 <clayg> it's going to aggregate that info in capabilities and send it to the scheduler, but by the time create volume comes down - it'll know better than the scheduler node which array is the best place for the volume 18:01:59 <vladimir3p> renuka: fine with me 18:02:13 <renuka_> #endmeeting 18:02:57 <renuka_> oh wow, that didn't work because I got disconnected before 18:03:03 <vladimir3p> clayg: it is not really relevant for my company, but I suppose folks managing multiple arays would prefer to have the scheduling on one place 18:03:05 <clayg> #endmetting 18:03:54 <renuka> #endmeeting