16:00:02 #startmeeting cinder 16:00:02 Meeting started Wed Nov 26 16:00:02 2014 UTC and is due to finish in 60 minutes. The chair is thingee. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:04 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:06 The meeting name has been set to 'cinder' 16:00:09 hi all 16:00:12 hi 16:00:12 o/ 16:00:16 Hello 16:00:23 Hi 16:00:28 Hi 16:00:31 hiya 16:00:32 hey 16:00:33 hi 16:00:33 hi 16:00:35 agenda today is small, yay! 16:00:37 yoyoyo 16:00:38 #link https://wiki.openstack.org/wiki/CinderMeetings 16:01:05 just a reminder because it has been coming up a couple of times, k-1 is the only time to get your new driver in 16:01:05 Hey 16:01:12 o/ 16:01:16 hi 16:01:35 you can read more details here http://lists.openstack.org/pipermail/openstack-dev/2014-October/049512.html 16:01:41 hey 16:02:13 I will be sending out a reminder to all potential new driver maintainers that are already targeted for k-1 that they're making slow progress and that we're aiming to merge december 18 16:02:18 OK, lets begin 16:02:32 The blueprint is https://blueprints.launchpad.net/cinder/+spec/volume-status-polling. 16:02:43 #topic volume status poll 16:02:48 TobiasE: you're up 16:02:51 #link https://review.openstack.org/#/c/132225/ 16:02:57 for the cinder spec 16:02:58 thanks 16:03:09 and for the nova spec: 16:03:11 #link https://review.openstack.org/#/c/135367/ 16:03:24 I'm assuming this is why I see a jaypipes in the audience this morning 16:03:26 :) 16:03:30 indeed :) 16:03:49 TobiasE: go ahead 16:04:08 We see some problems when running e.g. 100 attach 16:04:38 The idea is to implement async between nova and cinder 16:04:44 TobiasE: 100 simultaneous attach operations to VMs? 16:05:02 yes, high load scenarios 16:05:24 OK 16:05:26 But not the vm attachment is failing. it is the communication towards the storage backend 16:06:02 We have timeouts on RPC or HA-proxy side 16:06:10 Since terminate and initialize_connection are calls not casts we face the rpc timeout here 16:06:42 And then inconsistencies between backend and Cinder DB 16:07:16 ok, so jgriffith raised the point that he would not want to see the timeouts raised. Which makes sense 16:07:52 DuncanT would rather see timeouts raised 16:08:08 API load is already an issue, so I'm not sure I'd like every attach polling. Maybe a blocking call that ends with 'Now poll' if it times out? 16:08:49 I'm concerned you're going to see performance problems if you have a cinder volume polling, even with green threads 16:09:15 why poll? no mechanism for callback? 16:09:16 the only difference is you won't have something that times out 16:09:19 i.e. time out the API call slightly faster than the RPC timeout 16:09:21 is there a way to query the backend storage array for how many ongoing attach requests are underway, and place the request for a new attachment into a queue and then return a polling URI to the caller? 16:09:22 DuncanT: Catching Exeption and then start polling? 16:09:35 TobiasE: If we can 16:09:50 jaypipes: that's what I was thinking, but no we don't have something like that 16:10:18 jaypipes: That sounds plausible, but is a fairly big change 16:10:23 a callback would be the cleanest solution, I think 16:10:24 * jaypipes personally has no problem with not having a timeout as long as there is some reliable way of knowing if the backend is actively working on (or has a queued) attachment reqyuest 16:10:31 a worker queue in the cinder-volume manger is also a good idea 16:10:49 but does not solve the issue completly 16:10:50 The load a backend can cope with varies massively though.... 16:10:51 enikher: there's a patch with that and state problems with that. 16:11:04 DuncanT: yes, I would suspect that. 16:11:16 flip214: A call back into nova? 16:11:30 Thingee could you past the patch? 16:11:38 DuncanT, flip214: that is what neutron does for NIC attachment. 16:11:58 enikher: https://review.openstack.org/#/c/135795/ 16:12:04 DuncanT: basically, "whereever you like" ... simply putting an URL into the request that is being called upon completion. 16:12:27 flip214, jaypipes, TobiasE, DuncanT: makes sense to me 16:12:39 polling is soooo four years ago 16:12:48 yes 16:12:48 or, perhaps, return OK if possible within one second; else say "later, will call URL" 16:13:14 flip214: one code path is hard enough i think 16:13:15 there are HTTP codes for that, I believe. 16:13:16 flip214: I like that idea, but it makes the API harder to use through a firewall, because the callback could get blocked 16:13:19 flip214: yes, that is what neutron does. 16:13:27 202 Accepted. 16:13:43 bswartz: This is nova talking to cinder, so if you have a firewall there you have bigger problems 16:13:47 bswartz: in that case, the call has to be idempotent - and gets called until cinder says "done", every few seconds. 16:13:50 DuncanT: +1 16:13:50 DuncanT: right :) 16:13:54 with all the disadvantages. 16:13:58 nova isn't the only thing that does cinder volume attaches 16:14:10 cinder is consumed by other clients too 16:14:10 after cinder is done with initialize connection, nova side still need to discover the lun, etc. so attach is not complete just after cinder completes initialize_connection 16:14:34 "discover"? isn't that passed from cinder to nova? 16:14:38 bswartz: Fair enough. I'd like to hear those usecases if there are specific ones, please? 16:15:09 flip214: nova side has some work to complete after cinder returns 16:15:22 flip214: that's why it waits currently 16:15:32 So the callback would cause that step to happen 16:15:37 xyang1: yes, but I thought that cinder passes information like IP, LUN, etc back to nova. 16:15:42 block storage as a service? that's not an obvious enough use case? some people like cinder but have problems with nova and choose to use something else, or they have something preexisting that they choose to use instead of noca 16:15:45 so, bottom line, is that the cinder+nova contributor communities need to settle on either supporting long polling or supporting push based notifications ala Neutron's NIC attachment APIs. 16:15:47 You still need a timeout there though, to deal with stuck backends etc 16:15:48 s/noca/nova/ 16:15:51 flip214: nova needs to make sure the lun is visible to the host 16:16:24 TobiasE: can we see the spec redone with callback in mind? 16:16:45 Might need some help with that 16:16:47 bswartz: I know the principle, if there are any concrete cases you know of then I'd like to hear about them. I want to write a bare metal attach; if somebody has already done it then it might same me making mistakes they've already avoided 16:17:02 TobiasE: seems like flip214 has some knowledge to help. :) 16:17:05 xyang1: it's the same problem domain for NIC attachment. Neutron sends information (the "port_binding" dict) to Nova when the NIC has been created inside the Neutron drivers, and Nova then uses the port_binding information to plug the VIF locally on the nova-compute node 16:17:27 TobiasE: I would like to see things expanded with functional tests though. 16:17:35 jaypipes: ok, then we should take a look of that implementation 16:17:38 I'm not at liberty to speak about the case I'm aware of. In any case, I agree the firewall problems is an unlikely one -- just wanted to make sure it was considered. 16:17:56 bswartz: cinder already has calls to nova, so i don't think this adds new requirements 16:18:04 TobiasE: also bring in the warnings that bswartz into the spec. 16:18:07 thingee: Testing is essential here 16:18:08 xyang1: yes, that is what I recommend as well, but it relies on the cinder+nova contrib communities getting aligned on that direction. thus, I'm here :) 16:18:30 bswartz: Fair enough 16:18:31 TobiasE: absolutely, i just meant the current state of the spec doesn't explaining the testing plan well. 16:18:36 jaypipes: ok thanks 16:19:08 #agreed cinder to do callback for attachments for clients to consume 16:19:23 #action TobiasE to update current spec with cinder doing callbacks for attachments 16:19:28 anything else? 16:19:41 Are we going to allow polling as well? 16:19:55 i.e. what does nova get if it calls attach a second time? 16:20:00 hi! 16:20:01 nova will timeout in live-migration 16:20:26 DuncanT: it still can be successful if it is already attached on the array 16:20:31 since initialize_connection is then longer then expected 16:20:32 DuncanT: the second call should return a 409 Conflict, IMO. 16:21:10 jaypipes: That makes things tricky if nova-compute got restarted or something.... 16:21:11 DuncanT: because you don't want to create two callbacks inside Nova. only one. 16:21:26 enikher, DuncanT: these details can be worked on the mailing list http://lists.openstack.org/pipermail/openstack-dev/2014-November/049756.html 16:21:27 What will happen if nova will not callback? 16:21:35 I just wanted TobiasE to be able to move forward 16:21:44 OK 16:21:58 thanks for the help jaypipes! 16:22:02 still we have the same timeout problem 16:22:04 nova still needs a timeout in case the callback never comes 16:22:08 DuncanT: if nova-compute gets restarted, the callback will have either been consumed from the MQ or not. If not, then the status of the volume attachment should be ERROR, and the user should be able to resubmit an attachment request. 16:22:35 jaypipes: But cinder will never get told that nova didn't successfully attach 16:22:38 we need that live-sign to see that the backend is still working 16:22:47 bswartz: cinder should be able to copy the same retry/timeout code from the neutron work. 16:23:32 DuncanT: I don't see how that is a problem? Isn't the storage backend the source of truth for that type of information? 16:23:34 enikher: then just specify that the callback takes a parameter "in_progress=" 16:23:39 not an neutron expert could you paste the commit? 16:24:02 jaypipes: No. There's the cinder db, and the backend, and currently they can get out of sync 16:24:10 jaypipes: currrently the state of the volume will be changed back to 'available' if timeout happens 16:24:40 well, I believe that the DB should always be the "should" state. 16:24:41 even though the array may have finished the attach operation 16:24:43 jaypipes: does that involve cinder storing more state? typically cinder backends try to be stateless 16:24:47 jaypipes: We can solve the current problem by making cinder queue a detach if it times out 16:24:48 and the reality has to match what's there. 16:25:04 so in case of conflict the attach operation needs to be re-doable on nova. 16:25:05 DuncanT: right, but how is that Nova's problem? 16:25:46 jaypipes: If nova thinks the attach is still happening, you get in a mess. The nova bdm is yet another piece of state that gets out of sync 16:26:51 jaypipes: Haven't you worked on cleaning BDM up 16:27:00 One option rather than polling or callback is just for cinder to clean up better on timeout 16:27:16 And leave nova to clean itself up 16:27:22 DuncanT: ++ 16:27:27 then the user has to try again? 16:27:33 BDM shouldn't be considered as a place store volume attaching state, 'cos you can attach a volume to a 'stopped' instance, BDM still exists, and volume not attached. 16:27:33 that is not good I think 16:27:38 This would mean that some attach calls fail and have to be retried, but that is the cloud way sometimes 16:27:47 DuncanT: if nova can call cinder to get volume metadata, and considers cinder's response authoritative, that would be ideal. 16:28:43 DuncanT: that means that on high load requests will fail, get requeued, and make still bigger load. 16:29:00 enikher: Sometimes it is better for the user to retry than to over complicate code - instance startup and volume create can fail and need retrying too 16:29:29 flip214: Hopefully the load spike will have passed by then 16:29:32 flip214: there's a point at which any system will be overwhelmed -- at that point the caller must throttle his requests or expect failure 16:29:37 I'd prefer to go the callback route, without looking at timeouts for now. that solves the high-load issue. 16:29:52 yes but the user does not get enough information to know that the backend is just overloaded 16:29:59 But the callbacks massively increase the odds of a 'stuck' system 16:30:07 Which can only be fixed by admin 16:30:09 with very slow backends we had problems to attach 10 volumes 16:30:15 Rather than something the user can simple retry 16:30:26 so that would mean the user has to do a lot of retries 16:30:26 enikher: Then buy a better backend IMO 16:30:40 "simply retry" - based upon what information? she won't know whether the load spike is done. 16:30:49 flip214: Not having any timeouts is a total none-starter 16:31:10 flip214: If the backend is broken then you end up with the user in an unfixable mess 16:31:11 Actually what do you think about estimating the timeout 16:31:19 if the user can see the number of queued requests, and in which position some specific one is, she can see progress - or not. 16:31:24 DuncanT: if this is just to fetch the initialize_connection, how do we end up in a stuck state? If nova never sets the volume to in-use, can't cinder just roll it back? 16:31:38 if you want to have a short ha-proxy timeout the timeout will often occur 16:31:39 flip214: That is way more complex than just saying 'rety some time' 16:32:14 thingee: Once you've done the prep work in order to return the connection info, how do you know when to roll back? 16:32:15 I agree -- you always need a timeout to deal with the case where cinder had a critical hardware failure and was forced to restart 16:32:24 hopefully being able to scale out cinder-volume will also help with higher loads, and yes, backends need to be sane - i would open a bug for a backend that was so inefficient (if it was something the driver could fix) 16:32:28 DuncanT: that would be up to cinder. 16:32:32 http://ferd.ca/queues-don-t-fix-overload.html - queues are good, but don't fix overload. 16:32:33 flip214: the backend is still working or not? 16:32:44 but at least they can be used to _show_ whether there's progress-. 16:32:44 bswartz: Or when rabbit lost a message due to restarting, or a cinder service got restarted, or whatever 16:33:11 having retries means (still) higher load, so I don't think the user should retry. 16:33:17 flip214: Exposing queues to a cloud user is a nightmare 16:33:41 so here's where I'm going with this. If nova does an initialize_connection request, cinder does a call back. If nova disappears, cinder should roll it back, unless nova later does something with the call back and tells cinder to set the volume to in-use 16:33:52 flip214: Rate limit the retries just like you have to rate limit everything else to avoid DoS 16:33:58 flip214: estimate the timeouts? so that the polling is not done so frequently? 16:34:28 thingee: If cinder has rolled it back, then the work needs to be done again, which means another callback... ad infinitum for a slow backend 16:35:07 DuncanT: there was no mention of another call back 16:35:18 enikher: estimating based upon what? that's another guess-game. 16:35:23 update status is an api call, which just passes to cinder db 16:35:28 thingee: wait, what happen when a user attach a volume to a stopped instance, initialize_connection() can be called but the actual attach operation can be long after. 16:35:31 I agree with DuncanT - the spike should be over very quickly and the user can retry. if the spike lasts a long time then it's not a spike and the cloud sucks 16:35:40 winston-d_: exactly 16:35:40 avishay++ 16:36:34 thingee: what i meant was in that cinder, cinder shouldn't rely on callback. 16:36:46 s/in that cinder/in that case/ 16:36:51 So our number one support call for LVM is 'my volume is stuck in attaching/detaching' - this sounds like it will make that way worse 16:37:18 Honestly I think if the backend is under load, cinder should rely on the call back. if it's not, it should continue behavior as normal 16:37:18 guitarzan: what do you think? 16:38:02 majority of us apparently won't even notice different behavior because according to the feedback in the spec, we're all user super fast solutions that don't bog down on load. 16:38:08 Two paths means more testing, more bugs and more failures 16:38:28 DuncanT: I know, extra work is a bummer. 16:38:30 If most people don't use a code path, it /will/ end up weirdly broken over time 16:39:27 My proposal went unanswered, so I'm going with that until someone responds. I'll raise it on the mailing list again though 16:39:35 So callback does not seem to be helpful or? 16:39:35 this topic is dead imo for this current meeting. 16:39:49 enikher: read my proposal I said twice in this meeting. 16:39:57 #topic Over-Subscription Alternative 16:40:01 bswartz: you're up 16:40:05 hey guys 16:40:10 #link https://review.openstack.org/#/c/129342/ 16:40:19 I'm the lone voice of negativity here 16:40:37 bswartz: not in this meeting :) 16:40:53 but I proposed an alternative to xyang's oversubscription proposal 16:41:02 basically I have 2 concerns 16:41:16 thingee: everyone said we were all in agreement on this topic before the summit:) 16:41:34 1) I don't think it's the right UI for administrators to put oversubscribe ratios in the cinder.conf file -- I think it's better to make them an aspect of the volume_type 16:42:08 bswartz: why? 16:42:27 2) I think it's a bad idea to implement oversubscription by having the backends effectively lie to the scheduler about how much free space they have, I think the scheduler should know the true values and implement oversubscription itself 16:43:11 bswartz: I think you can only calculcate over subscription ratio for a backend or a pool, not a volume type though 16:43:12 avishay: thick provisioning (as opposed to thin) may be a value-add option you want to sell to users who are willing to pay more 16:43:22 bswartz: re 2) I sort of agree 16:43:43 because a volume type can be associated with multiple backends, a single backend, or a backend can support multiple volume types 16:43:57 re 1) though, the problem is that if a backend fills up with bronze volumes then that doesn't help the gold volumes at all 16:44:04 xyang1: I attempted to answer that question in my followup comment on the review 16:44:08 I just can't see how the formula will work to compare capacities from a backend with a ratio of a type 16:44:19 \o 16:44:21 bswartz: my understanding is xyang1's proposal doesn't 'lie' to scheudler, it's a fact that right now, scheduler doesn't know the oversubscription ratio of backends, even though it knows actual/virtual provisioned capacity. 16:44:24 xyang1, TobiasE, enikher: https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L4453-L4526 <-- the relevant code for how NIC plugging events are handled. 16:44:41 jaypipes: thanks! we'll take a look 16:44:41 DuncanT: so mixing gold and bronze on the same backend only works if the backend is enable to enforce that the gold volumes get all the space they're promised 16:44:52 bswartz: hasn't the backend been able to lie to cinder scheduler for a while? 16:44:54 xyang1, TobiasE, enikher: it's not pretty, but it works... 16:44:56 bswartz: infinite 16:45:01 I proposed that backends also report a "space_reservation" capability 16:45:02 bswartz: Quite. I don't think most can do that 16:45:30 jaypipes: thanks 16:45:40 DuncanT: NetApp can do that -- I'm pretty sure EMC can do that -- LVM can do it 16:46:04 jaypipes: thanks 16:46:17 bswartz: WRT to #2, hasn't the backend been able to lie to scheduler for a while now? by saying "infinite" 16:46:29 bswartz: LVM can? Ok. objection withdrawn 16:46:32 IMO it's pretty dumb to implement thin provisioning if you don't have a way to exempt some things from the thin provisioning 16:46:33 bswartz: we have 'reserved_percentage' since filter_scheduler is introduced, but it's never used. what's the difference between this and 'space_reservation'? 16:46:56 otherwise you're asking for disaster 16:47:02 bswartz: is space_reservation actually a thick lun? 16:47:14 xyang1: yes -- it would be thick luns 16:47:32 winston-d: it's important that the reservation only applies to volumes for which the "gold" promise was made 16:47:43 bswartz: you can still create thick luns. this proposal doesn't prevent that 16:47:48 a space reserve % is a blanker reserve across the whole backend 16:47:52 blanket* 16:48:16 xyang1, TobiasE, enikher: https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L532 <-- the wait_for_instance_event() main method. 16:48:26 bswartz: there's already a used ratio that controls how much is really used 16:48:44 jaypipes: thank you. 16:48:48 jaypipes: thanks! 16:48:51 no problemo :) 16:49:09 anyways I'm not married to my proposal -- I think some middle ground can be found between what xyang proposed and what I proposed 16:49:27 I just wanted to raise those 2 concerns and suggest some ways to work around them 16:49:43 bswartz: the used ratio xyang1 mentioned doesn't help? 16:49:49 If you really can do per-volume reservations and get accurate answers out of the backend summing them then that sounds preferable 16:49:55 bswartz: so I think your driver can still calculate free space the way you described and send back to scheduler 16:50:11 the current proposal makes it impossible to mix "gold" and "bronze" storage on the same backend, assuming the admin wants gold to be thick and bronze to be thin 16:50:51 I think I'm starting to agree with Ben here 16:51:16 xyang1: do you have any plans to mix the different types? 16:51:18 bswartz: it's possible. you only have to do one thin pool per thin LV ;) 16:51:44 flip214: are you referring to cinder pools? 16:51:53 bswartz: there are unresolved issues with Ben's proposal, so I'm still not sure 16:52:03 in priciple, a backend could have 2 pools, each with different oversubscribe amounts 16:52:13 but how would the admin enter that info into cinder.conf? 16:52:37 xyang1: what's what? 16:52:39 thingee: the over subscription ratio should really be calculated for a pool or backend 16:53:01 xyang1: did you read my followup comment that answers your question? 16:53:04 it is the ratio of virtual capacity over total capacity 16:53:19 I can write up a whole spec if I need to to spell out all the details 16:53:23 bswartz: did you just update? 16:53:27 but I'd rather just adjust the existing spec 16:53:33 xyang1: like 2 hours ago 16:53:42 bswartz: xyang1 already replied back to you 16:53:45 * jaypipes goes off hunting for lunch... 16:53:48 bswartz: the driver should expose some per-pool level config options to solve that problem - per pool overcommit amounts 16:54:31 winston-d_: Tricky with dynamic pools 16:54:45 DuncanT: that's what I was thinking 16:55:02 DuncanT: unfortunately yes. 16:55:06 if we can make new pools dynamically, then the driver will have to make up some value or use a default under the existing spec 16:55:20 5 minute warning 16:55:54 I just have a feeling that the scheduler is in a better position than the backends to make decisions about where and when to oversubscribe, and if the scheduler needs more data from the backends to do so, then let's implement that 16:55:55 * DuncanT would like to read some more details of bswartz' approach, if that is doable? 16:56:08 bswartz: +1 16:56:26 It feels like it better answers the questions like dynamic pools and mixing types 16:56:27 so the problem is scheduler gets free capacity from backend 16:56:44 scheduler doesn't calculate free capacity for the backend 16:56:52 bswartz: i agree with your last statement, but i thought that was what xyang1's proposal plan to do? 16:57:03 to do what bswartz suggested, it seems to be a big over haul of the scheduler 16:57:42 winston-d: it's a step in the right direction, but I think we can do better 16:57:46 bswartz: so the formula you provided to calculate free space, that should be executed by the driver and report back in stats 16:57:57 the scheduler doesn't calculate that 16:58:05 just to be clear -- I'm not completely opposed to xyang's approach -- it's an improvement 16:58:11 available_capacity 16:58:20 I'm just worried that we're commiting ourselves to an interface that can't be improved on later 16:58:54 DuncanT: I can write a whole new spec 16:58:58 or a wiki or something 16:59:03 make a wall of text on the ML? 16:59:11 s/make/maybe/ 16:59:16 bswartz: thx, that'll be very helpful. 16:59:20 bswartz: It sounds like picking any one of those would be a good idea. Spec seems the most logical 16:59:20 bswartz: so that is a problem because scheduler doesn't calculate available capacity for a backend 16:59:39 xyang1: +1 17:00:00 xyang: looks like you and I will have to meet up again offline 17:00:08 #endmeeting