15:00:10 #startmeeting manila 15:00:12 Meeting started Thu Jun 25 15:00:10 2015 UTC and is due to finish in 60 minutes. The chair is bswartz. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:13 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:15 The meeting name has been set to 'manila' 15:00:19 Hi 15:00:23 hello 15:00:24 hi 15:00:25 hello everyone 15:00:28 hi 15:00:29 Hi 15:00:32 hi 15:00:38 hi 15:01:01 #agenda https://wiki.openstack.org/wiki/Manila/Meetings 15:01:02 hi 15:02:02 so you've probably seen the announcements that Liberty-1 is out 15:02:21 #link https://launchpad.net/manila/liberty/liberty-1 15:02:56 lots of bugfixes and the share extend/shrink APIs went it 15:03:14 #topic liberty-2 15:03:20 hi 15:03:24 #link https://launchpad.net/manila/+milestone/liberty-2 15:03:51 right now there's not much targeted at liberty-2 but I expect that to change 15:04:06 bswartz: can you target this for L-2? https://blueprints.launchpad.net/manila/+spec/oversubscription-in-thin-provisioning 15:04:07 the migration work has started interesting discussions about availability zones and share instances 15:04:22 xyang: sure! 15:04:37 bswartz: thanks 15:04:49 what I wanted to tell everyone is that if you have work targeted for L-2 that's not targeted on launchpad please let me know so I can fix that 15:05:22 reviewers use launchpad to prioritize code reviews (at least I do) 15:06:05 * lpabon here 15:06:17 and release management relies on Launchpad being up to date to know what's in and what's out of a release 15:07:08 if anyone else has blueprints that need targeted, please ping me in IRC or send an email 15:07:13 #topic CI status update 15:07:26 #link http://lists.openstack.org/pipermail/openstack-dev/2015-May/064086.html 15:07:40 so today is the first of the deadlines set out in the original CI plan for Liberty 15:08:07 by now all of the drivers should have maintainers signed up, and maintainers should at least have a good idea of when/how they will implement CI 15:08:14 #link https://etherpad.openstack.org/p/manila-driver-maintainers 15:08:47 we have maintainers for all of the drivers except GPFS, and I'm talking to IBM about getting a commitment 15:09:43 later today I'm going to send out a new ML post with a reminding about the next deadline and I'll be checking in with all of the driver maintainers to make sure things are on track 15:09:50 don't be afraid to ask for help 15:10:49 it's better to ask for help now if you're not exactly sure how to get CI working because it _will_ take significant time 15:11:08 we have people willing to help but they can't help if you don't ask 15:11:37 btw there are 2 CI systems already reporting 15:11:41 #link: http://ec2-54-67-102-119.us-west-1.compute.amazonaws.com:5000/?project=openstack%2Fmanila&user=&timeframe=72&start=&end=&page_size=500 15:11:55 I'd like to thank both HP and NetApp for setting a good example 15:12:52 cknight: that link doesn't render in my browser :-( 15:13:01 oh there it finally did -- just took about 60 seconds 15:13:40 looks like HP CI may have hit a snag in the last day or 2 15:13:48 markstur: ^ 15:13:49 cknight: I can't open it either. Is that the tracking diagram? 15:13:50 yep 15:14:02 xyang: yes 15:14:14 I'm on it. Took a little while for me to figure out that it was consistent. 15:14:55 any question on CI before we move on? 15:15:18 bswartz: are you satisfied with the CI progress for L-1? Has everyone registered plans? 15:16:12 What is meant by 'registered'? 15:16:14 cknight: overall, yes. there are some vendors that are definitely behind but they know they're behind and they're working on addressing it 15:16:38 cknight, kaisers_0: I haven't actually asked anyone to write up a formal plan 15:17:04 Ok. :) 15:17:11 I'm willing to take people's word for it that they have a plan. It will become evident very shortly if they don't 15:17:32 bswartz: good enough :-) 15:18:12 there's only 5 weeks until the next deadline and 10 weeks until we expect CI systems to be running 15:18:33 I'll spell all this out in the upcoming ML post 15:19:03 now that I have contact information I'll be able to start working with individuals on meeting the next deadline before it happens 15:19:44 okay I think that's enough on CI 15:19:49 #topic DriverLog 15:20:00 The DriverLog project on Stackalytics is where you can advertise drivers. 15:20:02 #link http://stackalytics.com/report/driverlog?project_id=openstack%2Fmanila 15:20:09 We just added the Manila project there, along with the NetApp driver. 15:20:15 Driver maintainers: please follow the wiki instructions to register your drivers. 15:20:19 yeah I'm not sure who owns driverlog but it's a cool project 15:20:22 #link https://wiki.openstack.org/wiki/DriverLog 15:20:43 bswartz: Anything else on this topic? 15:20:43 cknight: thanks for the reminder 15:21:14 cknight: is driverlog a required thing for drivers in other projects or is it optional? 15:21:33 bswartz: Don't know, but it's trivial to add. 15:21:41 should we be requiring all drivers for manila to put something there or should they be motivated by their own self-interest? 15:22:20 bswartz: It seems self-interest would be sufficient, but it's your call. It isn't a big ask. 15:22:32 Don't see the need to force this. 15:22:43 bswartz: It's a more formal location to register the driver maintainer. 15:22:48 Our change is waiting on my HD 15:23:09 okay, I *suggest* that all the driver maintainer add an entry in driverlog for their manila drivers 15:23:24 for comitting, just wanted to wait for this meeting.., 15:23:46 the driverlog project serves as useful documentation for end users so it's worth doing 15:24:04 I think that wraps up this topic 15:24:07 bswartz: +1 it includes a link to vendor driver documentation 15:24:30 #topic open discussion 15:24:50 okay I didn't have anything else for the meeting today 15:24:57 * bswartz is still recovering from new baby 15:25:09 Gratz!! 15:25:31 I have a question about extend share error 15:25:53 If extend share failed, the share status will change to ‘extending_error’. This share will not be used. 15:26:01 for those that wonder what I've been up to I'm revisiting the share replication design proposed in vancouver, based on changes related to availability zones and share instances -- I don't have anything to share yet, but expect an revised proposal for share replication sometime soon 15:26:09 kaisers_0: ty 15:26:23 zhongjun2: yes 15:26:30 I don’t think this method is very appropriate. May be extend share failed should not affect share to use. We can prompt the user extend share fail, and not change the share status. The share status is still ‘avalible’. 15:26:44 zhongjun2: depending on what happened to cause the failure, the administrator may need to take action to make the share usable again 15:27:24 bswartz: +1 15:27:51 I suppose if the driver can ensure that the share is in the exact same state it was in before the extend was attempted, then we could put the state back to available -- but that could lead to confusion 15:27:59 How make the share usable again? 15:28:20 admin API reset-state 15:28:29 zhongjun2: usually shares in error state must be fixed by an administrator -- so in a typical cloud it would involve filing a trouble ticket 15:29:51 zhongjun2: i think that some systems may be able to have the share usable even it failed to be extended 15:30:21 like a DB transaction.. some storage systems may be able to go back to a good state automatically on failure to extend 15:31:00 lpabon: how would users find out that their extend failed though? 15:31:14 they might see that the share is available, and assume that it succeeded 15:31:19 bswartz: I think we could combine mount automation notifications framework with error notifications 15:31:26 I think that the extension api should fail, but probably the state of the share should be up to the driver 15:31:34 do we expect users to check the size of the share to validate that the extend worked? 15:32:07 u_glide: +1 There has to be a way to notify users of share actions without necessarily changing the share status. 15:32:17 I agree that it's a bad user experience to fail the extend and leave the share in a state where you can't mount it at all 15:32:31 bswartz: let me make sure we are talking about the same thing. The tenant is the one that extends the share through the API, not the users 15:32:39 I think that the extension api should fail, but probably the state of the share should be up to the driver. I agree with that 15:33:16 lpabon: tenant=user 15:33:51 Ok, so when the API request comes in, the resize will fail, the user will know because the API returned the error status 15:34:42 lpabon: like most operations resize is asychronous, so the API will succeed 15:34:50 O.o 15:35:13 Being asynchronous it should still be pinged for success 15:35:14 At least I think it is -- anyone know if I'm wrong? 15:35:36 the client (horizon or whatever) should ping status of the request 15:35:42 It is the manager who usually updates the status later after the operation is complete 15:36:07 submitting the request is different than request status, and I think it is up to the client/server relation to request submittion status 15:36:08 xyang: yes, but is it usually or is it always? 15:36:18 bswartz: always 15:36:46 lpabon: it's well known problem of the openstack 15:37:01 API usually sets status to extending, creating, etc. 15:37:02 * lpabon cries a little bit ... 15:37:07 it could make sense to return a flag from the driver to the manager indicating with the request: (1) succeeded (2) failed with a clean rollback or (3) failed and needs cleanup 15:37:09 manager sets the final status 15:37:23 u_glide: It's consistent with the design tenets, but it brings issues with it that have yet to be solved cleanly. https://wiki.openstack.org/wiki/BasicDesignTenets 15:37:56 bummer, well, at least you get the idea of what I mean 15:38:10 +1 for BasicDesignTenets 15:38:23 lpabon: yep. notifications are being worked by various projects. 15:38:45 lpabon: heat is planning to use zaqar for async events, and ameade wants to do the same in Manila. 15:39:22 cool, so _somehow_ the status of the request (being share extension in this example) would go back to the tenant/user notifying them that it failed 15:39:40 lpabon: yes 15:39:45 even with future better notification... seems like we could still use that flag for 2 kinds of fail 15:40:06 but.. the status of the share should be up to the driver, depending on what the storage system is capable of 15:40:27 so maybe the share is still capable of being used even though it failed to extend 15:40:34 I'd like to point out that our goal should be to make things not fail... 15:40:42 bswartz: +10 15:40:50 bswartz: +10 :-) 15:40:53 if backends never fail operations like extend, then how we handle the failures becomes a lot less important 15:41:54 if an extend failure is a truly exception use case, then it's no big deal if it requires and administrator to fix it, because it implies something really unusual happened 15:41:54 never-never :) 15:42:04 "Error: something really bad happened. Please try again." ? 15:42:57 This is how extend is supposed to work: 15:42:58 markstur: probably: "Error: something really bad happened. Abort, Retry, Fail? 15:43:20 drivers report their free space to the scheduler -- free space should be usable for extending existing shares 15:43:47 lpabon: :-) 15:43:53 when user requests to extend a share, the scheduler should check for free space before invoking the driver 15:44:08 bswartz: but it doesn't today 15:44:19 if there was a quota problem or a space problem, then the request should get rejected before the driver is ever invoked and the share should remain in available state 15:44:47 once driver is invoked, the resize really should succeed unless something we didn't plan for happens 15:44:52 cknight: which part 15:45:13 bswartz: iirc, the scheduler isn't consulted during extend 15:45:22 well crap 15:45:23 bswartz: Hmmm, i'm not sure.. The process of extending storage is quite involved 15:45:29 that seems like something we should fix 15:45:32 u_glide: is that right? 15:45:48 I think the notification efforts will help by classifying errors in terms that users need-to-know vs admin details 15:46:00 if we can determine ahead of time that there isn't enough space to extend a share we should not invoke the driver's extend method 15:46:01 cknight: right 15:46:08 we should have some errors that can suggest retry (like a timeout) 15:46:21 The 'Thin' share without being limited by the free space. 15:46:23 bswartz: it gets more complicated in thin-provisioning cases, need that to merge first 15:46:25 and others that indicate that admin is needed 15:46:25 without working tenant notifications, it's just idle talks :) 15:46:32 lpabon: I'm sure it depends on the backend, but I could see some backends having a very involved process 15:46:52 bswartz: netapp ;-) 15:47:25 lpabon: -1 the netapp backend just works :-) 15:47:25 Well we could still add that flag that Ben mentioned and have 2 error states for this if it appears necessary. 15:47:35 lpabon: on netapp a share resize is as easy as changing a number (which is effectively a quota value) 15:47:35 cknight: lol 15:48:14 in glusterfs a storage component must be created and then added to a volume. The volume must accept the volume component and increase the overall size. 15:48:20 extend failed (rolled back) and extend failed (needs cleanup) 15:48:40 markstur: even if we do what I suggested, that doesn't help the user find out whether his extend was successful or not 15:48:58 markstur: the user would have to inspect the share size in that case 15:49:05 iow, Determine storage size is available is a simple way to detect an early failure, but the execution of extension is still up to the storage system, and it could still fail 15:49:30 How about. Extending... Available. Doesn't that mean it succeeded? 15:49:45 no if it failed and rolled back to availble 15:49:57 bswartz: an alternative is to add a user-facing "heal" API, which tells the driver to attempt a cleanup from most any failed state and recover to available if possible 15:50:09 bswartz: its friendlier than reset-state 15:50:09 cknight: -1 15:50:29 if users need a heal API it means we're already doing something wrong 15:50:50 Let me see if i can summarize: 15:50:55 it's more of a clear notification then isn't it 15:51:10 the system should heal itself to the extent possible and when it can't it should ask for help (which the admin must provide) 15:51:33 Tenant wants to resize, scheduler checks for availability to resize and returns failure if there is no room.. If there is, driver executes resize... 15:51:52 ... Async call goes back to the client... 15:52:02 ... Driver does resize... 15:52:13 lpabon: yes and the size check could be synchronous so it a failed size check could actually fail the API similar to a failed quota check 15:52:16 ... Success or Failure state is returned.. 15:52:23 lpabon: The 'Thin' share without being limited by the free space. 15:52:42 We should still have some ways to recover from failure. I don't think we can completely avoid that 15:52:51 .. driver could decide to continue a usable state or fail the entire share.. 15:53:46 xyang: recovering from failures is what the admin does, and we give the admin APIs to do that 15:53:50 like reset-state 15:53:56 xyang: i'm not sure if Manila should do that 15:54:05 bswartz: ya, that is what I meant 15:54:09 probably is a storage management issue 15:54:26 lpabon: we need to be able to sync up manila db and backend status 15:54:45 okay 5 minute warning 15:54:53 xyang: yeah, i agree, that should be up to the driver, imho 15:55:11 this was an interesting topic, thanks for bringing it up zhongjun2 15:55:16 lpabon: an admin api that consults driver 15:55:21 Gotta leave, thanks & bye 15:55:22 are we in rough agreement about what to do here? 15:55:34 bswartz: :-) not sure 15:55:45 okay 15:55:48 bswartz: :-) 15:56:27 I propose that we add a free-space check to extend requests so we don't attempt to extend when there is no space 15:56:49 but once the extend request goes to the driver, if it fails then the share goes into error state 15:57:21 it already checks capacity. You mean check with more live stats? 15:57:41 bswartz: -1, sorry, but I do not think the entire share should go into error.. Unless error state allows still usability 15:57:42 markstur: I heard cknight and u_glide say we don't check capacity -- we only check quota 15:57:42 https://github.com/openstack/manila/blob/9d78f9ca36e4fc25a4ebe01c78781c126f7ea055/manila/api/contrib/share_actions.py#L145-L150 15:58:00 markstur: extend share doesn't go thru scheduler, so it doesn't check capacity 15:58:05 bswartz: might want to wait for "Xing-style" thin provisioning patch to merge 15:58:06 oh 15:58:30 xyang: (that's what we've been calling it) ;-) 15:58:40 :) 15:58:56 if we don't plan to check capacity or we can't check capacity then there's a much better case for allowing the driver to fail extends cleanly and roll back to the old size 15:59:15 bswartz: yeah 15:59:26 but in that case we need to communicate to the end user that the extend failed somehow 15:59:30 another possibility is to have every request go to scheduler, but winston abandons that idea for some reason in cinder. I'll check with him on that 15:59:36 bswartz: probably be better to just pass it to the driver and let it do the magic 15:59:55 okay we're at time 15:59:57 thanks everyone 16:00:04 good talk! thanks everyone 16:00:06 #endmeeting