#openstack-meeting-alt log

15:00:10 <bswartz> #startmeeting manila
15:00:12 <openstack> Meeting started Thu Jun 25 15:00:10 2015 UTC and is due to finish in 60 minutes.  The chair is bswartz. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:13 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:15 <openstack> The meeting name has been set to 'manila'
15:00:19 <cknight> Hi
15:00:23 <markstur> hello
15:00:24 <rraja> hi
15:00:25 <bswartz> hello everyone
15:00:28 <toabctl> hi
15:00:29 <kaisers_> Hi
15:00:32 <xyang> hi
15:00:38 <u_glide> hi
15:01:01 <bswartz> #agenda https://wiki.openstack.org/wiki/Manila/Meetings
15:01:02 <zhongjun2> hi
15:02:02 <bswartz> so you've probably seen the announcements that Liberty-1 is out
15:02:21 <bswartz> #link https://launchpad.net/manila/liberty/liberty-1
15:02:56 <bswartz> lots of bugfixes and the share extend/shrink APIs went it
15:03:14 <bswartz> #topic liberty-2
15:03:20 <tbarron> hi
15:03:24 <bswartz> #link https://launchpad.net/manila/+milestone/liberty-2
15:03:51 <bswartz> right now there's not much targeted at liberty-2 but I expect that to change
15:04:06 <xyang> bswartz: can you target this for L-2?  https://blueprints.launchpad.net/manila/+spec/oversubscription-in-thin-provisioning
15:04:07 <bswartz> the migration work has started interesting discussions about availability zones and share instances
15:04:22 <bswartz> xyang: sure!
15:04:37 <xyang> bswartz: thanks
15:04:49 <bswartz> what I wanted to tell everyone is that if you have work targeted for L-2 that's not targeted on launchpad please let me know so I can fix that
15:05:22 <bswartz> reviewers use launchpad to prioritize code reviews (at least I do)
15:06:05 * lpabon here
15:06:17 <bswartz> and release management relies on Launchpad being up to date to know what's in and what's out of a release
15:07:08 <bswartz> if anyone else has blueprints that need targeted, please ping me in IRC or send an email
15:07:13 <bswartz> #topic CI status update
15:07:26 <bswartz> #link http://lists.openstack.org/pipermail/openstack-dev/2015-May/064086.html
15:07:40 <bswartz> so today is the first of the deadlines set out in the original CI plan for Liberty
15:08:07 <bswartz> by now all of the drivers should have maintainers signed up, and maintainers should at least have a good idea of when/how they will implement CI
15:08:14 <bswartz> #link https://etherpad.openstack.org/p/manila-driver-maintainers
15:08:47 <bswartz> we have maintainers for all of the drivers except GPFS, and I'm talking to IBM about getting a commitment
15:09:43 <bswartz> later today I'm going to send out a new ML post with a reminding about the next deadline and I'll be checking in with all of the driver maintainers to make sure things are on track
15:09:50 <bswartz> don't be afraid to ask for help
15:10:49 <bswartz> it's better to ask for help now if you're not exactly sure how to get CI working because it _will_ take significant time
15:11:08 <bswartz> we have people willing to help but they can't help if you don't ask
15:11:37 <bswartz> btw there are 2 CI systems already reporting
15:11:41 <cknight> #link: http://ec2-54-67-102-119.us-west-1.compute.amazonaws.com:5000/?project=openstack%2Fmanila&user=&timeframe=72&start=&end=&page_size=500
15:11:55 <bswartz> I'd like to thank both HP and NetApp for setting a good example
15:12:52 <bswartz> cknight: that link doesn't render in my browser :-(
15:13:01 <bswartz> oh there it finally did -- just took about 60 seconds
15:13:40 <bswartz> looks like HP CI may have hit a snag in the last day or 2
15:13:48 <bswartz> markstur: ^
15:13:49 <xyang> cknight: I can't open it either. Is that the tracking diagram?
15:13:50 <markstur> yep
15:14:02 <cknight> xyang: yes
15:14:14 <markstur> I'm on it.  Took a little while for me to figure out that it was consistent.
15:14:55 <bswartz> any question on CI before we move on?
15:15:18 <cknight> bswartz: are you satisfied with the CI progress for L-1?  Has everyone registered plans?
15:16:12 <kaisers_0> What is meant by 'registered'?
15:16:14 <bswartz> cknight: overall, yes. there are some vendors that are definitely behind but they know they're behind and they're working on addressing it
15:16:38 <bswartz> cknight, kaisers_0: I haven't actually asked anyone to write up a formal plan
15:17:04 <kaisers_0> Ok. :)
15:17:11 <bswartz> I'm willing to take people's word for it that they have a plan. It will become evident very shortly if they don't
15:17:32 <cknight> bswartz: good enough :-)
15:18:12 <bswartz> there's only 5 weeks until the next deadline and 10 weeks until we expect CI systems to be running
15:18:33 <bswartz> I'll spell all this out in the upcoming ML post
15:19:03 <bswartz> now that I have contact information I'll be able to start working with individuals on meeting the next deadline before it happens
15:19:44 <bswartz> okay I think that's enough on CI
15:19:49 <bswartz> #topic DriverLog
15:20:00 <cknight> The DriverLog project on Stackalytics is where you can advertise drivers.
15:20:02 <bswartz> #link http://stackalytics.com/report/driverlog?project_id=openstack%2Fmanila
15:20:09 <cknight> We just added the Manila project there, along with the NetApp driver.
15:20:15 <cknight> Driver maintainers: please follow the wiki instructions to register your drivers.
15:20:19 <bswartz> yeah I'm not sure who owns driverlog but it's a cool project
15:20:22 <cknight> #link  https://wiki.openstack.org/wiki/DriverLog
15:20:43 <cknight> bswartz: Anything else on this topic?
15:20:43 <xyang> cknight: thanks for the reminder
15:21:14 <bswartz> cknight: is driverlog a required thing for drivers in other projects or is it optional?
15:21:33 <cknight> bswartz: Don't know, but it's trivial to add.
15:21:41 <bswartz> should we be requiring all drivers for manila to put something there or should they be motivated by their own self-interest?
15:22:20 <cknight> bswartz: It seems self-interest would be sufficient, but it's your call.  It isn't a big ask.
15:22:32 <kaisers_0> Don't see the need to force this.
15:22:43 <cknight> bswartz: It's a more formal location to register the driver maintainer.
15:22:48 <kaisers_0> Our change is waiting on my HD
15:23:09 <bswartz> okay, I *suggest* that all the driver maintainer add an entry in driverlog for their manila drivers
15:23:24 <kaisers_0> for comitting, just wanted to wait for this meeting..,
15:23:46 <bswartz> the driverlog project serves as useful documentation for end users so it's worth doing
15:24:04 <bswartz> I think that wraps up this topic
15:24:07 <cknight> bswartz: +1  it includes a link to vendor driver documentation
15:24:30 <bswartz> #topic open discussion
15:24:50 <bswartz> okay I didn't have anything else for the meeting today
15:24:57 * bswartz is still recovering from new baby
15:25:09 <kaisers_0> Gratz!!
15:25:31 <zhongjun2> I have a question about extend share error
15:25:53 <zhongjun2> If extend share failed, the share status will change to ‘extending_error’. This share will not be used.
15:26:01 <bswartz> for those that wonder what I've been up to I'm revisiting the share replication design proposed in vancouver, based on changes related to availability zones and share instances -- I don't have anything to share yet, but expect an revised proposal for share replication sometime soon
15:26:09 <bswartz> kaisers_0: ty
15:26:23 <bswartz> zhongjun2: yes
15:26:30 <zhongjun2> I don’t think this method is very appropriate. May be extend share failed should not affect share to use. We can prompt the user extend share fail, and not change the share status.  The share status is still ‘avalible’.
15:26:44 <bswartz> zhongjun2: depending on what happened to cause the failure, the administrator may need to take action to make the share usable again
15:27:24 <u_glide> bswartz: +1
15:27:51 <bswartz> I suppose if the driver can ensure that the share is in the exact same state it was in before the extend was attempted, then we could put the state back to available -- but that could lead to confusion
15:27:59 <zhongjun2> How make the share usable again?
15:28:20 <u_glide> admin API reset-state
15:28:29 <bswartz> zhongjun2: usually shares in error state must be fixed by an administrator -- so in a typical cloud it would involve filing a trouble ticket
15:29:51 <lpabon> zhongjun2: i think that some systems may be able to have the share usable even it failed to be extended
15:30:21 <lpabon> like a DB transaction.. some storage systems may be able to go back to a good state automatically on failure to extend
15:31:00 <bswartz> lpabon: how would users find out that their extend failed though?
15:31:14 <bswartz> they might see that the share is available, and assume that it succeeded
15:31:19 <u_glide> bswartz: I think we could combine mount automation notifications framework with error notifications
15:31:26 <lpabon> I think that the extension api should fail, but probably the state of the share should be up to the driver
15:31:34 <bswartz> do we expect users to check the size of the share to validate that the extend worked?
15:32:07 <cknight> u_glide: +1  There has to be a way to notify users of share actions without necessarily changing the share status.
15:32:17 <bswartz> I agree that it's a bad user experience to fail the extend and leave the share in a state where you can't mount it at all
15:32:31 <lpabon> bswartz: let me make sure we are talking about the same thing.  The tenant is the one that extends the share through the API, not the users
15:32:39 <zhongjun2> I think that the extension api should fail, but probably the state of the share should be up to the driver. I agree with that
15:33:16 <bswartz> lpabon: tenant=user
15:33:51 <lpabon> Ok, so when the API request comes in, the resize will fail, the user will know because the API returned the error status
15:34:42 <bswartz> lpabon: like most operations resize is asychronous, so the API will succeed
15:34:50 <lpabon> O.o
15:35:13 <lpabon> Being asynchronous it should still be pinged for success
15:35:14 <bswartz> At least I think it is -- anyone know if I'm wrong?
15:35:36 <lpabon> the client (horizon or whatever) should ping status of the request
15:35:42 <xyang> It is the manager who usually updates the status later after the operation is complete
15:36:07 <lpabon> submitting the request is different than request status, and I think it is up to the client/server relation to request submittion status
15:36:08 <bswartz> xyang: yes, but is it usually or is it always?
15:36:18 <xyang> bswartz: always
15:36:46 <u_glide> lpabon: it's well known problem of the openstack
15:37:01 <xyang> API usually sets status to extending, creating, etc.
15:37:02 * lpabon cries a little bit ...
15:37:07 <bswartz> it could make sense to return a flag from the driver to the manager indicating with the request: (1) succeeded (2) failed with a clean rollback or (3) failed and needs cleanup
15:37:09 <xyang> manager sets the final status
15:37:23 <cknight> u_glide: It's consistent with the design tenets, but it brings issues with it that have yet to be solved cleanly.  https://wiki.openstack.org/wiki/BasicDesignTenets
15:37:56 <lpabon> bummer, well, at least you get the idea of what I mean
15:38:10 <bswartz> +1 for BasicDesignTenets
15:38:23 <cknight> lpabon: yep.  notifications are being worked by various projects.
15:38:45 <cknight> lpabon: heat is planning to use zaqar for async events, and ameade wants to do the same in Manila.
15:39:22 <lpabon> cool, so _somehow_ the status of the request (being share extension in this example) would go back to the tenant/user notifying them that it failed
15:39:40 <cknight> lpabon: yes
15:39:45 <markstur> even with future better notification...  seems like we could still use that flag for 2 kinds of fail
15:40:06 <lpabon> but.. the status of the share should be up to the driver, depending on what the storage system is capable of
15:40:27 <lpabon> so maybe the share is still capable of being used even though it failed to extend
15:40:34 <bswartz> I'd like to point out that our goal should be to make things not fail...
15:40:42 <lpabon> bswartz: +10
15:40:50 <cknight> bswartz: +10  :-)
15:40:53 <bswartz> if backends never fail operations like extend, then how we handle the failures becomes a lot less important
15:41:54 <bswartz> if an extend failure is a truly exception use case, then it's no big deal if it requires and administrator to fix it, because it implies something really unusual happened
15:41:54 <u_glide> never-never :)
15:42:04 <markstur> "Error: something really bad happened. Please try again." ?
15:42:57 <bswartz> This is how extend is supposed to work:
15:42:58 <lpabon> markstur: probably: "Error: something really bad happened. Abort, Retry, Fail?
15:43:20 <bswartz> drivers report their free space to the scheduler -- free space should be usable for extending existing shares
15:43:47 <cknight> lpabon: :-)
15:43:53 <bswartz> when user requests to extend a share, the scheduler should check for free space before invoking the driver
15:44:08 <cknight> bswartz: but it doesn't today
15:44:19 <bswartz> if there was a quota problem or a space problem, then the request should get rejected before the driver is ever invoked and the share should remain in available state
15:44:47 <bswartz> once driver is invoked, the resize really should succeed unless something we didn't plan for happens
15:44:52 <bswartz> cknight: which part
15:45:13 <cknight> bswartz: iirc, the scheduler isn't consulted during extend
15:45:22 <bswartz> well crap
15:45:23 <lpabon> bswartz: Hmmm, i'm not sure.. The process of extending storage is quite involved
15:45:29 <bswartz> that seems like something we should fix
15:45:32 <cknight> u_glide: is that right?
15:45:48 <markstur> I think the notification efforts will help by classifying errors in terms that users need-to-know vs admin details
15:46:00 <bswartz> if we can determine ahead of time that there isn't enough space to extend a share we should not invoke the driver's extend method
15:46:01 <u_glide> cknight: right
15:46:08 <markstur> we should have some errors that can suggest retry (like a timeout)
15:46:21 <zhongjun2> The 'Thin' share without being limited by the free space.
15:46:23 <cknight> bswartz: it gets more complicated in thin-provisioning cases, need that to merge first
15:46:25 <markstur> and others that indicate that admin is needed
15:46:25 <u_glide> without working tenant notifications, it's just idle talks :)
15:46:32 <bswartz> lpabon: I'm sure it depends on the backend, but I could see some backends having a very involved process
15:46:52 <lpabon> bswartz: <cough> netapp</cough>   ;-)
15:47:25 <cknight> lpabon: -1  the netapp backend just works :-)
15:47:25 <markstur> Well we could still add that flag that Ben mentioned and have 2 error states for this if it appears necessary.
15:47:35 <bswartz> lpabon: on netapp a share resize is as easy as changing a number (which is effectively a quota value)
15:47:35 <lpabon> cknight: lol
15:48:14 <lpabon> in glusterfs a storage component must be created and then added to a volume.  The volume must accept the volume component and increase the overall size.
15:48:20 <markstur> extend failed (rolled back) and extend failed (needs cleanup)
15:48:40 <bswartz> markstur: even if we do what I suggested, that doesn't help the user find out whether his extend was successful or not
15:48:58 <bswartz> markstur: the user would have to inspect the share size in that case
15:49:05 <lpabon> iow, Determine storage size is available is a simple way to detect an early failure, but the execution of extension is still up to the storage system, and it could still fail
15:49:30 <markstur> How about.  Extending... Available.  Doesn't that mean it succeeded?
15:49:45 <bswartz> no if it failed and rolled back to availble
15:49:57 <cknight> bswartz: an alternative is to add a user-facing "heal" API, which tells the driver to attempt a cleanup from most any failed state and recover to available if possible
15:50:09 <cknight> bswartz: its friendlier than reset-state
15:50:09 <bswartz> cknight: -1
15:50:29 <bswartz> if users need a heal API it means we're already doing something wrong
15:50:50 <lpabon> Let me see if i can summarize:
15:50:55 <markstur> it's more of a clear notification then isn't it
15:51:10 <bswartz> the system should heal itself to the extent possible and when it can't it should ask for help (which the admin must provide)
15:51:33 <lpabon> Tenant wants to resize, scheduler checks for availability to resize and returns failure if there is no room.. If there is, driver executes resize...
15:51:52 <lpabon> ... Async call goes back to the client...
15:52:02 <lpabon> ... Driver does resize...
15:52:13 <bswartz> lpabon: yes and the size check could be synchronous so it a failed size check could actually fail the API similar to a failed quota check
15:52:16 <lpabon> ... Success or Failure state is returned..
15:52:23 <zhongjun2> lpabon:  The 'Thin' share without being limited by the free space.
15:52:42 <xyang> We should still have some ways to recover from failure.  I don't think we can completely avoid that
15:52:51 <lpabon> .. driver could decide to continue a usable state or fail the entire share..
15:53:46 <bswartz> xyang: recovering from failures is what the admin does, and we give the admin APIs to do that
15:53:50 <bswartz> like reset-state
15:53:56 <lpabon> xyang: i'm not sure if Manila should do that
15:54:05 <xyang> bswartz: ya, that is what I meant
15:54:09 <lpabon> probably is a storage management issue
15:54:26 <xyang> lpabon: we need to be able to sync up manila db and backend status
15:54:45 <bswartz> okay 5 minute warning
15:54:53 <lpabon> xyang: yeah, i agree, that should be up to the driver, imho
15:55:11 <bswartz> this was an interesting topic, thanks for bringing it up zhongjun2
15:55:16 <xyang> lpabon: an admin api that consults driver
15:55:21 <kaisers_0> Gotta leave, thanks & bye
15:55:22 <bswartz> are we in rough agreement about what to do here?
15:55:34 <lpabon> bswartz: :-) not sure
15:55:45 <bswartz> okay
15:55:48 <zhongjun2> bswartz: :-)
15:56:27 <bswartz> I propose that we add a free-space check to extend requests so we don't attempt to extend when there is no space
15:56:49 <bswartz> but once the extend request goes to the driver, if it fails then the share goes into error state
15:57:21 <markstur> it already checks capacity.  You mean check with more live stats?
15:57:41 <lpabon> bswartz: -1, sorry, but I do not think the entire share should go into error.. Unless error state allows still usability
15:57:42 <bswartz> markstur: I heard cknight and u_glide say we don't check capacity -- we only check quota
15:57:42 <u_glide> https://github.com/openstack/manila/blob/9d78f9ca36e4fc25a4ebe01c78781c126f7ea055/manila/api/contrib/share_actions.py#L145-L150
15:58:00 <xyang> markstur: extend share doesn't go thru scheduler, so it doesn't check capacity
15:58:05 <cknight> bswartz: might want to wait for "Xing-style" thin provisioning patch to merge
15:58:06 <markstur> oh
15:58:30 <cknight> xyang: (that's what we've been calling it)  ;-)
15:58:40 <xyang> :)
15:58:56 <bswartz> if we don't plan to check capacity or we can't check capacity then there's a much better case for allowing the driver to fail extends cleanly and roll back to the old size
15:59:15 <lpabon> bswartz: yeah
15:59:26 <bswartz> but in that case we need to communicate to the end user that the extend failed somehow
15:59:30 <xyang> another possibility is to have every request go to scheduler, but winston abandons that idea for some reason in cinder.  I'll check with him on that
15:59:36 <lpabon> bswartz: probably be better to just pass it to the driver and let it do the magic
15:59:55 <bswartz> okay we're at time
15:59:57 <bswartz> thanks everyone
16:00:04 <lpabon> good talk! thanks everyone
16:00:06 <bswartz> #endmeeting