#openstack-meeting log

16:00:07 <jungleboyj> #startmeeting Cinder
16:00:08 <openstack> Meeting started Wed Dec 12 16:00:07 2018 UTC and is due to finish in 60 minutes.  The chair is jungleboyj. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:09 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:11 <openstack> The meeting name has been set to 'cinder'
16:00:13 <whoami-rajat> Hi
16:00:36 <LiangFang> hi
16:00:40 <rosmaita> o/
16:00:40 <jungleboyj> courtesy ping:
16:00:40 <jungleboyj> jungleboyj diablo_rojo, diablo_rojo_phon, rajinir tbarron xyang xyang1 e0ne gouthamr thingee erlon tpsilva ganso patrickeast tommylikehu eharney geguileo smcginnis lhx_ lhx__ aspiers jgriffith moshele hwalsh felipemonteiro lpetrut lseki _alastor_ whoami-rajat yikun rosmaita enriquetaso
16:00:43 <walshh_> Hi
16:00:49 <lixiaoy1> HI
16:00:51 <geguileo> hi! o/
16:00:52 <eharney> hi
16:00:54 <erlon> hey!
16:00:57 <tbarron> hi
16:01:01 <jungleboyj> @!
16:01:02 <_pewp_> jungleboyj (^-^*)/
16:01:11 <carlos_silva> hi
16:01:27 <ganso> hello
16:01:38 <rajinir> hi
16:02:10 <jungleboyj> Nice that we have been having good attendance lately.
16:02:13 <jungleboyj> Thank you!
16:02:36 <jungleboyj> Ok, have a lot of topics this week so we should get started.
16:02:45 <jungleboyj> I don't have any announcements.
16:03:00 <e0ne> hi
16:03:03 <jungleboyj> I am still planning on meeting next week.  Will people be able to attend?
16:03:16 <geguileo> I will
16:03:24 <whoami-rajat> Yes.
16:03:25 <rosmaita> yes
16:03:26 <erlon> jungleboyj, why woudlnt?
16:03:28 <e0ne> I'll probably skip it
16:03:32 <ganso> yes
16:03:40 <walshh_> yes
16:03:41 <carlos_silva> yes
16:03:45 <jungleboyj> erlon:  Getting close to people's Christmas Vacations.
16:03:50 <erlon> jungleboyj, hmm
16:04:32 <jungleboyj> Ok.  Looks like we will have a good number of people there so I will keep the meeting and then we will be off for a week or two after that.
16:04:44 <yikun> hello
16:04:59 <jungleboyj> I think smcginnis will be back next week so it will be good to have a meeting with him before the Holidays.
16:05:23 <jungleboyj> So, on to the topics:
16:05:42 <jungleboyj> #topic Close launchpad bug " cinder-backup - CLI 'VolumeBackupsRestore' object is not iterable
16:05:50 <jungleboyj> enriquetaso:
16:06:11 <jungleboyj> #link https://bugs.launchpad.net/python-openstackclient/+bug/1733315
16:06:13 <openstack> Launchpad bug 1733315 in python-openstackclient "cinder-backup - CLI 'VolumeBackupsRestore' object is not iterable" [Undecided,Confirmed]
16:07:20 <jungleboyj> So, it looks like enriquetaso  thought it was fixed but smcginnis  thought it needed to be tested on the openstackcli .
16:07:26 <jungleboyj> No response after that.
16:08:07 <jungleboyj> enriquetaso:  Are you here?
16:08:58 <erlon> jungleboyj, it seems like we need some more testing before closing it
16:09:07 <jungleboyj> Yeah, sounds like it.
16:09:13 <whoami-rajat> I think eharney  and smcginnis  agreed that this should be closed but they said they don't have the appropriate permissions to do so.
16:09:45 <erlon> whoami-rajat, so, they tested with openstack client?
16:09:47 <eharney> well it may have been reproduced incorrectly, need to find that out first
16:10:44 <whoami-rajat> erlon: i'm not sure.
16:10:46 <jungleboyj> Ok.  Seems we don't have enough info here to make a decision.
16:10:50 <erlon> eharney, the behaviour seemed very straight forward to reproduce
16:11:41 <jungleboyj> erlon:  Yeah.  It appears so.
16:12:00 <eharney> i'm just reading the comments in the bug
16:12:03 <jungleboyj> I don't have an environment up to test it at the moment though.
16:12:19 <eharney> there's nothing to do until someone reports back on whether it still fails or not
16:12:36 <jungleboyj> I will update the bug then accordingly.
16:12:58 <lixiaoy1> do we need to verify openstack client CLI?
16:13:32 <lixiaoy1> seems that Sofia only checked it with cinder cli
16:14:25 <erlon> lixiaoy1, that is the point smcginnis put
16:14:52 <jungleboyj> lixiaoy1:  Yep and I just re-iterated that.
16:15:28 <jungleboyj> #topic Driver reinitialization after failure
16:15:37 <jungleboyj> lixiaoy1:  All your's
16:15:42 <lixiaoy1> thank you
16:15:52 <jungleboyj> #link https://review.openstack.org/#/c/622697
16:16:08 <jungleboyj> #link https://review.openstack.org/#/c/618702/
16:16:22 <lixiaoy1> we have agreed on that driver can be reinitialized when having recoverable errros during ptg
16:16:44 <jungleboyj> Right.
16:16:46 <eharney> i'm not sure the spec for this makes a very solid case for why we should do this
16:16:48 <lixiaoy1> currently Alan gave a comments about why we should differentiate recoverable and unrecoverable errors
16:17:09 <enriquetaso> lixiaoy1: I will check it with openstack cli
16:17:10 <jungleboyj> eharney:  We talked about this at the PTG.
16:17:19 <abishop> should _not_ distinguish :-/
16:17:20 <eharney> and said what..
16:17:33 <jungleboyj> It is to keep systems hitting uncrecoverable errors from spinning endlessly making it harder to see the real failure.
16:17:41 <lixiaoy1> abishop: yes, should not
16:17:55 <jungleboyj> enriquetaso:  Thank you.
16:18:43 <abishop> jungleboyj: not sure how encountering the same unrecoverable error will make it harder to discover root cause
16:18:48 <erlon> eharney, sometimes storage backends takes longer to be available than cinder vol services
16:19:31 <geguileo> erlon: didn't we have some retries for that in the initialization?
16:19:54 <jungleboyj> abishop: Because right now the driver fails to initialize but the volume driver instance keeps going and all you get are unitiialized errors.  No one goes back far enough to see why it wasn't initialized.
16:19:59 <eharney> the whole premise of this feature is that it's better to stop trying to initialize the driver when something happens than it is to retry initializing it, but the longer i look at this, it's not clear why that would be the case
16:20:03 <erlon> geguileo, if we do nobody mentioned about it so far
16:20:14 <geguileo> erlon: maybe it's just my imagination  XD
16:20:20 <erlon> not that I heard of, but I think no
16:21:05 <lixiaoy1> geguileo: erlon I have tested the driver failure case, there is no reintialization.
16:21:10 <jungleboyj> geguileo: I am not aware of anything.
16:21:39 <geguileo> now I remember, we used to have an infinate loop
16:21:49 <geguileo> then we removed it and left it with NO repetition
16:22:10 <jungleboyj> geguileo:  That sounds more right.  Seems like we may have swung too far.
16:22:14 <geguileo> probably an exponential backoff with 3 or 4 tries under specific circumnstances would have been good
16:22:24 <geguileo> jungleboyj: +1
16:22:29 <erlon> geguileo, +1
16:22:32 <jungleboyj> geguileo: ++
16:22:51 <ganso> at the PTG we discussed that there is precendent for retrying to initialize a driver under certain situations. What lixiaoy1 is proposing is to not re-initialize when a certain exception is thrown, because it would be pointless. I believe the behavior today is already to retry multiple times
16:23:34 <lixiaoy1> I am going to add driver reintialization, and there is a config to set retry nubmer
16:23:48 <jungleboyj> lixiaoy1: ++ I think that is good to do.
16:23:53 <lixiaoy1> https://review.openstack.org/#/c/618702/11
16:23:54 <eharney> would it not be better to just always retry?  how do we decide, in code, when we shouldn't retry?
16:24:06 <erlon> lixiaoy1, +
16:24:07 <erlon> 1
16:24:30 <jungleboyj> eharney: Well, if the driver reports a config error.  There is the question of whether they do that right.
16:24:52 <lixiaoy1> in fact, if we alway retry, that will be simpler.  as I found that in driver, the exception is not uniformal for import errrors.
16:24:53 <eharney> jungleboyj: config error isn't what the patches talked about, they also mention missing deps and some other cases
16:25:11 <jungleboyj> eharney:  Right.
16:25:16 <eharney> but that's the issue
16:25:21 <erlon> eharney, config erro, missing libs etc are errors that should not be recoverable
16:25:26 <geguileo> eharney: I think it will retry by default unless the new exception is raised
16:25:32 <geguileo> erlon: new exception is here https://review.openstack.org/#/c/622697
16:25:32 <eharney> but missing libs are recoverable -- you install them, and things work
16:25:48 <ganso> eharney: I'm confused, aren't we already retrying like 2 or 3 times?
16:25:54 <geguileo> eharney: and the retry here https://review.openstack.org/#/c/618702
16:25:54 <jungleboyj> eharney: and leave the driver spinning during that?
16:26:00 <erlon> a central discussion there is wheter we should always retry except some errors, or, never retry except some errors
16:26:02 <eharney> my point is
16:26:08 <geguileo> ganso: that's what I thought, but they told me that's not the case
16:26:11 <eharney> we're trying to draw a line between recoverable and non-recoverable errors
16:26:19 <eharney> and i don't see how that line is being defined in any clear way yet
16:26:30 <geguileo> eharney: I think we will retry always, but let drivers decide when they don't want to retry
16:26:42 <geguileo> It would be a driver decission
16:27:06 <eharney> with no guidelines defined..?
16:27:08 <abishop> I still don't see why retrying an unrecoverable error is an issue
16:27:34 <abishop> can the driver really be certain the error is truly unrecoverable?
16:27:35 <erlon> abishop, agreed, that shoudnt be a problem at all
16:27:36 <eharney> there's a patch up that makes the rbd driver not retry if python-rados is missing.  i don't see how that change actually helps any real-world user
16:27:51 <erlon> the driver will still not work just that, and that is not a lot of resources
16:28:18 <lixiaoy1> so how about we create a exception to indicate unrecoverable errors, let drivers decide whether to retry or not
16:28:18 <jungleboyj> Ok.
16:28:33 <abishop> erlon: yes! in which case I see no reason for drivers to judge whether the error is recoverable
16:28:35 <eharney> leaving this up to people implementing drivers is just going to leave admins confused about when retries would happen and when they wouldn't
16:28:37 <jungleboyj> lixiaoy1:  What is the goal here?
16:28:46 <jungleboyj> eharney: ++
16:29:06 <lixiaoy1> jungleboyj: you mean the new exception?
16:29:12 <erlon> eharney, your answer falls in the case of the cloud revover, where after a few retries, without human intervention, the cloud will be back
16:29:39 <jungleboyj> lixiaoy1:  Well, the overall goal of this change.  I feel like that has been lost here.  What is the use case?
16:29:51 <erlon> s/revover/recover/g
16:30:27 <lixiaoy1> the user case is that the cinder service may be ready ahead of backend storage services
16:31:00 <eharney> the implementation and spec so far are way broader than that case
16:31:16 <lixiaoy1> and  cinder service fails to initialize volume driver, even later the storage service is ready soon
16:31:45 <jungleboyj> So, basically, you want to be able to keep retrying if it can't contact the backend.
16:31:55 <lixiaoy1> yes.
16:32:02 <abishop> seems like the value is doing *some* retry, but distinguishing between recoverable and unrecoverable errors seems a distraction
16:32:05 <erlon> I would consider the approach mentioned by abishop, just always retry, no matter the error. It would make things, and discussions a lot easier
16:32:14 <lixiaoy1> so that the admin doesn't need to restart cinder service manually
16:32:33 <abishop> erlon: +1
16:32:40 <erlon> and we don't need to rely on driver changes, which would never happen (for most backends) anyways
16:32:42 <jungleboyj> erlon:  Which is what we used to do.  But maybe that is what we want.
16:33:29 <jungleboyj> lixiaoy1:  What are your thoughts on trying something like that?
16:33:29 <erlon> jungleboyj, may be just because there's any benefit in differentiating both cases
16:34:46 <lixiaoy1> always retry is much simpler, for me, no need to update driver codes
16:35:05 <LiangFang> retry seems reasonable, but the interval of retry may need to be longer if failed to retry
16:35:21 <lixiaoy1> after retry the specified count, it will not go on
16:35:23 <jungleboyj> That would actually be somewhat better as we would see the actual init failure in the logs rather than just the periodic job failing.
16:36:25 <jungleboyj> Do we have any disagreement with the proposal?
16:36:32 <erlon> LiangFang, the exponential backoff would solve that
16:36:41 <jungleboyj> erlon: ++
16:37:19 <lixiaoy1> jungleboyj: the proposal you mean always retry?
16:37:27 <jungleboyj> lixiaoy1:  Yes.
16:37:36 <jungleboyj> With an exponential backoff.
16:38:00 <lixiaoy1> what do you mean by exponential backoff?
16:38:19 <jungleboyj> Each time it fails it waits longer before trying again.
16:38:33 <abishop> maybe with an upper bound
16:38:42 <geguileo> https://en.wikipedia.org/wiki/Exponential_backoff
16:38:44 <jungleboyj> Waits 2 seconds, then 4, 8, 16, etc.
16:38:52 <yikun> expo backoff means [0, 1, 2, 4, 8] or [0, 0~2, 0~4, 0~8]?
16:39:28 <lixiaoy1> currently I would like to retry with the interval of reporting volume stats to scheduler
16:39:32 <jungleboyj> Lets go forward with that proposal as it seems to have less contention.
16:40:22 <jungleboyj> I suppose that could be an approach.
16:40:38 <jungleboyj> At this point I think we need to propose an update to the spec and work the details there?
16:40:47 <lixiaoy1> ok
16:41:15 <jungleboyj> lixiaoy1:  Thanks.  eharney and abishop can you please make sure to put your input there.
16:41:21 <eharney> sure
16:41:28 <jungleboyj> Cool.
16:41:33 <lixiaoy1> thank you all
16:41:43 <jungleboyj> #action lixiaoy1 to update the spec based on this discussion.
16:41:56 <jungleboyj> #action eharney abishop to review the spec based on their concerns.
16:42:02 <lixiaoy1> got it
16:42:14 <abishop> sure
16:43:01 <jungleboyj> Cool.  Thank you all.
16:43:30 <jungleboyj> Ok, so we have a launchpad bug with no owner here:  https://bugs.launchpad.net/cinder/+bug/1799381
16:43:32 <openstack> Launchpad bug 1799381 in Cinder "Cinder Netapp NFS driver has different host and provider_location" [Undecided,New]
16:44:34 <erlon> jungleboyj, what do you mean no owner?
16:44:35 <jungleboyj> Based on the comments it looks like it can be closed.
16:44:35 <whoami-rajat> Based on the comments it looks like its fixed in master but i didn't had the environment to reproduce it so i thought to bring it up in the cinder meeting.
16:44:45 <jungleboyj> No owner on the etherpad for the discussion.
16:44:58 <whoami-rajat> jungleboyj: sorry i forgot to add my name there.
16:45:05 <jungleboyj> whoami-rajat:  Ah, gotcha.
16:45:07 <erlon> jungleboyj, ow, I see
16:45:54 <jungleboyj> whoami-rajat: I will take a closer look at the bug and close it if it looks fixed.
16:45:55 <erlon> me and tpsilva can work with you on that rajat
16:46:10 <jungleboyj> erlon: Or that would be even better.  :-)
16:46:27 <whoami-rajat> erlon: Thanks. that will be helpful.
16:46:45 <erlon> though we have tested weren't able to reproduce it in newer releases
16:47:00 <tpsilva> is the original reporter in the meeting?
16:47:08 <erlon> whoami-rajat, we will double check and then close if the fix is confirmed
16:47:16 <tpsilva> we couldn't reproduce this issue
16:47:23 <tpsilva> but we'll take another look on that
16:47:48 <jungleboyj> tpsilva: Sounds like a good plan.
16:47:51 <whoami-rajat> erlon: ok. sounds good.
16:48:02 <jungleboyj> whoami-rajat:  Anything more there?
16:48:26 <whoami-rajat> jungleboyj: no, thats all.
16:48:35 <jungleboyj> Ok, cool.
16:48:49 <jungleboyj> #topic Fix LVM thinpool overallocation
16:48:56 <jungleboyj> LiangFang:
16:49:01 <LiangFang> hi
16:49:10 <jungleboyj> https://review.openstack.org/#/c/623639/
16:49:14 <jungleboyj> #link https://review.openstack.org/#/c/623639/
16:49:31 <LiangFang> when creating volume, scheduler know the volume size and will consider this
16:50:07 <LiangFang> but lvm driver itself still report the volume stat without the size that is creating
16:50:29 <erlon> LiangFang, you mean as the pool information?
16:50:39 <LiangFang> yes
16:50:52 <jungleboyj> Is this just a request for reviews?
16:50:53 <erlon> the used size right?
16:50:56 <LiangFang> so scheduler get a wrong pool size
16:50:57 <eharney> what's the bug exactly?  that stats from the driver are behind for a bit when volumes are created?
16:51:24 <LiangFang> jungleboyj: yes
16:51:35 <jungleboyj> Ok.
16:51:50 <jungleboyj> Can we just handle questions through the review then?
16:51:51 <erlon> LiangFang, just for a bit or it keep like that always?
16:52:25 <LiangFang> before finishing creating volume
16:52:40 <LiangFang> it always report a wrong stat
16:53:01 <erlon> LiangFang, Ill check that out later today
16:53:03 <eharney> isn't that normal behavior with how our scheduler and volume service work?
16:53:14 <jungleboyj> eharney:  That was what I am thinking.
16:53:25 <jungleboyj> If the volume hasn't been created yet ... then the data hasn't changed.
16:53:36 <eharney> i don't see any reason the LVM driver needs to be adding code to keep track of things in flight like this
16:54:08 <LiangFang> eharney: but scheduler will consider there's enough free space there, and will let more volume to be created
16:54:22 <eharney> LiangFang: yes, that's a limitation of our scheduling system that can happen with any backend though
16:55:10 <LiangFang> but if the backend can report exact stat, I think it would be good
16:55:34 <eharney> i don't think it's worth adding extra accounting and locks in the driver for this
16:55:43 <jungleboyj> eharney:  ++
16:55:54 <jungleboyj> I am concerned with changing our base driver like that.
16:55:59 <eharney> i'm also not sure it actually resolves the problem anyway
16:56:15 <eharney> because you still can have requests in flight between the scheduler and the volume service, etc., and have the same problem
16:56:30 <jungleboyj> Ok.  Lets move on as we are short on time.  Please add your comments in the review.
16:56:36 <eharney> and the new data reported in your patch won't be provided until an update stats call is performed anyway
16:56:47 <erlon> Its not clear to me whether this is just for a leap of time or if the difference stays forever
16:57:00 <geguileo> erlon: it's a leap of time
16:57:04 <jungleboyj> erlon: I think it is a leap of time.
16:57:17 <erlon> yeap, so, not a big deal
16:57:22 <jungleboyj> erlon: ++
16:57:35 <jungleboyj> LiangFang: Is the next topic also a request for reviews.
16:57:48 <LiangFang> yes
16:58:00 <jungleboyj> Ok.  We will take a look at that and make comments there as well.
16:58:09 <jungleboyj> #topic cinderlib
16:58:10 <LiangFang> thanks
16:58:19 <jungleboyj> geguileo:  Floor is your's for 2 min.
16:58:50 <jungleboyj> geguileo:  ?
16:59:14 <geguileo> thanks
16:59:26 <geguileo> just wanted to say that the patches are up for review
16:59:26 <jungleboyj> Yes.
16:59:30 <geguileo> there are 3 patches
16:59:34 <jungleboyj> Yeah, I saw that.
16:59:42 <jungleboyj> Thanks for sharing.  I will try to do a review soon.
16:59:43 <geguileo> one for the code and unit tests
16:59:47 <geguileo> another for the docs
17:00:05 <geguileo> and the last one for the functional tests and the change to run it in the LIO barbican gate
17:00:06 <jungleboyj> #link https://review.openstack.org/#/c/620669
17:00:19 <jungleboyj> #link https://review.openstack.org/#/c/620670
17:00:30 <jungleboyj> #link https://review.openstack.org/#/c/620671
17:00:42 <jungleboyj> So, team, please work on taking a look at those tests.
17:00:43 <geguileo> so I would appreciate reviews  :-)
17:00:48 <jungleboyj> I mean patches.
17:00:53 <geguileo> lol
17:01:00 <jungleboyj> :-)
17:01:07 <jungleboyj> Ok.  Thanks for meeting today team!
17:01:13 <jungleboyj> See you all next week!
17:01:15 <geguileo> thanks everyone!
17:01:17 <jungleboyj> #endmeeting