16:00:07 #startmeeting Cinder 16:00:08 Meeting started Wed Dec 12 16:00:07 2018 UTC and is due to finish in 60 minutes. The chair is jungleboyj. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:09 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:11 The meeting name has been set to 'cinder' 16:00:13 Hi 16:00:36 hi 16:00:40 o/ 16:00:40 courtesy ping: 16:00:40 jungleboyj diablo_rojo, diablo_rojo_phon, rajinir tbarron xyang xyang1 e0ne gouthamr thingee erlon tpsilva ganso patrickeast tommylikehu eharney geguileo smcginnis lhx_ lhx__ aspiers jgriffith moshele hwalsh felipemonteiro lpetrut lseki _alastor_ whoami-rajat yikun rosmaita enriquetaso 16:00:43 Hi 16:00:49 HI 16:00:51 hi! o/ 16:00:52 hi 16:00:54 hey! 16:00:57 hi 16:01:01 @! 16:01:02 <_pewp_> jungleboyj (^-^*)/ 16:01:11 hi 16:01:27 hello 16:01:38 hi 16:02:10 Nice that we have been having good attendance lately. 16:02:13 Thank you! 16:02:36 Ok, have a lot of topics this week so we should get started. 16:02:45 I don't have any announcements. 16:03:00 hi 16:03:03 I am still planning on meeting next week. Will people be able to attend? 16:03:16 I will 16:03:24 Yes. 16:03:25 yes 16:03:26 jungleboyj, why woudlnt? 16:03:28 I'll probably skip it 16:03:32 yes 16:03:40 yes 16:03:41 yes 16:03:45 erlon: Getting close to people's Christmas Vacations. 16:03:50 jungleboyj, hmm 16:04:32 Ok. Looks like we will have a good number of people there so I will keep the meeting and then we will be off for a week or two after that. 16:04:44 hello 16:04:59 I think smcginnis will be back next week so it will be good to have a meeting with him before the Holidays. 16:05:23 So, on to the topics: 16:05:42 #topic Close launchpad bug " cinder-backup - CLI 'VolumeBackupsRestore' object is not iterable 16:05:50 enriquetaso: 16:06:11 #link https://bugs.launchpad.net/python-openstackclient/+bug/1733315 16:06:13 Launchpad bug 1733315 in python-openstackclient "cinder-backup - CLI 'VolumeBackupsRestore' object is not iterable" [Undecided,Confirmed] 16:07:20 So, it looks like enriquetaso thought it was fixed but smcginnis thought it needed to be tested on the openstackcli . 16:07:26 No response after that. 16:08:07 enriquetaso: Are you here? 16:08:58 jungleboyj, it seems like we need some more testing before closing it 16:09:07 Yeah, sounds like it. 16:09:13 I think eharney and smcginnis agreed that this should be closed but they said they don't have the appropriate permissions to do so. 16:09:45 whoami-rajat, so, they tested with openstack client? 16:09:47 well it may have been reproduced incorrectly, need to find that out first 16:10:44 erlon: i'm not sure. 16:10:46 Ok. Seems we don't have enough info here to make a decision. 16:10:50 eharney, the behaviour seemed very straight forward to reproduce 16:11:41 erlon: Yeah. It appears so. 16:12:00 i'm just reading the comments in the bug 16:12:03 I don't have an environment up to test it at the moment though. 16:12:19 there's nothing to do until someone reports back on whether it still fails or not 16:12:36 I will update the bug then accordingly. 16:12:58 do we need to verify openstack client CLI? 16:13:32 seems that Sofia only checked it with cinder cli 16:14:25 lixiaoy1, that is the point smcginnis put 16:14:52 lixiaoy1: Yep and I just re-iterated that. 16:15:28 #topic Driver reinitialization after failure 16:15:37 lixiaoy1: All your's 16:15:42 thank you 16:15:52 #link https://review.openstack.org/#/c/622697 16:16:08 #link https://review.openstack.org/#/c/618702/ 16:16:22 we have agreed on that driver can be reinitialized when having recoverable errros during ptg 16:16:44 Right. 16:16:46 i'm not sure the spec for this makes a very solid case for why we should do this 16:16:48 currently Alan gave a comments about why we should differentiate recoverable and unrecoverable errors 16:17:09 lixiaoy1: I will check it with openstack cli 16:17:10 eharney: We talked about this at the PTG. 16:17:19 should _not_ distinguish :-/ 16:17:20 and said what.. 16:17:33 It is to keep systems hitting uncrecoverable errors from spinning endlessly making it harder to see the real failure. 16:17:41 abishop: yes, should not 16:17:55 enriquetaso: Thank you. 16:18:43 jungleboyj: not sure how encountering the same unrecoverable error will make it harder to discover root cause 16:18:48 eharney, sometimes storage backends takes longer to be available than cinder vol services 16:19:31 erlon: didn't we have some retries for that in the initialization? 16:19:54 abishop: Because right now the driver fails to initialize but the volume driver instance keeps going and all you get are unitiialized errors. No one goes back far enough to see why it wasn't initialized. 16:19:59 the whole premise of this feature is that it's better to stop trying to initialize the driver when something happens than it is to retry initializing it, but the longer i look at this, it's not clear why that would be the case 16:20:03 geguileo, if we do nobody mentioned about it so far 16:20:14 erlon: maybe it's just my imagination XD 16:20:20 not that I heard of, but I think no 16:21:05 geguileo: erlon I have tested the driver failure case, there is no reintialization. 16:21:10 geguileo: I am not aware of anything. 16:21:39 now I remember, we used to have an infinate loop 16:21:49 then we removed it and left it with NO repetition 16:22:10 geguileo: That sounds more right. Seems like we may have swung too far. 16:22:14 probably an exponential backoff with 3 or 4 tries under specific circumnstances would have been good 16:22:24 jungleboyj: +1 16:22:29 geguileo, +1 16:22:32 geguileo: ++ 16:22:51 at the PTG we discussed that there is precendent for retrying to initialize a driver under certain situations. What lixiaoy1 is proposing is to not re-initialize when a certain exception is thrown, because it would be pointless. I believe the behavior today is already to retry multiple times 16:23:34 I am going to add driver reintialization, and there is a config to set retry nubmer 16:23:48 lixiaoy1: ++ I think that is good to do. 16:23:53 https://review.openstack.org/#/c/618702/11 16:23:54 would it not be better to just always retry? how do we decide, in code, when we shouldn't retry? 16:24:06 lixiaoy1, + 16:24:07 1 16:24:30 eharney: Well, if the driver reports a config error. There is the question of whether they do that right. 16:24:52 in fact, if we alway retry, that will be simpler. as I found that in driver, the exception is not uniformal for import errrors. 16:24:53 jungleboyj: config error isn't what the patches talked about, they also mention missing deps and some other cases 16:25:11 eharney: Right. 16:25:16 but that's the issue 16:25:21 eharney, config erro, missing libs etc are errors that should not be recoverable 16:25:26 eharney: I think it will retry by default unless the new exception is raised 16:25:32 erlon: new exception is here https://review.openstack.org/#/c/622697 16:25:32 but missing libs are recoverable -- you install them, and things work 16:25:48 eharney: I'm confused, aren't we already retrying like 2 or 3 times? 16:25:54 eharney: and the retry here https://review.openstack.org/#/c/618702 16:25:54 eharney: and leave the driver spinning during that? 16:26:00 a central discussion there is wheter we should always retry except some errors, or, never retry except some errors 16:26:02 my point is 16:26:08 ganso: that's what I thought, but they told me that's not the case 16:26:11 we're trying to draw a line between recoverable and non-recoverable errors 16:26:19 and i don't see how that line is being defined in any clear way yet 16:26:30 eharney: I think we will retry always, but let drivers decide when they don't want to retry 16:26:42 It would be a driver decission 16:27:06 with no guidelines defined..? 16:27:08 I still don't see why retrying an unrecoverable error is an issue 16:27:34 can the driver really be certain the error is truly unrecoverable? 16:27:35 abishop, agreed, that shoudnt be a problem at all 16:27:36 there's a patch up that makes the rbd driver not retry if python-rados is missing. i don't see how that change actually helps any real-world user 16:27:51 the driver will still not work just that, and that is not a lot of resources 16:28:18 so how about we create a exception to indicate unrecoverable errors, let drivers decide whether to retry or not 16:28:18 Ok. 16:28:33 erlon: yes! in which case I see no reason for drivers to judge whether the error is recoverable 16:28:35 leaving this up to people implementing drivers is just going to leave admins confused about when retries would happen and when they wouldn't 16:28:37 lixiaoy1: What is the goal here? 16:28:46 eharney: ++ 16:29:06 jungleboyj: you mean the new exception? 16:29:12 eharney, your answer falls in the case of the cloud revover, where after a few retries, without human intervention, the cloud will be back 16:29:39 lixiaoy1: Well, the overall goal of this change. I feel like that has been lost here. What is the use case? 16:29:51 s/revover/recover/g 16:30:27 the user case is that the cinder service may be ready ahead of backend storage services 16:31:00 the implementation and spec so far are way broader than that case 16:31:16 and cinder service fails to initialize volume driver, even later the storage service is ready soon 16:31:45 So, basically, you want to be able to keep retrying if it can't contact the backend. 16:31:55 yes. 16:32:02 seems like the value is doing *some* retry, but distinguishing between recoverable and unrecoverable errors seems a distraction 16:32:05 I would consider the approach mentioned by abishop, just always retry, no matter the error. It would make things, and discussions a lot easier 16:32:14 so that the admin doesn't need to restart cinder service manually 16:32:33 erlon: +1 16:32:40 and we don't need to rely on driver changes, which would never happen (for most backends) anyways 16:32:42 erlon: Which is what we used to do. But maybe that is what we want. 16:33:29 lixiaoy1: What are your thoughts on trying something like that? 16:33:29 jungleboyj, may be just because there's any benefit in differentiating both cases 16:34:46 always retry is much simpler, for me, no need to update driver codes 16:35:05 retry seems reasonable, but the interval of retry may need to be longer if failed to retry 16:35:21 after retry the specified count, it will not go on 16:35:23 That would actually be somewhat better as we would see the actual init failure in the logs rather than just the periodic job failing. 16:36:25 Do we have any disagreement with the proposal? 16:36:32 LiangFang, the exponential backoff would solve that 16:36:41 erlon: ++ 16:37:19 jungleboyj: the proposal you mean always retry? 16:37:27 lixiaoy1: Yes. 16:37:36 With an exponential backoff. 16:38:00 what do you mean by exponential backoff? 16:38:19 Each time it fails it waits longer before trying again. 16:38:33 maybe with an upper bound 16:38:42 https://en.wikipedia.org/wiki/Exponential_backoff 16:38:44 Waits 2 seconds, then 4, 8, 16, etc. 16:38:52 expo backoff means [0, 1, 2, 4, 8] or [0, 0~2, 0~4, 0~8]? 16:39:28 currently I would like to retry with the interval of reporting volume stats to scheduler 16:39:32 Lets go forward with that proposal as it seems to have less contention. 16:40:22 I suppose that could be an approach. 16:40:38 At this point I think we need to propose an update to the spec and work the details there? 16:40:47 ok 16:41:15 lixiaoy1: Thanks. eharney and abishop can you please make sure to put your input there. 16:41:21 sure 16:41:28 Cool. 16:41:33 thank you all 16:41:43 #action lixiaoy1 to update the spec based on this discussion. 16:41:56 #action eharney abishop to review the spec based on their concerns. 16:42:02 got it 16:42:14 sure 16:43:01 Cool. Thank you all. 16:43:30 Ok, so we have a launchpad bug with no owner here: https://bugs.launchpad.net/cinder/+bug/1799381 16:43:32 Launchpad bug 1799381 in Cinder "Cinder Netapp NFS driver has different host and provider_location" [Undecided,New] 16:44:34 jungleboyj, what do you mean no owner? 16:44:35 Based on the comments it looks like it can be closed. 16:44:35 Based on the comments it looks like its fixed in master but i didn't had the environment to reproduce it so i thought to bring it up in the cinder meeting. 16:44:45 No owner on the etherpad for the discussion. 16:44:58 jungleboyj: sorry i forgot to add my name there. 16:45:05 whoami-rajat: Ah, gotcha. 16:45:07 jungleboyj, ow, I see 16:45:54 whoami-rajat: I will take a closer look at the bug and close it if it looks fixed. 16:45:55 me and tpsilva can work with you on that rajat 16:46:10 erlon: Or that would be even better. :-) 16:46:27 erlon: Thanks. that will be helpful. 16:46:45 though we have tested weren't able to reproduce it in newer releases 16:47:00 is the original reporter in the meeting? 16:47:08 whoami-rajat, we will double check and then close if the fix is confirmed 16:47:16 we couldn't reproduce this issue 16:47:23 but we'll take another look on that 16:47:48 tpsilva: Sounds like a good plan. 16:47:51 erlon: ok. sounds good. 16:48:02 whoami-rajat: Anything more there? 16:48:26 jungleboyj: no, thats all. 16:48:35 Ok, cool. 16:48:49 #topic Fix LVM thinpool overallocation 16:48:56 LiangFang: 16:49:01 hi 16:49:10 https://review.openstack.org/#/c/623639/ 16:49:14 #link https://review.openstack.org/#/c/623639/ 16:49:31 when creating volume, scheduler know the volume size and will consider this 16:50:07 but lvm driver itself still report the volume stat without the size that is creating 16:50:29 LiangFang, you mean as the pool information? 16:50:39 yes 16:50:52 Is this just a request for reviews? 16:50:53 the used size right? 16:50:56 so scheduler get a wrong pool size 16:50:57 what's the bug exactly? that stats from the driver are behind for a bit when volumes are created? 16:51:24 jungleboyj: yes 16:51:35 Ok. 16:51:50 Can we just handle questions through the review then? 16:51:51 LiangFang, just for a bit or it keep like that always? 16:52:25 before finishing creating volume 16:52:40 it always report a wrong stat 16:53:01 LiangFang, Ill check that out later today 16:53:03 isn't that normal behavior with how our scheduler and volume service work? 16:53:14 eharney: That was what I am thinking. 16:53:25 If the volume hasn't been created yet ... then the data hasn't changed. 16:53:36 i don't see any reason the LVM driver needs to be adding code to keep track of things in flight like this 16:54:08 eharney: but scheduler will consider there's enough free space there, and will let more volume to be created 16:54:22 LiangFang: yes, that's a limitation of our scheduling system that can happen with any backend though 16:55:10 but if the backend can report exact stat, I think it would be good 16:55:34 i don't think it's worth adding extra accounting and locks in the driver for this 16:55:43 eharney: ++ 16:55:54 I am concerned with changing our base driver like that. 16:55:59 i'm also not sure it actually resolves the problem anyway 16:56:15 because you still can have requests in flight between the scheduler and the volume service, etc., and have the same problem 16:56:30 Ok. Lets move on as we are short on time. Please add your comments in the review. 16:56:36 and the new data reported in your patch won't be provided until an update stats call is performed anyway 16:56:47 Its not clear to me whether this is just for a leap of time or if the difference stays forever 16:57:00 erlon: it's a leap of time 16:57:04 erlon: I think it is a leap of time. 16:57:17 yeap, so, not a big deal 16:57:22 erlon: ++ 16:57:35 LiangFang: Is the next topic also a request for reviews. 16:57:48 yes 16:58:00 Ok. We will take a look at that and make comments there as well. 16:58:09 #topic cinderlib 16:58:10 thanks 16:58:19 geguileo: Floor is your's for 2 min. 16:58:50 geguileo: ? 16:59:14 thanks 16:59:26 just wanted to say that the patches are up for review 16:59:26 Yes. 16:59:30 there are 3 patches 16:59:34 Yeah, I saw that. 16:59:42 Thanks for sharing. I will try to do a review soon. 16:59:43 one for the code and unit tests 16:59:47 another for the docs 17:00:05 and the last one for the functional tests and the change to run it in the LIO barbican gate 17:00:06 #link https://review.openstack.org/#/c/620669 17:00:19 #link https://review.openstack.org/#/c/620670 17:00:30 #link https://review.openstack.org/#/c/620671 17:00:42 So, team, please work on taking a look at those tests. 17:00:43 so I would appreciate reviews :-) 17:00:48 I mean patches. 17:00:53 lol 17:01:00 :-) 17:01:07 Ok. Thanks for meeting today team! 17:01:13 See you all next week! 17:01:15 thanks everyone! 17:01:17 #endmeeting