14:00:40 #startmeeting cinder 14:00:41 Meeting started Wed Mar 3 14:00:40 2021 UTC and is due to finish in 60 minutes. The chair is rosmaita. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:42 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:45 The meeting name has been set to 'cinder' 14:00:51 o/ 14:00:53 hi 14:00:53 hi! o/ 14:00:53 #topic roll call 14:00:55 Hi 14:00:58 hi 14:01:00 o7 14:01:01 hi 14:01:02 hi 14:01:47 looks like a good turnout, let's get started 14:02:00 #link https://etherpad.openstack.org/p/cinder-wallaby-meetings 14:02:04 #topic announcements 14:02:18 wallaby os-brick release this week (that is, tomorrow) 14:02:28 cinderclient release next week 14:02:39 wallaby feature freeze next week also 14:02:48 so, review priorities are: 14:02:53 today: os-brick 14:03:02 then, cinderclient and features 14:03:13 use launchpad to find features to review 14:03:40 #link https://blueprints.launchpad.net/cinder/wallaby 14:03:55 if you have a feature that's not in there, let me know immediately 14:04:25 note to driver developer/maintainers: even if you are not a core, you can review other people's drivers 14:04:48 we have helpful info in the cinder docs if you need suggestions on what to look for 14:05:02 though, you probably know since you are working on your own driver 14:05:43 o/ 14:05:43 #topic wallaby os-brick release 14:06:11 ok, some patches we need to get in (and that obviously need reviews) 14:06:20 fix nvmeof connector regression 14:06:29 #link https://review.opendev.org/c/openstack/os-brick/+/777086 14:06:53 it's got a -1 from me, but just for minor stuff ... it looks otherwise ok to me 14:07:10 it has passed zuul 14:07:29 and passed the mellanox SPDK CI, which consumes the connector in a "legacy" way 14:07:42 and it has passed the kioxia CI, which consumes the connector in the new way 14:09:19 anyway, that's the #1 priority because it corrects a regression 14:09:38 other stuff that is uncontroversial and needs review: 14:09:54 remove six - https://review.opendev.org/c/openstack/os-brick/+/754598/ 14:10:12 the nvmeof change doesn't use six at all, so it's not impacted 14:10:31 but i seem to have been reviewing this patch for months and would likeit out of my life 14:10:43 change min version of tox - https://review.opendev.org/c/openstack/os-brick/+/771568 14:11:10 i verified that tox is smart enough to update itself when it builds testenvs, so it's not a problem 14:11:23 RBD: catch read exceptions prior to modifying offset 14:11:36 #link https://review.opendev.org/c/openstack/os-brick/+/763867 14:11:57 would be good, please look, if it requires work, give it a -1 14:12:09 Avoid unhandled exceptions during connecting to iSCSI portals 14:12:17 #link https://review.opendev.org/c/openstack/os-brick/+/775545 14:12:44 there's a question about ^^, namely whether it catches too broad an exception 14:13:14 so while it would be nice to have, i don't insist on it 14:13:29 so if you have a particular interest in this, please review 14:13:42 change hacking to 4.0.0 14:13:57 #link https://review.opendev.org/c/openstack/os-brick/+/774883 14:14:21 so, we have already done it for cinder, and the one to cinderclient will probably merge before release 14:15:08 now that i think about it, we could always merge later and backport to stable/wallaby to keep the code consistent 14:15:41 so, let's hold off unless someone here has a strong opinion? 14:15:55 i think just updating to 4.0.0 is pretty easy? not sure what you mean 14:16:15 well, i don't know if it will impact any of the other patches 14:16:36 ah 14:17:20 i will rebase it after the critical stuff merges and we can see what happens 14:17:49 ok, i may put up a patch to update the requirements and lower-constraints 14:18:11 it may not be necessary because they were updated in january 14:18:48 anyway, if there is a change, i will individually bug people to look at the patch 14:19:34 ok, to summarize 14:19:49 the highest priority for brick is the nvmeof patch 14:20:08 #link https://review.opendev.org/c/openstack/os-brick/+/777086 14:21:57 need quick feedback on this issue: https://review.opendev.org/c/openstack/os-brick/+/777086/8/os_brick/initiator/connectors/nvmeof.py#829 14:22:41 ok, if memory serves, e0ne and hemna volunteered to review this? 14:23:12 rosmaita: will do it. last time I +1'ed on it but there was an issue with naming 14:23:13 i am right on the verge of +2, but want to make sure we have people who can look at it today 14:23:22 Definitely need hemna to look at it. 14:23:24 e0ne: cool, i think that has been addressed 14:24:03 i think hemna looked at an earlier version, so shouldn't be too bad to re-review 14:24:14 ++ 14:24:23 he may be on the west coast this week, i will look for him later 14:25:12 #topic cinderclient 14:25:13 I'll poke at it 14:25:18 hemna: ty! 14:25:41 https://review.opendev.org/q/project:openstack/python-cinderclient+branch:master+status:open 14:25:58 major things are to get the MV updated 14:26:12 there's a patch for 3.63 14:26:20 there will be a patch for 3.64 14:26:35 but the cinder-side code needs to merge first 14:27:13 #link https://review.opendev.org/q/project:openstack/python-cinderclient+branch:master+status:open 14:27:54 bad paste there, sorry 14:28:02 #link https://review.opendev.org/c/openstack/cinder/+/771081 14:28:29 ^^ is a wallaby feature so is a priority review ... also, it's nicely written and has good tests, and one +2 14:28:33 * jungleboyj has so many tabs open. :-) 14:28:38 :D 14:28:57 ok, i am still working on the v2 support removal patch for cinderclient 14:29:14 should have it ready in the next day or so 14:29:26 ++ 14:30:25 so note to reviewers: there are a bunch of small patches for cinderclient, please address them 14:30:44 #topic Request from TC to investigate frequent Gate failures 14:30:50 jungleboyj: that's you 14:30:58 Hello. :-) 14:31:31 So, as many have hopefully seen, there has been an effort to make our gate jobs more efficient as we have fewer resources for running check/gate. 14:31:52 There have also been issues with how long patches are taking to get through. 14:32:08 Neutron, Nova, etc have made things more efficient. 14:32:31 There is still need for improvement. 14:32:46 Now they are looking at how often gate/check runs fail. 14:33:10 #link https://termbin.com/4guht 14:33:29 interesting enough, the pastebin you linked refers to tempest-integrated-storage failures, I thought we had many more in the lvm-lio-barbican job 14:33:32 dansmith: Had highlighted two examples above where he was seeing frequent failures for Cinder. 14:33:32 does dansmith have an elasticsearch query up for these? 14:33:55 the lio-barbican job is failing a lot, but i think it doesn't show up much where other projects would see it 14:34:02 rosmaita: I don't think he had put that together yet. This was just a quick grab. 14:34:03 nope, haven't done that yet 14:34:03 what eharney said 14:34:14 dansmith: if you have time, that would be great 14:34:15 we have kind of discussed a bit about those in the past, my feeling is that something weird happen on the limited environment 14:34:26 definitely not barbican-related jobs that I've seen 14:34:44 and eharney has started a bit of investigation, I don't remember if there were any findings 14:35:33 tosky: i think there's actually a bug in the target code hitting the lio-barbican job but i guess the questions today are about other issues 14:35:36 i think if we can get an elasticsearch query up, that will make it easier to find fresh logs when someone has time to look 14:35:46 eharney: I see 14:36:00 So, partially, this is a reminder to everyone that we should try to look at failures and not just blindly recheck. Second, as to see if anyone has been looking at failures. 14:36:21 the big lvm-lio-barbican job failures i have been seeing are related to backup tests and the test-boot-pattern tests 14:36:25 eharney: It is more of a general call. Those were just examples. :-) 14:36:40 rosmaita: Hasn't that been the case for like 3 years now? 14:36:47 (backup tests which were removed from the main gates because of the failing rates, probably resources ) 14:37:49 rosmaita: Do we have elasticsearch queries for those? 14:38:02 i don't think so 14:38:18 So, that would probably be good to add. 14:38:30 someone could be a Hero of Cinder and put them together 14:38:44 :-) 14:38:44 and put the links on the meeting etherpad 14:39:04 * jungleboyj looks at our bug here enriquetaso 14:39:08 *hero 14:39:35 sure 14:39:46 maybe a driver maintainer who's waiting for reviews ? 14:39:47 not really sure how but sure 14:40:07 Add to our weekly bug review an eye on where our gate/check failures are at. 14:40:15 rosmaita: That is a good idea too. 14:40:52 enriquetaso: you can look by hand, but i think the preferred method is to file a bug and checkout a repo and push a patch 14:40:56 makes it more stable 14:41:09 we can talk offline, i have notes somewhere 14:41:27 or maybe dansmith can volunteer to walk you through the process? 14:42:05 ok, so the conclusion is that volume-related gate failures suck and someone needs to do something about it 14:42:08 tbh, I haven't actually done that in a while 14:42:18 #topic Wallaby R-6 Bug Review 14:42:33 hi 14:42:47 We have 6 bugs reported this week. I'd like to show 3 today because I'm not sure about the severity of them. 14:42:50 dansmith: that's fine, just may slow things down a bit, as you can see enriquetaso is already kind of busy 14:43:11 #link https://etherpad.opendev.org/p/cinder-wallaby-r6-bug-review 14:43:18 bug_1: 14:43:18 Backup create failed: RBD volume flatten too long causing mq to timed out. 14:43:19 rosmaita: ++ Thank you. 14:43:27 #link 14:43:27 https://bugs.launchpad.net/cinder/+bug/1916843 14:43:28 Launchpad bug 1916843 in Cinder "Backup create failed: RBD volume flatten too long causing mq to timed out." [Medium,New] 14:43:35 #link https://bugs.launchpad.net/cinder/+bug/1916843 14:43:40 In the process of creating a backup using a snapshot, there is an operation tocreate a temporary volume from a snapshot, which requires cinder-backupand cinder-volume to perform rpc interaction.default configuration"rpc_response_timeout" is 60s. 14:44:19 whoami-rajat: is this related to any of the flatten work you have done? 14:44:27 i can't remember 14:45:03 can't recall anything related, will take a thorough look after the meeting 14:45:25 Thanks! 14:45:29 i think we should ask for more info first 14:45:41 something about the environment 14:45:47 the main point in that bug is that the slow call shouldn't block the service up for RPC calls, AFAIK 14:45:58 the fact that it's slow isn't really the interesting part 14:46:42 yeah, but sounds like we would need a major redesign to do that right 14:46:58 i'm not so sure, need to look around in that area again 14:47:01 is it just that the operation takes longer than the RPC timeout? 14:47:18 like, works for small volumes but large one will always take longer than 60s and the RPC call times out? 14:47:31 not sure, doesn't say 14:48:00 i'm thinking more in the direction of librbd not behaving the way most code does with eventlet 14:48:13 we had a lot of those sorts of things in nova, so I implemented heartbeating long-call support in oslo.messaging a while back which lets you make much longer RPC calls 14:48:14 but i don't have enough to make it a useful discussion for now 14:48:35 okay if you're blocked in a C lib call and not answering RPC, that would also be bad 14:48:43 right 14:49:29 OK looks like there may be something there 14:49:41 next 14:49:45 bug_2: Cinder sends old db object when delete a attachment 14:49:50 #link https://bugs.launchpad.net/cinder/+bug/1916980 14:49:51 Launchpad bug 1916980 in Cinder "cinder sends old db object when delete a attachment" [Undecided,Incomplete] - Assigned to wu.chunyang (wuchunyang) 14:50:01 When we delete an attachment Cinder sends a notification with old db object. We should call refresh() function to refresh the object before sending tonotification. after running the refresh function. the status changes todetached. 14:50:01 I left a comment on the bug report "Does this problem occur on all backends or on a specific one?" 14:50:20 enriquetaso: I have fixed that one in my local repo 14:50:29 enriquetaso: I will submit the patch later today 14:50:32 cool 14:50:34 thanks! 14:50:47 next 14:50:49 bug_3: Scheduling is not even among multiple thin provisioning pools which have different sizes 14:50:57 #link https://bugs.launchpad.net/cinder/+bug/1917293 14:50:58 Launchpad bug 1917293 in Cinder "Scheduling is not even among multiple thin provisioning pools which have different sizes" [Undecided,New] 14:51:05 Scheduling is not even among multiple thin provisioning pools 14:51:05 which have different sizes. For example, there are two thin provisioning 14:51:05 pools. Pool0 has 10T capacity, Pool1 has 30T capacity, the max_over_subscription_ratio 14:51:05 of them are both 20. We assume that the provisioned_capacity_gb of Pool1 14:51:05 is 250T and the provisioned_capacity_gb of Pool0 is 0. 14:51:06 According to the formula in the cinder source code, the free capacity of Pool0 14:51:10 is 10*20-0=200T and the free capacity of Pool1 is 30*20-250=350T. 14:51:12 So it is clear for us to see that a new created volume is 14:51:14 scheduled to Pool1 instead of Pool0. However, Pool0 is scheduled 14:51:16 since it has bigger real capacity. 14:51:18 In a word, the scheduler tends to schedule the pool that has bigger size. 14:51:55 enriquetaso: I haven't looked at the bug, but where they testing on devstack or with a deployment that has 3 schedulers? 14:52:12 i'd need to look into this one again, but i'm not sure it's a bug, it may be that the reporter's expectations aren't quite right and he's expecting a scheduling algorithm that's smarter than what we actually try to do 14:52:34 or that the requests are going to different schedulers 14:52:45 and each scheduler has a different in-memory data at the time of receiving the request... 14:52:57 geguileo, i need to ask because it doesn't says 14:53:48 enriquetaso: anything that happens with multiple schedulers but doesn't with a single one can be attributed to different in-memory data 14:54:30 geguileo, good point, so we need more info 14:54:39 this will also depend on how you configure the capacity weigher 14:55:27 OK 14:55:39 that's the last one :) 14:55:50 thanks sofia 14:55:59 #topic stable branch update 14:56:07 i will provide a quick update 14:56:16 proposed stable victoria release 14:56:22 #link https://review.opendev.org/c/openstack/releases/+/778231 14:56:31 3 patches remaining in ussuri (cinder and os-brick) 14:56:42 4 remaining in train 14:56:54 thanks everyone for the reviews this week 14:57:17 that's all from my side 14:57:31 thanks Rajat 14:57:44 #topic conf option workaround 14:57:58 Again me 14:58:17 so hemna asked a question about the default value of the conf option in rbd clone patch 14:58:31 #link https://review.opendev.org/c/openstack/cinder/+/754397 14:59:10 IIRC, this is defaulted off mostly because we don't want to introduce extra risk and complexity, right? 14:59:19 yes 14:59:37 that's not really obvious in the patch, but that plus the fact that there are other solutions for people now makes a decent case for defaulting it off i think 14:59:48 ok, we are out of time ... let's continue this in openstack-cinder, or respond on the patch 14:59:57 thanks everyone 15:00:02 #endmeeting