15:03:55 #startmeeting manila 15:03:56 Meeting started Thu Dec 6 15:03:55 2018 UTC and is due to finish in 60 minutes. The chair is tbarron. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:03:57 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:04:00 The meeting name has been set to 'manila' 15:04:20 for the record, I set the meeting name wrong and just reatarted 15:04:24 .o/ 15:04:27 please say hi again for the record 15:04:34 hello 15:04:42 restarted 15:04:52 o/ :) 15:05:23 * tbarron notes that xyang said hi 15:05:43 :) 15:05:53 Hello 15:06:05 bswartz: I don't think the bot will kick me off for pinging unicasts 15:06:10 carthaca: hello! 15:06:34 we have the guy who fixes more netapp bugs than netapp here :) 15:06:43 Welcome Carthaca. 15:06:48 >_< 15:07:08 OK, we don't have big attendance but let's get started. 15:07:16 gouthamr kinda invited me in https://bugs.launchpad.net/manila/+bug/1804659 :) 15:07:17 Launchpad bug 1804659 in Manila "speed up list storage pools" [High,In progress] - Assigned to Maurice Schreiber (maurice-schreiber) 15:07:33 carthaca: k, we'll get to that pretty soon 15:07:47 Agenda: https://wiki.openstack.org/wiki/Manila/Meetings#Next_meeting 15:07:56 #topic Announcements 15:08:25 I just want to note that the end-of-year holidays for the "western world" are coming up and that 15:08:37 M2 milestone will be soon after 15:08:48 M2 is the week of Jan. 7 15:09:00 We have bugs targeted for M2. 15:09:10 And it's the new driver deadline. 15:09:35 I think the only outstanding new driver is Pure, from last cycle. 15:09:42 Anyone know others? 15:10:16 tbarron: deadline only for brand new drivers, correct? 15:10:32 tbarron: driver changes could be proposed up to FPF, correct? 15:10:33 Well we've got about 30 back ends now so I dunno we're desparate for more :) 15:10:42 ganso: correct 15:11:00 ok, any more announcements? 15:11:14 hearing none, .... 15:11:22 #topic new bug czar 15:11:28 gouthamr: your name is on this one 15:11:43 yep... ty tbarron 15:12:16 so, our last bug czar left Red Hat, and in the process has little to no time to dedicate to upstream development... :( 15:12:44 so we need a new bug czar, or a bug subteam even :) 15:13:26 do we have any volunteers? 15:13:43 gouthamr: what would a subteam look like? 15:13:55 I will volunteer 15:14:03 jgrosso++ 15:14:12 terrific 15:14:36 tbarron: we'd probably rotate from week to week or split up responsibility between projects 15:15:12 i was combing through bugs and right now we have a lot of cruft 15:15:38 i suspect we'd need a lot of initial workload of weeding out things we've already fixed 15:15:50 s/need/have 15:15:59 gouthamr: ok, well we'll let jgrosso decide whether he wants a subteam to whom he can delegate some of the work and return to the topic in another week or two, sound good? 15:16:56 tbarron: sure thing.. i can work with jgrosso to bring him up to speed 15:17:03 gouthamr: you can get that "weeding" info to jgrosso and maybe spend a couple hours together cleaning out the garden 15:17:18 ok, let's move on for now 15:17:34 we'd need more than a couple of hours on manila's LP 15:17:39 #topic Our Work for this cycle 15:17:40 :) 15:17:59 * tbarron is being ridiculously optimistic, PTL perogative 15:18:23 #link https://review.openstack.org/#/c/572283/ 15:18:39 This is the main priority for access rules review 15:19:00 We carried it over from the last cycle and it has recently been refreshed. 15:19:38 I wonder if we should set up a call to workk on this one together and see if we can get it advanced. 15:20:15 so zhengzhenyu hasn't proposed any driver changes 15:20:19 Is it feature complete? 15:20:43 ganso do you plan on fixing up the netapp driver? 15:20:57 gouthamr: as per we've agreed, it doesn't need fixing 15:21:38 gouthamr: actually, just a minor change to stop reordering when there are no priority conflicts 15:21:40 instinct says it does, because the driver reorders rules 15:21:53 gouthamr: we've agreed that drivers can still reorder if they want when there is a priority conflict between rules 15:22:37 ganso: oh, sure - that's what i meant 15:22:40 but the driver needs to change to *not* re-order otherwise, i.e. to respect the order set by the manager 15:22:53 gouthamr: oh, then yes, I thought you implied that our driver would *break*. I meant that it will not break 15:23:19 tbarron: yes 15:23:37 so bswartz I think this core patch is feature complete, we still would need driver work, client work, ui work, tempest-plugin work 15:23:38 tbarron: minor change though 15:24:28 ganso: i think we should move that reordering into the share manager for rules that still have the same priority 15:24:40 ganso: so we can do it consistently across drivers 15:24:47 OK I don't want to get stuck, please review this one by next week. 15:25:11 gouthamr: I suggested that in the past, because I am personally against "undeterminate behavior" 15:25:14 gouthamr: you are disagreeiing with the result of a previous discussion. I'm not saying we can't revisit it, just am observing 15:25:20 gouthamr: but we've agreed to not to, after a lot of discussion 15:25:42 okay i am probably forgetting the previous discussion 15:26:09 previous discussion said let's maintain backwards compatability when the specified priority still leaves "ties" 15:26:46 Please review this one by next week and have an opinion on whether we can safely merge it while the other work is pending. 15:27:00 ack 15:27:02 like if we expose the changes via the client last 15:27:18 We need to make progress and not get totally stuck on this one. 15:27:57 #topic testing global change to devstack for py3 first by default 15:28:05 #link https://review.openstack.org/#/c/623061/ 15:28:29 gouthamr: anything we need to do on this? 15:28:58 tbarron: yes, the test failed.. i'll investigate and propose any fixes 15:28:59 recheck? fix? 15:29:07 gouthamr: ack 15:29:19 I put this one in mostly for awareness. 15:29:31 gouthamr: ask others to help on this work :) 15:29:43 http://logs.openstack.org/61/623061/2/check/manila-tempest-minimal-dsvm-lvm/d190cc1/logs/devstacklog.txt.gz#_2018-12-05_21_36_42_366 15:29:57 * gouthamr :P manila server throws a 500, no big deal 15:30:28 ok, stuff to figure out 15:30:36 good that we're doing this test 15:30:51 #topic Bugs 15:31:10 #link https://etherpad.openstack.org/p/manila-bug-triage-pad 15:31:28 let's take the bug carthaca mentioned first 15:32:07 #LINK https://bugs.launchpad.net/manila/+bug/1804659 15:32:08 Launchpad bug 1804659 in Manila "speed up list storage pools" [High,In progress] - Assigned to Maurice Schreiber (maurice-schreiber) 15:32:23 thanks for being here carthaca! 15:32:30 +1000 15:33:41 Note that with the "Edge" infra work discussed in the last couple of Summits and PTGs customers are contemplating 15:33:53 i believe this is a regression because of the way the host map is constructed post https://review.openstack.org/#/c/351034/ 15:33:55 running with ~100 back ends 15:34:07 ^ that too 15:34:46 i believe carthaca's direction is correct, but wanted to discuss it here first.. 15:34:57 So we need to flush out scale/performance issues in the scheduler and api services when they interact with lots of back ends 15:35:01 #LINK https://review.openstack.org/#/c/619576/ 15:35:23 * tbarron notes, also races ... 15:35:34 I already observed the pool detail list to take over 5 minutes with just 1000 shares in total and 5 netapp backends 15:36:37 my cache really helps, but ideally I would like to find the source, why it takes so long :( 15:37:51 wow 15:38:14 Storage pool list is the one that's not served from the DB right? 15:38:49 carthaca: i think this is your culprit: https://github.com/openstack/manila/blob/fb17422/manila/scheduler/host_manager.py#L398 15:39:33 bswartz: yes, we extract it from the scheduler 15:39:53 but teh scheduler is counting up the sizes of the shares to calculate the provisioned_capacity_gb 15:40:48 gouthamr: but the share sizes are in the db, why would that be slow? 15:41:19 tbarron: if you have thousands of shares, that get all by host call can be slowing things up 15:42:32 Maybe this is the right time to wish for osprofiler https://blueprints.launchpad.net/manila/+spec/manila-os-profiler ? :) 15:44:33 carthaca: huh, dunno how that's used in nova, glance, cinder, but is worth investigating 15:46:17 i'll test carthaca's code today and review, can others please take a look too? 15:46:21 carthaca: thanks for working on this one, it seems important. 15:46:43 gouthamr +1 15:47:39 Any more on this one now? 15:47:58 gouthamr: thanks 15:48:19 ok, I promised that we'd talk about carlos_silva's generic driver ssh bug 15:48:40 #link https://bugs.launchpad.net/manila/+bug/1807126 15:48:40 Launchpad bug 1807126 in Manila "Cannot create shares using generic driver" [Undecided,New] 15:48:58 carlos_silva: are you around by any chance? 15:49:07 Who is currently looking into this one? 15:49:19 bswartz: it was just raised yesterday 15:49:50 It seems like the classic SSH problem we've seen for years in different flavors 15:49:50 I noticed that the same error that carlos_silva was reporting with xenial is showing up in gate with bionic 15:49:55 tbarron: yeah, i'm here 15:50:12 gouthamr asked carlos_silva to try the ssh connection manually and 15:50:17 hi carlos_silva !!~ 15:50:24 One workaround is to use SSH with password instead of keys 15:50:40 carlos_silva: did you try that? ^^ 15:50:55 The underlying problem is very hard to solve 15:51:19 bswartz: yeah, i tried 15:51:38 but it didn't works 15:51:42 bswartz: I think we're using p/w rather than keys in CI and we still are hitting the issue 15:52:02 tbarron: could be another issue then 15:52:07 carlos_silva: didn't you report that ssh works manaully, but only after a very long time? 15:52:23 I think paramiko times out after 15s 15:52:36 The network tricks we play with the generic driver seem to interfere with nova doing the stuff it's supposed to do 15:52:38 manually 15:53:01 Could be we need to revisit the networking approach used by generic 15:53:14 Too bad vponomaryov isn't still here 15:53:24 He was the only one with a deep understanding of that code 15:53:26 yeah 15:53:48 carlos_silva: so as you see we no longer have experts in this area and 15:54:03 tbarron: yes 15:54:07 so we have a config opt: "ssh_conn_timeout" 15:54:16 part of the problem is that it's complicated and no one is paid to work on that driver 15:54:32 It doesn't take a huge amount of skill to debug the problem 15:54:36 But it does take a huge amount of time 15:55:02 but University professors like it and we still document it as a reference because conceptually it is nice :) 15:55:02 Because you have to read all the code for generic and linux networking, and painstakingly step through the code in a debugger to see what's going on 15:55:26 I found horrifying things in paramiko by doing this ^ 15:55:35 carlos_silva: is your interest in this back end using it for production? 15:55:40 carlos_silva: maybe it would be useful to include in the bug report that ssh still works manually after a very long time 15:55:58 ganso: +1 15:56:03 carlos_silva: it's weird that the ssh client takes a long time, and it presents some issue with the underlying network? 15:56:05 https://github.com/openstack/manila/blob/0fd1b8f9fa40bdd504c9402dd5c43e86387671bd/manila/share/driver.py#L139 15:56:27 ^ if you toggle that timeout, manila should allow waiting until timeout 15:56:32 tbarron: no 15:56:33 gouthamr: it could be a network issue, or a metadata server issue, or a paramiko issue 15:56:43 ganso: ok, i'll do it 15:56:54 gouthamr: how do you know it's the network rather than someting in the service VM that makes it slow? 15:57:06 That's not even counting possible new problems introduced by new version of nova/neutron 15:57:36 we haven't changed the service image over a couple of releases 15:59:02 gouthamr: ack but we've had lots of indeterministic failures with timeouts and the log replaying the boot up sequence, suggesting to me that we were close to the threshold 15:59:28 * tbarron admits he doesn't know this area well thogh 15:59:30 tbarron: I don't think increasing the timeout is a good alternative for our upstream CI 15:59:40 tbarron: this kind of thing shouldn't take a very long time 15:59:54 tbarron: last time I used it, it was instantenously 16:00:14 anecdotes :) 16:00:29 there are lots of timeouts in gate jobs, for a logn time 16:00:43 ok, we're out of time 16:00:47 tbarron: s/instantenously/instantaneously 16:00:57 "Error reading SSH protocol banner" is a weird paramiko response - iirc it could happen for any number of reasons, a recent one that we saw was because of a wrong ssh key encryption algorithm 16:00:58 see you in #openstack-manila 16:01:09 Thanks everyone! 16:01:13 #endmeeting