15:03:55 <tbarron> #startmeeting manila 15:03:56 <openstack> Meeting started Thu Dec 6 15:03:55 2018 UTC and is due to finish in 60 minutes. The chair is tbarron. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:03:57 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:04:00 <openstack> The meeting name has been set to 'manila' 15:04:20 <tbarron> for the record, I set the meeting name wrong and just reatarted 15:04:24 <bswartz> .o/ 15:04:27 <tbarron> please say hi again for the record 15:04:34 <ganso> hello 15:04:42 <tbarron> restarted 15:04:52 <gouthamr> o/ :) 15:05:23 * tbarron notes that xyang said hi 15:05:43 <xyang> :) 15:05:53 <carthaca> Hello 15:06:05 <tbarron> bswartz: I don't think the bot will kick me off for pinging unicasts 15:06:10 <tbarron> carthaca: hello! 15:06:34 <tbarron> we have the guy who fixes more netapp bugs than netapp here :) 15:06:43 <tbarron> Welcome Carthaca. 15:06:48 <bswartz> >_< 15:07:08 <tbarron> OK, we don't have big attendance but let's get started. 15:07:16 <carthaca> gouthamr kinda invited me in https://bugs.launchpad.net/manila/+bug/1804659 :) 15:07:17 <openstack> Launchpad bug 1804659 in Manila "speed up list storage pools" [High,In progress] - Assigned to Maurice Schreiber (maurice-schreiber) 15:07:33 <tbarron> carthaca: k, we'll get to that pretty soon 15:07:47 <tbarron> Agenda: https://wiki.openstack.org/wiki/Manila/Meetings#Next_meeting 15:07:56 <tbarron> #topic Announcements 15:08:25 <tbarron> I just want to note that the end-of-year holidays for the "western world" are coming up and that 15:08:37 <tbarron> M2 milestone will be soon after 15:08:48 <tbarron> M2 is the week of Jan. 7 15:09:00 <tbarron> We have bugs targeted for M2. 15:09:10 <tbarron> And it's the new driver deadline. 15:09:35 <tbarron> I think the only outstanding new driver is Pure, from last cycle. 15:09:42 <tbarron> Anyone know others? 15:10:16 <ganso> tbarron: deadline only for brand new drivers, correct? 15:10:32 <ganso> tbarron: driver changes could be proposed up to FPF, correct? 15:10:33 <tbarron> Well we've got about 30 back ends now so I dunno we're desparate for more :) 15:10:42 <tbarron> ganso: correct 15:11:00 <tbarron> ok, any more announcements? 15:11:14 <tbarron> hearing none, .... 15:11:22 <tbarron> #topic new bug czar 15:11:28 <tbarron> gouthamr: your name is on this one 15:11:43 <gouthamr> yep... ty tbarron 15:12:16 <gouthamr> so, our last bug czar left Red Hat, and in the process has little to no time to dedicate to upstream development... :( 15:12:44 <gouthamr> so we need a new bug czar, or a bug subteam even :) 15:13:26 <gouthamr> do we have any volunteers? 15:13:43 <tbarron> gouthamr: what would a subteam look like? 15:13:55 <jgrosso> I will volunteer 15:14:03 <gouthamr> jgrosso++ 15:14:12 <tbarron> terrific 15:14:36 <gouthamr> tbarron: we'd probably rotate from week to week or split up responsibility between projects 15:15:12 <gouthamr> i was combing through bugs and right now we have a lot of cruft 15:15:38 <gouthamr> i suspect we'd need a lot of initial workload of weeding out things we've already fixed 15:15:50 <gouthamr> s/need/have 15:15:59 <tbarron> gouthamr: ok, well we'll let jgrosso decide whether he wants a subteam to whom he can delegate some of the work and return to the topic in another week or two, sound good? 15:16:56 <gouthamr> tbarron: sure thing.. i can work with jgrosso to bring him up to speed 15:17:03 <tbarron> gouthamr: you can get that "weeding" info to jgrosso and maybe spend a couple hours together cleaning out the garden 15:17:18 <tbarron> ok, let's move on for now 15:17:34 <gouthamr> we'd need more than a couple of hours on manila's LP 15:17:39 <tbarron> #topic Our Work for this cycle 15:17:40 <jgrosso> :) 15:17:59 * tbarron is being ridiculously optimistic, PTL perogative 15:18:23 <tbarron> #link https://review.openstack.org/#/c/572283/ 15:18:39 <tbarron> This is the main priority for access rules review 15:19:00 <tbarron> We carried it over from the last cycle and it has recently been refreshed. 15:19:38 <tbarron> I wonder if we should set up a call to workk on this one together and see if we can get it advanced. 15:20:15 <gouthamr> so zhengzhenyu hasn't proposed any driver changes 15:20:19 <bswartz> Is it feature complete? 15:20:43 <gouthamr> ganso do you plan on fixing up the netapp driver? 15:20:57 <ganso> gouthamr: as per we've agreed, it doesn't need fixing 15:21:38 <ganso> gouthamr: actually, just a minor change to stop reordering when there are no priority conflicts 15:21:40 <gouthamr> instinct says it does, because the driver reorders rules 15:21:53 <ganso> gouthamr: we've agreed that drivers can still reorder if they want when there is a priority conflict between rules 15:22:37 <gouthamr> ganso: oh, sure - that's what i meant 15:22:40 <tbarron> but the driver needs to change to *not* re-order otherwise, i.e. to respect the order set by the manager 15:22:53 <ganso> gouthamr: oh, then yes, I thought you implied that our driver would *break*. I meant that it will not break 15:23:19 <ganso> tbarron: yes 15:23:37 <tbarron> so bswartz I think this core patch is feature complete, we still would need driver work, client work, ui work, tempest-plugin work 15:23:38 <ganso> tbarron: minor change though 15:24:28 <gouthamr> ganso: i think we should move that reordering into the share manager for rules that still have the same priority 15:24:40 <gouthamr> ganso: so we can do it consistently across drivers 15:24:47 <tbarron> OK I don't want to get stuck, please review this one by next week. 15:25:11 <ganso> gouthamr: I suggested that in the past, because I am personally against "undeterminate behavior" 15:25:14 <tbarron> gouthamr: you are disagreeiing with the result of a previous discussion. I'm not saying we can't revisit it, just am observing 15:25:20 <ganso> gouthamr: but we've agreed to not to, after a lot of discussion 15:25:42 <gouthamr> okay i am probably forgetting the previous discussion 15:26:09 <tbarron> previous discussion said let's maintain backwards compatability when the specified priority still leaves "ties" 15:26:46 <tbarron> Please review this one by next week and have an opinion on whether we can safely merge it while the other work is pending. 15:27:00 <gouthamr> ack 15:27:02 <tbarron> like if we expose the changes via the client last 15:27:18 <tbarron> We need to make progress and not get totally stuck on this one. 15:27:57 <tbarron> #topic testing global change to devstack for py3 first by default 15:28:05 <tbarron> #link https://review.openstack.org/#/c/623061/ 15:28:29 <tbarron> gouthamr: anything we need to do on this? 15:28:58 <gouthamr> tbarron: yes, the test failed.. i'll investigate and propose any fixes 15:28:59 <tbarron> recheck? fix? 15:29:07 <tbarron> gouthamr: ack 15:29:19 <tbarron> I put this one in mostly for awareness. 15:29:31 <tbarron> gouthamr: ask others to help on this work :) 15:29:43 <gouthamr> http://logs.openstack.org/61/623061/2/check/manila-tempest-minimal-dsvm-lvm/d190cc1/logs/devstacklog.txt.gz#_2018-12-05_21_36_42_366 15:29:57 * gouthamr :P manila server throws a 500, no big deal 15:30:28 <tbarron> ok, stuff to figure out 15:30:36 <tbarron> good that we're doing this test 15:30:51 <tbarron> #topic Bugs 15:31:10 <tbarron> #link https://etherpad.openstack.org/p/manila-bug-triage-pad 15:31:28 <tbarron> let's take the bug carthaca mentioned first 15:32:07 <gouthamr> #LINK https://bugs.launchpad.net/manila/+bug/1804659 15:32:08 <openstack> Launchpad bug 1804659 in Manila "speed up list storage pools" [High,In progress] - Assigned to Maurice Schreiber (maurice-schreiber) 15:32:23 <gouthamr> thanks for being here carthaca! 15:32:30 <tbarron> +1000 15:33:41 <tbarron> Note that with the "Edge" infra work discussed in the last couple of Summits and PTGs customers are contemplating 15:33:53 <gouthamr> i believe this is a regression because of the way the host map is constructed post https://review.openstack.org/#/c/351034/ 15:33:55 <tbarron> running with ~100 back ends 15:34:07 <gouthamr> ^ that too 15:34:46 <gouthamr> i believe carthaca's direction is correct, but wanted to discuss it here first.. 15:34:57 <tbarron> So we need to flush out scale/performance issues in the scheduler and api services when they interact with lots of back ends 15:35:01 <gouthamr> #LINK https://review.openstack.org/#/c/619576/ 15:35:23 * tbarron notes, also races ... 15:35:34 <carthaca> I already observed the pool detail list to take over 5 minutes with just 1000 shares in total and 5 netapp backends 15:36:37 <carthaca> my cache really helps, but ideally I would like to find the source, why it takes so long :( 15:37:51 <bswartz> wow 15:38:14 <bswartz> Storage pool list is the one that's not served from the DB right? 15:38:49 <gouthamr> carthaca: i think this is your culprit: https://github.com/openstack/manila/blob/fb17422/manila/scheduler/host_manager.py#L398 15:39:33 <gouthamr> bswartz: yes, we extract it from the scheduler 15:39:53 <gouthamr> but teh scheduler is counting up the sizes of the shares to calculate the provisioned_capacity_gb 15:40:48 <tbarron> gouthamr: but the share sizes are in the db, why would that be slow? 15:41:19 <gouthamr> tbarron: if you have thousands of shares, that get all by host call can be slowing things up 15:42:32 <carthaca> Maybe this is the right time to wish for osprofiler https://blueprints.launchpad.net/manila/+spec/manila-os-profiler ? :) 15:44:33 <gouthamr> carthaca: huh, dunno how that's used in nova, glance, cinder, but is worth investigating 15:46:17 <gouthamr> i'll test carthaca's code today and review, can others please take a look too? 15:46:21 <tbarron> carthaca: thanks for working on this one, it seems important. 15:46:43 <tbarron> gouthamr +1 15:47:39 <tbarron> Any more on this one now? 15:47:58 <carthaca> gouthamr: thanks 15:48:19 <tbarron> ok, I promised that we'd talk about carlos_silva's generic driver ssh bug 15:48:40 <tbarron> #link https://bugs.launchpad.net/manila/+bug/1807126 15:48:40 <openstack> Launchpad bug 1807126 in Manila "Cannot create shares using generic driver" [Undecided,New] 15:48:58 <tbarron> carlos_silva: are you around by any chance? 15:49:07 <bswartz> Who is currently looking into this one? 15:49:19 <tbarron> bswartz: it was just raised yesterday 15:49:50 <bswartz> It seems like the classic SSH problem we've seen for years in different flavors 15:49:50 <tbarron> I noticed that the same error that carlos_silva was reporting with xenial is showing up in gate with bionic 15:49:55 <carlos_silva> tbarron: yeah, i'm here 15:50:12 <tbarron> gouthamr asked carlos_silva to try the ssh connection manually and 15:50:17 <tbarron> hi carlos_silva !!~ 15:50:24 <bswartz> One workaround is to use SSH with password instead of keys 15:50:40 <tbarron> carlos_silva: did you try that? ^^ 15:50:55 <bswartz> The underlying problem is very hard to solve 15:51:19 <carlos_silva> bswartz: yeah, i tried 15:51:38 <carlos_silva> but it didn't works 15:51:42 <tbarron> bswartz: I think we're using p/w rather than keys in CI and we still are hitting the issue 15:52:02 <bswartz> tbarron: could be another issue then 15:52:07 <tbarron> carlos_silva: didn't you report that ssh works manaully, but only after a very long time? 15:52:23 <tbarron> I think paramiko times out after 15s 15:52:36 <bswartz> The network tricks we play with the generic driver seem to interfere with nova doing the stuff it's supposed to do 15:52:38 <tbarron> manually 15:53:01 <bswartz> Could be we need to revisit the networking approach used by generic 15:53:14 <bswartz> Too bad vponomaryov isn't still here 15:53:24 <bswartz> He was the only one with a deep understanding of that code 15:53:26 <tbarron> yeah 15:53:48 <tbarron> carlos_silva: so as you see we no longer have experts in this area and 15:54:03 <carlos_silva> tbarron: yes 15:54:07 <gouthamr> so we have a config opt: "ssh_conn_timeout" 15:54:16 <tbarron> part of the problem is that it's complicated and no one is paid to work on that driver 15:54:32 <bswartz> It doesn't take a huge amount of skill to debug the problem 15:54:36 <bswartz> But it does take a huge amount of time 15:55:02 <tbarron> but University professors like it and we still document it as a reference because conceptually it is nice :) 15:55:02 <bswartz> Because you have to read all the code for generic and linux networking, and painstakingly step through the code in a debugger to see what's going on 15:55:26 <bswartz> I found horrifying things in paramiko by doing this ^ 15:55:35 <tbarron> carlos_silva: is your interest in this back end using it for production? 15:55:40 <ganso> carlos_silva: maybe it would be useful to include in the bug report that ssh still works manually after a very long time 15:55:58 <tbarron> ganso: +1 15:56:03 <gouthamr> carlos_silva: it's weird that the ssh client takes a long time, and it presents some issue with the underlying network? 15:56:05 <gouthamr> https://github.com/openstack/manila/blob/0fd1b8f9fa40bdd504c9402dd5c43e86387671bd/manila/share/driver.py#L139 15:56:27 <gouthamr> ^ if you toggle that timeout, manila should allow waiting until timeout 15:56:32 <carlos_silva> tbarron: no 15:56:33 <bswartz> gouthamr: it could be a network issue, or a metadata server issue, or a paramiko issue 15:56:43 <carlos_silva> ganso: ok, i'll do it 15:56:54 <tbarron> gouthamr: how do you know it's the network rather than someting in the service VM that makes it slow? 15:57:06 <bswartz> That's not even counting possible new problems introduced by new version of nova/neutron 15:57:36 <gouthamr> we haven't changed the service image over a couple of releases 15:59:02 <tbarron> gouthamr: ack but we've had lots of indeterministic failures with timeouts and the log replaying the boot up sequence, suggesting to me that we were close to the threshold 15:59:28 * tbarron admits he doesn't know this area well thogh 15:59:30 <ganso> tbarron: I don't think increasing the timeout is a good alternative for our upstream CI 15:59:40 <ganso> tbarron: this kind of thing shouldn't take a very long time 15:59:54 <ganso> tbarron: last time I used it, it was instantenously 16:00:14 <tbarron> anecdotes :) 16:00:29 <tbarron> there are lots of timeouts in gate jobs, for a logn time 16:00:43 <tbarron> ok, we're out of time 16:00:47 <ganso> tbarron: s/instantenously/instantaneously 16:00:57 <gouthamr> "Error reading SSH protocol banner" is a weird paramiko response - iirc it could happen for any number of reasons, a recent one that we saw was because of a wrong ssh key encryption algorithm 16:00:58 <tbarron> see you in #openstack-manila 16:01:09 <tbarron> Thanks everyone! 16:01:13 <tbarron> #endmeeting