15:03:55 <tbarron> #startmeeting manila
15:03:56 <openstack> Meeting started Thu Dec  6 15:03:55 2018 UTC and is due to finish in 60 minutes.  The chair is tbarron. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:03:57 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:04:00 <openstack> The meeting name has been set to 'manila'
15:04:20 <tbarron> for the record, I set the meeting name wrong and just reatarted
15:04:24 <bswartz> .o/
15:04:27 <tbarron> please say hi again for the record
15:04:34 <ganso> hello
15:04:42 <tbarron> restarted
15:04:52 <gouthamr> o/ :)
15:05:23 * tbarron notes that xyang said hi
15:05:43 <xyang> :)
15:05:53 <carthaca> Hello
15:06:05 <tbarron> bswartz: I don't think the bot will kick me off for pinging unicasts
15:06:10 <tbarron> carthaca: hello!
15:06:34 <tbarron> we have the guy who fixes more netapp bugs than netapp here :)
15:06:43 <tbarron> Welcome Carthaca.
15:06:48 <bswartz> >_<
15:07:08 <tbarron> OK, we don't have big attendance but let's get started.
15:07:16 <carthaca> gouthamr kinda invited me in https://bugs.launchpad.net/manila/+bug/1804659 :)
15:07:17 <openstack> Launchpad bug 1804659 in Manila "speed up list storage pools" [High,In progress] - Assigned to Maurice Schreiber (maurice-schreiber)
15:07:33 <tbarron> carthaca: k, we'll get to that pretty soon
15:07:47 <tbarron> Agenda: https://wiki.openstack.org/wiki/Manila/Meetings#Next_meeting
15:07:56 <tbarron> #topic Announcements
15:08:25 <tbarron> I just want to note that the end-of-year holidays for the "western world" are coming up and that
15:08:37 <tbarron> M2 milestone will be soon after
15:08:48 <tbarron> M2 is the week of Jan. 7
15:09:00 <tbarron> We have bugs targeted for M2.
15:09:10 <tbarron> And it's the new driver deadline.
15:09:35 <tbarron> I think the only outstanding new driver is Pure, from last cycle.
15:09:42 <tbarron> Anyone know others?
15:10:16 <ganso> tbarron: deadline only for brand new drivers, correct?
15:10:32 <ganso> tbarron: driver changes could be proposed up to FPF, correct?
15:10:33 <tbarron> Well we've got about 30 back ends now so I dunno we're desparate for more :)
15:10:42 <tbarron> ganso: correct
15:11:00 <tbarron> ok, any more announcements?
15:11:14 <tbarron> hearing none, ....
15:11:22 <tbarron> #topic new bug czar
15:11:28 <tbarron> gouthamr: your name is on this one
15:11:43 <gouthamr> yep... ty tbarron
15:12:16 <gouthamr> so, our last bug czar left Red Hat, and in the process has little to no time to dedicate to upstream development... :(
15:12:44 <gouthamr> so we need a new bug czar, or a bug subteam even :)
15:13:26 <gouthamr> do we have any volunteers?
15:13:43 <tbarron> gouthamr: what would a subteam look like?
15:13:55 <jgrosso> I will volunteer
15:14:03 <gouthamr> jgrosso++
15:14:12 <tbarron> terrific
15:14:36 <gouthamr> tbarron: we'd probably rotate from week to week or split up responsibility between projects
15:15:12 <gouthamr> i was combing through bugs and right now we have a lot of cruft
15:15:38 <gouthamr> i suspect we'd need a lot of initial workload of weeding out things we've already fixed
15:15:50 <gouthamr> s/need/have
15:15:59 <tbarron> gouthamr: ok, well we'll let jgrosso decide whether he wants a subteam to whom he can delegate some of the work and return to the topic in another week or two, sound good?
15:16:56 <gouthamr> tbarron: sure thing.. i can work with jgrosso to bring him up to speed
15:17:03 <tbarron> gouthamr: you can get that "weeding" info to jgrosso and maybe spend a couple hours together cleaning out the garden
15:17:18 <tbarron> ok, let's move on for now
15:17:34 <gouthamr> we'd need more than a couple of hours on manila's LP
15:17:39 <tbarron> #topic Our Work for this cycle
15:17:40 <jgrosso> :)
15:17:59 * tbarron is being ridiculously optimistic, PTL perogative
15:18:23 <tbarron> #link https://review.openstack.org/#/c/572283/
15:18:39 <tbarron> This is the main priority for access rules review
15:19:00 <tbarron> We carried it over from the last cycle and it has recently been refreshed.
15:19:38 <tbarron> I wonder if we should set up a call to workk on this one together and see if we can get it advanced.
15:20:15 <gouthamr> so zhengzhenyu hasn't proposed any driver changes
15:20:19 <bswartz> Is it feature complete?
15:20:43 <gouthamr> ganso do you plan on fixing up the netapp driver?
15:20:57 <ganso> gouthamr: as per we've agreed, it doesn't need fixing
15:21:38 <ganso> gouthamr: actually, just a minor change to stop reordering when there are no priority conflicts
15:21:40 <gouthamr> instinct says it does, because the driver reorders rules
15:21:53 <ganso> gouthamr: we've agreed that drivers can still reorder if they want when there is a priority conflict between rules
15:22:37 <gouthamr> ganso: oh, sure - that's what i meant
15:22:40 <tbarron> but the driver needs to change to *not* re-order otherwise, i.e. to respect the order set by the manager
15:22:53 <ganso> gouthamr: oh, then yes, I thought you implied that our driver would *break*. I meant that it will not break
15:23:19 <ganso> tbarron: yes
15:23:37 <tbarron> so bswartz I think this core patch is feature complete, we still would need driver work, client work, ui work, tempest-plugin work
15:23:38 <ganso> tbarron: minor change though
15:24:28 <gouthamr> ganso: i think we should move that reordering into the share manager for rules that still have the same priority
15:24:40 <gouthamr> ganso: so we can do it consistently across drivers
15:24:47 <tbarron> OK I don't want to get stuck, please review this one by next week.
15:25:11 <ganso> gouthamr: I suggested that in the past, because I am personally against "undeterminate behavior"
15:25:14 <tbarron> gouthamr: you are disagreeiing with the result of a previous discussion.  I'm not saying we can't revisit it, just am observing
15:25:20 <ganso> gouthamr: but we've agreed to not to, after a lot of discussion
15:25:42 <gouthamr> okay i am probably forgetting the previous discussion
15:26:09 <tbarron> previous discussion said let's maintain backwards compatability when the specified priority still leaves "ties"
15:26:46 <tbarron> Please review this one by next week and have an opinion on whether we can safely merge it while the other work is pending.
15:27:00 <gouthamr> ack
15:27:02 <tbarron> like if we expose the changes via the client last
15:27:18 <tbarron> We need to make progress and not get totally stuck on this one.
15:27:57 <tbarron> #topic testing global change to devstack for py3 first by default
15:28:05 <tbarron> #link https://review.openstack.org/#/c/623061/
15:28:29 <tbarron> gouthamr: anything we need to do on this?
15:28:58 <gouthamr> tbarron: yes, the test failed.. i'll investigate and propose any fixes
15:28:59 <tbarron> recheck?  fix?
15:29:07 <tbarron> gouthamr: ack
15:29:19 <tbarron> I put this one in mostly for awareness.
15:29:31 <tbarron> gouthamr: ask others to help on this work :)
15:29:43 <gouthamr> http://logs.openstack.org/61/623061/2/check/manila-tempest-minimal-dsvm-lvm/d190cc1/logs/devstacklog.txt.gz#_2018-12-05_21_36_42_366
15:29:57 * gouthamr :P manila server throws a 500, no big deal
15:30:28 <tbarron> ok, stuff to figure out
15:30:36 <tbarron> good that we're doing this test
15:30:51 <tbarron> #topic  Bugs
15:31:10 <tbarron> #link https://etherpad.openstack.org/p/manila-bug-triage-pad
15:31:28 <tbarron> let's take the bug carthaca mentioned first
15:32:07 <gouthamr> #LINK https://bugs.launchpad.net/manila/+bug/1804659
15:32:08 <openstack> Launchpad bug 1804659 in Manila "speed up list storage pools" [High,In progress] - Assigned to Maurice Schreiber (maurice-schreiber)
15:32:23 <gouthamr> thanks for being here carthaca!
15:32:30 <tbarron> +1000
15:33:41 <tbarron> Note that with the "Edge" infra work discussed in the last couple of Summits and PTGs customers are contemplating
15:33:53 <gouthamr> i believe this is a regression because of the way the host map is constructed post https://review.openstack.org/#/c/351034/
15:33:55 <tbarron> running with ~100 back ends
15:34:07 <gouthamr> ^ that too
15:34:46 <gouthamr> i believe carthaca's direction is correct, but wanted to discuss it here first..
15:34:57 <tbarron> So we need to flush out scale/performance issues in the scheduler and api services when they interact with lots of back ends
15:35:01 <gouthamr> #LINK https://review.openstack.org/#/c/619576/
15:35:23 * tbarron notes, also races ...
15:35:34 <carthaca> I already observed the pool detail list to take over 5 minutes with just 1000 shares in total and 5 netapp backends
15:36:37 <carthaca> my cache really helps, but ideally I would like to find the source, why it takes so long :(
15:37:51 <bswartz> wow
15:38:14 <bswartz> Storage pool list is the one that's not served from the DB right?
15:38:49 <gouthamr> carthaca: i think this is your culprit: https://github.com/openstack/manila/blob/fb17422/manila/scheduler/host_manager.py#L398
15:39:33 <gouthamr> bswartz: yes, we extract it from the scheduler
15:39:53 <gouthamr> but teh scheduler is counting up the sizes of the shares to calculate the provisioned_capacity_gb
15:40:48 <tbarron> gouthamr: but the share sizes are in the db, why would that be slow?
15:41:19 <gouthamr> tbarron: if you have thousands of shares, that get all by host call can be slowing things up
15:42:32 <carthaca> Maybe this is the right time to wish for osprofiler https://blueprints.launchpad.net/manila/+spec/manila-os-profiler ? :)
15:44:33 <gouthamr> carthaca: huh, dunno how that's used in nova, glance, cinder, but is worth investigating
15:46:17 <gouthamr> i'll test carthaca's code today and review, can others please take a look too?
15:46:21 <tbarron> carthaca: thanks for working on this one, it seems important.
15:46:43 <tbarron> gouthamr +1
15:47:39 <tbarron> Any more on this one now?
15:47:58 <carthaca> gouthamr: thanks
15:48:19 <tbarron> ok, I promised that we'd talk about carlos_silva's generic driver ssh bug
15:48:40 <tbarron> #link https://bugs.launchpad.net/manila/+bug/1807126
15:48:40 <openstack> Launchpad bug 1807126 in Manila "Cannot create shares using generic driver" [Undecided,New]
15:48:58 <tbarron> carlos_silva: are you around by any chance?
15:49:07 <bswartz> Who is currently looking into this one?
15:49:19 <tbarron> bswartz: it was just raised yesterday
15:49:50 <bswartz> It seems like the classic SSH problem we've seen for years in different flavors
15:49:50 <tbarron> I noticed that the same error that carlos_silva was reporting with xenial is showing up in gate with bionic
15:49:55 <carlos_silva> tbarron: yeah, i'm here
15:50:12 <tbarron> gouthamr asked carlos_silva to try the ssh connection manually and
15:50:17 <tbarron> hi carlos_silva !!~
15:50:24 <bswartz> One workaround is to use SSH with password instead of keys
15:50:40 <tbarron> carlos_silva: did you try that? ^^
15:50:55 <bswartz> The underlying problem is very hard to solve
15:51:19 <carlos_silva> bswartz: yeah, i tried
15:51:38 <carlos_silva> but it didn't works
15:51:42 <tbarron> bswartz: I think we're using p/w rather than keys in CI and we still are hitting the issue
15:52:02 <bswartz> tbarron: could be another issue then
15:52:07 <tbarron> carlos_silva: didn't you report that ssh works manaully, but only after a very long time?
15:52:23 <tbarron> I think paramiko times out after 15s
15:52:36 <bswartz> The network tricks we play with the generic driver seem to interfere with nova doing the stuff it's supposed to do
15:52:38 <tbarron> manually
15:53:01 <bswartz> Could be we need to revisit the networking approach used by generic
15:53:14 <bswartz> Too bad vponomaryov isn't still here
15:53:24 <bswartz> He was the only one with a deep understanding of that code
15:53:26 <tbarron> yeah
15:53:48 <tbarron> carlos_silva: so as you see we no longer have experts in this area and
15:54:03 <carlos_silva> tbarron: yes
15:54:07 <gouthamr> so we have a config opt: "ssh_conn_timeout"
15:54:16 <tbarron> part of the problem is that it's complicated and no one is paid to work on that driver
15:54:32 <bswartz> It doesn't take a huge amount of skill to debug the problem
15:54:36 <bswartz> But it does take a huge amount of time
15:55:02 <tbarron> but University professors like it and we still document it as a reference because conceptually it is nice :)
15:55:02 <bswartz> Because you have to read all the code for generic and linux networking, and painstakingly step through the code in a debugger to see what's going on
15:55:26 <bswartz> I found horrifying things in paramiko by doing this ^
15:55:35 <tbarron> carlos_silva: is your interest in this back end using it for production?
15:55:40 <ganso> carlos_silva: maybe it would be useful to include in the bug report that ssh still works manually after a very long time
15:55:58 <tbarron> ganso: +1
15:56:03 <gouthamr> carlos_silva: it's weird that the ssh client takes a long time, and it presents some issue with the underlying network?
15:56:05 <gouthamr> https://github.com/openstack/manila/blob/0fd1b8f9fa40bdd504c9402dd5c43e86387671bd/manila/share/driver.py#L139
15:56:27 <gouthamr> ^ if you toggle that timeout, manila should allow waiting until timeout
15:56:32 <carlos_silva> tbarron: no
15:56:33 <bswartz> gouthamr: it could be a network issue, or a metadata server issue, or a paramiko issue
15:56:43 <carlos_silva> ganso: ok, i'll do it
15:56:54 <tbarron> gouthamr: how do you know it's the network rather than someting in the service VM that makes it slow?
15:57:06 <bswartz> That's not even counting possible new problems introduced by new version of nova/neutron
15:57:36 <gouthamr> we haven't changed the service image over a couple of releases
15:59:02 <tbarron> gouthamr: ack but we've had lots of indeterministic failures with timeouts and the log replaying the boot up sequence, suggesting to me that we were close to the threshold
15:59:28 * tbarron admits he doesn't know this area well thogh
15:59:30 <ganso> tbarron: I don't think increasing the timeout is a good alternative for our upstream CI
15:59:40 <ganso> tbarron: this kind of thing shouldn't take a very long time
15:59:54 <ganso> tbarron: last time I used it, it was instantenously
16:00:14 <tbarron> anecdotes :)
16:00:29 <tbarron> there are lots of timeouts in gate jobs, for a logn time
16:00:43 <tbarron> ok, we're out of time
16:00:47 <ganso> tbarron: s/instantenously/instantaneously
16:00:57 <gouthamr> "Error reading SSH protocol banner" is a weird paramiko response - iirc it could happen for any number of reasons, a recent one that we saw was because of a wrong ssh key encryption algorithm
16:00:58 <tbarron> see you in #openstack-manila
16:01:09 <tbarron> Thanks everyone!
16:01:13 <tbarron> #endmeeting