15:01:15 <tbarron> #startmeeting manila
15:01:16 <openstack> Meeting started Thu Jan  3 15:01:15 2019 UTC and is due to finish in 60 minutes.  The chair is tbarron. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:01:17 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:01:20 <openstack> The meeting name has been set to 'manila'
15:02:03 <tbarron> ping bswartz
15:02:05 <ganso> hello
15:02:13 <gouthamr> o/
15:02:18 <tbarron> ping ganso
15:02:21 <tbarron> oh hi
15:02:26 <bswartz> .o/
15:02:32 <tbarron> ping zhongjun
15:02:36 <tbarron> ping zyang
15:02:41 <tbarron> ping toabctl
15:02:45 <tbarron> ping erlon
15:02:48 <toabctl> hey
15:02:49 <tbarron> ping tpsiva
15:02:53 <tbarron> ping amito
15:02:57 <tbarron> ping vkmc
15:03:25 <tbarron> that's the ping list, add yourself at https://wiki.openstack.org/wiki/Manila/Meetings if you want
15:03:41 * tbarron waits a couple minutes
15:03:52 <bswartz> Happy New Year!
15:04:22 <tbarron> ok, Hi all, and happy new year!
15:04:26 <gouthamr> Happy New Year, everyone! :)
15:04:47 <tbarron> #topic announcements
15:04:59 <tbarron> oh, our agenda is here
15:05:07 <ganso> happy new year!
15:05:16 <tbarron> #link https://wiki.openstack.org/wiki/Manila/Meetings
15:05:38 <tbarron> gouthamr: do you have any important football announcments?
15:05:47 <erlon> hey
15:05:57 <tbarron> hi erlon
15:06:03 <gouthamr> :D hehe, my new year started on a good note this year tbarron
15:06:20 <bswartz> I'm not thrilled with the football situation
15:06:35 * tbarron tries to start a fight when he can
15:06:50 <gouthamr> bswartz: wut, thought you guys won rose bowl
15:07:22 <bswartz> We did but that's not good enough for me
15:07:37 <gouthamr> :D then beat Purdue next time
15:07:40 <tbarron> but on a different note, milestone 2 is next week
15:07:41 * gouthamr ducks
15:07:56 <bswartz> >_<
15:07:59 <tbarron> That's new driver submission deadline but I don't think we have any.
15:08:39 <tbarron> Otherwise, I plan to cut some intermediary releases so if there's anything important you want included let me know.
15:08:53 <xyang> hi
15:09:03 <tbarron> It would be nice to be python3 ready by then but we'll see ...
15:09:14 <tbarron> hi xyang
15:09:27 <bswartz> Why releases plural?
15:09:27 <tbarron> Any other announcements?
15:09:37 <bswartz> Isn't it just 1 release for the milestone?
15:09:42 <tbarron> bswartz: manila-ui, client, etc.
15:09:49 <bswartz> Okay library releases
15:09:51 <tbarron> they're not required at milestone
15:10:14 <tbarron> but it's a convenient time and we're being encouraged not to just wait till the end of the cycle.
15:11:05 <tbarron> Any other announcements?
15:11:19 <tbarron> I guess we can welcome toabctl back :)
15:11:42 <toabctl> thx :) but still part-time :)
15:11:44 <amito> (sorry, was in another meeting that ran too long)
15:11:46 <tbarron> Happy new year, and thanks for your cleanup patches and reviews again.
15:11:58 <tbarron> amito: hey, happy new year.
15:12:07 <amito> tbarron: thanks, happy new year!
15:12:32 <tbarron> #topic new user-developer experience
15:12:57 <tbarron> special thanks to gouthamr for https://review.openstack.org/#/c/627020/
15:13:34 <tbarron> We've had a fair number of folks trying to set up devstack and not getting anywhere good
15:14:13 <tbarron> But gouthamr has written up new instructions, including gentler intro than just dropping them into full DHSS=True complexity at the start.
15:14:27 <tbarron> I think it will be a great help.
15:15:09 <tbarron> We still don't have great, easy to follow instructions for running tempest locally, so if anyone wants to
15:15:18 <tbarron> contribute on that front it would be great.
15:15:57 <tbarron> #topic gate issues
15:16:02 <tbarron> Somewhat related
15:16:23 <tbarron> we've made some progress cleaning up all the red non-voting first-party jobs.
15:16:31 <tbarron> but not enough.
15:16:54 <tbarron> The generic driver jobs were failing with SSH header protocol exceptions
15:17:05 <gouthamr> ++, a relief to see https://review.openstack.org/#/c/627854/ passing
15:17:23 <tbarron> And we had new users trying to use the generic driver locally and they were failing
15:17:27 <tbarron> at the same point.
15:17:41 <bswartz> gouthamr: what was the trick here?
15:17:51 <tbarron> I've submitted a series of patches that *mitigate* the issue and that will help us debug further I hope.
15:17:56 <bswartz> Or do I need to go read all 5 of those patches?
15:18:01 <tbarron> bswartz: crude, up the timeout.
15:18:12 <bswartz> which timeout
15:18:21 <gouthamr> was going to ask tbarron: a combination of adjusting the SSH banner timeout, setting the group?
15:18:24 <tbarron> and remove an old keepalive hack we had in the paramiko code that no one else has anymore.
15:18:33 <gouthamr> or this aswell? https://review.openstack.org/#/c/627797
15:18:37 <tbarron> there's a specific banner timerout that we need to bump
15:18:38 <gouthamr> ah
15:18:42 <tbarron> as a workaround
15:19:07 * gouthamr notes this must have taken much of tbarron's holiday
15:19:17 <tbarron> that enables it to work most of the time just the way we see openssh client work if you wait long enough
15:19:21 <bswartz> So we suspect something in slow (nova or neutron) and just waiting a little longer works around the slowness?
15:19:58 <tbarron> then https://review.openstack.org/#/c/627020/ seems to fix an intermittent issue even with the long timeouts
15:20:24 <bswartz> tbarron: wrong link?
15:20:25 <tbarron> gouthamr: not really (holiday) - mostly I was away from other work concerns so could think a litle bit
15:20:27 <ganso> tbarron:  ^wrong patch?
15:20:57 <tbarron> *link https://review.openstack.org/#/c/627797
15:21:04 <tbarron> sorry bout that
15:21:11 <bswartz> Oh that one
15:21:22 <bswartz> I never investigated this keepalive stuff so I'll look more closely
15:21:31 <tbarron> Now there's still an issue that will probably turn into a bug for neutron/ovs
15:22:09 <tbarron> If you install tcpdump on the SVM and run it when connecting to it from the openssh client
15:22:31 <tbarron> you see "spurious retransmissions" of the SSH version header from the client.
15:22:53 <tbarron> the client keeps sending it over and over, with exponential backoff
15:23:13 <tbarron> And you see the server responding with acks (you see that on the server)
15:23:13 <bswartz> To the correct IP?
15:23:19 <tbarron> bswartz: yes
15:23:30 <bswartz> So it's not a DHCP issue
15:23:42 <tbarron> if you run tcpdump on the client you don't see the ACKs from the server
15:23:45 <bswartz> Probably the service VM is stuck halfway booted
15:23:51 <tbarron> which explains the retransmissions
15:24:03 <tbarron> no, I'm on the svm running tcpdump
15:24:04 <bswartz> Have you ever collected the kernel logs from a service VM where this issue occured?
15:24:34 <tbarron> bswartz: nova console log shows it booted and I'm sshed into it,
15:24:44 <bswartz> There's a recurring (and extremely annoying) problem in the linux kernel where bootstrapping the random number generator can take 90+ seconds in virtual environments
15:24:50 <tbarron> running a second ssh login issue
15:25:09 <tbarron> running a second ssh login
15:25:33 <bswartz> So you're able to SSH in shortly after boot? And it's just manila that can't connect?
15:25:56 <tbarron> bswartz: manila can connect too if you give it long timeouts
15:25:57 <toabctl> bswartz, you might need haveged installed to have entropy
15:26:24 <tbarron> I'm debugging the slow connection, not complete failure
15:26:30 <bswartz> toabctl: yeah there are several workarounds to the issue, install userspace tools is one of them
15:26:38 <tbarron> You can connect after about 110s.
15:27:06 <tbarron> So connect, fix dns, install tcpdump.  Run tcpdump.
15:27:08 <bswartz> Well what I've seen in the past is that sshd gets stuck while starting up because it's waiting for the kernel to be able to supply random numbers
15:27:15 <tbarron> in another window connect again.
15:27:28 <bswartz> After the kernel finishes RNG initialization, sshd is able to fully start
15:27:35 <tbarron> bswartz: what i'm seeing is no lack of response on the server, but
15:27:41 <ganso> bswartz: after it connects, if you disconnect and try again, it still takes 110 seconds
15:27:46 <bswartz> In unfortunate situations that can take ~90 seconds
15:27:48 <tbarron> packets from server not getting back to the client
15:27:58 <bswartz> Hmm okay
15:28:08 <bswartz> Perhaps a red herring
15:28:23 <gouthamr> tbarron: is this limited to the initial connection?
15:28:24 <tbarron> i've been playing with mtu reduction, etc.
15:28:45 <tbarron> gouthamr: seems to be limited to the ssh header transmission
15:28:48 <gouthamr> tbarron: i mean, would this happen after share creation as well, i.e, when updating exports
15:29:07 <tbarron> the server is piggybacking its ssh header with an ack for the tcp segment from the client that
15:29:16 <tbarron> contains the client's ssh header
15:29:29 <tbarron> these are not extrememly long though and later
15:29:34 <tbarron> after it unstalls
15:29:46 <tbarron> there's a key exchange with much bigger payload
15:31:18 <tbarron> anyways I'd like to get this issue resolved as well, not just longer timeouts, etc. as a workaround.
15:31:40 <tbarron> But with the workaround, we see that the main issues remaining with the
15:31:44 <tbarron> generic jobs are:
15:31:49 <tbarron> 1) migration failures
15:32:00 <tbarron> 2) timeouts on the scenario job
15:32:37 <tbarron> inspecting the scenario job, the test cases that run for tens of minutes are migration cases
15:33:03 <tbarron> so I'm wondering what people would think of splitting the host-assisted migration tests into their own job
15:33:11 <ganso> I've seen a lot of lvm failures lately. Since it is voting, it is requiring a lot of rechecks
15:33:34 <bswartz> I'm confused about the 2 issues
15:33:38 <tbarron> That way maybe the other jobs will be green most of the time and we can focus on the red stuff separately.
15:33:46 <bswartz> Are the migrations causing the timeouts, or are the migrations failing outright?
15:33:51 <ganso> tbarron: would we be splitting the host-assisted migration tests for all drivers or just generic in DHSS=True?
15:34:00 <tbarron> bswartz: both
15:34:09 <bswartz> Gah
15:34:11 <tbarron> ganso: not sure, what do you think?
15:34:23 <bswartz> Splitting would help with timeouts but not the outright failures
15:35:01 <tbarron> bswartz: it won't make those tests pass but it will allow us  to see when non-migration stuff fails in the generic job
15:35:02 <ganso> tbarron, bswartz: yep
15:35:13 <tbarron> by looking at the normally-green job
15:35:19 <tbarron> wihtout having to dive into the logs
15:35:31 <tbarron> and the actual failures are intermittent
15:35:43 <tbarron> and probably cascade
15:35:49 <bswartz> I see
15:36:26 <tbarron> so my idea is that we limit the scope of the actual problem cases to speed thing up, on the one hand, and
15:36:33 <tbarron> limit collateral damage on the other
15:36:51 <bswartz> Would the split be permanent though?
15:36:56 <bswartz> Or just until we sort out the problem?
15:36:59 <tbarron> ganso: you are right that there are other non-migratioon intermittent issues
15:37:23 <tbarron> live lvm job snapshot races
15:37:41 <tbarron> bswartz: I haven't really thought it through that far
15:37:58 <bswartz> Do we need locks in the LVM driver?
15:38:11 <tbarron> but if we can stabilize everything and get it to run fast enough then maybe we should consolidate again
15:38:15 <tbarron> or have that as a goal
15:38:46 <tbarron> bswartz: maybe.  I don't understand the issue well enough to say.
15:39:23 <tbarron> we also have some intermittent failures in access-rule tests.
15:39:58 <tbarron> Anyways, I'm trying to sort these out a bit, make sure we have bugs, and try to get out of the mode where we just ignore failures.
15:40:16 <bswartz> +1
15:40:26 <tbarron> #topic our work for stein
15:40:49 <tbarron> priority for access rules https://review.openstack.org/#/c/572283/
15:41:04 <ganso> the revert to snapshot feature also has races, the dummy driver fails from time to time on that test
15:41:25 <tbarron> ganso: agree, do you know if we have a bug for that?
15:41:50 <ganso> tbarron: I am not sure, would need to look it up. Since it is failing for a long time someone might have already opened one
15:42:15 <tbarron> I don't see an active champion/driver for the priority for access rules work and fear that it won't get done this cycle either.
15:42:49 <tbarron> That review has had a -1 on it  since Dec. 13 with no update.
15:43:09 <tbarron> The client side review in the same situation, only longer.
15:43:42 <tbarron> At this point I'm inclined to just indicate that it's at risk for not getting done two cycles in a row and
15:44:01 <toabctl> tbarron, I can remove my -1 if others think that splitting commits is useless
15:44:03 <bswartz> is zhongjun here?
15:44:14 <tbarron> try to figure out the gate failures we have currently with access rules.
15:44:30 <bswartz> No updates suggests that she's not working on it
15:44:49 <ganso> tbarron: I don't think splitting commits is that helpful
15:45:02 <ganso> tbarron, toabctl: oops, wrong Thomas
15:45:05 <ganso> lol
15:45:07 <tbarron> toabctl: well your -1 shouln't be a blocker, if the champion for the feature disagrees then they should say so and keep pushing
15:45:08 <ganso> toabctl: ^
15:45:41 <toabctl> ok
15:46:44 <tbarron> toabctl: theoretically I agree with you but practically I would be willing to just push ahead if we can be confident that this one is safe and that the rest of the work has momentum
15:47:25 <tbarron> My issue is that there is no champion for  the feature, there's a lot of work to do besides just this patch, including regression testing and
15:47:33 <toabctl> tbarron, sure. I just think that it takes *more* time if you have a huge commit. but fine for me to leave it as it is
15:47:39 <tbarron> we already have some intermittent access rule test issues.
15:48:29 <tbarron> Anyways, I'm going to work on getting gate more stable -- including access rule tests -- and ignore this feature unless someone else drives it.
15:48:57 <tbarron> Moving along.
15:49:01 <tbarron> Python3.
15:49:24 <tbarron> I want to get this done more or less around milestone 2.
15:50:06 <tbarron> It looks like we don't set the actual python3 version variable (python3.6) so the lvm job
15:50:21 <tbarron> currently fails to start up the api service when running under py3.
15:50:32 <gouthamr> >
15:50:35 <tbarron> Not sure why at the moment, it's set for cinder jobs.
15:50:40 <gouthamr> we do set it
15:50:44 <tbarron> But we'll sort that out.
15:50:44 * gouthamr checks
15:50:58 <tbarron> gouthamr: I think cinder jobs learn it.
15:51:02 <gouthamr> tbarron: https://review.openstack.org/#/c/623061/3/playbooks/legacy/manila-tempest-minimal-dsvm-lvm/run.yaml
15:51:31 <tbarron> gouthamr: not the boolean, the actual version
15:51:34 <gouthamr> oh, wait, you mean python"3.6"
15:51:55 <tbarron> it's a different variable, cinder jobs learn it, and hellman's patch usess it
15:52:04 <tbarron> so we'll sort that out
15:52:06 <gouthamr> ah, i see
15:52:23 <gouthamr> like this one here: https://review.openstack.org/#/c/607379/2/devstack-vm-gate-wrap.sh
15:52:37 <tbarron> And we need to get the current centos jobs running under bionic so they can be py3 too
15:52:52 <tbarron> Hopefully we can get these out of the way in the coming week.
15:53:24 <tbarron> And declare the job done except potentially for moving from native eventlet wsgi to uswgi.
15:53:46 <tbarron> The last may or may not be needed since in production everyone uses
15:54:09 <tbarron> httpd or something in front of the api
15:54:26 <tbarron> and since eventlet/py3 issues may be getting sorted anyways.
15:55:02 <tbarron> #topic bugs
15:55:26 <tbarron> I don't have any hot ones other than the gate issues we discussed.
15:55:35 <tbarron> Some of which are also hitting users.
15:56:03 <tbarron> And some which need new bugs or old bugs discovered and prioritized.
15:56:15 <tbarron> Anyone else have particular bugs to talk about today?
15:56:38 <tbarron> #topic open discussion
15:56:55 <tbarron> ganso?
15:57:34 <gouthamr> i'd like to discuss "public" shares, but that might be a longer topic
15:57:39 <gouthamr> case in point: https://bugs.launchpad.net/manila/+bug/1801763
15:57:40 <openstack> Launchpad bug 1801763 in Manila "public flag should be controllable via policy" [Medium,Confirmed]
15:57:57 <gouthamr> i can add it to teh agenda for next week
15:57:57 <tbarron> gouthamr: that's a good one :)
15:58:08 <tbarron> gouthamr: ok
15:58:31 <tbarron> you might get people to talk about it in #openstack-manila in the mean time as well
15:58:51 <gouthamr> sure
15:59:21 <gouthamr> also we have someone in #openstack-horizon asking about manila-ui
16:00:05 <gouthamr> i tried helping, but i am out of ideas on why it isn't working for the guy
16:00:05 <tbarron> gouthamr: does he have vkmc's latest rdo fix?
16:00:15 <gouthamr> tbarron: he does
16:00:23 <tbarron> k, we're out of time
16:00:28 <tbarron> Thanks everyone!
16:00:35 <tbarron> #endmeeting