15:01:15 <tbarron> #startmeeting manila 15:01:16 <openstack> Meeting started Thu Jan 3 15:01:15 2019 UTC and is due to finish in 60 minutes. The chair is tbarron. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:01:17 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:20 <openstack> The meeting name has been set to 'manila' 15:02:03 <tbarron> ping bswartz 15:02:05 <ganso> hello 15:02:13 <gouthamr> o/ 15:02:18 <tbarron> ping ganso 15:02:21 <tbarron> oh hi 15:02:26 <bswartz> .o/ 15:02:32 <tbarron> ping zhongjun 15:02:36 <tbarron> ping zyang 15:02:41 <tbarron> ping toabctl 15:02:45 <tbarron> ping erlon 15:02:48 <toabctl> hey 15:02:49 <tbarron> ping tpsiva 15:02:53 <tbarron> ping amito 15:02:57 <tbarron> ping vkmc 15:03:25 <tbarron> that's the ping list, add yourself at https://wiki.openstack.org/wiki/Manila/Meetings if you want 15:03:41 * tbarron waits a couple minutes 15:03:52 <bswartz> Happy New Year! 15:04:22 <tbarron> ok, Hi all, and happy new year! 15:04:26 <gouthamr> Happy New Year, everyone! :) 15:04:47 <tbarron> #topic announcements 15:04:59 <tbarron> oh, our agenda is here 15:05:07 <ganso> happy new year! 15:05:16 <tbarron> #link https://wiki.openstack.org/wiki/Manila/Meetings 15:05:38 <tbarron> gouthamr: do you have any important football announcments? 15:05:47 <erlon> hey 15:05:57 <tbarron> hi erlon 15:06:03 <gouthamr> :D hehe, my new year started on a good note this year tbarron 15:06:20 <bswartz> I'm not thrilled with the football situation 15:06:35 * tbarron tries to start a fight when he can 15:06:50 <gouthamr> bswartz: wut, thought you guys won rose bowl 15:07:22 <bswartz> We did but that's not good enough for me 15:07:37 <gouthamr> :D then beat Purdue next time 15:07:40 <tbarron> but on a different note, milestone 2 is next week 15:07:41 * gouthamr ducks 15:07:56 <bswartz> >_< 15:07:59 <tbarron> That's new driver submission deadline but I don't think we have any. 15:08:39 <tbarron> Otherwise, I plan to cut some intermediary releases so if there's anything important you want included let me know. 15:08:53 <xyang> hi 15:09:03 <tbarron> It would be nice to be python3 ready by then but we'll see ... 15:09:14 <tbarron> hi xyang 15:09:27 <bswartz> Why releases plural? 15:09:27 <tbarron> Any other announcements? 15:09:37 <bswartz> Isn't it just 1 release for the milestone? 15:09:42 <tbarron> bswartz: manila-ui, client, etc. 15:09:49 <bswartz> Okay library releases 15:09:51 <tbarron> they're not required at milestone 15:10:14 <tbarron> but it's a convenient time and we're being encouraged not to just wait till the end of the cycle. 15:11:05 <tbarron> Any other announcements? 15:11:19 <tbarron> I guess we can welcome toabctl back :) 15:11:42 <toabctl> thx :) but still part-time :) 15:11:44 <amito> (sorry, was in another meeting that ran too long) 15:11:46 <tbarron> Happy new year, and thanks for your cleanup patches and reviews again. 15:11:58 <tbarron> amito: hey, happy new year. 15:12:07 <amito> tbarron: thanks, happy new year! 15:12:32 <tbarron> #topic new user-developer experience 15:12:57 <tbarron> special thanks to gouthamr for https://review.openstack.org/#/c/627020/ 15:13:34 <tbarron> We've had a fair number of folks trying to set up devstack and not getting anywhere good 15:14:13 <tbarron> But gouthamr has written up new instructions, including gentler intro than just dropping them into full DHSS=True complexity at the start. 15:14:27 <tbarron> I think it will be a great help. 15:15:09 <tbarron> We still don't have great, easy to follow instructions for running tempest locally, so if anyone wants to 15:15:18 <tbarron> contribute on that front it would be great. 15:15:57 <tbarron> #topic gate issues 15:16:02 <tbarron> Somewhat related 15:16:23 <tbarron> we've made some progress cleaning up all the red non-voting first-party jobs. 15:16:31 <tbarron> but not enough. 15:16:54 <tbarron> The generic driver jobs were failing with SSH header protocol exceptions 15:17:05 <gouthamr> ++, a relief to see https://review.openstack.org/#/c/627854/ passing 15:17:23 <tbarron> And we had new users trying to use the generic driver locally and they were failing 15:17:27 <tbarron> at the same point. 15:17:41 <bswartz> gouthamr: what was the trick here? 15:17:51 <tbarron> I've submitted a series of patches that *mitigate* the issue and that will help us debug further I hope. 15:17:56 <bswartz> Or do I need to go read all 5 of those patches? 15:18:01 <tbarron> bswartz: crude, up the timeout. 15:18:12 <bswartz> which timeout 15:18:21 <gouthamr> was going to ask tbarron: a combination of adjusting the SSH banner timeout, setting the group? 15:18:24 <tbarron> and remove an old keepalive hack we had in the paramiko code that no one else has anymore. 15:18:33 <gouthamr> or this aswell? https://review.openstack.org/#/c/627797 15:18:37 <tbarron> there's a specific banner timerout that we need to bump 15:18:38 <gouthamr> ah 15:18:42 <tbarron> as a workaround 15:19:07 * gouthamr notes this must have taken much of tbarron's holiday 15:19:17 <tbarron> that enables it to work most of the time just the way we see openssh client work if you wait long enough 15:19:21 <bswartz> So we suspect something in slow (nova or neutron) and just waiting a little longer works around the slowness? 15:19:58 <tbarron> then https://review.openstack.org/#/c/627020/ seems to fix an intermittent issue even with the long timeouts 15:20:24 <bswartz> tbarron: wrong link? 15:20:25 <tbarron> gouthamr: not really (holiday) - mostly I was away from other work concerns so could think a litle bit 15:20:27 <ganso> tbarron: ^wrong patch? 15:20:57 <tbarron> *link https://review.openstack.org/#/c/627797 15:21:04 <tbarron> sorry bout that 15:21:11 <bswartz> Oh that one 15:21:22 <bswartz> I never investigated this keepalive stuff so I'll look more closely 15:21:31 <tbarron> Now there's still an issue that will probably turn into a bug for neutron/ovs 15:22:09 <tbarron> If you install tcpdump on the SVM and run it when connecting to it from the openssh client 15:22:31 <tbarron> you see "spurious retransmissions" of the SSH version header from the client. 15:22:53 <tbarron> the client keeps sending it over and over, with exponential backoff 15:23:13 <tbarron> And you see the server responding with acks (you see that on the server) 15:23:13 <bswartz> To the correct IP? 15:23:19 <tbarron> bswartz: yes 15:23:30 <bswartz> So it's not a DHCP issue 15:23:42 <tbarron> if you run tcpdump on the client you don't see the ACKs from the server 15:23:45 <bswartz> Probably the service VM is stuck halfway booted 15:23:51 <tbarron> which explains the retransmissions 15:24:03 <tbarron> no, I'm on the svm running tcpdump 15:24:04 <bswartz> Have you ever collected the kernel logs from a service VM where this issue occured? 15:24:34 <tbarron> bswartz: nova console log shows it booted and I'm sshed into it, 15:24:44 <bswartz> There's a recurring (and extremely annoying) problem in the linux kernel where bootstrapping the random number generator can take 90+ seconds in virtual environments 15:24:50 <tbarron> running a second ssh login issue 15:25:09 <tbarron> running a second ssh login 15:25:33 <bswartz> So you're able to SSH in shortly after boot? And it's just manila that can't connect? 15:25:56 <tbarron> bswartz: manila can connect too if you give it long timeouts 15:25:57 <toabctl> bswartz, you might need haveged installed to have entropy 15:26:24 <tbarron> I'm debugging the slow connection, not complete failure 15:26:30 <bswartz> toabctl: yeah there are several workarounds to the issue, install userspace tools is one of them 15:26:38 <tbarron> You can connect after about 110s. 15:27:06 <tbarron> So connect, fix dns, install tcpdump. Run tcpdump. 15:27:08 <bswartz> Well what I've seen in the past is that sshd gets stuck while starting up because it's waiting for the kernel to be able to supply random numbers 15:27:15 <tbarron> in another window connect again. 15:27:28 <bswartz> After the kernel finishes RNG initialization, sshd is able to fully start 15:27:35 <tbarron> bswartz: what i'm seeing is no lack of response on the server, but 15:27:41 <ganso> bswartz: after it connects, if you disconnect and try again, it still takes 110 seconds 15:27:46 <bswartz> In unfortunate situations that can take ~90 seconds 15:27:48 <tbarron> packets from server not getting back to the client 15:27:58 <bswartz> Hmm okay 15:28:08 <bswartz> Perhaps a red herring 15:28:23 <gouthamr> tbarron: is this limited to the initial connection? 15:28:24 <tbarron> i've been playing with mtu reduction, etc. 15:28:45 <tbarron> gouthamr: seems to be limited to the ssh header transmission 15:28:48 <gouthamr> tbarron: i mean, would this happen after share creation as well, i.e, when updating exports 15:29:07 <tbarron> the server is piggybacking its ssh header with an ack for the tcp segment from the client that 15:29:16 <tbarron> contains the client's ssh header 15:29:29 <tbarron> these are not extrememly long though and later 15:29:34 <tbarron> after it unstalls 15:29:46 <tbarron> there's a key exchange with much bigger payload 15:31:18 <tbarron> anyways I'd like to get this issue resolved as well, not just longer timeouts, etc. as a workaround. 15:31:40 <tbarron> But with the workaround, we see that the main issues remaining with the 15:31:44 <tbarron> generic jobs are: 15:31:49 <tbarron> 1) migration failures 15:32:00 <tbarron> 2) timeouts on the scenario job 15:32:37 <tbarron> inspecting the scenario job, the test cases that run for tens of minutes are migration cases 15:33:03 <tbarron> so I'm wondering what people would think of splitting the host-assisted migration tests into their own job 15:33:11 <ganso> I've seen a lot of lvm failures lately. Since it is voting, it is requiring a lot of rechecks 15:33:34 <bswartz> I'm confused about the 2 issues 15:33:38 <tbarron> That way maybe the other jobs will be green most of the time and we can focus on the red stuff separately. 15:33:46 <bswartz> Are the migrations causing the timeouts, or are the migrations failing outright? 15:33:51 <ganso> tbarron: would we be splitting the host-assisted migration tests for all drivers or just generic in DHSS=True? 15:34:00 <tbarron> bswartz: both 15:34:09 <bswartz> Gah 15:34:11 <tbarron> ganso: not sure, what do you think? 15:34:23 <bswartz> Splitting would help with timeouts but not the outright failures 15:35:01 <tbarron> bswartz: it won't make those tests pass but it will allow us to see when non-migration stuff fails in the generic job 15:35:02 <ganso> tbarron, bswartz: yep 15:35:13 <tbarron> by looking at the normally-green job 15:35:19 <tbarron> wihtout having to dive into the logs 15:35:31 <tbarron> and the actual failures are intermittent 15:35:43 <tbarron> and probably cascade 15:35:49 <bswartz> I see 15:36:26 <tbarron> so my idea is that we limit the scope of the actual problem cases to speed thing up, on the one hand, and 15:36:33 <tbarron> limit collateral damage on the other 15:36:51 <bswartz> Would the split be permanent though? 15:36:56 <bswartz> Or just until we sort out the problem? 15:36:59 <tbarron> ganso: you are right that there are other non-migratioon intermittent issues 15:37:23 <tbarron> live lvm job snapshot races 15:37:41 <tbarron> bswartz: I haven't really thought it through that far 15:37:58 <bswartz> Do we need locks in the LVM driver? 15:38:11 <tbarron> but if we can stabilize everything and get it to run fast enough then maybe we should consolidate again 15:38:15 <tbarron> or have that as a goal 15:38:46 <tbarron> bswartz: maybe. I don't understand the issue well enough to say. 15:39:23 <tbarron> we also have some intermittent failures in access-rule tests. 15:39:58 <tbarron> Anyways, I'm trying to sort these out a bit, make sure we have bugs, and try to get out of the mode where we just ignore failures. 15:40:16 <bswartz> +1 15:40:26 <tbarron> #topic our work for stein 15:40:49 <tbarron> priority for access rules https://review.openstack.org/#/c/572283/ 15:41:04 <ganso> the revert to snapshot feature also has races, the dummy driver fails from time to time on that test 15:41:25 <tbarron> ganso: agree, do you know if we have a bug for that? 15:41:50 <ganso> tbarron: I am not sure, would need to look it up. Since it is failing for a long time someone might have already opened one 15:42:15 <tbarron> I don't see an active champion/driver for the priority for access rules work and fear that it won't get done this cycle either. 15:42:49 <tbarron> That review has had a -1 on it since Dec. 13 with no update. 15:43:09 <tbarron> The client side review in the same situation, only longer. 15:43:42 <tbarron> At this point I'm inclined to just indicate that it's at risk for not getting done two cycles in a row and 15:44:01 <toabctl> tbarron, I can remove my -1 if others think that splitting commits is useless 15:44:03 <bswartz> is zhongjun here? 15:44:14 <tbarron> try to figure out the gate failures we have currently with access rules. 15:44:30 <bswartz> No updates suggests that she's not working on it 15:44:49 <ganso> tbarron: I don't think splitting commits is that helpful 15:45:02 <ganso> tbarron, toabctl: oops, wrong Thomas 15:45:05 <ganso> lol 15:45:07 <tbarron> toabctl: well your -1 shouln't be a blocker, if the champion for the feature disagrees then they should say so and keep pushing 15:45:08 <ganso> toabctl: ^ 15:45:41 <toabctl> ok 15:46:44 <tbarron> toabctl: theoretically I agree with you but practically I would be willing to just push ahead if we can be confident that this one is safe and that the rest of the work has momentum 15:47:25 <tbarron> My issue is that there is no champion for the feature, there's a lot of work to do besides just this patch, including regression testing and 15:47:33 <toabctl> tbarron, sure. I just think that it takes *more* time if you have a huge commit. but fine for me to leave it as it is 15:47:39 <tbarron> we already have some intermittent access rule test issues. 15:48:29 <tbarron> Anyways, I'm going to work on getting gate more stable -- including access rule tests -- and ignore this feature unless someone else drives it. 15:48:57 <tbarron> Moving along. 15:49:01 <tbarron> Python3. 15:49:24 <tbarron> I want to get this done more or less around milestone 2. 15:50:06 <tbarron> It looks like we don't set the actual python3 version variable (python3.6) so the lvm job 15:50:21 <tbarron> currently fails to start up the api service when running under py3. 15:50:32 <gouthamr> > 15:50:35 <tbarron> Not sure why at the moment, it's set for cinder jobs. 15:50:40 <gouthamr> we do set it 15:50:44 <tbarron> But we'll sort that out. 15:50:44 * gouthamr checks 15:50:58 <tbarron> gouthamr: I think cinder jobs learn it. 15:51:02 <gouthamr> tbarron: https://review.openstack.org/#/c/623061/3/playbooks/legacy/manila-tempest-minimal-dsvm-lvm/run.yaml 15:51:31 <tbarron> gouthamr: not the boolean, the actual version 15:51:34 <gouthamr> oh, wait, you mean python"3.6" 15:51:55 <tbarron> it's a different variable, cinder jobs learn it, and hellman's patch usess it 15:52:04 <tbarron> so we'll sort that out 15:52:06 <gouthamr> ah, i see 15:52:23 <gouthamr> like this one here: https://review.openstack.org/#/c/607379/2/devstack-vm-gate-wrap.sh 15:52:37 <tbarron> And we need to get the current centos jobs running under bionic so they can be py3 too 15:52:52 <tbarron> Hopefully we can get these out of the way in the coming week. 15:53:24 <tbarron> And declare the job done except potentially for moving from native eventlet wsgi to uswgi. 15:53:46 <tbarron> The last may or may not be needed since in production everyone uses 15:54:09 <tbarron> httpd or something in front of the api 15:54:26 <tbarron> and since eventlet/py3 issues may be getting sorted anyways. 15:55:02 <tbarron> #topic bugs 15:55:26 <tbarron> I don't have any hot ones other than the gate issues we discussed. 15:55:35 <tbarron> Some of which are also hitting users. 15:56:03 <tbarron> And some which need new bugs or old bugs discovered and prioritized. 15:56:15 <tbarron> Anyone else have particular bugs to talk about today? 15:56:38 <tbarron> #topic open discussion 15:56:55 <tbarron> ganso? 15:57:34 <gouthamr> i'd like to discuss "public" shares, but that might be a longer topic 15:57:39 <gouthamr> case in point: https://bugs.launchpad.net/manila/+bug/1801763 15:57:40 <openstack> Launchpad bug 1801763 in Manila "public flag should be controllable via policy" [Medium,Confirmed] 15:57:57 <gouthamr> i can add it to teh agenda for next week 15:57:57 <tbarron> gouthamr: that's a good one :) 15:58:08 <tbarron> gouthamr: ok 15:58:31 <tbarron> you might get people to talk about it in #openstack-manila in the mean time as well 15:58:51 <gouthamr> sure 15:59:21 <gouthamr> also we have someone in #openstack-horizon asking about manila-ui 16:00:05 <gouthamr> i tried helping, but i am out of ideas on why it isn't working for the guy 16:00:05 <tbarron> gouthamr: does he have vkmc's latest rdo fix? 16:00:15 <gouthamr> tbarron: he does 16:00:23 <tbarron> k, we're out of time 16:00:28 <tbarron> Thanks everyone! 16:00:35 <tbarron> #endmeeting