#openstack-meeting log

14:00:35 <mlavalle> #startmeeting neutron_drivers
14:00:36 <openstack> Meeting started Fri Mar  8 14:00:35 2019 UTC and is due to finish in 60 minutes.  The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:37 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:40 <openstack> The meeting name has been set to 'neutron_drivers'
14:00:50 <haleyb> hi
14:00:51 <yamamoto> hi
14:01:04 <doreilly> hi
14:01:13 <mlavalle> Good evening / good morning
14:01:46 <mlavalle> doreilly: do you have anything to discuss in this meeting?
14:02:05 <slaweq> hi
14:02:08 <doreilly> yeah, the inactivity_probe thing
14:03:12 <njohnston_> o/
14:03:26 <mlavalle> doreilly: welcome. I have that rfe scheduled to be discussed today. I just didn't want to waste your time if your item wasn't scheduled
14:03:38 <doreilly> ok good
14:04:04 <mlavalle> let's wait 1 more minute for amotoki
14:05:32 <mlavalle> ok, let's get going
14:05:39 <mlavalle> #topic RFEs
14:06:16 <mlavalle> as mentioned in the exchange with doreilly, the first RFE to discuss today is https://bugs.launchpad.net/neutron/+bug/1817022
14:06:18 <openstack> Launchpad bug 1817022 in neutron "[RFE] set inactivity_probe and max_backoff for OVS bridge controller" [Wishlist,In progress] - Assigned to Darragh O'Reilly (darragh-oreilly)
14:06:37 <doreilly> I have a customer who says that bumping it to 15000ms helps with vif timeout issues on a heavt loaded system
14:06:41 <mlavalle> doreilly: let's give the team a few minutes to read the RFE and then questions will come ;-)
14:08:49 <slaweq> ok, what "issue" we can have when we set much longer time for this?
14:09:29 <doreilly> slaweq: not sure, maybe longer to reconnect
14:10:28 <doreilly> the default is 5seconds. It is up to people if they want to increase it using this RFE
14:10:55 <slaweq> I'm generally ok to change this to some bigger value if that is reasonable
14:11:22 <doreilly> ok. Like to 10sec?
14:11:33 <mlavalle> well, the proposal is to enable users and deployers to change it: https://review.openstack.org/#/c/641681/1/neutron/conf/plugins/ml2/drivers/ovs_conf.py
14:11:38 <slaweq> but my concern is do we really need to add config options for it?
14:12:26 <mlavalle> if we don't have a strong position on what those values should be, I think they should be a config option
14:13:33 <slaweq> sure, I'm just thinking loud now :)
14:13:44 <yamamoto> doreilly: what's "vif timeout issues"?
14:14:09 <doreilly> could not really get specific details from them
14:14:18 <haleyb> and how does the new inactivity_probe config interact with the existing based on ovsdb_timeout for the manager?
14:14:22 <slaweq> You said in bug report that it may happend "under the heavy load in neutron-ovs-agent" - what is exactly this high load here? Many ports to configure? Many ports already configured and existing in ovs bridge?
14:14:23 <doreilly> but they really need to bump of_request_timeout first
14:15:00 <doreilly> haleyb, it does not interact.
14:15:37 <haleyb> doreilly: should it override that value?
14:15:58 <doreilly> slaweq, about 2500 iptables-rules, I think is the problem
14:16:42 <doreilly> slaweq: a lot of change on the system. Many vms created/deleted together
14:17:06 <slaweq> ok, thx doreilly
14:17:11 <doreilly> slaweq: security group rules that reference other security groups
14:18:08 <slaweq> one more question: is it maybe somehow possible (and worth to do) to change it dynamically? e.g. many port updates in loop, increase this value, all idle, decrease it or something like that.
14:19:11 <haleyb> slaweq: like increase if we see timeouts?
14:19:33 <slaweq> haleyb: for example, that could works imho
14:19:55 <doreilly> slaweq, probably, but would be a good deal more complex
14:19:59 <slaweq> I'm trying to think about some smarter way than adding another knob to our config
14:20:30 <mlavalle> well, the bug description states that the parameters can be set manually after an ovs agent reatsrt
14:20:41 <mlavalle> the commands are provided then
14:20:43 <njohnston_> an exponential backoff algorithm would perhaps work, so that if we see a vif timeout we can retry but the timeout is higher, and then after things quiesce the timeouts go back to normal
14:21:09 <mlavalle> sso is it really feasible to do it automatically?
14:21:10 <slaweq> I think we have already something like that with rpc connections in ovs agent
14:22:57 <yamamoto> i suspect it isn't just a "busy" agent. it's more like something prevents ryu send/receive threads from being scheduled i guess.
14:23:14 <doreilly> yamamoto, I agree
14:24:08 <doreilly> I think the agent is bogged down with iptables, and cannot reply to the echo-request
14:24:42 <yamamoto> like, something blocking the whole process, bypassing eventlet
14:25:16 <doreilly> yamamoto: I think so
14:27:52 <yamamoto> i'm not against the proposal. it might be useful for some situations. but i don't think it's a real fix.
14:28:28 <mlavalle> yamamoto: how could we go about finding the root cause and real fix?
14:28:51 <mlavalle> doreilly: we are basing this in one deployer experience / feedback, right?
14:29:07 <doreilly> yes
14:29:34 <mlavalle> would that deployer be open to dig deeper?
14:29:36 <njohnston_> Are we OK with having soem vif timeouts and reacting to that, or is it important to try to avoid them before they happen?
14:30:26 <doreilly> njohnston_, I think it is better to avoid, because want to avoid full resync
14:30:54 <mlavalle> yeah, especially in a hevily loaded system
14:30:57 <njohnston_> yes that makes sense
14:31:10 <slaweq> yes, that's true
14:31:41 <slaweq> what if we would set some big number (hardcoded)?
14:32:18 <mlavalle> yamamoto: in the bug they point to an irc conversation you had with kevinbenton. do you recall it? is it relevant to this discussion?
14:32:31 <yamamoto> i guess it's difficult to find the root cause. having a canary thread sensitive to scheduling latency might be useful for debugging. eventlet itself might have those kind of debug mechanism.
14:33:06 <yamamoto> i remeber it. but i'm not sure if it's the root cause is the same.
14:33:59 <yamamoto> the issue discussed in the irc conversation was on our CI env.  it might be a vcpu hiccup or something like that.
14:35:32 <yamamoto> we observed ovs agent not producing any log messages for very long period when reconnect happened.
14:35:32 <mlavalle> yamamoto: would you be willing to provide some guidance in that debugging process?
14:36:48 <mlavalle> feel free to just say no. don't feel pressured ;-)
14:37:15 <yamamoto> https://github.com/eventlet/eventlet/blob/master/eventlet/debug.py#L153
14:38:19 <yamamoto> this seems relevant as far as i read the comment
14:39:17 <yamamoto> it depends on what kind of guidance they need
14:39:51 <mlavalle> You set the limit
14:40:21 <mlavalle> doreilly: would you and your client be open to further investigation?
14:40:49 <doreilly> not really, it is a production system, so cannot do mods like that
14:40:58 <mlavalle> ok
14:41:09 <mlavalle> yamamoto: nevermind. thanks for the openess
14:41:22 <liuyulong_> doreilly, maybe you can also take look at this bug: https://bugs.launchpad.net/neutron/+bug/1813703
14:41:23 <openstack> Launchpad bug 1813703 in neutron "[L2] [summary] ovs-agent issues at large scale" [High,In progress] - Assigned to LIU Yulong (dragon889)
14:41:52 <doreilly> I cannot reproduce disconect/reconnect messages in my lab
14:42:28 <doreilly> but ovs-vswitchd reconnects and everything is okay. I dont see errors in the ovs-agent.
14:42:36 <liuyulong_> doreilly, you can take this method to reproduce the timeout http://paste.openstack.org/show/745685/
14:43:38 <doreilly> liuyulong_, yep that's basically what I do. I launch 100 debug probes, with lots of sg rules
14:44:33 <yamamoto> doreilly: have you tried to make inactivity_probe shorter than default?
14:44:55 <doreilly> I will try the eventlet debug in the lab, to try to find root cause
14:45:21 <doreilly> yamamoto, min is 5000ms
14:45:54 <doreilly> yamamoto, I don't think its possible to go lower. Hardcoded in OVS
14:45:54 <yamamoto> you can modify ovs then
14:46:12 <doreilly> yamamoto, I could
14:47:53 <mlavalle> doreilly: so, if you are open to do some digging in your lab, let's go ahead and do that. You will be at the top of our agenda next time when you are ready. would that work?
14:48:13 <doreilly> ok thanks for the time
14:48:40 <mlavalle> doreilly: on the contrary, thanks for your time and efforts
14:49:58 <mlavalle> doreilly: just ad a note to the RFE when you are ready. I monitor them every week, before the meeting (Thursday evening US central time zone)
14:50:12 <doreilly> mlavalle, okay will do
14:50:45 <mlavalle> doreilly: and if you add whatever partial results / findings to the RFE, we can discuss them there also
14:51:57 <mlavalle> ok, let's move on
14:52:14 <mlavalle> Next one is https://bugs.launchpad.net/neutron/+bug/1817881
14:52:15 <openstack> Launchpad bug 1817881 in neutron " [RFE] L3 IPs monitor/metering via current QoS functionality (tc filters)" [Wishlist,New]
14:52:42 <mlavalle> We might not have time to fisnish the discussion today, but at least we can get acquianted with the proposal
14:52:55 <mlavalle> and take advantage that liuyulong_ is here
14:54:58 <liuyulong_> OK, it is for L3 IPs metering.
14:55:50 <liuyulong_> L3 IPs are all have QoS bandwidth limit now based or linux TC filters.
14:56:17 <liuyulong_> This is an example:
14:56:18 <liuyulong_> # ip netns exec snat-867e1473-4495-4513-8759-dee4cb1b9cef tc -s -d -p filter show dev qg-91293cf7-64
14:56:23 <liuyulong_> Sent 17640 bytes 180 pkts (dropped 0, overlimits 0)
14:56:30 <slaweq> liuyulong_: but will it work only for FIPs with QoS associated?
14:56:34 <slaweq> or for all?
14:56:57 <liuyulong_> For floating IPs with binding QoS only.
14:57:39 <mlavalle> so, it's not a total replacement for the current metering approach
14:57:42 <liuyulong_> This is a very strong demand from different cloud users.
14:58:09 <liuyulong_> No, it is a simple alternative.
14:58:24 <mlavalle> understood
14:59:09 <mlavalle> if we move ahead with this, we would need to be very clear with the use case / cases this is intended to cover
14:59:34 <mlavalle> and with that, let's continue here next week
14:59:43 <mlavalle> Thanks for attending
14:59:45 <liuyulong_> Floating IPs should always have bandwidth constraints. We can not open all the bandwidth to all the user all the time.
14:59:47 <liuyulong_> OK
14:59:59 <mlavalle> have a great weekend!
15:00:06 <mlavalle> #endmeeting