14:00:35 #startmeeting neutron_drivers 14:00:36 Meeting started Fri Mar 8 14:00:35 2019 UTC and is due to finish in 60 minutes. The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:37 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:40 The meeting name has been set to 'neutron_drivers' 14:00:50 hi 14:00:51 hi 14:01:04 hi 14:01:13 Good evening / good morning 14:01:46 doreilly: do you have anything to discuss in this meeting? 14:02:05 hi 14:02:08 yeah, the inactivity_probe thing 14:03:12 o/ 14:03:26 doreilly: welcome. I have that rfe scheduled to be discussed today. I just didn't want to waste your time if your item wasn't scheduled 14:03:38 ok good 14:04:04 let's wait 1 more minute for amotoki 14:05:32 ok, let's get going 14:05:39 #topic RFEs 14:06:16 as mentioned in the exchange with doreilly, the first RFE to discuss today is https://bugs.launchpad.net/neutron/+bug/1817022 14:06:18 Launchpad bug 1817022 in neutron "[RFE] set inactivity_probe and max_backoff for OVS bridge controller" [Wishlist,In progress] - Assigned to Darragh O'Reilly (darragh-oreilly) 14:06:37 I have a customer who says that bumping it to 15000ms helps with vif timeout issues on a heavt loaded system 14:06:41 doreilly: let's give the team a few minutes to read the RFE and then questions will come ;-) 14:08:49 ok, what "issue" we can have when we set much longer time for this? 14:09:29 slaweq: not sure, maybe longer to reconnect 14:10:28 the default is 5seconds. It is up to people if they want to increase it using this RFE 14:10:55 I'm generally ok to change this to some bigger value if that is reasonable 14:11:22 ok. Like to 10sec? 14:11:33 well, the proposal is to enable users and deployers to change it: https://review.openstack.org/#/c/641681/1/neutron/conf/plugins/ml2/drivers/ovs_conf.py 14:11:38 but my concern is do we really need to add config options for it? 14:12:26 if we don't have a strong position on what those values should be, I think they should be a config option 14:13:33 sure, I'm just thinking loud now :) 14:13:44 doreilly: what's "vif timeout issues"? 14:14:09 could not really get specific details from them 14:14:18 and how does the new inactivity_probe config interact with the existing based on ovsdb_timeout for the manager? 14:14:22 You said in bug report that it may happend "under the heavy load in neutron-ovs-agent" - what is exactly this high load here? Many ports to configure? Many ports already configured and existing in ovs bridge? 14:14:23 but they really need to bump of_request_timeout first 14:15:00 haleyb, it does not interact. 14:15:37 doreilly: should it override that value? 14:15:58 slaweq, about 2500 iptables-rules, I think is the problem 14:16:42 slaweq: a lot of change on the system. Many vms created/deleted together 14:17:06 ok, thx doreilly 14:17:11 slaweq: security group rules that reference other security groups 14:18:08 one more question: is it maybe somehow possible (and worth to do) to change it dynamically? e.g. many port updates in loop, increase this value, all idle, decrease it or something like that. 14:19:11 slaweq: like increase if we see timeouts? 14:19:33 haleyb: for example, that could works imho 14:19:55 slaweq, probably, but would be a good deal more complex 14:19:59 I'm trying to think about some smarter way than adding another knob to our config 14:20:30 well, the bug description states that the parameters can be set manually after an ovs agent reatsrt 14:20:41 the commands are provided then 14:20:43 an exponential backoff algorithm would perhaps work, so that if we see a vif timeout we can retry but the timeout is higher, and then after things quiesce the timeouts go back to normal 14:21:09 sso is it really feasible to do it automatically? 14:21:10 I think we have already something like that with rpc connections in ovs agent 14:22:57 i suspect it isn't just a "busy" agent. it's more like something prevents ryu send/receive threads from being scheduled i guess. 14:23:14 yamamoto, I agree 14:24:08 I think the agent is bogged down with iptables, and cannot reply to the echo-request 14:24:42 like, something blocking the whole process, bypassing eventlet 14:25:16 yamamoto: I think so 14:27:52 i'm not against the proposal. it might be useful for some situations. but i don't think it's a real fix. 14:28:28 yamamoto: how could we go about finding the root cause and real fix? 14:28:51 doreilly: we are basing this in one deployer experience / feedback, right? 14:29:07 yes 14:29:34 would that deployer be open to dig deeper? 14:29:36 Are we OK with having soem vif timeouts and reacting to that, or is it important to try to avoid them before they happen? 14:30:26 njohnston_, I think it is better to avoid, because want to avoid full resync 14:30:54 yeah, especially in a hevily loaded system 14:30:57 yes that makes sense 14:31:10 yes, that's true 14:31:41 what if we would set some big number (hardcoded)? 14:32:18 yamamoto: in the bug they point to an irc conversation you had with kevinbenton. do you recall it? is it relevant to this discussion? 14:32:31 i guess it's difficult to find the root cause. having a canary thread sensitive to scheduling latency might be useful for debugging. eventlet itself might have those kind of debug mechanism. 14:33:06 i remeber it. but i'm not sure if it's the root cause is the same. 14:33:59 the issue discussed in the irc conversation was on our CI env. it might be a vcpu hiccup or something like that. 14:35:32 we observed ovs agent not producing any log messages for very long period when reconnect happened. 14:35:32 yamamoto: would you be willing to provide some guidance in that debugging process? 14:36:48 feel free to just say no. don't feel pressured ;-) 14:37:15 https://github.com/eventlet/eventlet/blob/master/eventlet/debug.py#L153 14:38:19 this seems relevant as far as i read the comment 14:39:17 it depends on what kind of guidance they need 14:39:51 You set the limit 14:40:21 doreilly: would you and your client be open to further investigation? 14:40:49 not really, it is a production system, so cannot do mods like that 14:40:58 ok 14:41:09 yamamoto: nevermind. thanks for the openess 14:41:22 doreilly, maybe you can also take look at this bug: https://bugs.launchpad.net/neutron/+bug/1813703 14:41:23 Launchpad bug 1813703 in neutron "[L2] [summary] ovs-agent issues at large scale" [High,In progress] - Assigned to LIU Yulong (dragon889) 14:41:52 I cannot reproduce disconect/reconnect messages in my lab 14:42:28 but ovs-vswitchd reconnects and everything is okay. I dont see errors in the ovs-agent. 14:42:36 doreilly, you can take this method to reproduce the timeout http://paste.openstack.org/show/745685/ 14:43:38 liuyulong_, yep that's basically what I do. I launch 100 debug probes, with lots of sg rules 14:44:33 doreilly: have you tried to make inactivity_probe shorter than default? 14:44:55 I will try the eventlet debug in the lab, to try to find root cause 14:45:21 yamamoto, min is 5000ms 14:45:54 yamamoto, I don't think its possible to go lower. Hardcoded in OVS 14:45:54 you can modify ovs then 14:46:12 yamamoto, I could 14:47:53 doreilly: so, if you are open to do some digging in your lab, let's go ahead and do that. You will be at the top of our agenda next time when you are ready. would that work? 14:48:13 ok thanks for the time 14:48:40 doreilly: on the contrary, thanks for your time and efforts 14:49:58 doreilly: just ad a note to the RFE when you are ready. I monitor them every week, before the meeting (Thursday evening US central time zone) 14:50:12 mlavalle, okay will do 14:50:45 doreilly: and if you add whatever partial results / findings to the RFE, we can discuss them there also 14:51:57 ok, let's move on 14:52:14 Next one is https://bugs.launchpad.net/neutron/+bug/1817881 14:52:15 Launchpad bug 1817881 in neutron " [RFE] L3 IPs monitor/metering via current QoS functionality (tc filters)" [Wishlist,New] 14:52:42 We might not have time to fisnish the discussion today, but at least we can get acquianted with the proposal 14:52:55 and take advantage that liuyulong_ is here 14:54:58 OK, it is for L3 IPs metering. 14:55:50 L3 IPs are all have QoS bandwidth limit now based or linux TC filters. 14:56:17 This is an example: 14:56:18 # ip netns exec snat-867e1473-4495-4513-8759-dee4cb1b9cef tc -s -d -p filter show dev qg-91293cf7-64 14:56:23 Sent 17640 bytes 180 pkts (dropped 0, overlimits 0) 14:56:30 liuyulong_: but will it work only for FIPs with QoS associated? 14:56:34 or for all? 14:56:57 For floating IPs with binding QoS only. 14:57:39 so, it's not a total replacement for the current metering approach 14:57:42 This is a very strong demand from different cloud users. 14:58:09 No, it is a simple alternative. 14:58:24 understood 14:59:09 if we move ahead with this, we would need to be very clear with the use case / cases this is intended to cover 14:59:34 and with that, let's continue here next week 14:59:43 Thanks for attending 14:59:45 Floating IPs should always have bandwidth constraints. We can not open all the bandwidth to all the user all the time. 14:59:47 OK 14:59:59 have a great weekend! 15:00:06 #endmeeting