14:00:47 #startmeeting neutron_drivers 14:00:48 Meeting started Fri Mar 29 14:00:47 2019 UTC and is due to finish in 60 minutes. The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:49 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:51 The meeting name has been set to 'neutron_drivers' 14:01:00 hi 14:01:08 hi 14:01:15 hey slaweq 14:01:18 hi 14:01:22 you were not supposed to here 14:01:42 hi 14:01:44 but I have some time now so I can be :) 14:02:15 we know haleyb won't be here 14:02:16 and as haleyb is on pto, yamamoto wrote that he can't attend, I'm necessary to have quorum :P 14:03:12 ok, so we have minimumm quorum today 14:03:22 let's get going 14:03:23 it depends how much time slaweq has :) 14:03:27 #topic RFEs 14:03:42 amotoki: I have whole hour :) 14:03:45 slaweq: are you good for the entire meeting? 14:03:49 cool 14:03:50 yep 14:03:54 nice 14:04:14 THanks for taking the time slaweq. Much appreciated! 14:04:47 I see doreilly is here 14:04:55 so let's re-take https://bugs.launchpad.net/neutron/+bug/1817022 14:04:56 Launchpad bug 1817022 in neutron "[RFE] set inactivity_probe and max_backoff for OVS bridge controller" [Wishlist,In progress] - Assigned to Darragh O'Reilly (darragh-oreilly) 14:04:56 I am 14:05:22 yamamoto mentioned in his email that he didn't have time to look at the RFE 14:05:34 I'm afraid I did not have much time to dig deeper into the error os_ken is throwing 14:06:07 I just added a line to do a LOG.exception() - https://bugs.launchpad.net/neutron/+bug/1817022/comments/8 14:06:08 Launchpad bug 1817022 in neutron "[RFE] set inactivity_probe and max_backoff for OVS bridge controller" [Wishlist,In progress] - Assigned to Darragh O'Reilly (darragh-oreilly) 14:07:22 is this the exception that os-ken is throwing as a consequence of your testing? 14:07:40 yes 14:07:52 yeah, I see it now 14:08:24 Datapath Invalid 104209607453507 14:08:37 did you check the bridges datapaths? 14:08:45 but when you look at those places in os_ken, it is very generic 14:08:53 I mean: you should not have you bridges with the same datapath 14:09:24 I think i did and looked normal. I will check again 14:10:25 doreilly: when does this happen? only when ovs-agent is restarted? 14:10:39 or does it happen during normal operations? 14:10:57 and I understand that this testing is with master or close to master 14:11:22 when restarted 14:11:36 yes close to master 14:11:47 doreilly, yes, when restarted and of_inactivity_probe=1 14:11:58 this timeout is too short 14:12:17 liuyulong is trying to split security group rule operations and flow processing into pieces https://review.openstack.org/#/c/638642/ 14:12:20 and as you pointed out in the comment, you have disconnections in the ovs logs 14:12:22 it might help your case 14:12:38 I might be wrong though.. 14:12:55 OVS sets the inactivity probe to 5 sec if you don't specify 14:13:13 doreilly, that's the point 14:13:16 doreilly: 2019-03-11T16:35:25.258Z|25950|rconn|ERR|br-tun<->tcp:127.0.0.1:6633: no response to inactivity probe after 1 seconds, disconnecting 14:13:21 yes, you can try that 14:13:33 the OF manager is not connected 14:13:48 and this https://review.openstack.org/#/c/638647/ 14:14:27 right. That's what the bug is about. Allowing you to increase the inactivity probe from 5 to something higher 14:15:34 ralonsoh: oh forgot to mention, I changed OVS to default to 1 sec, to make this easier to recreate 14:15:42 doreilly: in log which You pointed in comment I see that exception was raised in clean_stale_flows method 14:15:51 recently we merged liuyulong's patch https://review.openstack.org/#/c/638647/14 14:15:51 doreilly, oook, now I understand 14:16:01 would that maybe help to this issue? 14:16:45 I guess this is caused by high load of OVS and/or openflow controller, so it can be mitigated by averaging/splitting OVS load. 14:17:06 638647 can change the situation. 14:17:13 agree 14:17:32 slaweq: maybe. I can retest on fresh master 14:18:00 doreilly: ok, thx 14:18:02 would be nice :) 14:18:24 the other thing is that since liuyulong is making all these improvements, maybe you two should work together 14:18:35 ??? 14:19:00 and see if those improvements fix your case 14:19:02 mlavalle: that is very good point, we should not do the same work twice :) 14:19:21 +1000 14:19:28 mlavalle: sure. I will retest. And ping liuyulong if I need anything 14:19:54 cool, let's leave it at that on this issue this week 14:20:07 https://launchpad.net/bugs/1813703 all my work can be found here. 14:20:09 Launchpad bug 1813703 in neutron "[L2] [summary] ovs-agent issues at large scale" [High,Fix released] - Assigned to LIU Yulong (dragon889) 14:20:39 Including the test method: https://bugs.launchpad.net/neutron/+bug/1813703/comments/12 14:20:40 and https://review.openstack.org/#/q/topic:bug/1813703+(status:open+OR+status:merged) 14:21:08 yeah, I'm sure "ovs-agentissues at large" are likely to apply to your case 14:21:41 doreilly: thanks for keeping pushing the ball forward 14:22:05 mlavalle: np thx for helping 14:22:06 this is good stuff! getting to the nitty gritty 14:22:48 ok, moving to the next RFE: https://bugs.launchpad.net/neutron/+bug/1817881 14:22:49 Launchpad bug 1817881 in neutron " [RFE] L3 IPs monitor/metering via current QoS functionality (tc filters)" [Wishlist,Confirmed] 14:22:58 speaking of liuyulong 14:24:20 I will not repeat the description, but give us an example. We have implemented this. And report L3 IPs statistic data to zabbix. 14:25:32 YAMAMOTO had mentioned the rest API, IMO, it's a good idea. 14:25:50 liuyulong: regarding the exchange you had with yamamoto, you are not proposing API changes now? 14:26:21 liuyulong: what with FIPs without QoS enabled? Will this metering work for them too? 14:26:39 slaweq, yes, you get the point. 14:27:02 mlavalle, I will not add the API for now, we can remove the qos policy binding to disable the monitor. 14:27:46 mlavalle, but it can be a next step work. 14:27:58 liuyulong: I really don't like idea to enable/disable metering by adding/removing qos policy to fip - those are 2 different things 14:28:14 yeah, that was what I originally understood, but then the later comments confused me 14:28:50 totally agree with slaweq 14:29:37 if we would like to control monitoring targets, it is better to specify it via API as the current metering API does. (it is a bit different though) 14:29:50 amotoki: +1000 14:29:56 so, is the mechanism ok but we would like a different approach from API point of view? 14:30:04 so you guys think API is necessary? 14:30:15 and also if we want to have such metering it should be possible to enable it to all fips, not only those with qos 14:30:49 slaweq, yes, once it can be done, but tc now does not accept '0' as input value. 14:31:25 slaweq, if no tc filter rules, no statistic data can get 14:31:33 liuyulong: so we can "workaround" this by e.g. configuring tc with some very high limit for each fip 14:31:33 liuyulong: API is not a MUST, but the next question is back to why we need to set QoS rules for monitoring. they are different thing. 14:31:49 amotoki: exactly :) 14:32:27 OK, back to the original question. : _ 14:32:59 How to monitor the L3 IP for now? 14:34:11 Seems a lot of work need to be done, start new metering agent, and metering rule, iptables get larger and larger 14:34:39 # ip netns exec snat-867e1473-4495-4513-8759-dee4cb1b9cef tc -s -d -p filter show dev qg-91293cf7-64 14:34:39 filter parent 1: protocol ip pref 1 u32 14:34:39 filter parent 1: protocol ip pref 1 u32 fh 800: ht divisor 1 14:34:40 filter parent 1: protocol ip pref 1 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid :1 not_in_hw (rule hit 180 success 180) 14:34:41 match IP src 172.16.100.10/32 (success 180 ) 14:34:43 police 0x2 rate 1024Kbit burst 128Kb mtu 64Kb action drop overhead 0b linklayer ethernet 14:34:45 ref 1 bind 1 installed 86737 sec used 439 sec 14:34:47 Sent 17640 bytes 180 pkts (dropped 0, overlimits 0) 14:35:28 liuyulong: can your question be rephrased to "we would like to meter traffic per L3 IP" (as you mentioend in the bug description) 14:36:04 "how" is the next thing. the first thing we need is what we want. 14:36:37 ok, make sense 14:36:57 These statistic data can easily get by l3 agent itself. 14:37:32 No more extra agent involved, L3 agent can totally handle it. 14:38:00 =therefore we see an improvement in the feature performance, right? 14:38:35 mlavalle, indeed 14:38:56 up to this point, this is what I understood originally of this RFE 14:39:21 liuyulong: if we will add it to L3 we will duplicate in L3 functionality of metering service plugin, right? 14:39:22 and the proposal is to use tc as the mechanism in a L3 extension, right? 14:39:59 can't we e.g. add "tc driver" to metering service plugin? 14:40:03 Yes, just read tc filter rules' statistic data. 14:41:50 like we did with the firewall. we went from ip-tables to ove / openflow 14:42:03 ovs / openflow ^^^^ 14:43:15 I like 'tc driver' idea too 14:43:47 OK, I think I need to add the API for this. 14:44:08 metering has two portions: the one is where we collect stats and the other is where we report stats. 14:44:31 for the first one, we discussed iptables or tc stat. 14:44:48 and liuyulong wants to change first part (how to collect stats) 14:44:50 for the latter, the current meetering plugin send notification to ceilometer. 14:45:01 we can improve the latter if we want too. 14:45:13 yeap 14:45:38 Then we need to start metering agent again? 14:46:05 liuyulong: what is wrong with starting agent? 14:46:15 metering agent is one choice. the other way is to handle metering in l3-agent. 14:47:21 I am not sure which one is better and/or easier right now. 14:48:33 overall, this doesn't seem to be a burning issue in the community (although it might be hot in liuyulong's employer). so I would like to shoot for better 14:48:33 amotoki: but IMO we should finally have only one way to do it. There is no point to dupicate metering-agent's functionality in L3 agent also and keep both of them in code 14:48:58 yeah, but we can gradually transition 14:49:00 slaweq: makes sense 14:49:25 again, the firewall comes to mind 14:49:28 mlavalle: I agree that it can be done smoothly, just want to say that we should keep it in mind probably 14:49:42 we are saying the same thing 14:49:45 ++ 14:49:56 :) 14:50:00 yes, make sense, I can try taht 14:50:58 two agents can potentially touch the same thing and it can be a problem, but I don't know it is a problem or not now, so I commented such above. 14:51:08 amotoki, how to report data is not the big problem, it is well handled, both udp and notification can be applied. 14:51:33 liuyulong: yeah, I am not worrying it either :) 14:51:45 Yes, for that driver approach, once I had same idea, but I find it is a bit difficult to handle the router update/floating IP update, remove, and e 14:52:14 L3 agent restart, tc rule remove, tc rule update etc. 14:53:10 All these thing are handled by l3 agent itself, but another agent rely on that. 14:53:33 liuyulong explains what I think in more detail :) 14:54:41 even more, HA router failove 14:54:45 failover 14:55:19 And DVR local host floating IP monitor, we need to start all metering agent to compute host too. 14:55:45 Message queue may not be happy, IMO. 14:57:42 and all these is avoided with a L3 extension, right? 14:58:49 mlavalle, our local implementation is a wrapper for l3 agent. 14:59:08 We called that l3_agent_with_metering. 14:59:30 I don't see a problem with the l3 agent extension approach. let's refine the RFE with the API suggestions 14:59:39 and we'll retake it next week 14:59:48 is that a good summary of next steps? 14:59:56 * mlavalle looks at the clock 15:00:01 +1 15:00:04 mlavalle: +1 15:00:07 OK 15:00:08 I agree that it can be done as another extension to L3 agent, in same way like qos is done for example 15:00:18 cool 15:00:24 #endmeeting