#openstack-meeting log

14:00:47 <mlavalle> #startmeeting neutron_drivers
14:00:48 <openstack> Meeting started Fri Mar 29 14:00:47 2019 UTC and is due to finish in 60 minutes.  The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:49 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:51 <openstack> The meeting name has been set to 'neutron_drivers'
14:01:00 <ralonsoh> hi
14:01:08 <slaweq> hi
14:01:15 <mlavalle> hey slaweq
14:01:18 <doreilly> hi
14:01:22 <mlavalle> you were not supposed to here
14:01:42 <amotoki> hi
14:01:44 <slaweq> but I have some time now so I can be :)
14:02:15 <mlavalle> we know haleyb won't be here
14:02:16 <slaweq> and as haleyb is on pto, yamamoto wrote that he can't attend, I'm necessary to have quorum :P
14:03:12 <mlavalle> ok, so we have minimumm quorum today
14:03:22 <mlavalle> let's get going
14:03:23 <amotoki> it depends how much time slaweq has :)
14:03:27 <mlavalle> #topic RFEs
14:03:42 <slaweq> amotoki: I have whole hour :)
14:03:45 <mlavalle> slaweq: are you good for the entire meeting?
14:03:49 <mlavalle> cool
14:03:50 <slaweq> yep
14:03:54 <amotoki> nice
14:04:14 <mlavalle> THanks for taking the time slaweq. Much appreciated!
14:04:47 <mlavalle> I see doreilly is here
14:04:55 <mlavalle> so let's re-take https://bugs.launchpad.net/neutron/+bug/1817022
14:04:56 <openstack> Launchpad bug 1817022 in neutron "[RFE] set inactivity_probe and max_backoff for OVS bridge controller" [Wishlist,In progress] - Assigned to Darragh O'Reilly (darragh-oreilly)
14:04:56 <doreilly> I am
14:05:22 <mlavalle> yamamoto mentioned in his email that he didn't have time to look at the RFE
14:05:34 <doreilly> I'm afraid I did not have much time to dig deeper into the error os_ken is throwing
14:06:07 <doreilly> I just added a line to do a LOG.exception() - https://bugs.launchpad.net/neutron/+bug/1817022/comments/8
14:06:08 <openstack> Launchpad bug 1817022 in neutron "[RFE] set inactivity_probe and max_backoff for OVS bridge controller" [Wishlist,In progress] - Assigned to Darragh O'Reilly (darragh-oreilly)
14:07:22 <mlavalle> is this the exception that os-ken is throwing as a consequence of your testing?
14:07:40 <doreilly> yes
14:07:52 <mlavalle> yeah, I see it now
14:08:24 <ralonsoh> Datapath Invalid 104209607453507
14:08:37 <ralonsoh> did you check the bridges datapaths?
14:08:45 <doreilly> but when you look at those places in os_ken, it is very generic
14:08:53 <ralonsoh> I mean: you should not have you bridges with the same datapath
14:09:24 <doreilly> I think i did and looked normal. I will check again
14:10:25 <amotoki> doreilly: when does this happen? only when ovs-agent is restarted?
14:10:39 <amotoki> or does it happen during normal operations?
14:10:57 <mlavalle> and I understand that this testing is with master or close to master
14:11:22 <doreilly> when restarted
14:11:36 <doreilly> yes close to master
14:11:47 <ralonsoh> doreilly, yes, when restarted and of_inactivity_probe=1
14:11:58 <ralonsoh> this timeout is too short
14:12:17 <amotoki> liuyulong is trying to split security group rule operations and flow processing into pieces https://review.openstack.org/#/c/638642/
14:12:20 <ralonsoh> and as you pointed out in the comment, you have disconnections in the ovs logs
14:12:22 <amotoki> it might help your case
14:12:38 <amotoki> I might be wrong though..
14:12:55 <doreilly> OVS sets the inactivity probe to 5 sec if you don't specify
14:13:13 <ralonsoh> doreilly, that's the point
14:13:16 <ralonsoh> doreilly: 2019-03-11T16:35:25.258Z|25950|rconn|ERR|br-tun<->tcp:127.0.0.1:6633: no response to inactivity probe after 1 seconds, disconnecting
14:13:21 <liuyulong> yes, you can try that
14:13:33 <ralonsoh> the OF manager is not connected
14:13:48 <liuyulong> and this https://review.openstack.org/#/c/638647/
14:14:27 <doreilly> right. That's what the bug is about. Allowing you to increase the inactivity probe from 5 to something higher
14:15:34 <doreilly> ralonsoh: oh forgot to mention, I changed OVS to default to 1 sec, to make this easier to recreate
14:15:42 <slaweq> doreilly: in log which You pointed in comment I see that exception was raised in clean_stale_flows method
14:15:51 <slaweq> recently we merged liuyulong's patch https://review.openstack.org/#/c/638647/14
14:15:51 <ralonsoh> doreilly, oook, now I understand
14:16:01 <slaweq> would that maybe help to this issue?
14:16:45 <amotoki> I guess this is caused by high load of OVS and/or openflow controller, so it can be mitigated by averaging/splitting OVS load.
14:17:06 <amotoki> 638647 can change the situation.
14:17:13 <ralonsoh> agree
14:17:32 <doreilly> slaweq: maybe. I can retest on fresh master
14:18:00 <slaweq> doreilly: ok, thx
14:18:02 <slaweq> would be nice :)
14:18:24 <mlavalle> the other thing is that since liuyulong is making all these improvements, maybe you two should work together
14:18:35 <mlavalle> ???
14:19:00 <mlavalle> and see if those improvements fix your case
14:19:02 <slaweq> mlavalle: that is very good point, we should not do the same work twice :)
14:19:21 <amotoki> +1000
14:19:28 <doreilly> mlavalle: sure. I will retest. And ping liuyulong if I need anything
14:19:54 <mlavalle> cool, let's leave it at that on this issue this week
14:20:07 <liuyulong> https://launchpad.net/bugs/1813703 all my work can be found here.
14:20:09 <openstack> Launchpad bug 1813703 in neutron "[L2] [summary] ovs-agent issues at large scale" [High,Fix released] - Assigned to LIU Yulong (dragon889)
14:20:39 <liuyulong> Including the test method: https://bugs.launchpad.net/neutron/+bug/1813703/comments/12
14:20:40 <amotoki> and https://review.openstack.org/#/q/topic:bug/1813703+(status:open+OR+status:merged)
14:21:08 <mlavalle> yeah, I'm sure "ovs-agentissues at large" are likely to apply to your case
14:21:41 <mlavalle> doreilly: thanks for keeping pushing the ball forward
14:22:05 <doreilly> mlavalle: np thx for helping
14:22:06 <mlavalle> this is good stuff! getting to the nitty gritty
14:22:48 <mlavalle> ok, moving to the next RFE: https://bugs.launchpad.net/neutron/+bug/1817881
14:22:49 <openstack> Launchpad bug 1817881 in neutron " [RFE] L3 IPs monitor/metering via current QoS functionality (tc filters)" [Wishlist,Confirmed]
14:22:58 <mlavalle> speaking of liuyulong
14:24:20 <liuyulong> I will not repeat the description, but give us an example. We have implemented this. And report L3 IPs statistic data to zabbix.
14:25:32 <liuyulong> YAMAMOTO had mentioned the rest API, IMO, it's a good idea.
14:25:50 <mlavalle> liuyulong: regarding the exchange you had with yamamoto, you are not proposing API changes now?
14:26:21 <slaweq> liuyulong: what with FIPs without QoS enabled? Will this metering work for them too?
14:26:39 <liuyulong> slaweq, yes, you get the point.
14:27:02 <liuyulong> mlavalle, I will not add the API for now, we can remove the qos policy binding to disable the monitor.
14:27:46 <liuyulong> mlavalle, but it can be a next step work.
14:27:58 <slaweq> liuyulong: I really don't like idea to enable/disable metering by adding/removing qos policy to fip - those are 2 different things
14:28:14 <mlavalle> yeah, that was what I originally understood, but then the later comments confused me
14:28:50 <amotoki> totally agree with slaweq
14:29:37 <amotoki> if we would like to control monitoring targets, it is better to specify it via API as the current metering API does. (it is a bit different though)
14:29:50 <slaweq> amotoki: +1000
14:29:56 <mlavalle> so, is the mechanism ok but we would like a different approach from API point of view?
14:30:04 <liuyulong> so you guys think API is necessary?
14:30:15 <slaweq> and also if we want to have such metering it should be possible to enable it to all fips, not only those with qos
14:30:49 <liuyulong> slaweq, yes, once it can be done, but tc now does not accept '0' as input value.
14:31:25 <liuyulong> slaweq, if no tc filter rules, no  statistic data can get
14:31:33 <slaweq> liuyulong: so we can "workaround" this by e.g. configuring tc with some very high limit for each fip
14:31:33 <amotoki> liuyulong: API is not a MUST, but the next question is back to why we need to set QoS rules for monitoring. they are different thing.
14:31:49 <slaweq> amotoki: exactly :)
14:32:27 <liuyulong> OK, back to the original question. : _
14:32:59 <liuyulong> How to monitor the L3 IP for now?
14:34:11 <liuyulong> Seems a lot of work need to be done, start new metering agent, and metering rule, iptables get larger and larger
14:34:39 <liuyulong> # ip netns exec snat-867e1473-4495-4513-8759-dee4cb1b9cef tc -s -d -p filter show dev qg-91293cf7-64
14:34:39 <liuyulong> filter parent 1: protocol ip pref 1 u32
14:34:39 <liuyulong> filter parent 1: protocol ip pref 1 u32 fh 800: ht divisor 1
14:34:40 <liuyulong> filter parent 1: protocol ip pref 1 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid :1 not_in_hw (rule hit 180 success 180)
14:34:41 <liuyulong> match IP src 172.16.100.10/32 (success 180 )
14:34:43 <liuyulong> police 0x2 rate 1024Kbit burst 128Kb mtu 64Kb action drop overhead 0b linklayer ethernet
14:34:45 <liuyulong> ref 1 bind 1 installed 86737 sec used 439 sec
14:34:47 <liuyulong> Sent 17640 bytes 180 pkts (dropped 0, overlimits 0)
14:35:28 <amotoki> liuyulong: can your question be rephrased to "we would like to meter traffic per L3 IP" (as you mentioend in the bug description)
14:36:04 <amotoki> "how" is the next thing. the first thing we need is what we want.
14:36:37 <liuyulong> ok, make sense
14:36:57 <liuyulong> These statistic data can easily get by l3 agent itself.
14:37:32 <liuyulong> No more extra agent involved, L3 agent can totally handle it.
14:38:00 <mlavalle> =therefore we see an improvement in the feature performance, right?
14:38:35 <liuyulong> mlavalle, indeed
14:38:56 <mlavalle> up to this point, this is what I understood originally of this RFE
14:39:21 <slaweq> liuyulong: if we will add it to L3 we will duplicate in L3 functionality of metering service plugin, right?
14:39:22 <mlavalle> and the proposal is to use tc as the mechanism in a L3 extension, right?
14:39:59 <slaweq> can't we e.g. add "tc driver" to metering service plugin?
14:40:03 <liuyulong> Yes, just read tc filter rules' statistic data.
14:41:50 <mlavalle> like we did with the firewall. we went from ip-tables to ove / openflow
14:42:03 <mlavalle> ovs / openflow ^^^^
14:43:15 <amotoki> I like 'tc driver' idea too
14:43:47 <liuyulong> OK, I think I need to add the API for this.
14:44:08 <amotoki> metering has two portions: the one is where we collect stats and the other is where we report stats.
14:44:31 <amotoki> for the first one, we discussed iptables or tc stat.
14:44:48 <slaweq> and liuyulong wants to change first part (how to collect stats)
14:44:50 <amotoki> for the latter, the current meetering plugin send notification to ceilometer.
14:45:01 <amotoki> we can improve the latter if we want too.
14:45:13 <mlavalle> yeap
14:45:38 <liuyulong> Then we need to start metering agent again?
14:46:05 <slaweq> liuyulong: what is wrong with starting agent?
14:46:15 <amotoki> metering agent is one choice. the other way is to handle metering in l3-agent.
14:47:21 <amotoki> I am not sure which one is better and/or easier right now.
14:48:33 <mlavalle> overall, this doesn't seem to be a burning issue in the community (although it might be hot in liuyulong's employer). so I would like to shoot for better
14:48:33 <slaweq> amotoki: but IMO we should finally have only one way to do it. There is no point to dupicate metering-agent's functionality in L3 agent also and keep both of them in code
14:48:58 <mlavalle> yeah, but we can gradually transition
14:49:00 <amotoki> slaweq: makes sense
14:49:25 <mlavalle> again, the firewall comes to mind
14:49:28 <slaweq> mlavalle: I agree that it can be done smoothly, just want to say that we should keep it in mind probably
14:49:42 <mlavalle> we are saying the same thing
14:49:45 <mlavalle> ++
14:49:56 <slaweq> :)
14:50:00 <liuyulong> yes, make sense, I can try taht
14:50:58 <amotoki> two agents can potentially touch the same thing and it can be a problem, but I don't know it is a problem or not now, so I commented such above.
14:51:08 <liuyulong> amotoki, how to report data is not the big problem, it is well handled, both udp and notification can be applied.
14:51:33 <amotoki> liuyulong: yeah, I am not worrying it either :)
14:51:45 <liuyulong> Yes, for that driver approach, once I had same idea, but I find it is a bit difficult to handle the router update/floating IP update, remove, and e
14:52:14 <liuyulong> L3 agent restart, tc rule remove, tc rule update etc.
14:53:10 <liuyulong> All these thing are handled by l3 agent itself, but another agent rely on that.
14:53:33 <amotoki> liuyulong explains what I think in more detail :)
14:54:41 <liuyulong> even more, HA router failove
14:54:45 <liuyulong> failover
14:55:19 <liuyulong> And DVR local host floating IP monitor, we need to start all metering agent to compute host too.
14:55:45 <liuyulong> Message queue may not be happy, IMO.
14:57:42 <mlavalle> and all these is avoided with a L3 extension, right?
14:58:49 <liuyulong> mlavalle, our local implementation is a wrapper for l3 agent.
14:59:08 <liuyulong> We called that l3_agent_with_metering.
14:59:30 <mlavalle> I don't see a problem with the l3 agent extension approach. let's refine the RFE with the API suggestions
14:59:39 <mlavalle> and we'll retake it next week
14:59:48 <mlavalle> is that a good summary of next steps?
14:59:56 * mlavalle looks at the clock
15:00:01 <amotoki> +1
15:00:04 <njohnston_> mlavalle: +1
15:00:07 <liuyulong> OK
15:00:08 <slaweq> I agree that it can be done as another extension to L3 agent, in same way like qos is done for example
15:00:18 <mlavalle> cool
15:00:24 <mlavalle> #endmeeting