13:59:32 <liuyulong> #startmeeting neutron_l3
13:59:33 <openstack> Meeting started Wed Jun 26 13:59:32 2019 UTC and is due to finish in 60 minutes.  The chair is liuyulong. Information about MeetBot at http://wiki.debian.org/MeetBot.
13:59:34 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
13:59:37 <openstack> The meeting name has been set to 'neutron_l3'
13:59:58 <mlavalle> o/
14:00:07 <liuyulong> #chair mlavalle
14:00:08 <openstack> Current chairs: liuyulong mlavalle
14:00:21 <ralonsoh> hi
14:00:34 <liuyulong> Let's wait a minute for more attendees. We are going to start the topics after 60 seconds officially. : )
14:00:40 <liuyulong> hi
14:01:06 <mlavalle> nice!
14:01:43 <wwriverrat> o/
14:01:52 <liuyulong> #topic Announcements
14:02:06 <liuyulong> Important things were highlighted in neutron team meeting yesterday, so I will not repeat them again.
14:02:45 <liuyulong> So no announcements from me now, anyone has any other?
14:03:37 <slaweq> hi
14:03:41 <mlavalle> none from me
14:03:48 <liuyulong> slaweq, hi
14:03:58 <liuyulong> OK, let's move on.
14:04:03 <liuyulong> #topic Bugs
14:04:13 <liuyulong> yamamoto was our bug deputy last week.
14:04:19 <liuyulong> #link http://lists.openstack.org/pipermail/openstack-discuss/2019-June/007290.html
14:04:42 <liuyulong> Looks like a busy week, : )
14:04:44 <njohnston> o/
14:04:58 <mlavalle> yeah, it ws kind of busy
14:05:16 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1833717
14:05:17 <openstack> Launchpad bug 1833717 in neutron "Functional tests: error during namespace creation" [High,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
14:05:26 <liuyulong> The fix: https://review.opendev.org/#/c/666845/. It's done now.
14:05:34 <mlavalle> thanks ralonsoh
14:05:41 <ralonsoh> no problem
14:06:56 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1834257
14:06:57 <openstack> Launchpad bug 1834257 in neutron "dhcp-agent can overwhelm neutron server with dhcp_ready_on_ports RPC calls" [High,In progress] - Assigned to Sebastian Lohff (sebageek)
14:07:19 <liuyulong> A WIP fix is here: https://review.opendev.org/#/c/667472/.  And it's based on this: https://review.opendev.org/#/c/659274/.
14:07:47 <liuyulong> Code basically looks good to me, it makes sense, but I have a concern is that how will this influence the port DHCP provisioning block during instance boot procedure.
14:08:28 <mlavalle> are referring to the first or second patch?
14:08:31 <liuyulong> If I have some port list like this [0, 63] [64, 127] ... [...]. That dhcp_ready_on_ports is a PRC `call` (wait and block until return) method, the following sets of DHCP ready ports will wait until the the formers completed.
14:09:16 <liuyulong> So the later the port stays in the waiting queue, the easier it fails to start the VM due to the provisioning block timeout.
14:09:32 <liuyulong> mlavalle, the base https://review.opendev.org/#/c/659274/.
14:11:07 <liuyulong> any comments in such perspective?
14:11:37 <mlavalle> I see your concern but I also see that others have other opinions
14:11:45 <mlavalle> like slaweq
14:11:48 <liuyulong> Or I'm just over-thinking?
14:12:10 <mlavalle> I'll go over the comments later today and add my 2 cents
14:12:33 <liuyulong> mlavalle, cool
14:12:34 <slaweq> liuyulong: basically You're right IMO
14:12:35 <slaweq> but
14:13:20 <slaweq> in current approach if You have e.g. 127 ports it will be send in one rpc message and then new port will wait long time until it will be send
14:13:53 <mlavalle> processed, you menat, right?
14:14:17 <slaweq> and IMHO if You will send those messages in smaller chunks, it will be spread between many rpc workers on server side, so in overall it shouldn't be big problem
14:14:23 <slaweq> mlavalle: right
14:14:48 <slaweq> and also, the issue here is mostly during full sync (e.g. after restart agent)
14:15:18 <ralonsoh> also we should not forget the main problem addressed in the patch: to solve the RPC timeouts when sending big sets of ports
14:15:19 <mlavalle> in normal processing it shouldn't be a big issue
14:15:21 <slaweq> in such case IMHO it's better to process everything without timeouts and repeats and then start "normal" work
14:15:24 <ralonsoh> and this problem is solved there
14:15:50 <liuyulong> such _dhcp_ready_ports_loop is in a single thread, not?
14:15:59 <slaweq> during agent's normal work, this shouldn't be problem as usually there is no so many ports to send at once
14:16:10 <slaweq> liuyulong: yes, it's single eventlet worker now
14:17:58 <liuyulong> so, the neutron-server will always process these ports in one worker for one such RPC call of one DHCP agent.
14:18:19 <liuyulong> no matter the trunk size
14:18:39 <liuyulong> So why, it is slow in neutron-server side?
14:19:19 <liuyulong> This should be the root cause IMO. I will raise a L3 DB slow query bug later.
14:19:33 <liuyulong> maybe something similar
14:20:26 <slaweq> liuyulong: it is slow because it iterates over all those ports and try to remove provisioning block for each port
14:20:37 <slaweq> and yes, it is slow
14:20:52 <slaweq> we should think about optimizing it later but IMHO patch for dhcp agent is good
14:23:04 <liuyulong> Yes, code is fine to me. Actually I do not think nova or neutron will handle a 64**+ VM concurrent booting in one single node too many times.
14:23:24 <mlavalle> so, let's move on then and let it go forward
14:23:35 <liuyulong> Next
14:23:36 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1833653
14:23:38 <openstack> Launchpad bug 1833653 in neutron "We should cleanup ipv4 address if keepalived is dead" [Medium,In progress] - Assigned to Yang Li (yang-li)
14:23:50 <liuyulong> I'm not quite sure if this is a system spontaneous problem.
14:24:06 <liuyulong> But according to the comments in the fix: https://review.opendev.org/#/c/667071/. Yang Li  says `restart the l3-agent too many time`, so looks like a artificial extreme scenario.
14:26:03 <liuyulong> IMO, maybe we should add a FLAG for agent restart succeed, and then remind the user in the DOC not to trigger restart many times when no such flag raised, or wait some times then restart again. But I still not quite sure why, restart again and again?
14:27:02 <liuyulong> But as he/she said in the gerrit, https://bugs.launchpad.net/neutron/+bug/1602320
14:27:04 <openstack> Launchpad bug 1602320 in neutron "ha + distributed router: keepalived process kill vrrp child process" [Medium,Fix released] - Assigned to He Qing (tsinghe-7)
14:27:14 <liuyulong> maybe we should take a look at this bug again.
14:27:31 <liuyulong> Next
14:27:32 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1732067
14:27:34 <openstack> Launchpad bug 1732067 in neutron "openvswitch firewall flows cause flooding on integration bridge" [High,In progress] - Assigned to LIU Yulong (dragon889)
14:27:52 <liuyulong> This is a L2 bug IMO
14:28:00 <liuyulong> We already have a patch here: https://review.opendev.org/#/c/639009/, but it breaks the dvr east-west traffic.
14:28:21 <liuyulong> So, I raised it to here again.
14:28:55 <liuyulong> The fix may also face the same issue I mentioned last week.
14:29:15 <liuyulong> Every L3 scenario traffic will be affected, we can not test all the scenarios in upstream neutron CI, or manually test.
14:29:55 * haleyb wanders in late
14:30:02 <liuyulong> So a minimum change is required.
14:30:56 <liuyulong> As a consequence, based on the openflow firewall design, I have an alternative fix: https://review.opendev.org/#/c/666991/. When the bridge tries to flood the packets, we use the dest MAC to direct the traffic to the right OF port, since neutron has the full acknowledge of every ports.
14:31:41 <ralonsoh> we should use the MAC and the VLAN tag
14:32:14 <ralonsoh> we can have the same mac in multiple VLANs
14:32:46 <liuyulong> ralonsoh, good to know, I will refactor the patch for this point.
14:33:19 <mlavalle> thanks for proposing it
14:33:19 <mlavalle> let's review it
14:33:52 <liuyulong> So, in such perspective, we do not touch too many tables of the OF firewall.
14:34:20 <liuyulong> Only one table now, and only for accepted egress traffic.
14:35:42 <liuyulong> I tested the DVR, legacy, dvr+ha, dvr_no_external, east-west, floating IPs, all works fine now, no flood in the bridge.
14:36:17 <liuyulong> OK, last one from me.
14:36:22 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1834308
14:36:23 <openstack> Launchpad bug 1834308 in neutron "[DVR][DB] too many slow query during agent restart" [Medium,Confirmed]
14:36:38 <liuyulong> Recently, during our local testing we meet such issue.
14:36:59 <liuyulong> The restart time is too long. With large concurrency number, the situation is even worse.
14:37:32 <liuyulong> The database CPU utilization is almost 100% for every used core.
14:38:00 <liuyulong> And We counted the slow query logs, some of them will be triggered 200K+ times.
14:38:52 <ralonsoh> did you tried to profile those DB queries?
14:39:00 <liuyulong> Have you guys meet such issue locally?
14:39:11 <liuyulong> ralonsoh, yes, I have a fix locally, it basically works.
14:39:26 <ralonsoh> ok
14:39:30 <haleyb> the bug is very generic, it seems as if there might be more than one issue here to fix?
14:40:25 <ralonsoh> btw, I have one more bug
14:40:32 <liuyulong> We tested restart the dhcp, metadata, l3 agent  in 80 nodes concurrently.
14:41:12 <liuyulong> No 0.5s+ slow DB query during such restart.
14:41:27 <liuyulong> OK
14:41:33 <liuyulong> ralonsoh, please, go ahead
14:41:37 <ralonsoh> thanks
14:41:40 <ralonsoh> #link https://bugs.launchpad.net/neutron/+bug/1732458
14:41:41 <openstack> Launchpad bug 1732458 in neutron "deleted_ports memory leak in dhcp agent" [Medium,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
14:41:56 <ralonsoh> patch: #link https://review.opendev.org/#/c/521035/
14:42:05 <ralonsoh> a bit old
14:42:32 <ralonsoh> I talked to the submitter and he agreed to have another strategy
14:42:44 <ralonsoh> so I submitted PS6
14:43:18 <ralonsoh> I'm using FixedIntervalLoopingCall to cleanup the deleted_ports variable
14:43:27 <ralonsoh> that's all!
14:44:02 <mlavalle> seems to be a description too broad. Are you planning to add finer detail to the bug? I know we have slow DB queries....
14:44:02 <mlavalle> or how do you propose to move forward with it?
14:44:23 <liuyulong> ralonsoh, thanks for bring this up.
14:44:37 <liuyulong> mlavalle, as I add [DVR] in the bug title.
14:45:25 <mlavalle> ok
14:45:26 <liuyulong> mlavalle, most of them basically related to DVR DB query.
14:46:49 <liuyulong> When restart L2 agent (ovs-agent) concurrently in 50-80 nodes, the openvswitch_dvr_agent will also do some DVR related DB query.
14:47:36 <mlavalle> let's add these details to the bug filing. it will be easier to understand and also to review the fix when it is proposed
14:47:53 <liuyulong> Some of them is really time consuming, and some of the will scan the ports table entirely.
14:48:22 <liuyulong> That's all bugs from me. So, are there any other bugs that need the team to pay attention?
14:48:24 <haleyb> mlavalle: +1 to that, i'd rather have more bugs that are specific than just one
14:49:04 <liuyulong> Let's move on.
14:49:07 <liuyulong> #topic Routed Networks
14:49:30 <liuyulong> mlavalle, tidwellr, wwriverrat: your turn now.
14:49:58 <mlavalle> on my part I have patch deployed in my set up
14:50:05 <mlavalle> I have to conduct testing
14:50:11 <mlavalle> this is OVS centered
14:50:26 <mlavalle> I got a bit sidetracked over the past few days
14:50:35 <mlavalle> but I am coming back to it
14:50:50 <mlavalle> I am talking about multiple segments per host
14:51:08 <wwriverrat> As for multi-segment work, I've also been sidetracked with my evil job ;-).
14:51:25 <mlavalle> wwriverrat: oh, you have one of those too?
14:51:28 <wwriverrat> I have been digesting all of the comments in both the spec as well as the WIP
14:52:10 <wwriverrat> It looks like there may be new code that also raises exception for a network with more than one segment
14:52:25 <wwriverrat> trying to figure out how to get around that
14:52:57 <wwriverrat> https://review.opendev.org/#/c/633165/20/neutron/plugins/ml2/plugin.py
14:53:17 <wwriverrat> Not sure if routed network will pass through that code or not
14:54:01 <wwriverrat> Longer story shortened: I've been freed up a little to get more work done here.
14:54:12 <mlavalle> I'll test that part
14:54:17 <mlavalle> in my setup
14:55:43 <wwriverrat> I'm hoping to have something around the WIP updated ideally by Monday. An update to the spec hopefully before end of today.
14:56:27 <mlavalle> thanks!
14:56:54 <liuyulong> Routed Networks is a little new to me. And we haven't used it in our environment locally. But it's worth a try if needed. : )
14:57:43 <liuyulong> We are running out of time.
14:58:09 <liuyulong> #topic On demand agenda
14:58:14 <liuyulong> Anything else to discuss today?
14:58:50 <liuyulong> OK, happy coding, see you guys online.
14:58:53 <mlavalle> not from me
14:58:53 <liuyulong> #endmeeting