13:59:32 <liuyulong> #startmeeting neutron_l3 13:59:33 <openstack> Meeting started Wed Jun 26 13:59:32 2019 UTC and is due to finish in 60 minutes. The chair is liuyulong. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:59:34 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:59:37 <openstack> The meeting name has been set to 'neutron_l3' 13:59:58 <mlavalle> o/ 14:00:07 <liuyulong> #chair mlavalle 14:00:08 <openstack> Current chairs: liuyulong mlavalle 14:00:21 <ralonsoh> hi 14:00:34 <liuyulong> Let's wait a minute for more attendees. We are going to start the topics after 60 seconds officially. : ) 14:00:40 <liuyulong> hi 14:01:06 <mlavalle> nice! 14:01:43 <wwriverrat> o/ 14:01:52 <liuyulong> #topic Announcements 14:02:06 <liuyulong> Important things were highlighted in neutron team meeting yesterday, so I will not repeat them again. 14:02:45 <liuyulong> So no announcements from me now, anyone has any other? 14:03:37 <slaweq> hi 14:03:41 <mlavalle> none from me 14:03:48 <liuyulong> slaweq, hi 14:03:58 <liuyulong> OK, let's move on. 14:04:03 <liuyulong> #topic Bugs 14:04:13 <liuyulong> yamamoto was our bug deputy last week. 14:04:19 <liuyulong> #link http://lists.openstack.org/pipermail/openstack-discuss/2019-June/007290.html 14:04:42 <liuyulong> Looks like a busy week, : ) 14:04:44 <njohnston> o/ 14:04:58 <mlavalle> yeah, it ws kind of busy 14:05:16 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1833717 14:05:17 <openstack> Launchpad bug 1833717 in neutron "Functional tests: error during namespace creation" [High,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez) 14:05:26 <liuyulong> The fix: https://review.opendev.org/#/c/666845/. It's done now. 14:05:34 <mlavalle> thanks ralonsoh 14:05:41 <ralonsoh> no problem 14:06:56 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1834257 14:06:57 <openstack> Launchpad bug 1834257 in neutron "dhcp-agent can overwhelm neutron server with dhcp_ready_on_ports RPC calls" [High,In progress] - Assigned to Sebastian Lohff (sebageek) 14:07:19 <liuyulong> A WIP fix is here: https://review.opendev.org/#/c/667472/. And it's based on this: https://review.opendev.org/#/c/659274/. 14:07:47 <liuyulong> Code basically looks good to me, it makes sense, but I have a concern is that how will this influence the port DHCP provisioning block during instance boot procedure. 14:08:28 <mlavalle> are referring to the first or second patch? 14:08:31 <liuyulong> If I have some port list like this [0, 63] [64, 127] ... [...]. That dhcp_ready_on_ports is a PRC `call` (wait and block until return) method, the following sets of DHCP ready ports will wait until the the formers completed. 14:09:16 <liuyulong> So the later the port stays in the waiting queue, the easier it fails to start the VM due to the provisioning block timeout. 14:09:32 <liuyulong> mlavalle, the base https://review.opendev.org/#/c/659274/. 14:11:07 <liuyulong> any comments in such perspective? 14:11:37 <mlavalle> I see your concern but I also see that others have other opinions 14:11:45 <mlavalle> like slaweq 14:11:48 <liuyulong> Or I'm just over-thinking? 14:12:10 <mlavalle> I'll go over the comments later today and add my 2 cents 14:12:33 <liuyulong> mlavalle, cool 14:12:34 <slaweq> liuyulong: basically You're right IMO 14:12:35 <slaweq> but 14:13:20 <slaweq> in current approach if You have e.g. 127 ports it will be send in one rpc message and then new port will wait long time until it will be send 14:13:53 <mlavalle> processed, you menat, right? 14:14:17 <slaweq> and IMHO if You will send those messages in smaller chunks, it will be spread between many rpc workers on server side, so in overall it shouldn't be big problem 14:14:23 <slaweq> mlavalle: right 14:14:48 <slaweq> and also, the issue here is mostly during full sync (e.g. after restart agent) 14:15:18 <ralonsoh> also we should not forget the main problem addressed in the patch: to solve the RPC timeouts when sending big sets of ports 14:15:19 <mlavalle> in normal processing it shouldn't be a big issue 14:15:21 <slaweq> in such case IMHO it's better to process everything without timeouts and repeats and then start "normal" work 14:15:24 <ralonsoh> and this problem is solved there 14:15:50 <liuyulong> such _dhcp_ready_ports_loop is in a single thread, not? 14:15:59 <slaweq> during agent's normal work, this shouldn't be problem as usually there is no so many ports to send at once 14:16:10 <slaweq> liuyulong: yes, it's single eventlet worker now 14:17:58 <liuyulong> so, the neutron-server will always process these ports in one worker for one such RPC call of one DHCP agent. 14:18:19 <liuyulong> no matter the trunk size 14:18:39 <liuyulong> So why, it is slow in neutron-server side? 14:19:19 <liuyulong> This should be the root cause IMO. I will raise a L3 DB slow query bug later. 14:19:33 <liuyulong> maybe something similar 14:20:26 <slaweq> liuyulong: it is slow because it iterates over all those ports and try to remove provisioning block for each port 14:20:37 <slaweq> and yes, it is slow 14:20:52 <slaweq> we should think about optimizing it later but IMHO patch for dhcp agent is good 14:23:04 <liuyulong> Yes, code is fine to me. Actually I do not think nova or neutron will handle a 64**+ VM concurrent booting in one single node too many times. 14:23:24 <mlavalle> so, let's move on then and let it go forward 14:23:35 <liuyulong> Next 14:23:36 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1833653 14:23:38 <openstack> Launchpad bug 1833653 in neutron "We should cleanup ipv4 address if keepalived is dead" [Medium,In progress] - Assigned to Yang Li (yang-li) 14:23:50 <liuyulong> I'm not quite sure if this is a system spontaneous problem. 14:24:06 <liuyulong> But according to the comments in the fix: https://review.opendev.org/#/c/667071/. Yang Li says `restart the l3-agent too many time`, so looks like a artificial extreme scenario. 14:26:03 <liuyulong> IMO, maybe we should add a FLAG for agent restart succeed, and then remind the user in the DOC not to trigger restart many times when no such flag raised, or wait some times then restart again. But I still not quite sure why, restart again and again? 14:27:02 <liuyulong> But as he/she said in the gerrit, https://bugs.launchpad.net/neutron/+bug/1602320 14:27:04 <openstack> Launchpad bug 1602320 in neutron "ha + distributed router: keepalived process kill vrrp child process" [Medium,Fix released] - Assigned to He Qing (tsinghe-7) 14:27:14 <liuyulong> maybe we should take a look at this bug again. 14:27:31 <liuyulong> Next 14:27:32 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1732067 14:27:34 <openstack> Launchpad bug 1732067 in neutron "openvswitch firewall flows cause flooding on integration bridge" [High,In progress] - Assigned to LIU Yulong (dragon889) 14:27:52 <liuyulong> This is a L2 bug IMO 14:28:00 <liuyulong> We already have a patch here: https://review.opendev.org/#/c/639009/, but it breaks the dvr east-west traffic. 14:28:21 <liuyulong> So, I raised it to here again. 14:28:55 <liuyulong> The fix may also face the same issue I mentioned last week. 14:29:15 <liuyulong> Every L3 scenario traffic will be affected, we can not test all the scenarios in upstream neutron CI, or manually test. 14:29:55 * haleyb wanders in late 14:30:02 <liuyulong> So a minimum change is required. 14:30:56 <liuyulong> As a consequence, based on the openflow firewall design, I have an alternative fix: https://review.opendev.org/#/c/666991/. When the bridge tries to flood the packets, we use the dest MAC to direct the traffic to the right OF port, since neutron has the full acknowledge of every ports. 14:31:41 <ralonsoh> we should use the MAC and the VLAN tag 14:32:14 <ralonsoh> we can have the same mac in multiple VLANs 14:32:46 <liuyulong> ralonsoh, good to know, I will refactor the patch for this point. 14:33:19 <mlavalle> thanks for proposing it 14:33:19 <mlavalle> let's review it 14:33:52 <liuyulong> So, in such perspective, we do not touch too many tables of the OF firewall. 14:34:20 <liuyulong> Only one table now, and only for accepted egress traffic. 14:35:42 <liuyulong> I tested the DVR, legacy, dvr+ha, dvr_no_external, east-west, floating IPs, all works fine now, no flood in the bridge. 14:36:17 <liuyulong> OK, last one from me. 14:36:22 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1834308 14:36:23 <openstack> Launchpad bug 1834308 in neutron "[DVR][DB] too many slow query during agent restart" [Medium,Confirmed] 14:36:38 <liuyulong> Recently, during our local testing we meet such issue. 14:36:59 <liuyulong> The restart time is too long. With large concurrency number, the situation is even worse. 14:37:32 <liuyulong> The database CPU utilization is almost 100% for every used core. 14:38:00 <liuyulong> And We counted the slow query logs, some of them will be triggered 200K+ times. 14:38:52 <ralonsoh> did you tried to profile those DB queries? 14:39:00 <liuyulong> Have you guys meet such issue locally? 14:39:11 <liuyulong> ralonsoh, yes, I have a fix locally, it basically works. 14:39:26 <ralonsoh> ok 14:39:30 <haleyb> the bug is very generic, it seems as if there might be more than one issue here to fix? 14:40:25 <ralonsoh> btw, I have one more bug 14:40:32 <liuyulong> We tested restart the dhcp, metadata, l3 agent in 80 nodes concurrently. 14:41:12 <liuyulong> No 0.5s+ slow DB query during such restart. 14:41:27 <liuyulong> OK 14:41:33 <liuyulong> ralonsoh, please, go ahead 14:41:37 <ralonsoh> thanks 14:41:40 <ralonsoh> #link https://bugs.launchpad.net/neutron/+bug/1732458 14:41:41 <openstack> Launchpad bug 1732458 in neutron "deleted_ports memory leak in dhcp agent" [Medium,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez) 14:41:56 <ralonsoh> patch: #link https://review.opendev.org/#/c/521035/ 14:42:05 <ralonsoh> a bit old 14:42:32 <ralonsoh> I talked to the submitter and he agreed to have another strategy 14:42:44 <ralonsoh> so I submitted PS6 14:43:18 <ralonsoh> I'm using FixedIntervalLoopingCall to cleanup the deleted_ports variable 14:43:27 <ralonsoh> that's all! 14:44:02 <mlavalle> seems to be a description too broad. Are you planning to add finer detail to the bug? I know we have slow DB queries.... 14:44:02 <mlavalle> or how do you propose to move forward with it? 14:44:23 <liuyulong> ralonsoh, thanks for bring this up. 14:44:37 <liuyulong> mlavalle, as I add [DVR] in the bug title. 14:45:25 <mlavalle> ok 14:45:26 <liuyulong> mlavalle, most of them basically related to DVR DB query. 14:46:49 <liuyulong> When restart L2 agent (ovs-agent) concurrently in 50-80 nodes, the openvswitch_dvr_agent will also do some DVR related DB query. 14:47:36 <mlavalle> let's add these details to the bug filing. it will be easier to understand and also to review the fix when it is proposed 14:47:53 <liuyulong> Some of them is really time consuming, and some of the will scan the ports table entirely. 14:48:22 <liuyulong> That's all bugs from me. So, are there any other bugs that need the team to pay attention? 14:48:24 <haleyb> mlavalle: +1 to that, i'd rather have more bugs that are specific than just one 14:49:04 <liuyulong> Let's move on. 14:49:07 <liuyulong> #topic Routed Networks 14:49:30 <liuyulong> mlavalle, tidwellr, wwriverrat: your turn now. 14:49:58 <mlavalle> on my part I have patch deployed in my set up 14:50:05 <mlavalle> I have to conduct testing 14:50:11 <mlavalle> this is OVS centered 14:50:26 <mlavalle> I got a bit sidetracked over the past few days 14:50:35 <mlavalle> but I am coming back to it 14:50:50 <mlavalle> I am talking about multiple segments per host 14:51:08 <wwriverrat> As for multi-segment work, I've also been sidetracked with my evil job ;-). 14:51:25 <mlavalle> wwriverrat: oh, you have one of those too? 14:51:28 <wwriverrat> I have been digesting all of the comments in both the spec as well as the WIP 14:52:10 <wwriverrat> It looks like there may be new code that also raises exception for a network with more than one segment 14:52:25 <wwriverrat> trying to figure out how to get around that 14:52:57 <wwriverrat> https://review.opendev.org/#/c/633165/20/neutron/plugins/ml2/plugin.py 14:53:17 <wwriverrat> Not sure if routed network will pass through that code or not 14:54:01 <wwriverrat> Longer story shortened: I've been freed up a little to get more work done here. 14:54:12 <mlavalle> I'll test that part 14:54:17 <mlavalle> in my setup 14:55:43 <wwriverrat> I'm hoping to have something around the WIP updated ideally by Monday. An update to the spec hopefully before end of today. 14:56:27 <mlavalle> thanks! 14:56:54 <liuyulong> Routed Networks is a little new to me. And we haven't used it in our environment locally. But it's worth a try if needed. : ) 14:57:43 <liuyulong> We are running out of time. 14:58:09 <liuyulong> #topic On demand agenda 14:58:14 <liuyulong> Anything else to discuss today? 14:58:50 <liuyulong> OK, happy coding, see you guys online. 14:58:53 <mlavalle> not from me 14:58:53 <liuyulong> #endmeeting