14:00:59 <liuyulong> #startmeeting neutron_l3 14:01:00 <openstack> Meeting started Wed Jul 31 14:00:59 2019 UTC and is due to finish in 60 minutes. The chair is liuyulong. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:01:01 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:01:03 <openstack> The meeting name has been set to 'neutron_l3' 14:01:23 <liuyulong> Hello 14:02:33 <slaweq> hi 14:02:41 <liuyulong> Today will be a quick short meeting, IMO. 14:02:44 <njohnston> o/ 14:02:53 <liuyulong> #topic Announcements 14:03:20 <haleyb> hi 14:03:21 <liuyulong> #link https://etherpad.openstack.org/p/Shanghai-Neutron-Planning 14:03:31 <liuyulong> Just added my name here ^ 14:03:50 <slaweq> great, we will finally meet in person liuyulong :) 14:03:51 <liuyulong> #chair haleyb 14:03:52 <openstack> Current chairs: haleyb liuyulong 14:05:18 <liuyulong> I have no more announcements. IMO, we have reminded all of them yesterday in team meeting. 14:06:04 <liuyulong> OK, let's move on. 14:06:13 <liuyulong> #topic Bugs 14:06:24 <liuyulong> #link http://lists.openstack.org/pipermail/openstack-discuss/2019-July/008089.html 14:06:30 <liuyulong> Boden Russell was our bug deputy last week, thanks. 14:07:24 <liuyulong> And again, I will skip all the bugs which were fixed or the related patches are getting merged now. 14:07:37 <liuyulong> First one: 14:07:39 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1837635 14:07:40 <openstack> Launchpad bug 1837635 in neutron "HA router state change from "standby" to "master" should be delayed" [Undecided,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez) 14:08:16 <liuyulong> For the fix, IMO, it looks good to me, but I'd like to see more deep test results. 14:08:42 <liuyulong> I just added two scenarios here: https://review.opendev.org/#/c/672533/ 14:10:06 <liuyulong> Sometimes, the actual results of the running program may be different from what you expected. 14:10:32 <liuyulong> Next: 14:10:37 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1834308 14:10:39 <openstack> Launchpad bug 1834308 in neutron "[DVR][DB] too many slow query during agent restart" [Medium,In progress] - Assigned to LIU Yulong (dragon889) 14:11:06 <liuyulong> I submitted the patch set yesterday. 14:11:16 <liuyulong> It is here: https://review.opendev.org/#/c/673557/ 14:11:30 <liuyulong> A pep8 failure... 14:12:49 <liuyulong> All the DB query in this patch has the highest frequency of call when restart ovs-agent. 14:13:58 <haleyb> do you have numbers on the improvement? 14:14:00 <liuyulong> And it is time-consuming, when your 'ports' table is getting larger and lager. These query will have a worse results. 14:14:52 <liuyulong> 40 nodes of ovs-agent restart will call these DB query about 300K+ times. 14:17:02 <liuyulong> And these query costs about 0.1s+ seconds logged by our mariadb cluster. 14:17:28 <haleyb> .1s per query? 14:18:18 <liuyulong> Yes, one of them is about 0.5s+. Let me link it in the gerrit. 14:18:50 <liuyulong> https://review.opendev.org/#/c/673557/1/neutron/db/dvr_mac_db.py@145 14:18:56 <liuyulong> get_ports_on_host_by_subnet 14:19:02 <liuyulong> This one. 14:20:01 <liuyulong> haleyb, the results is when the ports table has about 10-20K records. 14:23:09 <haleyb> so _get_ports_query() is really slow 14:25:05 <liuyulong> The scale of resource is about: 17000+ VMs, 3000+ DVR routers, 3000+ network, 3000+ subnets and 3000+ security groups; 40 security group rules for each security group. 14:26:08 <liuyulong> After this change, the ovs-agent restart time has a very significant improvement, it's about 40-50mins to 15mins. 14:26:29 <njohnston> I wonder if it would be further optimized by adding an index specifically on the Port.device_owner field. I'll comment on the change. 14:26:50 <tidwellr> interesting 14:27:08 <tidwellr> hi, btw 14:27:22 <haleyb> liuyulong: that's quite an improvement, even if 15mins is still a long time :) 14:27:44 <liuyulong> 40 - 50 mins, I can't believe it once, but indeed it is. 14:28:23 <liuyulong> rpc_loop 1 it will scan the ports and process it. 40-50mins....... 14:30:24 <haleyb> liuyulong: it looks like you have lots of reviewers now 14:31:03 <liuyulong> More detail about our test deployment is: 3 neutron-server with about (172 workers), its 3 nodes DB and 3 nodes MQ, are all in dedicated server. 14:31:10 <liuyulong> Yes, neutron has its own DB and MQ. 14:35:30 <liuyulong> Last one: 14:35:38 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1838431 14:35:39 <openstack> Launchpad bug 1838431 in neutron "[scale issue] ovs-agent port processing time increases linearly and eventually timeouts" [Undecided,New] 14:35:51 <liuyulong> More like a L2 issue... 14:37:05 <slaweq> liuyulong: this one looks like related to already known problem with "remote_security_group" rules in SG 14:37:19 <slaweq> there was bug reported for it already IIRC 14:37:19 <liuyulong> The test have not get successfully yet. 14:38:12 <liuyulong> I will test it again today. 14:40:19 <liuyulong> One more interesting thing is that we disable the DHCP for this test. No DHCP agent in this test. I can image if DHCP is enabled the vif-plug-timeout may get more... 14:41:13 <liuyulong> That's all bugs from me. 14:41:24 <liuyulong> any other bugs that need the team to pay attention? 14:41:37 <haleyb> there was one miguel filed yesterday 14:41:39 <slaweq> liuyulong: I can't find any bug reported for Your last issue but please check https://etherpad.openstack.org/p/openstack-networking-train-ptg in L347 14:41:48 <slaweq> njohnston: raised this problem on last PTG 14:41:58 <liuyulong> slaweq, OK, great 14:42:00 <haleyb> https://bugs.launchpad.net/neutron/+bug/1838449 14:42:01 <openstack> Launchpad bug 1838449 in neutron "Router migrations failing in the gate" [Medium,Confirmed] - Assigned to Miguel Lavalle (minsel) 14:42:23 <slaweq> liuyulong: please do Your test without security group rules which reference to remote_group_id 14:42:32 <slaweq> than it should be much, much faster 14:44:20 <haleyb> liuyulong: that was the only other bug i had, was going to try and reproduce locally today for miguel 14:45:06 <slaweq> haleyb: yes, this one hurts us quite lot in CI jobs 14:45:24 <liuyulong> slaweq, actually I refactor may test to 27 tenants yesterday. It looks better now. 14:46:09 <liuyulong> haleyb, thanks for bring up this, seems Miguel has found the issue code. 14:46:12 <slaweq> liuyulong: because if You have more tenants, there is less IPs (ports) using same security group probably and thus it's faster 14:47:16 <liuyulong> slaweq, yes, and I'm trying to add more security group for each tenant, or network. 14:48:19 <liuyulong> For one tenant and one default security group, it is a disaster. 14:49:12 <liuyulong> IMO, every one try to test this will be very easy to encounter this problem. 14:49:41 <liuyulong> njohnston's PTG summary looks very similar to this. 14:49:50 <slaweq> liuyulong: yes, we had this issue too 14:51:12 <liuyulong> And maybe some security group DB query also need some optimizing work. 14:51:28 <liuyulong> OK, next topic 14:51:36 <liuyulong> #topic Routed Networks 14:51:44 <liuyulong> mlavalle, tidwellr, wwriverrat: your turn now. 14:53:30 <tidwellr> if mlavalle and wwriverrat don't have anything, we can talk briefly floating IP's for routed networks 14:53:55 <tidwellr> https://review.opendev.org/#/c/486450/ and the POC code https://review.opendev.org/#/c/669395/ 14:55:04 <njohnston> I don't see mlavalle online 14:55:36 <tidwellr> if it isn't obvious by my nagging folks to take a look at these, this has turned into my pet project :) 14:57:50 <tidwellr> I've spun up a little lab where I've tested the POC code, it seems to work nicely and it's not terribly invasive. What I'm interested in is feedback about the approach in the spec 15:00:05 <liuyulong> tidwellr, thank you for replying to my question in the patch sets. 15:00:12 <liuyulong> OK, let's end the meeting. 15:00:17 <liuyulong> #endmeeting