15:00:54 <haleyb> #startmeeting neutron_l3 15:00:55 <openstack> Meeting started Thu Dec 13 15:00:54 2018 UTC and is due to finish in 60 minutes. The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:56 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:59 <openstack> The meeting name has been set to 'neutron_l3' 15:01:00 <njohnston> o/ 15:01:34 <slaweq> hi 15:02:11 <haleyb> hi everyone 15:02:16 <davidsha> Hi 15:02:35 <haleyb> #topic Announcements 15:03:08 <haleyb> Just a reminder Stein-2 is Jan 7th 15:04:13 <haleyb> And we will have a meeting next week, but not the week after 15:04:53 <haleyb> any other announcements? 15:05:28 <haleyb> #topic Bugs 15:05:37 <tidwellr> hi 15:05:55 <haleyb> tidwellr: hi 15:06:07 <haleyb> i don't see Swami so will go through bugs 15:06:46 <haleyb> https://bugs.launchpad.net/neutron/+bug/1774459 15:06:47 <openstack> Launchpad bug 1774459 in neutron "Update permanent ARP entries for allowed_address_pair IPs in DVR Routers" [High,Confirmed] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan) 15:06:52 <haleyb> https://review.openstack.org/#/c/601336/ 15:07:24 <haleyb> I have not finished my review of this, still working on it 15:08:05 <haleyb> seems zuul didn't finish it's review either 15:08:51 <haleyb> next is https://bugs.launchpad.net/neutron/+bug/1802006 15:08:52 <openstack> Launchpad bug 1802006 in neutron "Floating IP attach/detach fails for non-admin user and unbound port with router in different tenant" [Medium,In progress] - Assigned to Arjun Baindur (abaindur) 15:09:04 <haleyb> https://review.openstack.org/#/c/622623/ 15:09:48 <haleyb> i just rebased that, looked good, simple change 15:10:18 <haleyb> https://bugs.launchpad.net/neutron/+bug/1804327 15:10:19 <openstack> Launchpad bug 1804327 in neutron "occasional connection reset on SNATed after tcp retries" [Medium,In progress] - Assigned to Dirk Mueller (dmllr) 15:10:30 <haleyb> tidwellr: you were taking this one over 15:10:35 <haleyb> https://review.openstack.org/#/c/618208/ 15:11:00 <tidwellr> no updates on that one today 15:11:26 <haleyb> ack 15:11:43 <haleyb> next is https://bugs.launchpad.net/neutron/+bug/1805456 15:11:44 <openstack> Launchpad bug 1805456 in neutron "[DVR] Neutron doesn't configure multiple external subnets for one network properly" [Medium,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez) 15:11:50 <haleyb> https://review.openstack.org/#/c/622449/ 15:12:01 <haleyb> that's just stuck in recheck 15:12:36 <haleyb> thanks for fixing that ralonsoh 15:12:43 <ralonsoh> np 15:13:51 <haleyb> https://bugs.launchpad.net/neutron/+bug/1794991 15:13:52 <openstack> Launchpad bug 1794991 in neutron "Inconsistent flows with DVR l2pop VxLAN on br-tun" [Undecided,New] 15:14:42 <haleyb> Swami did some further debugging here, so made some progress 15:15:32 <haleyb> at least the missing flow has been identified 15:15:58 <haleyb> next is https://bugs.launchpad.net/neutron/+bug/1806770 15:15:59 <openstack> Launchpad bug 1806770 in neutron "DHCP Agent should not release DHCP lease when client ID is not set on port" [Medium,In progress] - Assigned to Arjun Baindur (abaindur) 15:16:37 <haleyb> https://review.openstack.org/#/c/623066/ proposed 15:16:54 <haleyb> i don't see Arjun here, big timezone difference 15:18:05 <haleyb> this is probably the 4th time we've had to tweak the dhcp release code, so any additional eyes would be helpful. in this case it was due to differences in Windows clients 15:19:24 <haleyb> crickets :) 15:20:10 <haleyb> there were also 2 metering agent bugs filed last week, i will triage them but fix is posted 15:20:16 <haleyb> https://bugs.launchpad.net/neutron/+bug/1807153 15:20:17 <openstack> Launchpad bug 1807153 in neutron "Race condition in metering agent when creating iptable managers for router namespaces" [Undecided,New] 15:20:51 <haleyb> https://bugs.launchpad.net/neutron/+bug/1807157 15:20:53 <openstack> Launchpad bug 1807157 in neutron "Metering doesn't work for DVR routers on compute nodes" [Undecided,New] 15:21:11 <haleyb> https://review.openstack.org/#/c/621165/ 15:21:41 <haleyb> ^^ fixes both, needs reviews 15:22:34 <haleyb> any other bugs someone wants to talk about? 15:23:01 <slaweq> I want to raise one 15:23:03 <slaweq> https://bugs.launchpad.net/neutron/+bug/1798475 15:23:04 <openstack> Launchpad bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,Confirmed] 15:23:23 <slaweq> it's fullstack tests issue but related to L3 HA 15:23:34 <liuyulong> yes, I got your email. 15:23:43 <liuyulong> Looking at it now 15:23:53 <slaweq> in last comment I described exactly sequence of events there, and I need someone else to take a look on it :) 15:23:59 <slaweq> yes, thx liuyulong :) 15:24:43 <haleyb> thanks liuyulong, let me know if you need some help 15:25:14 <liuyulong> I have a clue, maybe the create router 'router_update' notifaction was re-consumed again. 15:25:52 <liuyulong> Because the l3 agent is restart too fast. 15:26:14 <slaweq> it's more than 1 minute after create router 15:26:31 <liuyulong> I can not see a packet loss in my local env during l3 agent restart, if router was created long time ago. 15:26:33 <slaweq> because when router is created, backup agent is restarted and ping is checked for 1 minute 15:28:03 <haleyb> slaweq: so it's the restart of the master - when it comes back up it becomes master again instead of staying backup? 15:28:30 <liuyulong> http://logs.openstack.org/09/608909/20/check/neutron-fullstack/c7b6401/logs/dsvm-fullstack-logs/TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost/neutron-l3-agent--2018-11-30--03-38-50-946978.txt.gz#_2018-11-30_03_39_03_184 15:28:46 <slaweq> finally it becomes master again but it may be because other agent was already removed 15:28:48 <liuyulong> this line is the related LOG. 15:29:20 <liuyulong> l3 agent restart, but it got a notification 15:31:29 <liuyulong> The request id was first seen here: http://logs.openstack.org/09/608909/20/check/neutron-fullstack/c7b6401/logs/dsvm-fullstack-logs/TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost/neutron-l3-agent--2018-11-30--03-37-05-693500.txt.gz 15:31:36 <liuyulong> during the create 15:31:44 <liuyulong> req-9ce3d0cb-3fbf-421d-a59c-8ca6efda1c58 15:33:10 <liuyulong> This is a known issue, I'm not quite sure if it is related to the test failing. 15:33:35 <liuyulong> IMO, such behavior does not influence the data plane. 15:34:07 <slaweq> but test clearly shows that dataplane is impacted in this case 15:34:20 <slaweq> as it fails because of some packet loss during agent restart 15:34:40 <haleyb> liuyulong: is there another bug for this issue? 15:35:08 <liuyulong> haleyb, no bug for it 15:36:03 <liuyulong> slaweq, yes, we needs more investigation 15:36:04 <slaweq> liuyulong: haleyb: but can processing of such rpc message by agent cause switch of VIP address in keepalived? 15:36:27 <slaweq> I though that agent is not the one who decides if it's master or backup node 15:36:31 <liuyulong> slaweq, IMO, it should not 15:36:36 <slaweq> but keepalived do that 15:36:56 <haleyb> yes, keepalived should do it 15:37:13 <slaweq> so IMO here there is some issue that cause missing some vrrp packets from one "host" to another 15:37:29 <slaweq> but I don't know what could do that :/ 15:37:39 <haleyb> slaweq: and this is intermittent gate failure? 15:37:40 <liuyulong> besides this re-consume issue, a race condition between routers_updated and fullsync may also need attention. 15:38:01 <slaweq> haleyb: yes, it happens from time to time 15:38:22 <slaweq> some time ago this test was restarting all agents at once and it happend more often then IIRC 15:38:44 <slaweq> but then I changed it to restart only backup agents and You added restart of active agents to it 15:39:40 <liuyulong> L2 agent is not restarted, so l2 data plane may not be an issue. But if ha router re-consume notification was causing a re-install ha device, this may cause a packet loss. 15:40:07 <liuyulong> I'm not quite sure about this, needs some code digging 15:40:22 <haleyb> liuyulong: do you have the time to look? 15:40:57 <liuyulong> haleyb, yes, I'll check it. 15:41:38 <slaweq> liuyulong: thx a lot 15:42:38 <haleyb> liuyulong: thanks, i'll assign it to you (although launchpad is not cooperating right now) 15:42:49 <liuyulong> OK 15:43:14 <haleyb> liuyulong: dragon889 is your launchpad id ? 15:43:18 <liuyulong> yes 15:43:41 <haleyb> thanks 15:44:22 <haleyb> any other bugs to discuss? 15:45:22 <haleyb> #topic Check/gate failures 15:45:33 <haleyb> http://grafana.openstack.org/d/Hj5IHcSmz/neutron-failure-rate?orgId=1 15:45:55 <haleyb> obviously the iptable-hybrid job is causing grief... 15:46:14 <ralonsoh> my bad.... sorry 15:46:42 <ralonsoh> patch is merged in os-vif, waiting for next os-vif version release 15:47:17 <haleyb> ralonsoh: np, it wasn't your fault 15:47:42 <haleyb> i had an action item from last week to check the neutron-tempest-plugin-dvr-multinode-scenario job failure rate 15:48:07 <haleyb> i have failed that task, so re-added to my queue 15:49:43 <haleyb> i do wonder if some of this will clear up with fressi's changes that are in-flight 15:50:28 <haleyb> but most are because unable to ssh to the floating IP 15:50:47 <haleyb> eg http://logs.openstack.org/23/622623/3/check/neutron-tempest-plugin-dvr-multinode-scenario/604b3c0/testr_results.html.gz 15:52:20 <haleyb> i will take a further look 15:52:31 <haleyb> and file a bug 15:53:33 <haleyb> ooh, and i found another error now in the qos extension 15:53:42 <haleyb> http://logs.openstack.org/23/622623/3/check/neutron-tempest-plugin-dvr-multinode-scenario/604b3c0/controller/logs/screen-q-l3.txt.gz?level=WARNING 15:53:57 <haleyb> Error while deleting router a9ac83ee-c93e-4aaf-b1c5-bcdf22ff8b13: TypeError: string indices must be integers 15:54:02 <haleyb> anyone seen that before? 15:54:36 <slaweq> no, but that should be IMO easy to reproduce and fix :) 15:54:44 <njohnston> has this job been switched to python 3 yet? 15:55:27 <haleyb> i don't know, and it's non-voting so goes unnoticed 15:55:51 <slaweq> njohnston: yes, all neutron-tempest-plugin jobs are on py3 now 15:56:01 <slaweq> since some time already 15:56:25 <njohnston> ok, so that looks like an issue I have seen with some py3 conversions 15:57:10 <haleyb> USE_PYTHON3: true 15:57:39 <haleyb> running out of time... 15:57:51 <haleyb> #topic Open discussion 15:58:00 <haleyb> any quick topics to discuss? 15:59:15 <haleyb> ok, thanks for attending, i've got some bugs to file 15:59:20 <davidsha> Thanks! 15:59:25 <haleyb> #endmeeting