#openstack-meeting-3 log

15:00:54 <haleyb> #startmeeting neutron_l3
15:00:55 <openstack> Meeting started Thu Dec 13 15:00:54 2018 UTC and is due to finish in 60 minutes.  The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:56 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:59 <openstack> The meeting name has been set to 'neutron_l3'
15:01:00 <njohnston> o/
15:01:34 <slaweq> hi
15:02:11 <haleyb> hi everyone
15:02:16 <davidsha> Hi
15:02:35 <haleyb> #topic Announcements
15:03:08 <haleyb> Just a reminder Stein-2 is Jan 7th
15:04:13 <haleyb> And we will have a meeting next week, but not the week after
15:04:53 <haleyb> any other announcements?
15:05:28 <haleyb> #topic Bugs
15:05:37 <tidwellr> hi
15:05:55 <haleyb> tidwellr: hi
15:06:07 <haleyb> i don't see Swami so will go through bugs
15:06:46 <haleyb> https://bugs.launchpad.net/neutron/+bug/1774459
15:06:47 <openstack> Launchpad bug 1774459 in neutron "Update permanent ARP entries for allowed_address_pair IPs in DVR Routers" [High,Confirmed] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan)
15:06:52 <haleyb> https://review.openstack.org/#/c/601336/
15:07:24 <haleyb> I have not finished my review of this, still working on it
15:08:05 <haleyb> seems zuul didn't finish it's review either
15:08:51 <haleyb> next is https://bugs.launchpad.net/neutron/+bug/1802006
15:08:52 <openstack> Launchpad bug 1802006 in neutron "Floating IP attach/detach fails for non-admin user and unbound port with router in different tenant" [Medium,In progress] - Assigned to Arjun Baindur (abaindur)
15:09:04 <haleyb> https://review.openstack.org/#/c/622623/
15:09:48 <haleyb> i just rebased that, looked good, simple change
15:10:18 <haleyb> https://bugs.launchpad.net/neutron/+bug/1804327
15:10:19 <openstack> Launchpad bug 1804327 in neutron "occasional connection reset on SNATed after tcp retries" [Medium,In progress] - Assigned to Dirk Mueller (dmllr)
15:10:30 <haleyb> tidwellr: you were taking this one over
15:10:35 <haleyb> https://review.openstack.org/#/c/618208/
15:11:00 <tidwellr> no updates on that one today
15:11:26 <haleyb> ack
15:11:43 <haleyb> next is https://bugs.launchpad.net/neutron/+bug/1805456
15:11:44 <openstack> Launchpad bug 1805456 in neutron "[DVR] Neutron doesn't configure multiple external subnets for one network properly" [Medium,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
15:11:50 <haleyb> https://review.openstack.org/#/c/622449/
15:12:01 <haleyb> that's just stuck in recheck
15:12:36 <haleyb> thanks for fixing that ralonsoh
15:12:43 <ralonsoh> np
15:13:51 <haleyb> https://bugs.launchpad.net/neutron/+bug/1794991
15:13:52 <openstack> Launchpad bug 1794991 in neutron "Inconsistent flows with DVR l2pop VxLAN on br-tun" [Undecided,New]
15:14:42 <haleyb> Swami did some further debugging here, so made some progress
15:15:32 <haleyb> at least the missing flow has been identified
15:15:58 <haleyb> next is https://bugs.launchpad.net/neutron/+bug/1806770
15:15:59 <openstack> Launchpad bug 1806770 in neutron "DHCP Agent should not release DHCP lease when client ID is not set on port" [Medium,In progress] - Assigned to Arjun Baindur (abaindur)
15:16:37 <haleyb> https://review.openstack.org/#/c/623066/ proposed
15:16:54 <haleyb> i don't see Arjun here, big timezone difference
15:18:05 <haleyb> this is probably the 4th time we've had to tweak the dhcp release code, so any additional eyes would be helpful.  in this case it was due to differences in Windows clients
15:19:24 <haleyb> crickets :)
15:20:10 <haleyb> there were also 2 metering agent bugs filed last week, i will triage them but fix is posted
15:20:16 <haleyb> https://bugs.launchpad.net/neutron/+bug/1807153
15:20:17 <openstack> Launchpad bug 1807153 in neutron "Race condition in metering agent when creating iptable managers for router namespaces" [Undecided,New]
15:20:51 <haleyb> https://bugs.launchpad.net/neutron/+bug/1807157
15:20:53 <openstack> Launchpad bug 1807157 in neutron "Metering doesn't work for DVR routers on compute nodes" [Undecided,New]
15:21:11 <haleyb> https://review.openstack.org/#/c/621165/
15:21:41 <haleyb> ^^ fixes both, needs reviews
15:22:34 <haleyb> any other bugs someone wants to talk about?
15:23:01 <slaweq> I want to raise one
15:23:03 <slaweq> https://bugs.launchpad.net/neutron/+bug/1798475
15:23:04 <openstack> Launchpad bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,Confirmed]
15:23:23 <slaweq> it's fullstack tests issue but related to L3 HA
15:23:34 <liuyulong> yes, I got your email.
15:23:43 <liuyulong> Looking at it now
15:23:53 <slaweq> in last comment I described exactly sequence of events there, and I need someone else to take a look on it :)
15:23:59 <slaweq> yes, thx liuyulong :)
15:24:43 <haleyb> thanks liuyulong, let me know if you need some help
15:25:14 <liuyulong> I have a clue, maybe the create router 'router_update' notifaction was re-consumed again.
15:25:52 <liuyulong> Because the l3 agent is restart too fast.
15:26:14 <slaweq> it's more than 1 minute after create router
15:26:31 <liuyulong> I can not see a packet loss in my local env during l3 agent restart, if router was created long time ago.
15:26:33 <slaweq> because when router is created, backup agent is restarted and ping is checked for 1 minute
15:28:03 <haleyb> slaweq: so it's the restart of the master - when it comes back up it becomes master again instead of staying backup?
15:28:30 <liuyulong> http://logs.openstack.org/09/608909/20/check/neutron-fullstack/c7b6401/logs/dsvm-fullstack-logs/TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost/neutron-l3-agent--2018-11-30--03-38-50-946978.txt.gz#_2018-11-30_03_39_03_184
15:28:46 <slaweq> finally it becomes master again but it may be because other agent was already removed
15:28:48 <liuyulong> this line is the related LOG.
15:29:20 <liuyulong> l3 agent restart, but it got a notification
15:31:29 <liuyulong> The request id was first seen here: http://logs.openstack.org/09/608909/20/check/neutron-fullstack/c7b6401/logs/dsvm-fullstack-logs/TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost/neutron-l3-agent--2018-11-30--03-37-05-693500.txt.gz
15:31:36 <liuyulong> during the create
15:31:44 <liuyulong> req-9ce3d0cb-3fbf-421d-a59c-8ca6efda1c58
15:33:10 <liuyulong> This is a known issue, I'm not quite sure if it is related to the test failing.
15:33:35 <liuyulong> IMO, such behavior does not influence the data plane.
15:34:07 <slaweq> but test clearly shows that dataplane is impacted in this case
15:34:20 <slaweq> as it fails because of some packet loss during agent restart
15:34:40 <haleyb> liuyulong: is there another bug for this issue?
15:35:08 <liuyulong> haleyb, no bug for it
15:36:03 <liuyulong> slaweq, yes, we needs more investigation
15:36:04 <slaweq> liuyulong: haleyb: but can processing of such rpc message by agent cause switch of VIP address in keepalived?
15:36:27 <slaweq> I though that agent is not the one who decides if it's master or backup node
15:36:31 <liuyulong> slaweq, IMO, it should not
15:36:36 <slaweq> but keepalived do that
15:36:56 <haleyb> yes, keepalived should do it
15:37:13 <slaweq> so IMO here there is some issue that cause missing some vrrp packets from one "host" to another
15:37:29 <slaweq> but I don't know what could do that :/
15:37:39 <haleyb> slaweq: and this is intermittent gate failure?
15:37:40 <liuyulong> besides this re-consume issue, a race condition between routers_updated and fullsync may also need attention.
15:38:01 <slaweq> haleyb: yes, it happens from time to time
15:38:22 <slaweq> some time ago this test was restarting all agents at once and it happend more often then IIRC
15:38:44 <slaweq> but then I changed it to restart only backup agents and You added restart of active agents to it
15:39:40 <liuyulong> L2 agent is not restarted, so l2 data plane may not be an issue. But if ha router re-consume notification was causing a re-install ha device, this may cause a packet loss.
15:40:07 <liuyulong> I'm not quite sure about this, needs some code digging
15:40:22 <haleyb> liuyulong: do you have the time to look?
15:40:57 <liuyulong> haleyb, yes, I'll check it.
15:41:38 <slaweq> liuyulong: thx a lot
15:42:38 <haleyb> liuyulong: thanks, i'll assign it to you (although launchpad is not cooperating right now)
15:42:49 <liuyulong> OK
15:43:14 <haleyb> liuyulong: dragon889 is your launchpad id ?
15:43:18 <liuyulong> yes
15:43:41 <haleyb> thanks
15:44:22 <haleyb> any other bugs to discuss?
15:45:22 <haleyb> #topic Check/gate failures
15:45:33 <haleyb> http://grafana.openstack.org/d/Hj5IHcSmz/neutron-failure-rate?orgId=1
15:45:55 <haleyb> obviously the iptable-hybrid job is causing grief...
15:46:14 <ralonsoh> my bad.... sorry
15:46:42 <ralonsoh> patch is merged in os-vif, waiting for next os-vif version release
15:47:17 <haleyb> ralonsoh: np, it wasn't your fault
15:47:42 <haleyb> i had an action item from last week to check the neutron-tempest-plugin-dvr-multinode-scenario job failure rate
15:48:07 <haleyb> i have failed that task, so re-added to my queue
15:49:43 <haleyb> i do wonder if some of this will clear up with fressi's changes that are in-flight
15:50:28 <haleyb> but most are because unable to ssh to the floating IP
15:50:47 <haleyb> eg http://logs.openstack.org/23/622623/3/check/neutron-tempest-plugin-dvr-multinode-scenario/604b3c0/testr_results.html.gz
15:52:20 <haleyb> i will take a further look
15:52:31 <haleyb> and file a bug
15:53:33 <haleyb> ooh, and i found another error now in the qos extension
15:53:42 <haleyb> http://logs.openstack.org/23/622623/3/check/neutron-tempest-plugin-dvr-multinode-scenario/604b3c0/controller/logs/screen-q-l3.txt.gz?level=WARNING
15:53:57 <haleyb> Error while deleting router a9ac83ee-c93e-4aaf-b1c5-bcdf22ff8b13: TypeError: string indices must be integers
15:54:02 <haleyb> anyone seen that before?
15:54:36 <slaweq> no, but that should be IMO easy to reproduce and fix :)
15:54:44 <njohnston> has this job been switched to python 3 yet?
15:55:27 <haleyb> i don't know, and it's non-voting so goes unnoticed
15:55:51 <slaweq> njohnston: yes, all neutron-tempest-plugin jobs are on py3 now
15:56:01 <slaweq> since some time already
15:56:25 <njohnston> ok, so that looks like an issue I have seen with some py3 conversions
15:57:10 <haleyb> USE_PYTHON3: true
15:57:39 <haleyb> running out of time...
15:57:51 <haleyb> #topic Open discussion
15:58:00 <haleyb> any quick topics to discuss?
15:59:15 <haleyb> ok, thanks for attending, i've got some bugs to file
15:59:20 <davidsha> Thanks!
15:59:25 <haleyb> #endmeeting