14:00:54 <mlavalle> #startmeeting neutron_l3
14:00:55 <openstack> Meeting started Wed May 15 14:00:54 2019 UTC and is due to finish in 60 minutes.  The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:56 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:58 <openstack> The meeting name has been set to 'neutron_l3'
14:01:13 <tidwellr> o/
14:01:20 <haleyb> o/
14:01:41 <panda> o/ (partially - in another meeting)
14:02:39 <mlavalle> hey panda, I was wondering about you yesterday. Nice to see you here :-)
14:02:57 <mlavalle> good morning haleyb, tidwellr
14:03:07 <ralonsoh> hi
14:03:14 <mlavalle> hey ralonsoh
14:03:28 <tidwellr> mlavalle: happy wednesday!
14:05:16 <mlavalle> ok, let's get going
14:05:42 <mlavalle> #topic Announcements
14:06:11 <mlavalle> We are on our way to the T-1 milestone, June 3 - 7
14:06:19 <mlavalle> #link https://releases.openstack.org/train/schedule.html
14:07:41 <mlavalle> These are our photos from the recent PTG:
14:07:43 <njohnston> o/
14:07:46 <mlavalle> #link https://www.dropbox.com/sh/fydqjehy9h5y728/AAC1gIc5bJwwNd5JkcQ6Pqtra/Neutron?dl=0&subfolder_nav_tracking=1
14:08:28 <mlavalle> everybody handsome as usual
14:08:45 <mlavalle> especially in the one with the props
14:09:33 <mlavalle> any other announcements from the team?
14:10:08 <mlavalle> ok, let's move on
14:10:40 <mlavalle> #topic Bugs
14:11:27 <mlavalle> First, we have a critical issue https://bugs.launchpad.net/neutron/+bug/1824571
14:11:28 <openstack> Launchpad bug 1824571 in neutron "l3agent can't create router if there are multiple external networks" [Critical,Confirmed] - Assigned to Miguel Lavalle (minsel)
14:11:52 <mlavalle> it was recently promoted to critical by slaweq
14:12:07 <mlavalle> so I better hurry up with a fix for this
14:12:52 <mlavalle> I have an environment ready to reproduce the issue
14:14:26 <mlavalle> Next bug is https://bugs.launchpad.net/neutron/+bug/1774459
14:14:28 <openstack> Launchpad bug 1774459 in neutron "Update permanent ARP entries for allowed_address_pair IPs in DVR Routers" [High,Confirmed] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan)
14:15:33 <mlavalle> We didn't have a chance to discuss this issue in Denver
14:15:51 <mlavalle> couldn't reach Swami over google hangouts
14:16:05 <mlavalle> but it seems he is making progress: https://review.opendev.org/#/c/651905/
14:16:09 <tidwellr> before he left Denver he wanted me to just mention he needs reviews
14:17:05 <mlavalle> in the commit message he indicates it is related to the bug
14:17:13 <mlavalle> are there more patches coming?
14:17:15 <tidwellr> there's also https://review.opendev.org/#/c/616272/ that I think is also related
14:18:26 <tidwellr> and maybe this one too https://review.opendev.org/#/c/601336/
14:19:22 <mlavalle> yes, these two indicate also in the commit message that are related
14:19:37 * mlavalle leaving a note in the bug pointing to these 2 patches ^^^^
14:21:42 <mlavalle> Last one I have today is https://bugs.launchpad.net/neutron/+bug/1823038
14:21:43 <openstack> Launchpad bug 1823038 in neutron "Neutron-keepalived-state-change fails to check initial router state" [High,Confirmed]
14:22:13 <mlavalle> which seems to have already fixed
14:22:55 <ralonsoh> not yet
14:23:03 <ralonsoh> I'm going to propose a patch for it
14:23:16 <ralonsoh> the agent is now run under neutron-rootwrap
14:23:21 <ralonsoh> and privsep is failing
14:23:46 <ralonsoh> so I'm removing this new code added and keep only the privsep initialization
14:24:09 <mlavalle> can I assign it to you?
14:24:19 <ralonsoh> I'm just helping Slawek
14:24:29 <ralonsoh> he knows the status of this patch
14:24:48 <mlavalle> ah ok
14:25:06 <ralonsoh> that's all
14:25:10 <mlavalle> thanks for the update :)
14:25:39 <mlavalle> any other bugs we should discuss today?
14:26:24 <liuyulong> One more https://bugs.launchpad.net/neutron/+bug/1821912
14:26:25 <openstack> Launchpad bug 1821912 in neutron "intermittent ssh failures in various scenario tests" [High,In progress] - Assigned to LIU Yulong (dragon889)
14:26:56 <liuyulong> Seems we hit this more more frequently
14:27:54 <mlavalle> is it a work around?
14:27:58 <liuyulong> I have two direction of repair
14:28:20 <liuyulong> one: https://review.opendev.org/#/c/659009/ wait until the floating IP is active
14:28:38 <mlavalle> is this a work around? ^^^^
14:29:27 <liuyulong> It can be one work around
14:29:42 <liuyulong> the another is we can not rely on that nova DB instance status
14:30:03 <liuyulong> every time the guest OS is booting, then the test case is trying to login
14:30:52 <liuyulong> So I wonder if we can ping the fixed IP first, then try to login it
14:31:36 <liuyulong> But seems the tempest now does not allow that
14:31:41 <mlavalle> so what you are saying is that we don't have an underlying connectivity / authentication problem, but rather a testing problem?
14:32:06 <liuyulong> I have tried both in this https://review.opendev.org/659009, but now revert back to only have "waiting for floating IP status"
14:33:15 <liuyulong> In the recent merged patch, they all have a lot of "recheck", maybe we can increase this bug level too.
14:33:48 <mlavalle> critical?
14:34:57 <liuyulong> Not entirely, Slawek mentioned that nova metadata may have something wrong.
14:35:17 <mlavalle> ok, a combination of causes
14:35:47 <tidwellr> liuyulong: just thinking out loud and maybe it's crazy, but I wonder if there's way to set a static route on the host that would allow us to reach the fixed IP
14:37:51 <liuyulong> mlavalle,  Slawek and I are now aiming to different direction
14:38:29 <liuyulong> I also noticed that l3-agent may have a really long router processing time, 40s+ in some cases.
14:38:32 <mlavalle> liuyulong: sounds good to. I am also adding a 3rd direction: tcpdump in the namespace
14:39:13 <liuyulong> tidwellr, I'm not quite sure, but if tempest can only reach the API, the route may not work.
14:39:55 <tidwellr> yep, it's just a thought
14:40:58 <liuyulong> This is really a tough one...
14:41:05 <mlavalle> yes it is
14:41:47 <mlavalle> anything else on this bug?
14:42:01 <liuyulong> not from me
14:42:18 <mlavalle> thanks for the update :-)
14:42:23 <mlavalle> any other bugs?
14:43:19 <haleyb> mlavalle: there's one in the open agenda section, but if we're in a buggy mood we can discuss now
14:43:30 <mlavalle> shoot
14:43:40 <haleyb> https://bugs.launchpad.net/neutron/+bug/1818824
14:43:42 <openstack> Launchpad bug 1818824 in neutron "When a fip is added to a vm with dvr, previous connections loss the connectivity" [Low,In progress] - Assigned to Gabriele Cerami (gcerami)
14:44:03 <tidwellr> I saw the chatter about this in IRC yesterday
14:44:27 <haleyb> In short, there is a difference between DVR/centralized here
14:44:52 <panda> I tried to lay down some solution, but a behavioural decision has to be made before
14:45:31 <haleyb> if an instance is using the default snat IP and a floating is associated, should we be deleting the conntrack entries for the existing connections
14:46:32 <haleyb> i'm inclined to think we should always be cleaning them, since the instance should start using the floating IP
14:46:34 <tidwellr> is there a concrete example of a workload in a VM that is affected by this?
14:47:50 <haleyb> i don't think so.  it's an edge case since in order to trigger something in an instance you need a floating to get in first (or login from another instance on the private network)
14:48:15 <liuyulong> haleyb, +1, yes, it should stop the previous connection to save the SNAT node bandwidth.
14:48:42 <mlavalle> liuyulong always bringing the operator perspective
14:48:47 <mlavalle> nice
14:48:51 <tidwellr> I'm inclined to agree, once the FIP is associated force all traffic to use it
14:49:56 <haleyb> liuyulong: it doesn't happen with centralized routing today, you can have a connection continue to use the snat IP until it closes.  DVR "breaks" it simply because it forces everything into the fip namespace where it dies
14:51:24 <mlavalle> I agree with haleyb and tidwellr
14:51:30 <haleyb> so it seems we agree the conntrack entries should have been cleaned.  i think if we make that change soon-ish we'll be able to get some feedback if we break something during the T cycle
14:51:45 <mlavalle> yes!
14:51:50 <mlavalle> ]the sooner the better
14:51:53 <tidwellr> +1
14:51:55 <panda> for bot DVR and non DVR scnarios ?
14:52:08 <haleyb> and i don't think we documented the behavior, so we should do that too
14:52:32 <panda> in DVR currently the connections just starve, they are not closed
14:52:49 <haleyb> panda: yes, both.  with dvr it's essentially cleaned by the routing change, right?
14:53:14 <haleyb> as you say starved since the connection is broken
14:53:45 <panda> haleyb: i'ts not cleaned at all, the package try to follow the new route but they just die somewhere, so the connection clears for the timeout
14:53:58 <panda> I'm trying to understand if the need to be explicitly closed instead
14:54:12 <liuyulong> Could be a bug for centralized router, since we never test that.
14:55:09 <mlavalle> I'd say explicitely close it
14:55:16 <liuyulong> For dvr with centralized floating IPs, what's the hehavior now?
14:55:16 <haleyb> panda: right, but removing the stale conntrack entries would make the connection fail quickly and not timeout slowly
14:55:47 <mlavalle> good point
14:55:54 <haleyb> liuyulong: that's a good question, don't know
14:56:03 <liuyulong> previous connection may stay, IMO
14:58:15 <panda> liuyulong: and have a different behaviour for the two scenarios ? I think the idea here was to look for consistency
14:59:29 <panda> my personal preference is to try and maintain the old connection, but just because I found it a good entry point to experiment and learn the code :)
14:59:46 <haleyb> if we had a floating IP assigned and it got removed, conntrack gets cleaned-up, i think we should treat the default snat IP similarly - the (dis)association event flips which is used
14:59:47 <mlavalle> we are running out of time
15:00:00 <mlavalle> I lean towards consitency of behavior
15:00:13 <mlavalle> #endmeeting