14:00:14 <liuyulong> #startmeeting neutron_l3 14:00:15 <openstack> Meeting started Wed Mar 4 14:00:14 2020 UTC and is due to finish in 60 minutes. The chair is liuyulong. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:16 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:19 <openstack> The meeting name has been set to 'neutron_l3' 14:02:32 <liuyulong> Hi there 14:02:40 <liuyulong> #topic Announcements 14:03:13 <slaweq> hi 14:03:56 <liuyulong> #link https://www.openstack.org/events/opendev-ptg-2020/ 14:04:59 <liuyulong> Hope I could get to Vancouver. 14:05:28 <liuyulong> I need a VISA. 14:05:51 <liuyulong> I will try the community travel support. 14:07:08 <slaweq> for now we also don't know how it will be, mostly due to this coronavirus :/ 14:07:37 <liuyulong> #link https://etherpad.openstack.org/p/neutron-victoria-ptg 14:09:15 <liuyulong> slaweq, maybe, but the Summer is coming. 14:09:27 <liuyulong> Topics are wanted! ^^ 14:11:00 <liuyulong> OK, no more announcement from me. 14:11:05 <liuyulong> let's move on. 14:11:08 <liuyulong> #topic Bugs 14:11:21 <liuyulong> #link http://lists.openstack.org/pipermail/openstack-discuss/2020-February/012766.html 14:11:27 <liuyulong> #link http://lists.openstack.org/pipermail/openstack-discuss/2020-March/012926.html 14:11:52 <liuyulong> Because I was not here last week, we have two lists now. 14:12:08 <liuyulong> First one: 14:12:09 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1864963 14:12:10 <openstack> Launchpad bug 1864963 in neutron "loosing connectivity to instance with FloatingIP randomly" [Undecided,New] 14:12:52 <liuyulong> I have left some questions about the reporters' deployment, that could help us to find out the real problem. 14:13:35 <liuyulong> Mostly these questions are based on our local deployment. We met some issue on these fields. 14:15:47 <slaweq> thx for taking care of this 14:16:17 <liuyulong> slaweq, np 14:16:31 <liuyulong> Next one 14:16:33 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1865061 14:16:34 <openstack> Launchpad bug 1865061 in neutron "When neutron does a switch-over between router 1 and router2, the router1 conntrack flows shoud be deleted" [Low,Confirmed] 14:17:10 <slaweq> that is something which our QE found during testing 14:17:36 <slaweq> but it can br problem only if router will failover twice in short period of time 14:17:57 <slaweq> and that's why it's set Low importance 14:18:02 <liuyulong> Yes, that is my question, how could that "twice" happen in real world? 14:18:10 <liuyulong> https://bugs.launchpad.net/neutron/+bug/1865061/comments/1 14:18:11 <openstack> Launchpad bug 1865061 in neutron "When neutron does a switch-over between router 1 and router2, the router1 conntrack flows shoud be deleted" [Low,Confirmed] 14:18:30 <liuyulong> We have "non-preemptive" settings for HA router keepalived. 14:19:09 <liuyulong> So typically the "new-master" should work then. 14:19:28 <liuyulong> The connections in the original host should be all broken. 14:19:47 <slaweq> excactly, so I reported it there "just for the record" that such issue theoretically can happen 14:20:08 <slaweq> but that shouldn't be in fact an issue in real world probably 14:23:04 <liuyulong> extremely case is the HA networking is not stable. That could cause the HA router state change rapidly. For some deployment which running HA routers on hypervisors, the bad connection state could be a potential reason. 14:24:06 <liuyulong> That could be another story. 14:24:22 <liuyulong> OK, next one 14:24:33 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1865891 14:24:34 <openstack> Launchpad bug 1865891 in neutron "Race condition during removal of subnet from the router and removal of subnet" [Medium,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 14:24:48 <slaweq> yes, that one I'm working on now 14:25:27 <slaweq> it seems that sometimes if You plug subnet to the router and in parallel remove subnet, Your router port will end up as port without fixed_ips 14:25:31 <liuyulong> Alright 14:25:36 <liuyulong> see my comment here: 14:25:36 <liuyulong> https://bugs.launchpad.net/neutron/+bug/1865891/comments/2 14:25:38 <openstack> Launchpad bug 1865891 in neutron "Race condition during removal of subnet from the router and removal of subnet" [Medium,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 14:25:57 <liuyulong> I can image another one is to add port as router interface and concurrently delete the port. 14:26:23 <slaweq> I agree that maybe we will need to close it as "wontfix" 14:26:51 <slaweq> but I want first to dig a bit more and see what can be done there 14:28:11 <liuyulong> yes, it is indeed an issue. We just want to find out a balance. : ) 14:28:42 <liuyulong> OK, next one 14:28:43 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1865173 14:28:44 <openstack> Launchpad bug 1865173 in neutron "Revision number not bumped after update of router's description" [Low,Confirmed] 14:29:33 <liuyulong> Tested on stable/queens, it is not reproducible. 14:29:55 <slaweq> I was testing this on master branch 14:31:58 <liuyulong> Alright, a regression on router revision number. 14:32:08 <slaweq> probably 14:32:25 <slaweq> but I saw it only when I tried to bump router's description 14:32:59 <liuyulong> Interesting... 14:33:00 <slaweq> anyway, that's nothing really critical so I think it can stay in our backlog until someone will have some time to take a look at it 14:33:24 <liuyulong> np, make sense to me 14:33:38 <liuyulong> Next one: 14:33:40 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1865557 14:33:41 <openstack> Launchpad bug 1865557 in neutron "Error reading log file from 'neutron-keepalived-state-change' in 'test_read_queue_send_garp'" [Low,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez) 14:33:59 <ralonsoh> Just a logging problem 14:34:08 <liuyulong> The fix is simple, and it is just fails the case not raise an exception. 14:34:08 <ralonsoh> I found a problem only once, in a test 14:34:15 <ralonsoh> as commented in the bug 14:34:25 <ralonsoh> no no, we need to raise the exception 14:34:26 <liuyulong> So I've +2ed that. 14:34:57 <liuyulong> https://review.opendev.org/#/c/710850/1/neutron/tests/functional/agent/l3/test_keepalived_state_change.py 14:34:58 <ralonsoh> ok, not an exception but a fail (the same effect) 14:35:03 <ralonsoh> yes, I know 14:35:22 <ralonsoh> because we are executing a test, it's better to use self.fail 14:35:31 <ralonsoh> but the core of this patch is the extra log 14:35:38 <liuyulong> OK, maybe I'm not clear here. 14:36:07 <liuyulong> The fix is to just fail the case instead of raising an exception. 14:36:19 <ralonsoh> the effect is the same 14:36:28 <liuyulong> Yes 14:36:34 <ralonsoh> the point is to increase the log info 14:36:44 <ralonsoh> now we have the device list with the IP addresses 14:36:50 <ralonsoh> inside the testing namespace 14:37:36 <liuyulong> ralonsoh, great, thanks for working on this. 14:37:43 <ralonsoh> yw 14:38:22 <liuyulong> Alright, thag 14:38:34 <liuyulong> Alright, that's all bugs from me today. 14:38:41 <slaweq> I would like to talk about one also 14:38:43 <slaweq> https://bugs.launchpad.net/neutron/+bug/1859832 14:38:44 <openstack> Launchpad bug 1859832 in neutron "L3 HA connectivity to GW port can be broken after reboot of backup node" [Medium,In progress] - Assigned to LIU Yulong (dragon889) 14:39:02 <liuyulong> OK 14:39:03 <slaweq> and those 2 alternative solutions proposed by me and liuyulong for it 14:39:50 <slaweq> liuyulong: generally in Your approach I'm affraid those errors about fail to send garps during failover 14:40:43 <slaweq> and the second potential issue is IMO if we will not increase downtime during failover as neutron-l3-agent has to be noticed that failover happened and bring gateway up then 14:40:58 <slaweq> so 2 questions: 14:41:18 <slaweq> 1. do You know if there is any way to delay sending of first garp, to avoid those errors from keepalived? 14:41:53 <slaweq> 2. You said that You tested it in Your cloud, how long is downtime during failover with and without this patch? 14:42:41 <liuyulong> I replied the comments in the patch set. Allow me quota it here: 14:42:45 <liuyulong> We have run such code for a few months, no issue was found for such related log. Keepalived will send garp after a 60s delay by default [1], till then the L3 agent should have done qg-dev link up action. More details could be during the first phrase keepalived garp, do not send garp with no interval, it could have a 1 second delay (vrrp_garp_interval [2]). 14:42:45 <liuyulong> [1] https://github.com/openstack/neutron/blob/master/neutron/agent/linux/keepalived.py#L165 14:42:45 <liuyulong> [2] https://www.keepalived.org/manpage.html 14:43:47 <liuyulong> Your first question could have the answer: vrrp_garp_interval. 14:44:18 <liuyulong> The link up action is really quick, we have not seen any side effect on that. 14:44:49 <slaweq> it's quick but if router has many other things to do, isn't it queued to be processed as other events? 14:45:00 <liuyulong> More about that is the outside world also have ARP. 14:45:04 <slaweq> e.g. if there would be many routers failovered in same time 14:45:49 <liuyulong> HA state change does not have queue. 14:46:01 <liuyulong> It's not like the L3-agent main processing loop. 14:46:32 <slaweq> ok, but can we maybe move this "set device up" action to the neutron-keepalived-state-change monitor process? 14:46:53 <slaweq> so it would be done just after keepalived would configure VIP in the namespace 14:47:21 <liuyulong> That "enqueue_state_change" actually does not have a "queue", it's just a list of functions. 14:48:26 <slaweq> yes, but how about doing it here: https://github.com/openstack/neutron/blob/master/neutron/agent/l3/keepalived_state_change.py#L89 14:48:37 <ralonsoh> slaweq, are we going to add net capabilities to the neutron-keepalived-state-change agent?? 14:48:45 <ralonsoh> slaweq, I do not recommend it 14:48:56 <ralonsoh> this should be just a monitoring process 14:49:39 <slaweq> ralonsoh: look at the comment in https://github.com/openstack/neutron/blob/master/neutron/agent/l3/ha.py#L166 14:49:52 <slaweq> according to it, such plans were already some time ago :) 14:49:56 <liuyulong> That could be a heavy change. 14:50:10 <ralonsoh> I still don't recommend it 14:50:39 <ralonsoh> we'll have another service changing the net devices 14:50:51 <ralonsoh> this should be in only one process: the l3 agent 14:51:04 <liuyulong> We need router info from the l3-agent process to another monitor process. 14:51:13 <slaweq> we already have keepalived which is also changing those interfaces 14:51:22 <ralonsoh> yes 14:51:39 <ralonsoh> but this is an external process not managed/programmed by us 14:52:13 <slaweq> anyway, I really need to move forward with one of those potential fixes for this issue :) 14:52:20 <ralonsoh> I know 14:52:31 <slaweq> so first we should decide which one and then continue work on it 14:53:33 <liuyulong> I prefer one fix for all drivers. 14:53:50 <slaweq> liuyulong: yes, that's adventage for Your approach for sure 14:53:56 <ralonsoh> I still don't have a clear idea 14:54:05 <ralonsoh> sorry 14:54:18 <slaweq> what I'm affraid, is that this may cause some longer failover time 14:54:48 <slaweq> but except that, I think that liuyulong's idea may be really better as it's more generic 14:54:57 <liuyulong> And L3 issue should be handled in it's own scope by default. 14:55:17 <liuyulong> slaweq, you have QA team I guess you mentioned in this meeting. : ) 14:55:33 <slaweq> so ralonsoh what do You think if we will continue with liuyulong's patch? 14:56:05 <liuyulong> We also have a QA team, I will try to make sure they have fully tested the fail-over time. 14:56:15 <ralonsoh> I still need to check both again 14:56:34 <slaweq> ralonsoh: ok, thx 14:56:38 <slaweq> please check them 14:56:47 <liuyulong> Another thing is I will try to add that "vrrp_garp_interval" for the VRRP of the HA router. 14:57:09 <slaweq> liuyulong: and one more comment to this, can You remove config option from it? I don't think we really need such config option there 14:57:18 <liuyulong> It will be an independent change. 14:57:35 <slaweq> IMO this is internal implementation of HA routers and it shouldn't be configurable 14:57:36 <liuyulong> slaweq, sure 14:58:05 <slaweq> ok, liuyulong please ping me if You will add this vrrp_garp_interval option 14:58:12 <slaweq> I will test it again on my env 14:58:18 <slaweq> and thx for working on this 14:58:24 <liuyulong> slaweq, the config option is for our cloud locally, our operators would like to know the cloud code changes. 14:58:37 <liuyulong> slaweq, np 14:58:46 <slaweq> ok, that's all from my side 14:58:49 <slaweq> thx 14:59:02 <liuyulong> All right, we are out of time. 14:59:12 <liuyulong> let's end here. 14:59:23 <liuyulong> Thank you guys for attending. 14:59:25 <liuyulong> Bye 14:59:27 <ralonsoh> bye 14:59:31 <liuyulong> #endmeeting