14:00:14 #startmeeting neutron_drivers 14:00:14 Meeting started Fri Mar 25 14:00:14 2022 UTC and is due to finish in 60 minutes. The chair is lajoskatona1. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:14 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:14 The meeting name has been set to 'neutron_drivers' 14:00:17 o/ 14:00:18 o/ 14:00:35 hi 14:00:47 hello 14:00:54 o/ 14:00:57 Hello 14:01:17 o/ 14:01:35 hi! 14:02:09 So first topic for today (again :-))): 14:02:10 hi 14:02:18 Gratuitous ARPs are not sent during master transition: (#link https://bugs.launchpad.net/neutron/+bug/1952907) 14:02:32 We have some patches (which I know about): 14:02:36 #link https://review.opendev.org/c/openstack/neutron/+/821434 14:02:40 #link https://review.opendev.org/c/openstack/neutron/+/821433 14:02:44 #link https://review.opendev.org/c/openstack/neutron/+/712474 14:03:43 last time we discussed it damiandabrowski[m] and trident perhaps promised that they try to test the issue more 14:04:36 yes, i spent some time on this 14:04:52 in my bug report i mentioned that this bug may be already fixed by keepalived: https://github.com/acassen/keepalived/commit/72c4e54e9f2a9b5081814e549c47c7fc58b82df0 14:05:07 but AFAIK we don't use vmac interfaces so it's not relevant 14:07:04 and what's most important: i think keepalived has nothing to do with it, as slaweq's bug report(the reason why we keep qg- interface down on backup routers) describes the problem pretty well 14:07:04 To clarify, we originally thought that keepalived fix actually remedied https://review.opendev.org/c/openstack/neutron/+/707406 - which is what causes https://bugs.launchpad.net/neutron/+bug/1952907 14:07:36 damiandabrowski[m]: ok, so it seems that we need some code change on Neutron side? 14:08:35 i think yes, https://bugs.launchpad.net/neutron/+bug/1859832 is slaweq's bugs report, the problem starts when neutron flushes link-local address on qg- interface, then interfaces sends multicast unsubscribe 14:08:42 So as damiandabrowski[m] says, just reverting https://review.opendev.org/c/openstack/neutron/+/707406 won't be a solution as we initially thought. 14:09:18 i'm not sure how we can fix this issue, but maybe we can do something else rather than just "flush link-local ip" 14:09:45 or maybe we can prevent sending multicast unsubscribe packets, maybe with some sysctl parameters? 14:10:11 i just think we need to think about better solution for this bug: https://bugs.launchpad.net/neutron/+bug/1859832 14:11:00 as the current solution(keeping qg- interfaces down on backup routers), solved one problem but created another one 14:11:12 there was different solution for that bug https://review.opendev.org/c/openstack/neutron/+/702856/ 14:11:27 but we decided to go with liu's patch which was simpler approach (only in L3 agent) 14:12:09 just asking, none of the proposed patches/strategies are valid for you? 14:12:58 it may be worth to consider https://review.opendev.org/c/openstack/neutron/+/702856/ as well 14:13:13 solutions 2. and 3. from my bug report https://bugs.launchpad.net/neutron/+bug/1952907 should work fine(at least for ipv4, not sure about ipv6) 14:13:35 in keepalived there is setting vrrp_garp_master_repeat - how about setting it to e.g. 2 or 3 seconds? 14:13:37 but ideally, we would wan't to drop keepalived's dependency do neutron control plane during failovers 14:13:50 so keepalived would try a bit later to send second set of GARPs 14:14:14 it won't help, if keepalived won't send another GARPs if first one fails 14:14:32 i tried using different keepalived options but couldn't get it working this way 14:14:56 *it won't help, keepalived won't send another GARPs if first one fails 14:15:24 To be honest I don't really like https://review.opendev.org/c/openstack/neutron/+/707406 ... The fact that it suddenly moves neutron into the failover scenario, which now requires neutron agents fully working and may take some additional time during failover as states have to be detected and ports brought up by neutron after a failover. If 500 routers fail over at the same time, that may actually be significant time compared to just a 14:15:25 VRRP fail over. 14:15:38 in the past we had such code in keepalived-state-change-monitor to send garps there 14:15:44 but we removed it with https://review.opendev.org/c/openstack/neutron/+/752360 14:15:59 ^^ could be an option 14:16:27 yes, one of my proposed solutons is to bring this eature back, but it's not ideal(i agree with @trindent): https://review.opendev.org/c/openstack/neutron/+/821433 14:16:58 right 14:17:09 we removed that because we saw an issue when basically many garps send by many routers were killing openvswitch 14:17:18 ralonsoh: probably remembers that issue well :) 14:17:28 yes, with 200 virtual routers in one compute 14:17:30 one controller 14:18:07 I wonder if https://bugs.launchpad.net/neutron/+bug/1859832 could be solved by just filtering MLDv2 or if we could even disable IPv6 on the interface initially to not get a link local address before keepalived is started etc... 14:18:10 so in this scenario we need to keep neutron involved in the failover business 14:18:44 if there would be way to disable sending garps by keepalived and do it only from neutron-keepalived-state-change or l3 agent, that would be IMO the best solution 14:19:31 trident, I agree with investigating an option to just filter the "wrong" traffic. In this case MLDv2 packets 14:19:40 trident: IIRC that MLDv2 issue couldn't be solved by disabling IPv6 on interface as when You enabled/disabled IPv6 those packets were sent immediately 14:19:41 and reverting the patch to set the interfaces down 14:19:44 slaweq: true, that is not dependent on 3rd party version 14:20:01 but filtering the traffic? 14:20:29 so the only option was to bring interfaces down or block them e.g in OVS (what patch https://review.opendev.org/c/openstack/neutron/+/702856/ proposed) 14:21:19 i only wonder how keepalived(used outside the openstack) solves this issue, as basically it may be affected by the same issue when rebooting backup keepalived(but i don't have an answer for this ATM) 14:21:19 crazy idea... in damiandabrowski[m] original report he mentions "When I add random port to this router to trigger keepalived's reload, then all GARPs are sent properly". Can we force this in Neutron? 14:22:01 I mean the reload 14:22:12 mlavalle: I don't see reason why, easier would be to revert https://review.opendev.org/c/openstack/neutron/+/752360 and send additional garps from neutron directly 14:22:33 yeah I fear that is considered at a time as bug in keepalived.... 14:23:04 I said it was crazy idea... brainstorming 14:23:18 mlavalle: +1 14:23:29 I do have a slight memory of seeing a proposed fix (that I think was abandoned) to that issue or some other similar issue which actually filtered the problematic traffic quite some time ago. I have however not been able to locate it. Does it ring any bells for anyone? 14:23:30 :) 14:24:13 trident: You mean https://review.opendev.org/c/openstack/neutron/+/702856/ ? 14:24:42 slaweq: Nope, I think it was iptables or ebtables... 14:25:18 trident: but we can't filter that traffic completly as it may break some IPv6 functionalities in the "master" router, no? 14:26:07 slaweq: But we could instead of taking the interface down completely as we do today, apply filters and then instead of bringing the whole interface up, we could remove the filters. 14:27:03 trident: You mean apply iptables fiters when router is going to be "backup" and then remove them when router is going to be "primary"? 14:27:31 instead of doing "ip link set down" and "ip link set up" like we have now 14:27:47 slaweq: Yes. Note that I have not done any testing on that idea though, I just thought of it an hour or so before this meeting. 14:28:09 trident: I think that can work 14:28:26 TBH I don't remember now why I was trying to block it in OVS not iptables 14:28:46 maybe there was some reason for that or maybe it was just that I was somehow biased then :) 14:29:13 IMO worth try 14:29:58 so we have two working hypothesis: 1) have neutron do the garps 2) apply and remove filters 14:30:05 is that a good summary? 14:30:25 malavalle: thanks, thats my understanding too 14:30:26 mlavalle: for 1) I would be ok with only if we can somehow disable garps in keepalived 14:30:47 as we had issues with ovs when there was too much garps send 14:30:52 so we should avoid that IMO 14:30:53 slaweq: yeah, that was my assumption. I just wanted to be brief in the summary 14:31:00 ++ 14:31:39 2) can also solve frequent problems we have, at least in the CI, with race conditions when failing over 14:32:15 ralonsoh: yeah, hopefully :) 14:32:18 so if we have to pick the most promising alternative, is 2) the one? 14:32:29 so I would really give a try to the solution 2) first 14:33:02 sounds good alternative 14:33:48 trident, damiandabrowski[m]: you ok with thi plan? will you propose a patch? 14:34:33 or 3) do not engage neutron in keepalived's failover process at all? it's probably the best solution, but also the hardest one 14:34:53 as if i understand correctly, 1) and 2) is still dependent from neutron 14:35:12 yes, it's true 14:35:16 damiandabrowski: I'm not sure if I understand correctly, can You elaborate on that one? 14:35:50 I am afraid I might not be familiar enough with the code base to propose such a patch within reasonable time :/ I would however be most willing to help with testing! 14:36:22 separtion of concerns is always ideal... life is messy, though ;-) 14:37:33 slaweq: because currently neutron is responsible for keeping interfaces up/down, with 1) neutron will send garps, with 2) neutron will be responsible for updating iptables rules 14:37:48 noonedeadpunk: Do you have also some idea about the listed aproaches? 14:38:01 it makes keepalived failover process fully dependent from neutron which is not ideal situation 14:38:31 because technically, keepalived should be able to failover even when neutron is not functioning properly 14:38:46 I'm not huge expert in code, but updating iptables sounds like an idea. Especially if that help out with race conditions in CI... 14:39:00 damiandabrowski[m], you are right on this, but then keepalived should be able to block this traffic 14:39:07 noonedeadpunk: thanks 14:39:30 so what you say is once the router is setup, it should function on its own regardlees of what happens to the neutron control plane 14:39:31 ralonsoh: Which it probably _could_ not, as this if I understand it correctly happens before keepalived is even started? 14:39:59 ralonsoh: When neutron flushes the link local address from the interface and adds it to keepalived config. 14:40:05 but issue happens on router setup indeed? 14:40:33 which kind of makes sense at least to me that neutron is responsible for setup of l3 14:40:49 but I guess then keepalived manages himself anyway? 14:41:17 not exactly, the original issue(reported by slaweq) happens when You reboot backup keepalived 14:41:31 ouch 14:41:33 ok 14:41:56 Hm, if we could make the filters apply only between "router startup" until keepalived has been started perhaps? As long as the link local address flush, addition to keepalived and start of keepalived happens while filter is applied it would be fine I guess. 14:42:29 so block traffic only until keepalived is in charge 14:43:04 ah, i think You're right 14:44:57 so yeah, swapping https://review.opendev.org/c/openstack/neutron/+/707406/ with some 'iptables filtering solution' sounds like an idea for me 14:44:58 I think that it is exactly what patch https://review.opendev.org/c/openstack/neutron/+/702856 was doing but with OVS instead of iptables rules 14:45:14 slaweq: +1 14:45:15 but iptables filtering may be better here 14:46:43 slaweq: With the difference that it will filter all traffic. 14:47:31 trident: but only when router is created, not later during failover, right? 14:48:04 yes, afterwards we would let keepalioved do its job 14:48:15 ++ 14:48:21 slaweq: Oh, that might be true actually. In that case that might actually be a great solution! 14:49:20 ralonsoh put it very clearly: block traffic only until keepalived is in charge 14:50:01 trident, damiandabrowski[m]: do you think that you can work on it? Or just testing? 14:50:49 i can help with testing but i dont feel confident to write a patch :/ 14:51:39 let me talk to my manager to see if we can spend time on this patch 14:52:26 ralonsoh: thanks, I don't know how often we see effects of this in the CI 14:52:44 so we can have few minutes during the PTG also if you think 14:53:08 for sure (btw, I have good feedback so I'll take care of it) 14:53:39 ++ 14:53:43 ralonsoh++ thx 14:53:48 ok, as I see we have a plan, is that ok for everybody? 14:53:48 thanks 14:53:53 +1 14:53:57 +1 14:53:58 Great ralonsoh, thanks! I'll also be available for testing and review! 14:54:07 +1 14:54:17 awesome, thanks for Your engagement everyone! 14:54:36 yeah thanks everybody :-) 14:54:55 Ok next topic: 14:54:57 #topic On Demand Agenda 14:55:02 actually it's been a very enlightening conversation 14:55:10 (ralonsoh): #link https://bugs.launchpad.net/neutron/+bug/1964575 14:55:17 mlavalle: I agree 14:55:31 very quick: this is related to the sqlalchemy 2.0 migration 14:55:48 the session transaction, in 2.0, will be created and never finished 14:56:09 other side effect is that we don't have the implicit transaction creation, commit and deletion 14:56:32 this is why it is a MUST to always we create a DB operation, to open a context (reader/writer) 14:56:40 this is, in a nutshell, the summary 14:56:50 ralonsoh: thanks 14:57:10 that's all thansk 14:57:18 this is the patch you mentioned on the wiki: https://review.opendev.org/c/openstack/neutron/+/833247 14:57:25 exactly 14:57:39 and then we'll have the n-lib patch to set autocommit=False 14:57:51 (that will enable the 2.0 feature) 14:57:56 cool 14:58:00 thank you 14:58:02 ralonsoh: so another "migration to new engine facade" like story for couple of cycles? :) 14:58:10 pfffff kind of !!! 14:58:23 :-) 14:58:45 ok so we will have topic for next cycles than :P 14:58:54 lajoskatona1: yeah :) 14:59:01 lol 14:59:08 I think so... we'll need keep an eye on this 14:59:21 If nothing more for this I have 2 small topics 14:59:30 shouldn't this be a community wide goal then? 14:59:37 or only neutron is affected by that change? 15:00:12 community (but this should be raise by oslo.db cores in TC) 15:00:20 we'll have time for this 15:00:22 lajoskatona1, please 15:00:24 slaweq: I saw perhaps nova changes 15:00:33 ralonsoh: ok ,thanks 15:00:42 PTG etherpad: #link https://etherpad.opendev.org/p/neutron-zed-ptg 15:00:49 please add your topics :-) 15:01:12 * mlavalle will be on PTO next week, so won't attend this meeting 15:01:49 mlavalle: the PTG? 15:02:24 The other one is that there will be openinfra episode next week for cycle highlights 15:02:38 there is a slideset: #link https://docs.google.com/presentation/d/1PhTzrIUo6kfCzdoPO5V3xgQCt6ktqGP3Q1y6BXMryzg/edit#slide=id.g11ee73f15e5_1_4 15:02:53 if you have time please check it I added few sentences for Neutron 15:02:55 I will attend the PTG on the week of april 4 - 8 15:03:02 mlavalle: ok 15:03:07 \I will just be on PTO this coming week 15:03:28 so if there is drivers meeting on 1st, I won't attend 15:03:29 For next week I plane to cancel both the team meeting and the drivers meeting as the week after is the PTG 15:03:38 great 15:03:41 then nvm 15:03:42 mlavalle: ok, thanks 15:03:58 and that's it from me, thanks for attending 15:04:05 thanks! 15:04:06 o/ 15:04:08 #endmeeting