14:00:14 <lajoskatona1> #startmeeting neutron_drivers
14:00:14 <opendevmeet> Meeting started Fri Mar 25 14:00:14 2022 UTC and is due to finish in 60 minutes.  The chair is lajoskatona1. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:14 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:14 <opendevmeet> The meeting name has been set to 'neutron_drivers'
14:00:17 <mlavalle> o/
14:00:18 <lajoskatona1> o/
14:00:35 <yamamoto> hi
14:00:47 <ralonsoh> hello
14:00:54 <noonedeadpunk> o/
14:00:57 <trident> Hello
14:01:17 <slaweq> o/
14:01:35 <damiandabrowski[m]> hi!
14:02:09 <lajoskatona1> So first topic for today (again :-))):
14:02:10 <haleyb> hi
14:02:18 <lajoskatona1> Gratuitous ARPs are not sent during master transition: (#link https://bugs.launchpad.net/neutron/+bug/1952907)
14:02:32 <lajoskatona1> We have some patches (which I know about):
14:02:36 <lajoskatona1> #link https://review.opendev.org/c/openstack/neutron/+/821434
14:02:40 <lajoskatona1> #link https://review.opendev.org/c/openstack/neutron/+/821433
14:02:44 <lajoskatona1> #link https://review.opendev.org/c/openstack/neutron/+/712474
14:03:43 <lajoskatona1> last time we discussed it damiandabrowski[m] and trident perhaps promised that they try to test the issue more
14:04:36 <damiandabrowski[m]> yes, i spent some time on this
14:04:52 <damiandabrowski[m]> in my bug report i mentioned that this bug may be already fixed by keepalived: https://github.com/acassen/keepalived/commit/72c4e54e9f2a9b5081814e549c47c7fc58b82df0
14:05:07 <damiandabrowski[m]> but AFAIK we don't use vmac interfaces so it's not relevant
14:07:04 <damiandabrowski[m]> and what's most important: i think keepalived has nothing to do with it, as slaweq's bug report(the reason why we keep qg- interface down on backup routers) describes the problem pretty well
14:07:04 <trident> To clarify, we originally thought that keepalived fix actually remedied https://review.opendev.org/c/openstack/neutron/+/707406 - which is what causes https://bugs.launchpad.net/neutron/+bug/1952907
14:07:36 <lajoskatona1> damiandabrowski[m]: ok, so it seems that we need some code change on Neutron side?
14:08:35 <damiandabrowski[m]> i think yes, https://bugs.launchpad.net/neutron/+bug/1859832  is slaweq's bugs report, the problem starts when neutron flushes link-local address on qg- interface, then interfaces sends multicast unsubscribe
14:08:42 <trident> So as damiandabrowski[m] says, just reverting https://review.opendev.org/c/openstack/neutron/+/707406 won't be a solution as we initially thought.
14:09:18 <damiandabrowski[m]> i'm not sure how we can fix this issue, but maybe we can do something else rather than just "flush link-local ip"
14:09:45 <damiandabrowski[m]> or maybe we can prevent sending multicast unsubscribe packets, maybe with some sysctl parameters?
14:10:11 <damiandabrowski[m]> i just think we need to think about better solution for this bug: https://bugs.launchpad.net/neutron/+bug/1859832
14:11:00 <damiandabrowski[m]> as the current solution(keeping qg- interfaces down on backup routers), solved one problem but created another one
14:11:12 <slaweq> there was different solution for that bug https://review.opendev.org/c/openstack/neutron/+/702856/
14:11:27 <slaweq> but we decided to go with liu's patch which was simpler approach (only in L3 agent)
14:12:09 <ralonsoh> just asking, none of the proposed patches/strategies are valid for you?
14:12:58 <damiandabrowski[m]> it may be worth to consider https://review.opendev.org/c/openstack/neutron/+/702856/ as well
14:13:13 <damiandabrowski[m]> solutions 2. and 3. from my bug report https://bugs.launchpad.net/neutron/+bug/1952907 should work fine(at least for ipv4, not sure about ipv6)
14:13:35 <slaweq> in keepalived there is setting vrrp_garp_master_repeat - how about setting it to e.g. 2 or 3 seconds?
14:13:37 <damiandabrowski[m]> but ideally, we would wan't to drop keepalived's dependency do neutron control plane during failovers
14:13:50 <slaweq> so keepalived would try a bit later to send second set of GARPs
14:14:14 <damiandabrowski[m]> it won't help, if keepalived won't send another GARPs if first one fails
14:14:32 <damiandabrowski[m]> i tried using different keepalived options but couldn't get it working this way
14:14:56 <damiandabrowski[m]> *it won't help, keepalived won't send another GARPs if first one fails
14:15:24 <trident> To be honest I don't really like https://review.opendev.org/c/openstack/neutron/+/707406 ... The fact that it suddenly moves neutron into the failover scenario, which now requires neutron agents fully working and may take some additional time during failover as states have to be detected and ports brought up by neutron after a failover. If 500 routers fail over at the same time, that may actually be significant time compared to just a
14:15:25 <trident> VRRP fail over.
14:15:38 <slaweq> in the past we had such code in keepalived-state-change-monitor to send garps there
14:15:44 <slaweq> but we removed it with https://review.opendev.org/c/openstack/neutron/+/752360
14:15:59 <ralonsoh> ^^ could be an option
14:16:27 <damiandabrowski[m]> yes, one of my proposed solutons is to bring this eature back, but it's not ideal(i agree with @trindent): https://review.opendev.org/c/openstack/neutron/+/821433
14:16:58 <ralonsoh> right
14:17:09 <slaweq> we removed that because we saw an issue when basically many garps send by many routers were killing openvswitch
14:17:18 <slaweq> ralonsoh: probably remembers that issue well :)
14:17:28 <ralonsoh> yes, with 200 virtual routers in one compute
14:17:30 <ralonsoh> one controller
14:18:07 <trident> I wonder if https://bugs.launchpad.net/neutron/+bug/1859832 could be solved by just filtering MLDv2 or if we could even disable IPv6 on the interface initially to not get a link local address before keepalived is started etc...
14:18:10 <mlavalle> so in this scenario we need to keep neutron involved in the failover business
14:18:44 <slaweq> if there would be way to disable sending garps by keepalived and do it only from neutron-keepalived-state-change or l3 agent, that would be IMO the best solution
14:19:31 <ralonsoh> trident, I agree with investigating an option to just filter the "wrong" traffic. In this case MLDv2 packets
14:19:40 <slaweq> trident: IIRC that MLDv2 issue couldn't be solved by disabling IPv6 on interface as when You enabled/disabled IPv6 those packets were sent immediately
14:19:41 <ralonsoh> and reverting the patch to set the interfaces down
14:19:44 <lajoskatona1> slaweq: true, that is not dependent on 3rd party version
14:20:01 <ralonsoh> but filtering the traffic?
14:20:29 <slaweq> so the only option was to bring interfaces down or block them e.g in OVS (what patch https://review.opendev.org/c/openstack/neutron/+/702856/ proposed)
14:21:19 <damiandabrowski[m]> i only wonder how keepalived(used outside the openstack) solves this issue, as basically it may be affected by the same issue when rebooting backup keepalived(but i don't have an answer for this ATM)
14:21:19 <mlavalle> crazy idea... in damiandabrowski[m] original report he mentions "When I add random port to this router to trigger keepalived's reload, then all GARPs are sent properly". Can we force this in Neutron?
14:22:01 <mlavalle> I mean the reload
14:22:12 <slaweq> mlavalle: I don't see reason why, easier would be to revert https://review.opendev.org/c/openstack/neutron/+/752360 and send additional garps from neutron directly
14:22:33 <lajoskatona1> yeah I fear that is considered at a time as bug in keepalived....
14:23:04 <mlavalle> I said it was crazy idea... brainstorming
14:23:18 <lajoskatona1> mlavalle: +1
14:23:29 <trident> I do have a slight memory of seeing a proposed fix (that I think was abandoned) to that issue or some other similar issue which actually filtered the problematic traffic quite some time ago. I have however not been able to locate it. Does it ring any bells for anyone?
14:23:30 <slaweq> :)
14:24:13 <slaweq> trident: You mean https://review.opendev.org/c/openstack/neutron/+/702856/ ?
14:24:42 <trident> slaweq: Nope, I think it was iptables or ebtables...
14:25:18 <slaweq> trident: but we can't filter that traffic completly as it may break some IPv6 functionalities in the "master" router, no?
14:26:07 <trident> slaweq: But we could instead of taking the interface down completely as we do today, apply filters and then instead of bringing the whole interface up, we could remove the filters.
14:27:03 <slaweq> trident: You mean apply iptables fiters when router is going to be "backup" and then remove them when router is going to be "primary"?
14:27:31 <slaweq> instead of doing "ip link set down" and "ip link set up" like we have now
14:27:47 <trident> slaweq: Yes. Note that I have not done any testing on that idea though, I just thought of it an hour or so before this meeting.
14:28:09 <slaweq> trident: I think that can work
14:28:26 <slaweq> TBH I don't remember now why I was trying to block it in OVS not iptables
14:28:46 <slaweq> maybe there was some reason for that or maybe it was just that I was somehow biased then :)
14:29:13 <slaweq> IMO worth try
14:29:58 <mlavalle> so we have two working hypothesis: 1) have neutron do the garps 2) apply and remove filters
14:30:05 <mlavalle> is that a good summary?
14:30:25 <lajoskatona1> malavalle: thanks, thats my understanding too
14:30:26 <slaweq> mlavalle: for 1) I would be ok with only if we can somehow disable garps in keepalived
14:30:47 <slaweq> as we had issues with ovs when there was too much garps send
14:30:52 <slaweq> so we should avoid that IMO
14:30:53 <mlavalle> slaweq: yeah, that was my assumption. I just wanted to be brief in the summary
14:31:00 <slaweq> ++
14:31:39 <ralonsoh> 2) can also solve frequent problems we have, at least in the CI, with race conditions when failing over
14:32:15 <slaweq> ralonsoh: yeah, hopefully :)
14:32:18 <mlavalle> so if we have to pick the most promising alternative, is 2) the one?
14:32:29 <slaweq> so I would really give a try to the solution 2) first
14:33:02 <lajoskatona1> sounds good alternative
14:33:48 <mlavalle> trident, damiandabrowski[m]: you ok with thi plan? will you propose a patch?
14:34:33 <damiandabrowski[m]> or 3) do not engage neutron in keepalived's failover process at all? it's probably the best solution, but also the hardest one
14:34:53 <damiandabrowski[m]> as if i understand correctly, 1) and 2) is still dependent from neutron
14:35:12 <lajoskatona1> yes, it's true
14:35:16 <slaweq> damiandabrowski: I'm not sure if I understand correctly, can You elaborate on that one?
14:35:50 <trident> I am afraid I might not be familiar enough with the code base to propose such a patch within reasonable time :/ I would however be most willing to help with testing!
14:36:22 <mlavalle> separtion of concerns is always ideal... life is messy, though ;-)
14:37:33 <damiandabrowski[m]> slaweq: because currently neutron is responsible for keeping interfaces up/down, with 1) neutron will send garps, with 2) neutron will be responsible for updating iptables rules
14:37:48 <lajoskatona1> noonedeadpunk: Do you have also some idea about the listed aproaches?
14:38:01 <damiandabrowski[m]> it makes keepalived failover process fully dependent from neutron which is not ideal situation
14:38:31 <damiandabrowski[m]> because technically, keepalived should be able to failover even when neutron is not functioning properly
14:38:46 <noonedeadpunk> I'm not huge expert in code, but updating iptables sounds like an idea. Especially if that help out with race conditions in CI...
14:39:00 <ralonsoh> damiandabrowski[m], you are right on this, but then keepalived should be able to block this traffic
14:39:07 <lajoskatona1> noonedeadpunk: thanks
14:39:30 <mlavalle> so what you say is once the router is setup, it should function on its own regardlees of what happens to the neutron control plane
14:39:31 <trident> ralonsoh: Which it probably _could_ not, as this if I understand it correctly happens before keepalived is even started?
14:39:59 <trident> ralonsoh: When neutron flushes the link local address from the interface and adds it to keepalived config.
14:40:05 <noonedeadpunk> but issue happens on router setup indeed?
14:40:33 <noonedeadpunk> which kind of makes sense at least to me that neutron is responsible for setup of l3
14:40:49 <noonedeadpunk> but I guess then keepalived manages himself anyway?
14:41:17 <damiandabrowski[m]> not exactly, the original issue(reported by slaweq) happens  when You reboot backup keepalived
14:41:31 <noonedeadpunk> ouch
14:41:33 <noonedeadpunk> ok
14:41:56 <trident> Hm, if we could make the filters apply only between "router startup" until keepalived has been started perhaps? As long as the link local address flush, addition to keepalived and start of keepalived happens while filter is applied it would be fine I guess.
14:42:29 <ralonsoh> so block traffic only until keepalived is in charge
14:43:04 <damiandabrowski[m]> ah, i think You're right
14:44:57 <damiandabrowski[m]> so yeah, swapping https://review.opendev.org/c/openstack/neutron/+/707406/ with some 'iptables filtering solution' sounds like an idea for me
14:44:58 <slaweq> I think that it is exactly what patch https://review.opendev.org/c/openstack/neutron/+/702856 was doing but with OVS instead of iptables rules
14:45:14 <damiandabrowski[m]> slaweq:  +1
14:45:15 <slaweq> but iptables filtering may be better here
14:46:43 <trident> slaweq: With the difference that it will filter all traffic.
14:47:31 <slaweq> trident: but only when router is created, not later during failover, right?
14:48:04 <mlavalle> yes, afterwards we would let keepalioved do its job
14:48:15 <slaweq> ++
14:48:21 <trident> slaweq: Oh, that might be true actually. In that case that might actually be a great solution!
14:49:20 <mlavalle> ralonsoh put it very clearly: block traffic only until keepalived is in charge
14:50:01 <lajoskatona1> trident, damiandabrowski[m]: do you think that you can work on it? Or just testing?
14:50:49 <damiandabrowski[m]> i can help with testing but i dont feel confident to write a patch :/
14:51:39 <ralonsoh> let me talk to my manager to see if we can spend time on this patch
14:52:26 <lajoskatona1> ralonsoh: thanks, I don't know how often we see effects of this in the CI
14:52:44 <lajoskatona1> so we can have few minutes during the PTG also if you think
14:53:08 <ralonsoh> for sure (btw, I have good feedback so I'll take care of it)
14:53:39 <mlavalle> ++
14:53:43 <slaweq> ralonsoh++ thx
14:53:48 <lajoskatona1> ok, as I see we have a plan, is that ok for everybody?
14:53:48 <mlavalle> thanks
14:53:53 <mlavalle> +1
14:53:57 <ralonsoh> +1
14:53:58 <trident> Great ralonsoh, thanks! I'll also be available for testing and review!
14:54:07 <trident> +1
14:54:17 <damiandabrowski[m]> awesome, thanks for Your engagement everyone!
14:54:36 <lajoskatona1> yeah thanks everybody :-)
14:54:55 <lajoskatona1> Ok next topic:
14:54:57 <lajoskatona1> #topic On Demand Agenda
14:55:02 <mlavalle> actually it's been a very enlightening conversation
14:55:10 <lajoskatona1> (ralonsoh): #link https://bugs.launchpad.net/neutron/+bug/1964575
14:55:17 <lajoskatona1> mlavalle: I agree
14:55:31 <ralonsoh> very quick: this is related to the sqlalchemy 2.0 migration
14:55:48 <ralonsoh> the session transaction, in 2.0, will be created and never finished
14:56:09 <ralonsoh> other side effect is that we don't have the implicit transaction creation, commit and deletion
14:56:32 <ralonsoh> this is why it is a MUST to always we create a DB operation, to open a context (reader/writer)
14:56:40 <ralonsoh> this is, in a nutshell, the summary
14:56:50 <lajoskatona1> ralonsoh: thanks
14:57:10 <ralonsoh> that's all thansk
14:57:18 <lajoskatona1> this is the patch you mentioned on the wiki: https://review.opendev.org/c/openstack/neutron/+/833247
14:57:25 <ralonsoh> exactly
14:57:39 <ralonsoh> and then we'll have the n-lib patch to set autocommit=False
14:57:51 <ralonsoh> (that will enable the 2.0 feature)
14:57:56 <mlavalle> cool
14:58:00 <mlavalle> thank you
14:58:02 <slaweq> ralonsoh: so another "migration to new engine facade" like story for couple of cycles? :)
14:58:10 <ralonsoh> pfffff kind of !!!
14:58:23 <lajoskatona1> :-)
14:58:45 <lajoskatona1> ok so we will have topic for next cycles than :P
14:58:54 <slaweq> lajoskatona1: yeah :)
14:59:01 <mlavalle> lol
14:59:08 <ralonsoh> I think so... we'll need keep an eye on this
14:59:21 <lajoskatona1> If nothing more for this I have 2 small topics
14:59:30 <slaweq> shouldn't this be a community wide goal then?
14:59:37 <slaweq> or only neutron is affected by that change?
15:00:12 <ralonsoh> community (but this should be raise by oslo.db cores in TC)
15:00:20 <ralonsoh> we'll have time for this
15:00:22 <ralonsoh> lajoskatona1, please
15:00:24 <lajoskatona1> slaweq: I saw perhaps nova changes
15:00:33 <lajoskatona1> ralonsoh: ok ,thanks
15:00:42 <lajoskatona1> PTG etherpad: #link https://etherpad.opendev.org/p/neutron-zed-ptg
15:00:49 <lajoskatona1> please add your topics :-)
15:01:12 * mlavalle will be on PTO next week, so won't attend this meeting
15:01:49 <lajoskatona1> mlavalle: the PTG?
15:02:24 <lajoskatona1> The other one is that there will be openinfra episode next week for cycle highlights
15:02:38 <lajoskatona1> there is a slideset: #link https://docs.google.com/presentation/d/1PhTzrIUo6kfCzdoPO5V3xgQCt6ktqGP3Q1y6BXMryzg/edit#slide=id.g11ee73f15e5_1_4
15:02:53 <lajoskatona1> if you have time please check it I added few sentences for Neutron
15:02:55 <mlavalle> I will attend the PTG on the week of april 4 - 8
15:03:02 <lajoskatona1> mlavalle: ok
15:03:07 <mlavalle> \I will just be on PTO this coming week
15:03:28 <mlavalle> so if there is drivers meeting on 1st, I won't attend
15:03:29 <lajoskatona1> For next week I plane to cancel both the team meeting and the drivers meeting as the week after is the PTG
15:03:38 <mlavalle> great
15:03:41 <mlavalle> then nvm
15:03:42 <lajoskatona1> mlavalle: ok, thanks
15:03:58 <lajoskatona1> and that's it from me, thanks for attending
15:04:05 <ralonsoh> thanks!
15:04:06 <mlavalle> o/
15:04:08 <lajoskatona1> #endmeeting