14:00:58 <haleyb> #startmeeting neutron_drivers
14:00:58 <opendevmeet> Meeting started Fri Dec  6 14:00:58 2024 UTC and is due to finish in 60 minutes.  The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:58 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:58 <opendevmeet> The meeting name has been set to 'neutron_drivers'
14:01:01 <haleyb> Ping list: ykarel, mlavalle, mtomaska, slaweq, obondarev, tobias-urdin, lajoskatona, amotoki, haleyb, ralonsoh
14:01:06 <mlavalle> \o
14:01:11 <ralonsoh> hello
14:01:17 <slaweq> o/
14:01:19 <s3rj1k> hi all
14:02:11 <haleyb> alright, I see 4 items in the agenda, let's get started
14:02:18 <haleyb> ralonsoh: the first two are yours
14:02:28 <ralonsoh> I'm opening the agenda
14:02:38 <ralonsoh> #link https://bugs.launchpad.net/neutron/+bug/2084782
14:02:43 <ralonsoh> [OVN] VLAN/flat LRP should be updated when the router networks are updated
14:03:24 <lajoskatona> o/
14:03:26 <ralonsoh> summarizing: the problem here is that the funcionality right now doesn't work if we update the networks attached to a router
14:03:51 <ralonsoh> if we have mixed type networks (tunnelled and not tunnelled), the flat/vlan ones need to be centralized
14:04:04 <ralonsoh> due to issues in the MTU
14:04:26 <ralonsoh> so the point is that we have been fixing/reverting this functionality several times
14:04:32 <ralonsoh> my suggestion:
14:04:43 <ralonsoh> 1) make always the vlan/flat network traffic centralized
14:05:07 <ralonsoh> 2) propose a new feature to allow distributed vlan/flat traffic for routers with only this kind of networks
14:05:17 <ralonsoh> (a kind of network type segregation)
14:05:28 <ralonsoh> --> https://bugs.launchpad.net/neutron/+bug/2084782/comments/1
14:05:31 <obondarev> late o/
14:06:01 <ralonsoh> for (1) --> https://review.opendev.org/c/openstack/neutron/+/935652
14:06:25 <ralonsoh> so opinions here?
14:07:12 <lajoskatona> what centralized means in this case with OVN?
14:07:24 <ralonsoh> ovn allows distributed FIPs
14:07:39 <ralonsoh> that won't work for flat/vlan networks connected to a router
14:07:55 <ralonsoh> all traffic will cross the GW logical router port
14:08:05 <lajoskatona> ahh, ok ,thanks
14:08:53 <slaweq> so in fact we would make config option "enable_distributed_floating_ips" having no effect on the deployment, right?
14:09:05 <slaweq> for the tenant networks of type vlan/flat
14:09:09 <ralonsoh> for vlan/flat tenant networks
14:09:12 <ralonsoh> yes
14:09:15 <slaweq> for geneve networks it will still be distributed
14:09:18 <ralonsoh> yes
14:09:55 <slaweq> so basically it is revert of all the attemps to make vlan tenant networks to be distributed
14:09:58 <ralonsoh> yes
14:10:15 <ralonsoh> and (2) make this possible but only for these networks connected to a router
14:10:21 <ralonsoh> not router type mixing
14:10:28 <ralonsoh> right now this is a problem in OVN
14:10:41 <ralonsoh> not network type mixing*
14:10:48 <slaweq> and you know my opinion - I am ok with this as I know there is no other option and basically now we have this functionality broken in some cases
14:10:58 <mlavalle> ah ok, so we are not ruling out distributed fips with vlan networks
14:11:00 <slaweq> and there's no way to fix it for all use cases
14:11:08 <opendevreview> Stanislav Dmitriev proposed openstack/neutron master: HA VRRP health check parameters  https://review.opendev.org/c/openstack/neutron/+/932716
14:11:10 <mlavalle> it's just the mixing that we want to limit
14:11:32 <ralonsoh> yes but we still allow that, but if we mix them
14:11:32 <lajoskatona> if  Iunderstand well the 2 options there will be some limitation anyway so have to choose  which one :-)
14:11:40 <ralonsoh> 1) tunnelled with have DVR
14:11:45 <ralonsoh> 2) flat/vlan won't
14:11:49 <mlavalle> yes, there's a limit in the proposal
14:12:07 <ralonsoh> so we are not going to prevent mixing networks
14:12:16 <ralonsoh> just disabling dvr for vlan
14:12:27 <mlavalle> right, but...
14:12:53 <mlavalle> if there are no tunneled networks, 2 will allow dvr for vlans, right?
14:13:01 <slaweq> lajoskatona: yes, I see it like choose between: 1) mix network types in routers as you want but traffic to/from vlan tenant networks will be going throug chassis gw or 2) vlan tenant networks traffic will be distributed but you can't mix networks of different types in the same router
14:13:12 <ralonsoh> mlavalle, no
14:13:19 <ralonsoh> this is proposal (2)
14:13:24 <slaweq> depends on the use case operator will be able to choose what the need
14:13:29 <mlavalle> yes, thet's what I meant
14:13:33 <slaweq> at least that's my understanding of the proposal
14:13:41 <ralonsoh> and this is because if in the router you add a 3rd network, that is tunnelled, the vlan traffic will be broken
14:15:34 <ralonsoh> mlavalle, right, I didn't read it correctly
14:16:02 <ralonsoh> maybe, with future OVN improvements, we'll be able to avoid this limitations
14:17:22 <opendevreview> Merged x/whitebox-neutron-tempest-plugin master: Bump hacking  https://review.opendev.org/c/x/whitebox-neutron-tempest-plugin/+/936943
14:17:33 <mlavalle> is there a way we can fix this at the ovnlevel instead that at neutron's?
14:17:58 <ralonsoh> yes but there are other limitations, for example with QoS
14:18:00 <mlavalle> would a conversation with the ovn core team help?
14:18:29 <ralonsoh> there is some kind of fix but that removes the qos capability for vlan networks
14:18:37 <ralonsoh> so the solution introduces another issue\
14:18:43 <haleyb> so what will be the impact to a tenant (besides it actually working correctly), just a possible performance penalty for centralizing provider traffic?
14:19:02 <haleyb> o
14:19:06 <ralonsoh> right now we have this limitation, with mixed networks
14:19:28 <ralonsoh> if we also implement (2) for vlan networks, this limitation will be gone
14:19:44 <ralonsoh> (but, of course, using only vlan networks in the router, not mixing)
14:20:04 <ralonsoh> right now we can't have everything
14:20:44 <haleyb> right, but the use case of having both is probably pretty small, so we can live with it
14:20:57 <ralonsoh> yes, that's the point
14:21:06 <ralonsoh> usually tenant networks are geneve
14:21:13 <ralonsoh> (that work much better in ovn)
14:21:32 <mlavalle> that's a valid consideration
14:22:09 <ralonsoh> if this is accepted, of course I'll document everything crystal clear
14:22:55 <lajoskatona> that is necessary for sure
14:23:07 <haleyb> with pictures or ascii art :) j/k doc is the key
14:23:24 <obondarev> "but you can't mix networks of different types in the same router" what means "you can't"? Error raised on attempt to connect a diff type network to a router?
14:24:00 <ralonsoh> if (2) is implemented and you add vlan networks to a router
14:24:10 <ralonsoh> that will be configurable, in order to have DVR with vlan
14:24:24 <ralonsoh> but that will imply that you won't be able to mix network types
14:24:39 <ralonsoh> but this will be configurable (same as DVR, of example)
14:24:51 <mlavalle> would that be for the entire deployment or per router?
14:24:54 <haleyb> true, is that an API-type change? or ?
14:24:56 <ralonsoh> per router
14:25:12 <ralonsoh> haleyb, I don't think we can implement that at API level
14:25:12 <obondarev> yeah I'm just thinking on upgrade scenario, when user already has mixed typed nets in a router
14:25:18 <mlavalle> seems less restrictive then
14:25:20 <ralonsoh> this is on the server, running time
14:25:40 <ralonsoh> obondarev, by default, this config option will be disabled so the upgrade will be possible
14:25:53 <obondarev> ack
14:26:21 <ralonsoh> folks, I'm going to propose a spec
14:26:26 <haleyb> ralonsoh: oh, per-router to me implied you can change it via the API
14:26:28 <ralonsoh> this kind of change deserves it
14:26:42 <slaweq> obondarev we can add some upgrade check to warn user about it durign upgrade
14:26:43 <ralonsoh> it will be much better to describe everything there
14:26:56 <lajoskatona> +1 for spec
14:27:00 <ralonsoh> including upgrades or any other consideration
14:27:06 <ralonsoh> I'll add the RFE tag to the bug
14:27:18 <obondarev> +1
14:27:22 <haleyb> anyways, we should vote, and i was going to mention the RFE tag
14:27:34 <haleyb> +1
14:28:05 <mlavalle> +1
14:28:08 <lajoskatona> +1 for the whole id and proposing solution for it
14:28:27 <mlavalle> yes, my +1 assumes we implement (2)
14:28:33 <ralonsoh> thanks folks, I'll propose this spec next week
14:28:57 <haleyb> alright, great i marked it approved
14:29:03 <haleyb> you also had the next item
14:29:09 <ralonsoh> #link https://bugs.launchpad.net/neutron/+bug/2032817
14:29:27 <ralonsoh> issue: in a router, we handle ext MTU < int MTU
14:29:39 <ralonsoh> but not the other way: ext MTU > int MTU
14:29:49 <ralonsoh> this last case is rare but possible (there we have the bug)
14:30:11 <ralonsoh> we have a flag in the GW LRP: options.gateway_mtu, that is set in the first case
14:30:31 <ralonsoh> this sends a frag meesage to the VM to fragment the traffic, and works fine
14:30:53 <ralonsoh> the proposal: to ALWAYS set this flag, regardless of the MTU relation
14:31:19 <ralonsoh> that means, in a router with several networks and MTUs, the minimum MTU will define this value
14:31:30 <ralonsoh> that makes sense if we want all networks to communicate
14:31:39 <ralonsoh> this is, of course, a limitation in the performance
14:31:57 <ralonsoh> but if you connect a network to a router you'll have this in consideration
14:32:34 <ralonsoh> and, something else to be condifered, this is configurable: there is a neutron configuration flag in order to write or not this flag
14:32:46 <ralonsoh> ovn_emit_need_to_frag
14:33:09 <ralonsoh> better than this, the POC patch
14:33:16 <ralonsoh> https://review.opendev.org/c/openstack/neutron/+/937026
14:33:27 <ralonsoh> (code speaks better than me)
14:33:59 <haleyb> ok, so we will only set the mtu if the config option is set
14:34:11 <ralonsoh> yes (this is True by default)
14:34:24 <ralonsoh> and makes sense if we don't want issues with different MTUs
14:34:35 <ralonsoh> this is just a heads-up because is an important change in the GW LRP
14:34:47 <ralonsoh> so any comment can be done in the patch ^^^
14:34:59 <ralonsoh> thanks for listening (there are other topics in the agenda)
14:35:33 <lajoskatona> thanks ralonsoh, makes the review easier
14:35:38 <ralonsoh> thanks!
14:35:40 <mlavalle> +1
14:35:42 <haleyb> ralonsoh: i agree with fixing, might need to clean-up gaps document where we added this, will put in review
14:37:06 <haleyb> not sure we have to vote since it's technically a bug fix, but i'm +1 obviously
14:37:14 <ralonsoh> no no, just a heads-up
14:37:20 <ralonsoh> any review in the patch, thanks!!
14:37:35 <opendevreview> Dmitriy Rabotyagov proposed openstack/neutron master: [OVN] Do not supply gateway_port if it's not binded to chassis  https://review.opendev.org/c/openstack/neutron/+/931495
14:37:39 <mlavalle> I just typed +1 to agree in that the clarification makes the revies easier
14:37:52 <haleyb> ok, we have two more items
14:38:03 <haleyb> s3rj1k: since i noticed you here, you can go next
14:38:30 <haleyb> #link https://review.opendev.org/c/openstack/neutron-specs/+/935724
14:38:51 <s3rj1k> yea, so this is continuation of previous meeting
14:39:10 <mlavalle> I thought so
14:39:11 <s3rj1k> same toping, just prepared a spec for it
14:39:16 <s3rj1k> *topic
14:39:39 <ralonsoh> I'll check it next week, for sure
14:39:51 <ralonsoh> I think we already voted for this one
14:39:56 <mlavalle> we did
14:40:17 <s3rj1k> yea, just highlighting this as needs review
14:40:24 <ralonsoh> for sure, thanks!
14:40:27 <mlavalle> thanks for the reminder
14:40:30 <s3rj1k> we can continue on next items
14:41:52 <haleyb> ok, next one was from Amir, not sure of his nick
14:41:57 <ralonsoh> amnik,
14:42:18 <amnik> Yes I'm here
14:42:25 <haleyb> #link https://bugs.launchpad.net/neutron/+bug/2089726
14:42:58 <amnik> I get your points in review I want to share with you what I think now.
14:43:45 <amnik> I also agree that it is not reasobable to implement parralel in ovsdbapp
14:44:06 <amnik> Because affect all other project that use ovsdbapp
14:44:33 <amnik> but in OVN agent I think order of events does not matter
14:44:51 <opendevreview> Serhii Ivanov proposed openstack/neutron-specs master: Proposes `Agent Startup State Tracking`  https://review.opendev.org/c/openstack/neutron-specs/+/935724
14:45:10 <amnik> so maybe we can implment parrale in OVN agent
14:45:37 <amnik> in Ovn agent we always call provision_datapath in run function
14:45:44 <noonedeadpunk> but comment for port DOWN/UP was a good example which was applicable for OVN
14:45:56 <noonedeadpunk> in which state port wil result if it's done in parallel?
14:46:42 <ralonsoh> noonedeadpunk, you can receive, as we are doing now in the CI, the port up and down events almost at the same time
14:46:46 <amnik> and in provision_datapath we get the latest change of database independant of events
14:47:21 <noonedeadpunk> ralonsoh: yes, but you would expect order matter, right? so you can't process these in parallel I assume?
14:47:29 <ralonsoh> amnik, I'm totally ok with any improvement in the datapath provisioning. We had problems in the past because of this with beefy servers
14:47:44 <ralonsoh> noonedeadpunk, that was my point in the comment in the patch
14:47:48 <noonedeadpunk> yeah
14:47:51 <ralonsoh> amnik, but you need to consider:
14:47:57 <noonedeadpunk> I was just +1 it effectively here :)
14:48:14 <ralonsoh> 1) the datapath will be dependant on a network
14:48:39 <ralonsoh> 2) you can't process provisioning of a port in a datapath if other call is also doing the same for the same datapath
14:49:00 <ralonsoh> 3) create threads in python to execute just python code doesn't improve performance
14:49:25 <ralonsoh> that's all
14:50:25 <lajoskatona> if we add parallelism for these low level operations do we gain anything or just have the extra headache with the complexity?
14:50:47 <amnik> can you please explain the third one more. If we provosion datpath in multiple thread does not improve performance?
14:51:06 <slaweq> lajoskatona: and new set of problems to debug in CI jobs 🙂
14:51:09 <ralonsoh> amnik, no, you are executing python code
14:51:34 <lajoskatona> and debug customers :-)
14:51:40 <lajoskatona> I mean their deployments
14:51:52 <ralonsoh> lajoskatona, the only parts of this code that can maybe be executed in parallel are the pyroute calls to create/set the devices
14:52:17 <noonedeadpunk> from what I saw main performance issues in neutron replies, was actually the way how policies are being processed...
14:52:30 <ralonsoh> if we really want to improve this, we should refactor the code and create a dispatcher for several workers (processes) attending one datapath at the same time
14:52:40 <ralonsoh> so you can receive events and send the operations to these workers
14:53:04 <amnik> rolonsoh: I think the ovnmeta namespace maybe created parralely could help
14:53:09 <ralonsoh> to prevent issues with parallel executions in the same datapath, only one worker should attend each datapath
14:53:19 <ralonsoh> amnik, that takes 0.1 seconds
14:54:01 <amnik> rolonsoh: I understand.
14:54:56 <haleyb> ralonsoh: that's what the l3-agent does with it's work queues, indexed by network if my memory is good, but the GIL gets in the way
14:55:05 <ralonsoh> so, again, I'll help and actively participate in a good improvement design for the OVN agent
14:55:21 <ralonsoh> haleyb, kind of but the l3 agent and dhcp agent use threads
14:55:29 <ralonsoh> and right now I lowered the working threads to 1
14:55:47 <ralonsoh> and there is no performance impact (and much better debugging and less issues)
14:56:07 <ralonsoh> so if we want to do this, we need to create a server and spawn processes that will attend the requests
14:56:24 <haleyb> yes, understood
14:56:25 <ralonsoh> that is a HUGE refactor but I'll actively collaborate on this
14:57:39 <amnik> ralonsoh: thanks, I will review your idea in this meeting. and propose RFE if I can help.
14:57:48 <ralonsoh> perfect!
14:58:52 <lajoskatona> don't we need a  spec for such a refactor?
14:59:04 <haleyb> ok, so based on that i will reject this current RFE
14:59:07 <ralonsoh> we need first an idea hehehe
14:59:38 <lajoskatona> ack
14:59:40 <haleyb> and we need data showing it helps
15:00:14 <haleyb> ok, we are out of time
15:00:23 <slaweq> +1 for having some data which will show what is actually improved and by how much
15:00:39 <amnik> Thanks all. I'll check it more. and contribute If I can help.
15:00:51 <haleyb> thanks everyone for attending
15:00:54 <haleyb> #endmeeting