14:04:06 <ralonsoh> #startmeeting neutron_drivers
14:04:06 <opendevmeet> Meeting started Fri Nov 15 14:04:06 2024 UTC and is due to finish in 60 minutes.  The chair is ralonsoh. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:04:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:04:06 <opendevmeet> The meeting name has been set to 'neutron_drivers'
14:04:16 <ralonsoh> Ping list: ykarel, mlavalle, mtomaska, slaweq, obondarev, tobias-urdin, lajoskatona, amotoki, haleyb, ralonsoh
14:04:19 <mlavalle> \o
14:04:19 <ralonsoh> hello all
14:04:22 <lajoskatona> o/
14:04:24 <obondarev> o/
14:04:31 <slaweq> o/
14:04:41 <s3rj1k> hi all
14:04:50 <ralonsoh> we have many RFEs to discuss today
14:05:01 <ralonsoh> and we have quorum, so I think we can start
14:05:09 <amnik> hello all
14:05:13 <ralonsoh> First one: [RFE] Configurable Agent Termination on OVS Restart
14:05:16 <ralonsoh> https://bugs.launchpad.net/neutron/+bug/2086776
14:05:34 <ralonsoh> s3rj1k, please, present the RFE
14:06:50 <s3rj1k> ralonsoh: Hi, yea so basically the idea is to automate manual fixes that are done by human, of agent in k8s env
14:07:26 <ralonsoh> s3rj1k, my question, same as Liu in the bug is: why don't we fix what is broken in the OVS agent code?
14:07:34 <ralonsoh> what exact flows are not recovered?
14:07:43 <s3rj1k> to do that, there is an idea to introduce opt-in functionality to terminate agent instead of flow recovery
14:08:34 <s3rj1k> fixing is a separate thing, here is about making sure that prod envs are working without human interaction
14:09:13 <ralonsoh> the OVS agent is capable of detecting when the vswitch has been restarted and tries to restore the flows
14:09:13 <opendevreview> Sebastian Lohff proposed openstack/neutron master: docs: Fix typo in openstack create subnet command  https://review.opendev.org/c/openstack/neutron/+/935182
14:09:29 <s3rj1k> Liu also mentioned that this is only related to containerized evns, as for systemd setups there is already similar restarts happen, as I understand
14:09:35 <ralonsoh> if we fix what is broken, that will solve your problem (and for any other one using it)
14:10:04 <ralonsoh> Liu is stating, as he is right, that restaring the OVS agent has a high cost in the Neutron API
14:10:14 <s3rj1k> to fix that we need to reporo bug, there is no time to do that on envs that have client workloads
14:10:47 <s3rj1k> >  restaring the OVS agent has a high cost - agree, hence opt-in
14:11:07 <slaweq> I personally feel like this proposal is more like workaround for some bug rather then fix of it
14:11:12 <ralonsoh> exactly
14:11:23 <s3rj1k> agree, workaround
14:11:32 <ralonsoh> s3rj1k, can you share with the community what is broken when the OVS restores the flows?
14:11:41 <ralonsoh> because this is the critical point here but is not documented
14:11:46 <ralonsoh> this is what needs to be fixed
14:12:20 <s3rj1k> I have no specific repros right now, details that I have is in ticket, one of the flows was not recovered
14:12:58 <ralonsoh> s3rj1k, so please, add this information in the LP bug and how to reproduce it locally, if possible
14:12:59 <s3rj1k> when I will have a repro I will create a bug for it, but htis does not mean that workaround is invalid
14:13:04 <ralonsoh> with this information we can try to fix it
14:13:07 <lajoskatona> Could you please check the suggestion from Liu (https://bugs.launchpad.net/neutron/+bug/2086776/comments/5 ) that seam reasonable and in sync with the current design?
14:13:32 <ralonsoh> that is a lightweight workaround too
14:13:41 <ralonsoh> no API involved in this case
14:14:52 <ralonsoh> but maybe could be a solution: instead of the current code that restores the flows, when the OVS is restarted, restart the OVS agent monitoring mechanism
14:14:59 <lajoskatona> anyway if we have a list of missing things as ralonsoh suggested that can help to find the places to fix
14:15:32 <lajoskatona> yeah perhaps
14:16:14 <ralonsoh> s3rj1k, so please, before proceeding on any decision, we need to know what is actually broken. Then we can propose a solution (fix the restore methods, lightweight restart or full restart)
14:16:54 <s3rj1k> are we sure we want to block general workaround on specific bug repro?
14:17:05 <s3rj1k> things can be done async
14:17:05 <ralonsoh> sorry what does it mean?
14:17:51 <s3rj1k> continue with terminate workaround and as a separate bug try to find repro for that one specific flow
14:18:31 <s3rj1k> I am pretty sure that there are more cases for flows recovery but I have no info on that, only value report on one of them
14:18:50 <ralonsoh> if you ask me to implement a workaround for a bug I don't understand, I will say no
14:18:57 <ralonsoh> I need to know what is broken
14:19:03 <slaweq> personally I don't like idea of such kind of workaround because it don't seems for me as professional thing - it's a bit like implementing restart of service to fix memory leak :) So I would personally vote for fixing real bugs and handle ovs restart properly instead of this rfe
14:19:03 <s3rj1k> as basically fix is manual restart, nobody is triaging bugs in prod
14:20:08 <ralonsoh> nobody is triaging bugs in prod --> I do this for a living
14:20:16 <s3rj1k> slaweq: so better to ignore that and keep line1 ops just restart all the things?
14:20:23 <slaweq> also if you need workaround like that you can probably implement some small script which will be doing what you need to do - it don't need to be in the agent itself
14:20:46 <slaweq> s3rj1k no, IMO it's better to try to reproduce the issue on dev env and try to fix it
14:20:47 <s3rj1k> sure it can be done as some script
14:21:52 <obondarev> no need to reproduce intentionally, just next time when manual intervention is needed (agent restart) - dump flows before and after restart, that could be enough for investigation
14:21:57 <greatgatsby_> was just coming in here to post about our outage we've had during 2 of our last 3 upgrade tests.  Seems directly related to the OVS conversation you're having here now
14:22:29 <lajoskatona> obondarev: +1, good that you added here
14:22:47 <s3rj1k> as I said, when some details come in, I will create a separate bug for that
14:23:05 <greatgatsby_> we've had a 5 minute outage between when kolla-ansible restarts OVS containers (which triggers 2 separate flow re-creations) and we don't get FIP connectivity back until the neutron role, some 5 minutes later
14:23:24 <s3rj1k> this is more about do we want some workaround in place or not
14:24:00 <greatgatsby_> in our testing, both the restart of openvswitch_db and openvswitch-vswitchd trigger neutron agent to recreate the flows, this happens within about 20 seconds of each other
14:25:23 <slaweq> IMHO it is better to have script e.g. in https://opendev.org/openstack/osops for that rather then implement that in neutron
14:25:37 <obondarev> +1 to slaweq
14:25:46 <ralonsoh> s3rj1k, we want to fix this bug, for sure. We don't want to implement a workaround for something we don't understand. In order to push code, we need to know what is happening
14:25:46 <greatgatsby_> for some reason, but not always, neutron ovs agent doesn't seem to like this and connectivity is lost until the agent is restarted during the neutron role
14:26:17 <greatgatsby_> sorry for just injecting my recent experience, just excited that this is being discussed in here, hoping for some kind of mitigation
14:26:20 <ralonsoh> so, for now, we are not going to vote for this RFE until we have a better description of the current problem
14:26:39 <ralonsoh> s3rj1k, please, do not open a separate bug, provide this info in the current one
14:27:01 <ralonsoh> ok, we have more RFEs and just 35 mins left
14:27:05 <lajoskatona> greatgatsby_: do you think your issue is related to https://bugs.launchpad.net/neutron/+bug/2086776 ?
14:27:14 <s3rj1k> greatgatsby_: can you please add more info to RFE if possible
14:28:06 <ralonsoh> ok, next RFE: [RFE] L3 Agent Readiness Status for HA Routers
14:28:07 <s3rj1k> ralonsoh: ok, if have more details I'll add them to this RFE
14:28:12 <ralonsoh> #link https://bugs.launchpad.net/neutron/+bug/2086794
14:28:27 <ralonsoh> we had a similar bug https://bugs.launchpad.net/neutron/+bug/2011422
14:28:39 <ralonsoh> that was also discussed in a PTG (I don't remember which one)
14:28:44 <s3rj1k> can be also discussed together with 2086799
14:28:47 <lajoskatona> I have a general comment for this and next one ([RFE] OVS Agent Synchronization Status for Container Environments )
14:29:09 <ralonsoh> first let's wait for the RFE proposal
14:29:14 <ralonsoh> s3rj1k, please, go on
14:29:41 <lajoskatona> Are these about monitoring our processes / agents etc.... Nova discussed something similar: https://etherpad.opendev.org/p/nova-2025.1-ptg#L255 perhaps worth to check before reinventing it
14:29:51 <s3rj1k> both RFEs are about extending available data for k8s probes
14:30:28 <greatgatsby_> it sounds very similar.  We've only just identified the trigger, and it not being 100% reproducable, it takes me a couple days to get back to a fresh environment.  My next test I will log the flows the whole upgrade.  For now, we've just identified it's caused by the 2 OVS container restarts, neutron agent doesn't recover properly somehow, and connectivity is lost to our VMs until minutes later
14:30:28 <greatgatsby_> when the agent is restarted as part of the neutron role in kolla-ansible
14:30:41 <s3rj1k> so that pod mgmnt can do monitroing and life-cycling of containerized agents
14:31:16 <greatgatsby_> we've using DVR and VLANs, if that matters
14:31:23 <ralonsoh> s3rj1k, but what is the information required?
14:31:31 <ralonsoh> what do you exactly need to monitor?
14:31:38 <slaweq> I think it is good idea generally for all the agents to report in some file what they are doing, like "full_sync in progress", "sync finished", etc.
14:31:47 <s3rj1k> OVS agent's synchronization status (started/ready/dead states)
14:32:02 <ralonsoh> slaweq, why a file? we have the Neutron API and heartbeats
14:32:04 <slaweq> as sometimes such full sync after e.g. agent start can take very long time indeed
14:32:10 <s3rj1k> and for second one HA router status
14:32:29 <slaweq> ralonsoh IIUC it's about checking it locally
14:32:42 <slaweq> by e.g. readiness probe in k8s
14:32:42 <s3rj1k> I can extend both of RFEs with details if in general this sounds good
14:32:48 <lajoskatona> agree with neutron API, we already have some healtcheck API, that should be extended
14:33:02 <greatgatsby_> we were surprised that both OVS container restarts each trigger flow re-creation in rapid succession
14:33:10 <lajoskatona> Nova also move that direction see the spec: https://specs.openstack.org/openstack/nova-specs/specs/2024.2/approved/per-process-healthchecks.html
14:33:14 <greatgatsby_> https://github.com/openstack/kolla-ansible/blob/master/ansible/roles/openvswitch/handlers/main.yml
14:33:15 <ralonsoh> we have a fantastic Neutron API that can be feed with anything we want
14:33:36 <slaweq> and  for such puprose reporting this to the neutron db and checking for each agent through api will not scale at all
14:33:42 <ralonsoh> lajoskatona, exactly: this is a matter of improving the agent information mechanism
14:34:01 <s3rj1k> for k8s probes it is best to use something local to pod
14:34:27 <slaweq> we can send such info in the hearbeat too but saving it in a file locally shouldn't hurt anyone and can help in some cases
14:35:41 <slaweq> but we should probably have some standarized list of possible states and reuse them in all agents
14:35:48 <ralonsoh> I'm against storing local info if we don't push that to the API too
14:35:55 <s3rj1k> probes in general tend to run frequently so anything that is network related can have pref impact on cluster, better to have local data
14:35:59 <slaweq> not that each agent will report something slightly different
14:36:16 <ralonsoh> slaweq, yes, that was the PTG proposal, having a set of states per agent, that coudl be configurable
14:36:32 <ralonsoh> DHCP: network status (provisioning), L3: router status, etc
14:36:45 <ralonsoh> that could be defined in each agent independently
14:37:11 <ralonsoh> but again, this info should go to the API. If we want to store it locally as an option, perfect
14:37:19 <mlavalle> like a state machine
14:37:40 <ralonsoh> mlavalle, yes, and not only a global one per agent, but also per resource (network, router, etc)
14:37:53 <slaweq> do we need them per agent, shouldn't just few standard ones be enough? Like e.g. "sync in progress", "operational" or something like that
14:37:56 <s3rj1k> ralonsoh: works for my case if bot local and API
14:38:00 <slaweq> what else would be needed there?
14:38:20 <ralonsoh> slaweq, do you mean having standard states?
14:38:25 <ralonsoh> regardless of the resource/agent
14:38:42 <slaweq> ralonsoh yes, IMHO it can work that way
14:38:47 <slaweq> but maybe I am missing something
14:38:50 <ralonsoh> I think that makes sense
14:39:12 <ralonsoh> slaweq, we can always add new states (that don't need to be used by all resources/agents)
14:39:19 <slaweq> true
14:39:32 <mlavalle> maybe there is a common set of states and then each agent has a few of its own
14:39:34 <ralonsoh> but yes, initially I would propose a set of states that could match with the state of the common agents
14:39:44 <slaweq> but that way we can have them in the neutron lib defined so that everyone will be able to rely on them pretty easily
14:39:53 <ralonsoh> yes to both
14:39:57 <lajoskatona> so let's have now some framweork with basic common information locally and on API?
14:40:35 <slaweq> on api it is very easy as you can add whatever you want to the dict send in the heartbeat IIRC
14:40:41 <ralonsoh> yes, some information that will be provided to the API and, via config, stored locally if needed
14:40:54 <slaweq> so yes, I would say - have them in both places - api and locally
14:41:21 <ralonsoh> I'm ok with this proposal, but of course we need a spec for sure
14:41:25 <slaweq> locally it would be similar to the ha state of the router stored in the file
14:41:32 <ralonsoh> ^ yes
14:41:52 <lajoskatona> +1
14:41:56 <slaweq> yes, spec would be good. It don't need to be long but we should define there what states we want to have
14:42:13 <ralonsoh> so +1 with spec
14:42:19 <obondarev> +1
14:42:24 <slaweq> +1 from me
14:42:26 <mlavalle> +1
14:42:26 <s3rj1k> should we start proposing states in RFE comments?
14:42:29 <lajoskatona> +1
14:42:43 <ralonsoh> s3rj1k, it would be better to describe it in the spec
14:42:45 <mlavalle> s3rj1k: wrte a short spec
14:43:08 <s3rj1k> ok, will do
14:43:10 <lajoskatona> Like these : https://opendev.org/openstack/neutron-specs/src/branch/master/specs/2024.2 under 2025.1 folder
14:43:12 <mlavalle> s3rj1k: do you know how to propose a spec?
14:43:47 <ralonsoh> s3rj1k, are you going to combine the RFEs in one single spec?
14:43:56 <s3rj1k> I'll walk through docs, in case of questions just ask them here on regular day, so no prob with that, thanks
14:44:24 <s3rj1k> >  in one single spec? - that can make sense yes
14:44:41 <ralonsoh> perfect, I'll comment that in the LP bugs after this meeting
14:45:02 <ralonsoh> so we can skip the 3rd one and go for the last one
14:45:10 <ralonsoh> [RFE] Add OVN Ganesha Agent to setup path for VM connectivity to NFS-Ganesha
14:45:16 <ralonsoh> #link https://bugs.launchpad.net/neutron/+bug/2087541
14:45:38 <ralonsoh> Is Amir here now?
14:45:52 <amnik> Yes I'm here
14:46:03 <ralonsoh> perfect, can you explain it a bit?
14:46:12 <amnik> sure
14:46:26 <amnik> According to the Manila documentation, there is a need for network connectivity between the NFS-Ganesha service and the client mounting the Manila share.
14:46:41 <amnik> Currently, no existing solution addresses this requirement. This RFE proposes to solve it using Neutron and OVN.
14:47:28 <slaweq> first thing from me - instead of new agent we can do this as extension to the neutron-ovn-agent probably
14:47:59 <ralonsoh> I couldn't agree more (note: during this release I need to make the OVN agent the default one...)
14:48:14 <mlavalle> +1
14:48:20 <ralonsoh> we already have a generic agent for OVN, with a plugable interface
14:48:29 <mlavalle> that was the whole point of it
14:48:34 <ralonsoh> is just a matter of implementing the needed plugin and enable it
14:49:09 <ralonsoh> amnik, what is actually needed? what the agent needs to do?
14:49:17 <ralonsoh> what will trigger these actions?
14:49:43 <amnik> this solution inspired by ovn metadata agent
14:50:13 <amnik> it detects ports for ganesha connectivity and plug the on compute nodes
14:50:45 <amnik> these ports are distributed localport like metadata port
14:51:12 <slaweq> I don't know about Ganesha at all, can you maybe in briefly explain what this port is needed for on compute nodes , how vms will use it, etc.?
14:52:29 <amnik> ganesha mediated traffic to the cephFS. You can think about it like NFS server between vms and cephFS in ceph cluster
14:53:23 <amnik> after we plugging the port on compute nodes we add some iptables rule to dnat traffic to the ganesha
14:54:00 <amnik> private_ip_port:2049 -> ganesha_ip:2049
14:54:16 <ralonsoh> is this port a VM port?
14:54:58 <slaweq> and it is one virtual port per network? So all vms conected to that network will use the same port?
14:55:23 <amnik> It like metadata port. Not bind to any chassis.
14:55:36 <ralonsoh> but who is creating this port?
14:55:39 <ralonsoh> is in the OVN database?
14:55:47 <slaweq> and other question: ganesha_ip is from the same private network or not?
14:57:12 <ralonsoh> apart from the technical questions, I'm a bit reluctant to add something so specific in the Neutron repository
14:57:19 <amnik> slaweq: I think better to create port for each share that vms want to connect to. So we can manage it per share(storage on cephFS )
14:58:26 <amnik> ralonsoh: the port will be created with API and after that ganesha agent detect it
14:59:03 <amnik> ralonsoh: yes in OVN database that agent can detect it with event.
14:59:09 <ralonsoh> how? this port won't be bound to a chassis
14:59:17 <ralonsoh> we use the SB and the local OVS database
14:59:37 <ralonsoh> same as an ovn-controller
15:00:05 <slaweq> could it be that neutron-ovn-agent extension which would be respoinsible for this would be actually in Manila and just loaded by neutron-ovn-agent if needed? We can accept new device type if needed so that you can create port with this new device type in neutron api but other things may actually be in Manila as it seems that this is team of experts on this topic, not we
15:00:39 <amnik> slaweq: No ganesha is service that deploy for example on our controller servers and has IP with network of that server.
15:02:10 <amnik> ralonsoh: agent can detect it by device_id convention like ovnganesh- and port type in PORT_binding table
15:02:27 <ralonsoh> slaweq, yes, that could work
15:02:45 <ralonsoh> amnik, you said this port is not bound
15:03:04 <ralonsoh> who is physically creating this port?
15:03:37 <slaweq> if that would have to land in the neutron repository, we would probably need spec but I agree with ralonsoh that this may be a bit outside our expertise here so maybe it would be better to keep as much as possible in manila and just add necessary bits in neutron (neutron-lib) to handle that new type of ports correctly
15:04:26 <ralonsoh> agree with this
15:04:28 <amnik> ralonsoh: This is a seperate API call that user should create the port (openstack port create ...)
15:04:41 <ralonsoh> amnik, no
15:04:49 <ralonsoh> who is creating the layer one port
15:04:51 <ralonsoh> ?
15:05:13 <ralonsoh> anyway, we are running out of time
15:05:19 <ralonsoh> we can continue after this meeting
15:05:32 <ralonsoh> what is the recommendation for this RFE?
15:05:53 <ralonsoh> I agree with slaweq's comment
15:06:27 <mlavalle> me too
15:06:33 <slaweq> I would like to see detailed spec with exact description about how this is going to work in both API and backend
15:06:36 <amnik> ralonsoh: This is a responsiblity of ganesha agent. user create port. agent will plug it to the compute node with veth devices like metadata port
15:07:00 <ralonsoh> but for now this RFE is not approved, right?
15:07:09 <slaweq> Then we can discuss there what have to be in neutron and maybe what can be somewhere else
15:07:28 <obondarev> I think RFE needs to be updated at least, new agent seems an overkill
15:07:30 <slaweq> ralonsoh IMO not yet, it's too early
15:07:43 <ralonsoh> ok, I'll comment that in the LP bug
15:07:49 <ralonsoh> I'm closing this meeting now
15:07:54 <ralonsoh> #endmeeting