14:04:06 <ralonsoh> #startmeeting neutron_drivers 14:04:06 <opendevmeet> Meeting started Fri Nov 15 14:04:06 2024 UTC and is due to finish in 60 minutes. The chair is ralonsoh. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:04:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:04:06 <opendevmeet> The meeting name has been set to 'neutron_drivers' 14:04:16 <ralonsoh> Ping list: ykarel, mlavalle, mtomaska, slaweq, obondarev, tobias-urdin, lajoskatona, amotoki, haleyb, ralonsoh 14:04:19 <mlavalle> \o 14:04:19 <ralonsoh> hello all 14:04:22 <lajoskatona> o/ 14:04:24 <obondarev> o/ 14:04:31 <slaweq> o/ 14:04:41 <s3rj1k> hi all 14:04:50 <ralonsoh> we have many RFEs to discuss today 14:05:01 <ralonsoh> and we have quorum, so I think we can start 14:05:09 <amnik> hello all 14:05:13 <ralonsoh> First one: [RFE] Configurable Agent Termination on OVS Restart 14:05:16 <ralonsoh> https://bugs.launchpad.net/neutron/+bug/2086776 14:05:34 <ralonsoh> s3rj1k, please, present the RFE 14:06:50 <s3rj1k> ralonsoh: Hi, yea so basically the idea is to automate manual fixes that are done by human, of agent in k8s env 14:07:26 <ralonsoh> s3rj1k, my question, same as Liu in the bug is: why don't we fix what is broken in the OVS agent code? 14:07:34 <ralonsoh> what exact flows are not recovered? 14:07:43 <s3rj1k> to do that, there is an idea to introduce opt-in functionality to terminate agent instead of flow recovery 14:08:34 <s3rj1k> fixing is a separate thing, here is about making sure that prod envs are working without human interaction 14:09:13 <ralonsoh> the OVS agent is capable of detecting when the vswitch has been restarted and tries to restore the flows 14:09:13 <opendevreview> Sebastian Lohff proposed openstack/neutron master: docs: Fix typo in openstack create subnet command https://review.opendev.org/c/openstack/neutron/+/935182 14:09:29 <s3rj1k> Liu also mentioned that this is only related to containerized evns, as for systemd setups there is already similar restarts happen, as I understand 14:09:35 <ralonsoh> if we fix what is broken, that will solve your problem (and for any other one using it) 14:10:04 <ralonsoh> Liu is stating, as he is right, that restaring the OVS agent has a high cost in the Neutron API 14:10:14 <s3rj1k> to fix that we need to reporo bug, there is no time to do that on envs that have client workloads 14:10:47 <s3rj1k> > restaring the OVS agent has a high cost - agree, hence opt-in 14:11:07 <slaweq> I personally feel like this proposal is more like workaround for some bug rather then fix of it 14:11:12 <ralonsoh> exactly 14:11:23 <s3rj1k> agree, workaround 14:11:32 <ralonsoh> s3rj1k, can you share with the community what is broken when the OVS restores the flows? 14:11:41 <ralonsoh> because this is the critical point here but is not documented 14:11:46 <ralonsoh> this is what needs to be fixed 14:12:20 <s3rj1k> I have no specific repros right now, details that I have is in ticket, one of the flows was not recovered 14:12:58 <ralonsoh> s3rj1k, so please, add this information in the LP bug and how to reproduce it locally, if possible 14:12:59 <s3rj1k> when I will have a repro I will create a bug for it, but htis does not mean that workaround is invalid 14:13:04 <ralonsoh> with this information we can try to fix it 14:13:07 <lajoskatona> Could you please check the suggestion from Liu (https://bugs.launchpad.net/neutron/+bug/2086776/comments/5 ) that seam reasonable and in sync with the current design? 14:13:32 <ralonsoh> that is a lightweight workaround too 14:13:41 <ralonsoh> no API involved in this case 14:14:52 <ralonsoh> but maybe could be a solution: instead of the current code that restores the flows, when the OVS is restarted, restart the OVS agent monitoring mechanism 14:14:59 <lajoskatona> anyway if we have a list of missing things as ralonsoh suggested that can help to find the places to fix 14:15:32 <lajoskatona> yeah perhaps 14:16:14 <ralonsoh> s3rj1k, so please, before proceeding on any decision, we need to know what is actually broken. Then we can propose a solution (fix the restore methods, lightweight restart or full restart) 14:16:54 <s3rj1k> are we sure we want to block general workaround on specific bug repro? 14:17:05 <s3rj1k> things can be done async 14:17:05 <ralonsoh> sorry what does it mean? 14:17:51 <s3rj1k> continue with terminate workaround and as a separate bug try to find repro for that one specific flow 14:18:31 <s3rj1k> I am pretty sure that there are more cases for flows recovery but I have no info on that, only value report on one of them 14:18:50 <ralonsoh> if you ask me to implement a workaround for a bug I don't understand, I will say no 14:18:57 <ralonsoh> I need to know what is broken 14:19:03 <slaweq> personally I don't like idea of such kind of workaround because it don't seems for me as professional thing - it's a bit like implementing restart of service to fix memory leak :) So I would personally vote for fixing real bugs and handle ovs restart properly instead of this rfe 14:19:03 <s3rj1k> as basically fix is manual restart, nobody is triaging bugs in prod 14:20:08 <ralonsoh> nobody is triaging bugs in prod --> I do this for a living 14:20:16 <s3rj1k> slaweq: so better to ignore that and keep line1 ops just restart all the things? 14:20:23 <slaweq> also if you need workaround like that you can probably implement some small script which will be doing what you need to do - it don't need to be in the agent itself 14:20:46 <slaweq> s3rj1k no, IMO it's better to try to reproduce the issue on dev env and try to fix it 14:20:47 <s3rj1k> sure it can be done as some script 14:21:52 <obondarev> no need to reproduce intentionally, just next time when manual intervention is needed (agent restart) - dump flows before and after restart, that could be enough for investigation 14:21:57 <greatgatsby_> was just coming in here to post about our outage we've had during 2 of our last 3 upgrade tests. Seems directly related to the OVS conversation you're having here now 14:22:29 <lajoskatona> obondarev: +1, good that you added here 14:22:47 <s3rj1k> as I said, when some details come in, I will create a separate bug for that 14:23:05 <greatgatsby_> we've had a 5 minute outage between when kolla-ansible restarts OVS containers (which triggers 2 separate flow re-creations) and we don't get FIP connectivity back until the neutron role, some 5 minutes later 14:23:24 <s3rj1k> this is more about do we want some workaround in place or not 14:24:00 <greatgatsby_> in our testing, both the restart of openvswitch_db and openvswitch-vswitchd trigger neutron agent to recreate the flows, this happens within about 20 seconds of each other 14:25:23 <slaweq> IMHO it is better to have script e.g. in https://opendev.org/openstack/osops for that rather then implement that in neutron 14:25:37 <obondarev> +1 to slaweq 14:25:46 <ralonsoh> s3rj1k, we want to fix this bug, for sure. We don't want to implement a workaround for something we don't understand. In order to push code, we need to know what is happening 14:25:46 <greatgatsby_> for some reason, but not always, neutron ovs agent doesn't seem to like this and connectivity is lost until the agent is restarted during the neutron role 14:26:17 <greatgatsby_> sorry for just injecting my recent experience, just excited that this is being discussed in here, hoping for some kind of mitigation 14:26:20 <ralonsoh> so, for now, we are not going to vote for this RFE until we have a better description of the current problem 14:26:39 <ralonsoh> s3rj1k, please, do not open a separate bug, provide this info in the current one 14:27:01 <ralonsoh> ok, we have more RFEs and just 35 mins left 14:27:05 <lajoskatona> greatgatsby_: do you think your issue is related to https://bugs.launchpad.net/neutron/+bug/2086776 ? 14:27:14 <s3rj1k> greatgatsby_: can you please add more info to RFE if possible 14:28:06 <ralonsoh> ok, next RFE: [RFE] L3 Agent Readiness Status for HA Routers 14:28:07 <s3rj1k> ralonsoh: ok, if have more details I'll add them to this RFE 14:28:12 <ralonsoh> #link https://bugs.launchpad.net/neutron/+bug/2086794 14:28:27 <ralonsoh> we had a similar bug https://bugs.launchpad.net/neutron/+bug/2011422 14:28:39 <ralonsoh> that was also discussed in a PTG (I don't remember which one) 14:28:44 <s3rj1k> can be also discussed together with 2086799 14:28:47 <lajoskatona> I have a general comment for this and next one ([RFE] OVS Agent Synchronization Status for Container Environments ) 14:29:09 <ralonsoh> first let's wait for the RFE proposal 14:29:14 <ralonsoh> s3rj1k, please, go on 14:29:41 <lajoskatona> Are these about monitoring our processes / agents etc.... Nova discussed something similar: https://etherpad.opendev.org/p/nova-2025.1-ptg#L255 perhaps worth to check before reinventing it 14:29:51 <s3rj1k> both RFEs are about extending available data for k8s probes 14:30:28 <greatgatsby_> it sounds very similar. We've only just identified the trigger, and it not being 100% reproducable, it takes me a couple days to get back to a fresh environment. My next test I will log the flows the whole upgrade. For now, we've just identified it's caused by the 2 OVS container restarts, neutron agent doesn't recover properly somehow, and connectivity is lost to our VMs until minutes later 14:30:28 <greatgatsby_> when the agent is restarted as part of the neutron role in kolla-ansible 14:30:41 <s3rj1k> so that pod mgmnt can do monitroing and life-cycling of containerized agents 14:31:16 <greatgatsby_> we've using DVR and VLANs, if that matters 14:31:23 <ralonsoh> s3rj1k, but what is the information required? 14:31:31 <ralonsoh> what do you exactly need to monitor? 14:31:38 <slaweq> I think it is good idea generally for all the agents to report in some file what they are doing, like "full_sync in progress", "sync finished", etc. 14:31:47 <s3rj1k> OVS agent's synchronization status (started/ready/dead states) 14:32:02 <ralonsoh> slaweq, why a file? we have the Neutron API and heartbeats 14:32:04 <slaweq> as sometimes such full sync after e.g. agent start can take very long time indeed 14:32:10 <s3rj1k> and for second one HA router status 14:32:29 <slaweq> ralonsoh IIUC it's about checking it locally 14:32:42 <slaweq> by e.g. readiness probe in k8s 14:32:42 <s3rj1k> I can extend both of RFEs with details if in general this sounds good 14:32:48 <lajoskatona> agree with neutron API, we already have some healtcheck API, that should be extended 14:33:02 <greatgatsby_> we were surprised that both OVS container restarts each trigger flow re-creation in rapid succession 14:33:10 <lajoskatona> Nova also move that direction see the spec: https://specs.openstack.org/openstack/nova-specs/specs/2024.2/approved/per-process-healthchecks.html 14:33:14 <greatgatsby_> https://github.com/openstack/kolla-ansible/blob/master/ansible/roles/openvswitch/handlers/main.yml 14:33:15 <ralonsoh> we have a fantastic Neutron API that can be feed with anything we want 14:33:36 <slaweq> and for such puprose reporting this to the neutron db and checking for each agent through api will not scale at all 14:33:42 <ralonsoh> lajoskatona, exactly: this is a matter of improving the agent information mechanism 14:34:01 <s3rj1k> for k8s probes it is best to use something local to pod 14:34:27 <slaweq> we can send such info in the hearbeat too but saving it in a file locally shouldn't hurt anyone and can help in some cases 14:35:41 <slaweq> but we should probably have some standarized list of possible states and reuse them in all agents 14:35:48 <ralonsoh> I'm against storing local info if we don't push that to the API too 14:35:55 <s3rj1k> probes in general tend to run frequently so anything that is network related can have pref impact on cluster, better to have local data 14:35:59 <slaweq> not that each agent will report something slightly different 14:36:16 <ralonsoh> slaweq, yes, that was the PTG proposal, having a set of states per agent, that coudl be configurable 14:36:32 <ralonsoh> DHCP: network status (provisioning), L3: router status, etc 14:36:45 <ralonsoh> that could be defined in each agent independently 14:37:11 <ralonsoh> but again, this info should go to the API. If we want to store it locally as an option, perfect 14:37:19 <mlavalle> like a state machine 14:37:40 <ralonsoh> mlavalle, yes, and not only a global one per agent, but also per resource (network, router, etc) 14:37:53 <slaweq> do we need them per agent, shouldn't just few standard ones be enough? Like e.g. "sync in progress", "operational" or something like that 14:37:56 <s3rj1k> ralonsoh: works for my case if bot local and API 14:38:00 <slaweq> what else would be needed there? 14:38:20 <ralonsoh> slaweq, do you mean having standard states? 14:38:25 <ralonsoh> regardless of the resource/agent 14:38:42 <slaweq> ralonsoh yes, IMHO it can work that way 14:38:47 <slaweq> but maybe I am missing something 14:38:50 <ralonsoh> I think that makes sense 14:39:12 <ralonsoh> slaweq, we can always add new states (that don't need to be used by all resources/agents) 14:39:19 <slaweq> true 14:39:32 <mlavalle> maybe there is a common set of states and then each agent has a few of its own 14:39:34 <ralonsoh> but yes, initially I would propose a set of states that could match with the state of the common agents 14:39:44 <slaweq> but that way we can have them in the neutron lib defined so that everyone will be able to rely on them pretty easily 14:39:53 <ralonsoh> yes to both 14:39:57 <lajoskatona> so let's have now some framweork with basic common information locally and on API? 14:40:35 <slaweq> on api it is very easy as you can add whatever you want to the dict send in the heartbeat IIRC 14:40:41 <ralonsoh> yes, some information that will be provided to the API and, via config, stored locally if needed 14:40:54 <slaweq> so yes, I would say - have them in both places - api and locally 14:41:21 <ralonsoh> I'm ok with this proposal, but of course we need a spec for sure 14:41:25 <slaweq> locally it would be similar to the ha state of the router stored in the file 14:41:32 <ralonsoh> ^ yes 14:41:52 <lajoskatona> +1 14:41:56 <slaweq> yes, spec would be good. It don't need to be long but we should define there what states we want to have 14:42:13 <ralonsoh> so +1 with spec 14:42:19 <obondarev> +1 14:42:24 <slaweq> +1 from me 14:42:26 <mlavalle> +1 14:42:26 <s3rj1k> should we start proposing states in RFE comments? 14:42:29 <lajoskatona> +1 14:42:43 <ralonsoh> s3rj1k, it would be better to describe it in the spec 14:42:45 <mlavalle> s3rj1k: wrte a short spec 14:43:08 <s3rj1k> ok, will do 14:43:10 <lajoskatona> Like these : https://opendev.org/openstack/neutron-specs/src/branch/master/specs/2024.2 under 2025.1 folder 14:43:12 <mlavalle> s3rj1k: do you know how to propose a spec? 14:43:47 <ralonsoh> s3rj1k, are you going to combine the RFEs in one single spec? 14:43:56 <s3rj1k> I'll walk through docs, in case of questions just ask them here on regular day, so no prob with that, thanks 14:44:24 <s3rj1k> > in one single spec? - that can make sense yes 14:44:41 <ralonsoh> perfect, I'll comment that in the LP bugs after this meeting 14:45:02 <ralonsoh> so we can skip the 3rd one and go for the last one 14:45:10 <ralonsoh> [RFE] Add OVN Ganesha Agent to setup path for VM connectivity to NFS-Ganesha 14:45:16 <ralonsoh> #link https://bugs.launchpad.net/neutron/+bug/2087541 14:45:38 <ralonsoh> Is Amir here now? 14:45:52 <amnik> Yes I'm here 14:46:03 <ralonsoh> perfect, can you explain it a bit? 14:46:12 <amnik> sure 14:46:26 <amnik> According to the Manila documentation, there is a need for network connectivity between the NFS-Ganesha service and the client mounting the Manila share. 14:46:41 <amnik> Currently, no existing solution addresses this requirement. This RFE proposes to solve it using Neutron and OVN. 14:47:28 <slaweq> first thing from me - instead of new agent we can do this as extension to the neutron-ovn-agent probably 14:47:59 <ralonsoh> I couldn't agree more (note: during this release I need to make the OVN agent the default one...) 14:48:14 <mlavalle> +1 14:48:20 <ralonsoh> we already have a generic agent for OVN, with a plugable interface 14:48:29 <mlavalle> that was the whole point of it 14:48:34 <ralonsoh> is just a matter of implementing the needed plugin and enable it 14:49:09 <ralonsoh> amnik, what is actually needed? what the agent needs to do? 14:49:17 <ralonsoh> what will trigger these actions? 14:49:43 <amnik> this solution inspired by ovn metadata agent 14:50:13 <amnik> it detects ports for ganesha connectivity and plug the on compute nodes 14:50:45 <amnik> these ports are distributed localport like metadata port 14:51:12 <slaweq> I don't know about Ganesha at all, can you maybe in briefly explain what this port is needed for on compute nodes , how vms will use it, etc.? 14:52:29 <amnik> ganesha mediated traffic to the cephFS. You can think about it like NFS server between vms and cephFS in ceph cluster 14:53:23 <amnik> after we plugging the port on compute nodes we add some iptables rule to dnat traffic to the ganesha 14:54:00 <amnik> private_ip_port:2049 -> ganesha_ip:2049 14:54:16 <ralonsoh> is this port a VM port? 14:54:58 <slaweq> and it is one virtual port per network? So all vms conected to that network will use the same port? 14:55:23 <amnik> It like metadata port. Not bind to any chassis. 14:55:36 <ralonsoh> but who is creating this port? 14:55:39 <ralonsoh> is in the OVN database? 14:55:47 <slaweq> and other question: ganesha_ip is from the same private network or not? 14:57:12 <ralonsoh> apart from the technical questions, I'm a bit reluctant to add something so specific in the Neutron repository 14:57:19 <amnik> slaweq: I think better to create port for each share that vms want to connect to. So we can manage it per share(storage on cephFS ) 14:58:26 <amnik> ralonsoh: the port will be created with API and after that ganesha agent detect it 14:59:03 <amnik> ralonsoh: yes in OVN database that agent can detect it with event. 14:59:09 <ralonsoh> how? this port won't be bound to a chassis 14:59:17 <ralonsoh> we use the SB and the local OVS database 14:59:37 <ralonsoh> same as an ovn-controller 15:00:05 <slaweq> could it be that neutron-ovn-agent extension which would be respoinsible for this would be actually in Manila and just loaded by neutron-ovn-agent if needed? We can accept new device type if needed so that you can create port with this new device type in neutron api but other things may actually be in Manila as it seems that this is team of experts on this topic, not we 15:00:39 <amnik> slaweq: No ganesha is service that deploy for example on our controller servers and has IP with network of that server. 15:02:10 <amnik> ralonsoh: agent can detect it by device_id convention like ovnganesh- and port type in PORT_binding table 15:02:27 <ralonsoh> slaweq, yes, that could work 15:02:45 <ralonsoh> amnik, you said this port is not bound 15:03:04 <ralonsoh> who is physically creating this port? 15:03:37 <slaweq> if that would have to land in the neutron repository, we would probably need spec but I agree with ralonsoh that this may be a bit outside our expertise here so maybe it would be better to keep as much as possible in manila and just add necessary bits in neutron (neutron-lib) to handle that new type of ports correctly 15:04:26 <ralonsoh> agree with this 15:04:28 <amnik> ralonsoh: This is a seperate API call that user should create the port (openstack port create ...) 15:04:41 <ralonsoh> amnik, no 15:04:49 <ralonsoh> who is creating the layer one port 15:04:51 <ralonsoh> ? 15:05:13 <ralonsoh> anyway, we are running out of time 15:05:19 <ralonsoh> we can continue after this meeting 15:05:32 <ralonsoh> what is the recommendation for this RFE? 15:05:53 <ralonsoh> I agree with slaweq's comment 15:06:27 <mlavalle> me too 15:06:33 <slaweq> I would like to see detailed spec with exact description about how this is going to work in both API and backend 15:06:36 <amnik> ralonsoh: This is a responsiblity of ganesha agent. user create port. agent will plug it to the compute node with veth devices like metadata port 15:07:00 <ralonsoh> but for now this RFE is not approved, right? 15:07:09 <slaweq> Then we can discuss there what have to be in neutron and maybe what can be somewhere else 15:07:28 <obondarev> I think RFE needs to be updated at least, new agent seems an overkill 15:07:30 <slaweq> ralonsoh IMO not yet, it's too early 15:07:43 <ralonsoh> ok, I'll comment that in the LP bug 15:07:49 <ralonsoh> I'm closing this meeting now 15:07:54 <ralonsoh> #endmeeting