14:00:17 #startmeeting neutron_drivers 14:00:17 Meeting started Fri Jan 31 14:00:17 2025 UTC and is due to finish in 60 minutes. The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:17 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:17 The meeting name has been set to 'neutron_drivers' 14:00:19 Ping list: ykarel, mlavalle, mtomaska, slaweq, obondarev, tobias-urdin, lajoskatona, amotoki, haleyb, ralonsoh 14:00:20 \o 14:00:26 hello 14:00:28 o/ 14:01:33 will wait another minute for quorum 14:01:40 o/ 14:02:10 i had pinged daniel about the one item on our agenda, just noticed he said he might be late 14:02:15 o/ I'm here 14:02:20 oh hi 14:02:31 Hi! literally just got my laptop open in time hehe 14:02:32 we can get started then 14:02:44 #link https://bugs.launchpad.net/neutron/+bug/2096704 14:02:46 [RFE] Allow rebalancing of OVN LRPs 14:03:27 Shoot, how can I help clarifying the RFE :) 14:03:50 so i think you had talked to ralonsoh last week about this? it does seem like a gap to me 14:04:10 well, not a gap 14:04:21 but a useful tool when using OVN L3 14:04:52 seems a sensible tool 14:04:59 :) 14:05:02 Tuesday ralonsoh said it would be a tool not APi, am I right about it< 14:05:04 ? 14:05:36 yes, that should be a tool 14:05:58 but f0o can explain how it should be and the goal 14:06:22 For us it would scratch the itch that Nova already has covered with migrations, we can always rebalance our compute nodes with some maintenance window/s but as for the network nodes they seem destined to just overflow and give in - having little space for horizontal scaling as we canno reassign the LRP priorities to the new nodes 14:08:27 yes, but this is the user case 14:08:33 how this tool should work? 14:08:38 why is it needed? 14:09:07 are there other options? etc 14:10:23 f0o: are you still here? 14:10:29 I can only speak for us, we are happy with a one-shot command that we can Cron or issue on-demand when we see capacity imbalances. But if there's a better option akin to feedback-loops where neutron would check on chassis-events if it needs to rebalance or not, that can also work. However I'm unsure how invasive the changing of LRP priorities is 14:11:08 I dug a big and saw that I could just YOLO change the priorities of the lrp<>chassis but I have no idea how that affects neutron if I were to do that with a bash-loop 14:11:52 The situation we are in now is a deadlocked one. The only rectifying way I can see to get the priorities shifted is to remove all routers and readd them which breaks FIPs and causes big outages 14:12:23 how did you end in this situation? 14:12:35 I'm not even sure how one GN ended up with being the prio for all lrps - but that's out of scope because even if thats fixed in $future, the bad situation wont be fixed retrospectively 14:12:35 did you remove GW chassis? 14:12:59 we had no notable downtime of GW chassis, we had a few maintenances that made a few failovers but nothing longer than an hour 14:13:18 during which we didnt accept public neutron endpoint requests to avoid customers from creating routers 14:13:19 but you removed GW chassis from the deployment 14:13:46 we did not, we just quit ovs and performed the maintenance and rebooted 14:14:14 did you stop GW chassis? 14:14:36 yeah via systemctl stop ovs-vswitchd 14:15:01 so this is why OVN rebalanced the LRP assigned chassis 14:15:45 so you manually made an action that triggered the rebalance of the assigned chassis 14:15:55 likely; it was our understanding that the LRP priorities would go back to the previously assigned chassis 14:16:10 eventually, when the high prio chassis is restored, the assigned chassis should be changed back, yes 14:16:32 but if that didn't happen, then you need to check your OVN chassis list 14:16:55 in any case, about this tool, it should be an admin only tool, of course 14:17:00 what I saw when we spoke is that the rt1 chassis has prio 2 and the rt1 chassis has prio 1 on all lrps 14:17:18 so the chassis is there, but its low prio on all lrps, instead of regaining prio 2 on some of them 14:17:36 yes but you didn't tell me that you rebooted systems 14:17:42 that was missing in this conversation 14:17:45 in any case 14:18:01 first of all, you should also open a bug related to the OVN L3 scheduler 14:18:15 how should the maintance have been performed to avoid this in the future? because some actions require reboots 14:18:26 if with 2 GW chassis the balance is not correctly done, this could be al algorithm issue 14:18:51 f0o, as commented, when you restart a GW chassis, OVN restores the high prio chassis assigned 14:18:59 if I were to add a 3rd GW chassis now, how would the existing LRPs be redistributed so that the new GW wouldnt be empty until a new LRP is created? 14:19:27 f0o, by default, the underlaying architecture is usually static in a deployment 14:19:45 you are talking about upgrade/maintenance operations 14:19:53 and this is fine but this is not normal operation 14:20:05 do not expect the OVN L3 scheduler to re-balance 14:20:44 if the L3 Scheduler wont rebalance that I need to be able to do so, no? 14:21:20 or at least have the option to do so if I wanted to. Otherwise I will have a dying node permanently and no way to avoid it since it just happens to have more LRPs historically 14:22:04 again, I think this tool is something nice to have and useful **after cluster modifications** 14:22:35 +1 14:22:54 f0o: would you implement it? 14:24:26 mlavalle: I have no domain knowledge how external changes to the OVN LRP prios would affect neutron internals. But I can create a tool that just scans all LRPs and all chassis and just redistributes the priors to have a balanced sheet. If neutron needs to be informed of this and somebody can point me to _how_ I could inform neutron of this change so the DB (I guess?) is in 14:24:26 sync then I can do that as well 14:25:36 this is where we are now halted as well, if we are able to freely reassign prios without neutron going haywire from it (AKA no notifications needed) then I can make a PoC very quickly 14:27:22 I would need to check but this is only OVN NB config: there is a list of gateway_chassis registers per LRP and they are assigned to the LRP.gateway_chassis list 14:27:37 Neutron DB doesn't play a role here 14:28:18 so then issuing ovn-nbctl lrp-set-gateway-chassis lrp-123 chassis-with-prio-1 900 on the northd hosts should cause no problems, right? 14:29:02 hold on, if you are creating a neutron script, you should use the mechanisms provided by Neutron 14:29:12 we are not talking about a bash script here 14:29:25 check any other Neutron script 14:30:14 yeah like remove_duplicated_port_bindings.py 14:30:23 https://opendev.org/openstack/neutron/src/branch/master/neutron/cmd/remove_duplicated_port_bindings.py 14:30:32 that's a good example 14:31:29 ok, I'll have a look at the OVN NB codebase in neutron and see what I can access/use there 14:32:14 +1 14:32:21 maybe neutron_ovn_db_sync_util, that has access to the OVN DB could be more useful 14:32:31 do we need a spec for this RFE? 14:32:34 I don't think so 14:32:50 but should be properly documented 14:33:26 if it's small in scope i would agree we don't need a spec 14:34:34 i think it should be separate from neutron-ovn-db-sync to be clear 14:34:50 yes, this is just an example 14:34:59 so should we vote then? 14:35:07 +1 14:35:14 +1 14:35:16 +1 14:35:25 +1 with good documentation what the user can expect using the tool 14:35:35 +1 14:35:43 +1 (Not even sure if mine counts ;) 14:36:12 lol, it doesn't but it is welcome 14:36:15 your development will count more :-) 14:36:28 f0o: ok, can you write-up the summary here and add to the bug? 14:36:38 i will approve it 14:37:01 haleyb: will do 14:37:35 we can then just comment in any patch 14:37:56 f0o: thanks for bringing up the issue and thanks for working on it 14:38:03 Happy to help :) 14:38:41 is there anything else people want to discuss? 14:39:10 ok, then thanks for attending, and have a good weekend! 14:39:19 #endmeeting