f0o | Good Morning - Here some preliminary findings on the potential OVS route leakage: https://paste.opendev.org/show/bluBQiReoTozIfNPXnxU/ | 07:50 |
---|---|---|
f0o | my immediate guess is that OVS utilizes the default VRF for everything which is why it can ping br-vxlan/br-mgmt/br-OSA* devices | 07:51 |
f0o | which means that this issue might not be caused by my specific setup | 07:52 |
f0o | bumpy road, I should not have my IRC client in a test-vm on the OpenStack that I'm fiddling with the routing with | 08:12 |
f0o | but on the other hand it's immediate show of results lol | 08:12 |
jrosser | f0o: the only thing i can suggest is that someone else with an OVN setup (i don't have one) can try to reproduce pinging something on the mgmt net from a vm | 08:35 |
f0o | jrosser: I've done more tests in the meantime | 08:37 |
f0o | and the conclusion is that OVS/OVN entirely ignore VRF and just put the packet onto the interface that has a matching route bypassing any routing policies | 08:37 |
f0o | I've "resolved" it by moving _ALL_ interfaces away from the default VRF which lead to OVN and OVS crash, so I added single host-routes back into the default routing table (for all northd hosts and vteps of hypervisors) | 08:38 |
f0o | This is not a solution, but it removes the bulk of the management stuff away. I can still ping the VTEPs and Northd IPs from VMs which is bad enough | 08:38 |
f0o | I think the only correct solution to this is applying iptables/nftables rules to forbid forwarding if the exit-interface is anything that isn't part of the provider-networks | 08:40 |
f0o | but that can have some side-effects too since it's quite the jack-hammer method | 08:41 |
jrosser | but surely this is so fundamental that there must be a strategy from the neutron team for dealing with it | 08:41 |
f0o | for your quick test, you should always be able to ping 10.0.3.1 in OSA AIO since that's always fixed and present | 08:41 |
jrosser | i dont have anything with ovs / ovn so cant reproduce, outside an all-in-one | 08:42 |
f0o | I just got a full nmap port-scan on 10.20.1.126 which is a northd host and I had to place a host-only route on the OVN Gateway Nodes's default table | 08:42 |
f0o | this is scary | 08:42 |
f0o | as an adversary I can quite quickly scan RFC ranges for northd ports (those are fixed ports) and attempt a DoS with some fuzzed packets to get northd to crash or worse | 08:43 |
f0o | jrosser: the AIO one should have the same issue, if a VM there can ping 10.0.3.1 you know you're vulnerable | 08:44 |
f0o | because that's the fixed IP for lxcbr0 | 08:44 |
f0o | I'm just now installing virtualbox to get an AIO setup on my box | 08:44 |
noonedeadpunk | f0o: are you using ovn bgp agent in a setup? | 08:46 |
f0o | nope | 08:46 |
noonedeadpunk | well, then we haven't tried vrfs in our setup in fact.... we've tried with the bgp agent, but we quickly found that it would require a patch that it's not merged yet | 08:48 |
f0o | I dont think this is related to VRFs | 08:49 |
noonedeadpunk | https://review.opendev.org/c/openstack/ovn-bgp-agent/+/906505 | 08:49 |
f0o | just installing ubuntu into vbox right now to get this replicated into AIO | 08:49 |
jrosser | noonedeadpunk: i think its more q question of if you can just get at things you would not expect to, from the vm | 08:49 |
jrosser | this is not to do with bgp at all | 08:49 |
noonedeadpunk | but eventually, I do see vms traffic on aio on br-vlan | 08:50 |
noonedeadpunk | but I'd need to turn my head around this issue I guess, as I'm quite slow on understanding networking | 08:51 |
jrosser | f0o: i think that it's confusing as youve got a lot of terms we don't normally use at all on the openstack nodes, vrf for example | 08:51 |
noonedeadpunk | wait | 08:51 |
f0o | I wrote a wall of text yesterday (my) evening with what I suspect is happening | 08:52 |
noonedeadpunk | so you can ping your mgmt network from vm o_O | 08:52 |
jrosser | these are usually things on switches/route | 08:52 |
f0o | yes I can ping br-mgmt/br-vlan/br-vxlan/lxcbr0 from within a VM | 08:52 |
noonedeadpunk | this I can test really quickly.... | 08:52 |
jrosser | f0o: yes i understand, but also you are totally familiar with your setup - so bear with us whilst we try to understand | 08:52 |
f0o | no worries I try to get this done in AIO so I can share the configs for one-click-replication | 08:52 |
noonedeadpunk | I have quite big ovn sandbox handy | 08:53 |
noonedeadpunk | like 7 computes, 3 control planes, standalone net nodes, etc | 08:53 |
f0o | noonedeadpunk: try pingin 10.0.3.1 from a VM on geneve then | 08:53 |
noonedeadpunk | some of them are broked with bgp agent though... but shouldn't be an issue | 08:53 |
jrosser | noonedeadpunk: see line 36 of this https://paste.opendev.org/show/bluBQiReoTozIfNPXnxU/ | 08:54 |
jrosser | thats the interesting part | 08:54 |
f0o | for me the OVN GatewayNodes are super happy to just copy the packet into lxcbr0 | 08:54 |
noonedeadpunk | well, I don't have gateway nodes there where I have lxcbr0 | 08:55 |
jrosser | so a difference would be that noonedeadpunk has standalone gateway nodes so i expect these are not running lxc | 08:55 |
f0o | noonedeadpunk: then use your br-mgmt ;) | 08:55 |
f0o | point is it puts packets where it shouldnt | 08:55 |
noonedeadpunk | and eventually, the assumed choice is to have gateway nodes either standalone or on computes with OVN | 08:55 |
jrosser | but they must have an interface/ip on the openstack mgmt network | 08:55 |
noonedeadpunk | on the contrary to OVS | 08:55 |
noonedeadpunk | they do, sure | 08:55 |
f0o | ping those then | 08:55 |
noonedeadpunk | yeah, sec | 08:56 |
f0o | becaue I can ping OSA management nodes from my VM | 08:56 |
f0o | run a tcpdump on your gateway nodes to see if it pushes packets. there may or may not be a return path but that does not mean it's not moving the packets into that device | 08:56 |
f0o | because when I ping the VTEPs (10.20.8.0/22) I dont see a reply on the VM but the tcpdump shows that the packet was pushed out br-vxlan | 08:59 |
f0o | 08:58:14.965375 br-vxlan Out ifindex 20 86:ec:48:c5:67:54 ethertype IPv4 (0x0800), length 104: 185.243.23.86 > 10.20.8.11: ICMP echo request, id 41, seq 1, length 64 | 08:59 |
f0o | there just isnt a return-path for the VTEP back into the FIP range | 08:59 |
f0o | just getting that ubuntu vm started now for the AIO replication | 08:59 |
* noonedeadpunk needs to find guniune working net node first :D | 09:00 | |
f0o | :D | 09:00 |
jrosser | the AIO has some iptables/nat stuff just to make it more complicated | 09:00 |
f0o | nice | 09:00 |
f0o | :D | 09:01 |
f0o | it's an obscure issue because I bet that 99% of the cases those RFC networks dont have a return path for the packet so it seems like it doesnt leak. In my case the Gateway Node are the actual TOR Routers so all hypervisors have a default route to them for connectivity. So there is always a return path (other than for br-vxlan because that is local only) | 09:02 |
noonedeadpunk | well, actually on AIO I for sure saw plain traffic on br-vlan from VMs | 09:03 |
noonedeadpunk | So basically IF I just add a gateway ip of "public" network of AIO to the br-vlan interface there - it get happily src nated by the node | 09:04 |
noonedeadpunk | and I'd assume, that nothing would really stop such aio vm from pinging local to the node networks | 09:05 |
f0o | nice virtualbox doesnt run for me -.- | 09:11 |
f0o | wget gets a segfault on a stock ubuntu22.04 | 09:11 |
noonedeadpunk | so I see that https://paste.opendev.org/show/bVemrDPKgYSjWu5uxWSW/ | 09:14 |
noonedeadpunk | 10.21.11.52 is a mgmt IP of a gateway node I'm trying to ping from VM with floating IP | 09:14 |
f0o | and I'm guessing bond0 is the br-ext? | 09:15 |
noonedeadpunk | yep, it's part of br-ext ovs bridge | 09:15 |
noonedeadpunk | oh ,well... there's no fip... src nat only... Let me assign fip :D | 09:15 |
noonedeadpunk | (but I guess it should be same) | 09:16 |
f0o | should be the same I think | 09:16 |
f0o | for me vlan3012 holds the gateway IP of the FIP range because the router can deliver the packet itself - so it is it's own next-hop | 09:17 |
f0o | genev_sys_6081 > vlan3012 (with IP of the gw of neutron's external network) > funky stuff happens here | 09:17 |
f0o | if I fabricate a packet and send it over the wire to vlan3012 everything seems to be working as expected, the packet gets discarded because no route is found. | 09:18 |
noonedeadpunk | yeah, so I guess indeed once you add IP on the interface - things can go in a weird direction | 09:18 |
f0o | if the packet comes from genev_sys_6081 then funky stuff happens | 09:18 |
f0o | it's like packets from OVS dont obey routing tables | 09:19 |
noonedeadpunk | I guess I would create a separate routing table and forward all traffic from the public nets to it explicitly | 09:19 |
noonedeadpunk | as I dont' think you can use vrfs in fact | 09:19 |
f0o | that's what VRFs do | 09:19 |
f0o | VRFs in linux are just separate routing tables | 09:19 |
noonedeadpunk | or wel, at least we didn't manage to get the working nicely with bgp setup | 09:20 |
f0o | no neither me with ovn-bgp - which is why I skipped it | 09:20 |
noonedeadpunk | well, yes, but I guess you need to put an interface to vrf? | 09:20 |
noonedeadpunk | and in this case bridge interfaces that are leaking are created with ovs/ovn? | 09:21 |
f0o | you can slave interfaces into vrf but it's not a requirement, you can also just next-hop from one table into the next | 09:21 |
noonedeadpunk | yeah, exactly | 09:21 |
f0o | but yes in this case the leaking (and slaved) interfaces are created by OVS/OVN | 09:21 |
noonedeadpunk | we did that for bgp... | 09:21 |
noonedeadpunk | so I guess that what I meant by not able to use vrf | 09:22 |
noonedeadpunk | but jumping from table should work | 09:22 |
noonedeadpunk | sorry, my networking is not really great, so not always use proper languague to explain myself | 09:22 |
f0o | no worries I get what you say | 09:23 |
f0o | I'm attempt to try it out now | 09:23 |
f0o | obviosuly the solution to this "dont be your own next-hop" because then OVS cant do funky stuff and packets _have to_ obey the VRFs - but that's hardly a solution, more of a workaround to me | 09:24 |
noonedeadpunk | yeah | 09:24 |
f0o | like you said earlier, have the gateway nodes on the compute nodes | 09:24 |
f0o | at the cost of dragging the whole external vlan to all hosts | 09:25 |
noonedeadpunk | well, for this specific matter we have standalone hosts designated for gateways | 09:25 |
noonedeadpunk | as we don't wanna have that on controllers nor on computes | 09:26 |
noonedeadpunk | but it's $$ | 09:26 |
f0o | but those designated gateways are just dumb-routers then, because you cant have them actually route anything. They just translate vxlan to vlan and dump it on the wire. very very expensive vteps | 09:26 |
jrosser | ^ same here, i have "separated and simplified" rather than collapse any of these functions together | 09:26 |
noonedeadpunk | geneve to vlan, but yes :) | 09:27 |
f0o | because the moment those designated gateways could push the packets on L3 you end up in this issue that they will happily push it into br-mgmt | 09:27 |
noonedeadpunk | and well. we also do have vpnaas, which does spawn network namespaces there | 09:27 |
noonedeadpunk | well, unless you inject a rule to use different routing table with high prio, so that main routing table is not available then | 09:28 |
noonedeadpunk | and some iptables on top :) | 09:28 |
f0o | noonedeadpunk: but that's exactly what I do | 09:28 |
f0o | minus iptables | 09:28 |
noonedeadpunk | hm | 09:29 |
f0o | noonedeadpunk: https://paste.opendev.org/show/buOWbgSHxagRB00GurO0/ | 09:29 |
noonedeadpunk | but then if route to mgmt net is not present in that vrf - how it get's to it | 09:29 |
f0o | I'm running that setup now (https://paste.opendev.org/show/buOWbgSHxagRB00GurO0/) and with this I can no longer ping br-mgmt but I can ping br-vxlan and those host-routes of northd | 09:30 |
f0o | so it's better than before but it still happily pushes packets into any routes in default table | 09:30 |
f0o | the 3 host-routes on br-mgmt are northd just as clarification | 09:31 |
f0o | https://paste.opendev.org/show/bciHUaK5qBGTiEX0ng1j/ << the ICMP test showing that northd is pingable but a different host on br-mgmt is not | 09:32 |
f0o | this is "as good" as I can get it right now without employing nftables | 09:32 |
noonedeadpunk | we played with smth like that https://paste.openstack.org/show/bGvFW15cfKdirA6KDkCY/ | 09:36 |
noonedeadpunk | where 203.0.113.0/24 is our public network | 09:37 |
noonedeadpunk | but yeah | 09:37 |
f0o | and that worked? | 09:41 |
f0o | because VRFs in linux are part of the rule 1000 (l3mdev-table) | 09:42 |
noonedeadpunk | well, I'm not sure we had exactly same usecase frankly speaking | 09:42 |
noonedeadpunk | as there was not vrf per say | 09:42 |
noonedeadpunk | but that pretty much insured that vms can't reach pretty much anything that's on the host | 09:43 |
f0o | let me see that i can replicate that | 09:43 |
f0o | so with my shifting br-mgmt into MGMT vrf I nuked my entire setup | 10:03 |
f0o | and I didnt notice | 10:03 |
f0o | remember when I wanted HAProxy to bind on a specific interface? that's because I wanted br-mgmt into VRF and HAProxy wouldnt utilize it without it. SO yeah I killed HAProxy | 10:04 |
f0o | totally spaced that | 10:04 |
f0o | but alright that fire is put out, back to policies | 10:05 |
f0o | noonedeadpunk: I ran `ip rule add from 185.243.23.0/24 lookup 20` on both routers (table 20 is the IBGP fulltable) but the problem persists, my VM can still ping br-mgmt etc | 10:51 |
f0o | basically I replicated your last paste, including the 999 rule prio | 10:51 |
f0o | https://paste.opendev.org/show/bqkxpnVt3JZQcrjHM8iJ/ | 10:52 |
noonedeadpunk | but how in the world it's getting there? | 10:56 |
noonedeadpunk | unless you have a route to br-mgmt in 20 | 10:57 |
noonedeadpunk | or "default route" in the table does | 10:57 |
f0o | Nope | 10:59 |
f0o | `ip route show to match 10.20.0.11/32 table 20` returns empty | 10:59 |
f0o | `ip r sh table 20 | egrep -e default -e "^10.20.0." -e br-mgmt` also returns empty | 11:00 |
f0o | the whole box has not a single default route | 11:00 |
f0o | odd disconnect... I was saying that the whole box has no def routes anywhere and that grepping for br-mgmt/default/10.20.0 in table 20 would return nothing | 11:01 |
f0o | but this is why I believe that OVS just entirely ignores the kernel routing tables and just moves packets into the interfaces itself | 11:02 |
f0o | I cannot explain it otherwise | 11:03 |
noonedeadpunk | Actually... I think you might be right here. As in order for OVS respecting kernel routing, you need to explicitly "eject" traffic through defining a flow in OVS | 11:07 |
noonedeadpunk | I think this is actually what ovn-bgp-agent is doing to make things work fwiw | 11:07 |
noonedeadpunk | So I guess I had also a flow `ovs-ofctl dump-flows br-ext` like `cookie=0x3e7, duration=489002.621s, table=0, n_packets=166709, n_bytes=10220943, priority=900,ip,in_port="patch-provnet-0" actions=mod_dl_dst:1a:f8:a1:5c:d0:43,NORMAL` | 11:09 |
noonedeadpunk | where `1a:f8:a1:5c:d0:43` was mac of br-ext bridge in kernel space | 11:10 |
noonedeadpunk | but yeah | 11:10 |
noonedeadpunk | this is already a mess | 11:10 |
noonedeadpunk | I'm not sure this is good way to go anyway | 11:10 |
f0o_ | this sure is a mess | 11:25 |
f0o_ | but it might make sense to force OVS to flush onto a wire so that the kernel can take over | 11:26 |
f0o_ | lets see if I can employ nftables without getting too much of a throughput pernalty | 11:27 |
noonedeadpunk | I believe that kernel routing is not being used for purpose to enable path acceleration (like dpdk) and not have a penalty/limitations of kernel | 11:40 |
f0o_ | yeah I think so too | 11:40 |
noonedeadpunk | but potentially it's better to talk to neutron folks about that, as they know way more about ovn | 11:40 |
f0o_ | well iptable -t filter -A FORWARD -o br-mgmt -j DROP works | 11:40 |
f0o_ | doesnt work for lxcbr0 tho because OSA adds a -o lxcbr0 -j ACCEPT | 11:41 |
f0o_ | but I can likely inject some -s check above that accept and drop it early | 11:41 |
f0o_ | wondering if those DROPs should be something that OSA wants to add as safeguard | 11:42 |
noonedeadpunk | potentially - yes, but then you can also set `lxc_net_manage_iptables: false` and then osa won't inject any iptable rules | 11:44 |
f0o_ | ngl life was easier with linuxbridges xD | 11:44 |
noonedeadpunk | it really was... | 11:46 |
noonedeadpunk | fwiw, iptables rules are added here: https://opendev.org/openstack/openstack-ansible-lxc_hosts/src/branch/master/templates/lxc-system-manage.j2#L93-L110 | 11:46 |
f0o_ | ok idk what I did but this is very unstable connection right now | 11:54 |
f0o_ | let me revert everything | 11:54 |
f0o_ | I think this issue that I'm having is just an oversight from neutron-team in implementing OVS. Because coincidentally OVS never supported full-tables because it would crash https://github.com/openvswitch/ovs-issues/issues/185 - But I talked to one of the OVS devs and I got a patch that made OVS work with full tables by only listening on route changes in the default table. | 11:59 |
f0o_ | I think this issue that I'm having is just an oversight from neutron-team in implementing OVS. Because coincidentally OVS never supported full-tables because it would crash https://github.com/openvswitch/ovs-issues/issues/185 - But I talked to one of the OVS devs and I got a patch that made OVS work with full tables by only listening on route changes in the default table. | 11:59 |
f0o_ | so I can see how neutron just dismissed the option that a OVS gatewaynode could be a full-fledged router | 12:00 |
f0o_ | and only focused on VTEP<>L2 bridging for GatewayNodes | 12:00 |
f0o_ | just theorycraftig | 12:00 |
noonedeadpunk | yeah, I guess that's pretty much true. as potentially they left this usecase for ovn-bgp-agent | 12:01 |
noonedeadpunk | like if you want getway to be a router - you want bgp | 12:01 |
noonedeadpunk | or will use bgp anyway | 12:01 |
f0o_ | that is pretty much true, unfortunately I simply cannot get ovn-bgp-agent to run | 12:02 |
f0o_ | I've broken my head over it for a week and it just didnt worked well | 12:02 |
noonedeadpunk | Well, I got it working | 12:06 |
noonedeadpunk | and in relatively reliable way, though it was /o\ experience overall I guess :D | 12:07 |
f0o_ | did you use your public ASN all the way (IBGP all the way) or did you use private-ASN for bgp-agent<>routers ? | 12:10 |
noonedeadpunk | but we scraped that, as the only way FIP to work with non-DVR scenario is through SB DB driver | 12:10 |
noonedeadpunk | I think we used public asn on /28 subnets | 12:11 |
f0o_ | because I kept running into an issue that IBGP along the whole path would create split-brain issues and attempting to rectify it with route-deflectors was a huge pain to the point where I just dropped it | 12:11 |
noonedeadpunk | As NB DB driver simply does not announce FIPs on the gateway nodes, but rather on computes where VMs are running which is wrong | 12:12 |
f0o_ | and adding a private-ASN in the middle of the path would create odd advertisements to our transits, which is correct as it would show "MYASN 65001" in the path and become a Bogon | 12:12 |
noonedeadpunk | and I've submitted bunch of bugs reports | 12:12 |
f0o_ | :D | 12:12 |
f0o_ | I also ran into that but assumed that it was how it's supposed to be | 12:12 |
noonedeadpunk | and then also there was bug in nb db that it did not withdraw announcements from frr | 12:12 |
f0o_ | Compute -> Router | 12:12 |
f0o_ | haha | 12:13 |
noonedeadpunk | nah, sb driver does that properly | 12:13 |
f0o_ | so alright ovn-bgp-agent not a solution for me still | 12:13 |
f0o_ | feels like I'm running circles here | 12:13 |
f0o_ | I think I will just do iptables and run some throughput tests and that be it | 12:13 |
noonedeadpunk | So eventually we used public ASN for announcements and ebgp-multihop | 12:14 |
noonedeadpunk | and public net for tenants was obviously a different one that's used for announcements | 12:14 |
noonedeadpunk | yeah, anyway | 12:15 |
f0o_ | as much as I would love full-path bgp it does sound like more pain than it's worth at the current state | 12:15 |
noonedeadpunk | we still picked just stupid l2 after all, and then do magic on leafs | 12:15 |
noonedeadpunk | it really is, imo | 12:15 |
noonedeadpunk | and then there's still no support for multiple VRFs still | 12:15 |
noonedeadpunk | but when it will be for NB it's back to point that you get FIP announced from computes | 12:16 |
f0o_ | my brain is spinning but I guess that's the fever more than anything else | 12:17 |
f0o_ | I got iptables to work now, easy peasy. | 12:18 |
f0o_ | verified it as working too | 12:18 |
f0o_ | not ideal solution but it does work until maybe ovn-bgp-agent matures | 12:18 |
f0o_ | soooo iptables dont work actually | 12:56 |
f0o_ | haha | 12:56 |
f0o_ | nvmd typo | 12:58 |
f0o_ | I'm literally not seeing the forest from all the trees now | 12:58 |
opendevreview | Merged openstack/openstack-ansible-rabbitmq_server master: Add support for the apply_to parameter for policies https://review.opendev.org/c/openstack/openstack-ansible-rabbitmq_server/+/910712 | 18:55 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!