Brace | Ok, after doing those fixes recommended a few days ago, all the Neutron ports have now built, yet, we still can't connect to any instances. | 07:43 |
---|---|---|
Brace | https://paste.openstack.org/show/bY9wtajyDwd12D5j6M0z/ - this is the error message we're seeing, anyone got any ideas? | 07:43 |
Brace | Also the privsep-helper process for L3 has 1070+ 'pipe' file descriptors attached | 07:44 |
jrosser | Brace: what version of openstack are you using - i only see really really old bugs related to that | 08:33 |
noonedeadpunk | admin1: currently in Poland | 08:36 |
noonedeadpunk | But I tend to travel _a lot_ during last month as originally from Ukraine | 08:38 |
Brace | jrosser: we're on Victoria 22.3.3 | 08:39 |
noonedeadpunk | Brace: and you was increasing open file limit, right? | 08:42 |
noonedeadpunk | or what were the recommended fixes?:) | 08:43 |
Brace | noonedeadpunk: yeah you recommended https://paste.openstack.org/show/bY235whPe5LKkFFzo6pn/ | 08:44 |
Brace | so we increased the LimitNOFILE and ended up rebooting the whole cluster | 08:44 |
Brace | however we have 1000 projects and that means we end up with 5000 ports in build which is why it's taken me a few days to come back | 08:45 |
noonedeadpunk | um... that change should not ever be that breaking. and should run only against l3-agents kind of | 08:49 |
Brace | everytime we do something with Neutron, it ends up with many thousands of ports in build | 08:52 |
jrosser | Brace: this might be a question more for #openstack-neutron as i think there have been improvements how neutron handles lots of networks | 09:15 |
jrosser | but i couldnt say which release any of that was done in | 09:15 |
Brace | jrosser: ok, thanks for that, I'll ask in there | 09:17 |
jrosser | there may also be tuning you can do around the number of workers but i've not really any experience there | 09:19 |
noonedeadpunk | to have that said - we never saw such behaviour with ports going to build state.... | 10:13 |
noonedeadpunk | I guess top problem for us always was is l3 being re-build or misbehaving.... | 10:14 |
jrosser | 1000 projects really is a lot though | 10:14 |
noonedeadpunk | I bet we have 10000+ ports in some regions... | 10:18 |
noonedeadpunk | never checked amount of projects though.... | 10:22 |
*** arxcruz is now known as arxcruz|ruck | 10:23 | |
*** dviroel_ is now known as dviroel | 11:08 | |
Brace | jrosser: yeah, I'm vaguely aware that 1000 projects is bad, but we're using it to manage resources | 11:22 |
Brace | we don't seem to have much more than about 6000 ports at the moment | 11:22 |
*** prometheanfire is now known as Guest0 | 11:48 | |
*** ChanServ changes topic to "Launchpad: https://launchpad.net/openstack-ansible || Weekly Meetings: https://wiki.openstack.org/wiki/Meetings/openstack-ansible || Review Dashboard: http://bit.ly/osa-review-board-v4_1" | 11:52 | |
noonedeadpunk | so what I'm saying that it's not _that_ much ports to cause issues IMO... | 12:08 |
noonedeadpunk | Brace: to you use ovs or lxb? | 12:08 |
Brace | is lxb - linuxbridge? | 12:13 |
mgariepy | yes | 12:15 |
Brace | yes, we use linuxbridge then | 12:15 |
noonedeadpunk | oh, ok, we use ovs at that scale... | 12:16 |
mgariepy | ovs or ovn would perform a lot better i guess | 12:17 |
noonedeadpunk | well, it's arguably I'd say... | 12:18 |
noonedeadpunk | I was running big enough with lxb, but that was quite some time ago | 12:18 |
mgariepy | lxb with 500 router is not good... | 12:18 |
mgariepy | if it runs smoothly it's ok but if you need to migrate the load accross servers it takes for ever to sync. | 12:19 |
noonedeadpunk | is ovs somewhat different lol | 12:19 |
mgariepy | lol | 12:20 |
Brace | heh | 12:20 |
mgariepy | lxb had some nasty issue ( getting sysctl stuff takes a lot of time when you have a lot of ports) | 12:20 |
noonedeadpunk | I mean - to cleanly move 500 l3 routers from net node with ovs in serial takes like 30-45mins? | 12:20 |
mgariepy | 30 min is fast.. compared to lxb.. | 12:21 |
mgariepy | with lxb i saw a couple hours. | 12:21 |
noonedeadpunk | with lxb I never had to do that :p as l3-agent restart was recovering properly. and with ovs you _always_ have ~10 routers then can't recover on themselves and overall router recovery can take time as well... | 12:22 |
Brace | well, when we did a config change a while back it took 4 days to rebuild all the ports | 12:22 |
noonedeadpunk | but dunno... I haven't used lxb for a while now, so it can have nasty stuff today as well... | 12:22 |
noonedeadpunk | (and we never-ever were rebuilding ports) | 12:23 |
mgariepy | i don't think that lxb have a lot of work on it these day | 12:23 |
Brace | we've had a couple of times when networking has just failed (and we can't figure out why) so we just restart all the networking componenets | 12:23 |
mgariepy | i had to when a network node crash. | 12:23 |
Brace | and after a few days it starts working again | 12:23 |
Brace | but this time, we've done that and it's still dead | 12:24 |
mgariepy | network part is the most flaky imo :) | 12:24 |
noonedeadpunk | so you mean ports that are part of l3? | 12:24 |
mgariepy | brace are you using ha-router ? | 12:25 |
Brace | mgariepy: so I'm learning | 12:25 |
Brace | mgariepy: you mean haproxy? yup, that's in there | 12:26 |
mgariepy | no. vrrp for the routers | 12:26 |
Brace | I *think* so, there's lots of lines talking about vrrp in the logs | 12:27 |
Brace | Mar 22 12:25:38 controller-3 Keepalived_vrrp[873858]: VRRP_Instance(VR_9) removing protocol Virtual Routes | 12:27 |
Brace | tbh, we used openstack-ansible a number of years ago to deploy the cluster, but normally it broadly works so my knowledge of the exact config is pretty poor | 12:27 |
mgariepy | openstack router show <router_name_here> -c ha -f value | 12:27 |
mgariepy | if you use ha-router it takes more ports. | 12:29 |
Brace | that command came back with True | 12:29 |
mgariepy | what version of openstack are you running ? | 12:30 |
Brace | 22.3.3 | 12:30 |
mgariepy | ok | 12:31 |
jamesdenton | so are all of your routers down right now? networking is busted? | 12:31 |
Brace | nope, the routers are ACTIVE and UP, however we can't connect to any of the servers | 12:32 |
Brace | if we build new servers, they're equally broken | 12:32 |
jamesdenton | can you reach the servers from the respective dhcp namespaces? | 12:32 |
jamesdenton | that would be L2 adjacent. | 12:32 |
jamesdenton | are you using vxlan or vlan for those tenant networks? | 12:33 |
Brace | vxlan, I'm fairly sure | 12:33 |
mgariepy | back in kilo, over 50 or 100 routers in ha mode it was taking longer to recover than with non-ha ones with the scheduler script. | 12:33 |
Brace | I sort of understand what you're saying there, but no idea how to actually test reaching the servers via l2 | 12:34 |
jamesdenton | "back in kilo" :D | 12:34 |
mgariepy | lol :) yep. | 12:34 |
mgariepy | i know it's a long time ago.. lol | 12:34 |
jamesdenton | if you have access to the controllers, there should be a qdhcp namespace for each network. "ip netns list" will list them all | 12:34 |
jamesdenton | you will correlate the network uuid of your VM to the qdhcp namespace name - the ID will be the same. You should then be able to ssh or ping from that namespace to the VM. It's essentially avoiding the routers. | 12:35 |
Brace | ok, so I have stuff like this - qrouter-00eb4cab-d5d5-4d99-b6d1-bb2efd043903 | 12:35 |
Brace | so I just ssh that? | 12:36 |
mgariepy | maybe you can tweak : https://docs.openstack.org/neutron/ussuri/configuration/neutron.html#DEFAULT.max_l3_agents_per_router | 12:36 |
admin1 | Brace, how many network nodes do you actually have vs networks ? | 12:36 |
jamesdenton | yes, that's for each router. and that would work, too. if you can find the router that matches the one in front of your tenant network. the command would be something like "ip netns exec qrouter-00eb...3903 ping x.x.x.x" | 12:36 |
jamesdenton | or "ip netns exec qdhcp-3b2e...0697 ssh ubuntu@x.x.x.x" | 12:37 |
mgariepy | also `dhcp_agents_per_network` is a good candidate to remove some ports. | 12:37 |
Brace | admin1: we have three controllers, we don't have dedicated network nodes | 12:38 |
lowercase | One issue i've had in the past is that when neutron goes down, when all the interfaces on the router all attempt to come up at the same time. Rabbitmq gets overwhelmed and the network doesn't come up. Sorry if this isn't relevant. I just popped in for a sec. | 12:39 |
Brace | lowercase: nope, it's useful information | 12:39 |
jamesdenton | if HA is enabled (which it appears to be) you will have a qrouter namespace on EACH controller/network node, but only one will be active. They use VRRP across a dedicated (vxlan likely) network to determine master | 12:40 |
jamesdenton | so, all three will have an 'ha' interface, but only one should have a qg and qr interface | 12:40 |
jamesdenton | that's the one you'd want to test from | 12:40 |
Brace | ok | 12:42 |
mgariepy | with: openstack network agent list --router <router_name> | 12:42 |
Brace | so I need to connect to dhcp-b27..... and not the qrouter | 12:42 |
jamesdenton | both would be a good test | 12:43 |
jamesdenton | and yes, that openstack network agent list command should tell you which one it thinks is active. | 12:43 |
Brace | ok, so sshing to the qrouter, just gives a 'No route to host' | 12:43 |
jamesdenton | can you post the command? | 12:44 |
Brace | ip netns exec qrouter-b27d7882-d19f-4b8a-a34f-11e7d7821ba3 ssh ubuntu@10.72.103.62 | 12:44 |
jamesdenton | ok, try: ip netns exec qrouter-b27d7882-d19f-4b8a-a34f-11e7d7821ba3 ip addr | 12:44 |
jamesdenton | what interfaces do you see and can you pastebin that | 12:45 |
Brace | https://paste.openstack.org/show/bPLAlfK6Wuex68WnZ1Pw/ | 12:45 |
jamesdenton | ok, try the 192.168.5 addr of the VM instead | 12:46 |
jamesdenton | the fixed IP vs the floating IP{ | 12:46 |
jamesdenton | 10.72.103.62 may not be a floating IP or this may not be the right router, hard to tell | 12:46 |
Brace | that's a connection refused | 12:46 |
jamesdenton | can you post the 'openstack server show' output for the VM? | 12:47 |
Brace | https://paste.openstack.org/show/bFiL16SeU7ROmH1J3amH/ | 12:49 |
jamesdenton | ok, so ssh to 192.168.5.108 is connection refused? | 12:49 |
Brace | no, I was sshing to 192.168.5.1 | 12:50 |
jamesdenton | ahh ok, try 192.168.5.108 | 12:50 |
Brace | .108 works, just tried it now | 12:50 |
jamesdenton | ok good. well, one problem, then, is that 10.72.103.62 is the floating IP but does not appear on the qg interface | 12:50 |
jamesdenton | as a /32 | 12:50 |
Brace | ok | 12:51 |
jamesdenton | trying to think of the easiest way to rebuild that router without an agent restart. might be able to set admin state down for the router, watch the namespaces disappear, then admin up | 12:51 |
jamesdenton | can you check the other two controllers and do that same "ip netns exec qrouter-b27d7882-d19f-4b8a-a34f-11e7d7821ba3 ip addr" and post? | 12:52 |
jamesdenton | also, curious to know the specs of those controller nodes. cpu/ram/etc | 12:52 |
Brace | 16 core Xeons, 192gb ram, 5x363gb hdds for ceph | 12:53 |
Brace | 3 controller nodes, with 17 compute nodes | 12:54 |
jamesdenton | ok, all identical specs then? | 12:54 |
jamesdenton | thanks | 12:54 |
Brace | the compute nodes have a lower spec, but the controller nodes are all the spec above | 12:56 |
Brace | https://paste.openstack.org/show/bCLxmVkZPwM6KhTExrTR/ | 12:58 |
jamesdenton | ok, thanks | 12:59 |
jamesdenton | just making sure there wasn't some sort of split-brain issue | 12:59 |
Brace | ah ok | 13:01 |
jamesdenton | ok, so you can try disabling the router with this command: openstack router set --disable b27d7882-d19f-4b8a-a34f-11e7d7821ba3. That should tear down the namespaces for that router only. then wait a min, and issue: openstack router set --enable b27d7882-d19f-4b8a-a34f-11e7d7821ba3. That should implement the namespaces and build out the floating IPs. | 13:01 |
jamesdenton | you might watch the l3 agent logs simultaneously across three controllers to see if there are any errors | 13:01 |
Brace | ok | 13:09 |
Brace | nope, can't connect to the server on the 10. ip | 13:13 |
jamesdenton | can you post that ip netns output again from all 3? | 13:14 |
jamesdenton | and also, can you find the port that corresponds to 10.72.103.62 and post the 'port show' output? | 13:14 |
Brace | https://paste.openstack.org/show/b6dbUxyrRWMpX0sdQuEc/ | 13:21 |
jamesdenton | ok, so if you see c2 and c3, they both have interfaces for this router. Only, C3 has the correct one by the looks of it | 13:23 |
jamesdenton | so there is something up with C2 - i'm wondering if the namespace never got torn down. | 13:23 |
jrosser | is it odd that 10.72.103.133/32 is on both C2 and C3 as well | 13:24 |
jamesdenton | you might try the disable command again, wait a min, and check to see if the namespaces disappear from all 3 controllers. if C2 is still there, then the agent may not be processing that properly | 13:24 |
jamesdenton | jrosser i think that namespace never went away | 13:24 |
jrosser | right | 13:25 |
jamesdenton | Brace if you perform that check, and the namespace on C2 is still there, you can try to destroy it with "ip netns delete qrouter-b27d7882-d19f-4b8a-a34f-11e7d7821ba3". then re-enable the router. the logs on C2 may help indicate what's happening, and/or rabbitmq may reveal stuck messages or something | 13:25 |
jamesdenton | restarting the L3 agent *only* on C2 could help, too | 13:26 |
Brace | ok, I'll give that a bash, but I'm going to grab some lunch first | 13:26 |
Brace | but here is the port show pastebin = https://paste.openstack.org/show/b5RuQSovWmz0K6pSvRri/ | 13:26 |
jamesdenton | good idea :D | 13:26 |
jamesdenton | cool, thanks. is this a new router with only 2 floating IPs? | 13:27 |
Brace | thanks for all the help so far, much appreciated! | 13:27 |
jamesdenton | just want to make sure | 13:27 |
Brace | yes, that's the router of the instance that I've been doing all the troubleshooting on so far | 13:27 |
Brace | port even | 13:27 |
jamesdenton | cool, thanks | 13:27 |
Brace | openstack port show d04ed8a1-3313-4a14-bbd3-8a6057bc52e8 | 13:27 |
Brace | that was the command I used | 13:28 |
jamesdenton | perfect. i'm glad to see it turned up in the C3 namespace. | 13:28 |
Brace | jamesdenton: so we have restarted the l3 agent on C2 several times (and rebooted C2) so I'm not sure it'll do much tbh | 14:22 |
Brace | I''m going to try the disable command again | 14:27 |
Brace | so I deleted the router as per ip netns delete qrouter-b27.... | 14:31 |
Brace | but a qrouter-b27d7.... is still there on C3 | 14:32 |
jamesdenton | sure - is the router re-enabled? | 14:41 |
Brace | no I didn't re-enable it | 14:41 |
spatel | noonedeadpunk hey! around | 14:41 |
jamesdenton | ok - so you disabled the router and the namespace didn't disappear from C3? weird. Can you check to see if there are any agent queues in rabbit that have messages sitting there? | 14:44 |
noonedeadpunk | semi-around | 14:44 |
Brace | nope according to haproxy nothing much happening | 14:45 |
Brace | I guess the best way to look would be to go into the rabbit container and look at queues in there? | 14:45 |
spatel | noonedeadpunk do you have any example yaml file of k8s to create hell-world with octavia lb to expose port? my yaml not spun up LB :( | 14:46 |
noonedeadpunk | nope, I don't :( | 14:47 |
noonedeadpunk | trying to avoid messing with k8s as much as I can :) | 14:48 |
Brace | nope looking at the neutron rabbit queues, they're all empty | 14:48 |
spatel | what do you mean by avoid messing k8s? | 14:51 |
NeilHanlon | spatel: can you share what you have so far? | 14:51 |
jamesdenton | Brace ok cool. Very odd the namespace didn't get destroyed. you can try to delete it on there, then turn up the router again and see if it gets built across all 3 and if only 1 becomes master. | 14:51 |
spatel | NeilHanlon i have this example code to spin up app. I am trying to expose my app port via Octavia but it doesn't work. I meant i am not seeing k8s creating any LB to expose port - https://paste.opendev.org/show/b7jCTenIZdMX3VFw5Rv2/ | 14:53 |
spatel | Just trying to understand how does k8s integrate with octavia | 14:53 |
NeilHanlon | have you deployed the octavia ingress controller? | 14:55 |
jamesdenton | ^^ | 14:55 |
NeilHanlon | https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/octavia-ingress-controller/using-octavia-ingress-controller.md | 14:55 |
spatel | NeilHanlon deploy ingress controller??? | 14:58 |
spatel | This is first-time playing with k8s with octavia so sorry if i missed something or not understand steps | 14:59 |
NeilHanlon | spatel: check out that github link for the octavia ingres controller. Octavia doesn't natively interact with Kubernetes | 14:59 |
NeilHanlon | you need the octavia ingress controller running in your kubernetes cluster to facilitate using Octavia to load balance for kubernetes | 15:00 |
noonedeadpunk | #startmeeting openstack_ansible_meeting | 15:01 |
opendevmeet | Meeting started Tue Mar 22 15:01:04 2022 UTC and is due to finish in 60 minutes. The chair is noonedeadpunk. Information about MeetBot at http://wiki.debian.org/MeetBot. | 15:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 15:01 |
opendevmeet | The meeting name has been set to 'openstack_ansible_meeting' | 15:01 |
noonedeadpunk | #topic rollcall | 15:01 |
noonedeadpunk | hey everyone | 15:01 |
noonedeadpunk | I'm half-there as have a meeting | 15:01 |
damiandabrowski[m] | hi! | 15:04 |
noonedeadpunk | #topic office hours | 15:08 |
noonedeadpunk | well, don't have much topics to have that said | 15:08 |
noonedeadpunk | we're super close to PTL and openstack release | 15:08 |
noonedeadpunk | * s/PTL/PTG/ | 15:09 |
noonedeadpunk | So please register to PTG and let's fill in topics for discussion in etherpad | 15:10 |
noonedeadpunk | #link https://etherpad.opendev.org/p/osa-Z-ptg | 15:10 |
NeilHanlon | heya | 15:10 |
NeilHanlon | TY for the note about PTG! | 15:12 |
Brace | jamesdenton: nope, it doesn't get rebuilt on C2 and takes a disable/enable before it comes up on the other two, so I guess that leans towards rabbitmq being the issue? | 15:14 |
jamesdenton | well, or somehow the router is no longer scheduled against C2. can you try openstack network agent list --router <id> | 15:17 |
noonedeadpunk | other then that I didn't have time for anything else since week was quite tough with internal things going on | 15:18 |
Brace | jamesdenton: it's alive and up on all controllers according to that command | 15:19 |
jamesdenton | ok. it may be that the other routers are experiencing a similar condition | 15:21 |
jamesdenton | not sure if only C2 is to blame, but seems problematic anyway. | 15:21 |
Brace | yup | 15:25 |
jrosser | o/ hello | 15:25 |
opendevreview | Jonathan Rosser proposed openstack/ansible-role-pki master: Refactor conditional generation of CA and certificates https://review.opendev.org/c/openstack/ansible-role-pki/+/830794 | 15:27 |
*** dviroel is now known as dviroel|lunch | 15:50 | |
Brace | jamesdenton: I'm going to try restarting l3-agent on C2 and see what that does, can't make things worse can it! | 15:50 |
jamesdenton | i guess not :D | 15:50 |
jamesdenton | tail the log after you restart it and see if there's anything unusual reported | 15:50 |
jamesdenton | main loop exceeding timeout or anything along those lines | 15:51 |
Brace | jamesdenton: I wouldn't know what's unusual tbh, but there's an note about finding left over processes which indicates an unclean termination of a previous run | 15:57 |
jrosser | oh well now that is a thing with systemd isnt it? | 15:57 |
Brace | not sure tbh, but it otherwise seems ok | 15:58 |
jrosser | this https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/772538 | 15:59 |
Brace | but the router still doesn't come back on c2 | 15:59 |
jrosser | and https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/781109 | 15:59 |
jrosser | so restarting the netutron agent does not necessarily remove all the things that it created | 16:00 |
jamesdenton | Brace you might try unscheduling the router from the agent, and rescheduling to see if it will clean up and recreate | 16:00 |
Brace | ah, I just tried disabling and enabling the port | 16:01 |
Brace | router even | 16:01 |
jamesdenton | right, ok. | 16:01 |
jamesdenton | try to unschedule and reschedule. i cannot remember offhand the command | 16:01 |
jamesdenton | i'm headed OOO for the day | 16:01 |
Brace | jamesdenton: thank you for all your help, much appreciated | 16:02 |
Brace | I'll try the unschedule and reschedule | 16:02 |
noonedeadpunk | #endmeeting | 16:04 |
opendevmeet | Meeting ended Tue Mar 22 16:04:40 2022 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 16:04 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/openstack_ansible_meeting/2022/openstack_ansible_meeting.2022-03-22-15.01.html | 16:04 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/openstack_ansible_meeting/2022/openstack_ansible_meeting.2022-03-22-15.01.txt | 16:04 |
opendevmeet | Log: https://meetings.opendev.org/meetings/openstack_ansible_meeting/2022/openstack_ansible_meeting.2022-03-22-15.01.log.html | 16:04 |
*** Guest0 is now known as prometheanfire | 16:33 | |
admin1 | have anyone seen an issue with nova snapshots when glance and cinder is ceph but nova is local storage ? | 16:56 |
admin1 | and if my glance backend is only rbd, do I need enabled_backends = rbd:rbd,http:http,cinder:cinder -- all of these ? | 16:56 |
admin1 | i am getting an error that snapshots do not work .. and research points to multiple backends being in glance | 16:57 |
*** tosky is now known as Guest38 | 17:04 | |
*** tosky_ is now known as tosky | 17:04 | |
opendevreview | Merged openstack/openstack-ansible-os_neutron master: Set fail_mode explicitly https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/834436 | 18:30 |
admin1 | if that bug is really what i think i am hitting, how do I override glance in haproxy from mode http to tcp ? | 18:33 |
admin1 | so i changed the mode in haproxy to tcp from http and its fixed .. now want to know how to make this fix permanent | 20:20 |
*** dviroel is now known as dviroel|brb | 20:21 | |
*** dviroel|brb is now known as dviroel | 23:24 | |
*** dviroel is now known as dviroel|out | 23:31 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!