Tuesday, 2022-03-22

Brace	Ok, after doing those fixes recommended a few days ago, all the Neutron ports have now built, yet, we still can't connect to any instances.	07:43
Brace	https://paste.openstack.org/show/bY9wtajyDwd12D5j6M0z/ - this is the error message we're seeing, anyone got any ideas?	07:43
Brace	Also the privsep-helper process for L3 has 1070+ 'pipe' file descriptors attached	07:44
jrosser	Brace: what version of openstack are you using - i only see really really old bugs related to that	08:33
noonedeadpunk	admin1: currently in Poland	08:36
noonedeadpunk	But I tend to travel _a lot_ during last month as originally from Ukraine	08:38
Brace	jrosser: we're on Victoria 22.3.3	08:39
noonedeadpunk	Brace: and you was increasing open file limit, right?	08:42
noonedeadpunk	or what were the recommended fixes?:)	08:43
Brace	noonedeadpunk: yeah you recommended https://paste.openstack.org/show/bY235whPe5LKkFFzo6pn/	08:44
Brace	so we increased the LimitNOFILE and ended up rebooting the whole cluster	08:44
Brace	however we have 1000 projects and that means we end up with 5000 ports in build which is why it's taken me a few days to come back	08:45
noonedeadpunk	um... that change should not ever be that breaking. and should run only against l3-agents kind of	08:49
Brace	everytime we do something with Neutron, it ends up with many thousands of ports in build	08:52
jrosser	Brace: this might be a question more for #openstack-neutron as i think there have been improvements how neutron handles lots of networks	09:15
jrosser	but i couldnt say which release any of that was done in	09:15
Brace	jrosser: ok, thanks for that, I'll ask in there	09:17
jrosser	there may also be tuning you can do around the number of workers but i've not really any experience there	09:19
noonedeadpunk	to have that said - we never saw such behaviour with ports going to build state....	10:13
noonedeadpunk	I guess top problem for us always was is l3 being re-build or misbehaving....	10:14
jrosser	1000 projects really is a lot though	10:14
noonedeadpunk	I bet we have 10000+ ports in some regions...	10:18
noonedeadpunk	never checked amount of projects though....	10:22
*** arxcruz is now known as arxcruz\|ruck		10:23
*** dviroel_ is now known as dviroel		11:08
Brace	jrosser: yeah, I'm vaguely aware that 1000 projects is bad, but we're using it to manage resources	11:22
Brace	we don't seem to have much more than about 6000 ports at the moment	11:22
*** prometheanfire is now known as Guest0		11:48
*** ChanServ changes topic to "Launchpad: https://launchpad.net/openstack-ansible \|\| Weekly Meetings: https://wiki.openstack.org/wiki/Meetings/openstack-ansible \|\| Review Dashboard: http://bit.ly/osa-review-board-v4_1"		11:52
noonedeadpunk	so what I'm saying that it's not _that_ much ports to cause issues IMO...	12:08
noonedeadpunk	Brace: to you use ovs or lxb?	12:08
Brace	is lxb - linuxbridge?	12:13
mgariepy	yes	12:15
Brace	yes, we use linuxbridge then	12:15
noonedeadpunk	oh, ok, we use ovs at that scale...	12:16
mgariepy	ovs or ovn would perform a lot better i guess	12:17
noonedeadpunk	well, it's arguably I'd say...	12:18
noonedeadpunk	I was running big enough with lxb, but that was quite some time ago	12:18
mgariepy	lxb with 500 router is not good...	12:18
mgariepy	if it runs smoothly it's ok but if you need to migrate the load accross servers it takes for ever to sync.	12:19
noonedeadpunk	is ovs somewhat different lol	12:19
mgariepy	lol	12:20
Brace	heh	12:20
mgariepy	lxb had some nasty issue ( getting sysctl stuff takes a lot of time when you have a lot of ports)	12:20
noonedeadpunk	I mean - to cleanly move 500 l3 routers from net node with ovs in serial takes like 30-45mins?	12:20
mgariepy	30 min is fast.. compared to lxb..	12:21
mgariepy	with lxb i saw a couple hours.	12:21
noonedeadpunk	with lxb I never had to do that :p as l3-agent restart was recovering properly. and with ovs you _always_ have ~10 routers then can't recover on themselves and overall router recovery can take time as well...	12:22
Brace	well, when we did a config change a while back it took 4 days to rebuild all the ports	12:22
noonedeadpunk	but dunno... I haven't used lxb for a while now, so it can have nasty stuff today as well...	12:22
noonedeadpunk	(and we never-ever were rebuilding ports)	12:23
mgariepy	i don't think that lxb have a lot of work on it these day	12:23
Brace	we've had a couple of times when networking has just failed (and we can't figure out why) so we just restart all the networking componenets	12:23
mgariepy	i had to when a network node crash.	12:23
Brace	and after a few days it starts working again	12:23
Brace	but this time, we've done that and it's still dead	12:24
mgariepy	network part is the most flaky imo :)	12:24
noonedeadpunk	so you mean ports that are part of l3?	12:24
mgariepy	brace are you using ha-router ?	12:25
Brace	mgariepy: so I'm learning	12:25
Brace	mgariepy: you mean haproxy? yup, that's in there	12:26
mgariepy	no. vrrp for the routers	12:26
Brace	I think so, there's lots of lines talking about vrrp in the logs	12:27
Brace	Mar 22 12:25:38 controller-3 Keepalived_vrrp[873858]: VRRP_Instance(VR_9) removing protocol Virtual Routes	12:27
Brace	tbh, we used openstack-ansible a number of years ago to deploy the cluster, but normally it broadly works so my knowledge of the exact config is pretty poor	12:27
mgariepy	openstack router show <router_name_here> -c ha -f value	12:27
mgariepy	if you use ha-router it takes more ports.	12:29
Brace	that command came back with True	12:29
mgariepy	what version of openstack are you running ?	12:30
Brace	22.3.3	12:30
mgariepy	ok	12:31
jamesdenton	so are all of your routers down right now? networking is busted?	12:31
Brace	nope, the routers are ACTIVE and UP, however we can't connect to any of the servers	12:32
Brace	if we build new servers, they're equally broken	12:32
jamesdenton	can you reach the servers from the respective dhcp namespaces?	12:32
jamesdenton	that would be L2 adjacent.	12:32
jamesdenton	are you using vxlan or vlan for those tenant networks?	12:33
Brace	vxlan, I'm fairly sure	12:33
mgariepy	back in kilo, over 50 or 100 routers in ha mode it was taking longer to recover than with non-ha ones with the scheduler script.	12:33
Brace	I sort of understand what you're saying there, but no idea how to actually test reaching the servers via l2	12:34
jamesdenton	"back in kilo" :D	12:34
mgariepy	lol :) yep.	12:34
mgariepy	i know it's a long time ago.. lol	12:34
jamesdenton	if you have access to the controllers, there should be a qdhcp namespace for each network. "ip netns list" will list them all	12:34
jamesdenton	you will correlate the network uuid of your VM to the qdhcp namespace name - the ID will be the same. You should then be able to ssh or ping from that namespace to the VM. It's essentially avoiding the routers.	12:35
Brace	ok, so I have stuff like this - qrouter-00eb4cab-d5d5-4d99-b6d1-bb2efd043903	12:35
Brace	so I just ssh that?	12:36
mgariepy	maybe you can tweak : https://docs.openstack.org/neutron/ussuri/configuration/neutron.html#DEFAULT.max_l3_agents_per_router	12:36
admin1	Brace, how many network nodes do you actually have vs networks ?	12:36
jamesdenton	yes, that's for each router. and that would work, too. if you can find the router that matches the one in front of your tenant network. the command would be something like "ip netns exec qrouter-00eb...3903 ping x.x.x.x"	12:36
jamesdenton	or "ip netns exec qdhcp-3b2e...0697 ssh ubuntu@x.x.x.x"	12:37
mgariepy	also `dhcp_agents_per_network` is a good candidate to remove some ports.	12:37
Brace	admin1: we have three controllers, we don't have dedicated network nodes	12:38
lowercase	One issue i've had in the past is that when neutron goes down, when all the interfaces on the router all attempt to come up at the same time. Rabbitmq gets overwhelmed and the network doesn't come up. Sorry if this isn't relevant. I just popped in for a sec.	12:39
Brace	lowercase: nope, it's useful information	12:39
jamesdenton	if HA is enabled (which it appears to be) you will have a qrouter namespace on EACH controller/network node, but only one will be active. They use VRRP across a dedicated (vxlan likely) network to determine master	12:40
jamesdenton	so, all three will have an 'ha' interface, but only one should have a qg and qr interface	12:40
jamesdenton	that's the one you'd want to test from	12:40
Brace	ok	12:42
mgariepy	with: openstack network agent list --router <router_name>	12:42
Brace	so I need to connect to dhcp-b27..... and not the qrouter	12:42
jamesdenton	both would be a good test	12:43
jamesdenton	and yes, that openstack network agent list command should tell you which one it thinks is active.	12:43
Brace	ok, so sshing to the qrouter, just gives a 'No route to host'	12:43
jamesdenton	can you post the command?	12:44
Brace	ip netns exec qrouter-b27d7882-d19f-4b8a-a34f-11e7d7821ba3 ssh ubuntu@10.72.103.62	12:44
jamesdenton	ok, try: ip netns exec qrouter-b27d7882-d19f-4b8a-a34f-11e7d7821ba3 ip addr	12:44
jamesdenton	what interfaces do you see and can you pastebin that	12:45
Brace	https://paste.openstack.org/show/bPLAlfK6Wuex68WnZ1Pw/	12:45
jamesdenton	ok, try the 192.168.5 addr of the VM instead	12:46
jamesdenton	the fixed IP vs the floating IP{	12:46
jamesdenton	10.72.103.62 may not be a floating IP or this may not be the right router, hard to tell	12:46
Brace	that's a connection refused	12:46
jamesdenton	can you post the 'openstack server show' output for the VM?	12:47
Brace	https://paste.openstack.org/show/bFiL16SeU7ROmH1J3amH/	12:49
jamesdenton	ok, so ssh to 192.168.5.108 is connection refused?	12:49
Brace	no, I was sshing to 192.168.5.1	12:50
jamesdenton	ahh ok, try 192.168.5.108	12:50
Brace	.108 works, just tried it now	12:50
jamesdenton	ok good. well, one problem, then, is that 10.72.103.62 is the floating IP but does not appear on the qg interface	12:50
jamesdenton	as a /32	12:50
Brace	ok	12:51
jamesdenton	trying to think of the easiest way to rebuild that router without an agent restart. might be able to set admin state down for the router, watch the namespaces disappear, then admin up	12:51
jamesdenton	can you check the other two controllers and do that same "ip netns exec qrouter-b27d7882-d19f-4b8a-a34f-11e7d7821ba3 ip addr" and post?	12:52
jamesdenton	also, curious to know the specs of those controller nodes. cpu/ram/etc	12:52
Brace	16 core Xeons, 192gb ram, 5x363gb hdds for ceph	12:53
Brace	3 controller nodes, with 17 compute nodes	12:54
jamesdenton	ok, all identical specs then?	12:54
jamesdenton	thanks	12:54
Brace	the compute nodes have a lower spec, but the controller nodes are all the spec above	12:56
Brace	https://paste.openstack.org/show/bCLxmVkZPwM6KhTExrTR/	12:58
jamesdenton	ok, thanks	12:59
jamesdenton	just making sure there wasn't some sort of split-brain issue	12:59
Brace	ah ok	13:01
jamesdenton	ok, so you can try disabling the router with this command: openstack router set --disable b27d7882-d19f-4b8a-a34f-11e7d7821ba3. That should tear down the namespaces for that router only. then wait a min, and issue: openstack router set --enable b27d7882-d19f-4b8a-a34f-11e7d7821ba3. That should implement the namespaces and build out the floating IPs.	13:01
jamesdenton	you might watch the l3 agent logs simultaneously across three controllers to see if there are any errors	13:01
Brace	ok	13:09
Brace	nope, can't connect to the server on the 10. ip	13:13
jamesdenton	can you post that ip netns output again from all 3?	13:14
jamesdenton	and also, can you find the port that corresponds to 10.72.103.62 and post the 'port show' output?	13:14
Brace	https://paste.openstack.org/show/b6dbUxyrRWMpX0sdQuEc/	13:21
jamesdenton	ok, so if you see c2 and c3, they both have interfaces for this router. Only, C3 has the correct one by the looks of it	13:23
jamesdenton	so there is something up with C2 - i'm wondering if the namespace never got torn down.	13:23
jrosser	is it odd that 10.72.103.133/32 is on both C2 and C3 as well	13:24
jamesdenton	you might try the disable command again, wait a min, and check to see if the namespaces disappear from all 3 controllers. if C2 is still there, then the agent may not be processing that properly	13:24
jamesdenton	jrosser i think that namespace never went away	13:24
jrosser	right	13:25
jamesdenton	Brace if you perform that check, and the namespace on C2 is still there, you can try to destroy it with "ip netns delete qrouter-b27d7882-d19f-4b8a-a34f-11e7d7821ba3". then re-enable the router. the logs on C2 may help indicate what's happening, and/or rabbitmq may reveal stuck messages or something	13:25
jamesdenton	restarting the L3 agent only on C2 could help, too	13:26
Brace	ok, I'll give that a bash, but I'm going to grab some lunch first	13:26
Brace	but here is the port show pastebin = https://paste.openstack.org/show/b5RuQSovWmz0K6pSvRri/	13:26
jamesdenton	good idea :D	13:26
jamesdenton	cool, thanks. is this a new router with only 2 floating IPs?	13:27
Brace	thanks for all the help so far, much appreciated!	13:27
jamesdenton	just want to make sure	13:27
Brace	yes, that's the router of the instance that I've been doing all the troubleshooting on so far	13:27
Brace	port even	13:27
jamesdenton	cool, thanks	13:27
Brace	openstack port show d04ed8a1-3313-4a14-bbd3-8a6057bc52e8	13:27
Brace	that was the command I used	13:28
jamesdenton	perfect. i'm glad to see it turned up in the C3 namespace.	13:28
Brace	jamesdenton: so we have restarted the l3 agent on C2 several times (and rebooted C2) so I'm not sure it'll do much tbh	14:22
Brace	I''m going to try the disable command again	14:27
Brace	so I deleted the router as per ip netns delete qrouter-b27....	14:31
Brace	but a qrouter-b27d7.... is still there on C3	14:32
jamesdenton	sure - is the router re-enabled?	14:41
Brace	no I didn't re-enable it	14:41
spatel	noonedeadpunk hey! around	14:41
jamesdenton	ok - so you disabled the router and the namespace didn't disappear from C3? weird. Can you check to see if there are any agent queues in rabbit that have messages sitting there?	14:44
noonedeadpunk	semi-around	14:44
Brace	nope according to haproxy nothing much happening	14:45
Brace	I guess the best way to look would be to go into the rabbit container and look at queues in there?	14:45
spatel	noonedeadpunk do you have any example yaml file of k8s to create hell-world with octavia lb to expose port? my yaml not spun up LB :(	14:46
noonedeadpunk	nope, I don't :(	14:47
noonedeadpunk	trying to avoid messing with k8s as much as I can :)	14:48
Brace	nope looking at the neutron rabbit queues, they're all empty	14:48
spatel	what do you mean by avoid messing k8s?	14:51
NeilHanlon	spatel: can you share what you have so far?	14:51
jamesdenton	Brace ok cool. Very odd the namespace didn't get destroyed. you can try to delete it on there, then turn up the router again and see if it gets built across all 3 and if only 1 becomes master.	14:51
spatel	NeilHanlon i have this example code to spin up app. I am trying to expose my app port via Octavia but it doesn't work. I meant i am not seeing k8s creating any LB to expose port - https://paste.opendev.org/show/b7jCTenIZdMX3VFw5Rv2/	14:53
spatel	Just trying to understand how does k8s integrate with octavia	14:53
NeilHanlon	have you deployed the octavia ingress controller?	14:55
jamesdenton	^^	14:55
NeilHanlon	https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/octavia-ingress-controller/using-octavia-ingress-controller.md	14:55
spatel	NeilHanlon deploy ingress controller???	14:58
spatel	This is first-time playing with k8s with octavia so sorry if i missed something or not understand steps	14:59
NeilHanlon	spatel: check out that github link for the octavia ingres controller. Octavia doesn't natively interact with Kubernetes	14:59
NeilHanlon	you need the octavia ingress controller running in your kubernetes cluster to facilitate using Octavia to load balance for kubernetes	15:00
noonedeadpunk	#startmeeting openstack_ansible_meeting	15:01
opendevmeet	Meeting started Tue Mar 22 15:01:04 2022 UTC and is due to finish in 60 minutes. The chair is noonedeadpunk. Information about MeetBot at http://wiki.debian.org/MeetBot.	15:01
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	15:01
opendevmeet	The meeting name has been set to 'openstack_ansible_meeting'	15:01
noonedeadpunk	#topic rollcall	15:01
noonedeadpunk	hey everyone	15:01
noonedeadpunk	I'm half-there as have a meeting	15:01
damiandabrowski[m]	hi!	15:04
noonedeadpunk	#topic office hours	15:08
noonedeadpunk	well, don't have much topics to have that said	15:08
noonedeadpunk	we're super close to PTL and openstack release	15:08
noonedeadpunk	* s/PTL/PTG/	15:09
noonedeadpunk	So please register to PTG and let's fill in topics for discussion in etherpad	15:10
noonedeadpunk	#link https://etherpad.opendev.org/p/osa-Z-ptg	15:10
NeilHanlon	heya	15:10
NeilHanlon	TY for the note about PTG!	15:12
Brace	jamesdenton: nope, it doesn't get rebuilt on C2 and takes a disable/enable before it comes up on the other two, so I guess that leans towards rabbitmq being the issue?	15:14
jamesdenton	well, or somehow the router is no longer scheduled against C2. can you try openstack network agent list --router <id>	15:17
noonedeadpunk	other then that I didn't have time for anything else since week was quite tough with internal things going on	15:18
Brace	jamesdenton: it's alive and up on all controllers according to that command	15:19
jamesdenton	ok. it may be that the other routers are experiencing a similar condition	15:21
jamesdenton	not sure if only C2 is to blame, but seems problematic anyway.	15:21
Brace	yup	15:25
jrosser	o/ hello	15:25
opendevreview	Jonathan Rosser proposed openstack/ansible-role-pki master: Refactor conditional generation of CA and certificates https://review.opendev.org/c/openstack/ansible-role-pki/+/830794	15:27
*** dviroel is now known as dviroel\|lunch		15:50
Brace	jamesdenton: I'm going to try restarting l3-agent on C2 and see what that does, can't make things worse can it!	15:50
jamesdenton	i guess not :D	15:50
jamesdenton	tail the log after you restart it and see if there's anything unusual reported	15:50
jamesdenton	main loop exceeding timeout or anything along those lines	15:51
Brace	jamesdenton: I wouldn't know what's unusual tbh, but there's an note about finding left over processes which indicates an unclean termination of a previous run	15:57
jrosser	oh well now that is a thing with systemd isnt it?	15:57
Brace	not sure tbh, but it otherwise seems ok	15:58
jrosser	this https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/772538	15:59
Brace	but the router still doesn't come back on c2	15:59
jrosser	and https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/781109	15:59
jrosser	so restarting the netutron agent does not necessarily remove all the things that it created	16:00
jamesdenton	Brace you might try unscheduling the router from the agent, and rescheduling to see if it will clean up and recreate	16:00
Brace	ah, I just tried disabling and enabling the port	16:01
Brace	router even	16:01
jamesdenton	right, ok.	16:01
jamesdenton	try to unschedule and reschedule. i cannot remember offhand the command	16:01
jamesdenton	i'm headed OOO for the day	16:01
Brace	jamesdenton: thank you for all your help, much appreciated	16:02
Brace	I'll try the unschedule and reschedule	16:02
noonedeadpunk	#endmeeting	16:04
opendevmeet	Meeting ended Tue Mar 22 16:04:40 2022 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	16:04
opendevmeet	Minutes: https://meetings.opendev.org/meetings/openstack_ansible_meeting/2022/openstack_ansible_meeting.2022-03-22-15.01.html	16:04
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/openstack_ansible_meeting/2022/openstack_ansible_meeting.2022-03-22-15.01.txt	16:04
opendevmeet	Log: https://meetings.opendev.org/meetings/openstack_ansible_meeting/2022/openstack_ansible_meeting.2022-03-22-15.01.log.html	16:04
*** Guest0 is now known as prometheanfire		16:33
admin1	have anyone seen an issue with nova snapshots when glance and cinder is ceph but nova is local storage ?	16:56
admin1	and if my glance backend is only rbd, do I need enabled_backends = rbd:rbd,http:http,cinder:cinder -- all of these ?	16:56
admin1	i am getting an error that snapshots do not work .. and research points to multiple backends being in glance	16:57
*** tosky is now known as Guest38		17:04
*** tosky_ is now known as tosky		17:04
opendevreview	Merged openstack/openstack-ansible-os_neutron master: Set fail_mode explicitly https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/834436	18:30
admin1	if that bug is really what i think i am hitting, how do I override glance in haproxy from mode http to tcp ?	18:33
admin1	so i changed the mode in haproxy to tcp from http and its fixed .. now want to know how to make this fix permanent	20:20
*** dviroel is now known as dviroel\|brb		20:21
*** dviroel\|brb is now known as dviroel		23:24
*** dviroel is now known as dviroel\|out		23:31

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!