Wednesday, 2021-09-22

mrf3	hi, deploying wallaby ansible in a lab for test, do you use default HAProxy settings for galera?	06:16
mrf3	with the default balance params during deployment you can get timeouts	06:16
jrosser	the defaults should be "sensible"	07:05
jrosser	mrf3: can you give a specific example? do you mean something times out during the deployment and it fails?	07:06
mrf3	haproxy by default for galera is configured for just node1 get all traffic during deployment	07:12
mrf3	due "massive" queries / connections for settup , during the lab of 3 nodes the galera got timeout	07:13
mrf3	no ones happens in your test?	07:13
mrf3	or just people test ussing AIO?	07:14
jrosser	the galera setup is deliberately active/standby/standby	07:27
jrosser	are you exceeding the maximum number of connections?	07:28
*** rpittau\|afk is now known as rpittau		07:28
mrf3	yes just deploying, 3 controllers, 3 networks, 1 compute	07:32
mrf3	yes we got " backend galera-back has no server available!"	07:32
mrf3	galera has 3 containers ofc	07:32
jrosser	well, there is a kind of corner case, that if you exceed the maximum number of database connections, then the haproxy healthcheck itself cannot connect to the database	07:40
jrosser	so the loadbalancer thinks that the db is down	07:40
jrosser	so i think it's worth trying to understand why in your case the loadbalancer thinks there is no backend	07:41
noonedeadpunk	mornings	07:50
jrosser	morning	07:54
mrf3	im surprised that no one got this coner case or just solve it self without report?	08:12
mrf3	get out of mysql connection is not a "corner" case is a future production problem	08:12
mrf3	because haproxy by default only enable a single node for the full infraestructure	08:12
noonedeadpunk	yeah and we have plans to move away from balancing with haproxy to proxysql	08:13
noonedeadpunk	eventually right now we pushed max connection limit but it's not really a long term solution for sure	08:14
noonedeadpunk	but we also need to reduce sleeping amount of connections	08:14
noonedeadpunk	we have etherpad with some ideas https://etherpad.opendev.org/p/db_pool_calculations but wasn't able to return back to it :(	08:15
anskiy	I've seen this same issue actually in lab. Fixed it with this override: https://paste.opendev.org/show/809487/. Another problem of the way it works right now is that queries are not distributed across cluster (galera one) nodes, so all the load essentially hits the first node :(	08:18
noonedeadpunk	well, eventually, galera can consume writes only on single node. But reads could be distributed across many	08:22
noonedeadpunk	However haproxy can't really distinguish read/write because it's l3 balancers, while l7 is required for this kind of balancing	08:23
mrf3	anskiy +1 same happend to me	08:23
mrf3	i was having nightmare thinking i setup some wrong at network level and then i saw ha-proxy galera balancer down and cause playbooks got wrong	08:24
jrosser	some specific tuning per deployment is probably needed	08:32
jrosser	depending how many services you deploy / how many CPU you controllers have / blah blah this all affects the number of database connections	08:32
noonedeadpunk	we set smth like `galera_max_connections: 5000` in user_vars	08:35
noonedeadpunk	because 90% of all connections are sleeping ones atm	08:35
mrf3	sleeping connections are the problem	08:40
noonedeadpunk	well, it depends. Because it speeds up reaching maria by using already established connections	09:10
noonedeadpunk	Also you shouldn't get more connections to DB now except from the already spawned pool	09:10
jrosser	the etherpad explains it quite will tbh	09:11
jrosser	*well	09:11
mrf3	galera_wait_timeout is the timeout for idle connections?	09:59
jrosser	mrf3: https://opendev.org/openstack/openstack-ansible-galera_server/src/branch/master/templates/my.cnf.j2#L66	10:34
mrf3	ty jrosser i setup to default timeout of 120 secs	10:54
mrf3	3600 idle a sql connection its crazy	10:55
jrosser	this is not an OSA thing btw	10:55
jrosser	it is how oslo.db middleware works, and by extension all of the services	10:55
jrosser	you can tune the settings on the db connection pool based on your use case	10:56
spatel	jamesdenton morning	13:14
*** arxcruz is now known as arxcruz\|ruck		13:14
jamesdenton	good morning	13:14
spatel	I found solution :)	13:14
jamesdenton	mice :)	13:15
jamesdenton	nice, too!	13:15
spatel	did you?	13:15
jamesdenton	tell me, was it obvious?	13:15
spatel	i saw your post related that error	13:15
jamesdenton	no, i've been messing with it	13:15
spatel	solution is Disable -> HP Shared Memory Features	13:15
spatel	In BIOS	13:15
jamesdenton	hmm, ok	13:16
spatel	First you have to upgrade NIC firmware it will give you option in BIOS to disable/enable HP shared Memory	13:16
jamesdenton	ok, let me reboot this thing and see	13:16
spatel	I am very surprised there is no simple answer in internet, every post just clueless...	13:17
spatel	https://ibb.co/YW1HTdG	13:17
jamesdenton	it all varies so much	13:17
spatel	In Device level configuration > disable Shared memory and DPDK will work like charm	13:17
spatel	Make sure you have latest firmware for HP NIC otherwise you won't able to find that option in BIOS	13:18
jamesdenton	i think i do, but i'll double check	13:18
jamesdenton	i need to compare this config to another G9 i have tested DPDK with. It's possible i disabled it there, i dunno	13:19
jamesdenton	that was 2 yrs ago	13:19
spatel	:) HP just making this harder for everyone..	13:19
spatel	now i am testing on different server to just check this is correct process and i would like to blog it out so other folks don't need to struggle	13:20
spatel	learnt one more thing in blog never reference HP kb article because they started restricting some of them like redhat doing, only paid people allow to use	13:21
jamesdenton	maybe a screenshot would help	13:21
spatel	yep.. i was reading your blog post and same thing happened most of the links are broken :(	13:22
jamesdenton	which post was that?	13:22
spatel	let me find it.. you have very nice blog related this issue	13:22
spatel	jamesdenton https://www.jimmdenton.com/proliant-intel-dpdk/	13:24
spatel	may be that solve your issue :)	13:24
noonedeadpunk	haha lol	13:25
jamesdenton	Well, i think the IOMMU stuff was <= G8, i'm not seeing the same thing on this G9	13:25
jamesdenton	but the shared memory stuff sound plausible	13:25
spatel	I have Gen9 servers and i am seeing same issue	13:25
spatel	May be HP later add that feature to disable enable shared memory to handle RMRR (using NIC firmware)	13:27
spatel	jamesdenton i am planning to patch ovn to support dpdk. its broken at present and only supported in ml2.ovs plugin	13:31
jamesdenton	by all means	13:31
spatel	jamesdenton did you try AF_XDP ?	13:33
spatel	look like its next big thing in OVS, if it workout then it will overtake DPDK	13:34
jamesdenton	if it's easier to implement it just might do it	13:35
spatel	https://docs.openvswitch.org/en/latest/intro/install/afxdp/	13:35
spatel	did you play with it? i am going to try out	13:35
jamesdenton	i have not	13:40
spatel	jamesdenton i am reading this - https://docs.openstack.org/networking-ovn/queens/admin/dpdk.html	13:44
spatel	help me to understand, why they are saying just set datapath_type=netdev	13:44
spatel	for br-int ?	13:45
spatel	what about br-provider, we don't need it there?	13:45
jamesdenton	yes	13:45
jamesdenton	"and all other bridges if connected to the integration bridge via patch ports"	13:45
spatel	br-provider is external network so that should be connect to dpdk port right?	13:46
spatel	so we don't need command like this ? - ovs-vsctl add-port br-int dpdk-1 -- set Interface dpdk-1 type=dpdk options:dpdk-devargs=0000:06:00.1	13:47
jamesdenton	i would expect to need that	13:49
jamesdenton	i don't think those instructions are complete	13:49
spatel	that is what i thought something is missing	13:51
spatel	let me ask openvswitch channel	13:51
strattao	we need to change our cinder_service _password. I see warnings all over the documentation that says changing the password and running the playbooks will break the region. So... how would I go about changing the cinder_service_password? (or any other passwords that might need to be changed for that matter...)	14:01
strattao	(sorry to hijack this fascinating conversation spatel & jamesdenton)	14:02
spatel	strattao no worry, i mostly go with whatever secret password generated by OSA, why do you want to change that ?	14:03
spatel	you are not going to use them in daily basis so why bother	14:04
strattao	our password complexity requirements changed and the ones that were originally autogenerated no longer match the new requirements.	14:04
spatel	but why just cinder_service_password only?	14:05
noonedeadpunk	strattao: Actually I think you can jsut update password and re-run the role.	14:06
strattao	looks like that's the only one that doesn't meet the new requirements	14:07
noonedeadpunk	Service will be really broken but until the time role execution will end	14:07
noonedeadpunk	Because you need services to be restarted after password is changed, and it's changed somewhere in the middle of role execution	14:07
noonedeadpunk	if you use manila - you should also re-run it	14:08
strattao	hmmm, well I was testing this out in one of our dev regions and it failed re-running os-cinder. The error was with keystone, so I re-ran os-keystone, then os-cinder, to no avail. Now I have setup-everything running in the background now, but figured I'd ask here if anyone has ever had to update a password since all the documentation says that this is bad and I don't want to break in production.	14:10
noonedeadpunk	And what's the issue with keystone was?	14:14
noonedeadpunk	Have you tried to comment out no_log there?	14:14
noonedeadpunk	because we should have force to be set for quite some time	14:14
noonedeadpunk	Which means that role will attempt to change password	14:14
noonedeadpunk	Unless you've also changed keystone admin password, then things are really bad	14:15
strattao	Yeah, you're right - the keystone error I was seeing was that the cinder-api is not responsive. So, I'll try to restart cinder services in the api container now that the password has been reset and then re-run the os-cinder playbook and ..... ??? profit?!?	14:23
strattao	Nope, that didn't work :(	14:26
strattao	Okay, well I'll keep plugging along, thx	14:26
spotz	strattao: Depending on the error it might also be a good idea to ask in Cinder assuming it's not an OSA issue	14:37
noonedeadpunk	I guess it's osa actually....	15:06
noonedeadpunk	strattao: but some output would be useful	15:06
opendevreview	Dmitriy Rabotyagov proposed openstack/openstack-ansible master: Add serial execution to all playbooks https://review.opendev.org/c/openstack/openstack-ansible/+/805188	15:17
*** rpittau is now known as rpittau\|afk		15:45
spatel	jamesdenton, i have test my method on 3 servers and it just works after Disable HP shared memory feature in BIOS	15:57
jamesdenton	nice! so far, that has not worked for me	16:15
*** mgoddard- is now known as mgoddard		17:13
-opendevstatus- NOTICE: Zuul has been restarted in order to address a performance regression related to event processing; any changes pushed or approved between roughly 17:00 and 18:30 UTC should be rechecked if they're not already enqueued according to the Zuul status page		18:35
mrf3	mm guys how much memory needed is need for a controller deployed with ansible? 16GB is not enough	22:25

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!