Friday, 2022-06-24

felixhuettner[m]	stalking is always nice :)	06:50
felixhuettner[m]	we run individual Rabbit clusters for each openstack service	06:50
felixhuettner[m]	we are honestly not sure what is causing all the messages	06:50
felixhuettner[m]	but we mostly see issues when we restart the rabbit nodes one by one	06:51
felixhuettner[m]	the setup is a little "non-standard": we run rabbit in k8s and use tls for all connections	06:51
felixhuettner[m]	replication policy is set according to the sig recommendation	06:52
felixhuettner[m]	but we try to do a rolling restart each week	06:53
felixhuettner[m]	and during this time our nova and neutron rabbit clusters break (when we restart the last of the 3 nodes)	06:53
felixhuettner[m]	the patches you found have definately improoved the behaviour (the nova cluster now only dies half the time)	06:54
felixhuettner[m]	but a collegue of mine is now working on that to get this to a better state	06:54
amorin	ok!	07:07
amorin	we noticed on our side that each time we restart a node, the connections from agents (neutron mostly) are dispatched to other nodes, so the load is split accross 2 nodes instead of 2	07:08
amorin	*instead of 3	07:08
amorin	and the load is never dispatched again on the 3 nodes in an automated way	07:08
amorin	the only way to perform this is to restart the agents	07:09
felixhuettner[m]	oooh, that is also interesting	07:09
amorin	so if for some reason, 2 nodes over 3 are down, the last node will handle all the load	07:09
amorin	and it will stay like that forever	07:10
felixhuettner[m]	the issue we see is with the failover of mirrored queues (that might also fit your issue)	07:10
felixhuettner[m]	it seems like when failing over these queues select the next master based on the "oldest" node	07:10
amorin	have you tried the quorum queues?	07:10
felixhuettner[m]	so if you have restarted 2 out of 3 nodes then the last node will have all the mirrored queues since it is oldest	07:10
felixhuettner[m]	not yet	07:11
felixhuettner[m]	we are still a little stuck on queens :D	07:11
felixhuettner[m]	do you have some experiences there?	07:11
amorin	not yet neither :(	07:11
amorin	but we are running our own patched version of oslo.messaging, so we might be able to upgrade it to a latest version that enable quorum queues	07:12
amorin	that's something we will try for sure	07:12
felixhuettner[m]	we are currently planning a new cluster on yoga and will use quorum queues there. So maybe i can share something there in a few months	07:12
amorin	that would be amazing!	07:12
opendevreview	Ramona Rautenberg proposed openstack/large-scale master: Transfer configure page to rst format https://review.opendev.org/c/openstack/large-scale/+/847535	07:42
opendevreview	Ramona Rautenberg proposed openstack/large-scale master: Transfer configure page to rst format https://review.opendev.org/c/openstack/large-scale/+/847535	08:03
opendevreview	Ramona Rautenberg proposed openstack/large-scale master: Transfer configure page to rst format https://review.opendev.org/c/openstack/large-scale/+/847535	08:42
opendevreview	Ramona Rautenberg proposed openstack/large-scale master: Transfer configure page to rst format https://review.opendev.org/c/openstack/large-scale/+/847535	10:14
opendevreview	Merged openstack/large-scale master: Transfer configure page to rst format https://review.opendev.org/c/openstack/large-scale/+/847535	15:43

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!