felixhuettner[m] | stalking is always nice :) | 06:50 |
---|---|---|
felixhuettner[m] | we run individual Rabbit clusters for each openstack service | 06:50 |
felixhuettner[m] | we are honestly not sure what is causing all the messages | 06:50 |
felixhuettner[m] | but we mostly see issues when we restart the rabbit nodes one by one | 06:51 |
felixhuettner[m] | the setup is a little "non-standard": we run rabbit in k8s and use tls for all connections | 06:51 |
felixhuettner[m] | replication policy is set according to the sig recommendation | 06:52 |
felixhuettner[m] | but we try to do a rolling restart each week | 06:53 |
felixhuettner[m] | and during this time our nova and neutron rabbit clusters break (when we restart the last of the 3 nodes) | 06:53 |
felixhuettner[m] | the patches you found have definately improoved the behaviour (the nova cluster now only dies half the time) | 06:54 |
felixhuettner[m] | but a collegue of mine is now working on that to get this to a better state | 06:54 |
amorin | ok! | 07:07 |
amorin | we noticed on our side that each time we restart a node, the connections from agents (neutron mostly) are dispatched to other nodes, so the load is split accross 2 nodes instead of 2 | 07:08 |
amorin | *instead of 3 | 07:08 |
amorin | and the load is never dispatched again on the 3 nodes in an automated way | 07:08 |
amorin | the only way to perform this is to restart the agents | 07:09 |
felixhuettner[m] | oooh, that is also interesting | 07:09 |
amorin | so if for some reason, 2 nodes over 3 are down, the last node will handle all the load | 07:09 |
amorin | and it will stay like that forever | 07:10 |
felixhuettner[m] | the issue we see is with the failover of mirrored queues (that might also fit your issue) | 07:10 |
felixhuettner[m] | it seems like when failing over these queues select the next master based on the "oldest" node | 07:10 |
amorin | have you tried the quorum queues? | 07:10 |
felixhuettner[m] | so if you have restarted 2 out of 3 nodes then the last node will have all the mirrored queues since it is oldest | 07:10 |
felixhuettner[m] | not yet | 07:11 |
felixhuettner[m] | we are still a little stuck on queens :D | 07:11 |
felixhuettner[m] | do you have some experiences there? | 07:11 |
amorin | not yet neither :( | 07:11 |
amorin | but we are running our own patched version of oslo.messaging, so we might be able to upgrade it to a latest version that enable quorum queues | 07:12 |
amorin | that's something we will try for sure | 07:12 |
felixhuettner[m] | we are currently planning a new cluster on yoga and will use quorum queues there. So maybe i can share something there in a few months | 07:12 |
amorin | that would be amazing! | 07:12 |
opendevreview | Ramona Rautenberg proposed openstack/large-scale master: Transfer configure page to rst format https://review.opendev.org/c/openstack/large-scale/+/847535 | 07:42 |
opendevreview | Ramona Rautenberg proposed openstack/large-scale master: Transfer configure page to rst format https://review.opendev.org/c/openstack/large-scale/+/847535 | 08:03 |
opendevreview | Ramona Rautenberg proposed openstack/large-scale master: Transfer configure page to rst format https://review.opendev.org/c/openstack/large-scale/+/847535 | 08:42 |
opendevreview | Ramona Rautenberg proposed openstack/large-scale master: Transfer configure page to rst format https://review.opendev.org/c/openstack/large-scale/+/847535 | 10:14 |
opendevreview | Merged openstack/large-scale master: Transfer configure page to rst format https://review.opendev.org/c/openstack/large-scale/+/847535 | 15:43 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!