*** rlandy is now known as rlandy|out | 01:25 | |
*** mhen_ is now known as mhen | 01:54 | |
*** yadnesh|away is now known as yadnesh | 04:33 | |
*** luigi is now known as luigi-out | 05:42 | |
*** yadnesh is now known as yadnesh|afk | 08:03 | |
*** yadnesh|afk is now known as yadnesh | 09:05 | |
*** dasTor_ is now known as dasTor | 09:34 | |
*** rlandy|out is now known as rlandy | 10:30 | |
zbig | Hey folks, I have posted this message originally in #opesntack-kolla channel but here it fits bette. I am looking for bit help with HA setup. | 12:21 |
---|---|---|
zbig | My cluster consist out of 8 machines with 3 in controller role (Inventory file https://paste.opendev.org/show/blFqF5uQtjDvfj6Q1E6f/ ) I am deploying Openstack Yoga with kolla-ansible with following settings (globals.yaml https://paste.opendev.org/show/bPhVP7i2es2NEsLxRwMu/ ). | 12:21 |
zbig | I was testing HA capabilities by doing hard power down of one of the controller machine and observing how openstack will behave. | 12:21 |
zbig | I have noticed following: Horizon was not usable failing with “something went wrong” page, after debugging I figured out that changing this value here https://github.com/openstack/kolla-ansible/blob/stable/yoga/ansible/roles/horizon/templates/local_settings.j2#L13 to False is help to made horizon loading properly. | 12:21 |
zbig | But even then, creating new instance was failing and horizon was super slow. | 12:21 |
zbig | At the same time I have notice that cli openstack client was working ok for read operations (didn’t try write operations). Grafana and Kibana were both fine. | 12:21 |
zbig | So my question is - is there anything specific to HA setup is should consider? I was hoping that taking out one controller node may results in some some issues with network or compute but Horizon and scheduling new workloads would work fine. | 12:21 |
lowercase | look at your rabbitmq logs during the ha test and report back | 13:25 |
*** blarnath is now known as d34dh0r53 | 13:50 | |
zbig | lowercase: didn't see anything unusual in rabbitmq, but i peaked into nove-scheduler logs and it seems like it still tries to connect to AMQP server of the node that gone offline, it is using a machine IP, bit strange i would thought it would use internal VIP | 14:25 |
zbig | the exact error is "2022-10-12 14:23:12.898 20 ERROR oslo.messaging._drivers.impl_rabbit [-] [d2ec1d06-bcc3-4bf8-8952-7e8c4b41cdec] AMQP server on 172.20.20.11:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>" where that IP 172.20.20.11 belonged to offline node | 14:26 |
lowercase | rabbitmq shouldn't be behind a vip. | 14:49 |
lowercase | nova, for example, should have all three rabbitmq servers listed individually. my config looks like username@password:rabbit1 then the password, username@password:rabbit2, username@password:rabbit3 | 14:51 |
zbig | hm, i checked nova.conf and transport_url has indeed all 3 nodes listed there, so maybe that error message from nova scheduler is just red herring | 14:56 |
*** yadnesh is now known as yadnesh|afk | 15:11 | |
*** yadnesh|afk is now known as yadnesh|away | 15:11 | |
zbig | looking at logs of nova and in cinder-volume, it is full of errors of the same AMQP errors, what is not clear to me is if these services (nova and cinder) are logging only errors and silently connect to other 2 rabbitmq servers or if they error on connecting to the one offline and dont try to connect to other servers | 15:15 |
zbig | there is some activity in other rabbitmq logs but i would expect way more... i have suspect all the issues are indeed related to controller host that is offline being 1st on that connection ulrs in services | 15:21 |
zbig | i will later run experiment with turning off controller node that is listed as last one in that connection string to see if cluster behaves differently | 15:22 |
zbig | more generally, is there anything to keep in mind with openstack HA? i just followed kolla-asnbile quick start and made sure to have 3 controller node and VIP, i assumed it will work out of the box | 15:23 |
lowercase | Just to clarify, rabbitmq and galera mysql, should both not have a vip. Every other service will/could/should have a vip. I would catalog the errors and review them for the quantity of errors for each ip. If rabbit2 and rabbit3 are not reachable, have improper authentication or are becoming partitioned during a failover. Each of these have very different solutions that will require more troubleshooting time. | 16:46 |
lowercase | actually, i dont recall if my mysql database is behind a vip. I'm not on the vpn to check atm. | 16:54 |
*** timburke_ is now known as timburke | 20:59 | |
*** rlandy is now known as rlandy|bbl | 22:04 | |
*** rlandy|bbl is now known as rlandy | 23:50 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!