Wednesday, 2022-10-12

*** rlandy is now known as rlandy|out01:25
*** mhen_ is now known as mhen01:54
*** yadnesh|away is now known as yadnesh04:33
*** luigi is now known as luigi-out05:42
*** yadnesh is now known as yadnesh|afk08:03
*** yadnesh|afk is now known as yadnesh09:05
*** dasTor_ is now known as dasTor09:34
*** rlandy|out is now known as rlandy10:30
zbigHey folks, I have posted this message originally in #opesntack-kolla channel but here it fits bette. I am looking for bit help with HA setup. 12:21
zbigMy cluster consist out of 8 machines with 3 in controller role (Inventory file https://paste.opendev.org/show/blFqF5uQtjDvfj6Q1E6f/ ) I am deploying Openstack Yoga with kolla-ansible with following settings (globals.yaml https://paste.opendev.org/show/bPhVP7i2es2NEsLxRwMu/ ). 12:21
zbigI was testing HA capabilities by doing hard power down of one of the controller machine and observing how openstack will behave. 12:21
zbigI have noticed following: Horizon was not usable failing with “something went wrong” page, after debugging I figured out that changing this value here https://github.com/openstack/kolla-ansible/blob/stable/yoga/ansible/roles/horizon/templates/local_settings.j2#L13 to False is help to made horizon loading properly. 12:21
zbig  But even then, creating new instance was failing and horizon was super slow. 12:21
zbig  At the same time I have notice that cli openstack client was working ok for read operations (didn’t try write operations). Grafana and Kibana were both fine. 12:21
zbig  So my question is - is there anything specific to HA setup is should consider? I was hoping that taking out one controller node may results in some some issues with network or compute but Horizon and scheduling new workloads would work fine.  12:21
lowercaselook at your rabbitmq logs during the ha test and report back13:25
*** blarnath is now known as d34dh0r5313:50
zbiglowercase: didn't see anything unusual in rabbitmq, but i peaked into nove-scheduler logs and it seems like it still tries to connect to AMQP server of the node that gone offline, it is using a machine IP, bit strange i would thought it would use internal VIP14:25
zbigthe exact error is "2022-10-12 14:23:12.898 20 ERROR oslo.messaging._drivers.impl_rabbit [-] [d2ec1d06-bcc3-4bf8-8952-7e8c4b41cdec] AMQP server on 172.20.20.11:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>" where that IP 172.20.20.11 belonged to offline node14:26
lowercaserabbitmq shouldn't be behind a vip.14:49
lowercasenova, for example, should have all three rabbitmq servers listed individually. my config looks like username@password:rabbit1 then the password, username@password:rabbit2, username@password:rabbit314:51
zbighm, i checked nova.conf and transport_url has indeed all 3 nodes listed there, so maybe that error message from nova scheduler is just red herring14:56
*** yadnesh is now known as yadnesh|afk15:11
*** yadnesh|afk is now known as yadnesh|away15:11
zbiglooking at logs of nova and in cinder-volume, it is full of errors of the same AMQP errors, what is not clear to me is if these services (nova and cinder) are logging only errors and silently connect to other 2 rabbitmq servers or if they error on connecting to the one offline and dont try to connect to other servers15:15
zbigthere is some activity in other rabbitmq logs but i would expect way more... i have suspect all the issues are indeed related to controller host that is offline being 1st on that connection ulrs in services15:21
zbigi will later run experiment with turning off controller node that is listed as last one in that connection string to see if cluster behaves differently15:22
zbigmore generally, is there anything to keep in mind with openstack HA? i just followed kolla-asnbile quick start and made sure to have 3 controller node and VIP, i assumed it will work out of the box15:23
lowercaseJust to clarify, rabbitmq and galera mysql, should both not have a vip. Every other service will/could/should have a vip. I would catalog the errors and review them for the quantity of errors for each ip. If rabbit2 and rabbit3 are not reachable, have improper authentication or are becoming partitioned during a failover. Each of these have very different solutions that will require more troubleshooting time.16:46
lowercaseactually, i dont recall if my mysql database is behind a vip. I'm not on the vpn to check atm.16:54
*** timburke_ is now known as timburke20:59
*** rlandy is now known as rlandy|bbl22:04
*** rlandy|bbl is now known as rlandy23:50

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!