Wednesday, 2022-10-12

*** rlandy is now known as rlandy\|out		01:25
*** mhen_ is now known as mhen		01:54
*** yadnesh\|away is now known as yadnesh		04:33
*** luigi is now known as luigi-out		05:42
*** yadnesh is now known as yadnesh\|afk		08:03
*** yadnesh\|afk is now known as yadnesh		09:05
*** dasTor_ is now known as dasTor		09:34
*** rlandy\|out is now known as rlandy		10:30
zbig	Hey folks, I have posted this message originally in #opesntack-kolla channel but here it fits bette. I am looking for bit help with HA setup.	12:21
zbig	My cluster consist out of 8 machines with 3 in controller role (Inventory file https://paste.opendev.org/show/blFqF5uQtjDvfj6Q1E6f/ ) I am deploying Openstack Yoga with kolla-ansible with following settings (globals.yaml https://paste.opendev.org/show/bPhVP7i2es2NEsLxRwMu/ ).	12:21
zbig	I was testing HA capabilities by doing hard power down of one of the controller machine and observing how openstack will behave.	12:21
zbig	I have noticed following: Horizon was not usable failing with “something went wrong” page, after debugging I figured out that changing this value here https://github.com/openstack/kolla-ansible/blob/stable/yoga/ansible/roles/horizon/templates/local_settings.j2#L13 to False is help to made horizon loading properly.	12:21
zbig	But even then, creating new instance was failing and horizon was super slow.	12:21
zbig	At the same time I have notice that cli openstack client was working ok for read operations (didn’t try write operations). Grafana and Kibana were both fine.	12:21
zbig	So my question is - is there anything specific to HA setup is should consider? I was hoping that taking out one controller node may results in some some issues with network or compute but Horizon and scheduling new workloads would work fine.	12:21
lowercase	look at your rabbitmq logs during the ha test and report back	13:25
*** blarnath is now known as d34dh0r53		13:50
zbig	lowercase: didn't see anything unusual in rabbitmq, but i peaked into nove-scheduler logs and it seems like it still tries to connect to AMQP server of the node that gone offline, it is using a machine IP, bit strange i would thought it would use internal VIP	14:25
zbig	the exact error is "2022-10-12 14:23:12.898 20 ERROR oslo.messaging._drivers.impl_rabbit [-] [d2ec1d06-bcc3-4bf8-8952-7e8c4b41cdec] AMQP server on 172.20.20.11:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>" where that IP 172.20.20.11 belonged to offline node	14:26
lowercase	rabbitmq shouldn't be behind a vip.	14:49
lowercase	nova, for example, should have all three rabbitmq servers listed individually. my config looks like username@password:rabbit1 then the password, username@password:rabbit2, username@password:rabbit3	14:51
zbig	hm, i checked nova.conf and transport_url has indeed all 3 nodes listed there, so maybe that error message from nova scheduler is just red herring	14:56
*** yadnesh is now known as yadnesh\|afk		15:11
*** yadnesh\|afk is now known as yadnesh\|away		15:11
zbig	looking at logs of nova and in cinder-volume, it is full of errors of the same AMQP errors, what is not clear to me is if these services (nova and cinder) are logging only errors and silently connect to other 2 rabbitmq servers or if they error on connecting to the one offline and dont try to connect to other servers	15:15
zbig	there is some activity in other rabbitmq logs but i would expect way more... i have suspect all the issues are indeed related to controller host that is offline being 1st on that connection ulrs in services	15:21
zbig	i will later run experiment with turning off controller node that is listed as last one in that connection string to see if cluster behaves differently	15:22
zbig	more generally, is there anything to keep in mind with openstack HA? i just followed kolla-asnbile quick start and made sure to have 3 controller node and VIP, i assumed it will work out of the box	15:23
lowercase	Just to clarify, rabbitmq and galera mysql, should both not have a vip. Every other service will/could/should have a vip. I would catalog the errors and review them for the quantity of errors for each ip. If rabbit2 and rabbit3 are not reachable, have improper authentication or are becoming partitioned during a failover. Each of these have very different solutions that will require more troubleshooting time.	16:46
lowercase	actually, i dont recall if my mysql database is behind a vip. I'm not on the vpn to check atm.	16:54
*** timburke_ is now known as timburke		20:59
*** rlandy is now known as rlandy\|bbl		22:04
*** rlandy\|bbl is now known as rlandy		23:50

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!