*** goldyfruit has quit IRC | 04:49 | |
*** goldyfruit has joined #openstack-masakari | 13:18 | |
*** goldyfruit has quit IRC | 14:38 | |
*** goldyfruit has joined #openstack-masakari | 14:38 | |
*** vishalmanchanda has joined #openstack-masakari | 14:51 | |
*** gmann is now known as gmann_afk | 17:20 | |
*** gmann_afk is now known as gmann | 18:49 | |
lile | Hi goldyfruit | 19:34 |
---|---|---|
lile | I've tried out the patch from Change 675734 which allows me to configure a segment by either fqdn or hostname | 19:35 |
lile | However it doesn't appear to help matters for me, masakari still doesn't attempt to migrate VMs from a failed hypervisor | 19:35 |
lile | It looks like everything regarding hostnames should match, although I got a bit cute with the segments | 19:36 |
lile | This cluster was deployed with kolla-ansible (train) | 19:36 |
lile | The domain name has been redacted with "example.com" | 19:38 |
lile | https://www.irccloud.com/pastebin/ZOZOsiov/ | 19:38 |
*** vishalmanchanda has quit IRC | 19:43 | |
goldyfruit | What about the masakari logs ? | 20:08 |
goldyfruit | on controller nodes and computes | 20:08 |
goldyfruit | lile, | 20:08 |
goldyfruit | I see that you are using kolla | 20:09 |
goldyfruit | Apply this change: https://review.opendev.org/#/c/697712/ | 20:09 |
goldyfruit | to your configuration | 20:09 |
lile | Is there a specific masakari log you're interested in? masakari-engine? | 20:09 |
goldyfruit | engine and the one on the compute from the hostmonitior | 20:10 |
lile | ls | 20:10 |
goldyfruit | Did you setup pacemaker and corosync ? | 20:10 |
goldyfruit | Because kolla-ansible doesn't do that | 20:11 |
lile | ok, that's probably where I've missed | 20:11 |
goldyfruit | The only support right now in Kolla is instancemonitor | 20:11 |
lile | Can you point me at the correct documentation for that? | 20:11 |
goldyfruit | That is a tricky question :p | 20:11 |
lile | lol, yeah, I've looked all over for docs and ended up very confused... | 20:11 |
goldyfruit | But basically, you will need to setup pacemaker and corosync on the controller nodes | 20:12 |
goldyfruit | And only pacemaker-remote on the compute | 20:13 |
lile | The hv node masakari logs are basically empty | 20:13 |
lile | cat /var/log/kolla/masakari/masakari-instancemonitor.log | 20:13 |
lile | 2020-02-11 11:58:38.497 6 INFO masakarimonitors.service [-] Starting masakarimonitors-instancemonitor | 20:13 |
lile | Likewise for engine on the controllers | 20:14 |
lile | [root@odt-rsd-ost-ctr-b1 masakari]# cat masakari-engine.log | 20:14 |
lile | 2020-02-11 11:57:02.448 6 INFO masakari.engine.driver [-] Loading masakari notification driver 'taskflow_driver' | 20:14 |
lile | 2020-02-11 11:57:06.525 6 INFO masakari.service [-] Starting ha_engine (version 8.0.0) | 20:14 |
lile | The wsgi server log just complains when it sees an HV node go down, regarding rabbitmq | 20:15 |
goldyfruit | So at least to test masakari-engine and masakari-instancemonitor, you can kill an instance on the compute and see if the instance is coming back | 20:15 |
lile | 2020-02-11 13:06:12.596 20 INFO masakari.compute.nova [req-26c23ee2-b6b9-4cfe-8af4-16f5a4567c22 nova - - - -] Call hypervisor search command to get list of matching hypervisor name 'hv-10-011.example.com' | 20:15 |
lile | 2020-02-11 13:06:13.588 20 ERROR oslo.messaging._drivers.impl_rabbit [req-26c23ee2-b6b9-4cfe-8af4-16f5a4567c22 nova - - - -] [1c1fecd1-495e-41a9-acc4-3c955f14fba4] AMQP server on 10.232.194.237:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: IOError: Server unexpectedly closed connection | 20:15 |
lile | 2020-02-11 13:06:14.616 20 INFO oslo.messaging._drivers.impl_rabbit [req-26c23ee2-b6b9-4cfe-8af4-16f5a4567c22 nova - - - -] [1c1fecd1-495e-41a9-acc4-3c955f14fba4] Reconnected to AMQP server on 10.232.194.237:5672 via [amqp] client with port 41096. | 20:15 |
lile | 2020-02-11 13:06:48.848 18 WARNING oslo.messaging._drivers.impl_rabbit [req-0bf6856f-8bbd-4a4a-8354-8d4c9fd70812 2bc8474c1633412fa53fab73b98714cc 21500607bf2747068ee65d3fe1b51874 - - -] Unexpected error during heartbeat thread processing, retrying...: IOError: Server unexpectedly closed connection | 20:15 |
lile | 2020-02-11 14:04:29.846 20 WARNING oslo.messaging._drivers.impl_rabbit [req-26c23ee2-b6b9-4cfe-8af4-16f5a4567c22 nova - - - -] Unexpected error during heartbeat thread processing, retrying...: IOError: Server unexpectedly closed connection | 20:15 |
lile | 2020-02-11 14:04:30.941 20 ERROR oslo.messaging._drivers.impl_rabbit [req-0311e219-1083-4ad4-961d-3d629449f4c5 2bc8474c1633412fa53fab73b98714cc 21500607bf2747068ee65d3fe1b51874 - - -] [1c1fecd1-495e-41a9-acc4-3c955f14fba4] AMQP server on 10.232.194.237:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: IOError: Server unexpectedly closed connection | 20:15 |
lile | 2020-02-11 14:04:31.974 20 INFO oslo.messaging._drivers.impl_rabbit [req-0311e219-1083-4ad4-961d-3d629449f4c5 2bc8474c1633412fa53fab73b98714cc 21500607bf2747068ee65d3fe1b51874 - - -] [1c1fecd1-495e-41a9-acc4-3c955f14fba4] Reconnected to AMQP server on 10.232.194.237:5672 via [amqp] client with port 35934. | 20:15 |
lile | 2020-02-11 14:05:56.914 18 INFO masakari.compute.nova [req-9e44e63e-f1de-4faa-86fa-3357f1a58ec0 nova - - - -] Call hypervisor search command to get list of matching hypervisor name 'hv-10-012' | 20:16 |
lile | 2020-02-11 14:05:57.835 18 ERROR oslo.messaging._drivers.impl_rabbit [req-9e44e63e-f1de-4faa-86fa-3357f1a58ec0 nova - - - -] [60c60034-bd54-4af0-81ce-7e53d152a714] AMQP server on 10.232.194.236:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: IOError: Server unexpectedly closed connection | 20:16 |
lile | 2020-02-11 14:05:58.861 18 INFO oslo.messaging._drivers.impl_rabbit [req-9e44e63e-f1de-4faa-86fa-3357f1a58ec0 nova - - - -] [60c60034-bd54-4af0-81ce-7e53d152a714] Reconnected to AMQP server on 10.232.194.236:5672 via [amqp] client with port 59904. | 20:16 |
lile | 2020-02-11 14:42:37.087 18 WARNING oslo.messaging._drivers.impl_rabbit [req-9e44e63e-f1de-4faa-86fa-3357f1a58ec0 nova - - - -] Unexpected error during heartbeat thread processing, retrying...: IOError: Server unexpectedly closed connection | 20:16 |
goldyfruit | lile, don't paste the log here please | 20:16 |
lile | apologies | 20:16 |
lile | pastebin? | 20:16 |
goldyfruit | yes | 20:17 |
lile | For 697712, I could just push the "[api] api_interface = internal" into masakari-monitors.conf with a kolla-ansible config merge, correct? (as a quick test) | 20:18 |
goldyfruit | yep | 20:19 |
lile | For instance ha, set--property HA_Enabled=True on the instance, correct? | 20:23 |
goldyfruit | yes | 20:23 |
lile | So masakari noticed the instance going down but didn't restart the instance | 20:55 |
lile | So masakari noticed the instance going down but didn't restart the instance | 20:56 |
lile | https://www.irccloud.com/pastebin/X3IqLeG9/ | 20:56 |
goldyfruit | did you kill the instance ? | 20:57 |
lile | Yes, from the hv using an os level kill | 20:57 |
goldyfruit | kill -9 pid ? | 20:58 |
goldyfruit | Try to enable debug on masakari api, engine and instancemonitor | 21:04 |
lile | just a kill, but I can hit it harder next time ;-) | 21:06 |
lile | I did find an interesting message in engine on the controller | 21:09 |
lile | https://www.irccloud.com/pastebin/rFdFahA8/ | 21:09 |
goldyfruit | try the -9 | 21:11 |
goldyfruit | And turn on debug | 21:11 |
lile | Saw that while I was enabling debug | 21:11 |
lile | ok, with kill -9 it did restart | 21:15 |
goldyfruit | great | 21:15 |
lile | would that recover instances if the compute node failed though? | 21:16 |
goldyfruit | nop | 21:17 |
goldyfruit | You need to setup the hostmonitor | 21:17 |
lile | ok, any pointer to how to get pacemaker/corosync or hostmonitor running along side a kolla-ansible deployment? | 21:17 |
goldyfruit | I didn' t have the time to continue the impletementation | 21:18 |
goldyfruit | implementation | 21:18 |
goldyfruit | But the images have been merged into kolla | 21:18 |
goldyfruit | they are under hacluster | 21:19 |
lile | ok, I'll dig into that. | 21:19 |
goldyfruit | About the Ansible role | 21:19 |
goldyfruit | https://review.opendev.org/#/c/670104/ | 21:19 |
goldyfruit | By reading the role you should be able to see how to install pacemaker/corosync | 21:20 |
lile | excellent | 21:20 |
lile | Is there anything specific missing from the role or did it just not get merged back to Train? | 21:23 |
goldyfruit | The role was working for train/master | 21:24 |
goldyfruit | Just needed to clean some stuff but never had the time since I left the company | 21:24 |
lile | Ok, I can probably just figure out how to invoke it in my existing train setup | 21:24 |
lile | we're building our own kolla images, when necessary, and merging them in at kolla-ansible through docker tags | 21:25 |
lile | Thank you so much for your help today! I'll dig into https://review.opendev.org/#/c/670104/ and see what I come up with. | 21:26 |
goldyfruit | Good luck :) | 21:26 |
lile | If the clean ups aren't beyond my capability, I may be able to submit them back to you or possibly back to the change | 21:27 |
goldyfruit | Cool | 21:28 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!