__ministry | Hello, I using senlin in production environments and had a problem that are senlin health manager had call many time to keystone and health check a cluster more than expected. This problem I had also raise formerly. After long time monitor. I had show this problem start when registering health manager for cluster being timeout with example log: "Registering health manager for cluster 9af41150-f44c-4d0f-a905-0492bfcfbc9c timed out." | 02:36 |
---|---|---|
__ministry | Anyone else had met this problem? | 02:36 |
eandersson | __ministry this bug right https://bugs.launchpad.net/senlin/+bug/1975440 | 02:49 |
eandersson | Are you running the version that has that fix? 14.0 | 02:50 |
eandersson | Zed | 02:51 |
__ministry | yup. I had run code had this commit. but this problem still happen. | 02:56 |
eandersson | Can you add the new information to the bug too so that we don’t lose track of the info | 02:56 |
eandersson | Need to figure out a way to reproduce this | 02:56 |
__ministry | This bug I can't re-procedure. but I was collected some informations: | 03:01 |
__ministry | 1. Bug start when have log "Registering health manager for cluster timed out." | 03:01 |
__ministry | 2. Health manager alway health check this clusters although this clusters not found. Example: "Cluster (0dd4f97a-effb-4223-9863-7d7932960ad8) is not found." | 03:01 |
__ministry | 3. When I restart health manager, this problem will being resolved. | 03:01 |
eandersson | Does that cluster id exist? | 03:26 |
eandersson | I wonder if we should just handle this different here. | 03:33 |
eandersson | https://github.com/openstack/senlin/blob/master/senlin/engine/health_manager.py#L390 | 03:33 |
eandersson | Maybe if this fails just stop trying, because I don't know if there is any point in continuing if the cluster does not exist. What do you think dtruong? | 03:35 |
eandersson | __ministry: Do you see an uuid in the logs with this log line | 03:39 |
eandersson | > Unregistering health manager for cluster <uuid> timed out. | 03:39 |
eandersson | It really feels like this is a RabbitMQ issue. | 03:41 |
eandersson | A few (or many) lost messages. | 03:41 |
eandersson | But it could also be that the health manager isn't responsive, maybe stuck processing some call | 03:43 |
__ministry | that clusters had deleted and not found in database. but health manager still do health check for this clusters. | 04:02 |
__ministry | I also had saw this log with above format. "Registering health manager for cluster 9af41150-f44c-4d0f-a905-0492bfcfbc9c timed out." | 04:03 |
__ministry | but never seen log that with format "Unregistering health manager for cluster <uuid> timed out." | 04:05 |
opendevreview | Erik Olof Gunnar Andersson proposed openstack/senlin master: [WIP] Check for orphaned health checks https://review.opendev.org/c/openstack/senlin/+/876275 | 05:40 |
eandersson | No clue if the above works, but might have to do something like this to ensure that we don't have unwanted health checks hanging around. | 05:41 |
__ministry | Thank you very much. Let me try your code. | 07:27 |
opendevreview | Erik Olof Gunnar Andersson proposed openstack/senlin master: [WIP] Add cleanup step for orphaned health checks https://review.opendev.org/c/openstack/senlin/+/876275 | 07:30 |
eandersson | Sounds good thanks __ministry. I updated the code a bit to not be as aggressive, but the previous version should do what we want. | 07:31 |
eandersson | I still don't know how we are getting into this state though. | 07:32 |
__ministry | Let me trace and collect more informations. | 07:34 |
opendevreview | Erik Olof Gunnar Andersson proposed openstack/senlin master: [WIP] Add cleanup step for orphaned health checks https://review.opendev.org/c/openstack/senlin/+/876275 | 19:24 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!