Monday, 2023-03-06

__ministry	Hello, I using senlin in production environments and had a problem that are senlin health manager had call many time to keystone and health check a cluster more than expected. This problem I had also raise formerly. After long time monitor. I had show this problem start when registering health manager for cluster being timeout with example log: "Registering health manager for cluster 9af41150-f44c-4d0f-a905-0492bfcfbc9c timed out."	02:36
__ministry	Anyone else had met this problem?	02:36
eandersson	__ministry this bug right https://bugs.launchpad.net/senlin/+bug/1975440	02:49
eandersson	Are you running the version that has that fix? 14.0	02:50
eandersson	Zed	02:51
__ministry	yup. I had run code had this commit. but this problem still happen.	02:56
eandersson	Can you add the new information to the bug too so that we don’t lose track of the info	02:56
eandersson	Need to figure out a way to reproduce this	02:56
__ministry	This bug I can't re-procedure. but I was collected some informations:	03:01
__ministry	1. Bug start when have log "Registering health manager for cluster timed out."	03:01
__ministry	2. Health manager alway health check this clusters although this clusters not found. Example: "Cluster (0dd4f97a-effb-4223-9863-7d7932960ad8) is not found."	03:01
__ministry	3. When I restart health manager, this problem will being resolved.	03:01
eandersson	Does that cluster id exist?	03:26
eandersson	I wonder if we should just handle this different here.	03:33
eandersson	https://github.com/openstack/senlin/blob/master/senlin/engine/health_manager.py#L390	03:33
eandersson	Maybe if this fails just stop trying, because I don't know if there is any point in continuing if the cluster does not exist. What do you think dtruong?	03:35
eandersson	__ministry: Do you see an uuid in the logs with this log line	03:39
eandersson	> Unregistering health manager for cluster <uuid> timed out.	03:39
eandersson	It really feels like this is a RabbitMQ issue.	03:41
eandersson	A few (or many) lost messages.	03:41
eandersson	But it could also be that the health manager isn't responsive, maybe stuck processing some call	03:43
__ministry	that clusters had deleted and not found in database. but health manager still do health check for this clusters.	04:02
__ministry	I also had saw this log with above format. "Registering health manager for cluster 9af41150-f44c-4d0f-a905-0492bfcfbc9c timed out."	04:03
__ministry	but never seen log that with format "Unregistering health manager for cluster <uuid> timed out."	04:05
opendevreview	Erik Olof Gunnar Andersson proposed openstack/senlin master: [WIP] Check for orphaned health checks https://review.opendev.org/c/openstack/senlin/+/876275	05:40
eandersson	No clue if the above works, but might have to do something like this to ensure that we don't have unwanted health checks hanging around.	05:41
__ministry	Thank you very much. Let me try your code.	07:27
opendevreview	Erik Olof Gunnar Andersson proposed openstack/senlin master: [WIP] Add cleanup step for orphaned health checks https://review.opendev.org/c/openstack/senlin/+/876275	07:30
eandersson	Sounds good thanks __ministry. I updated the code a bit to not be as aggressive, but the previous version should do what we want.	07:31
eandersson	I still don't know how we are getting into this state though.	07:32
__ministry	Let me trace and collect more informations.	07:34
opendevreview	Erik Olof Gunnar Andersson proposed openstack/senlin master: [WIP] Add cleanup step for orphaned health checks https://review.opendev.org/c/openstack/senlin/+/876275	19:24

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!