Sunday, 2020-12-20

Alex_Gaynor	I'm seeing arm64 builds not starting again :-/	01:50
Alex_Gaynor	And two just started, maybe it's just very slow.	01:52
*** zbr\|rover1 has joined #opendev		06:47
*** zbr\|rover has quit IRC		06:49
*** zbr\|rover1 is now known as zbr\|rover		06:49
*** sboyron has joined #opendev		08:26
*** sboyron has quit IRC		08:26
*** whoami-rajat__ has joined #opendev		08:41
*** zbr\|rover is now known as zbr		09:43
*** hamalq has joined #opendev		10:39
*** whoami-rajat__ has quit IRC		11:21
*** tosky has joined #opendev		11:48
*** DSpider has joined #opendev		11:49
*** ysandeep\|away is now known as ysandeep		12:03
*** hamalq has quit IRC		12:39
*** hamalq has joined #opendev		13:15
*** hamalq has quit IRC		13:20
fungi	roughly how long of a delay were you seeing? (was it time from trigger to build starting? or elapsed time from trigger to build completing?)	14:14
Alex_Gaynor	fungi: from triggering it to the build starting; it was >30 minutes	14:17
fungi	ouch, yeah that doesn't sound good. i'll trace one in the logs and see if i can come up with a timeline	14:22
fungi	as a point of corroboration, the ttr graph at https://grafana.opendev.org/d/pwrNXt2Mk/nodepool-linaro?orgId=1&from=now-24h&to=now does seem to show nodes often taking up to 30 minutes to boot and become reachable in that cloud	14:42
Alex_Gaynor	Does the "available" state refer to nodes that are booted and can be assigned jobs without delay?	14:44
fungi	yes, but that environment is configured to only keep one node pre-booted for each configured image type so as to not eat into the quota there too much	14:44
fungi	i also did an initial pass over the logs from around that timeframe to sample the time ranges for each node request from when the launcher first took the lock on it until the node reported ready: http://paste.openstack.org/show/801198	14:45
fungi	the rows with only one timestamp i think represent nodes which did not become reachable or perhaps failed to boot entirely	14:45
fungi	going to look at a few in closer detail and figure out where the delay is being incurred	14:46
fungi	i think 3 minutes may be the launcher's wait timeout too... if so that explains why we don't see any nodes taking longer than 30 minutes to reach a ready state but we do see some which have failed to become ready	14:47
fungi	er, 30 minutes	14:47
fungi	aha, so looking at a sample failure, it's a 10-minute timeout with three retries	14:49
fungi	boot request is initiated to the provider, two minutes later the provider indicates the node is in a ready state, but by 10 minutes it's still not reachable via ssh and so the launcher drops it and starts another attempt. after three times of the same it gives up trying to boot that and unlocks the request (so that another launcher could try it, except this is the only provider we have for the relevant node	14:51
fungi	types so it just winds up locking it again and starting over)	14:52
*** slaweq has joined #opendev		14:53
fungi	spot checking a few more of the failures, that seems to be a consistent pattern	14:54
fungi	and yeah, looking at some of the nodes which took around 30 minutes the cause seems to have been a combination of taking up to 5 minutes for the cloud to indicate the node was active and then timing out 10 minutes waiting to be able to reach it over ssh and retrying	14:58
fungi	so i'm going to say it looks like the reason node allocations are taking to long to be satisfied there is that many boot calls are creating servers which take longer than 10 minutes to become reachable over the network (or perhaps are never going to become reachable)	14:59
fungi	kevinz: ^ if you're around, do you have any insight into why that might be happening?	15:00
fungi	in the case of frequent wait timeouts, our "error node launch attempts" graph is misleading, because it represents node requests where the launcher tried three times consecutively before giving up	15:03
*** hamalq has joined #opendev		15:09
*** slaweq has quit IRC		15:09
*** hamalq has quit IRC		15:14
*** hamalq has joined #opendev		16:25
*** hamalq has quit IRC		16:29
*** hamalq has joined #opendev		17:45
*** mrunge_ has joined #opendev		18:23
*** mrunge has quit IRC		18:24
*** mrunge has joined #opendev		18:27
*** mrunge_ has quit IRC		18:28
*** mrunge_ has joined #opendev		18:32
*** mrunge has quit IRC		18:33
*** hamalq has quit IRC		18:49
*** slaweq has joined #opendev		19:48
*** slaweq has quit IRC		21:05
*** slaweq has joined #opendev		21:08
*** slaweq has quit IRC		21:41

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!