Alex_Gaynor | I'm seeing arm64 builds not starting again :-/ | 01:50 |
---|---|---|
Alex_Gaynor | And two just started, maybe it's just very slow. | 01:52 |
*** zbr|rover1 has joined #opendev | 06:47 | |
*** zbr|rover has quit IRC | 06:49 | |
*** zbr|rover1 is now known as zbr|rover | 06:49 | |
*** sboyron has joined #opendev | 08:26 | |
*** sboyron has quit IRC | 08:26 | |
*** whoami-rajat__ has joined #opendev | 08:41 | |
*** zbr|rover is now known as zbr | 09:43 | |
*** hamalq has joined #opendev | 10:39 | |
*** whoami-rajat__ has quit IRC | 11:21 | |
*** tosky has joined #opendev | 11:48 | |
*** DSpider has joined #opendev | 11:49 | |
*** ysandeep|away is now known as ysandeep | 12:03 | |
*** hamalq has quit IRC | 12:39 | |
*** hamalq has joined #opendev | 13:15 | |
*** hamalq has quit IRC | 13:20 | |
fungi | roughly how long of a delay were you seeing? (was it time from trigger to build starting? or elapsed time from trigger to build completing?) | 14:14 |
Alex_Gaynor | fungi: from triggering it to the build starting; it was >30 minutes | 14:17 |
fungi | ouch, yeah that doesn't sound good. i'll trace one in the logs and see if i can come up with a timeline | 14:22 |
fungi | as a point of corroboration, the ttr graph at https://grafana.opendev.org/d/pwrNXt2Mk/nodepool-linaro?orgId=1&from=now-24h&to=now does seem to show nodes often taking up to 30 minutes to boot and become reachable in that cloud | 14:42 |
Alex_Gaynor | Does the "available" state refer to nodes that are booted and can be assigned jobs without delay? | 14:44 |
fungi | yes, but that environment is configured to only keep one node pre-booted for each configured image type so as to not eat into the quota there too much | 14:44 |
fungi | i also did an initial pass over the logs from around that timeframe to sample the time ranges for each node request from when the launcher first took the lock on it until the node reported ready: http://paste.openstack.org/show/801198 | 14:45 |
fungi | the rows with only one timestamp i think represent nodes which did not become reachable or perhaps failed to boot entirely | 14:45 |
fungi | going to look at a few in closer detail and figure out where the delay is being incurred | 14:46 |
fungi | i think 3 minutes may be the launcher's wait timeout too... if so that explains why we don't see any nodes taking longer than 30 minutes to reach a ready state but we do see some which have failed to become ready | 14:47 |
fungi | er, 30 minutes | 14:47 |
fungi | aha, so looking at a sample failure, it's a 10-minute timeout with three retries | 14:49 |
fungi | boot request is initiated to the provider, two minutes later the provider indicates the node is in a ready state, but by 10 minutes it's still not reachable via ssh and so the launcher drops it and starts another attempt. after three times of the same it gives up trying to boot that and unlocks the request (so that another launcher could try it, except this is the only provider we have for the relevant node | 14:51 |
fungi | types so it just winds up locking it again and starting over) | 14:52 |
*** slaweq has joined #opendev | 14:53 | |
fungi | spot checking a few more of the failures, that seems to be a consistent pattern | 14:54 |
fungi | and yeah, looking at some of the nodes which took around 30 minutes the cause seems to have been a combination of taking up to 5 minutes for the cloud to indicate the node was active and then timing out 10 minutes waiting to be able to reach it over ssh and retrying | 14:58 |
fungi | so i'm going to say it looks like the reason node allocations are taking to long to be satisfied there is that many boot calls are creating servers which take longer than 10 minutes to become reachable over the network (or perhaps are never going to become reachable) | 14:59 |
fungi | kevinz: ^ if you're around, do you have any insight into why that might be happening? | 15:00 |
fungi | in the case of frequent wait timeouts, our "error node launch attempts" graph is misleading, because it represents node requests where the launcher tried three times consecutively before giving up | 15:03 |
*** hamalq has joined #opendev | 15:09 | |
*** slaweq has quit IRC | 15:09 | |
*** hamalq has quit IRC | 15:14 | |
*** hamalq has joined #opendev | 16:25 | |
*** hamalq has quit IRC | 16:29 | |
*** hamalq has joined #opendev | 17:45 | |
*** mrunge_ has joined #opendev | 18:23 | |
*** mrunge has quit IRC | 18:24 | |
*** mrunge has joined #opendev | 18:27 | |
*** mrunge_ has quit IRC | 18:28 | |
*** mrunge_ has joined #opendev | 18:32 | |
*** mrunge has quit IRC | 18:33 | |
*** hamalq has quit IRC | 18:49 | |
*** slaweq has joined #opendev | 19:48 | |
*** slaweq has quit IRC | 21:05 | |
*** slaweq has joined #opendev | 21:08 | |
*** slaweq has quit IRC | 21:41 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!