opendevreview | James E. Blair proposed zuul/zuul-jobs master: Ignore errors when deleting tags from dockerhub https://review.opendev.org/c/zuul/zuul-jobs/+/799338 | 00:00 |
---|---|---|
corvus | fungi, clarkb: apparently that xenial node is failing to boot in bhs1 | 00:25 |
clarkb | oh hrm | 00:25 |
corvus | it's on attempt #2 | 00:25 |
fungi | timeouts? | 00:26 |
clarkb | corvus: do you hvae the node id so we can see the traceback? | 00:26 |
corvus | yeha | 00:26 |
corvus | 2021-07-03 00:17:16,437 INFO nodepool.NodeLauncher: [e: 56c26ac0e5cf4a35a42ab59289ef7fcf] [node_request: 200-0014656534] [node: 0025380180] Node 6aae1b30-8d51-4c9f-95bb-ee96be8bd1e3 scheduled for cleanup | 00:26 |
corvus | from nl04 | 00:26 |
corvus | that was attempt #1 | 00:26 |
fungi | we unfortunately see frequent timeouts in a number of our providers, i wonder if we need to tune the launchers to wait longer | 00:26 |
clarkb | openstack.exceptions.ResourceTimeout: Timeout waiting for the server to come up. | 00:27 |
corvus | fungi: i'm not sure i want to wait >10 minutes for a node to boot | 00:27 |
corvus | though tbh, i don't know why we wait 10 minutes 3 times | 00:27 |
corvus | i would like it to just fail immediately so we can actually fall back on another provider | 00:28 |
clarkb | we retry 3 times in the code regardless of failure type and then timeouts can cause failure | 00:28 |
clarkb | but ya in the case of timeouts falling back to another provider may be a good idea if you have >1 able to fulfill the request | 00:28 |
corvus | yeah, i just think we've had a lot of push-pull over the years about failure conditions | 00:28 |
fungi | i agree treating timeout failures differently makes sense | 00:28 |
clarkb | you do want to retry if it is your only provider remaining I think | 00:29 |
corvus | it doesn't make sense to set a timeout then wait 3 times, because it would (as fungi suggested initially) be better to wait 30m once than 10m 3 times :) | 00:29 |
fungi | though it does probably increase the chances that we return node_failure if we don't have a given label in many providers | 00:29 |
clarkb | ok I need to check on dinner plans and related activities. I'll watch the zuul status | 00:29 |
corvus | 2021-07-03 00:27:18,229 ERROR nodepool.NodeLauncher: [e: 56c26ac0e5cf4a35a42ab59289ef7fcf] [node_request: 200-0014656534] [node: 0025380180] Launch attempt 2/3 failed: | 00:30 |
corvus | so only another 8 minutes of waiting until we can move on | 00:30 |
fungi | another possibility is that something has happened with our xenial images and they're going to timeout everywhere | 00:30 |
corvus | yeah, or xenial on bhs | 00:30 |
fungi | i guess we'll find out when it switches to another provider/region | 00:30 |
corvus | oh hey it's running | 00:31 |
corvus | attempt #3 worked | 00:31 |
fungi | third try's a charm? ;) | 00:31 |
fungi | yeesh | 00:31 |
corvus | maybe 3x 10m timeout isn't as terrible as i thought? i dunno | 00:31 |
corvus | i'm happy to be proven wrong if it means the job starts running ;) | 00:32 |
opendevreview | Merged zuul/zuul-jobs master: Ignore errors when deleting tags from dockerhub https://review.opendev.org/c/zuul/zuul-jobs/+/799338 | 00:35 |
corvus | i'm restarting all of zuul again to pick up today's bugfixes | 01:06 |
corvus | #status log restarted all of zuul on commit 10966948d723ea75ca845f77d22b8623cb44eba4 to pick up stats and zk watch bugfixes | 01:09 |
opendevstatus | corvus: finished logging | 01:09 |
corvus | restoring queues | 01:16 |
clarkb | corvus: a few jobs error'd in the openstack periodic queue | 01:19 |
clarkb | but check seems happy | 01:19 |
corvus | clarkb: i'm guessing that's an artifact of the re-enqueue? | 01:19 |
fungi | are we going to need to clean up any leaked znodes from before? | 01:19 |
clarkb | corvus: ya I'm wondering if periodic jobs don't reenqueue cleanly | 01:20 |
corvus | fungi: no znodes leaked afaik, only watches, which have already disappeared due to closing the connections tehy were on | 01:20 |
corvus | i think i saw that from the last re-enqueue | 01:20 |
fungi | oh, got it, so watches aren't represented by their own znodes, they're a separate structure? | 01:21 |
clarkb | ya a watch is a mechanism on top of the znodes (that may or may not be there) | 01:21 |
corvus | yep, it's entirely an in-memory construct on a single zk server and associated with a single client connection; as soon as that connection is broken, it's gone | 01:22 |
fungi | right, for some reason i had it in my head that watches were also znodes, pointing at znode paths which may or may not exist | 01:22 |
fungi | makes more sense now | 01:22 |
fungi | so restarting zuul would have also temporarily cleaned up the leaked watches | 01:23 |
clarkb | yes | 01:23 |
fungi | even before the fix | 01:23 |
clarkb | zk04 went from 720 to 715 watch according to graphite | 01:24 |
clarkb | that is promosing | 01:24 |
corvus | also, i don't know how much a watch really "costs"; we may have been able to handle considerable growth before it became a problem | 01:27 |
corvus | i'm just not in the mood to find out that way :) | 01:27 |
clarkb | ya the admin docs warn against large numbers of watches (but don't specify a scale against hardware) when running the wchp command but the command returned instantly for us in the 10-20k range without issue | 01:28 |
clarkb | I suspect that large numbers in this case is much larger ++ to not finding out the hard way though | 01:28 |
corvus | none of zk04's cacti graphs show any kind of linear growth during the day, nor does the response time or anything like that, so they're probably relatively cheap | 01:29 |
corvus | clarkb: yeah, seems like the 10k order of magnitude is experimentally okay :) | 01:29 |
opendevreview | Gonéri Le Bouder proposed openstack/diskimage-builder master: fedora: defined DIB_FEDORA_SUBRELEASE for f3{3,4} https://review.opendev.org/c/openstack/diskimage-builder/+/799339 | 01:54 |
*** odyssey4me is now known as Guest1321 | 02:24 | |
opendevreview | Gonéri Le Bouder proposed openstack/diskimage-builder master: fedora: reuse DIB_FEDORA_SUBRELEASE if set https://review.opendev.org/c/openstack/diskimage-builder/+/799340 | 02:27 |
opendevreview | Gonéri Le Bouder proposed openstack/diskimage-builder master: Fedora: bump DIB_RELEASE to 34 https://review.opendev.org/c/openstack/diskimage-builder/+/799341 | 02:27 |
*** cloudnull7 is now known as cloudnull | 23:44 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!