| @clarkb:matrix.org | corvus: its weird too because it seems like things were good and then got sad? | 00:20 |
|---|---|---|
| @clarkb:matrix.org | which is part of why I wonder if it could be related to the zuul-providers ready node config. THough that merged early enough that it should've been in place while things were ahppy too | 00:20 |
| @jim:acmegating.com | yeah, worth considering, but i'd be surprised | 00:21 |
| @clarkb:matrix.org | corvus: the launcher debug log on zl02 is 14GB large which is making it difficult to grep for the random request that went to node fialure that I'm looking at | 00:24 |
| @clarkb:matrix.org | interestingly zl01 doesn't have this problem | 00:24 |
| @jim:acmegating.com | yeah, log rotation has happened on zl01 but not zl02 | 00:25 |
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 957294: Fix race with node provisioning https://review.opendev.org/c/zuul/zuul/+/957294 | 00:53 | |
| @jim:acmegating.com | Clark: ^ that passed the test_launcher tests locally; running the full suite now | 00:54 |
| @clarkb:matrix.org | I'm not sure I understand the comment ther eabout holding the lock for a few more instructions | 00:56 |
| @clarkb:matrix.org | the with block is only for the one call | 00:56 |
| @jim:acmegating.com | look for releaseLock | 00:57 |
| @jim:acmegating.com | it's near one of the call sites, the other one is up one level in the call stack, but also just a few instructions later | 00:57 |
| @jim:acmegating.com | the with block doesn't hold the lock -- the with block operates as long as the existing lock continues to be held (ie, not lost due to a zk disconnect) | 00:57 |
| @jim:acmegating.com | (i mean, the mechanics of the with block are more complicated than that, but that's a very brief description of what the intent is there) | 00:58 |
| @clarkb:matrix.org | oh I see this was teh case before | 00:59 |
| @clarkb:matrix.org | this is just adding a bit more info. The comment isn't speaking to something that changes as part of the update | 00:59 |
| @jim:acmegating.com | yep; just highlighting a potential pitfall in case we end up looking at this code again | 00:59 |
| @jim:acmegating.com | i think it's fine now (ie, i don't think we need to update it to send the event after the lock is released) | 00:59 |
| @clarkb:matrix.org | you're saying that even though we hold the lock the schedulers and executors can still race us | 01:00 |
| @clarkb:matrix.org | so the order matters | 01:00 |
| @jim:acmegating.com | yep. because the schedulers and executors don't care about the lock. | 01:01 |
| @jim:acmegating.com | (if they did, they might wait for us to release it before reading, and we wouldn't have had this problem. but we don't want them to care, so this is still the best fix) | 01:01 |
| @clarkb:matrix.org | +2 from me | 01:01 |
| @jim:acmegating.com | now, i know why it was written that way -- with this change we are adding a critical section where if the launcher crashes between setting the state and the event being sent, the request will be stuck forever. | 01:02 |
| @jim:acmegating.com | we should probably fix that, but that seems less bad than the status quo. | 01:02 |
| @clarkb:matrix.org | I guess that makes docker compose down invocations potentially impactful? | 01:03 |
| @jim:acmegating.com | that should exit gracefully, but if there's a hard crash, then yeah | 01:03 |
| @clarkb:matrix.org | got it | 01:03 |
| @jim:acmegating.com | Clark: left a comment on that change about it | 01:15 |
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 957295: Launcher: improve missing artifact log message https://review.opendev.org/c/zuul/zuul/+/957295 | 02:00 | |
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 957296: Launcher: retry loading providers on failure https://review.opendev.org/c/zuul/zuul/+/957296 | 02:16 | |
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 957296: Launcher: retry loading providers on failure https://review.opendev.org/c/zuul/zuul/+/957296 | 02:37 | |
| -@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 957171: Fix race in web stats https://review.opendev.org/c/zuul/zuul/+/957171 | 03:12 | |
| -@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 957184: Avoid relative priority updates on completed requests https://review.opendev.org/c/zuul/zuul/+/957184 | 03:20 | |
| -@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 956507: AWS: implement image copy https://review.opendev.org/c/zuul/zuul/+/956507 | 05:12 | |
| -@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 957304: Always send nodes provisioned event before unlock https://review.opendev.org/c/zuul/zuul/+/957304 | 05:58 | |
| -@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/nodepool] 933899: Slow processing of failed nodescan connections https://review.opendev.org/c/zuul/nodepool/+/933899 | 07:15 | |
| -@gerrit:opendev.org- Damian Fajfer proposed wip: [zuul/zuul] 956152: Update swagger-ui to 4.19.1 https://review.opendev.org/c/zuul/zuul/+/956152 | 08:50 | |
| -@gerrit:opendev.org- Damian Fajfer proposed wip: [zuul/zuul] 956167: Update React version to 17.0.2 https://review.opendev.org/c/zuul/zuul/+/956167 | 08:51 | |
| -@gerrit:opendev.org- Damian Fajfer proposed wip: [zuul/zuul] 956167: Update React version to 17.0.2 https://review.opendev.org/c/zuul/zuul/+/956167 | 08:52 | |
| -@gerrit:opendev.org- Damian Fajfer proposed wip: [zuul/zuul] 956152: Update swagger-ui to 4.19.1 https://review.opendev.org/c/zuul/zuul/+/956152 | 08:54 | |
| @bennetefx:matrix.org | Hi Zuul community, | 09:22 |
| When connecting Zuul to Gerrit, the scheduler uses an SSH key whose public key is added to a Gerrit account for event streaming. | ||
| If there are multiple scheduler instances on different machines, is there a recommended way to centrally manage this key so all instances share the same pair, rather than generating separate ones? | ||
| @jangutter:matrix.org | bennetefx: openinfra tooling uses playbooks to update their zuul scheduler configuration. https://opendev.org/opendev/system-config/src/commit/c28655848247b9eda07235c53032e3c5bef17d0d/playbooks/service-zuul.yaml | 09:33 |
| -@gerrit:opendev.org- Jan Gutter proposed: [zuul/zuul-jobs] 957218: Raise connection pool for boto3 in s3 upload role https://review.opendev.org/c/zuul/zuul-jobs/+/957218 | 12:42 | |
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 957456: Include requested nodes in quota percentage calculation https://review.opendev.org/c/zuul/zuul/+/957456 | 17:43 | |
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 957457: Sort nodes by request time https://review.opendev.org/c/zuul/zuul/+/957457 | 18:13 | |
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 957457: Sort nodes by request time https://review.opendev.org/c/zuul/zuul/+/957457 | 19:16 | |
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 22:16 | |
| - [zuul/zuul] 957456: Include requested nodes in quota percentage calculation https://review.opendev.org/c/zuul/zuul/+/957456 | ||
| - [zuul/zuul] 957457: Sort nodes by request time https://review.opendev.org/c/zuul/zuul/+/957457 | ||
| - [zuul/zuul] 957294: Fix race with node provisioning https://review.opendev.org/c/zuul/zuul/+/957294 | ||
| - [zuul/zuul] 957464: Ensure node request event is sent after crash https://review.opendev.org/c/zuul/zuul/+/957464 | ||
| - [zuul/zuul] 957295: Launcher: improve missing artifact log message https://review.opendev.org/c/zuul/zuul/+/957295 | ||
| - [zuul/zuul] 957296: Launcher: retry loading providers on failure https://review.opendev.org/c/zuul/zuul/+/957296 | ||
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed on behalf of Simon Westphahl: [zuul/zuul] 957304: Always send nodes provisioned event before unlock https://review.opendev.org/c/zuul/zuul/+/957304 | 22:16 | |
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 957469: Omnibus change to build image https://review.opendev.org/c/zuul/zuul/+/957469 | 22:26 | |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!