-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956206: Launcher: ignore failed nodes for request locality https://review.opendev.org/c/zuul/zuul/+/956206 | 00:10 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956206: Launcher: ignore failed nodes for request locality https://review.opendev.org/c/zuul/zuul/+/956206 | 00:43 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956207: Add additional debugging to provider selection https://review.opendev.org/c/zuul/zuul/+/956207 | 00:47 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 956196: Fix test race in launcher test_jobs_executed https://review.opendev.org/c/zuul/zuul/+/956196 | 01:49 | |
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul-jobs] 956219: Add default for build diskimage image name https://review.opendev.org/c/zuul/zuul-jobs/+/956219 | 08:44 | |
-@gerrit:opendev.org- Szymon Datko proposed: [zuul/zuul] 956230: Implement reject filter in GitLab driver https://review.opendev.org/c/zuul/zuul/+/956230 | 12:16 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 956205: Web: handle null provider name on nodes https://review.opendev.org/c/zuul/zuul/+/956205 | 13:32 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956260: Launcher: handle delete state machine errors https://review.opendev.org/c/zuul/zuul/+/956260 | 15:46 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956263: Openstack: don't log tracebacks for quota exceptions https://review.opendev.org/c/zuul/zuul/+/956263 | 15:57 | |
@clarkb:matrix.org | corvus: I know we're focused on zuul-launcher this morning so no rush. But the python3.13 job timed out several times on rax classic nodes and then failed two other runs for a success rate of 50%. Each of the failed test runs was for tracebacks being detected in a single unittest per run. One for dictionary changing side during periodic stats reporting (looks like while walking zookeeper nodes) and the other for trying to schedule a future after the threadpool has shutdown? I suspect that neither of those are directly python3.13 related but python31.3 could possibly be more susceptible due to timing changes? | 16:01 |
---|---|---|
@clarkb:matrix.org | I've gone ahead and rechecked the change to gather more data too | 16:03 |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 952727: Replace Ansible 8 with 11 https://review.opendev.org/c/zuul/zuul/+/952727 | 16:03 | |
@jim:acmegating.com | Clark: agreed; bugs like that shake out every time we do something that alters the timing | 16:04 |
@clarkb:matrix.org | I'll look into those a bit deeper a bit later today | 16:04 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 956267: Fix test shutdown race in test_max_ready_age https://review.opendev.org/c/zuul/zuul/+/956267 | 16:35 | |
@clarkb:matrix.org | I think that may be one of the errors | 16:35 |
@clarkb:matrix.org | assuming its safe to manually shutdown the launcher there I suspect that will address the problem | 16:36 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 956281: Iterate static dict values in estimateDataSize https://review.opendev.org/c/zuul/zuul/+/956281 | 17:30 | |
@clarkb:matrix.org | and maybe that for the other error | 17:30 |
@jim:acmegating.com | Clark: which test failed that? | 17:39 |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 956206: Launcher: ignore failed nodes for request locality https://review.opendev.org/c/zuul/zuul/+/956206 | 17:42 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: | 17:42 | |
- [zuul/zuul] 956207: Add additional debugging to provider selection https://review.opendev.org/c/zuul/zuul/+/956207 | ||
- [zuul/zuul] 956260: Launcher: handle delete state machine errors https://review.opendev.org/c/zuul/zuul/+/956260 | ||
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 956263: Openstack: don't log tracebacks for quota exceptions https://review.opendev.org/c/zuul/zuul/+/956263 | 17:42 | |
@clarkb:matrix.org | corvus: test_message_too_long_failed_job | 17:43 |
@clarkb:matrix.org | corvus: https://42cdd24dbdd776667c4b-f807507fa9fed87678c45abc175a86d4.ssl.cf5.rackcdn.com/zuul/2b69ca50fd2e4f599d4a8a9a9a9767c3/testr_results.html is the log | 17:43 |
@jim:acmegating.com | thanks. that makes sense. | 17:43 |
@clarkb:matrix.org | the first change I wrote is failing, but in a bunch of unrelated test cases so I think it may be bad luck. But I'll check the results before I recheck to ensure the test case I modified isn't also failing | 17:52 |
@jim:acmegating.com | Clark: i don't think that approach works; i don't think you can shut it down like that. | 18:02 |
@clarkb:matrix.org | corvus: hrm. I thought about waitForSettled but we already run that and it seems like it can decide later to keep booting things | 18:04 |
@clarkb:matrix.org | and grabbing a lock won't help as we're going to release that as the test ends | 18:05 |
@jim:acmegating.com | yeah, i don't immediately have an easy solution. but one thing that may work is writing a conditional to figure out if we're in the 3 or 5 case, then waiting some more. | 18:05 |
@jim:acmegating.com | or, something similar we do for timer tess: | 18:05 |
@jim:acmegating.com | * or, something similar we do for timer tests: | 18:06 |
@jim:acmegating.com | we load a new configuration at the end without the problematic config items, so it's more stable | 18:06 |
@clarkb:matrix.org | ack I'll look and see if it is easy to determine what the wait for nodes count should be | 18:06 |
@clarkb:matrix.org | corvus: re the "randomness" there it seems like debian-normal has min-ready 2 and two providers that can provide that label. debian-emea has min-ready: 1 and one provider. Is the problem that one or both of the debian-normal providers may provide the minready value (2 total or 2+2 total) then we have the debian-emea node adding 1? So we're either 3 total or 5 total? | 18:11 |
@clarkb:matrix.org | The bit about why we'd have 5 total is a bit fuzzy to me but I'm assuming that is because min-ready may be fulfilled by both providers? | 18:11 |
@jim:acmegating.com | yeah see the comment on the test above which explains the sources. | 18:12 |
@clarkb:matrix.org | I suspect the other test actually needs a fix along these lines too | 18:18 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956289: Add unit tests (and fixes) for selectProviders https://review.opendev.org/c/zuul/zuul/+/956289 | 18:38 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956289: Add unit tests (and fixes) for selectProviders https://review.opendev.org/c/zuul/zuul/+/956289 | 18:39 | |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 956267: Fix test shutdown race in test_max_ready_age https://review.opendev.org/c/zuul/zuul/+/956267 | 18:43 | |
@clarkb:matrix.org | corvus: something like ^ maybe? | 18:43 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956289: Add unit tests (and fixes) for selectProviders https://review.opendev.org/c/zuul/zuul/+/956289 | 19:46 | |
@jim:acmegating.com | Clark: that might help for max_ready_age, but i'm not sure it will do anything for min_ready | 19:49 |
@clarkb:matrix.org | did the matrix gerritbot die? ps3 didn't get a message here | 19:52 |
@jim:acmegating.com | Clark: i don't see a ps3 of 267 | 20:01 |
@clarkb:matrix.org | corvus: ps3 of 956289, it finally reported just very late | 20:04 |
@clarkb:matrix.org | corvus: do you mean the test cases? Both of them stop with a nebulous amount of servers which I think means we may continue trying to boot new servers during shutdown? Removing the min-ready value from both tests and waiting for the config to settle should ensure that we don't try and boot anything new because there is no longer the min-ready demand? | 20:05 |
@jim:acmegating.com | i think the damage is done with min_ready, so it may not be an effective fix. i was only thinking of suggesting that for max_ready_age because that still has things going on after the main part of the test. | 20:10 |
@jim:acmegating.com | Clark: do you have a link to a test failure for max_ready_age handy? | 20:16 |
@jim:acmegating.com | nm found it | 20:18 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956305: Fix test_max_ready_age https://review.opendev.org/c/zuul/zuul/+/956305 | 20:25 | |
@jim:acmegating.com | Clark: ^ i think we may want something like that | 20:26 |
@clarkb:matrix.org | oh I see it was jumping ahead | 20:53 |
@clarkb:matrix.org | I +2'd your chnage and can abandon mine | 20:53 |
@clarkb:matrix.org | corvus: there were a couple more failures that aren't timeouts: https://0740c1f4685ccde5d553-ea8eedd91454e7f5ed51aea65091aac2.ssl.cf5.rackcdn.com/zuul/ca32bae4e01d4f0eae1890e306ea9caf/testr_results.html and https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_04f/zuul/04f68102507449139b212e11faffecb5/testr_results.html from https://zuul.opendev.org/t/zuul/build/ca32bae4e01d4f0eae1890e306ea9caf/logs and https://zuul.opendev.org/t/zuul/build/04f68102507449139b212e11faffecb5/logs respecitvely | 20:55 |
@clarkb:matrix.org | in that first one I think the underlying zookeeper record is being removed before we fetch it back again. I guess due to running on a quicker node? | 20:56 |
@clarkb:matrix.org | do you think it is appropriate to catch that exception and consider that case to also be a successful output of the test? | 20:56 |
@clarkb:matrix.org | in that second one it seems like we're trying to upload an image and then wait for the image artifact situation to settle but the upload always fails? that one seems more complicated | 20:58 |
@jim:acmegating.com | lookin | 21:01 |
@clarkb:matrix.org | hrm I think those upload failures may be a distraction. The test case marks that exception as an ok traceback | 21:08 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed on behalf of Clark Boylan: [zuul/zuul] 945521: Add python3.13 unittest job https://review.opendev.org/c/zuul/zuul/+/945521 | 21:11 | |
@clarkb:matrix.org | The logs indicate that it is deleting image build artifacts with no upload. I'm beginning to suspect that we're in a race to get the data before it gets deleted? | 21:11 |
@jim:acmegating.com | Clark: i don't like that first error with the zk cache. i know of no reason for it to fail like that. i updated 945521 with some extra debugging. | 21:11 |
@clarkb:matrix.org | but if we never see when the data is there before it gets deleted we can run until the timeout and fail? | 21:11 |
@clarkb:matrix.org | I see the extra debugging | 21:12 |
@clarkb:matrix.org | looks like the deletion of the IBA occurs directly from the launcher against zk but the query for the artifacts in the test is happening against the image build registry which is a zkcache of iba objects | 21:16 |
@clarkb:matrix.org | ya I think its possible for that cache to never have the data in it when we look due to it updating when things delete | 21:19 |
@clarkb:matrix.org | corvus: maybe we need to mock out the iba deletion step in that test? | 21:20 |
@jim:acmegating.com | Clark: i think it's simpler; 1 sec | 21:30 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956309: Fix test_launcher_image_expire_failed_upload https://review.opendev.org/c/zuul/zuul/+/956309 | 21:39 | |
@jim:acmegating.com | Clark: ^ described in commit msg | 21:39 |
@clarkb:matrix.org | corvus: so this only ever passed if things went fast enough to not see the new IBAs? But if things slowed down so that the IBA count incremented then we would fail? | 21:41 |
@clarkb:matrix.org | it isn't clear to me how the IBAs getting deleted from zookeeper (per the build log) is able to reconcile with the cache state having the data in it deterministically | 21:42 |
@jim:acmegating.com | Clark: yep, surprising the turtle regularly won this race. | 21:44 |
@jim:acmegating.com | i don't think i understand your last sentence | 21:44 |
@clarkb:matrix.org | corvus: if you look at the logs for the failing test in https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_04f/zuul/04f68102507449139b212e11faffecb5/testr_results.html there are messages like `Deleting image build artifact with no uploads:` that seems to be logged when the launcher deletes the IBA records from zookeeper. The test's waitForBuildArtifacts() method is looking at a zookeeper object cache that should be udpated when zookeeper udpates I think | 21:45 |
@clarkb:matrix.org | corvus: so basically if we're deleting the IBAs because the uploads failed then wouldn't we still haev a race with the image build registry cache and looking to see if the artifacts are there? | 21:46 |
@jim:acmegating.com | the deleting is synchronous, the cache is delayed | 21:46 |
@jim:acmegating.com | yes, that's why we have to wait for the correct number of ibas to be in the cache | 21:46 |
@jim:acmegating.com | (and why we can't wait for 3, specifically, because we can't guarantee that state) | 21:46 |
@jim:acmegating.com | _waitForArtifacts is: busy-loop until the cache updates to have the expected number of ibas (the expected number is the second argument) | 21:47 |
@clarkb:matrix.org | but couldn't it sometimes be 0 if we look at the wrong time? (Fwiw I see that async worker with an event queue is the default for the caches so that is why it is delayed) | 21:50 |
@jim:acmegating.com | if it's zero, we will busy loop until it's 2 | 21:50 |
@clarkb:matrix.org | right but if we're deleting the zk records isn't it possible that in some test runs every time we look at the cache the values won't be there? | 21:51 |
@clarkb:matrix.org | it would require the async cache updates to occur after the deletions and before the queries? | 21:52 |
@jim:acmegating.com | the only possible states for the number of artifacts for the ubuntu-local image are: 0, 1, 2, 3, 2 | 21:53 |
@jim:acmegating.com | in that order | 21:53 |
@jim:acmegating.com | at no point, once the number reaches 2, does it ever decrease | 21:53 |
@jim:acmegating.com | (you will note, this is the long-standing nodepool logic for dealing with old images) | 21:54 |
@clarkb:matrix.org | corvus: in that job log there were 4 `Deleting image build artifact with no uploads` messages total. 3 for ubuntu-local and one for debian-local | 21:54 |
@jim:acmegating.com | correction, i think 0,1,0,1,2,3,2 is possible | 21:55 |
@clarkb:matrix.org | which I think corresponds to build one of each and delete both. then build ubuntu-local twice and delete both | 21:55 |
@clarkb:matrix.org | I guess I'm confused as to why if we are deleting image 4 images build in zk and there are 4 image builds total it isn't possible for the cache to see those deletions if run in the right sequence | 21:55 |
@clarkb:matrix.org | * I guess I'm confused as to why if we are deleting image 4 IBAs in zk and there are 4 image builds total it isn't possible for the cache to see those deletions if run in the right sequence | 21:56 |
@clarkb:matrix.org | it seems like we're counting on the cache not updating whcih is maybe happening and I'm not understanding that part? | 21:57 |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 956289: Add unit tests (and fixes) for selectProviders https://review.opendev.org/c/zuul/zuul/+/956289 | 22:06 | |
@jim:acmegating.com | Clark: some of those deletions are because there's a build job that returns an image format that is unused. i think you're right that could cause a race as well. let me see if it makes sense to remove the extra format. | 22:07 |
@clarkb:matrix.org | oh yes I guess a second set of uploads for another format would offset my counts and then they wouldn't be accurate in my assertions above | 22:07 |
@clarkb:matrix.org | but maybe if we look at just the right time we'd see 3 due to an extra upload and have to wait until it gets deleted and shrinks back to 2? | 22:08 |
@clarkb:matrix.org | if that is what is happening then maybe its ok since it shuold shrink back down to the numbers you've got listed in the fixup change? | 22:10 |
@jim:acmegating.com | the thing i'm worried about is getting to the expected number of artifacts too soon -- because we do create 2 ubuntu-local artifacts, then immediately delete 1. we don't want any of the subsequent checks to be based on the wrong one (like the uuid check i just added) | 22:13 |
@clarkb:matrix.org | makes sense | 22:13 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956309: Fix test_launcher_image_expire_failed_upload https://review.opendev.org/c/zuul/zuul/+/956309 | 22:17 | |
@jim:acmegating.com | Clark: ^ went with filtering rather than changing the test fixtures (plus, we're checking an extra behavior which is good) | 22:18 |
@jim:acmegating.com | with that, it should be a strict sequence of 0,1,2,3,2 for the raw ubuntu-local artifact count | 22:19 |
@clarkb:matrix.org | +2 thanks. Not sure if we want to wait for additional reviewers on those | 22:22 |
@clarkb:matrix.org | and I'm glad my confusion wasn't just a time sink and led to a better test in the end | 22:23 |
@fungicide:matrix.org | looking | 22:23 |
@fungicide:matrix.org | both lgtm | 22:24 |
@jim:acmegating.com | yep thanks! | 22:26 |
@clarkb:matrix.org | fungi: there is also https://review.opendev.org/c/zuul/zuul/+/956281 but that one changes actual zuul code and not just tests. I think its pretty safe through | 22:27 |
@clarkb:matrix.org | * fungi: there is also https://review.opendev.org/c/zuul/zuul/+/956281 but that one changes actual zuul code and not just tests. I think its pretty safe though | 22:27 |
@fungicide:matrix.org | yeah, gc complications from passing mutable types, as always | 22:30 |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 956305: Fix test_max_ready_age https://review.opendev.org/c/zuul/zuul/+/956305 | 23:52 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!