Thursday, 2025-07-31

-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956206: Launcher: ignore failed nodes for request locality https://review.opendev.org/c/zuul/zuul/+/95620600:10
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956206: Launcher: ignore failed nodes for request locality https://review.opendev.org/c/zuul/zuul/+/95620600:43
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956207: Add additional debugging to provider selection https://review.opendev.org/c/zuul/zuul/+/95620700:47
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 956196: Fix test race in launcher test_jobs_executed https://review.opendev.org/c/zuul/zuul/+/95619601:49
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul-jobs] 956219: Add default for build diskimage image name https://review.opendev.org/c/zuul/zuul-jobs/+/95621908:44
-@gerrit:opendev.org- Szymon Datko proposed: [zuul/zuul] 956230: Implement reject filter in GitLab driver https://review.opendev.org/c/zuul/zuul/+/95623012:16
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 956205: Web: handle null provider name on nodes https://review.opendev.org/c/zuul/zuul/+/95620513:32
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956260: Launcher: handle delete state machine errors https://review.opendev.org/c/zuul/zuul/+/95626015:46
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956263: Openstack: don't log tracebacks for quota exceptions https://review.opendev.org/c/zuul/zuul/+/95626315:57
@clarkb:matrix.orgcorvus: I know we're focused on zuul-launcher this morning so no rush. But the python3.13 job timed out several times on rax classic nodes and then failed two other runs for a success rate of 50%. Each of the failed test runs was for tracebacks being detected in a single unittest per run. One for dictionary changing side during periodic stats reporting (looks like while walking zookeeper nodes) and the other for trying to schedule a future after the threadpool has shutdown? I suspect that neither of those are directly python3.13 related but python31.3 could possibly be more susceptible due to timing changes?16:01
@clarkb:matrix.orgI've gone ahead and rechecked the change to gather more data too16:03
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 952727: Replace Ansible 8 with 11 https://review.opendev.org/c/zuul/zuul/+/95272716:03
@jim:acmegating.comClark: agreed; bugs like that shake out every time we do something that alters the timing16:04
@clarkb:matrix.orgI'll look into those a bit deeper a bit later today16:04
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 956267: Fix test shutdown race in test_max_ready_age https://review.opendev.org/c/zuul/zuul/+/95626716:35
@clarkb:matrix.orgI think that may be one of the errors16:35
@clarkb:matrix.orgassuming its safe to manually shutdown the launcher there I suspect that will address the problem16:36
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 956281: Iterate static dict values in estimateDataSize https://review.opendev.org/c/zuul/zuul/+/95628117:30
@clarkb:matrix.organd maybe that for the other error17:30
@jim:acmegating.comClark: which test failed that?17:39
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 956206: Launcher: ignore failed nodes for request locality https://review.opendev.org/c/zuul/zuul/+/95620617:42
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com:17:42
- [zuul/zuul] 956207: Add additional debugging to provider selection https://review.opendev.org/c/zuul/zuul/+/956207
- [zuul/zuul] 956260: Launcher: handle delete state machine errors https://review.opendev.org/c/zuul/zuul/+/956260
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 956263: Openstack: don't log tracebacks for quota exceptions https://review.opendev.org/c/zuul/zuul/+/95626317:42
@clarkb:matrix.orgcorvus: test_message_too_long_failed_job17:43
@clarkb:matrix.orgcorvus: https://42cdd24dbdd776667c4b-f807507fa9fed87678c45abc175a86d4.ssl.cf5.rackcdn.com/zuul/2b69ca50fd2e4f599d4a8a9a9a9767c3/testr_results.html is the log17:43
@jim:acmegating.comthanks.  that makes sense.17:43
@clarkb:matrix.orgthe first change I wrote is failing, but in a bunch of unrelated test cases so I think it may be bad luck. But I'll check the results before I recheck to ensure the test case I modified isn't also failing17:52
@jim:acmegating.comClark: i don't think that approach works; i don't think you can shut it down like that.18:02
@clarkb:matrix.orgcorvus: hrm. I thought about waitForSettled but we already run that and it seems like it can decide later to keep booting things18:04
@clarkb:matrix.organd grabbing a lock won't help as we're going to release that as the test ends18:05
@jim:acmegating.comyeah, i don't immediately have an easy solution.  but one thing that may work is writing a conditional to figure out if we're in the 3 or 5 case, then waiting some more.18:05
@jim:acmegating.comor, something similar we do for timer tess:18:05
@jim:acmegating.com * or, something similar we do for timer tests:18:06
@jim:acmegating.comwe load a new configuration at the end without the problematic config items, so it's more stable18:06
@clarkb:matrix.orgack I'll look and see if it is easy to determine what the wait for nodes count should be18:06
@clarkb:matrix.orgcorvus: re the "randomness" there it seems like debian-normal has min-ready 2 and two providers that can provide that label. debian-emea has min-ready: 1 and one provider. Is the problem that one or both of the debian-normal providers may provide the minready value (2 total or 2+2 total) then we have the debian-emea node adding 1? So we're either 3 total or 5 total?18:11
@clarkb:matrix.orgThe bit about why we'd have 5 total is a bit fuzzy to me but I'm assuming that is because min-ready may be fulfilled by both providers?18:11
@jim:acmegating.comyeah see the comment on the test above which explains the sources.18:12
@clarkb:matrix.orgI suspect the other test actually needs a fix along these lines too18:18
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956289: Add unit tests (and fixes) for selectProviders https://review.opendev.org/c/zuul/zuul/+/95628918:38
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956289: Add unit tests (and fixes) for selectProviders https://review.opendev.org/c/zuul/zuul/+/95628918:39
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 956267: Fix test shutdown race in test_max_ready_age https://review.opendev.org/c/zuul/zuul/+/95626718:43
@clarkb:matrix.orgcorvus: something like ^ maybe?18:43
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956289: Add unit tests (and fixes) for selectProviders https://review.opendev.org/c/zuul/zuul/+/95628919:46
@jim:acmegating.comClark: that might help for max_ready_age, but i'm not sure it will do anything for min_ready19:49
@clarkb:matrix.orgdid the matrix gerritbot die? ps3 didn't get a message here19:52
@jim:acmegating.comClark: i don't see a ps3 of 26720:01
@clarkb:matrix.orgcorvus: ps3 of 956289, it finally reported just very late20:04
@clarkb:matrix.orgcorvus: do you mean the test cases? Both of them stop with a nebulous amount of servers which I think means we may continue trying to boot new servers during shutdown? Removing the min-ready value from both tests and waiting for the config to settle should ensure that we don't try and boot anything new because there is no longer the min-ready demand?20:05
@jim:acmegating.comi think the damage is done with min_ready, so it may not be an effective fix.  i was only thinking of suggesting that for max_ready_age because that still has things going on after the main part of the test.20:10
@jim:acmegating.comClark: do you have a link to a test failure for max_ready_age handy?20:16
@jim:acmegating.comnm found it20:18
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956305: Fix test_max_ready_age https://review.opendev.org/c/zuul/zuul/+/95630520:25
@jim:acmegating.comClark: ^ i think we may want something like that20:26
@clarkb:matrix.orgoh I see it was jumping ahead20:53
@clarkb:matrix.orgI +2'd your chnage and can abandon mine20:53
@clarkb:matrix.orgcorvus: there were a couple more failures that aren't timeouts: https://0740c1f4685ccde5d553-ea8eedd91454e7f5ed51aea65091aac2.ssl.cf5.rackcdn.com/zuul/ca32bae4e01d4f0eae1890e306ea9caf/testr_results.html and https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_04f/zuul/04f68102507449139b212e11faffecb5/testr_results.html from https://zuul.opendev.org/t/zuul/build/ca32bae4e01d4f0eae1890e306ea9caf/logs and https://zuul.opendev.org/t/zuul/build/04f68102507449139b212e11faffecb5/logs respecitvely20:55
@clarkb:matrix.orgin that first one I think the underlying zookeeper record is being removed before we fetch it back again. I guess due to running on a quicker node?20:56
@clarkb:matrix.orgdo you think it is appropriate to catch that exception and consider that case to also be a successful output of the test?20:56
@clarkb:matrix.orgin that second one it seems like we're trying to upload an image and then wait for the image artifact situation to settle but the upload always fails? that one seems more complicated20:58
@jim:acmegating.comlookin21:01
@clarkb:matrix.orghrm I think those upload failures may be a distraction. The test case marks that exception as an ok traceback21:08
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed on behalf of Clark Boylan: [zuul/zuul] 945521: Add python3.13 unittest job https://review.opendev.org/c/zuul/zuul/+/94552121:11
@clarkb:matrix.orgThe logs indicate that it is deleting image build artifacts with no upload. I'm beginning to suspect that we're in a race to get the data before it gets deleted?21:11
@jim:acmegating.comClark: i don't like that first error with the zk cache.  i know of no reason for it to fail like that.  i updated 945521 with some extra debugging.21:11
@clarkb:matrix.orgbut if we never see when the data is there before it gets deleted we can run until the timeout and fail?21:11
@clarkb:matrix.orgI see the extra debugging21:12
@clarkb:matrix.orglooks like the deletion of the IBA occurs directly from the launcher against zk but the query for the artifacts in the test is happening against the image build registry which is a zkcache of iba objects21:16
@clarkb:matrix.orgya I think its possible for that cache to never have the data in it when we look due to it updating when things delete21:19
@clarkb:matrix.orgcorvus: maybe we need to mock out the iba deletion step in that test?21:20
@jim:acmegating.comClark: i think it's simpler; 1 sec21:30
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956309: Fix test_launcher_image_expire_failed_upload https://review.opendev.org/c/zuul/zuul/+/95630921:39
@jim:acmegating.comClark: ^ described in commit msg21:39
@clarkb:matrix.orgcorvus: so this only ever passed if things went fast enough to not see the new IBAs? But if things slowed down so that the IBA count incremented then we would fail?21:41
@clarkb:matrix.orgit isn't clear to me how the IBAs getting deleted from zookeeper (per the build log) is able to reconcile with the cache state having the data in it deterministically21:42
@jim:acmegating.comClark: yep, surprising the turtle regularly won this race.21:44
@jim:acmegating.comi don't think i understand your last sentence21:44
@clarkb:matrix.orgcorvus: if you look at the logs for the failing test in https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_04f/zuul/04f68102507449139b212e11faffecb5/testr_results.html there are messages like `Deleting image build artifact with no uploads:` that seems to be logged when the launcher deletes the IBA records from zookeeper. The test's waitForBuildArtifacts() method is looking at a zookeeper object cache that should be udpated when zookeeper udpates I think21:45
@clarkb:matrix.orgcorvus: so basically if we're deleting the IBAs because the uploads failed then wouldn't we still haev a race with the image build registry cache and looking to see if the artifacts are there?21:46
@jim:acmegating.comthe deleting is synchronous, the cache is delayed21:46
@jim:acmegating.comyes, that's why we have to wait for the correct number of ibas to be in the cache21:46
@jim:acmegating.com(and why we can't wait for 3, specifically, because we can't guarantee that state)21:46
@jim:acmegating.com_waitForArtifacts is: busy-loop until the cache updates to have the expected number of ibas (the expected number is the second argument)21:47
@clarkb:matrix.orgbut couldn't it sometimes be 0 if we look at the wrong time? (Fwiw I see that async worker with an event queue is the default for the caches so that is why it is delayed)21:50
@jim:acmegating.comif it's zero, we will busy loop until it's 221:50
@clarkb:matrix.orgright but if we're deleting the zk records isn't it possible that in some test runs every time we look at the cache the values won't be there?21:51
@clarkb:matrix.orgit would require the async cache updates to occur after the deletions and before the queries?21:52
@jim:acmegating.comthe only possible states for the number of artifacts for the ubuntu-local image are: 0, 1, 2, 3, 221:53
@jim:acmegating.comin that order21:53
@jim:acmegating.comat no point, once the number reaches 2, does it ever decrease21:53
@jim:acmegating.com(you will note, this is the long-standing nodepool logic for dealing with old images)21:54
@clarkb:matrix.orgcorvus: in that job log there were 4 `Deleting image build artifact with no uploads` messages total. 3 for ubuntu-local and one for debian-local21:54
@jim:acmegating.comcorrection, i think 0,1,0,1,2,3,2 is possible21:55
@clarkb:matrix.orgwhich I think corresponds to build one of each and delete both. then build ubuntu-local twice and delete both21:55
@clarkb:matrix.orgI guess I'm confused as to why if we are deleting image 4 images build in zk and there are 4 image builds total it isn't possible for the cache to see those deletions if run in the right sequence21:55
@clarkb:matrix.org* I guess I'm confused as to why if we are deleting image 4 IBAs in zk and there are 4 image builds total it isn't possible for the cache to see those deletions if run in the right sequence21:56
@clarkb:matrix.orgit seems like we're counting on the cache not updating whcih is maybe happening and I'm not understanding that part?21:57
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 956289: Add unit tests (and fixes) for selectProviders https://review.opendev.org/c/zuul/zuul/+/95628922:06
@jim:acmegating.comClark: some of those deletions are because there's a build job that returns an image format that is unused.  i think you're right that could cause a race as well.  let me see if it makes sense to remove the extra format.22:07
@clarkb:matrix.orgoh yes I guess a second set of uploads for another format would offset my counts and then they wouldn't be accurate in my assertions above22:07
@clarkb:matrix.orgbut maybe if we look at just the right time we'd see 3 due to an extra upload and have to wait until it gets deleted and shrinks back to 2?22:08
@clarkb:matrix.orgif that is what is happening then maybe its ok since it shuold shrink back down to the numbers you've got listed in the fixup change?22:10
@jim:acmegating.comthe thing i'm worried about is getting to the expected number of artifacts too soon -- because we do create 2 ubuntu-local artifacts, then immediately delete 1.  we don't want any of the subsequent checks to be based on the wrong one (like the uuid check i just added)22:13
@clarkb:matrix.orgmakes sense22:13
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 956309: Fix test_launcher_image_expire_failed_upload https://review.opendev.org/c/zuul/zuul/+/95630922:17
@jim:acmegating.comClark: ^ went with filtering rather than changing the test fixtures (plus, we're checking an extra behavior which is good)22:18
@jim:acmegating.comwith that, it should be a strict sequence of 0,1,2,3,2 for the raw ubuntu-local artifact count22:19
@clarkb:matrix.org+2 thanks. Not sure if we want to wait for additional reviewers on those22:22
@clarkb:matrix.organd I'm glad my confusion wasn't just a time sink and led to a better test in the end22:23
@fungicide:matrix.orglooking22:23
@fungicide:matrix.orgboth lgtm22:24
@jim:acmegating.comyep thanks!22:26
@clarkb:matrix.orgfungi: there is also https://review.opendev.org/c/zuul/zuul/+/956281 but that one changes actual zuul code and not just tests. I think its pretty safe through22:27
@clarkb:matrix.org* fungi: there is also https://review.opendev.org/c/zuul/zuul/+/956281 but that one changes actual zuul code and not just tests. I think its pretty safe though22:27
@fungicide:matrix.orgyeah, gc complications from passing mutable types, as always22:30
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 956305: Fix test_max_ready_age https://review.opendev.org/c/zuul/zuul/+/95630523:52

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!