| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 00:18 | |
| - [zuul/zuul] 962146: Correct provider default inheritance https://review.opendev.org/c/zuul/zuul/+/962146 | ||
| - [zuul/zuul] 962147: Fix provider "flavors" check https://review.opendev.org/c/zuul/zuul/+/962147 | ||
| -@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/zuul] 962177: web: Upgrade re-ansi dependency to latest 0.7.3 https://review.opendev.org/c/zuul/zuul/+/962177 | 11:30 | |
| -@gerrit:opendev.org- yatin proposed: [zuul/zuul-jobs] 961208: [WIP] Make fips setup compatible to 10-stream https://review.opendev.org/c/zuul/zuul-jobs/+/961208 | 11:58 | |
| -@gerrit:opendev.org- yatin proposed: [zuul/zuul-jobs] 961208: [WIP] Make fips setup compatible to 10-stream https://review.opendev.org/c/zuul/zuul-jobs/+/961208 | 12:48 | |
| -@gerrit:opendev.org- yatin proposed: [zuul/zuul-jobs] 961208: [WIP] Make fips setup compatible to 10-stream https://review.opendev.org/c/zuul/zuul-jobs/+/961208 | 14:05 | |
| @clarkb:matrix.org | corvus: the inheritance fix timed out its unittests. This py311 job, https://zuul.opendev.org/t/zuul/build/d8f6cbf46b404712aa1adf018a6b3d1b/log/job-output.txt, ran on rax flex sjc3 which I would typically expect to run the jobs in a reasonable amount of time. Looks like the parent change also timed out. maybe something about the parent change is causing tests to slow or never complete? | 14:46 |
|---|---|---|
| @jim:acmegating.com | Clark: i don't think so. i'm currently collecting stats from various builds and providers. | 14:51 |
| @jim:acmegating.com | https://zuul.opendev.org/t/zuul/runtime?job_name=zuul-nox-py312&project=zuul/zuul&branch=master&pipeline=gate | 14:51 |
| @jim:acmegating.com | that is a very large spread | 14:51 |
| @clarkb:matrix.org | huh maybe something else snuck in (via dependency updates perhaps) that is causing it | 14:54 |
| -@gerrit:opendev.org- yatin proposed: [zuul/zuul-jobs] 961208: [WIP] Make fips setup compatible to 10-stream https://review.opendev.org/c/zuul/zuul-jobs/+/961208 | 15:23 | |
| -@gerrit:opendev.org- Jan Gutter proposed: [zuul/zuul-jobs] 962194: Fix up some EL10 compatibility https://review.opendev.org/c/zuul/zuul-jobs/+/962194 | 15:30 | |
| @jim:acmegating.com | Clark: looking at recent timeouts, it looks like the test runner is going "idle" for a long period at the end, suggesting that some unit test or the test runner is getting stuck. (in other words, it's not just that the tests or node are slow). | 15:36 |
| @jim:acmegating.com | Clark: i ran some benchmark tests locally and they were fine -- timings were normal. but that is in an old venv. i recreated the venv and now it looks like my local run is hung. so i think we should pull on your "dependency" thread. | 15:37 |
| @clarkb:matrix.org | corvus: yup I think I've reached the same conclusion. https://zuul.opendev.org/t/zuul/build/d8f6cbf46b404712aa1adf018a6b3d1b/log/job-output.txt ran 1930 test cases too which matches the total number of test cases in the last merged change fo zuul | 15:37 |
| @jim:acmegating.com | i saved my old venv so we can compare versions, (and of course we have the build logs). | 15:37 |
| @clarkb:matrix.org | not sure if that new change or its parent add some test cases which could bump up that count and indicate a stuck test case or two. But we may also simply not be exiting the test runner for some reasion. Possibly because we've started a non daemon thread that has not shut down? | 15:37 |
| @jim:acmegating.com | Clark: on my old (good) venv, i ran current master and the end of that stack of 3 changes, and the times were the same | 15:38 |
| @jim:acmegating.com | Ran: 1931 tests in 636.4226 sec. | 15:38 |
| Ran: 1932 tests in 634.6860 sec. | ||
| @clarkb:matrix.org | that certainly seems to point at some dependency change if the old dep set can run thinsg consistently | 15:38 |
| @clarkb:matrix.org | hrm pyparsing and wcwidth are the only two dep changes between a recent working run and the timed out runs in ci system | 15:47 |
| @clarkb:matrix.org | pyparsing 3.2.4 -> 3.2.5 and wcwidth 0.2.13 -> 0.2.14 | 15:47 |
| @jim:acmegating.com | based on the output, the test runs were either finished or almost finished (there are >=1920 "... ok" instances). i wonder if anything in the testr stack changed? | 15:50 |
| @clarkb:matrix.org | my dependency diff between jobs says stestr, subunit, etc haven't changed. Only pyparsing and wcwidth did | 15:53 |
| @jim:acmegating.com | ack. i'll try revving those down when my current run finishes | 15:54 |
| @jim:acmegating.com | my old venv had pyparsing 3.2.3 (and wcwidth 0.2.13) | 15:55 |
| @clarkb:matrix.org | corvus: if you catch it in a sad state I wonder if you can use the sigusr2 threaddump feature while tests are running | 15:58 |
| @clarkb:matrix.org | to see if there are non daemon threads unexpectedly running. There is probably another way to accomplish that if sigusr2 doesn't work in this context | 15:58 |
| @jim:acmegating.com | i'm doubtful the output would go anywhere useful... like it may end up in an internal buffer that won't be flushed when i terminate the proc | 15:58 |
| @jim:acmegating.com | i may be able to try with different testr capture settings though | 15:59 |
| @jim:acmegating.com | oh i don't think usr2 gets generally enabled in the unit tests | 16:01 |
| @clarkb:matrix.org | oh ya I bet its only done as part of daemon setup | 16:01 |
| @clarkb:matrix.org | or command setup and we don't go through that in the tests | 16:02 |
| @jim:acmegating.com | i could add that and have it print to stdout | 16:02 |
| @clarkb:matrix.org | another idea would be to run one test at a time in a new process and see if that allows some/most tests to complete successfully and eventually narrow it down to whatever trips the issue. That may be slow though | 16:06 |
| @jim:acmegating.com | i'm currently having stestr run some subsets of tests (like "test_scheduler") to try to narrow it down, so far no joy. | 16:07 |
| @jim:acmegating.com | that at least tells us that it's not *any* test, but either some specific test or some combination of tests. | 16:07 |
| @jim:acmegating.com | worth considering ansible versions too; those won't show up in the pip freeze | 16:21 |
| @clarkb:matrix.org | ++ | 16:29 |
| @clarkb:matrix.org | I don't see new ansible or ansible-core releases on pypi in the time range we're looking at. However, it could be one of the deps there too | 16:33 |
| @clarkb:matrix.org | doesn't look like the `manage-ansible` command logs the versions in a way that the job logs pick up. Just lists the input side of the installation | 16:35 |
| @jim:acmegating.com | okay a bit more data (but no information): i did the USR2 thing, and about a third of the processes produced output (the others did nothing -- they didn't even print out the start/stop lines). of that third, all of them included non-daemon threads for cheroot (the web server). and half of them also included threads involving the nodepool event watcher waiting for a shutdown event. (this is classic nodepool, not niz). | 16:39 |
| @jim:acmegating.com | in another session, i have downgraded pyparsing to 3.2.4, no joy | 16:40 |
| @jim:acmegating.com | cheroot 11.0.0 was released sep 21 | 16:42 |
| @jim:acmegating.com | i would have put the recent bunch of timeouts as starting sept 17... but maybe those were "normal" | 16:43 |
| @jim:acmegating.com | (maybe there's a cheroot issue, but that sort of got lost in the noise introduced by the opendev provider churn last week) | 16:44 |
| @jim:acmegating.com | https://github.com/cherrypy/cheroot/issues/769 | 16:45 |
| @clarkb:matrix.org | oh my initial diffs of the deps sets was incomplete I guess | 16:45 |
| @clarkb:matrix.org | (I used vimdiff notdiff diff) and yes cheroot has also updated | 16:45 |
| @clarkb:matrix.org | 10.0.1 to 11.0.0 | 16:45 |
| @jim:acmegating.com | that's the thread name that shows up in the sigusr2 output | 16:45 |
| @clarkb:matrix.org | looks like someone else is having the same exact issue just with pytest test cases | 16:46 |
| @clarkb:matrix.org | boto3, botocore, cheroot, coverage, google-api-python-client, and moto have updates in addition to pyparsing and wcwidth | 16:48 |
| @clarkb:matrix.org | of those this cheroot thing seems like the most likely culprit given that issue | 16:48 |
| @jim:acmegating.com | test of that is in progress | 16:49 |
| @clarkb:matrix.org | corvus: https://github.com/cherrypy/cheroot/blob/4f040de1d729cf02c9673411ebf819fe460c337b/cheroot/server.py#L1901-L1932 it does seem to indicate here that if we call stop() on the http server it should end up killing that thread | 16:51 |
| @clarkb:matrix.org | Looks like the ZuulWebFixture does call stop() via a cleanup | 16:52 |
| @clarkb:matrix.org | so maybe stop() is buggy or we're not using ZuulWebFixture somewhere we should be | 16:52 |
| @jim:acmegating.com | https://github.com/cherrypy/cheroot/blob/4f040de1d729cf02c9673411ebf819fe460c337b/cheroot/server.py#L119 | 16:54 |
| @jim:acmegating.com | if thread-A enters the get call on 128 and it is not shutdown, it will call a blocking wait on line 131. if thread-B then shuts down the queue, it would need to re-enter the method to perform the shut down check | 16:56 |
| @clarkb:matrix.org | huh is it just spinning in that loop forever? | 16:56 |
| @clarkb:matrix.org | oh I see its a deadlock not a spin | 16:57 |
| @jim:acmegating.com | i think it needs to put a bunch of null items in the queue to force those methods to return | 16:57 |
| @clarkb:matrix.org | agreed. And its possible this just works on python3.13 because the implementation is different there | 16:58 |
| @jim:acmegating.com | where does it work with 3.13? | 16:58 |
| @jim:acmegating.com | i have confirmed that the pyparsing+wcwidth downgrades do not fix the issue. and i have confirmed that cheroot downgrade does. i'll propose a change to pin. | 16:59 |
| @clarkb:matrix.org | corvus: the queue implementation is from stdlib. I don't think we have any test runs of this with zuul and python3.13. I could recheck my change to add python3.13 test jobs if we want to try and gather that (but that change hasn't been reliable for other reasons, likely timing issues with the udpated runtime) | 16:59 |
| @clarkb:matrix.org | I'm just thinking its possible that this works under python3.13 which may have made it harder to detect the issue before they released (or after) | 16:59 |
| @jim:acmegating.com | oh i see, you're hypothesizing they may not have noticed the issue if they tested with 3.13. | 17:00 |
| @jim:acmegating.com | it's also possible their tests don't spawn enough threads. | 17:00 |
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 962204: Disallow cheroot 11.0.0 https://review.opendev.org/c/zuul/zuul/+/962204 | 17:03 | |
| @clarkb:matrix.org | +2 from me. Though its probably worth fast approving that as everything is stuck behind it | 17:04 |
| @jim:acmegating.com | done and promoted; the opendev fixes are behind it. | 17:06 |
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 962220: Allow section inheritance between projects https://review.opendev.org/c/zuul/zuul/+/962220 | 18:38 | |
| @clarkb:matrix.org | hrm the py312 job timed out | 19:29 |
| @clarkb:matrix.org | this time there isn't a gap of time where the tests have finished but processes haven't exited so I think it wasj ust slow | 19:30 |
| @clarkb:matrix.org | I'm going to manually reenqueue it to the gate | 19:30 |
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 21:09 | |
| - [zuul/zuul] 960924: Always associate nodes with providers https://review.opendev.org/c/zuul/zuul/+/960924 | ||
| - [zuul/zuul] 960927: Launcher: add max-age https://review.opendev.org/c/zuul/zuul/+/960927 | ||
| - [zuul/zuul] 961292: Launcher: handle reused node failure https://review.opendev.org/c/zuul/zuul/+/961292 | ||
| - [zuul/zuul] 961557: Assign unassigned building nodes to requests https://review.opendev.org/c/zuul/zuul/+/961557 | ||
| - [zuul/zuul] 962145: Use a subnode for request assignment https://review.opendev.org/c/zuul/zuul/+/962145 | ||
| -@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 962204: Disallow cheroot 11.0.0 https://review.opendev.org/c/zuul/zuul/+/962204 | 21:30 | |
| @clarkb:matrix.org | yay now other changes have a chance | 21:40 |
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 962232: Speed up git cherry-pick operations on large changes https://review.opendev.org/c/zuul/zuul/+/962232 | 22:10 | |
| -@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: | 22:23 | |
| - [zuul/zuul] 961534: Allow base sections with no connection https://review.opendev.org/c/zuul/zuul/+/961534 | ||
| - [zuul/zuul] 962146: Correct provider default inheritance https://review.opendev.org/c/zuul/zuul/+/962146 | ||
| @gordonmessmer:fedora.im | I apologize for asking a dumb question, but how does one build the container set in a local checkout of the zuul git repo, in order to test changes interactively? | 22:55 |
| @clarkb:matrix.org | Gordon Messmer: the Dockerfile has a number of targets in it. But other than that it should just be `docker build ./ --target zuul-executor` or whichever target you need | 23:00 |
| @clarkb:matrix.org | I think last time I did this I ended up running docker build several times to get each of the targets built and docker used cached data so only the first one was really slow | 23:01 |
| @clarkb:matrix.org | you can optionally set `-t zuul-executor:my-local-testing` or whatever to tag each of the builds too and keep track of what they contain | 23:01 |
| @gordonmessmer:fedora.im | thanks | 23:02 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!