*** yadnesh|away is now known as yadnesh | 04:00 | |
*** bhagyashris_ is now known as bhagyashris | 04:28 | |
*** jpena|off is now known as jpena | 08:42 | |
*** gibi_pto is now known as gibi | 08:47 | |
*** yadnesh is now known as yadnesh|away | 13:38 | |
dansmith | gmann: do you have any tips for debugging these time-out tests? I'm looking at the inflight subunit for one of them, and I can figure out which worker was running _a_ test at the time, and what test it just finished, but not what it's running when the timeout happens | 15:10 |
---|---|---|
dansmith | right before that it was running a stable rescue test which took 45s | 15:11 |
dansmith | so I kinda wonder if we're really just bumping up against what we can do in the job timeout interval | 15:11 |
dansmith | it was only 18s between the previous test finishing and the job timeout, so it's not like it was stuck for a long time and then we timed out | 15:12 |
fungi | doesn't it run multiple tests in parallel? so just because one worker finished a test doesn't mean there wasn't another running for longer | 15:13 |
dansmith | fungi: yeah, and that may be happening, but all the other workers have an equal number of "started" and "finished" markers except one | 15:20 |
dansmith | IIRC, the test distribution is roughly "sort alphabetical and then slice by number of workers" which may also be part of the problem, if one worker ends up getting penalized by a huge batch of slow tests, such that we stop being very parallel at some point | 15:21 |
dansmith | two of the four workers are the only ones active for the final ten minutes of the test | 15:22 |
fungi | in theory we could make a variant of the job with a very long timeout and set it to serialized, then capture timing information for each test (but if the timeouts are related to concurrency and the wrong tests conflicting at the right times then that won't tell us anything) | 15:22 |
dansmith | I'm seeing this on lots of different job flavors lately.. I think we all are | 15:23 |
dansmith | I think unless we have a way to affect the distribution amongst workers, that would be interesting, but I'm not sure what we'd do with it | 15:24 |
dansmith | even still, I think the test timeout used to be high enough that things had to go way off the rails to hit it.. if we're only within ten minutes of it and a slightly non-uniform ditribution is pushing us over the edge, we probably need to do something different anyway | 15:25 |
fungi | agreed, likely that this is a symptom of the jobs just running longer than they used to. more tests? slower software? additional resource consumption? our providers' systems got slower or more heavily loaded? | 15:27 |
dansmith | I think.. yes. :) | 15:28 |
dansmith | obviously more tests over time, no doubt | 15:28 |
fungi | we could probably collectively do a better job of cleaning up redundant tests or combining related tests in order to not have so much of a net increase over time | 15:45 |
dansmith | yeah, I was just thinking that some of these stable rescue tests that take a good long time maybe aren't all critical to run every time | 15:46 |
fungi | adding new tests is comparatively easy, but leads to tech debt if that's all we do | 15:46 |
dansmith | it's also desirable to just write new "start fresh" test cases for each thing because it's easier to grok the current state of an instance after fresh boot, but that's pretty expensive | 15:47 |
fungi | we probably also have a ton of tests with large overlap in functionality and comparatively small additions to coverage | 15:47 |
fungi | which results in a very small percentage of the job's runtime exercising unique parts of the codebase and the vast majority of the job redundantly retreading the same ground | 15:49 |
dansmith | yeah, these rescue tests for example, | 15:50 |
dansmith | there's a different test for each disk bus | 15:50 |
dansmith | and there are two virtio tests, one with a volume (non-root) attached and one without | 15:50 |
dansmith | the former is probably good enough for most runs | 15:50 |
fungi | if we were to be scientific about it, a given test takes x amount of time to cover y amount of otherwise untested functionality with z risk that a bug introduced into it by a new change, and there's some threshold between those factors that makes the test useful to run on each proposed change rather than periodically post-merge | 15:56 |
dansmith | right, well, I was also looking.. we do a fair amount of duplicating our tests for things like ide, | 15:57 |
dansmith | which AIUI is maybe not even supported on some enterprise linux distros anymore | 15:57 |
fungi | or restated, if the test is quick, covers a relatively large amount of untested code, or covers some particularly risk-prone area of the software then it should run on every change | 15:57 |
dansmith | so we could probably make that a post-merge periodic sort of deal | 15:57 |
dansmith | yeah | 15:57 |
dansmith | and I wonder if nova could stomach testing things for old ancient microversion compatibility less often | 15:59 |
opendevreview | Merged openstack/tempest master: Fix default values for variables in run-tempest role https://review.opendev.org/c/openstack/tempest/+/869440 | 17:24 |
opendevreview | Merged openstack/devstack stable/xena: Use proper sed separator for paths https://review.opendev.org/c/openstack/devstack/+/865842 | 17:24 |
*** jpena is now known as jpena|off | 17:36 | |
gmann | I do not think we have many duplicate test. also it is difficult to say which is duplicate especially considering the individual API operation test required for interop capabilities checks | 19:05 |
gmann | dansmith: to minimize the job timeout, as you know we do mark long running tests as slow and run them in separate job | 19:07 |
gmann | we used to monitor those test and mark them slow | 19:07 |
dansmith | gmann: yeah I don't mean *really* slow tests, | 19:07 |
dansmith | I just mean lots of tests that take 45s - 1m add up over time | 19:07 |
gmann | dansmith: usually we consider test running >200 sec as slow but that is scenario tests. API test testing single operation taking 60 sec is still slow | 19:08 |
dansmith | yeah, I know | 19:08 |
dansmith | a hundred 45s tests is over an hour of work.. | 19:09 |
dansmith | gmann: anyway, I'm just saying, I've looked at a few test timeout jobs today that look to me like nothing is particularly stuck, which makes me think we might just be hitting the point where we're testing almost as much as we can in the alotted time | 19:10 |
gmann | dansmith: yeah and seeing more timeout on Ubuntu 22.04 can be another factor. | 19:11 |
gmann | I will say let's mark slow tests as slow and it still it is happening then we can think of increasing the job timeout | 19:11 |
gmann | it is 7200s now | 19:12 |
gmann | which should be enough | 19:12 |
dansmith | yeah, I'm not saying we should raise the timeout, | 19:13 |
dansmith | I'm saying maybe we need to consider more buckets of tests | 19:13 |
dansmith | and maybe say "it's okay for us to test this periodic post-merge" | 19:13 |
dansmith | like I don't think we need a "rescue virtio" and "rescue virtio with data volume" test pre-merge.. the volume might fail when the root disk does not, but running both every time is probably too much work for the gain we get | 19:14 |
clarkb | one thing I've found when I have had time to debug slow tests is that we do suffer a lot from the thousand cuts problem. No one thing is slow, but we do lots of things and if they aren't quick it adds up. Also ya there is probably enough overlap in tess that you can bucket them better | 19:15 |
gmann | yeah, or we can split those tests in another job | 19:15 |
gmann | periodic seems we might not be able to capture some breaking change and can create issue during release time | 19:16 |
dansmith | clarkb: right | 19:16 |
dansmith | gmann: right, which is why I'm saying we'd need to make tactical decisions about those things | 19:16 |
dansmith | like "X is nice but it's almost certainly covered by Y" | 19:17 |
gmann | yeah that can be better | 19:18 |
dansmith | test_resize_server_revert_with_volume_attached [255.241544s] | 19:19 |
dansmith | I mean, that's a long test man :) | 19:19 |
dansmith | test_resize_server_with_multiattached_volume [352.641738s] | 19:19 |
dansmith | those are probably fairly close in terms of coverage, or maybe could be made to be | 19:19 |
dansmith | that's ten minutes of one worker for two probably very similar tests | 19:20 |
dansmith | test_list_get_volume_attachments_multiattach [153.243739s] | 19:20 |
dansmith | and I bet we could just do a list in the multiattach resize case and cover this one ^ to get three minutes back | 19:21 |
dansmith | it's much cleaner to get GET, PUT, POST, DELETE in a separate tests, but when they take minutes, we probably need to be more tactical | 19:21 |
gmann | I agree and that is why we stopped accepting those single API operations which are/can be covered by scenario tests or functional test at project side | 19:22 |
dansmith | ack, I didn't know, but good to hear | 19:22 |
gmann | but for existing one, interop is the reason we need to keep those. | 19:22 |
dansmith | we probably could use some review | 19:23 |
dansmith | yeah | 19:23 |
dansmith | we could have a class of "just needed for interop" maybe and run those not pre-merge? | 19:23 |
gmann | even some negative tests who do create resource and check other operation in negative way | 19:23 |
gmann | sure, not all needed by interop. surly this one not test_resize_server_revert_with_volume_attached | 19:24 |
dansmith | I should know, but is there a way to tell which of these are codified as interop-required? | 19:25 |
dansmith | this is one I was skeptical about earlier: test_stable_device_rescue_cdrom_ide [85.087746s] | 19:26 |
gmann | dansmith: or we can run all admin API tests in one separate job and even more API tests | 19:26 |
dansmith | I *think* that people using q35 can't even use ide for cdrom | 19:26 |
dansmith | gmann: yeah, that might be good | 19:27 |
dansmith | gmann: maybe also not run a bunch of generic nova ones in the jobs that are specific to nova/cinder interactions | 19:27 |
dansmith | purely sharding though is just running *more* in parallel, which is good until we stop scaling that direction too | 19:27 |
dansmith | but it's certainly something we can do in the short term | 19:28 |
gmann | yeah | 19:28 |
gmann | let me push that and we can see how it looks like | 19:29 |
dansmith | okay | 19:30 |
*** mtreinish_ is now known as mtreinish | 21:51 | |
*** noonedeadpunk_ is now known as noonedeadpunk | 21:51 | |
kopecmartin | dansmith: if you wanna see if a particular test is part of interop, just grep add-ons and guidelines dirs for that test in https://opendev.org/openinfra/interop | 23:24 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!