Monday, 2023-02-06

*** yadnesh|away is now known as yadnesh04:00
*** bhagyashris_ is now known as bhagyashris04:28
*** jpena|off is now known as jpena08:42
*** gibi_pto is now known as gibi08:47
*** yadnesh is now known as yadnesh|away13:38
dansmithgmann: do you have any tips for debugging these time-out tests? I'm looking at the inflight subunit for one of them, and I can figure out which worker was running _a_ test at the time, and what test it just finished, but not what it's running when the timeout happens15:10
dansmithright before that it was running a stable rescue test which took 45s15:11
dansmithso I kinda wonder if we're really just bumping up against what we can do in the job timeout interval15:11
dansmithit was only 18s between the previous test finishing and the job timeout, so it's not like it was stuck for a long time and then we timed out15:12
fungidoesn't it run multiple tests in parallel? so just because one worker finished a test doesn't mean there wasn't another running for longer15:13
dansmithfungi: yeah, and that may be happening, but all the other workers have an equal number of "started" and "finished" markers except one15:20
dansmithIIRC, the test distribution is roughly "sort alphabetical and then slice by number of workers" which may also be part of the problem, if one worker ends up getting penalized by a huge batch of slow tests, such that we stop being very parallel at some point15:21
dansmithtwo of the four workers are the only ones active for the final ten minutes of the test15:22
fungiin theory we could make a variant of the job with a very long timeout and set it to serialized, then capture timing information for each test (but if the timeouts are related to concurrency and the wrong tests conflicting at the right times then that won't tell us anything)15:22
dansmithI'm seeing this on lots of different job flavors lately.. I think we all are15:23
dansmithI think unless we have a way to affect the distribution amongst workers, that would be interesting, but I'm not sure what we'd do with it15:24
dansmitheven still, I think the test timeout used to be high enough that things had to go way off the rails to hit it.. if we're only within ten minutes of it and a slightly non-uniform ditribution is pushing us over the edge, we probably need to do something different anyway15:25
fungiagreed, likely that this is a symptom of the jobs just running longer than they used to. more tests? slower software? additional resource consumption? our providers' systems got slower or more heavily loaded?15:27
dansmithI think.. yes. :)15:28
dansmithobviously more tests over time, no doubt15:28
fungiwe could probably collectively do a better job of cleaning up redundant tests or combining related tests in order to not have so much of a net increase over time15:45
dansmithyeah, I was just thinking that some of these stable rescue tests that take a good long time maybe aren't all critical to run every time15:46
fungiadding new tests is comparatively easy, but leads to tech debt if that's all we do15:46
dansmithit's also desirable to just write new "start fresh" test cases for each thing because it's easier to grok the current state of an instance after fresh boot, but that's pretty expensive15:47
fungiwe probably also have a ton of tests with large overlap in functionality and comparatively small additions to coverage15:47
fungiwhich results in a very small percentage of the job's runtime exercising unique parts of the codebase and the vast majority of the job redundantly retreading the same ground15:49
dansmithyeah, these rescue tests for example,15:50
dansmiththere's a different test for each disk bus15:50
dansmithand there are two virtio tests, one with a volume (non-root) attached and one without15:50
dansmiththe former is probably good enough for most runs15:50
fungiif we were to be scientific about it, a given test takes x amount of time to cover y amount of otherwise untested functionality with z risk that a bug introduced into it by a new change, and there's some threshold between those factors that makes the test useful to run on each proposed change rather than periodically post-merge15:56
dansmithright, well, I was also looking.. we do a fair amount of duplicating our tests for things like ide,15:57
dansmithwhich AIUI is maybe not even supported on some enterprise linux distros anymore15:57
fungior restated, if the test is quick, covers a relatively large amount of untested code, or covers some particularly risk-prone area of the software then it should run on every change15:57
dansmithso we could probably make that a post-merge periodic sort of deal15:57
dansmithyeah15:57
dansmithand I wonder if nova could stomach testing things for old ancient microversion compatibility less often15:59
opendevreviewMerged openstack/tempest master: Fix default values for variables in run-tempest role  https://review.opendev.org/c/openstack/tempest/+/86944017:24
opendevreviewMerged openstack/devstack stable/xena: Use proper sed separator for paths  https://review.opendev.org/c/openstack/devstack/+/86584217:24
*** jpena is now known as jpena|off17:36
gmannI do not think we have many duplicate test. also it is difficult to say which is duplicate especially considering the individual API operation test required for interop capabilities checks19:05
gmanndansmith: to minimize the job timeout, as you know we do mark long running tests as slow and run them in separate job19:07
gmannwe used to monitor those test and mark them slow19:07
dansmithgmann: yeah I don't mean *really* slow tests,19:07
dansmithI just mean lots of tests that take 45s - 1m add up over time19:07
gmanndansmith: usually we consider test running >200 sec as slow but that is scenario tests. API test testing single operation taking 60 sec is still slow19:08
dansmithyeah, I know19:08
dansmitha hundred 45s tests is over an hour of work..19:09
dansmithgmann: anyway, I'm just saying, I've looked at a few test timeout jobs today that look to me like nothing is particularly stuck, which makes me think we might just be hitting the point where we're testing almost as much as we can in the alotted time19:10
gmanndansmith: yeah and seeing more timeout on Ubuntu 22.04 can be another factor.19:11
gmannI will say let's mark slow tests as slow and it still it is happening then we can think of increasing the job timeout 19:11
gmannit is 7200s now19:12
gmannwhich should be enough19:12
dansmithyeah, I'm not saying we should raise the timeout,19:13
dansmithI'm saying maybe we need to consider more buckets of tests19:13
dansmithand maybe say "it's okay for us to test this periodic post-merge"19:13
dansmithlike I don't think we need a "rescue virtio" and "rescue virtio with data volume" test pre-merge.. the volume might fail when the root disk does not, but running both every time is probably too much work for the gain we get19:14
clarkbone thing I've found when I have had time to debug slow tests is that we do suffer a lot from the thousand cuts problem. No one thing is slow, but we do lots of things and if they aren't quick it adds up. Also ya there is probably enough overlap in tess that you can bucket them better19:15
gmannyeah, or we can split those tests in another job19:15
gmannperiodic seems we might not be able to capture some breaking change and can create issue during release time19:16
dansmithclarkb: right19:16
dansmithgmann: right, which is why I'm saying we'd need to make tactical decisions about those things19:16
dansmithlike "X is nice but it's almost certainly covered by Y"19:17
gmannyeah that can be better19:18
dansmithtest_resize_server_revert_with_volume_attached [255.241544s]19:19
dansmithI mean, that's a long test man :)19:19
dansmithtest_resize_server_with_multiattached_volume [352.641738s]19:19
dansmiththose are probably fairly close in terms of coverage, or maybe could be made to be19:19
dansmiththat's ten minutes of one worker for two probably very similar tests19:20
dansmithtest_list_get_volume_attachments_multiattach [153.243739s] 19:20
dansmithand I bet we could just do a list in the multiattach resize case and cover this one ^ to get three minutes back19:21
dansmithit's much cleaner to get GET, PUT, POST, DELETE in a separate tests, but when they take minutes, we probably need to be more tactical19:21
gmannI agree and that is why we stopped accepting those single API operations which are/can be covered by scenario tests or functional test at project side19:22
dansmithack, I didn't know, but good to hear19:22
gmannbut for existing one, interop is the reason we need to keep those. 19:22
dansmithwe probably could use some review19:23
dansmithyeah19:23
dansmithwe could have a class of "just needed for interop" maybe and run those not pre-merge?19:23
gmanneven some negative tests who do create resource and check other operation in negative way19:23
gmannsure, not all needed by interop. surly this one not test_resize_server_revert_with_volume_attached 19:24
dansmithI should know, but is there a way to tell which of these are codified as interop-required?19:25
dansmiththis is one I was skeptical about earlier: test_stable_device_rescue_cdrom_ide [85.087746s]19:26
gmanndansmith: or we can run all admin API tests in one separate job and even more API tests19:26
dansmithI *think* that people using q35 can't even use ide for cdrom19:26
dansmithgmann: yeah, that might be good19:27
dansmithgmann: maybe also not run a bunch of generic nova ones in the jobs that are specific to nova/cinder interactions19:27
dansmithpurely sharding though is just running *more* in parallel, which is good until we stop scaling that direction too19:27
dansmithbut it's certainly something we can do in the short term19:28
gmannyeah19:28
gmannlet me push that and we can see how it looks like19:29
dansmithokay19:30
*** mtreinish_ is now known as mtreinish21:51
*** noonedeadpunk_ is now known as noonedeadpunk21:51
kopecmartindansmith: if you wanna see if a particular test is part of interop, just grep add-ons and guidelines dirs for that test in https://opendev.org/openinfra/interop 23:24

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!