Monday, 2023-02-06

*** yadnesh\|away is now known as yadnesh		04:00
*** bhagyashris_ is now known as bhagyashris		04:28
*** jpena\|off is now known as jpena		08:42
*** gibi_pto is now known as gibi		08:47
*** yadnesh is now known as yadnesh\|away		13:38
dansmith	gmann: do you have any tips for debugging these time-out tests? I'm looking at the inflight subunit for one of them, and I can figure out which worker was running _a_ test at the time, and what test it just finished, but not what it's running when the timeout happens	15:10
dansmith	right before that it was running a stable rescue test which took 45s	15:11
dansmith	so I kinda wonder if we're really just bumping up against what we can do in the job timeout interval	15:11
dansmith	it was only 18s between the previous test finishing and the job timeout, so it's not like it was stuck for a long time and then we timed out	15:12
fungi	doesn't it run multiple tests in parallel? so just because one worker finished a test doesn't mean there wasn't another running for longer	15:13
dansmith	fungi: yeah, and that may be happening, but all the other workers have an equal number of "started" and "finished" markers except one	15:20
dansmith	IIRC, the test distribution is roughly "sort alphabetical and then slice by number of workers" which may also be part of the problem, if one worker ends up getting penalized by a huge batch of slow tests, such that we stop being very parallel at some point	15:21
dansmith	two of the four workers are the only ones active for the final ten minutes of the test	15:22
fungi	in theory we could make a variant of the job with a very long timeout and set it to serialized, then capture timing information for each test (but if the timeouts are related to concurrency and the wrong tests conflicting at the right times then that won't tell us anything)	15:22
dansmith	I'm seeing this on lots of different job flavors lately.. I think we all are	15:23
dansmith	I think unless we have a way to affect the distribution amongst workers, that would be interesting, but I'm not sure what we'd do with it	15:24
dansmith	even still, I think the test timeout used to be high enough that things had to go way off the rails to hit it.. if we're only within ten minutes of it and a slightly non-uniform ditribution is pushing us over the edge, we probably need to do something different anyway	15:25
fungi	agreed, likely that this is a symptom of the jobs just running longer than they used to. more tests? slower software? additional resource consumption? our providers' systems got slower or more heavily loaded?	15:27
dansmith	I think.. yes. :)	15:28
dansmith	obviously more tests over time, no doubt	15:28
fungi	we could probably collectively do a better job of cleaning up redundant tests or combining related tests in order to not have so much of a net increase over time	15:45
dansmith	yeah, I was just thinking that some of these stable rescue tests that take a good long time maybe aren't all critical to run every time	15:46
fungi	adding new tests is comparatively easy, but leads to tech debt if that's all we do	15:46
dansmith	it's also desirable to just write new "start fresh" test cases for each thing because it's easier to grok the current state of an instance after fresh boot, but that's pretty expensive	15:47
fungi	we probably also have a ton of tests with large overlap in functionality and comparatively small additions to coverage	15:47
fungi	which results in a very small percentage of the job's runtime exercising unique parts of the codebase and the vast majority of the job redundantly retreading the same ground	15:49
dansmith	yeah, these rescue tests for example,	15:50
dansmith	there's a different test for each disk bus	15:50
dansmith	and there are two virtio tests, one with a volume (non-root) attached and one without	15:50
dansmith	the former is probably good enough for most runs	15:50
fungi	if we were to be scientific about it, a given test takes x amount of time to cover y amount of otherwise untested functionality with z risk that a bug introduced into it by a new change, and there's some threshold between those factors that makes the test useful to run on each proposed change rather than periodically post-merge	15:56
dansmith	right, well, I was also looking.. we do a fair amount of duplicating our tests for things like ide,	15:57
dansmith	which AIUI is maybe not even supported on some enterprise linux distros anymore	15:57
fungi	or restated, if the test is quick, covers a relatively large amount of untested code, or covers some particularly risk-prone area of the software then it should run on every change	15:57
dansmith	so we could probably make that a post-merge periodic sort of deal	15:57
dansmith	yeah	15:57
dansmith	and I wonder if nova could stomach testing things for old ancient microversion compatibility less often	15:59
opendevreview	Merged openstack/tempest master: Fix default values for variables in run-tempest role https://review.opendev.org/c/openstack/tempest/+/869440	17:24
opendevreview	Merged openstack/devstack stable/xena: Use proper sed separator for paths https://review.opendev.org/c/openstack/devstack/+/865842	17:24
*** jpena is now known as jpena\|off		17:36
gmann	I do not think we have many duplicate test. also it is difficult to say which is duplicate especially considering the individual API operation test required for interop capabilities checks	19:05
gmann	dansmith: to minimize the job timeout, as you know we do mark long running tests as slow and run them in separate job	19:07
gmann	we used to monitor those test and mark them slow	19:07
dansmith	gmann: yeah I don't mean really slow tests,	19:07
dansmith	I just mean lots of tests that take 45s - 1m add up over time	19:07
gmann	dansmith: usually we consider test running >200 sec as slow but that is scenario tests. API test testing single operation taking 60 sec is still slow	19:08
dansmith	yeah, I know	19:08
dansmith	a hundred 45s tests is over an hour of work..	19:09
dansmith	gmann: anyway, I'm just saying, I've looked at a few test timeout jobs today that look to me like nothing is particularly stuck, which makes me think we might just be hitting the point where we're testing almost as much as we can in the alotted time	19:10
gmann	dansmith: yeah and seeing more timeout on Ubuntu 22.04 can be another factor.	19:11
gmann	I will say let's mark slow tests as slow and it still it is happening then we can think of increasing the job timeout	19:11
gmann	it is 7200s now	19:12
gmann	which should be enough	19:12
dansmith	yeah, I'm not saying we should raise the timeout,	19:13
dansmith	I'm saying maybe we need to consider more buckets of tests	19:13
dansmith	and maybe say "it's okay for us to test this periodic post-merge"	19:13
dansmith	like I don't think we need a "rescue virtio" and "rescue virtio with data volume" test pre-merge.. the volume might fail when the root disk does not, but running both every time is probably too much work for the gain we get	19:14
clarkb	one thing I've found when I have had time to debug slow tests is that we do suffer a lot from the thousand cuts problem. No one thing is slow, but we do lots of things and if they aren't quick it adds up. Also ya there is probably enough overlap in tess that you can bucket them better	19:15
gmann	yeah, or we can split those tests in another job	19:15
gmann	periodic seems we might not be able to capture some breaking change and can create issue during release time	19:16
dansmith	clarkb: right	19:16
dansmith	gmann: right, which is why I'm saying we'd need to make tactical decisions about those things	19:16
dansmith	like "X is nice but it's almost certainly covered by Y"	19:17
gmann	yeah that can be better	19:18
dansmith	test_resize_server_revert_with_volume_attached [255.241544s]	19:19
dansmith	I mean, that's a long test man :)	19:19
dansmith	test_resize_server_with_multiattached_volume [352.641738s]	19:19
dansmith	those are probably fairly close in terms of coverage, or maybe could be made to be	19:19
dansmith	that's ten minutes of one worker for two probably very similar tests	19:20
dansmith	test_list_get_volume_attachments_multiattach [153.243739s]	19:20
dansmith	and I bet we could just do a list in the multiattach resize case and cover this one ^ to get three minutes back	19:21
dansmith	it's much cleaner to get GET, PUT, POST, DELETE in a separate tests, but when they take minutes, we probably need to be more tactical	19:21
gmann	I agree and that is why we stopped accepting those single API operations which are/can be covered by scenario tests or functional test at project side	19:22
dansmith	ack, I didn't know, but good to hear	19:22
gmann	but for existing one, interop is the reason we need to keep those.	19:22
dansmith	we probably could use some review	19:23
dansmith	yeah	19:23
dansmith	we could have a class of "just needed for interop" maybe and run those not pre-merge?	19:23
gmann	even some negative tests who do create resource and check other operation in negative way	19:23
gmann	sure, not all needed by interop. surly this one not test_resize_server_revert_with_volume_attached	19:24
dansmith	I should know, but is there a way to tell which of these are codified as interop-required?	19:25
dansmith	this is one I was skeptical about earlier: test_stable_device_rescue_cdrom_ide [85.087746s]	19:26
gmann	dansmith: or we can run all admin API tests in one separate job and even more API tests	19:26
dansmith	I think that people using q35 can't even use ide for cdrom	19:26
dansmith	gmann: yeah, that might be good	19:27
dansmith	gmann: maybe also not run a bunch of generic nova ones in the jobs that are specific to nova/cinder interactions	19:27
dansmith	purely sharding though is just running more in parallel, which is good until we stop scaling that direction too	19:27
dansmith	but it's certainly something we can do in the short term	19:28
gmann	yeah	19:28
gmann	let me push that and we can see how it looks like	19:29
dansmith	okay	19:30
*** mtreinish_ is now known as mtreinish		21:51
*** noonedeadpunk_ is now known as noonedeadpunk		21:51
kopecmartin	dansmith: if you wanna see if a particular test is part of interop, just grep add-ons and guidelines dirs for that test in https://opendev.org/openinfra/interop	23:24

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!