Tuesday, 2023-06-06

opendevreview	daniel.pawlik proposed openstack/ci-log-processing master: Change provided directory path in remove_old_dir function https://review.opendev.org/c/openstack/ci-log-processing/+/885340	07:11
opendevreview	Merged openstack/ci-log-processing master: Change provided directory path in remove_old_dir function https://review.opendev.org/c/openstack/ci-log-processing/+/885340	07:42
iurygregory	Hi all, anyone aware of py3 jobs reaching timed_out?	15:16
iurygregory	I noticed in some ironic patches, and they are failing randomly .-.	15:17
clarkb	iurygregory: I always encourage people to open up their job logs and see where the time is going before anything else	15:17
clarkb	the job logs should have good timestamp information to see where it is going	15:17
clarkb	it could be pip's dep solver being slow or unittests taking longer than expected or dkms compilation of deps etc etc	15:18
fungi	or some dependency lacking prebuilt wheels for a recent release	15:18
iurygregory	yeah we are looking at the logs trying to understand what is happening, because seems random (on some patches it fails py38, others py39, others py310)	15:22
fungi	basically look at timestamps and see if there are any large gaps in time	15:24
clarkb	also py310 and newer are much quicker than older pythons at least for zuul	15:24
clarkb	py310/py311 arealmost half the runtime of the py38 jobs	15:24
clarkb	so there may be some tuning of timeout values that needs happening if you see a similar spread	15:25
rpittau	clarkb: the issue is happening at random points during the tests and with different python versions, py3.8 and py3.10	15:37
rpittau	the tests just stop at one point and then the job reaches timeout after 40 minutes	15:37
rpittau	for example here https://c7cc7615691360b259ca-b4831d08abece0047714b6befcdb357a.ssl.cf1.rackcdn.com/885276/2/check/openstack-tox-cover/8dfaf99/job-output.txt	15:37
fungi	2023-06-06 12:45:28.590470 \| ubuntu-jammy \| INFO [alembic.runtime.migration] Running upgrade c0455649680c -> ac00b586ab95, Adds indexes to important and commonly matched columns.	15:39
fungi	2023-06-06 13:22:31.946089 \| RUN END RESULT_TIMED_OUT: [untrusted : opendev.org/zuul/zuul-jobs/playbooks/tox/run.yaml@master]	15:39
rpittau	yep	15:39
fungi	so looks like that one stuck during a db migration?	15:39
rpittau	it is not always the same place, for example here its just after a random test https://c81b6237185e127ea93f-fe98a86d691c6300a22f04f47f232829.ssl.cf1.rackcdn.com/885276/2/check/openstack-tox-py38/889a3ab/job-output.txt	15:39
fungi	any commonality in the node type? like using nested-virt or something?	15:40
rpittau	for now they all look from ovh	15:41
rpittau	2023-06-06 12:32:23.021846 \| localhost \| Provider: ovh-bhs1	15:41
rpittau	heh nvm I found a different one Provider: inmotion-iad3	15:42
fungi	stable branches, master, both? any idea when it started/got worse?	15:43
rpittau	at the moment it looks only on master, started today, we haven't ran much today on stable branches	15:45
clarkb	given that evidence I'd start looking at the database	15:46
clarkb	in particular newer sqla and databases are pickier about transactions and locks	15:46
rpittau	clarkb: which evidence? :)	15:46
clarkb	rpittau: that things get stuck performing database operations	15:46
rpittau	that's just one	15:46
clarkb	yes I understand	15:46
rpittau	the others fail in different points	15:46
clarkb	but its the one you've shared so its what I can work off of	15:46
clarkb	it is entirely possible there are 30 different issues at play but we have to start somewhere	15:47
clarkb	and sqla locking and transactions are known to have changed	15:47
clarkb	I personally debugged and fixed a database thing in the zuul unittests that resulted in similar issues with tests hanging because locks were held preventing new schemas for additional tests from being created	15:47
clarkb	basically start with what we do know and work from there	15:48
clarkb	I've also seen mariadb be flakier than mysql for whatever reason. Haven't been able to run that down so zuul testing still uses mysql not mariadb	15:49
fungi	maybe it's time to resurrect drizzle	15:50
rpittau	clarkb: from what I can see the database migration error is just one out of at least 4, the others are different, I'm not sure that's the problem, it seems the time taken in those tests hasn't changed	15:51
clarkb	rpittau: ok something is deadlocking and not performing additional effort. I don't know what. I'm just making suggestions based on what was shared. You will need to figure out where and why things are stopping in your unittests. Another approach may be to run things locally in a loop to see if you can reproduce. Or add additional debugging etc etc	15:52
clarkb	looking at https://c81b6237185e127ea93f-fe98a86d691c6300a22f04f47f232829.ssl.cf1.rackcdn.com/885276/2/check/openstack-tox-py38/889a3ab/tmpgr5suvjz worker 7 seems to run fewer tests than the other 7 threads. The tests should be roughly balanced equally so that might be a clue (worker-7 is getting stuck and not completing)	15:54
clarkb	do your unittests run an individual test timeout? That can help break things forcefully giving you a clue for where the are getting stuck	15:54
fungi	rpittau: also, in an effort to not duplicate work already going on, you might take a look at the #openstack-qa logs from friday where dansmith brought some similar concerns	15:55
clarkb	I do not see a timeout in ironic/tests/base.py so that may be the first thing to do	15:55
clarkb	many projects do this for this very reason. If you tests lock up and the job dies you get very little info. If you set a timeout like 1-5 minutes long to forcefully break the test case then you get a traceback of where things are stuck	15:56
rpittau	fungi, clarkb, thanks for the advice, let's see if we can come up with something	15:56
fungi	rpittau: the timeouts raised in the qa channel were related to tempest jobs, but if somethinghas gotten systematically slower in openstack the impact could extend to unit testing as well	15:58
rpittau	fungi: yeah, I'm reading through it, I'll try to raise up the issue there, thanks	15:59
clarkb	04eaebe701fa70bc3d483fe9281a2252c78ea14c says oslo test bsae test classes may do the timeout for you	16:00
clarkb	I would double check that assumption	16:00
clarkb	ya that assumption is bad you have to set OS_TEST_TIMEOUT (which doesn't appear to be set in tox.ini) and if you don't then it uses DEFAULT_TIMEOUT which is set to 0 if you don't override it in your class which means no timeout	16:02
clarkb	so ya update ironic/tests/base.py to set DEFAULT_TIMEOUT or set OS_TEST_TIMEOUT in tox.ini to something like 60 or 120 seconds and see what happens	16:03
rpittau	would that value be for all the unit tests?	16:03
clarkb	its based on your test class tree	16:04
rpittau	oh right :/	16:04
rpittau	ok, thanks, I'll check that	16:04

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!