opendevreview | daniel.pawlik proposed openstack/ci-log-processing master: Change provided directory path in remove_old_dir function https://review.opendev.org/c/openstack/ci-log-processing/+/885340 | 07:11 |
---|---|---|
opendevreview | Merged openstack/ci-log-processing master: Change provided directory path in remove_old_dir function https://review.opendev.org/c/openstack/ci-log-processing/+/885340 | 07:42 |
iurygregory | Hi all, anyone aware of py3 jobs reaching timed_out? | 15:16 |
iurygregory | I noticed in some ironic patches, and they are failing randomly .-. | 15:17 |
clarkb | iurygregory: I always encourage people to open up their job logs and see where the time is going before anything else | 15:17 |
clarkb | the job logs should have good timestamp information to see where it is going | 15:17 |
clarkb | it could be pip's dep solver being slow or unittests taking longer than expected or dkms compilation of deps etc etc | 15:18 |
fungi | or some dependency lacking prebuilt wheels for a recent release | 15:18 |
iurygregory | yeah we are looking at the logs trying to understand what is happening, because seems random (on some patches it fails py38, others py39, others py310) | 15:22 |
fungi | basically look at timestamps and see if there are any large gaps in time | 15:24 |
clarkb | also py310 and newer are much quicker than older pythons at least for zuul | 15:24 |
clarkb | py310/py311 arealmost half the runtime of the py38 jobs | 15:24 |
clarkb | so there may be some tuning of timeout values that needs happening if you see a similar spread | 15:25 |
rpittau | clarkb: the issue is happening at random points during the tests and with different python versions, py3.8 and py3.10 | 15:37 |
rpittau | the tests just stop at one point and then the job reaches timeout after 40 minutes | 15:37 |
rpittau | for example here https://c7cc7615691360b259ca-b4831d08abece0047714b6befcdb357a.ssl.cf1.rackcdn.com/885276/2/check/openstack-tox-cover/8dfaf99/job-output.txt | 15:37 |
fungi | 2023-06-06 12:45:28.590470 | ubuntu-jammy | INFO [alembic.runtime.migration] Running upgrade c0455649680c -> ac00b586ab95, Adds indexes to important and commonly matched columns. | 15:39 |
fungi | 2023-06-06 13:22:31.946089 | RUN END RESULT_TIMED_OUT: [untrusted : opendev.org/zuul/zuul-jobs/playbooks/tox/run.yaml@master] | 15:39 |
rpittau | yep | 15:39 |
fungi | so looks like that one stuck during a db migration? | 15:39 |
rpittau | it is not always the same place, for example here its just after a random test https://c81b6237185e127ea93f-fe98a86d691c6300a22f04f47f232829.ssl.cf1.rackcdn.com/885276/2/check/openstack-tox-py38/889a3ab/job-output.txt | 15:39 |
fungi | any commonality in the node type? like using nested-virt or something? | 15:40 |
rpittau | for now they all look from ovh | 15:41 |
rpittau | 2023-06-06 12:32:23.021846 | localhost | Provider: ovh-bhs1 | 15:41 |
rpittau | heh nvm I found a different one Provider: inmotion-iad3 | 15:42 |
fungi | stable branches, master, both? any idea when it started/got worse? | 15:43 |
rpittau | at the moment it looks only on master, started today, we haven't ran much today on stable branches | 15:45 |
clarkb | given that evidence I'd start looking at the database | 15:46 |
clarkb | in particular newer sqla and databases are pickier about transactions and locks | 15:46 |
rpittau | clarkb: which evidence? :) | 15:46 |
clarkb | rpittau: that things get stuck performing database operations | 15:46 |
rpittau | that's just one | 15:46 |
clarkb | yes I understand | 15:46 |
rpittau | the others fail in different points | 15:46 |
clarkb | but its the one you've shared so its what I can work off of | 15:46 |
clarkb | it is entirely possible there are 30 different issues at play but we have to start somewhere | 15:47 |
clarkb | and sqla locking and transactions are known to have changed | 15:47 |
clarkb | I personally debugged and fixed a database thing in the zuul unittests that resulted in similar issues with tests hanging because locks were held preventing new schemas for additional tests from being created | 15:47 |
clarkb | basically start with what we do know and work from there | 15:48 |
clarkb | I've also seen mariadb be flakier than mysql for whatever reason. Haven't been able to run that down so zuul testing still uses mysql not mariadb | 15:49 |
fungi | maybe it's time to resurrect drizzle | 15:50 |
rpittau | clarkb: from what I can see the database migration error is just one out of at least 4, the others are different, I'm not sure that's the problem, it seems the time taken in those tests hasn't changed | 15:51 |
clarkb | rpittau: ok something is deadlocking and not performing additional effort. I don't know what. I'm just making suggestions based on what was shared. You will need to figure out where and why things are stopping in your unittests. Another approach may be to run things locally in a loop to see if you can reproduce. Or add additional debugging etc etc | 15:52 |
clarkb | looking at https://c81b6237185e127ea93f-fe98a86d691c6300a22f04f47f232829.ssl.cf1.rackcdn.com/885276/2/check/openstack-tox-py38/889a3ab/tmpgr5suvjz worker 7 seems to run fewer tests than the other 7 threads. The tests should be roughly balanced equally so that might be a clue (worker-7 is getting stuck and not completing) | 15:54 |
clarkb | do your unittests run an individual test timeout? That can help break things forcefully giving you a clue for where the are getting stuck | 15:54 |
fungi | rpittau: also, in an effort to not duplicate work already going on, you might take a look at the #openstack-qa logs from friday where dansmith brought some similar concerns | 15:55 |
clarkb | I do not see a timeout in ironic/tests/base.py so that may be the first thing to do | 15:55 |
clarkb | many projects do this for this very reason. If you tests lock up and the job dies you get very little info. If you set a timeout like 1-5 minutes long to forcefully break the test case then you get a traceback of where things are stuck | 15:56 |
rpittau | fungi, clarkb, thanks for the advice, let's see if we can come up with something | 15:56 |
fungi | rpittau: the timeouts raised in the qa channel were related to tempest jobs, but if somethinghas gotten systematically slower in openstack the impact could extend to unit testing as well | 15:58 |
rpittau | fungi: yeah, I'm reading through it, I'll try to raise up the issue there, thanks | 15:59 |
clarkb | 04eaebe701fa70bc3d483fe9281a2252c78ea14c says oslo test bsae test classes may do the timeout for you | 16:00 |
clarkb | I would double check that assumption | 16:00 |
clarkb | ya that assumption is bad you have to set OS_TEST_TIMEOUT (which doesn't appear to be set in tox.ini) and if you don't then it uses DEFAULT_TIMEOUT which is set to 0 if you don't override it in your class which means no timeout | 16:02 |
clarkb | so ya update ironic/tests/base.py to set DEFAULT_TIMEOUT or set OS_TEST_TIMEOUT in tox.ini to something like 60 or 120 seconds and see what happens | 16:03 |
rpittau | would that value be for all the unit tests? | 16:03 |
clarkb | its based on your test class tree | 16:04 |
rpittau | oh right :/ | 16:04 |
rpittau | ok, thanks, I'll check that | 16:04 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!