iurygregory | not sure if is magic or what, but https://review.opendev.org/c/openstack/ironic/+/885372 seems to help on my patch ^ | 00:13 |
---|---|---|
iurygregory | :D | 00:13 |
iurygregory | or if is just CI that is back to normal | 00:13 |
opendevreview | Iury Gregory Melo Ferreira proposed openstack/ironic master: Add DB API for Firmware and Object https://review.opendev.org/c/openstack/ironic/+/883062 | 01:54 |
iurygregory | removing Depends-On to see | 01:54 |
opendevreview | Iury Gregory Melo Ferreira proposed openstack/ironic master: Firmware Interface https://review.opendev.org/c/openstack/ironic/+/885276 | 02:00 |
opendevreview | Iury Gregory Melo Ferreira proposed openstack/ironic master: [WIP] RedfishFirmware Interface https://review.opendev.org/c/openstack/ironic/+/885425 | 02:52 |
opendevreview | Merged openstack/ironic master: execute on child node support https://review.opendev.org/c/openstack/ironic/+/880545 | 04:04 |
rpittau | good morning ironic! o/ | 06:27 |
rpittau | so no fails there after 2 runs, maybe it was a transient hiccup? zuul drunk? tox had a heavy lunch? | 06:28 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: Add test timout to tox config https://review.opendev.org/c/openstack/ironic/+/885372 | 06:38 |
rpittau | reduced the time to 3 minutes ^ | 06:38 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: Add test timout to tox config https://review.opendev.org/c/openstack/ironic/+/885372 | 06:48 |
rpittau | ah that failed finally! | 07:42 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: Add test timout to tox config https://review.opendev.org/c/openstack/ironic/+/885372 | 07:53 |
rpittau | looks like the timeout didn't work, or tests are not actually timing out but something else is | 07:54 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: Migrate the inspector's /continue API https://review.opendev.org/c/openstack/ironic/+/875944 | 08:57 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: Add the initial skeleton of the agent inspect interface https://review.opendev.org/c/openstack/ironic/+/877814 | 09:04 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: [WIP] Very basic in-band inspection with the "agent" interface https://review.opendev.org/c/openstack/ironic/+/885450 | 09:08 |
dtantsur | rpittau: I wonder if we have any tests that run for too long | 09:17 |
* dtantsur tries locally | 09:17 | |
dtantsur | We could also consider running the bazillion of RBAC tests in a separate job only. | 09:19 |
dtantsur | ironic.tests.unit.drivers.modules.test_agent_client.TestAgentClient.test__command_poll 6.075 | 09:22 |
dtantsur | ironic.tests.unit.drivers.modules.ibmc.test_utils.IBMCUtilsTestCase.test_handle_ibmc_exception_retry 4.117 | 09:22 |
dtantsur | ironic.tests.unit.drivers.modules.ilo.test_power.IloPowerInternalMethodsTestCase.test__set_power_state_soft_reboot_fail_to_on 4.093 | 09:22 |
dtantsur | ironic.tests.unit.conductor.test_manager.UpdateNodeTestCase.test_update_node_interface_in_allowed_state 4.039 | 09:22 |
dtantsur | ironic.tests.unit.conductor.test_manager.DoNodeTearDownTestCase.test__do_node_tear_down_from_valid_states 3.231 | 09:22 |
dtantsur | ironic.tests.unit.drivers.modules.ilo.test_power.IloPowerInternalMethodsTestCase.test__set_power_state_soft_power_off_timeout 3.189 | 09:22 |
dtantsur | ironic.tests.unit.drivers.modules.ilo.test_power.IloPowerInternalMethodsTestCase.test__set_power_state_soft_reboot_timeout 3.118 | 09:22 |
dtantsur | ironic.tests.unit.drivers.modules.drac.test_power.DracPowerTestCase.test_set_power_state 2.620 | 09:22 |
dtantsur | ironic.tests.unit.drivers.modules.drac.test_power.DracPowerTestCase.test_set_power_state_timeout 2.605 | 09:22 |
dtantsur | ironic.tests.unit.api.controllers.v1.test_node.TestPost.test_create_node_specify_interfaces 2.350 | 09:22 |
dtantsur | Fixing these will shave half a minute off the run | 09:23 |
rpittau | dtantsur: we could definitely reduce the total time, but I'm not worried about that, but the fact that I don't actually see any test timing out and the limit is set to 15 secs | 09:25 |
rpittau | so we don't know the root cause of the random timeout | 09:26 |
dtantsur | ouch | 09:27 |
dtantsur | understood | 09:28 |
rpittau | I think OS_TEST_TIMEOUT is not working as it should, the limit should be 15s per test but ironic.tests.unit.db.sqlalchemy.test_migrations.TestMigrationsMySQL.test_walk_versions took 43 secs to finish in py39 | 09:36 |
dtantsur | 43 is quite slow, although not impossibly slow | 09:54 |
rpittau | yeah, but it should fail since the timeout is 15 secs :) | 09:54 |
dtantsur | well, it's good that it does not :D | 09:54 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: Add test timout to tox config https://review.opendev.org/c/openstack/ironic/+/885372 | 10:08 |
opendevreview | Mahnoor Asghar proposed openstack/ironic master: Handle duplicate node inventory entries per node https://review.opendev.org/c/openstack/ironic/+/884608 | 10:28 |
rpittau | besides the fact that OS_TEST_TIMEOUT doesn't seem to work, I've noticed that ironic.tests.unit.db.sqlalchemy.test_migrations.TestMigrationsMySQL.test_walk_versions has very weird runtimes that vary from 10-15 secs to over a minute in some cases | 10:59 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: [WIP] Add test timout to tox config https://review.opendev.org/c/openstack/ironic/+/885372 | 11:05 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: [WIP] Add test timout to tox config https://review.opendev.org/c/openstack/ironic/+/885372 | 11:07 |
iurygregory | good morning Ironic | 11:45 |
iurygregory | omg a lot of messages | 11:45 |
iurygregory | rpittau, I think you reached the timeout at least according to https://9dd8202499839c857d32-766877f31b96866d435c7c3ea23bc324.ssl.cf1.rackcdn.com/885372/7/check/openstack-tox-cover/a7d221d/testr_results.html | 11:55 |
rpittau | iurygregory: yeah, but not using OS_TEST_TIMEOUT | 11:56 |
iurygregory | yeah | 11:56 |
iurygregory | I was reading all the conversation you had and looking at the patch | 11:57 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: [WIP] Add test timout to tox config https://review.opendev.org/c/openstack/ironic/+/885372 | 12:00 |
iurygregory | lol `gentle=True` | 12:00 |
opendevreview | Mahnoor Asghar proposed openstack/ironic master: Handle duplicate node inventory entries per node https://review.opendev.org/c/openstack/ironic/+/884608 | 12:09 |
iurygregory | rpittau, I'm trying to understand why we could have invalid timeout .-. (tox would do something wrong when setting OS_TEST_TIMEOUT? | 12:14 |
dtantsur | I'd be surprised if tox was aware of OS-specific variables | 12:15 |
iurygregory | we are using setenv so OS_TEST_TIMEOUT would be set (at least would be expected lol) | 12:16 |
dtantsur | iLO unit tests relying on precise timing of sleep() are :chef's kiss: | 12:54 |
rpittau | I'm trying with passenv this time | 12:57 |
rpittau | that should pass the value correctly | 12:58 |
rpittau | "should" | 12:58 |
iurygregory | dtantsur, wow | 12:58 |
rpittau | it doesn't lol | 12:58 |
dtantsur | \o/ | 13:01 |
dtantsur | So it's a cursed computing day today? | 13:02 |
rpittau | I don't see what's wrong, the OS_TEST_TIMEOUT value is passed when running tox so should be taken into account by TestCase | 13:02 |
dtantsur | rpittau: can some of our dependencies mess with it? like oslotest? | 13:02 |
rpittau | oslotest is the one that should actually make use of it :/ | 13:03 |
iurygregory | .-. | 13:03 |
rpittau | https://github.com/openstack/oslotest/blob/master/oslotest/base.py#L35-L45 | 13:04 |
iurygregory | (╯°□°)╯︵ ┻━┻ | 13:04 |
dtantsur | Hmm, but where is it handled? I don't see it mentioned in the code. | 13:06 |
dtantsur | ah, https://github.com/openstack/oslotest/blob/master/oslotest/timeout.py#L36 | 13:06 |
dtantsur | maybe it's too gentle then? :D | 13:06 |
rpittau | lol | 13:07 |
iurygregory | gentle=False :D | 13:07 |
* TheJulia wonders if sleeping in is the best choice | 13:07 | |
rpittau | the gentle means that it will actually throw an exception | 13:07 |
iurygregory | LOL | 13:07 |
rpittau | TheJulia: it is today :D | 13:07 |
dtantsur | absolutely | 13:07 |
iurygregory | yeah, let's go back to bed | 13:07 |
* TheJulia grabs cat and closes eyes | 13:07 | |
rpittau | by the tentacles of Cthulhu! locally it's working............ | 13:08 |
iurygregory | rpittau, I'm not surprised :D | 13:08 |
* rpittau runs away screaming | 13:08 | |
iurygregory | its the same scenario as "it works in devstack" :D | 13:09 |
dtantsur | I also absolutely love silently ignoring values like in https://github.com/openstack/oslotest/blob/master/oslotest/timeout.py#L39-L41 | 13:09 |
rpittau | yeah I'm going to remove that from my implementation (aka copy-paste) | 13:13 |
TheJulia | InsufficientCatGravityException :( | 13:13 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: [WIP] Add test timout to tox config https://review.opendev.org/c/openstack/ironic/+/885372 | 13:17 |
* dtantsur wants to set oslo_service.loopingcall on fire | 13:28 | |
* iurygregory is afraid to ask why | 13:28 | |
dtantsur | iurygregory: it's pretty implicit and horribly overbloated for our case | 13:28 |
dtantsur | you see, it's supposed to be used asynchronously, hence its usage of green threads | 13:29 |
dtantsur | but we do things like LoopingCall().start().wait() | 13:29 |
dtantsur | which turns it into a complicated for loop with some eventlet inside | 13:29 |
iurygregory | jesus | 13:29 |
iurygregory | .-. | 13:29 |
dtantsur | yep. and I've just found that we use initial_delay=1 when waiting for power state. which in our case is just sleep(1) inserted into all power drivers. | 13:30 |
dtantsur | I don't remember if we did that on purpose... | 13:30 |
TheJulia | it sounds like something we might have done intentionally | 13:32 |
dtantsur | If so, I'd rather have us use time.sleep explicitly. At least the intention would be clear. | 13:32 |
dtantsur | (And don't let me start on using eventlet itself... I wish async Python was there when OpenStack started) | 13:33 |
TheJulia | dtantsur: same | 13:37 |
* TheJulia awaits dtantsur to pour gasoline on eventlet | 13:37 | |
dtantsur | I seriously wonder at which point the person behind eventlet decides to move on | 13:38 |
dtantsur | I suspect it's becoming an increasingly thankless job | 13:38 |
iurygregory | rpittau, seems like 1 works like a charm :D | 13:39 |
iurygregory | https://11f91ceea8fc4a8c24f4-01466d86d1bf7d9494b28327d57fdcc3.ssl.cf5.rackcdn.com/885372/9/check/openstack-tox-py310/e99f0e5/testr_results.html | 13:39 |
iurygregory | TypeError: 'str' object cannot be interpreted as an integer XD | 13:40 |
* TheJulia facepalms | 13:40 | |
opendevreview | Julia Kreger proposed openstack/ironic master: DNM Enable OVN https://review.opendev.org/c/openstack/ironic/+/885087 | 13:50 |
opendevreview | Julia Kreger proposed openstack/ironic master: DNM Enable OVN https://review.opendev.org/c/openstack/ironic/+/885087 | 13:57 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: Mock sleep in unit tests that rely on it https://review.opendev.org/c/openstack/ironic/+/885497 | 14:10 |
rpittau | well that's perfect | 14:10 |
rpittau | now I know what's happening | 14:11 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: [WIP] Add test timout to tox config https://review.opendev.org/c/openstack/ironic/+/885372 | 14:12 |
iurygregory | rpittau, https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4d8/885372/10/check/openstack-tox-py38/4d8415e/testr_results.html | 14:25 |
iurygregory | it works \o/ | 14:25 |
rpittau | should work without the base changes | 14:25 |
rpittau | anyway, let's see the results | 14:26 |
iurygregory | py38 and py310 failed | 14:26 |
iurygregory | 39 still running | 14:26 |
rpittau | which is absurd | 14:26 |
rpittau | ah no, just slower to fail | 14:27 |
iurygregory | still has 4min remaning | 14:27 |
iurygregory | I do see some fixtures._fixtures.timeout.TimeoutException in the console | 14:29 |
rpittau | oooook so one last test and then I can remove the WIP | 14:38 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: [WIP] Add test timout to tox config https://review.opendev.org/c/openstack/ironic/+/885372 | 14:39 |
iurygregory | failed | 14:48 |
rpittau | \o/ | 14:55 |
iurygregory | I'm wondering how we will choose the time | 14:56 |
rpittau | 45 secs | 14:56 |
rpittau | the highest I saw was 42 | 14:56 |
rpittau | and it's a lot | 14:56 |
iurygregory | yeah | 14:56 |
rpittau | anyway, first ice cream, then remove WIP | 14:56 |
iurygregory | ice cream ++ | 14:56 |
masghar | I should probably wait for rpittau to merge before re-checking my patch, right? https://review.opendev.org/c/openstack/ironic/+/884608 | 15:00 |
iurygregory | masghar, yeah | 15:01 |
masghar | Although my timeouts were 40 and 50 minutes respectively :O | 15:01 |
iurygregory | this is the timeout for the job itself | 15:01 |
iurygregory | we are trying to add timeouts to each unittest | 15:02 |
masghar | I see | 15:02 |
masghar | Thanks! | 15:02 |
iurygregory | at least if it fails we will get an idea of which test is taking too long so we can check | 15:02 |
opendevreview | Julia Kreger proposed openstack/ironic master: DNM Enable OVN https://review.opendev.org/c/openstack/ironic/+/885087 | 15:03 |
masghar | Makes sense. But just out of curiosity, mine probably failed because the CI was busy, right? Because nothing timeout related has been merged to master yet? | 15:04 |
TheJulia | gah, our devstack plugin has been driving me crazy the last few days | 15:04 |
TheJulia | so I'm afraid to ask, but did a new eventlet drop or soemthing? | 15:05 |
iurygregory | masghar, CI is unhappy in py3 jobs and is hitting timed out without much explanation, so with rpittau patch hopefully we will be able to identify tests that are taking too long | 15:05 |
masghar | iurygregory: ooh okay. Makes more sense! | 15:06 |
iurygregory | TheJulia, are you having trouble with devstack in CI or locally? | 15:08 |
TheJulia | uhh, I've been so slammed with presentations, people messaging me with stuff, and people seeking ad-hoc help on issues, I've not done too much code wise this week | 15:09 |
TheJulia | I did run tests yesterday but had zero issues | 15:09 |
iurygregory | running locally the tests are working fine | 15:10 |
iurygregory | but this issue started to show since Jun 05 the timed_out one | 15:10 |
TheJulia | ... last eventlet was january... | 15:11 |
TheJulia | oh well | 15:11 |
TheJulia | CI makes sense if new package release because we may have older copies of $things locally | 15:11 |
TheJulia | I +2'ed dmitry's sleep change | 15:13 |
rpittau | mmmm wondering why we don't import the same env variables in cover tests | 15:26 |
rpittau | well not important now, I'll add that to a different patch | 15:27 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: Add test timout to tox config https://review.opendev.org/c/openstack/ironic/+/885372 | 15:28 |
rpittau | ^ should be ready | 15:28 |
iurygregory | ack | 15:28 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: Add test timout to tox config https://review.opendev.org/c/openstack/ironic/+/885372 | 15:30 |
rpittau | now :P | 15:30 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: Use tox env variables in coverage tests https://review.opendev.org/c/openstack/ironic/+/885507 | 15:32 |
dtantsur | Long weekend starts here, see you on Monday and safe travels for those going to the Summit! | 15:35 |
rpittau | dtantsur: enjoy! :) | 15:35 |
iurygregory | enjoy dtantsur o/ | 15:35 |
opendevreview | Merged openstack/ironic master: Mock sleep in unit tests that rely on it https://review.opendev.org/c/openstack/ironic/+/885497 | 15:37 |
rpittau | see ya tomorrow ironicers! o/ | 15:38 |
iurygregory | bye rpittau o/ | 15:41 |
opendevreview | Julia Kreger proposed openstack/ironic master: DNM Enable OVN https://review.opendev.org/c/openstack/ironic/+/885087 | 15:45 |
TheJulia | i like nova advertised their etherpad | 15:47 |
TheJulia | for the ptg | 15:47 |
JayF | we don't really have one other than the ironic-openinfra-2023 which I already linked on the ML | 15:48 |
JayF | oooh, like for the forum | 15:48 |
JayF | that's a good idea | 15:48 |
opendevreview | Julia Kreger proposed openstack/ironic stable/train: Fix Cinder Integration fallout from CVE-2023-2088 https://review.opendev.org/c/openstack/ironic/+/885065 | 16:15 |
JayF | TheJulia: so I found out something interesting the other day: apparently T/U/V/W fixes for Cinder and Nova were all done *downstream* | 16:20 |
JayF | TheJulia: which sounded weird given we're also experiencing breakage back further than that | 16:20 |
iurygregory | oh wow | 16:23 |
JayF | because that's some of the impetus for Cinder retiring old EM branches now | 16:23 |
JayF | because if they couldn't patch them for this nasty vuln, why do they exist | 16:23 |
iurygregory | *magic* | 16:24 |
opendevreview | Julia Kreger proposed openstack/ironic master: DNM Enable OVN https://review.opendev.org/c/openstack/ironic/+/885087 | 16:28 |
iurygregory | I need to say, I love ZUUL! | 16:28 |
iurygregory | https://review.opendev.org/c/openstack/ironic/+/885507 timed_out | 16:29 |
JayF | Are you fishing for karma? | 16:29 |
JayF | nope, sarcasm | 16:29 |
iurygregory | <crying> | 16:29 |
JayF | it's listening to you Iury, you have to genuinely love Zuul for it to set you free ;) | 16:29 |
iurygregory | yes! | 16:29 |
JayF | TheJulia: iurygregory: FWIW I checked openstack/releases and openstack/requirements, looks like only recent things added, library wise, in CI should be oslo.messaging and tooz | 16:31 |
TheJulia | wow, that is... saddening | 16:32 |
JayF | I've said at least two sad things in the last ten minutes | 16:32 |
iurygregory | yeah I checked that also =( | 16:32 |
JayF | which one has saddened you? | 16:32 |
TheJulia | yes | 16:32 |
* TheJulia nods emphatically | 16:33 | |
clarkb | re the timeouts I note that oslo test only does a gentle timeout. You may need the non gentle version. Also, other approaches that can be taken: grab the subunit file for timed out jobs and see which tests haven't reported back. In theory one of those is the problem. You can also run the test suite in a loop locally to see if you can reproduce the error. Though it may be | 16:44 |
clarkb | timing specific | 16:44 |
iurygregory | clarkb, looking at the job with timed_out I don't see the subunit file there .-. | 16:53 |
iurygregory | https://zuul.opendev.org/t/openstack/build/568021ced3a64caaa6bc0a235d4f31c3/logs | 16:53 |
iurygregory | or is this tmp_69f9zxx? | 16:53 |
clarkb | yes that file | 16:53 |
iurygregory | ack, ty | 16:54 |
clarkb | when the test is running the subunit content is written to a temp file then when it completes it gets moved to its final destination. When the test suite crashes or is forcefully stopped that temp file isn't moved | 16:54 |
iurygregory | gotcha | 16:54 |
opendevreview | Julia Kreger proposed openstack/ironic master: DNM Enable OVN https://review.opendev.org/c/openstack/ironic/+/885087 | 17:40 |
opendevreview | Julia Kreger proposed openstack/ironic master: Permit Ironic to notify IPA it can support MD5 https://review.opendev.org/c/openstack/ironic/+/882168 | 18:06 |
opendevreview | Merged openstack/ironic-tempest-plugin master: rbac - Fix vif_attach expected return values https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/883032 | 18:11 |
iurygregory | if anyone have time for review I appreciate feedback in https://review.opendev.org/c/openstack/ironic/+/883062 and https://review.opendev.org/c/openstack/ironic/+/885276/ o/ | 18:13 |
JayF | ah, I skipped yesterday because of V-1 | 18:14 |
JayF | but I need to stop filtering on that when our CI is going batty | 18:14 |
iurygregory | yeah =( this timed_out are a pain | 18:15 |
iurygregory | leaving for my physiotherapy session, will continue working at night | 18:17 |
TheJulia | is it just the tests with sleep or is it more tests? | 18:17 |
iurygregory | I saw it getting stuck in different tests | 18:19 |
iurygregory | https://zuul.opendev.org/t/openstack/builds?job_name=openstack-tox-py39&project=openstack%2Fironic&result=TIMED_OUT&skip=0 | 18:21 |
iurygregory | https://zuul.opendev.org/t/openstack/builds?job_name=openstack-tox-py38&project=openstack%2Fironic&result=TIMED_OUT&skip=0 | 18:21 |
iurygregory | https://zuul.opendev.org/t/openstack/builds?job_name=openstack-tox-py310&project=openstack%2Fironic&result=TIMED_OUT&skip=0 | 18:21 |
JayF | do we run unit tests in parallel by default? | 18:21 |
JayF | If so, I wonder if going single-threaded would expose anything | 18:21 |
iurygregory | yeah, I think by default we run in parallel | 18:23 |
ashinclouds[m] | 8 runners with parallel execution inside of them | 18:33 |
JayF | would be interesting to see if these repro serially | 18:36 |
TheJulia | I doubt they world, tbh. | 18:48 |
JayF | yeah, same | 18:49 |
JayF | I think I'm going to just start running tox in a loop on my computer, see if I can get one to hang lol | 18:49 |
TheJulia | I wonder if it would need to be inside a VM… | 18:50 |
TheJulia | Specifically because addressable memory cache behaves differently in a VM | 18:51 |
TheJulia | Which cascades out in some fun cases | 18:51 |
JayF | that would be a touch more difficult for my current setup, but I could make it happen | 18:53 |
* JayF is very close to setting up a dedicated VM host in his house | 18:53 | |
JayF | I wonder if we should just ask them to hold a node, and we can try to repro on that specific node | 19:03 |
JayF | instead of chasing environments | 19:03 |
JayF | idk | 19:03 |
TheJulia | ++++++++ | 19:05 |
JayF | maybe I'll get "lucky", but I just kicked off a run of 50 in a row locally since I'm going to go get some lunch lol | 19:09 |
JayF | 20 clean runs :| | 19:30 |
yde | hi again | 19:54 |
yde | i'm hitting an issue i don't know how to fix: when i power up unknown nodes, the dnsmasq ironic services ignores the mac address | 19:55 |
yde | and i can't enter "discovery" mode with auto inspection. i know from previous deployment this it totally feasible, but i can't remember which is the setup i have to make so that dnsmasq doesn't ignores unknown mac addresses | 19:56 |
TheJulia | Greetings | 19:57 |
yde | does it ring a bell to someone? i would really appreciate it cause i'm quite lost... i even asked chatgpt;) | 20:01 |
clarkb | I don't mind holding a node but I consider it a pretty big bug that ironic removed test case timeouts | 20:05 |
clarkb | I'm like 99% sure that test case timeouts would identify the problem quickly but they were removed and now you timeout the entire job | 20:05 |
clarkb | I added those timeouts across openstack almost a decade ago because they are pretty important in a CI environment like ours | 20:05 |
clarkb | basically lets not forego fixing systematic issues if we identify the specific cause more quickly | 20:06 |
yde | there is this file which must be setup by the conductor or inspector dhcp-hostsdir/unknown_hosts_filter | 20:11 |
yde | *:*:*:*:*:*,ignore | 20:11 |
yde | i don't know how to disable it | 20:12 |
yde | oups. found my mistake... i overrided conf in kolla-ansible in the ironic.conf file, not the inspector.conf. | 20:19 |
JayF | clarkb: none of us working on Ironic now, for the most part, were here a decade ago fwiw. I don't know why those were removed, but I'm happy to go down the path to check out why they were removed+maybe readd them. | 20:26 |
* JayF is getting close, but late 2004 is still only 8.5y | 20:27 | |
clarkb | JayF: I found the change that did it. It replaced it with the oslo test stuff but didn't enable it. | 20:27 |
clarkb | Or at least that appears to be what happened. I don't think it was entirely intentional more just that we shouldn't continue to ignore it since it is a very useful tool when you have automated testing | 20:27 |
clarkb | the oslo test class may also be flawed in that it only does a graceful timeout as well... | 20:28 |
opendevreview | Jay Faulkner proposed openstack/ironic master: Add additional logging on iLO power failure https://review.opendev.org/c/openstack/ironic/+/885549 | 20:28 |
clarkb | I think the graceful timeout cannot trigger if eventlet has gone out to lunch ofr example | 20:29 |
clarkb | (becaue nothing ever handles the signal handler? I seem to recall this was a problem at one time) | 20:29 |
JayF | I am split attention right now so I can't look in depth, but would you mind putting a bug in launchpad about this and assigning it to me? | 20:31 |
JayF | if you can't now, I'll try to remember | 20:31 |
clarkb | a bug to reenable timeouts? rpittau already has a change for it. I was hoping it would get merged as the first step in debugging this | 20:32 |
JayF | that change did not appear to effectively enable timeouts, I don't think | 20:33 |
clarkb | I think it did? there was a whole set of failed jobs when the value was set to 1 | 20:33 |
clarkb | ya patchset 11 | 20:36 |
clarkb | patchset 10 will be worth revisting if the gentle timeout is insufficient for this specific issue (because you can flip it to gentle=False) | 20:37 |
clarkb | patchset 4 had a job timeout (the only one?) and the 15 second test case timeout doesn't seem to have worked there so ya gentle=False may be required | 20:44 |
clarkb | these extra tests ran in the patchset 4 py39 job but did not run before the py310 job timed out https://paste.opendev.org/show/bPbFfzlyhZfXS1SPe4dJ/ | 21:09 |
clarkb | I sorted them to make diffing easier so don't read into the order there. But I do notice that db tests are included (I suspected them when rpittau first asked questions in the infra channel) | 21:09 |
clarkb | another approach may be to have stestr record the tests in order for each worker before beginning the tests (I don't know if it can do that) | 21:10 |
clarkb | but then you'd be able to see which test was supposed to run next | 21:10 |
opendevreview | Jay Faulkner proposed openstack/ironic master: DO NOT MERGE: See if Unit tests fail more interestingly without wal https://review.opendev.org/c/openstack/ironic/+/885550 | 21:13 |
JayF | dtantsur: iurygregory: ^ Just doing some science, seeing if disabling WAL will give us a more meaningful error | 21:13 |
TheJulia | blah... I've been so busy I've barely looked at IRC today | 21:14 |
TheJulia | yde: yeah, that all gets set as inspector configuration | 21:14 |
opendevreview | Clark Boylan proposed openstack/ironic master: DNM terrible unittest bisecting https://review.opendev.org/c/openstack/ironic/+/885552 | 21:50 |
clarkb | JayF: ^ thats a thing that is a mega hack | 21:51 |
clarkb | but illustrates how to use the CI system and speculative testing to your advantage | 21:51 |
clarkb | heh I think that shows your db tests cannot run in isolation? | 21:59 |
clarkb | 10 of them anyway | 21:59 |
TheJulia | We can't disjoint them if that is what your saying, the db layer needs to be loaded | 22:01 |
clarkb | TheJulia: I did `stestr run --slowest ironic.tests.unit.db` which I would expect to be fine | 22:02 |
clarkb | even if something needs to be loaded each test should be responsible for laoding its dependencies so they can run in isolation | 22:03 |
TheJulia | that is true | 22:03 |
TheJulia | JayF: I can look at unit testing with you tomorrow | 22:03 |
JayF | clarkb: thanks for the examples, this is going to be my focus tomorrow unless someone fixes it while I'm not here :D | 22:04 |
JayF | (those are my favorite kind of fixes, the kind that appear overnight while the europeans trounce about in the stack :D ) | 22:04 |
clarkb | basically I'm trying to go through the list of tests that didn't run in a timed out job: https://paste.opendev.org/show/bPbFfzlyhZfXS1SPe4dJ/ and see if any of those result in a timeout but its a bit of a non starter if the tests don't even run in isolation | 22:04 |
clarkb | I'm extra suspicuous of the db tests now because they seem to require external side effects to run at all | 22:05 |
JayF | https://github.com/openstack/ironic/blob/master/tools/test-setup.sh is your method somehwo skipping over running this? | 22:07 |
JayF | Hmm. That's just a helper for us, not run in CI. | 22:08 |
JayF | I assume we just get that as part of CI setup then? | 22:08 |
opendevreview | Clark Boylan proposed openstack/ironic master: DNM terrible unittest bisecting https://review.opendev.org/c/openstack/ironic/+/885552 | 22:08 |
clarkb | JayF: that is run in CI | 22:09 |
JayF | O dpm | 22:09 |
JayF | **I don't see it referenced in ironic in git | 22:09 |
JayF | or at least I should say; `git grep test-setup.sh` does not bring up anything in our test setup | 22:09 |
clarkb | https://zuul.opendev.org/t/openstack/build/996a1102472d423396355a121889c839/log/job-output.txt#478 you can see it run there | 22:09 |
clarkb | ya its part of the parent job setup that is common to openstack (maybe all of the python jobs?) | 22:09 |
JayF | that's what I assumed, glad to have that confirmed then | 22:10 |
clarkb | the migration tests should only run if a mysql/postgres/mariadb is present so I think it is detecting hte DB is there but then failing on something else | 22:10 |
JayF | yeah, ack | 22:11 |
JayF | I need to look at this with a full brain, not the leftovers on the dangling end of my day | 22:11 |
JayF | I'm going to pick this up 7am my time tomorrow :) (about 16.75 hours from now :P) | 22:11 |
clarkb | ok I've gone ahead and started bisecting the list of tests that didn't run further. So newer patchsets won't have the db stuff in it | 22:12 |
JayF | fwiw that job testing disabling WAL did not impact the shape of the failure; not that surprising | 22:19 |
TheJulia | interesting, this aiui started on the 5th | 22:26 |
TheJulia | and we basically only merged a warning addition on the 5th :\ | 22:28 |
TheJulia | i guess I'm lamenting there is no "silver bullet" | 22:28 |
JayF | TheJulia: if you have an early failing example, and we have a late success example, we can compare pip freeze output | 22:31 |
clarkb | the db fixture stuff uses testresources to avoid recreating a new db schema for each test that needs one. My hunch is that some other test class is providing necessary info to make this happen and then db migrations are able to piggy back off of that but unable to bootstrap things themselves | 22:32 |
JayF | oh no | 22:36 |
JayF | https://pypi.org/project/tox/#history | 22:36 |
opendevreview | Clark Boylan proposed openstack/ironic master: DNM terrible unittest bisecting https://review.opendev.org/c/openstack/ironic/+/885552 | 22:37 |
JayF | clarkb: I'll note ^ tox release just happened to coincide with our failures beginning | 22:38 |
* JayF wonders if clarkb has /ignore *tox* at this point :P | 22:38 | |
JayF | tests are using 4.6.0 | 22:39 |
JayF | it's impossible to even test this in our CI, isn't it? because tox will just yolo-install newer tox | 22:39 |
JayF | although tox 4.6.0 on python 3.11 does have tests passing for me locally, this is still highly suspect | 22:40 |
JayF | the diff makes me think this is a red herring, but the timing is still perfect https://github.com/tox-dev/tox/compare/4.5.2...4.6.0 | 22:42 |
clarkb | JayF: I mean I use nox | 22:46 |
clarkb | tox yolo installing a newer tox happens if you have tox setup requires or whatever it is called | 22:46 |
clarkb | I don't see that in ironic's tox.ini | 22:46 |
JayF | ensure-tox in zuul is installing 4.6.0 in your tests | 22:47 |
JayF | per my spot check like, 10 minutes ago | 22:47 |
clarkb | yes you'd have to override that somewhere | 22:47 |
JayF | looking at the diff, I feel pretty safe thinking it's not tox | 22:47 |
JayF | or at least, not something unique enough that it wouldn't blow the whole stack instead of just ironic :) | 22:47 |
clarkb | JayF: are you looking at something other than the job timeouts? | 22:47 |
clarkb | because the job timeouts occur in the unittests not tox I think (though I guess it could be tox) | 22:48 |
JayF | I'm looking from a different angle for right now; basically trying to answer the "What changed June 5?" question | 22:48 |
clarkb | ah | 22:48 |
clarkb | I think it is unlikely for tox to be the problem here bceause fewer tests are running in the timeout case its not like all tests finish and then tox fails to exit appropriately. But I guess I can't rule it out completely since tox has been known to do weird hings to the runtime | 22:48 |
JayF | same; but I was at the end of that thread and thought "what the hell" and it lined up | 22:49 |
clarkb | My brain is melting a bit against the oslo db test fixture stuff. the complete lack of logging for magical behaviors is less than ideal | 22:49 |
JayF | after checking things more likely to be to blame, I mean | 22:49 |
clarkb | it should be checking for a mysql server and trying to connect to it and if that works adding it to the list of available resources. Similar with postgres | 22:49 |
JayF | I'm going to approach it more similarly to you tomorrow, but bluntly, my brain is burned and I'm trying to squeeze usefulness out of it until my EOD :) | 22:50 |
clarkb | but somewhere along the line that is breaking and there is no logging and I'm not going to spin this up locally so that I can pdb it or hack in some logging | 22:50 |
JayF | I don't think any human has repro'd the job failures locally at all | 22:50 |
JayF | I certainly haven't, not for lack of trying | 22:50 |
JayF | I really think we'll find it's something inherent to the CI setup, but until I find a smoking gun that's just me trying to blame someone not-me for my problems LOL | 22:50 |
clarkb | ya I mean where I'm ending up is that the test suite isn't really built for humans to run it locally or in smaller subsets of tests right now unfortunately :/ | 22:51 |
clarkb | which I guess shouldn't be surprising because we only enforce that the ntire thing runs with the gate | 22:52 |
JayF | sometimes "if it's not tested in CI, it's broken" is as much a self-fulfilling prophecy as it is a valid adage | 22:53 |
opendevreview | Clark Boylan proposed openstack/ironic master: DNM terrible unittest bisecting https://review.opendev.org/c/openstack/ironic/+/885552 | 22:54 |
clarkb | I think https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/provision.py#L255 is buggy because _no_engine_reason is only set immediately before an exception is raised previously so we can only hit that line of code without that value being set which resuls in AttributeError: 'Backend' object has no attribute '_no_engine_reason' | 23:03 |
clarkb | ya that should have a catch all Exception handler not just for Backend Not available... | 23:04 |
clarkb | Whatever the underlying issue is is completely hidden from us gg | 23:04 |
opendevreview | Clark Boylan proposed openstack/ironic master: DNM terrible unittest bisecting https://review.opendev.org/c/openstack/ironic/+/885552 | 23:13 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!