rpittau | good morning ironic! o/ | 08:19 |
---|---|---|
rpittau | JayF, clarkb, TheJulia, about the timeout/failures in unit tests, I'm more and more oriented towards some DB shenanigans as supposed by clarkb | 10:17 |
rpittau | if you look at my patch (besides the weird failure) the migration tests have very high runtimes, which are absolutely unjustified | 10:17 |
rpittau | I've observed the migration tests runtimes going from 12-13 secs to 42 secs completely randomly and I suspect at some point the DB hangs and the timeout happens | 10:17 |
rpittau | btw I think we should merge https://review.opendev.org/c/openstack/ironic/+/885372?usp=dashboard anyway :) | 10:24 |
rpittau | JayF: if you want to test with an older version of tox I think you can use the requires option in tox.ini in the tox section | 10:28 |
opendevreview | Merged openstack/python-ironicclient master: Add support for node inventory https://review.opendev.org/c/openstack/python-ironicclient/+/883833 | 11:35 |
JayF | rpittau: tox was completely a guess that mostly has shaken out as a wrong guess | 14:02 |
rpittau | good news everyone! I was able to reproduce the timeout locally | 14:03 |
rpittau | "good" is relative in this case | 14:04 |
JayF | it's worthy of celebration for sure | 14:04 |
JayF | did you do anything magical to repro it? | 14:05 |
rpittau | nothing special, I tried the kiss approach and it worked | 14:06 |
rpittau | created a VM and reproduced the workflow | 14:06 |
rpittau | second tentative and tox just stop at some point | 14:06 |
rpittau | no error message, nothing | 14:06 |
rpittau | just defunct | 14:06 |
rpittau | I don't really see anything wrong so far in the VM, it's not stuck. only the terminal with tox is not movinganymore | 14:07 |
rpittau | lots of zombie processes | 14:09 |
rpittau | was there a stestr upgrade recently? | 14:09 |
JayF | no | 14:13 |
JayF | I'd suggest patching oslo.db in that venv as clarkb suggested yesterday | 14:13 |
JayF | since we know that could cause things to stop | 14:13 |
JayF | no new releases or updates to requirements that maps to the timeline of unit tests beginning to fail | 14:14 |
JayF | tox release was on the same day but I think we've mostly ruled that out | 14:14 |
rpittau | ok, let me do another test first | 14:15 |
rpittau | I can reproduce it pretty much every time now | 14:23 |
opendevreview | OpenStack Release Bot proposed openstack/ironic bugfix/22.0: Update .gitreview for bugfix/22.0 https://review.opendev.org/c/openstack/ironic/+/885588 | 14:25 |
clarkb | cool having a reproduction case is very helpful | 15:24 |
clarkb | rpittau: what I would do is grab the list of tests fed to the stestr works. Then see what tests ran and which ones are up next | 15:24 |
clarkb | rpittau: the tests that are up next would be highly suspicuous since in theory they are/were the ones running (depending on the state things end up in) | 15:25 |
rpittau | JayF, clarkb, I patched oslo.db, ran the tests twice and it didn't timeout, but the database tests took over a minute to complete | 15:25 |
rpittau | each of them | 15:25 |
clarkb | ya I mean they can be slow that doesn't necessarily mean they are the problem causing the entire job to timeout (since you have a lot of headroom on the job timeout) | 15:26 |
clarkb | rpittau: what patch did you apply? | 15:26 |
clarkb | extra logging? | 15:26 |
rpittau | considering the number of db tests we have I think the jobs go in timeout because effectively they take too long, so they pass the 40 minutes timeout value | 15:27 |
clarkb | being able to reproduce it is the main thing. Hopefull you're able to track it down pretty quickly now | 15:27 |
clarkb | rpittau: I don't think that is the case looking at the job logs | 15:27 |
clarkb | rpittau: if you look at the job logs you see that the last tes tcase completes like half an hour before the job is timed out | 15:27 |
clarkb | something in the test suite is getting stuck and not progressing | 15:27 |
clarkb | it may still be that it is the db tests getting stuck but I don't think it is that they take too long generally. More like a deadlock etc | 15:28 |
rpittau | mmm I'm seeing tox stuck again, I'll leave it be for a while | 15:29 |
clarkb | rpittau: I would definitely try to determine which specific test cases are running when it gets stuck | 15:29 |
clarkb | and then you can sort out why they are stuck etc | 15:30 |
rpittau | yeah, I'm trying to understand that | 15:31 |
clarkb | rpittau: if you look at ps -elf | grep stestr or similar there should be stestr processes with --load-list arguments. I think each of those load-list arguments is a file containing a set of tests in the order to run them | 15:35 |
rpittau | yep, looking at stestr options to help with that | 15:36 |
clarkb | rpittau: if you then look n .stestr/tmp* for the current run you should see the log showing what tests have been run for each worker and you can cross check against the load list files to see what is currently running | 15:36 |
JayF | I setup a local mysql, trying to get in on the reproduction game | 15:36 |
clarkb | It might also be possible to issue a sigterm/^C to cause the test run to raise an exception whereever it is stuck as well and then you'd see in the traceback where it was at | 15:36 |
JayF | rpittau: are you doing anything other than setting up mysql? also using postgresql? | 15:39 |
JayF | rpittau: what python version | 15:39 |
rpittau | JayF: just installing bindep deps, then running tools/test-setup.sh, using py 3.10 | 15:40 |
JayF | ack; so you have a running postgres and mysql server? | 15:40 |
JayF | I will setup postgres too | 15:40 |
yde | hi, i'm trying to make ironic+neutron work with a networking generic switch driver to put servers into neutron managed networks (vlans) but there's somehting i really dont get in the workflow | 15:40 |
rpittau | JayF: both yeah | 15:40 |
yde | which "dhcp+tftp" service is supposed to answer to PXE requests ? | 15:41 |
yde | during enrollment/inspection, the ironic dnsmasq + tftp servers does the job, boots the IPA and the inspector reports what it found | 15:42 |
yde | but then it auto adds all the MACs to a deny list | 15:42 |
yde | and when i boot a instance, neutron+ironic create a port into the neutron managed "provisionning" network | 15:42 |
yde | which i also created with a dhcp agent | 15:43 |
yde | but this dnsmasq does'nt have all necessary stuff to answer to pxe requests right ? | 15:43 |
rpittau | clarkb: I don't see any --load-list in the stestr process | 15:45 |
rpittau | it's using the tests dir directly | 15:46 |
TheJulia | yde: you need to configure/establish a specific/dedicated endpoint for tftp | 15:47 |
yde | somehow, the enrolled+available baremetal host is properly discovered and inspected. then when i boot an instance through nova, neutron binds it a port in the admin network (whicch has a dhcp agent running in its ip netns), the ports on the switch are properly setup to the correct vlan id | 15:47 |
yde | and the server reboots, sends its its DHCP requests, and nobody replies | 15:48 |
JayF | rpittau: you using mariadb, not mysql, yeah? | 15:48 |
* JayF trying to get this to reproduce and looking for differences | 15:48 | |
rpittau | I'm using mysql | 15:48 |
JayF | dang, so am I | 15:48 |
rpittau | just using what's in CI | 15:48 |
JayF | is it happening every time for you? | 15:48 |
rpittau | in bindep | 15:48 |
rpittau | yes | 15:48 |
rpittau | so now it failed and I thinkg we're onto something | 15:49 |
opendevreview | Merged openstack/ironic bugfix/22.0: Update .gitreview for bugfix/22.0 https://review.opendev.org/c/openstack/ironic/+/885588 | 15:49 |
yde | TheJulia: thanks. i have one, the one setup by kolla-ansible. but where do i have to setup the ipxe script ? | 15:49 |
JayF | if there's a way I can help other than by working independently to try and get a thing to fail, let me know | 15:49 |
rpittau | JayF: posting the error I found | 15:50 |
yde | i mean, if it's not the ironic-dnsmasq agent (which has an inspector.ixpe script setup), where is the ipxe script for deployment ? and who shall serve it / | 15:50 |
rpittau | JayF, clarkb : https://paste.openstack.org/show/bn06iH0TXNfw4bAB78ns/ | 15:50 |
yde | ? | 15:50 |
JayF | those are served from the conductor server, generally | 15:50 |
yde | which dnsmasq ? the one in the qdhcp namespace of the provisioning network ? or the one ironic-dnsmasq agnet ? | 15:51 |
JayF | rpittau: ...I feel like that error is disappointingly less useful than I hoped | 15:55 |
JayF | rpittau: did you ID anything interesting in there I might have misseD? | 15:56 |
rpittau | mmm no, but I have the processes with the lists now | 15:57 |
clarkb | rpittau: JayF it is in eventlet | 16:03 |
TheJulia | yde: the dnsmasq launched in the neutron namespace can’t take tftp. The config for ironic-conductor should include a tftp_address which it puts in the dhcp response payload | 16:03 |
clarkb | waiting to read bytes. I think if there is a lock on the db side the test side might be waiting for the lock to release | 16:04 |
clarkb | rpittau: maybe next is to check the db state when it is stuck and see if there are locks preventing work from happeningin hte db | 16:04 |
rpittau | clarkb: I can see it's very slow at the moment | 16:04 |
JayF | what OS do we use in the gate? | 16:04 |
JayF | jammy, yeah | 16:05 |
rpittau | ubuntu focal and jammy | 16:05 |
JayF | the thing I keep rebounding back to | 16:07 |
JayF | is why now | 16:08 |
JayF | why did it break 6/5 | 16:08 |
JayF | https://github.com/eventlet/eventlet/issues/798 | 16:09 |
JayF | I think that's not impacting us here, because the switches complete successfully in your traceback | 16:09 |
rpittau | so it was stuck for a while then it started again briefly and it looks like running a migration | 16:16 |
rpittau | I think it's stuck on ironic.tests.unit.db.sqlalchemy.test_migrations.TestWalkVersions.test_walk_versions_all_false | 16:16 |
rpittau | no nvm that's completed | 16:16 |
rpittau | that's the last migration test completed | 16:17 |
rpittau | I'll leave it running for the rest of the evening | 16:19 |
rpittau | need to split now | 16:19 |
rpittau | good night! o/ | 16:19 |
JayF | o/ I'm going to keep trying to get a local reproducer :( | 16:20 |
rpittau | I think I found the culprit {1} ironic.tests.unit.db.sqlalchemy.test_migrations.TestMigrationsMySQL.test_upgrade_twice [1792.906012s] ... ok | 16:25 |
rpittau | this takes 30 minutes to complete | 16:25 |
rpittau | in the end it completes, but of course it goes in timeout in cI | 16:26 |
* rpittau runs away screaming in SQL | 16:26 | |
yde | TheJulia: thanks a lot. i already have a tfpserver up an running and setup in the [pxe] section of the conductor | 16:27 |
clarkb | in https://zuul.opendev.org/t/openstack/build/00d4e534d78448a99b7a284baf6a869a/log/job-output.txt#57860 it only took 16s | 16:27 |
clarkb | rpittau: ^ so maybe it is only slow sometimes? But that would certainly do it | 16:27 |
yde | TheJulia: what i dont get is how the tftp info is pushed to the server as it doesnt seem to be setup in the dnsmasq conf of the qdhcp neutron network | 16:29 |
yde | TheJulia: whereas it is, in the ironic-dnsmasq service | 16:29 |
TheJulia | yde: when a port attachment occurs in the lifecycle, the neutron port gets updated with the parameters | 16:30 |
TheJulia | So dnsmasq gets told by neutron | 16:30 |
TheJulia | And it is dhcp record-lease specific as a result | 16:30 |
yde | TheJulia: ok thanks | 16:31 |
yde | ill check that | 16:31 |
JayF | rpittau: clarkb: I'm going to propose a change to temporarily skip that test, and look and see if we can make it behave better. | 16:46 |
JayF | that test looks bad at the face of it | 16:53 |
JayF | I'm going to see if it's a simple fix | 16:53 |
JayF | https://bugs.launchpad.net/ironic/+bug/2023316 to track this, since I'm going to propose disabling a test | 17:00 |
opendevreview | Jay Faulkner proposed openstack/ironic master: DNM: Testing to see if disabling this test lets things pass https://review.opendev.org/c/openstack/ironic/+/885600 | 17:02 |
JayF | pushed it up with integration tests disabled because I'm still sus, if it passes I'll make it mergable | 17:02 |
JayF | it's passed | 17:17 |
opendevreview | Jay Faulkner proposed openstack/ironic master: Disabling test_upgrade_twice temporarily for CI fix https://review.opendev.org/c/openstack/ironic/+/885600 | 17:19 |
JayF | TheJulia: rpittau: others: ^ workaround for unit test failures | 17:19 |
JayF | after my afternoon meetings (and lunch) I'll start digging to see if there is a real fix, but it only causes us harm to have CI broken | 17:19 |
opendevreview | Jay Faulkner proposed openstack/ironic master: Disabling test_upgrade_twice temporarily for CI fix https://review.opendev.org/c/openstack/ironic/+/885600 | 17:21 |
JayF | https://review.opendev.org/c/openstack/ironic/+/885600 is V+1 and seemed to run a lot quicker, too | 17:57 |
JayF | rpittau: saw your comment. That the tests ran so fast is heartening, I think | 21:09 |
JayF | Heads up: when looking at the unit test failures, and digging thru some issues, I saw something that's likely to break oslo.service's SSL support in eventlet for py 3.12 | 21:35 |
JayF | so we'll have to be on the lookout for that when 3.12 comes down the pipe | 21:35 |
TheJulia | Fun :( | 21:36 |
JayF | if we're lucky, it'll just be differential eventlet versions depending on py version | 21:37 |
JayF | I also think we only use that in IPA, not Ironic | 21:37 |
opendevreview | Jay Faulkner proposed openstack/ironic master: DNM: Ensure upgrade twice test uses two engines https://review.opendev.org/c/openstack/ironic/+/885661 | 21:45 |
TheJulia | yde: is dnsmasq logging anything? | 21:50 |
TheJulia | yde: nevermind, i opened an old irccloud window which I guess had out of date logs | 21:50 |
opendevreview | Jay Faulkner proposed openstack/ironic master: Ensure upgrade twice test uses two engines https://review.opendev.org/c/openstack/ironic/+/885661 | 21:59 |
JayF | the most wishy washy unsure commit message I've ever applied | 21:59 |
JayF | but it passed unit tests the first time | 21:59 |
clarkb | oh interesting fwiw that wouldn't surprise me if reusing things keeps some stale state around that can create problems | 22:08 |
JayF | The thing I can't quite figure out is what is being reused, exactly. | 22:13 |
JayF | The engine passed in that context manager is an engine-getter | 22:14 |
JayF | so I wonder if this fixing things would still indicate we have a place where a connection/session is being left dangling | 22:14 |
JayF | because I don't think they should step on each other | 22:14 |
JayF | but it was an obvious thing to try given the symptoms | 22:14 |
JayF | that thing has already passed | 22:15 |
JayF | I now have a voice in the back of my head suggesting that it disappeared as quickly as it appeared, and I should re-validate the original bug... | 22:15 |
JayF | but I guess rpittau had it repro'd locally, he can test if 885661 fixes it for him, too | 22:16 |
JayF | rpittau: pointing my brain towards something else; so trying to document status for you in the morning: https://review.opendev.org/c/openstack/ironic/+/885661 may be a real fix. If it doesn't survive additional testing (or your local testing) we should land https://review.opendev.org/c/openstack/ironic/+/885600 to unblock the gate | 22:22 |
TheJulia | https://review.opendev.org/c/openstack/ironic/+/885661/ fixing it makes me wonder about our last migration | 22:51 |
opendevreview | Julia Kreger proposed openstack/ironic master: WIP: service steps https://review.opendev.org/c/openstack/ironic/+/884783 | 22:52 |
TheJulia | eh, looks fine | 22:58 |
TheJulia | still funky | 22:58 |
TheJulia | JayF: so.... w/r/t hold steps, I suspect they could be combined into one handler | 23:03 |
TheJulia | Just with two distinctly different approaches | 23:04 |
TheJulia | one being a "translation lookup, or 'reserved name' set of logic | 23:04 |
TheJulia | " | 23:04 |
TheJulia | likely reserved names then the translation lookup. | 23:04 |
* TheJulia steps away | 23:04 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!