Thursday, 2023-06-08

rpittau	good morning ironic! o/	08:19
rpittau	JayF, clarkb, TheJulia, about the timeout/failures in unit tests, I'm more and more oriented towards some DB shenanigans as supposed by clarkb	10:17
rpittau	if you look at my patch (besides the weird failure) the migration tests have very high runtimes, which are absolutely unjustified	10:17
rpittau	I've observed the migration tests runtimes going from 12-13 secs to 42 secs completely randomly and I suspect at some point the DB hangs and the timeout happens	10:17
rpittau	btw I think we should merge https://review.opendev.org/c/openstack/ironic/+/885372?usp=dashboard anyway :)	10:24
rpittau	JayF: if you want to test with an older version of tox I think you can use the requires option in tox.ini in the tox section	10:28
opendevreview	Merged openstack/python-ironicclient master: Add support for node inventory https://review.opendev.org/c/openstack/python-ironicclient/+/883833	11:35
JayF	rpittau: tox was completely a guess that mostly has shaken out as a wrong guess	14:02
rpittau	good news everyone! I was able to reproduce the timeout locally	14:03
rpittau	"good" is relative in this case	14:04
JayF	it's worthy of celebration for sure	14:04
JayF	did you do anything magical to repro it?	14:05
rpittau	nothing special, I tried the kiss approach and it worked	14:06
rpittau	created a VM and reproduced the workflow	14:06
rpittau	second tentative and tox just stop at some point	14:06
rpittau	no error message, nothing	14:06
rpittau	just defunct	14:06
rpittau	I don't really see anything wrong so far in the VM, it's not stuck. only the terminal with tox is not movinganymore	14:07
rpittau	lots of zombie processes	14:09
rpittau	was there a stestr upgrade recently?	14:09
JayF	no	14:13
JayF	I'd suggest patching oslo.db in that venv as clarkb suggested yesterday	14:13
JayF	since we know that could cause things to stop	14:13
JayF	no new releases or updates to requirements that maps to the timeline of unit tests beginning to fail	14:14
JayF	tox release was on the same day but I think we've mostly ruled that out	14:14
rpittau	ok, let me do another test first	14:15
rpittau	I can reproduce it pretty much every time now	14:23
opendevreview	OpenStack Release Bot proposed openstack/ironic bugfix/22.0: Update .gitreview for bugfix/22.0 https://review.opendev.org/c/openstack/ironic/+/885588	14:25
clarkb	cool having a reproduction case is very helpful	15:24
clarkb	rpittau: what I would do is grab the list of tests fed to the stestr works. Then see what tests ran and which ones are up next	15:24
clarkb	rpittau: the tests that are up next would be highly suspicuous since in theory they are/were the ones running (depending on the state things end up in)	15:25
rpittau	JayF, clarkb, I patched oslo.db, ran the tests twice and it didn't timeout, but the database tests took over a minute to complete	15:25
rpittau	each of them	15:25
clarkb	ya I mean they can be slow that doesn't necessarily mean they are the problem causing the entire job to timeout (since you have a lot of headroom on the job timeout)	15:26
clarkb	rpittau: what patch did you apply?	15:26
clarkb	extra logging?	15:26
rpittau	considering the number of db tests we have I think the jobs go in timeout because effectively they take too long, so they pass the 40 minutes timeout value	15:27
clarkb	being able to reproduce it is the main thing. Hopefull you're able to track it down pretty quickly now	15:27
clarkb	rpittau: I don't think that is the case looking at the job logs	15:27
clarkb	rpittau: if you look at the job logs you see that the last tes tcase completes like half an hour before the job is timed out	15:27
clarkb	something in the test suite is getting stuck and not progressing	15:27
clarkb	it may still be that it is the db tests getting stuck but I don't think it is that they take too long generally. More like a deadlock etc	15:28
rpittau	mmm I'm seeing tox stuck again, I'll leave it be for a while	15:29
clarkb	rpittau: I would definitely try to determine which specific test cases are running when it gets stuck	15:29
clarkb	and then you can sort out why they are stuck etc	15:30
rpittau	yeah, I'm trying to understand that	15:31
clarkb	rpittau: if you look at ps -elf \| grep stestr or similar there should be stestr processes with --load-list arguments. I think each of those load-list arguments is a file containing a set of tests in the order to run them	15:35
rpittau	yep, looking at stestr options to help with that	15:36
clarkb	rpittau: if you then look n .stestr/tmp* for the current run you should see the log showing what tests have been run for each worker and you can cross check against the load list files to see what is currently running	15:36
JayF	I setup a local mysql, trying to get in on the reproduction game	15:36
clarkb	It might also be possible to issue a sigterm/^C to cause the test run to raise an exception whereever it is stuck as well and then you'd see in the traceback where it was at	15:36
JayF	rpittau: are you doing anything other than setting up mysql? also using postgresql?	15:39
JayF	rpittau: what python version	15:39
rpittau	JayF: just installing bindep deps, then running tools/test-setup.sh, using py 3.10	15:40
JayF	ack; so you have a running postgres and mysql server?	15:40
JayF	I will setup postgres too	15:40
yde	hi, i'm trying to make ironic+neutron work with a networking generic switch driver to put servers into neutron managed networks (vlans) but there's somehting i really dont get in the workflow	15:40
rpittau	JayF: both yeah	15:40
yde	which "dhcp+tftp" service is supposed to answer to PXE requests ?	15:41
yde	during enrollment/inspection, the ironic dnsmasq + tftp servers does the job, boots the IPA and the inspector reports what it found	15:42
yde	but then it auto adds all the MACs to a deny list	15:42
yde	and when i boot a instance, neutron+ironic create a port into the neutron managed "provisionning" network	15:42
yde	which i also created with a dhcp agent	15:43
yde	but this dnsmasq does'nt have all necessary stuff to answer to pxe requests right ?	15:43
rpittau	clarkb: I don't see any --load-list in the stestr process	15:45
rpittau	it's using the tests dir directly	15:46
TheJulia	yde: you need to configure/establish a specific/dedicated endpoint for tftp	15:47
yde	somehow, the enrolled+available baremetal host is properly discovered and inspected. then when i boot an instance through nova, neutron binds it a port in the admin network (whicch has a dhcp agent running in its ip netns), the ports on the switch are properly setup to the correct vlan id	15:47
yde	and the server reboots, sends its its DHCP requests, and nobody replies	15:48
JayF	rpittau: you using mariadb, not mysql, yeah?	15:48
* JayF trying to get this to reproduce and looking for differences		15:48
rpittau	I'm using mysql	15:48
JayF	dang, so am I	15:48
rpittau	just using what's in CI	15:48
JayF	is it happening every time for you?	15:48
rpittau	in bindep	15:48
rpittau	yes	15:48
rpittau	so now it failed and I thinkg we're onto something	15:49
opendevreview	Merged openstack/ironic bugfix/22.0: Update .gitreview for bugfix/22.0 https://review.opendev.org/c/openstack/ironic/+/885588	15:49
yde	TheJulia: thanks. i have one, the one setup by kolla-ansible. but where do i have to setup the ipxe script ?	15:49
JayF	if there's a way I can help other than by working independently to try and get a thing to fail, let me know	15:49
rpittau	JayF: posting the error I found	15:50
yde	i mean, if it's not the ironic-dnsmasq agent (which has an inspector.ixpe script setup), where is the ipxe script for deployment ? and who shall serve it /	15:50
rpittau	JayF, clarkb : https://paste.openstack.org/show/bn06iH0TXNfw4bAB78ns/	15:50
yde	?	15:50
JayF	those are served from the conductor server, generally	15:50
yde	which dnsmasq ? the one in the qdhcp namespace of the provisioning network ? or the one ironic-dnsmasq agnet ?	15:51
JayF	rpittau: ...I feel like that error is disappointingly less useful than I hoped	15:55
JayF	rpittau: did you ID anything interesting in there I might have misseD?	15:56
rpittau	mmm no, but I have the processes with the lists now	15:57
clarkb	rpittau: JayF it is in eventlet	16:03
TheJulia	yde: the dnsmasq launched in the neutron namespace can’t take tftp. The config for ironic-conductor should include a tftp_address which it puts in the dhcp response payload	16:03
clarkb	waiting to read bytes. I think if there is a lock on the db side the test side might be waiting for the lock to release	16:04
clarkb	rpittau: maybe next is to check the db state when it is stuck and see if there are locks preventing work from happeningin hte db	16:04
rpittau	clarkb: I can see it's very slow at the moment	16:04
JayF	what OS do we use in the gate?	16:04
JayF	jammy, yeah	16:05
rpittau	ubuntu focal and jammy	16:05
JayF	the thing I keep rebounding back to	16:07
JayF	is why now	16:08
JayF	why did it break 6/5	16:08
JayF	https://github.com/eventlet/eventlet/issues/798	16:09
JayF	I think that's not impacting us here, because the switches complete successfully in your traceback	16:09
rpittau	so it was stuck for a while then it started again briefly and it looks like running a migration	16:16
rpittau	I think it's stuck on ironic.tests.unit.db.sqlalchemy.test_migrations.TestWalkVersions.test_walk_versions_all_false	16:16
rpittau	no nvm that's completed	16:16
rpittau	that's the last migration test completed	16:17
rpittau	I'll leave it running for the rest of the evening	16:19
rpittau	need to split now	16:19
rpittau	good night! o/	16:19
JayF	o/ I'm going to keep trying to get a local reproducer :(	16:20
rpittau	I think I found the culprit {1} ironic.tests.unit.db.sqlalchemy.test_migrations.TestMigrationsMySQL.test_upgrade_twice [1792.906012s] ... ok	16:25
rpittau	this takes 30 minutes to complete	16:25
rpittau	in the end it completes, but of course it goes in timeout in cI	16:26
* rpittau runs away screaming in SQL		16:26
yde	TheJulia: thanks a lot. i already have a tfpserver up an running and setup in the [pxe] section of the conductor	16:27
clarkb	in https://zuul.opendev.org/t/openstack/build/00d4e534d78448a99b7a284baf6a869a/log/job-output.txt#57860 it only took 16s	16:27
clarkb	rpittau: ^ so maybe it is only slow sometimes? But that would certainly do it	16:27
yde	TheJulia: what i dont get is how the tftp info is pushed to the server as it doesnt seem to be setup in the dnsmasq conf of the qdhcp neutron network	16:29
yde	TheJulia: whereas it is, in the ironic-dnsmasq service	16:29
TheJulia	yde: when a port attachment occurs in the lifecycle, the neutron port gets updated with the parameters	16:30
TheJulia	So dnsmasq gets told by neutron	16:30
TheJulia	And it is dhcp record-lease specific as a result	16:30
yde	TheJulia: ok thanks	16:31
yde	ill check that	16:31
JayF	rpittau: clarkb: I'm going to propose a change to temporarily skip that test, and look and see if we can make it behave better.	16:46
JayF	that test looks bad at the face of it	16:53
JayF	I'm going to see if it's a simple fix	16:53
JayF	https://bugs.launchpad.net/ironic/+bug/2023316 to track this, since I'm going to propose disabling a test	17:00
opendevreview	Jay Faulkner proposed openstack/ironic master: DNM: Testing to see if disabling this test lets things pass https://review.opendev.org/c/openstack/ironic/+/885600	17:02
JayF	pushed it up with integration tests disabled because I'm still sus, if it passes I'll make it mergable	17:02
JayF	it's passed	17:17
opendevreview	Jay Faulkner proposed openstack/ironic master: Disabling test_upgrade_twice temporarily for CI fix https://review.opendev.org/c/openstack/ironic/+/885600	17:19
JayF	TheJulia: rpittau: others: ^ workaround for unit test failures	17:19
JayF	after my afternoon meetings (and lunch) I'll start digging to see if there is a real fix, but it only causes us harm to have CI broken	17:19
opendevreview	Jay Faulkner proposed openstack/ironic master: Disabling test_upgrade_twice temporarily for CI fix https://review.opendev.org/c/openstack/ironic/+/885600	17:21
JayF	https://review.opendev.org/c/openstack/ironic/+/885600 is V+1 and seemed to run a lot quicker, too	17:57
JayF	rpittau: saw your comment. That the tests ran so fast is heartening, I think	21:09
JayF	Heads up: when looking at the unit test failures, and digging thru some issues, I saw something that's likely to break oslo.service's SSL support in eventlet for py 3.12	21:35
JayF	so we'll have to be on the lookout for that when 3.12 comes down the pipe	21:35
TheJulia	Fun :(	21:36
JayF	if we're lucky, it'll just be differential eventlet versions depending on py version	21:37
JayF	I also think we only use that in IPA, not Ironic	21:37
opendevreview	Jay Faulkner proposed openstack/ironic master: DNM: Ensure upgrade twice test uses two engines https://review.opendev.org/c/openstack/ironic/+/885661	21:45
TheJulia	yde: is dnsmasq logging anything?	21:50
TheJulia	yde: nevermind, i opened an old irccloud window which I guess had out of date logs	21:50
opendevreview	Jay Faulkner proposed openstack/ironic master: Ensure upgrade twice test uses two engines https://review.opendev.org/c/openstack/ironic/+/885661	21:59
JayF	the most wishy washy unsure commit message I've ever applied	21:59
JayF	but it passed unit tests the first time	21:59
clarkb	oh interesting fwiw that wouldn't surprise me if reusing things keeps some stale state around that can create problems	22:08
JayF	The thing I can't quite figure out is what is being reused, exactly.	22:13
JayF	The engine passed in that context manager is an engine-getter	22:14
JayF	so I wonder if this fixing things would still indicate we have a place where a connection/session is being left dangling	22:14
JayF	because I don't think they should step on each other	22:14
JayF	but it was an obvious thing to try given the symptoms	22:14
JayF	that thing has already passed	22:15
JayF	I now have a voice in the back of my head suggesting that it disappeared as quickly as it appeared, and I should re-validate the original bug...	22:15
JayF	but I guess rpittau had it repro'd locally, he can test if 885661 fixes it for him, too	22:16
JayF	rpittau: pointing my brain towards something else; so trying to document status for you in the morning: https://review.opendev.org/c/openstack/ironic/+/885661 may be a real fix. If it doesn't survive additional testing (or your local testing) we should land https://review.opendev.org/c/openstack/ironic/+/885600 to unblock the gate	22:22
TheJulia	https://review.opendev.org/c/openstack/ironic/+/885661/ fixing it makes me wonder about our last migration	22:51
opendevreview	Julia Kreger proposed openstack/ironic master: WIP: service steps https://review.opendev.org/c/openstack/ironic/+/884783	22:52
TheJulia	eh, looks fine	22:58
TheJulia	still funky	22:58
TheJulia	JayF: so.... w/r/t hold steps, I suspect they could be combined into one handler	23:03
TheJulia	Just with two distinctly different approaches	23:04
TheJulia	one being a "translation lookup, or 'reserved name' set of logic	23:04
TheJulia	"	23:04
TheJulia	likely reserved names then the translation lookup.	23:04
* TheJulia steps away		23:04

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!