Friday, 2023-08-04

opendevreview	Ghanshyam proposed openstack/tempest master: Remove nova-network tests https://review.opendev.org/c/openstack/tempest/+/890471	00:19
opendevreview	Ghanshyam proposed openstack/tempest master: Remove nova-network tests https://review.opendev.org/c/openstack/tempest/+/890471	01:17
opendevreview	Ghanshyam proposed openstack/tempest master: Remove nova-network tests https://review.opendev.org/c/openstack/tempest/+/890471	01:34
opendevreview	melanie witt proposed openstack/tempest master: DNM test rbd retype https://review.opendev.org/c/openstack/tempest/+/890360	05:21
opendevreview	Arnau Verdaguer proposed openstack/tempest master: [WIP][DNM] Test fix for LP#2027605 https://review.opendev.org/c/openstack/tempest/+/888770	10:39
opendevreview	Arnau Verdaguer proposed openstack/tempest master: [WIP][DNM] Test fix for LP#2027605 https://review.opendev.org/c/openstack/tempest/+/888770	10:42
opendevreview	Dan Smith proposed openstack/devstack master: Set acquire=0 on ProxyPass https://review.opendev.org/c/openstack/devstack/+/890526	13:42
opendevreview	Dan Smith proposed openstack/devstack master: Set acquire=0 on ProxyPass https://review.opendev.org/c/openstack/devstack/+/890526	13:43
opendevreview	Dan Smith proposed openstack/devstack master: Disable waiting forever for connpool workers https://review.opendev.org/c/openstack/devstack/+/890526	14:17
dansmith	gmann: so some data.. in the last two weeks we've had no ooms on the nova-ceph-multistore job	15:11
dansmith	that's limited to 3x concurrency	15:11
dansmith	I was going to say, none on nova-live-migration-ceph either, but thats a very limited set of tests, so not very useful	15:16
dansmith	anyway, I'm wondering if the random failures through apache are also related to an increase in the number of things we have flying around now	15:16
opendevreview	Dan Smith proposed openstack/tempest master: Revert "Increase the default concurrency for tempest run" https://review.opendev.org/c/openstack/tempest/+/890535	15:24
dansmith	just so I can recheck it a bunch of times ^	15:25
gmann	dansmith: ack, as you know ceph jobs run very limited scenario tests (only 3). let's check it in this revert and other jobs too if more concurrency is causing oom	17:15
dansmith	gmann: yeah, but I think the scenarios probably aren't the ones causing the ooms	17:16
dansmith	but yep, I just want to recheck it a bunch	17:16
dansmith	it's about to pass the first attempt, only one real failure, the mkfs stamp_pattern thing	17:16
dansmith	gmann: also, not sure if you saw, but what do you think of this? https://review.opendev.org/c/openstack/tempest/+/890350/1	17:17
dansmith	it's a bit of a stab in the dark, but might be worth trying for the stamp_pattern thing	17:17
gmann	dansmith: ok, not checked yet.	17:18
gmann	dansmith: agree, at least it is no different things from test scenario perspective so if it help that is bonus	17:19
dansmith	gmann: cool, I'll move that out on its own so maybe we can try to get that in soon	17:20
opendevreview	Dan Smith proposed openstack/tempest master: Use vfat for timestamp https://review.opendev.org/c/openstack/tempest/+/890350	17:21
gmann	dansmith: cool	17:21
dansmith	that concurrency revert finished in under 2h, so wall time isn't even bad	17:22
dansmith	(logs uploading from the last job now)	17:23
dansmith	the skip if no cinder fix got one of the uuber slow workers where it takes almost 2h to finish devstack, and is now failing lots of tests as expected	18:02
dansmith	it's a n-v job though so it should still go to gate (to probably fail)	18:03
dansmith	gmann: on the concurrency thing, one thing we could try is some sort of semaphore, where a test declares how many servers it's going to create,	18:05
dansmith	and we only run enough tests in parallel to create that many servers	18:05
dansmith	most will be one, but that would potentially allow us some higher concurrency for api tests and things, but still only 4 (or whatever) servers at any given point	18:05
opendevreview	Ghanshyam proposed openstack/tempest master: Skip test early to improve memory footprint and time https://review.opendev.org/c/openstack/tempest/+/890469	18:20
gmann	dansmith: not sure how much benefit it will give, oom was issue even before in 4 concurrency also	18:20
dansmith	gmann: I think much less after my mysql tweaks.. it'	18:21
dansmith	it is happening a lot now	18:21
gmann	but increasing tow more worker help in job timeout issue	18:21
dansmith	it manifests in a lot of different ways.. sometimes a bunch of tests fail because it takes so long to respawn, but sometimes it's a single test failure and if you go look in syslog, there's a mysql oom	18:22
dansmith	I know, but other things we did also helped with the timeout	18:22
gmann	dansmith: as we know the mysql performance issue, should we run them in a few of the jobs like periodic instead if all by default ?	18:22
dansmith	which "them" ?	18:23
gmann	yeah, all other also helped	18:23
dansmith	anyway, I'm just saying, we're still failing a ton of things, fewer timeouts, but more OOMs	18:23
dansmith	the ooms I've been looking at have all had six or seven qemu processes running most using more than mysql	18:24
opendevreview	Ghanshyam proposed openstack/tempest master: Add test for assisted volume snapshot https://review.opendev.org/c/openstack/tempest/+/864839	18:26
gmann	ok	18:27
dansmith	anyway, let's recheck this revert several times and see	18:27
gmann	dansmith: let's do more recheck on revert of concurrency and see if we see oom	18:27
gmann	yeah	18:27
dansmith	the way things were going before, we should get plenty of timeouts if we recheck a bunch of times, and the way things are now, we should get a few ooms at least if that's not the problem :)	18:27
gmann	yeah solving both are little hard	18:28
dansmith	yeah, and in the panic to improve we changed a lot of things at once	18:28
dansmith	skip if not cinder back in the gate for another go	18:29
dansmith	gmann: unrelated to ooms, I'm wondering if this will help us diagnose the issues where we seem to run out of apache workers: https://review.opendev.org/c/openstack/devstack/+/890526	18:31
gmann	+1, hoping we get this in at least	18:32
dansmith	I could be wrong, but some of our "timeout talking to $service" things seem almost like we never make it past apache and then timeout	18:32
dansmith	and I think that tweak will make us fail fast with something we can identify as "apache is out of workers"	18:32
dansmith	right now, as I understand it, a connection will wait forever for a thread pool worker to make the proxy call,	18:32
dansmith	which can cause us to timeout on the client side if it waits too long, even though eventually it will get a worker thread and make the call we originally asked for	18:33
dansmith	so if we see some 503s popping up from apache, we can maybe increase the apache workers to help	18:33
gmann	you mean the case of identity error 500 we see?	18:33
gmann	I think that is the same case here	18:34
dansmith	no, the identity 500s are mysql OOMs ;)	18:34
dansmith	here's one someone just rechecked: https://zuul.opendev.org/t/openstack/build/bbc550d6c8d54aee91a6a5d9166f45b8/log/controller/logs/syslog.txt#7417	18:34
gmann	this one right?	18:34
gmann	tempest.lib.exceptions.IdentityError: Got identity error	18:34
gmann	Details: Unexpected status code 500	18:34
dansmith	yep	18:35
gmann	ohk	18:35
dansmith	with that acquire=1 flag we should get a 503 from apache (not 500) if it has no pool workers to take the request after 1ms	18:36
dansmith	if we still had elastic recheck I'd say we should add a check for mysql oom (or really any oom) and have an automatic comment to help mark them	18:38
gmann	but it pass retry=0 at least it will no retry right	18:39
dansmith	yeah, but they're different stages of the request	18:39
dansmith	retry= means "the remote side didn't reply to me" but that's a case where we did get a worker, made the request, and got no answer	18:40
dansmith	acquire is for the "how long to wait for our own threadpool to give us a worker"	18:40
dansmith	without a worker, we never even make the request	18:40
gmann	yeah, I mean with that no retry I think failing early make sense	18:41
dansmith	okay	18:41
gmann	1ms is enough also	18:41
gmann	I think I agree with the idea, it will help in debugging	18:41
dansmith	1ms is the lowest value you can use.. without anything it will wait forever	18:42
dansmith	right, this doesn't fix anything it just helps us identify if we end up exhausting apache	18:42
dansmith	the only thing it might actually help,	18:42
dansmith	is if apache says "sorry no workers" and some client waits and retries, and then makes it through	18:43
dansmith	but the debugging is the goal	18:43
gmann	yeah	18:44
opendevreview	Ghanshyam proposed openstack/tempest master: Skip scenario tests early to avoid unnecessary setup https://review.opendev.org/c/openstack/tempest/+/890573	19:16
* dansmith score one green passing run :) https://review.opendev.org/c/openstack/tempest/+/890535		19:32
dansmith	oops	19:32
opendevreview	Ghanshyam proposed openstack/tempest master: Skip scenario tests early to avoid unnecessary setup https://review.opendev.org/c/openstack/tempest/+/890573	19:38
dansmith	gmann: all the scenario tests require cinder?	19:42
dansmith	oh nm,	19:42
dansmith	I misread the first change	19:42
gmann	not all	19:42
gmann	neutron one specially do not require	19:43
dansmith	yeah I misread the first change and was thinking that was in the base test	19:43
gmann	ohk	19:44
dansmith	skip if not cinder oomed again in check	21:45

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!