Friday, 2023-08-04

opendevreviewGhanshyam proposed openstack/tempest master: Remove nova-network tests  https://review.opendev.org/c/openstack/tempest/+/89047100:19
opendevreviewGhanshyam proposed openstack/tempest master: Remove nova-network tests  https://review.opendev.org/c/openstack/tempest/+/89047101:17
opendevreviewGhanshyam proposed openstack/tempest master: Remove nova-network tests  https://review.opendev.org/c/openstack/tempest/+/89047101:34
opendevreviewmelanie witt proposed openstack/tempest master: DNM test rbd retype  https://review.opendev.org/c/openstack/tempest/+/89036005:21
opendevreviewArnau Verdaguer proposed openstack/tempest master: [WIP][DNM] Test fix for LP#2027605  https://review.opendev.org/c/openstack/tempest/+/88877010:39
opendevreviewArnau Verdaguer proposed openstack/tempest master: [WIP][DNM] Test fix for LP#2027605  https://review.opendev.org/c/openstack/tempest/+/88877010:42
opendevreviewDan Smith proposed openstack/devstack master: Set acquire=0 on ProxyPass  https://review.opendev.org/c/openstack/devstack/+/89052613:42
opendevreviewDan Smith proposed openstack/devstack master: Set acquire=0 on ProxyPass  https://review.opendev.org/c/openstack/devstack/+/89052613:43
opendevreviewDan Smith proposed openstack/devstack master: Disable waiting forever for connpool workers  https://review.opendev.org/c/openstack/devstack/+/89052614:17
dansmithgmann: so some data.. in the last two weeks we've had no ooms on the nova-ceph-multistore job15:11
dansmiththat's limited to 3x concurrency15:11
dansmithI was going to say, none on nova-live-migration-ceph either, but thats a very limited set of tests, so not very useful15:16
dansmithanyway, I'm wondering if the random failures through apache are also related to an increase in the number of things we have flying around now15:16
opendevreviewDan Smith proposed openstack/tempest master: Revert "Increase the default concurrency for tempest run"  https://review.opendev.org/c/openstack/tempest/+/89053515:24
dansmithjust so I can recheck it a bunch of times ^15:25
gmanndansmith: ack, as you know ceph jobs run very limited scenario tests (only 3). let's check it in this revert and other jobs too if more concurrency is causing oom17:15
dansmithgmann: yeah, but I think the scenarios probably aren't the ones causing the ooms17:16
dansmithbut yep, I just want to recheck it a bunch17:16
dansmithit's about to pass the first attempt, only one real failure, the mkfs stamp_pattern thing17:16
dansmithgmann: also, not sure if you saw, but what do you think of this? https://review.opendev.org/c/openstack/tempest/+/890350/117:17
dansmithit's a bit of a stab in the dark, but might be worth trying for the stamp_pattern thing17:17
gmanndansmith: ok, not checked yet. 17:18
gmanndansmith: agree, at least it is no different things from test scenario perspective so if it help that is bonus 17:19
dansmithgmann: cool, I'll move that out on its own so maybe we can try to get that in soon17:20
opendevreviewDan Smith proposed openstack/tempest master: Use vfat for timestamp  https://review.opendev.org/c/openstack/tempest/+/89035017:21
gmanndansmith: cool17:21
dansmiththat concurrency revert finished in under 2h, so wall time isn't even bad17:22
dansmith(logs uploading from the last job now)17:23
dansmiththe skip if no cinder fix got one of the uuber slow workers where it takes almost 2h to finish devstack, and is now failing lots of tests as expected18:02
dansmithit's a n-v job though so it should still go to gate (to probably fail)18:03
dansmithgmann: on the concurrency thing, one thing we could try is some sort of semaphore, where a test declares how many servers it's going to create,18:05
dansmithand we only run enough tests in parallel to create that many servers18:05
dansmithmost will be one, but that would potentially allow us some higher concurrency for api tests and things, but still only 4 (or whatever) servers at any given point18:05
opendevreviewGhanshyam proposed openstack/tempest master: Skip test early to improve memory footprint and time  https://review.opendev.org/c/openstack/tempest/+/89046918:20
gmanndansmith: not sure how much benefit it will give, oom was issue even before in 4 concurrency also18:20
dansmithgmann: I think much less after my mysql tweaks.. it'18:21
dansmithit is happening a *lot* now18:21
gmannbut increasing tow more worker help in job timeout issue18:21
dansmithit manifests in a lot of different ways.. sometimes a bunch of tests fail because it takes so long to respawn, but sometimes it's a single test failure and if you go look in syslog, there's a mysql oom18:22
dansmithI know, but other things we did also helped with the timeout18:22
gmanndansmith: as we know the mysql performance issue, should we run them in a few of the jobs like periodic instead if all by default ?18:22
dansmithwhich "them" ?18:23
gmannyeah, all other also helped18:23
dansmithanyway, I'm just saying, we're still failing a ton of things, fewer timeouts, but more OOMs18:23
dansmiththe ooms I've been looking at have all had six or seven qemu processes running most using more than mysql18:24
opendevreviewGhanshyam proposed openstack/tempest master: Add test for assisted volume snapshot  https://review.opendev.org/c/openstack/tempest/+/86483918:26
gmannok18:27
dansmithanyway, let's recheck this revert several times and see18:27
gmanndansmith: let's do more recheck on revert of concurrency and see if we see oom18:27
gmannyeah18:27
dansmiththe way things were going before, we should get plenty of timeouts if we recheck a bunch of times, and the way things are now, we should get a few ooms at least if that's not the problem :)18:27
gmannyeah solving both are little hard18:28
dansmithyeah, and in the panic to improve we changed a lot of things at once18:28
dansmithskip if not cinder back in the gate for another go18:29
dansmithgmann: unrelated to ooms, I'm wondering if this will help us diagnose the issues where we seem to run out of apache workers: https://review.opendev.org/c/openstack/devstack/+/89052618:31
gmann+1, hoping we get this in at least18:32
dansmithI could be wrong, but some of our "timeout talking to $service" things seem almost like we never make it past apache and then timeout18:32
dansmithand I think that tweak will make us fail fast with something we can identify as "apache is out of workers"18:32
dansmithright now, as I understand it, a connection will wait forever for a thread pool worker to make the proxy call,18:32
dansmithwhich can cause us to timeout on the client side if it waits too long, even though eventually it will get a worker thread and make the call we originally asked for18:33
dansmithso if we see some 503s popping up from apache, we can maybe increase the apache workers to help18:33
gmannyou mean the case of identity error 500 we see?18:33
gmannI think that is the same case here18:34
dansmithno, the identity 500s are mysql OOMs ;)18:34
dansmithhere's one someone just rechecked: https://zuul.opendev.org/t/openstack/build/bbc550d6c8d54aee91a6a5d9166f45b8/log/controller/logs/syslog.txt#741718:34
gmannthis one right? 18:34
gmanntempest.lib.exceptions.IdentityError: Got identity error18:34
gmannDetails: Unexpected status code 50018:34
dansmithyep18:35
gmannohk18:35
dansmithwith that acquire=1 flag we should get a 503 from apache (not 500) if it has no pool workers to take the request after 1ms18:36
dansmithif we still had elastic recheck I'd say we should add a check for mysql oom (or really any oom) and have an automatic comment to help mark them18:38
gmannbut it pass retry=0 at least it will no retry right18:39
dansmithyeah, but they're different stages of the request18:39
dansmithretry= means "the remote side didn't reply to me" but that's a case where we did get a worker, made the request, and got no answer18:40
dansmithacquire is for the "how long to wait for our own threadpool to give us a worker"18:40
dansmithwithout a worker, we never even make the request18:40
gmannyeah, I mean with that no retry I think failing early make sense 18:41
dansmithokay18:41
gmann1ms is enough also18:41
gmannI think I agree with the idea, it will help in debugging 18:41
dansmith1ms is the lowest value you can use.. without anything it will wait forever18:42
dansmithright, this doesn't fix anything it just helps us identify if we end up exhausting apache18:42
dansmiththe only thing it might actually help,18:42
dansmithis if apache says "sorry no workers" and some client waits and retries, and then makes it through18:43
dansmithbut the debugging is the goal18:43
gmannyeah18:44
opendevreviewGhanshyam proposed openstack/tempest master: Skip scenario tests early to avoid unnecessary setup  https://review.opendev.org/c/openstack/tempest/+/89057319:16
* dansmith score one green passing run :) https://review.opendev.org/c/openstack/tempest/+/89053519:32
dansmithoops19:32
opendevreviewGhanshyam proposed openstack/tempest master: Skip scenario tests early to avoid unnecessary setup  https://review.opendev.org/c/openstack/tempest/+/89057319:38
dansmithgmann: all the scenario tests require cinder?19:42
dansmithoh nm,19:42
dansmithI misread the first change19:42
gmannnot all19:42
gmannneutron one specially do not require 19:43
dansmithyeah I misread the first change and was thinking that was in the base test19:43
gmannohk19:44
dansmithskip if not cinder oomed again in check21:45

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!