Thursday, 2023-02-09

gmannkopecmartin: once you start work tomorrow https://review.opendev.org/c/openstack/tempest/+/87298202:22
*** yadnesh|away is now known as yadnesh04:04
opendevreviewyatin proposed openstack/grenade stable/train: [train-only] Add ensure-rust to grenade jobs  https://review.opendev.org/c/openstack/grenade/+/87299906:10
ykarelgmann, can you rework flow https://review.opendev.org/c/openstack/devstack/+/87290206:13
*** jpena|off is now known as jpena08:27
opendevreviewMartin Kopec proposed openstack/tempest master: Mark tempest-multinode-full-py3 as n-v  https://review.opendev.org/c/openstack/tempest/+/87322808:41
opendevreviewMerged openstack/grenade stable/ussuri: [ussuri-only] Add ensure-rust to grenade jobs  https://review.opendev.org/c/openstack/grenade/+/87296909:46
opendevreviewMerged openstack/tempest master: Retry glance add/update locations on BadRequest  https://review.opendev.org/c/openstack/tempest/+/87298210:16
opendevreviewMerged openstack/devstack stable/train: [Train Only] Run ensure-rust role  https://review.opendev.org/c/openstack/devstack/+/87290212:58
opendevreviewMerged openstack/tempest master: Allow SSH connection callers to not permit agent usage  https://review.opendev.org/c/openstack/tempest/+/87256613:25
*** yadnesh is now known as yadnesh|away14:58
dansmithgmann: kopecmartin so, opensearch tells me we're hitting *hundreds* of OOMs in gate jobs each day15:11
dansmithI went to look after my chunked patch failed in the gate for the 300th time due to a single job failure15:11
dansmithgiven that number is high, I'm thinking we should look at some sweeping change(s)15:12
dansmithI wonder if we could/should start with upping some swap?15:12
dansmithwe have a number of jobs that already override swap to 8G, but the one i saw oom this morning was a 4G job.. is there any reason not to up that to 8G mostly across the board?15:17
dansmithclarkb: fungi ^15:20
fungiyour timeouts may be related to increased swap use. as soon as you start getting into swap thrash from memory exhaustion, *everything* you do that involves memory access or disk access is going to slow down tremendously15:34
fungii'd start by looking at what's been eating more memory and seeing if that can be fixed/reverted15:35
fungiideally those node would use very little swap, mostly to page out infrequently-accessed segments in order to make more room for performance-boosting uses like fs caching15:36
clarkbfwiw the swap amount was reduced to 1gb from 8gb in devstack jobs due to being unable to sparse allocate the backing file on ext415:45
clarkbso we have done more swap in the past, but the kernel made changes that broke userspace (boo) and we had to accomodate that15:45
clarkbwe could try sparse allocation again though. Maybe jammy fixed it, but I Think the change was semi intentional and not expected to be fixed15:46
dansmithfungi: yeah I definitely understand that.. it depends if the problem is a bunch of stuff we load and never use, or if it's in the active set.. given how inflated things get with python it helps more often than I usually expect15:52
dansmithclarkb: last I knew the internals, swap on file was actually problematic in general, but I *think* that got better by them making swap access effectively O_DIRECT.. I assume that too is resolved?15:53
clarkbdansmith: we have had no issues with swap on a file that I know of15:53
clarkbdansmith: the mkswapfs or something manpage documented created sparse allocations on ext (other fses didn't support it) but then they broke that. Switching to a fully allocated file works fine15:54
dansmithfungi: and fwiw, the OOMs I've seen are all choosing mysqld to kill because it's the biggest offender.. I know,. oom_score and what not, but...15:54
clarkbthe main drawbacks to using a fully allocated file are the loss of the disk space whether you use the swap space or not and the time it takes to write all those zeros15:54
clarkbI would expect the time to be somewhat linear too? allocating 8gb is 8x longer than 1gb?15:55
dansmithclarkb: yeah, so the original problem was that the kernel would allocate buffer space to write to the filesystem in a way that it didn't have to do for writing to block,15:55
dansmithand that could exacerbate low-memory situatons15:55
dansmithit's been a long time15:55
dansmithso maybe that's resolved15:55
fungiyeah, in many cases i looked at in the past, it was tons of separate uwsgi processes consuming most of the memory, but mysql got the axe because it was a large monolithic target15:55
dansmithfungi: yeah, and I think we lowered mysql's oom_score to make it less likely15:56
dansmithbut the more services we run, each needing at least a reasonable number of workers, we're going to have a lot of uwsgi processes15:56
dansmithI've addressed it in certain jobs by reducing scope, like taking c-bak and other things out of jobs that don't need to test that stuff15:57
dansmithclarkb: is creating multiple files a way around that allocation issue?15:58
clarkbdansmith: I don't think so. I expect time cost to be similar due to how clouds limit iops15:59
dansmithoh the problem is just actually fully allocating 8gb of disk and that's why we needed sparse and reduced the size of what we did instead?15:59
dansmithI see15:59
clarkbright its a time issue16:00
clarkbfallocate is almost immediate. dd'ing a bunch of zeros is slow16:00
clarkblooks like my current swapon manpage no longer documents using fallocate on anything but xfs16:01
dansmithdo we not get a chunk of disk we can partition and use without having to reboot?16:01
clarkbso they fixed the docs at least16:01
clarkbdansmith: only on rax (so rax jobs do repartition)16:01
dansmithokay16:01
ykarel^ are they affected jobs jammy based and running tempest tests with higher concurrency so multiple guest vms concurrently ?16:03
ykarelmay be already known it's due to qemu 616:03
ykareljfr https://bugs.launchpad.net/nova/+bug/194960616:05
clarkb(coincidentally I just ran into errors on my local machine due to a full /. Its fully of docker images...)16:05
dansmithare we dumping the rabbit memory diagnostics anywhere?16:07
dansmithrabbit definitely has a bunch of stuff swapped out on the one I'm looking at.. only 15MB resident but almost 850 virt16:08
fungipresumably rabbit has some way to tune its message retention, but i really know very little about this problem domain16:10
clarkbthe two places I know are the copilot dstat replacement (which is off by default iirc) and the memory dump at the end of the job.16:10
dansmithwell, it has a diag dump to see where it's using memory, so I'm just wondering if we're already dumping that16:10
fungithough it may also leak allocations or just not bother to garbage collect16:10
clarkbI don't think we do ^16:10
dansmithclarkb: where should I add that so it runs after a job even if it fails?16:11
clarkbdansmith: the post-run portion of the job should run after a failure including timeouts. It has its own timeout to enable that sort of thing16:11
fungipost-run phase normally16:11
dansmithdevstack/playbooks/post.yaml /16:12
fungithere's also a cleanup phase, but shouldn't be necessary for this case i wouldn't think16:12
clarkbya that looks like the right file16:12
dansmithah yeah, that's where I did the perf dump16:12
opendevreviewDan Smith proposed openstack/devstack master: Grab rabbitmq memory diagnostics dump in post  https://review.opendev.org/c/openstack/devstack/+/87329916:17
dansmithalso doesn't seem like we're doing much tweaking of mysql to reduce its memory usage, so maybe there's something we can do there16:22
opendevreviewDan Smith proposed openstack/tempest master: Fix retry_bad_request() context manager  https://review.opendev.org/c/openstack/tempest/+/87330016:26
dansmithgmann: kopecmartin ^16:26
clarkbI want to say that mordred and aeva did mysql tuning way way back when, but maybe that has been lost or we were tuning to different needs16:26
dansmiththat patch we landed wasn't getting actually run in the gate, so now anyone running those tests is 100% broken because I SUCK16:26
dansmithgmann: kopecmartin testing with https://review.opendev.org/c/openstack/nova/+/873302 because apparently we don't get ceph-multistore coverage in the tempest gate16:30
dansmithclarkb: yeah I just don't see any in devstack at quick glance16:30
dansmithgmann: kopecmartin I reconfigured my local devstack to be able to test that fix and it works, so I think it should be okay.. hopefully it merges as quick as the original :/16:53
*** jpena is now known as jpena|off17:33
opendevreviewDan Smith proposed openstack/devstack master: Grab rabbitmq memory diagnostics dump in post  https://review.opendev.org/c/openstack/devstack/+/87329917:35
clarkbhttps://mariadb.com/kb/en/mariadb-memory-allocation/ may be useful if anyone digs into the db tuning17:45
clarkbbit old, but with pointers to stuff17:46
dansmithclarkb: yeah I've reduced the footprint on some personal memory-constrained cloud instances successfully in the past so I can add some tweaks17:55
dansmithman, tons of post_failure with no logs this morning18:52
clarkbchances are one of the swift regions is having trouble. I'll take a look18:54
dansmithseems like a lot of them were too fast to have run and fail only during upload, but not all of them18:54
clarkbthe one I'm looking at was a swift upload failure. Going to look at a couple more just to get more dat18:59
dansmithhttps://zuul.opendev.org/t/openstack/build/b367c9b4a4a54eab8964676f4c6f43ec19:04
dansmithhttps://zuul.opendev.org/t/openstack/build/5eb69a31232d444db48b0ec90ccb5fee19:04
dansmithhttps://zuul.opendev.org/t/openstack/build/c21483fb7491468fa893d8df1a4a036819:04
dansmithjust some more that are in my view right now (that's not all of them just a few)19:04
dansmiththe ceph multistore job I'm eagerly waiting to finish is failing like crazy on tons of tempest tests, so I'm really interested to see what's going on there19:05
gmannykarel: ok, i think kopecmartin did that19:05
dansmithperhaps another oom19:05
gmanndansmith: +A on 87330019:06
dansmithgmann: thanks, tested locally19:06
gmann in ceph job in gate it seems other server test failing on timeout 19:10
dansmithgmann: I'm watching one fail like crazy, you see other ceph jobs failing too?19:11
gmannthis patch only19:14
dansmithokay19:19
dansmithlooks like rbd failures on that one19:29
dansmithlots of glance image 404s and some "rbd is busy" errors19:29
gmannis this same?  'RuntimeError: generator didn't stop' https://zuul.opendev.org/t/openstack/build/1566c99f09984540bd8754a22342e486/log/job-output.txt#8095819:31
dansmithoh no, I didn't see that19:34
dansmithwtf, I don't see that locally19:34
dansmith{0} tempest.api.image.v2.test_images.ImageLocationsTest.test_set_location [2.530617s] ... ok19:34
dansmithoh, I wonder if that's because even after the retries it still failed and since I raised the real errors instead of StopIteration?19:35
gmannit happening in test_replace_location also19:35
gmannmay be19:36
dansmithoh you know, this isn't going to work at all the second time, because we're not iterating over the result19:36
dansmithman I screwed this one up19:36
dansmithwithout a failure to test against, it's hard to even know19:37
dansmithdamn19:37
dansmithit was going to fail anyway, it just didn't retry19:38
gmannohk, let me remove the W from tempest one19:38
dansmithlet me try to simulate one failure to validate what I think we need19:38
gmanndansmith: feel free to add nova-ceph-multistore in tempest experimental queue to check it in same patch19:40
gmannthat will help in future too19:40
dansmithack19:41
opendevreviewDan Smith proposed openstack/tempest master: Fix retry_bad_request() context manager  https://review.opendev.org/c/openstack/tempest/+/87330019:50
dansmithgmann: made this ^ fail on the first iteration with BadRequest to confirm it tries again19:51
dansmithplease review *carefully* because I cannot be trusted19:51
gmanndansmith: ack, do not worry. its this cycle things not you :). 19:52
dansmiththanks, but it's definitely mostly me :)19:53
gmanndansmith: you remember we were talking that last cycle was smooth in gate. now all coming in this cycle19:53
gmannthose job timeout are leading to our timeout also19:55
dansmithyeah :(19:59
gmannralonsoh: this is possible fix for neutron train/ussuri/victoria. but I am not sure if constraints will be compatible with neutron-tempst-plugin 1.8.0   https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/87332520:20
gmanntesting here https://review.opendev.org/q/topic:bug%252F200676320:21
gmannif constraints does not work then it seems we need to call out testing on t-u-v if 813195 fix is must to unblock20:22
gmannralonsoh: I am wondering why they started failing on 'guestmount' now?20:23
gmannralonsoh: or it is after this, they started failing  https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/87305920:24
opendevreviewMerged openstack/tempest master: Mark tempest-multinode-full-py3 as n-v  https://review.opendev.org/c/openstack/tempest/+/87322821:34
dansmithgmann: from the multistore job about to finish: Failed: 021:45
gmanndansmith: yeah, +A on patch22:12

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!