gmann | kopecmartin: once you start work tomorrow https://review.opendev.org/c/openstack/tempest/+/872982 | 02:22 |
---|---|---|
*** yadnesh|away is now known as yadnesh | 04:04 | |
opendevreview | yatin proposed openstack/grenade stable/train: [train-only] Add ensure-rust to grenade jobs https://review.opendev.org/c/openstack/grenade/+/872999 | 06:10 |
ykarel | gmann, can you rework flow https://review.opendev.org/c/openstack/devstack/+/872902 | 06:13 |
*** jpena|off is now known as jpena | 08:27 | |
opendevreview | Martin Kopec proposed openstack/tempest master: Mark tempest-multinode-full-py3 as n-v https://review.opendev.org/c/openstack/tempest/+/873228 | 08:41 |
opendevreview | Merged openstack/grenade stable/ussuri: [ussuri-only] Add ensure-rust to grenade jobs https://review.opendev.org/c/openstack/grenade/+/872969 | 09:46 |
opendevreview | Merged openstack/tempest master: Retry glance add/update locations on BadRequest https://review.opendev.org/c/openstack/tempest/+/872982 | 10:16 |
opendevreview | Merged openstack/devstack stable/train: [Train Only] Run ensure-rust role https://review.opendev.org/c/openstack/devstack/+/872902 | 12:58 |
opendevreview | Merged openstack/tempest master: Allow SSH connection callers to not permit agent usage https://review.opendev.org/c/openstack/tempest/+/872566 | 13:25 |
*** yadnesh is now known as yadnesh|away | 14:58 | |
dansmith | gmann: kopecmartin so, opensearch tells me we're hitting *hundreds* of OOMs in gate jobs each day | 15:11 |
dansmith | I went to look after my chunked patch failed in the gate for the 300th time due to a single job failure | 15:11 |
dansmith | given that number is high, I'm thinking we should look at some sweeping change(s) | 15:12 |
dansmith | I wonder if we could/should start with upping some swap? | 15:12 |
dansmith | we have a number of jobs that already override swap to 8G, but the one i saw oom this morning was a 4G job.. is there any reason not to up that to 8G mostly across the board? | 15:17 |
dansmith | clarkb: fungi ^ | 15:20 |
fungi | your timeouts may be related to increased swap use. as soon as you start getting into swap thrash from memory exhaustion, *everything* you do that involves memory access or disk access is going to slow down tremendously | 15:34 |
fungi | i'd start by looking at what's been eating more memory and seeing if that can be fixed/reverted | 15:35 |
fungi | ideally those node would use very little swap, mostly to page out infrequently-accessed segments in order to make more room for performance-boosting uses like fs caching | 15:36 |
clarkb | fwiw the swap amount was reduced to 1gb from 8gb in devstack jobs due to being unable to sparse allocate the backing file on ext4 | 15:45 |
clarkb | so we have done more swap in the past, but the kernel made changes that broke userspace (boo) and we had to accomodate that | 15:45 |
clarkb | we could try sparse allocation again though. Maybe jammy fixed it, but I Think the change was semi intentional and not expected to be fixed | 15:46 |
dansmith | fungi: yeah I definitely understand that.. it depends if the problem is a bunch of stuff we load and never use, or if it's in the active set.. given how inflated things get with python it helps more often than I usually expect | 15:52 |
dansmith | clarkb: last I knew the internals, swap on file was actually problematic in general, but I *think* that got better by them making swap access effectively O_DIRECT.. I assume that too is resolved? | 15:53 |
clarkb | dansmith: we have had no issues with swap on a file that I know of | 15:53 |
clarkb | dansmith: the mkswapfs or something manpage documented created sparse allocations on ext (other fses didn't support it) but then they broke that. Switching to a fully allocated file works fine | 15:54 |
dansmith | fungi: and fwiw, the OOMs I've seen are all choosing mysqld to kill because it's the biggest offender.. I know,. oom_score and what not, but... | 15:54 |
clarkb | the main drawbacks to using a fully allocated file are the loss of the disk space whether you use the swap space or not and the time it takes to write all those zeros | 15:54 |
clarkb | I would expect the time to be somewhat linear too? allocating 8gb is 8x longer than 1gb? | 15:55 |
dansmith | clarkb: yeah, so the original problem was that the kernel would allocate buffer space to write to the filesystem in a way that it didn't have to do for writing to block, | 15:55 |
dansmith | and that could exacerbate low-memory situatons | 15:55 |
dansmith | it's been a long time | 15:55 |
dansmith | so maybe that's resolved | 15:55 |
fungi | yeah, in many cases i looked at in the past, it was tons of separate uwsgi processes consuming most of the memory, but mysql got the axe because it was a large monolithic target | 15:55 |
dansmith | fungi: yeah, and I think we lowered mysql's oom_score to make it less likely | 15:56 |
dansmith | but the more services we run, each needing at least a reasonable number of workers, we're going to have a lot of uwsgi processes | 15:56 |
dansmith | I've addressed it in certain jobs by reducing scope, like taking c-bak and other things out of jobs that don't need to test that stuff | 15:57 |
dansmith | clarkb: is creating multiple files a way around that allocation issue? | 15:58 |
clarkb | dansmith: I don't think so. I expect time cost to be similar due to how clouds limit iops | 15:59 |
dansmith | oh the problem is just actually fully allocating 8gb of disk and that's why we needed sparse and reduced the size of what we did instead? | 15:59 |
dansmith | I see | 15:59 |
clarkb | right its a time issue | 16:00 |
clarkb | fallocate is almost immediate. dd'ing a bunch of zeros is slow | 16:00 |
clarkb | looks like my current swapon manpage no longer documents using fallocate on anything but xfs | 16:01 |
dansmith | do we not get a chunk of disk we can partition and use without having to reboot? | 16:01 |
clarkb | so they fixed the docs at least | 16:01 |
clarkb | dansmith: only on rax (so rax jobs do repartition) | 16:01 |
dansmith | okay | 16:01 |
ykarel | ^ are they affected jobs jammy based and running tempest tests with higher concurrency so multiple guest vms concurrently ? | 16:03 |
ykarel | may be already known it's due to qemu 6 | 16:03 |
ykarel | jfr https://bugs.launchpad.net/nova/+bug/1949606 | 16:05 |
clarkb | (coincidentally I just ran into errors on my local machine due to a full /. Its fully of docker images...) | 16:05 |
dansmith | are we dumping the rabbit memory diagnostics anywhere? | 16:07 |
dansmith | rabbit definitely has a bunch of stuff swapped out on the one I'm looking at.. only 15MB resident but almost 850 virt | 16:08 |
fungi | presumably rabbit has some way to tune its message retention, but i really know very little about this problem domain | 16:10 |
clarkb | the two places I know are the copilot dstat replacement (which is off by default iirc) and the memory dump at the end of the job. | 16:10 |
dansmith | well, it has a diag dump to see where it's using memory, so I'm just wondering if we're already dumping that | 16:10 |
fungi | though it may also leak allocations or just not bother to garbage collect | 16:10 |
clarkb | I don't think we do ^ | 16:10 |
dansmith | clarkb: where should I add that so it runs after a job even if it fails? | 16:11 |
clarkb | dansmith: the post-run portion of the job should run after a failure including timeouts. It has its own timeout to enable that sort of thing | 16:11 |
fungi | post-run phase normally | 16:11 |
dansmith | devstack/playbooks/post.yaml / | 16:12 |
fungi | there's also a cleanup phase, but shouldn't be necessary for this case i wouldn't think | 16:12 |
clarkb | ya that looks like the right file | 16:12 |
dansmith | ah yeah, that's where I did the perf dump | 16:12 |
opendevreview | Dan Smith proposed openstack/devstack master: Grab rabbitmq memory diagnostics dump in post https://review.opendev.org/c/openstack/devstack/+/873299 | 16:17 |
dansmith | also doesn't seem like we're doing much tweaking of mysql to reduce its memory usage, so maybe there's something we can do there | 16:22 |
opendevreview | Dan Smith proposed openstack/tempest master: Fix retry_bad_request() context manager https://review.opendev.org/c/openstack/tempest/+/873300 | 16:26 |
dansmith | gmann: kopecmartin ^ | 16:26 |
clarkb | I want to say that mordred and aeva did mysql tuning way way back when, but maybe that has been lost or we were tuning to different needs | 16:26 |
dansmith | that patch we landed wasn't getting actually run in the gate, so now anyone running those tests is 100% broken because I SUCK | 16:26 |
dansmith | gmann: kopecmartin testing with https://review.opendev.org/c/openstack/nova/+/873302 because apparently we don't get ceph-multistore coverage in the tempest gate | 16:30 |
dansmith | clarkb: yeah I just don't see any in devstack at quick glance | 16:30 |
dansmith | gmann: kopecmartin I reconfigured my local devstack to be able to test that fix and it works, so I think it should be okay.. hopefully it merges as quick as the original :/ | 16:53 |
*** jpena is now known as jpena|off | 17:33 | |
opendevreview | Dan Smith proposed openstack/devstack master: Grab rabbitmq memory diagnostics dump in post https://review.opendev.org/c/openstack/devstack/+/873299 | 17:35 |
clarkb | https://mariadb.com/kb/en/mariadb-memory-allocation/ may be useful if anyone digs into the db tuning | 17:45 |
clarkb | bit old, but with pointers to stuff | 17:46 |
dansmith | clarkb: yeah I've reduced the footprint on some personal memory-constrained cloud instances successfully in the past so I can add some tweaks | 17:55 |
dansmith | man, tons of post_failure with no logs this morning | 18:52 |
clarkb | chances are one of the swift regions is having trouble. I'll take a look | 18:54 |
dansmith | seems like a lot of them were too fast to have run and fail only during upload, but not all of them | 18:54 |
clarkb | the one I'm looking at was a swift upload failure. Going to look at a couple more just to get more dat | 18:59 |
dansmith | https://zuul.opendev.org/t/openstack/build/b367c9b4a4a54eab8964676f4c6f43ec | 19:04 |
dansmith | https://zuul.opendev.org/t/openstack/build/5eb69a31232d444db48b0ec90ccb5fee | 19:04 |
dansmith | https://zuul.opendev.org/t/openstack/build/c21483fb7491468fa893d8df1a4a0368 | 19:04 |
dansmith | just some more that are in my view right now (that's not all of them just a few) | 19:04 |
dansmith | the ceph multistore job I'm eagerly waiting to finish is failing like crazy on tons of tempest tests, so I'm really interested to see what's going on there | 19:05 |
gmann | ykarel: ok, i think kopecmartin did that | 19:05 |
dansmith | perhaps another oom | 19:05 |
gmann | dansmith: +A on 873300 | 19:06 |
dansmith | gmann: thanks, tested locally | 19:06 |
gmann | in ceph job in gate it seems other server test failing on timeout | 19:10 |
dansmith | gmann: I'm watching one fail like crazy, you see other ceph jobs failing too? | 19:11 |
gmann | this patch only | 19:14 |
dansmith | okay | 19:19 |
dansmith | looks like rbd failures on that one | 19:29 |
dansmith | lots of glance image 404s and some "rbd is busy" errors | 19:29 |
gmann | is this same? 'RuntimeError: generator didn't stop' https://zuul.opendev.org/t/openstack/build/1566c99f09984540bd8754a22342e486/log/job-output.txt#80958 | 19:31 |
dansmith | oh no, I didn't see that | 19:34 |
dansmith | wtf, I don't see that locally | 19:34 |
dansmith | {0} tempest.api.image.v2.test_images.ImageLocationsTest.test_set_location [2.530617s] ... ok | 19:34 |
dansmith | oh, I wonder if that's because even after the retries it still failed and since I raised the real errors instead of StopIteration? | 19:35 |
gmann | it happening in test_replace_location also | 19:35 |
gmann | may be | 19:36 |
dansmith | oh you know, this isn't going to work at all the second time, because we're not iterating over the result | 19:36 |
dansmith | man I screwed this one up | 19:36 |
dansmith | without a failure to test against, it's hard to even know | 19:37 |
dansmith | damn | 19:37 |
dansmith | it was going to fail anyway, it just didn't retry | 19:38 |
gmann | ohk, let me remove the W from tempest one | 19:38 |
dansmith | let me try to simulate one failure to validate what I think we need | 19:38 |
gmann | dansmith: feel free to add nova-ceph-multistore in tempest experimental queue to check it in same patch | 19:40 |
gmann | that will help in future too | 19:40 |
dansmith | ack | 19:41 |
opendevreview | Dan Smith proposed openstack/tempest master: Fix retry_bad_request() context manager https://review.opendev.org/c/openstack/tempest/+/873300 | 19:50 |
dansmith | gmann: made this ^ fail on the first iteration with BadRequest to confirm it tries again | 19:51 |
dansmith | please review *carefully* because I cannot be trusted | 19:51 |
gmann | dansmith: ack, do not worry. its this cycle things not you :). | 19:52 |
dansmith | thanks, but it's definitely mostly me :) | 19:53 |
gmann | dansmith: you remember we were talking that last cycle was smooth in gate. now all coming in this cycle | 19:53 |
gmann | those job timeout are leading to our timeout also | 19:55 |
dansmith | yeah :( | 19:59 |
gmann | ralonsoh: this is possible fix for neutron train/ussuri/victoria. but I am not sure if constraints will be compatible with neutron-tempst-plugin 1.8.0 https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/873325 | 20:20 |
gmann | testing here https://review.opendev.org/q/topic:bug%252F2006763 | 20:21 |
gmann | if constraints does not work then it seems we need to call out testing on t-u-v if 813195 fix is must to unblock | 20:22 |
gmann | ralonsoh: I am wondering why they started failing on 'guestmount' now? | 20:23 |
gmann | ralonsoh: or it is after this, they started failing https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/873059 | 20:24 |
opendevreview | Merged openstack/tempest master: Mark tempest-multinode-full-py3 as n-v https://review.opendev.org/c/openstack/tempest/+/873228 | 21:34 |
dansmith | gmann: from the multistore job about to finish: Failed: 0 | 21:45 |
gmann | dansmith: yeah, +A on patch | 22:12 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!