*** soniya29 is now known as soniya29|ruck | 04:37 | |
*** soniya29 is now known as soniya29|ruck | 05:29 | |
*** soniya is now known as soniya|ruck | 06:45 | |
*** elodilles is now known as elodilles_pto | 07:04 | |
*** jpena|off is now known as jpena | 07:42 | |
opendevreview | Martin Kopec proposed openstack/patrole master: Drop py3.6 and py3.7 from Patrole https://review.opendev.org/c/openstack/patrole/+/840457 | 08:48 |
---|---|---|
opendevreview | Martin Kopec proposed openstack/coverage2sql master: Update python testing as per zed cycle teting runtime https://review.opendev.org/c/openstack/coverage2sql/+/841002 | 09:00 |
opendevreview | Martin Kopec proposed openstack/os-performance-tools master: DNM CI jobs health check https://review.opendev.org/c/openstack/os-performance-tools/+/846136 | 09:02 |
opendevreview | Martin Kopec proposed openstack/devstack-tools master: DNM CI jobs health check https://review.opendev.org/c/openstack/devstack-tools/+/846138 | 09:06 |
opendevreview | Merged openstack/grenade master: docs: fix typo in README https://review.opendev.org/c/openstack/grenade/+/839247 | 09:08 |
opendevreview | Martin Kopec proposed openstack/tempest-stress master: DNM CI jobs health check https://review.opendev.org/c/openstack/tempest-stress/+/846139 | 09:08 |
opendevreview | Merged openstack/bashate master: Do not run pre-commit verbose by default https://review.opendev.org/c/openstack/bashate/+/836142 | 09:13 |
opendevreview | Martin Kopec proposed openstack/devstack-tools master: Drop py3.6 and py3.7 from devstack-tools https://review.opendev.org/c/openstack/devstack-tools/+/846138 | 10:09 |
*** soniya is now known as soniya29|ruck | 11:00 | |
*** soniya29 is now known as soniya29|ruck | 11:47 | |
*** soniya is now known as soniya29|ruck | 12:05 | |
*** soniya is now known as soniya29|ruck | 12:39 | |
opendevreview | Merged openstack/tempest master: Make test_server_actions.resource_setup() wait for SSHABLE https://review.opendev.org/c/openstack/tempest/+/843155 | 13:31 |
*** soniya is now known as soniya|ruck | 13:58 | |
*** soniya29 is now known as soniya29|ruck | 14:38 | |
opendevreview | Merged openstack/coverage2sql master: Update python testing as per zed cycle teting runtime https://review.opendev.org/c/openstack/coverage2sql/+/841002 | 14:45 |
*** soniya29|ruck is now known as soniya29|out | 15:18 | |
dansmith | clarkb: so to query individual values from the perf json in the web ui, do I need to select some other index? | 15:57 |
dansmith | seems like logstash-logs is all I can see | 15:58 |
clarkb | I'm not sure dpawlik may know? | 15:59 |
dansmith | I figured, but he's not around | 15:59 |
dansmith | we've got maybe a regression to go hunt, but I'm not sure how | 15:59 |
clarkb | tristanC may know too | 15:59 |
opendevreview | Dan Smith proposed openstack/devstack master: Add perftop.py tool for examining performance https://review.opendev.org/c/openstack/devstack/+/846198 | 16:19 |
dansmith | sean-k-mooney: geguileo was looking at why c-bak was getting so big, which led us (him) to notice that n-cpu did as well in certain circumstances | 16:20 |
dansmith | moving the convo here since it's not cinder-specific anymore | 16:20 |
dansmith | sean-k-mooney: so my memory stat is rss not virt (like I thought it was), so your swap thing may be relevant, | 16:21 |
dansmith | although as you noted, they're all the same worker | 16:21 |
dansmith | but perhaps a pressure-induced behavior difference | 16:21 |
dansmith | still seems like a big difference to me which makes me wonder if we're doing something dumb like reading a big file into memory | 16:22 |
geguileo | I think it's glibc not releasing memory back to the system | 16:22 |
geguileo | due to the memory pressure | 16:23 |
geguileo | because the compute node is the one that goes >1GB while the controller node is < 500MB | 16:23 |
dansmith | you mean when pressure is low, it avoids releasing as much/ | 16:23 |
geguileo | yes | 16:24 |
dansmith | okay, so you're thinking that pressure is lower in the multinode case? I suppose that might be true if we're spinning up the same number of instances but across two nodes | 16:25 |
geguileo | controller nodes are using >5GB in total, whereas the computes only use 2GB | 16:25 |
dansmith | but in the multinode case, we're running one controller+compute in the same way as the single node job, only the "other compute" is emptier | 16:26 |
geguileo | yes, and since it's emptier there is less presure for glibc to release to the system, right? | 16:26 |
dansmith | right, but the perf.json is only from the controller+compute node | 16:27 |
geguileo | maybe I'm missing something... | 16:27 |
geguileo | aren't those 2 different nodes? | 16:27 |
geguileo | one is controller and another is computes? | 16:28 |
sean-k-mooney | i think we have swap enabled by default | 16:28 |
dansmith | there are two nodes total.. one with all the control services *and* a compute, per normal, and then one other node with just compute | 16:28 |
sean-k-mooney | so i was wonderign if it could be related to memcache | 16:28 |
sean-k-mooney | and the fact the subnode wont use it and will use a dict cache | 16:28 |
geguileo | dansmith: correct, and in that one on the controller node n-cpu uses 500MB (out of the 5GB that performance.json has) whereas in the compute n-cpu consumes 1.xGB out of 2GB | 16:29 |
geguileo | so if the nodes have the same amount of RAM and the instances are evenly distributed | 16:30 |
dansmith | geguileo: not based on my info.. the only perf.json I'm looking at is from the combined node, which is where it has inflated to 1.5G | 16:30 |
geguileo | then there is less pressure on the compute node | 16:30 |
dansmith | right, but that's not where I'm seeing n-cpu be larger :) | 16:30 |
geguileo | dansmith: I'm looking only at multinode ones... | 16:30 |
geguileo | compute 2.4GB https://f6314bbe689272b182bf-704d2e5cde896695f5c12544f01f1d12.ssl.cf1.rackcdn.com/845806/2/check/tempest-slow-py3/363a876/compute1/logs/performance.json | 16:31 |
geguileo | controller 580MB https://f6314bbe689272b182bf-704d2e5cde896695f5c12544f01f1d12.ssl.cf1.rackcdn.com/845806/2/check/tempest-slow-py3/363a876/controller/logs/performance.json | 16:31 |
dansmith | ack, yep, I had no stored data from the compute-only one, but looking at a recent job I see the same | 16:31 |
dansmith | the thing I was trying to say is: | 16:32 |
geguileo | if it's a pressure thing it could be the reason why suddently when c-bak consumes less memory n-cpu seems to consume more | 16:32 |
dansmith | n-cpu on the controller+compute node is *larger* on the multinode job than a controller+compute on a single-node job, by like 3x | 16:32 |
dansmith | BUT, presumably that's because we're stressing the controller node less on the multinode job because half the instances are spun up on there | 16:33 |
dansmith | I see 4.2G on one compute-only job I just examined | 16:33 |
dansmith | that's still way too big, even if we can hand that back to the system on demand I think.. like I dunno what's making us be that big | 16:34 |
geguileo | for c-bak it was the fragmentation caused by glibc's per-thread malloc arenas | 16:35 |
geguileo | during my c-bak investigation I checked leaked objects, python memory management system stats, and even forced glibc malloc to free the memory in the service | 16:37 |
geguileo | and in the end for c-bak it was sufficient to set those env variables | 16:37 |
dansmith | okay, but were those chunks of memory that glibc would have given up if it was under pressure or not? | 16:37 |
geguileo | maybe for nova we need to either force glibc to free the memory or we need to configure the memory pressure | 16:37 |
geguileo | dansmith: let me see if I tried that specific combo | 16:38 |
geguileo | I didn't try that combo :-( | 16:38 |
geguileo | I only tried it with a single malloc arena | 16:39 |
dansmith | just wondering if the same thing applies to n-cpu or if it's something else, | 16:39 |
dansmith | because based on these numbers, I would have expected n-cpu to always be more than c-bak in any of the dumps, which I never saw | 16:39 |
geguileo | I've seen CI failures caused by OOM c-bak kills, but I don't think there's been those for n-cpu, right? | 16:41 |
dansmith | never that I've seen yeah | 16:41 |
*** jpena is now known as jpena|off | 16:44 | |
geguileo | dansmith: is the nova-grenade-multinode the right job to look at for the compute & controller memory usage? | 16:50 |
dansmith | I'd look at one of the more standard ones, like nova-multi-cell | 16:50 |
dansmith | the grenade one runs both more and less because it's 1.5 tempest runs, 1.5 devstack runs, etc | 16:51 |
geguileo | ok, so multi-cell has compute & controller nodes | 16:51 |
geguileo | I'll do a couple of tests with that one | 16:52 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!