Thursday, 2022-06-16

*** soniya29 is now known as soniya29\|ruck		04:37
*** soniya29 is now known as soniya29\|ruck		05:29
*** soniya is now known as soniya\|ruck		06:45
*** elodilles is now known as elodilles_pto		07:04
*** jpena\|off is now known as jpena		07:42
opendevreview	Martin Kopec proposed openstack/patrole master: Drop py3.6 and py3.7 from Patrole https://review.opendev.org/c/openstack/patrole/+/840457	08:48
opendevreview	Martin Kopec proposed openstack/coverage2sql master: Update python testing as per zed cycle teting runtime https://review.opendev.org/c/openstack/coverage2sql/+/841002	09:00
opendevreview	Martin Kopec proposed openstack/os-performance-tools master: DNM CI jobs health check https://review.opendev.org/c/openstack/os-performance-tools/+/846136	09:02
opendevreview	Martin Kopec proposed openstack/devstack-tools master: DNM CI jobs health check https://review.opendev.org/c/openstack/devstack-tools/+/846138	09:06
opendevreview	Merged openstack/grenade master: docs: fix typo in README https://review.opendev.org/c/openstack/grenade/+/839247	09:08
opendevreview	Martin Kopec proposed openstack/tempest-stress master: DNM CI jobs health check https://review.opendev.org/c/openstack/tempest-stress/+/846139	09:08
opendevreview	Merged openstack/bashate master: Do not run pre-commit verbose by default https://review.opendev.org/c/openstack/bashate/+/836142	09:13
opendevreview	Martin Kopec proposed openstack/devstack-tools master: Drop py3.6 and py3.7 from devstack-tools https://review.opendev.org/c/openstack/devstack-tools/+/846138	10:09
*** soniya is now known as soniya29\|ruck		11:00
*** soniya29 is now known as soniya29\|ruck		11:47
*** soniya is now known as soniya29\|ruck		12:05
*** soniya is now known as soniya29\|ruck		12:39
opendevreview	Merged openstack/tempest master: Make test_server_actions.resource_setup() wait for SSHABLE https://review.opendev.org/c/openstack/tempest/+/843155	13:31
*** soniya is now known as soniya\|ruck		13:58
*** soniya29 is now known as soniya29\|ruck		14:38
opendevreview	Merged openstack/coverage2sql master: Update python testing as per zed cycle teting runtime https://review.opendev.org/c/openstack/coverage2sql/+/841002	14:45
*** soniya29\|ruck is now known as soniya29\|out		15:18
dansmith	clarkb: so to query individual values from the perf json in the web ui, do I need to select some other index?	15:57
dansmith	seems like logstash-logs is all I can see	15:58
clarkb	I'm not sure dpawlik may know?	15:59
dansmith	I figured, but he's not around	15:59
dansmith	we've got maybe a regression to go hunt, but I'm not sure how	15:59
clarkb	tristanC may know too	15:59
opendevreview	Dan Smith proposed openstack/devstack master: Add perftop.py tool for examining performance https://review.opendev.org/c/openstack/devstack/+/846198	16:19
dansmith	sean-k-mooney: geguileo was looking at why c-bak was getting so big, which led us (him) to notice that n-cpu did as well in certain circumstances	16:20
dansmith	moving the convo here since it's not cinder-specific anymore	16:20
dansmith	sean-k-mooney: so my memory stat is rss not virt (like I thought it was), so your swap thing may be relevant,	16:21
dansmith	although as you noted, they're all the same worker	16:21
dansmith	but perhaps a pressure-induced behavior difference	16:21
dansmith	still seems like a big difference to me which makes me wonder if we're doing something dumb like reading a big file into memory	16:22
geguileo	I think it's glibc not releasing memory back to the system	16:22
geguileo	due to the memory pressure	16:23
geguileo	because the compute node is the one that goes >1GB while the controller node is < 500MB	16:23
dansmith	you mean when pressure is low, it avoids releasing as much/	16:23
geguileo	yes	16:24
dansmith	okay, so you're thinking that pressure is lower in the multinode case? I suppose that might be true if we're spinning up the same number of instances but across two nodes	16:25
geguileo	controller nodes are using >5GB in total, whereas the computes only use 2GB	16:25
dansmith	but in the multinode case, we're running one controller+compute in the same way as the single node job, only the "other compute" is emptier	16:26
geguileo	yes, and since it's emptier there is less presure for glibc to release to the system, right?	16:26
dansmith	right, but the perf.json is only from the controller+compute node	16:27
geguileo	maybe I'm missing something...	16:27
geguileo	aren't those 2 different nodes?	16:27
geguileo	one is controller and another is computes?	16:28
sean-k-mooney	i think we have swap enabled by default	16:28
dansmith	there are two nodes total.. one with all the control services and a compute, per normal, and then one other node with just compute	16:28
sean-k-mooney	so i was wonderign if it could be related to memcache	16:28
sean-k-mooney	and the fact the subnode wont use it and will use a dict cache	16:28
geguileo	dansmith: correct, and in that one on the controller node n-cpu uses 500MB (out of the 5GB that performance.json has) whereas in the compute n-cpu consumes 1.xGB out of 2GB	16:29
geguileo	so if the nodes have the same amount of RAM and the instances are evenly distributed	16:30
dansmith	geguileo: not based on my info.. the only perf.json I'm looking at is from the combined node, which is where it has inflated to 1.5G	16:30
geguileo	then there is less pressure on the compute node	16:30
dansmith	right, but that's not where I'm seeing n-cpu be larger :)	16:30
geguileo	dansmith: I'm looking only at multinode ones...	16:30
geguileo	compute 2.4GB https://f6314bbe689272b182bf-704d2e5cde896695f5c12544f01f1d12.ssl.cf1.rackcdn.com/845806/2/check/tempest-slow-py3/363a876/compute1/logs/performance.json	16:31
geguileo	controller 580MB https://f6314bbe689272b182bf-704d2e5cde896695f5c12544f01f1d12.ssl.cf1.rackcdn.com/845806/2/check/tempest-slow-py3/363a876/controller/logs/performance.json	16:31
dansmith	ack, yep, I had no stored data from the compute-only one, but looking at a recent job I see the same	16:31
dansmith	the thing I was trying to say is:	16:32
geguileo	if it's a pressure thing it could be the reason why suddently when c-bak consumes less memory n-cpu seems to consume more	16:32
dansmith	n-cpu on the controller+compute node is larger on the multinode job than a controller+compute on a single-node job, by like 3x	16:32
dansmith	BUT, presumably that's because we're stressing the controller node less on the multinode job because half the instances are spun up on there	16:33
dansmith	I see 4.2G on one compute-only job I just examined	16:33
dansmith	that's still way too big, even if we can hand that back to the system on demand I think.. like I dunno what's making us be that big	16:34
geguileo	for c-bak it was the fragmentation caused by glibc's per-thread malloc arenas	16:35
geguileo	during my c-bak investigation I checked leaked objects, python memory management system stats, and even forced glibc malloc to free the memory in the service	16:37
geguileo	and in the end for c-bak it was sufficient to set those env variables	16:37
dansmith	okay, but were those chunks of memory that glibc would have given up if it was under pressure or not?	16:37
geguileo	maybe for nova we need to either force glibc to free the memory or we need to configure the memory pressure	16:37
geguileo	dansmith: let me see if I tried that specific combo	16:38
geguileo	I didn't try that combo :-(	16:38
geguileo	I only tried it with a single malloc arena	16:39
dansmith	just wondering if the same thing applies to n-cpu or if it's something else,	16:39
dansmith	because based on these numbers, I would have expected n-cpu to always be more than c-bak in any of the dumps, which I never saw	16:39
geguileo	I've seen CI failures caused by OOM c-bak kills, but I don't think there's been those for n-cpu, right?	16:41
dansmith	never that I've seen yeah	16:41
*** jpena is now known as jpena\|off		16:44
geguileo	dansmith: is the nova-grenade-multinode the right job to look at for the compute & controller memory usage?	16:50
dansmith	I'd look at one of the more standard ones, like nova-multi-cell	16:50
dansmith	the grenade one runs both more and less because it's 1.5 tempest runs, 1.5 devstack runs, etc	16:51
geguileo	ok, so multi-cell has compute & controller nodes	16:51
geguileo	I'll do a couple of tests with that one	16:52

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!