Thursday, 2023-02-09

gmann	kopecmartin: once you start work tomorrow https://review.opendev.org/c/openstack/tempest/+/872982	02:22
*** yadnesh\|away is now known as yadnesh		04:04
opendevreview	yatin proposed openstack/grenade stable/train: [train-only] Add ensure-rust to grenade jobs https://review.opendev.org/c/openstack/grenade/+/872999	06:10
ykarel	gmann, can you rework flow https://review.opendev.org/c/openstack/devstack/+/872902	06:13
*** jpena\|off is now known as jpena		08:27
opendevreview	Martin Kopec proposed openstack/tempest master: Mark tempest-multinode-full-py3 as n-v https://review.opendev.org/c/openstack/tempest/+/873228	08:41
opendevreview	Merged openstack/grenade stable/ussuri: [ussuri-only] Add ensure-rust to grenade jobs https://review.opendev.org/c/openstack/grenade/+/872969	09:46
opendevreview	Merged openstack/tempest master: Retry glance add/update locations on BadRequest https://review.opendev.org/c/openstack/tempest/+/872982	10:16
opendevreview	Merged openstack/devstack stable/train: [Train Only] Run ensure-rust role https://review.opendev.org/c/openstack/devstack/+/872902	12:58
opendevreview	Merged openstack/tempest master: Allow SSH connection callers to not permit agent usage https://review.opendev.org/c/openstack/tempest/+/872566	13:25
*** yadnesh is now known as yadnesh\|away		14:58
dansmith	gmann: kopecmartin so, opensearch tells me we're hitting hundreds of OOMs in gate jobs each day	15:11
dansmith	I went to look after my chunked patch failed in the gate for the 300th time due to a single job failure	15:11
dansmith	given that number is high, I'm thinking we should look at some sweeping change(s)	15:12
dansmith	I wonder if we could/should start with upping some swap?	15:12
dansmith	we have a number of jobs that already override swap to 8G, but the one i saw oom this morning was a 4G job.. is there any reason not to up that to 8G mostly across the board?	15:17
dansmith	clarkb: fungi ^	15:20
fungi	your timeouts may be related to increased swap use. as soon as you start getting into swap thrash from memory exhaustion, everything you do that involves memory access or disk access is going to slow down tremendously	15:34
fungi	i'd start by looking at what's been eating more memory and seeing if that can be fixed/reverted	15:35
fungi	ideally those node would use very little swap, mostly to page out infrequently-accessed segments in order to make more room for performance-boosting uses like fs caching	15:36
clarkb	fwiw the swap amount was reduced to 1gb from 8gb in devstack jobs due to being unable to sparse allocate the backing file on ext4	15:45
clarkb	so we have done more swap in the past, but the kernel made changes that broke userspace (boo) and we had to accomodate that	15:45
clarkb	we could try sparse allocation again though. Maybe jammy fixed it, but I Think the change was semi intentional and not expected to be fixed	15:46
dansmith	fungi: yeah I definitely understand that.. it depends if the problem is a bunch of stuff we load and never use, or if it's in the active set.. given how inflated things get with python it helps more often than I usually expect	15:52
dansmith	clarkb: last I knew the internals, swap on file was actually problematic in general, but I think that got better by them making swap access effectively O_DIRECT.. I assume that too is resolved?	15:53
clarkb	dansmith: we have had no issues with swap on a file that I know of	15:53
clarkb	dansmith: the mkswapfs or something manpage documented created sparse allocations on ext (other fses didn't support it) but then they broke that. Switching to a fully allocated file works fine	15:54
dansmith	fungi: and fwiw, the OOMs I've seen are all choosing mysqld to kill because it's the biggest offender.. I know,. oom_score and what not, but...	15:54
clarkb	the main drawbacks to using a fully allocated file are the loss of the disk space whether you use the swap space or not and the time it takes to write all those zeros	15:54
clarkb	I would expect the time to be somewhat linear too? allocating 8gb is 8x longer than 1gb?	15:55
dansmith	clarkb: yeah, so the original problem was that the kernel would allocate buffer space to write to the filesystem in a way that it didn't have to do for writing to block,	15:55
dansmith	and that could exacerbate low-memory situatons	15:55
dansmith	it's been a long time	15:55
dansmith	so maybe that's resolved	15:55
fungi	yeah, in many cases i looked at in the past, it was tons of separate uwsgi processes consuming most of the memory, but mysql got the axe because it was a large monolithic target	15:55
dansmith	fungi: yeah, and I think we lowered mysql's oom_score to make it less likely	15:56
dansmith	but the more services we run, each needing at least a reasonable number of workers, we're going to have a lot of uwsgi processes	15:56
dansmith	I've addressed it in certain jobs by reducing scope, like taking c-bak and other things out of jobs that don't need to test that stuff	15:57
dansmith	clarkb: is creating multiple files a way around that allocation issue?	15:58
clarkb	dansmith: I don't think so. I expect time cost to be similar due to how clouds limit iops	15:59
dansmith	oh the problem is just actually fully allocating 8gb of disk and that's why we needed sparse and reduced the size of what we did instead?	15:59
dansmith	I see	15:59
clarkb	right its a time issue	16:00
clarkb	fallocate is almost immediate. dd'ing a bunch of zeros is slow	16:00
clarkb	looks like my current swapon manpage no longer documents using fallocate on anything but xfs	16:01
dansmith	do we not get a chunk of disk we can partition and use without having to reboot?	16:01
clarkb	so they fixed the docs at least	16:01
clarkb	dansmith: only on rax (so rax jobs do repartition)	16:01
dansmith	okay	16:01
ykarel	^ are they affected jobs jammy based and running tempest tests with higher concurrency so multiple guest vms concurrently ?	16:03
ykarel	may be already known it's due to qemu 6	16:03
ykarel	jfr https://bugs.launchpad.net/nova/+bug/1949606	16:05
clarkb	(coincidentally I just ran into errors on my local machine due to a full /. Its fully of docker images...)	16:05
dansmith	are we dumping the rabbit memory diagnostics anywhere?	16:07
dansmith	rabbit definitely has a bunch of stuff swapped out on the one I'm looking at.. only 15MB resident but almost 850 virt	16:08
fungi	presumably rabbit has some way to tune its message retention, but i really know very little about this problem domain	16:10
clarkb	the two places I know are the copilot dstat replacement (which is off by default iirc) and the memory dump at the end of the job.	16:10
dansmith	well, it has a diag dump to see where it's using memory, so I'm just wondering if we're already dumping that	16:10
fungi	though it may also leak allocations or just not bother to garbage collect	16:10
clarkb	I don't think we do ^	16:10
dansmith	clarkb: where should I add that so it runs after a job even if it fails?	16:11
clarkb	dansmith: the post-run portion of the job should run after a failure including timeouts. It has its own timeout to enable that sort of thing	16:11
fungi	post-run phase normally	16:11
dansmith	devstack/playbooks/post.yaml /	16:12
fungi	there's also a cleanup phase, but shouldn't be necessary for this case i wouldn't think	16:12
clarkb	ya that looks like the right file	16:12
dansmith	ah yeah, that's where I did the perf dump	16:12
opendevreview	Dan Smith proposed openstack/devstack master: Grab rabbitmq memory diagnostics dump in post https://review.opendev.org/c/openstack/devstack/+/873299	16:17
dansmith	also doesn't seem like we're doing much tweaking of mysql to reduce its memory usage, so maybe there's something we can do there	16:22
opendevreview	Dan Smith proposed openstack/tempest master: Fix retry_bad_request() context manager https://review.opendev.org/c/openstack/tempest/+/873300	16:26
dansmith	gmann: kopecmartin ^	16:26
clarkb	I want to say that mordred and aeva did mysql tuning way way back when, but maybe that has been lost or we were tuning to different needs	16:26
dansmith	that patch we landed wasn't getting actually run in the gate, so now anyone running those tests is 100% broken because I SUCK	16:26
dansmith	gmann: kopecmartin testing with https://review.opendev.org/c/openstack/nova/+/873302 because apparently we don't get ceph-multistore coverage in the tempest gate	16:30
dansmith	clarkb: yeah I just don't see any in devstack at quick glance	16:30
dansmith	gmann: kopecmartin I reconfigured my local devstack to be able to test that fix and it works, so I think it should be okay.. hopefully it merges as quick as the original :/	16:53
*** jpena is now known as jpena\|off		17:33
opendevreview	Dan Smith proposed openstack/devstack master: Grab rabbitmq memory diagnostics dump in post https://review.opendev.org/c/openstack/devstack/+/873299	17:35
clarkb	https://mariadb.com/kb/en/mariadb-memory-allocation/ may be useful if anyone digs into the db tuning	17:45
clarkb	bit old, but with pointers to stuff	17:46
dansmith	clarkb: yeah I've reduced the footprint on some personal memory-constrained cloud instances successfully in the past so I can add some tweaks	17:55
dansmith	man, tons of post_failure with no logs this morning	18:52
clarkb	chances are one of the swift regions is having trouble. I'll take a look	18:54
dansmith	seems like a lot of them were too fast to have run and fail only during upload, but not all of them	18:54
clarkb	the one I'm looking at was a swift upload failure. Going to look at a couple more just to get more dat	18:59
dansmith	https://zuul.opendev.org/t/openstack/build/b367c9b4a4a54eab8964676f4c6f43ec	19:04
dansmith	https://zuul.opendev.org/t/openstack/build/5eb69a31232d444db48b0ec90ccb5fee	19:04
dansmith	https://zuul.opendev.org/t/openstack/build/c21483fb7491468fa893d8df1a4a0368	19:04
dansmith	just some more that are in my view right now (that's not all of them just a few)	19:04
dansmith	the ceph multistore job I'm eagerly waiting to finish is failing like crazy on tons of tempest tests, so I'm really interested to see what's going on there	19:05
gmann	ykarel: ok, i think kopecmartin did that	19:05
dansmith	perhaps another oom	19:05
gmann	dansmith: +A on 873300	19:06
dansmith	gmann: thanks, tested locally	19:06
gmann	in ceph job in gate it seems other server test failing on timeout	19:10
dansmith	gmann: I'm watching one fail like crazy, you see other ceph jobs failing too?	19:11
gmann	this patch only	19:14
dansmith	okay	19:19
dansmith	looks like rbd failures on that one	19:29
dansmith	lots of glance image 404s and some "rbd is busy" errors	19:29
gmann	is this same? 'RuntimeError: generator didn't stop' https://zuul.opendev.org/t/openstack/build/1566c99f09984540bd8754a22342e486/log/job-output.txt#80958	19:31
dansmith	oh no, I didn't see that	19:34
dansmith	wtf, I don't see that locally	19:34
dansmith	{0} tempest.api.image.v2.test_images.ImageLocationsTest.test_set_location [2.530617s] ... ok	19:34
dansmith	oh, I wonder if that's because even after the retries it still failed and since I raised the real errors instead of StopIteration?	19:35
gmann	it happening in test_replace_location also	19:35
gmann	may be	19:36
dansmith	oh you know, this isn't going to work at all the second time, because we're not iterating over the result	19:36
dansmith	man I screwed this one up	19:36
dansmith	without a failure to test against, it's hard to even know	19:37
dansmith	damn	19:37
dansmith	it was going to fail anyway, it just didn't retry	19:38
gmann	ohk, let me remove the W from tempest one	19:38
dansmith	let me try to simulate one failure to validate what I think we need	19:38
gmann	dansmith: feel free to add nova-ceph-multistore in tempest experimental queue to check it in same patch	19:40
gmann	that will help in future too	19:40
dansmith	ack	19:41
opendevreview	Dan Smith proposed openstack/tempest master: Fix retry_bad_request() context manager https://review.opendev.org/c/openstack/tempest/+/873300	19:50
dansmith	gmann: made this ^ fail on the first iteration with BadRequest to confirm it tries again	19:51
dansmith	please review carefully because I cannot be trusted	19:51
gmann	dansmith: ack, do not worry. its this cycle things not you :).	19:52
dansmith	thanks, but it's definitely mostly me :)	19:53
gmann	dansmith: you remember we were talking that last cycle was smooth in gate. now all coming in this cycle	19:53
gmann	those job timeout are leading to our timeout also	19:55
dansmith	yeah :(	19:59
gmann	ralonsoh: this is possible fix for neutron train/ussuri/victoria. but I am not sure if constraints will be compatible with neutron-tempst-plugin 1.8.0 https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/873325	20:20
gmann	testing here https://review.opendev.org/q/topic:bug%252F2006763	20:21
gmann	if constraints does not work then it seems we need to call out testing on t-u-v if 813195 fix is must to unblock	20:22
gmann	ralonsoh: I am wondering why they started failing on 'guestmount' now?	20:23
gmann	ralonsoh: or it is after this, they started failing https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/873059	20:24
opendevreview	Merged openstack/tempest master: Mark tempest-multinode-full-py3 as n-v https://review.opendev.org/c/openstack/tempest/+/873228	21:34
dansmith	gmann: from the multistore job about to finish: Failed: 0	21:45
gmann	dansmith: yeah, +A on patch	22:12

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!