Monday, 2023-07-31

opendevreview	Ke Niu proposed openstack/hacking master: Use py3 as the default runtime for tox https://review.opendev.org/c/openstack/hacking/+/889836	03:31
opendevreview	Merged openstack/devstack master: git: git checkout for a commit hash combinated with depth argument https://review.opendev.org/c/openstack/devstack/+/889012	11:29
opendevreview	Maxim Sava proposed openstack/tempest master: Add image task client and image tests task APIs. https://review.opendev.org/c/openstack/tempest/+/888755	11:44
opendevreview	Maxim Sava proposed openstack/tempest master: Add image task client and image tests task APIs. https://review.opendev.org/c/openstack/tempest/+/888755	11:45
dansmith	gmann: I've seen multiple mysql OOMs this morning.. your concurrency patch shouldn't have increased the actual concurrency on jobs already running 4-wide right?	13:54
opendevreview	Dan Smith proposed openstack/devstack master: Reduce the flush interval of dbcounter plugin https://review.opendev.org/c/openstack/devstack/+/890136	14:06
opendevreview	Dan Smith proposed openstack/devstack master: Reduce the flush frequency of dbcounter plugin https://review.opendev.org/c/openstack/devstack/+/890136	14:20
dansmith	gmann: dbcounter was flushing multiple times per second in keystone and neutron because they do so...many...DB ops.. Hence ^	14:21
dansmith	I think "after ten seconds of inactivity or 60s total" is more than often enough for that	14:21
dansmith	right now we're generating writes for lots of reads in keystone	14:21
dansmith	sometimes ten per second in both keystone and neutron actually	14:32
dansmith	and if those have multiple ops per flush, then multiply that by the number of ops	14:32
dansmith	yeah, 24 actual inserts in one second of one neutron log I found	14:33
dansmith	I have no idea how or why they're doing so many DB queries, but alas	14:34
dansmith	gmann: that patch took the neutron log from 4986 DB flushes to 635 for a tempest-full run	15:50
dansmith	and from 2k to ~100 in keystone	15:51
frickler	sounds still like a lot. maybe we can run all of devstack with eatmydata? we don't really care about persistence	15:54
dansmith	frickler: the 635 for keystone is because it flushes at least once per minute, which doesn't seem unreasonable, especially given they're doing a half million select calls over the course of the same test run :)	15:57
dansmith	er, for neutron I mean	15:57
dansmith	the 100 op limit before was causing us to do most of those writes at times when DB traffic was already high, so these 635 are only when DB traffic is low (10s with no ops) or once per minute when load is high,	16:07
dansmith	so even though it may not look like much of a drop, it's the timing of those that we're still doing that probably matters the most	16:07
opendevreview	Merged openstack/tempest master: Reorder device rescue with volume for overlap https://review.opendev.org/c/openstack/tempest/+/889198	16:49
gmann	dansmith: hi, reading..	17:29
dansmith	gmann: it passed all its tests first time, and a nova patch I made depends-on this is about to pass all of its jobs as well	17:31
gmann	dansmith: concurrency change will increase concurrency for most of the jobs unless that is set in job definition	17:31
dansmith	gmann: ack, I saw a job still running with 4x threads, but also one with 8x after I asked	17:31
gmann	dansmith: it seems huge improvement 2k to ~100	17:31
dansmith	gmann: yes, I think we should get this in ASAP	17:31
gmann	dansmith: concurrency is changed to number of cpu -2 so 6 is for most of the jobs but yes there might be some job setting it to 4 which I think we can unset and run with default concurrency	17:32
dansmith	gmann: it's more than just 2k->100, but those 100 are mostly at times of low DB traffic, instead of the 2k which are mostly at times of high db traffic	17:32
gmann	'number of cpu - 2'	17:33
dansmith	gmann: ah, I meant 6 above, and ack	17:33
dansmith	gmann: all our efforts to pack the workers tighter increase the penalty that the dbcounter imposes of course.. more concurrency means more db traffic at the same time, which means more dbcounter traffic at the worst possible time too	17:33
gmann	yeah...	17:34
JayF	Ironic might have been a canary for this somewhat, which is why we had to disable it.	17:35
JayF	We're already pretty aggressive on I/O for our jobs so it would've had an outsize impact	17:35
dansmith	JayF: I surveyed some ironic jobs that still have it enabled and it didn't seem like you were generating enough DB traffic to run afoul of this particular problem	17:35
dansmith	I didn't do any math, but just kinda looked, because I remember you disabled it	17:36
JayF	I'm thinking more that it impacted keystone which made it impact us	17:36
dansmith	but maybe you have other jobs that are exponentially worse	17:36
JayF	not ironic-the-service but ironic-the-ci-use-case	17:36
JayF	because our jobs are extremely I/O sensitive	17:36
JayF	just a throwaway thought though	17:36
dansmith	ah, could be.. keystone and neutron both are hammering the crap out of the DB, and this amplifies it	17:36
JayF	and we use neutron a lot, too :)	17:36
dansmith	ack	17:36
dansmith	the dbcounter is how I came to know that both of those services do crazy amounts of DB traffic, so I want to keep it enabled, we just need to make sure it doesn't worsen things	17:37
dansmith	neutron makes literally 500k select calls over the course of a tempest run	17:37
dansmith	compared to like 10k for all of the nova services or something	17:38
gmann	we do set keystone creds and neutron resources per test class but that should not lead to that many call	17:42
dansmith	yeah they make crazy numbers of select calls for each operation	17:43
dansmith	there are like nearly 100k select calls by keystone at the end of just the devstack setup phase	17:43
dansmith	https://zuul.opendev.org/t/openstack/build/14cce086f31141c892a543189426b4da/log/job-output.txt#22525	17:45
dansmith	and over 100k just during a tempest run:	17:46
dansmith	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_14c/890136/2/check/tempest-full-py3/14cce08/controller/logs/performance.json	17:47
dansmith	but neutron dwarfs everyone during an actual tempest run	17:47
dansmith	"db": "neutron","op": "SELECT","count": 474710	17:47
gmann	yeah	17:48
dansmith	gmann: dependent nova patched passed everything (except for the mkfs known failure) on the first attempt as well	18:18
gmann	dansmith: ack,	18:18
gmann	dansmith: +2	18:24
dansmith	gmann: thanks, can we get frickler or kopecmartin to hit that soon?	18:25
dansmith	frickler expressed an opinion, but I don't think it was contrary to "this is better than current at least"	18:25

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!