Monday, 2023-07-31

opendevreviewKe Niu proposed openstack/hacking master: Use py3 as the default runtime for tox  https://review.opendev.org/c/openstack/hacking/+/88983603:31
opendevreviewMerged openstack/devstack master: git: git checkout for a commit hash combinated with depth argument  https://review.opendev.org/c/openstack/devstack/+/88901211:29
opendevreviewMaxim Sava proposed openstack/tempest master: Add image task client and image tests task APIs.  https://review.opendev.org/c/openstack/tempest/+/88875511:44
opendevreviewMaxim Sava proposed openstack/tempest master: Add image task client and image tests task APIs.  https://review.opendev.org/c/openstack/tempest/+/88875511:45
dansmithgmann: I've seen multiple mysql OOMs this morning.. your concurrency patch shouldn't have increased the actual concurrency on jobs already running 4-wide right?13:54
opendevreviewDan Smith proposed openstack/devstack master: Reduce the flush interval of dbcounter plugin  https://review.opendev.org/c/openstack/devstack/+/89013614:06
opendevreviewDan Smith proposed openstack/devstack master: Reduce the flush frequency of dbcounter plugin  https://review.opendev.org/c/openstack/devstack/+/89013614:20
dansmithgmann: dbcounter was flushing multiple times per second in keystone and neutron because they do so...many...DB ops.. Hence ^14:21
dansmithI think "after ten seconds of inactivity or 60s total" is more than often enough for that14:21
dansmithright now we're generating writes for lots of reads in keystone14:21
dansmithsometimes ten per second in both keystone and neutron actually14:32
dansmithand if those have multiple ops per flush, then multiply that by the number of ops14:32
dansmithyeah, 24 actual inserts in one second of one neutron log I found14:33
dansmithI have no idea how or why they're doing so many DB queries, but alas14:34
dansmithgmann: that patch took the neutron log from 4986 DB flushes to 635 for a tempest-full run15:50
dansmithand from 2k to ~100 in keystone15:51
fricklersounds still like a lot. maybe we can run all of devstack with eatmydata? we don't really care about persistence15:54
dansmithfrickler: the 635 for keystone is because it flushes at least once per minute, which doesn't seem unreasonable, especially given they're doing a half million select calls over the course of the same test run :)15:57
dansmither, for neutron I mean15:57
dansmiththe 100 op limit before was causing us to do most of those writes at times when DB traffic was already high, so these 635 are only when DB traffic is low (10s with no ops) or once per minute when load is high,16:07
dansmithso even though it may not look like much of a drop, it's the timing of those that we're still doing that probably matters the most16:07
opendevreviewMerged openstack/tempest master: Reorder device rescue with volume for overlap  https://review.opendev.org/c/openstack/tempest/+/88919816:49
gmanndansmith: hi, reading..17:29
dansmithgmann: it passed all its tests first time, and a nova patch I made depends-on this is about to pass all of its jobs as well17:31
gmanndansmith: concurrency change will increase concurrency for most of the jobs unless that is set in job definition 17:31
dansmithgmann: ack, I saw a job still running with 4x threads, but also one with 8x after I asked17:31
gmanndansmith: it seems huge improvement  2k to ~100 17:31
dansmithgmann: yes, I think we should get this in ASAP17:31
gmanndansmith: concurrency  is changed to number of cpu -2 so 6 is for most of the jobs but yes there might be some job setting it to 4 which I think we can unset and run with default concurrency 17:32
dansmithgmann: it's more than just 2k->100, but those 100 are mostly at times of low DB traffic, instead of the 2k which are mostly at times of high db traffic17:32
gmann'number of cpu - 2'17:33
dansmithgmann: ah, I meant 6 above, and ack17:33
dansmithgmann: all our efforts to pack the workers tighter increase the penalty that the dbcounter imposes of course.. more concurrency means more db traffic at the same time, which means more dbcounter traffic at the worst possible time too17:33
gmannyeah...17:34
JayFIronic might have been a canary for this somewhat, which is why we had to disable it.17:35
JayFWe're already pretty aggressive on I/O for our jobs so it would've had an outsize impact17:35
dansmithJayF: I surveyed some ironic jobs that still have it enabled and it didn't seem like you were generating enough DB traffic to run afoul of this particular problem17:35
dansmithI didn't do any math, but just kinda looked, because I remember you disabled it17:36
JayFI'm thinking more that it impacted keystone which made it impact us17:36
dansmithbut maybe you have other jobs that are exponentially worse17:36
JayFnot ironic-the-service but ironic-the-ci-use-case17:36
JayFbecause our jobs are extremely I/O sensitive17:36
JayFjust a throwaway thought though17:36
dansmithah, could be.. keystone and neutron both are hammering the crap out of the DB, and this amplifies it17:36
JayFand we use neutron a lot, too :) 17:36
dansmithack17:36
dansmiththe dbcounter is how I came to know that both of those services do crazy amounts of DB traffic, so I want to keep it enabled, we just need to make sure it doesn't worsen things17:37
dansmithneutron makes literally 500k select calls over the course of a tempest run17:37
dansmithcompared to like 10k for all of the nova services or something17:38
gmannwe do set keystone creds and neutron resources per test class but that should not lead to that many call17:42
dansmithyeah they make crazy numbers of select calls for each operation17:43
dansmiththere are like nearly 100k select calls by keystone at the end of *just* the devstack setup phase17:43
dansmithhttps://zuul.opendev.org/t/openstack/build/14cce086f31141c892a543189426b4da/log/job-output.txt#2252517:45
dansmithand over 100k just during a tempest run:17:46
dansmithhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_14c/890136/2/check/tempest-full-py3/14cce08/controller/logs/performance.json17:47
dansmithbut neutron dwarfs everyone during an actual tempest run17:47
dansmith"db": "neutron","op": "SELECT","count": 47471017:47
gmannyeah17:48
dansmithgmann: dependent nova patched passed everything (except for the mkfs known failure) on the first attempt as well18:18
gmanndansmith: ack, 18:18
gmanndansmith: +218:24
dansmithgmann: thanks, can we get frickler or kopecmartin to hit that soon?18:25
dansmithfrickler expressed an opinion, but I don't think it was contrary to "this is better than current at least"18:25

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!