opendevreview | Ke Niu proposed openstack/hacking master: Use py3 as the default runtime for tox https://review.opendev.org/c/openstack/hacking/+/889836 | 03:31 |
---|---|---|
opendevreview | Merged openstack/devstack master: git: git checkout for a commit hash combinated with depth argument https://review.opendev.org/c/openstack/devstack/+/889012 | 11:29 |
opendevreview | Maxim Sava proposed openstack/tempest master: Add image task client and image tests task APIs. https://review.opendev.org/c/openstack/tempest/+/888755 | 11:44 |
opendevreview | Maxim Sava proposed openstack/tempest master: Add image task client and image tests task APIs. https://review.opendev.org/c/openstack/tempest/+/888755 | 11:45 |
dansmith | gmann: I've seen multiple mysql OOMs this morning.. your concurrency patch shouldn't have increased the actual concurrency on jobs already running 4-wide right? | 13:54 |
opendevreview | Dan Smith proposed openstack/devstack master: Reduce the flush interval of dbcounter plugin https://review.opendev.org/c/openstack/devstack/+/890136 | 14:06 |
opendevreview | Dan Smith proposed openstack/devstack master: Reduce the flush frequency of dbcounter plugin https://review.opendev.org/c/openstack/devstack/+/890136 | 14:20 |
dansmith | gmann: dbcounter was flushing multiple times per second in keystone and neutron because they do so...many...DB ops.. Hence ^ | 14:21 |
dansmith | I think "after ten seconds of inactivity or 60s total" is more than often enough for that | 14:21 |
dansmith | right now we're generating writes for lots of reads in keystone | 14:21 |
dansmith | sometimes ten per second in both keystone and neutron actually | 14:32 |
dansmith | and if those have multiple ops per flush, then multiply that by the number of ops | 14:32 |
dansmith | yeah, 24 actual inserts in one second of one neutron log I found | 14:33 |
dansmith | I have no idea how or why they're doing so many DB queries, but alas | 14:34 |
dansmith | gmann: that patch took the neutron log from 4986 DB flushes to 635 for a tempest-full run | 15:50 |
dansmith | and from 2k to ~100 in keystone | 15:51 |
frickler | sounds still like a lot. maybe we can run all of devstack with eatmydata? we don't really care about persistence | 15:54 |
dansmith | frickler: the 635 for keystone is because it flushes at least once per minute, which doesn't seem unreasonable, especially given they're doing a half million select calls over the course of the same test run :) | 15:57 |
dansmith | er, for neutron I mean | 15:57 |
dansmith | the 100 op limit before was causing us to do most of those writes at times when DB traffic was already high, so these 635 are only when DB traffic is low (10s with no ops) or once per minute when load is high, | 16:07 |
dansmith | so even though it may not look like much of a drop, it's the timing of those that we're still doing that probably matters the most | 16:07 |
opendevreview | Merged openstack/tempest master: Reorder device rescue with volume for overlap https://review.opendev.org/c/openstack/tempest/+/889198 | 16:49 |
gmann | dansmith: hi, reading.. | 17:29 |
dansmith | gmann: it passed all its tests first time, and a nova patch I made depends-on this is about to pass all of its jobs as well | 17:31 |
gmann | dansmith: concurrency change will increase concurrency for most of the jobs unless that is set in job definition | 17:31 |
dansmith | gmann: ack, I saw a job still running with 4x threads, but also one with 8x after I asked | 17:31 |
gmann | dansmith: it seems huge improvement 2k to ~100 | 17:31 |
dansmith | gmann: yes, I think we should get this in ASAP | 17:31 |
gmann | dansmith: concurrency is changed to number of cpu -2 so 6 is for most of the jobs but yes there might be some job setting it to 4 which I think we can unset and run with default concurrency | 17:32 |
dansmith | gmann: it's more than just 2k->100, but those 100 are mostly at times of low DB traffic, instead of the 2k which are mostly at times of high db traffic | 17:32 |
gmann | 'number of cpu - 2' | 17:33 |
dansmith | gmann: ah, I meant 6 above, and ack | 17:33 |
dansmith | gmann: all our efforts to pack the workers tighter increase the penalty that the dbcounter imposes of course.. more concurrency means more db traffic at the same time, which means more dbcounter traffic at the worst possible time too | 17:33 |
gmann | yeah... | 17:34 |
JayF | Ironic might have been a canary for this somewhat, which is why we had to disable it. | 17:35 |
JayF | We're already pretty aggressive on I/O for our jobs so it would've had an outsize impact | 17:35 |
dansmith | JayF: I surveyed some ironic jobs that still have it enabled and it didn't seem like you were generating enough DB traffic to run afoul of this particular problem | 17:35 |
dansmith | I didn't do any math, but just kinda looked, because I remember you disabled it | 17:36 |
JayF | I'm thinking more that it impacted keystone which made it impact us | 17:36 |
dansmith | but maybe you have other jobs that are exponentially worse | 17:36 |
JayF | not ironic-the-service but ironic-the-ci-use-case | 17:36 |
JayF | because our jobs are extremely I/O sensitive | 17:36 |
JayF | just a throwaway thought though | 17:36 |
dansmith | ah, could be.. keystone and neutron both are hammering the crap out of the DB, and this amplifies it | 17:36 |
JayF | and we use neutron a lot, too :) | 17:36 |
dansmith | ack | 17:36 |
dansmith | the dbcounter is how I came to know that both of those services do crazy amounts of DB traffic, so I want to keep it enabled, we just need to make sure it doesn't worsen things | 17:37 |
dansmith | neutron makes literally 500k select calls over the course of a tempest run | 17:37 |
dansmith | compared to like 10k for all of the nova services or something | 17:38 |
gmann | we do set keystone creds and neutron resources per test class but that should not lead to that many call | 17:42 |
dansmith | yeah they make crazy numbers of select calls for each operation | 17:43 |
dansmith | there are like nearly 100k select calls by keystone at the end of *just* the devstack setup phase | 17:43 |
dansmith | https://zuul.opendev.org/t/openstack/build/14cce086f31141c892a543189426b4da/log/job-output.txt#22525 | 17:45 |
dansmith | and over 100k just during a tempest run: | 17:46 |
dansmith | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_14c/890136/2/check/tempest-full-py3/14cce08/controller/logs/performance.json | 17:47 |
dansmith | but neutron dwarfs everyone during an actual tempest run | 17:47 |
dansmith | "db": "neutron","op": "SELECT","count": 474710 | 17:47 |
gmann | yeah | 17:48 |
dansmith | gmann: dependent nova patched passed everything (except for the mkfs known failure) on the first attempt as well | 18:18 |
gmann | dansmith: ack, | 18:18 |
gmann | dansmith: +2 | 18:24 |
dansmith | gmann: thanks, can we get frickler or kopecmartin to hit that soon? | 18:25 |
dansmith | frickler expressed an opinion, but I don't think it was contrary to "this is better than current at least" | 18:25 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!