frickler | kopecmartin: what's your plan on https://review.opendev.org/c/openstack/devstack/+/558930 now? | 06:42 |
---|---|---|
*** dmellado819 is now known as dmellado81 | 06:55 | |
opendevreview | Jakub Skunda proposed openstack/tempest master: Add option to specify source and destination host https://review.opendev.org/c/openstack/tempest/+/891123 | 10:39 |
opendevreview | Merged openstack/hacking master: Improve H212 failure message https://review.opendev.org/c/openstack/hacking/+/882137 | 11:38 |
opendevreview | Lukas Piwowarski proposed openstack/tempest master: Fix cleanup for volume backup tests https://review.opendev.org/c/openstack/tempest/+/890798 | 12:14 |
opendevreview | Ashley Rodriguez proposed openstack/devstack-plugin-ceph master: Remote Ceph with cephadm https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/876747 | 13:38 |
kopecmartin | frickler: hm, is it better to merge it now or rather wait till Monday and merge then? | 14:08 |
frickler | kopecmartin: go for it, I'll be around to watch things a bit | 14:20 |
opendevreview | Lukas Piwowarski proposed openstack/tempest master: Fix cleanup for volume backup tests https://review.opendev.org/c/openstack/tempest/+/890798 | 14:28 |
dansmith | gmann: debugging this grenade failure: https://zuul.opendev.org/t/openstack/build/bb553b6605984dcda6b73eeef91c5084 | 15:29 |
dansmith | that is still running at concurrency=6 and the problem seems to be that the secgroup call tempest->nova->neutron takes too long.. it's still happening, as I can trace the path in the logs, | 15:30 |
dansmith | but to do the rule create, nova fetches the secgroup and other things, and that takes more than 60s, even though the actual calls to neutron show sub-second timers | 15:30 |
dansmith | that makes me think we're exhausting available workers in our uwsgi services, and adding more workers will increase the memory footprint | 15:31 |
*** ykarel is now known as ykarel|away | 15:59 | |
ykarel|away | dansmith, wrt ^ not sure if you are aware of https://bugs.launchpad.net/neutron/+bug/2015065 where some investigation was done in past | 16:06 |
ykarel|away | and that time iirc concurrency was not 6, some data from that might be helpful | 16:07 |
dansmith | ykarel|away: ack, interesting | 16:10 |
dansmith | ykarel|away: the dbcounter issues should be very much relieved lately, and the bug doesn't mention disabling that having fixed it, right? | 16:10 |
dansmith | I think this sort of proxy to neutron is always going to suffer from concurrency-strained resources, be it API worker threads or apache connections, etc, etc | 16:11 |
ykarel|away | dansmith, even disabling dbcounter the issue was reproduced, but i noticed it was less frequent after disabling it, may it was just coincidence | 16:12 |
dansmith | the rule_create tests are worse than the group_create because they make multiple calls to neutron to validate the group, etc so they're more likely to hit it | 16:12 |
dansmith | ykarel|away: ack, well, dbcounter no longer forces a flush after 100 operations like it did before, so it should run only during idle periods or once per minute max | 16:13 |
dansmith | neutron does a *ton* of DB, so it was getting hit there for sure, but that should be much less now | 16:13 |
ykarel|away | yeap i heard slaweq is inspecting those large db calls, we might be able to reduce those, but /me don't know if those are related to the above failures | 16:15 |
dansmith | no, I know.. the dbcounter behavior was amplifying the DB that you were doing, but shouldn't be anymore, that was my only point | 16:16 |
dansmith | however, lots of DB can be consuming eventlet connection pool workers, and we can run out of those, especially at higher concurrency | 16:16 |
ykarel|away | ack i see | 16:20 |
opendevreview | Dan Smith proposed openstack/devstack master: Log slow uwsgi requests https://review.opendev.org/c/openstack/devstack/+/891213 | 16:29 |
opendevreview | Dan Smith proposed openstack/devstack master: Set uwsgi threads per worker https://review.opendev.org/c/openstack/devstack/+/891214 | 16:29 |
ykarel|away | i recall default threads was 1 and i seem to remember there was some constraint from nova to not set it higher | 16:33 |
ykarel|away | but ok to try these changes and see the impact | 16:34 |
dansmith | ykarel|away: I don't know why that would be, but we'll see if something obvious breaks :) | 16:34 |
dansmith | if threads=1 by default and we're only running 2 workers then I think we're definitely in trouble for concurrency=6 :) | 16:35 |
dansmith | and if it's only trouble for nova and not others, we could/should probably bump it for the others at least | 16:35 |
ykarel|away | dansmith, https://docs.openstack.org/releasenotes/nova/stein.html#relnotes-19-0-3-stable-stein | 16:36 |
ykarel|away | don't know if that is still valid though | 16:36 |
dansmith | ah, yeah, I don't think that's still valid, but I remember that now | 16:37 |
ykarel|away | yeap /me agrees for the tunnings where ever possible, thx for looking into it | 16:37 |
dansmith | we're not setting threads=1 in devstack now, and I'm not sure it really defaults to 1 in uwsgi, but I don't know | 16:38 |
dansmith | maybe melwitt knows and I know sean-k-mooney was involved in that, so maybe next week | 16:39 |
ykarel|away | i recall seeing it defaults to 1, but would be good to confirm | 16:39 |
* dansmith nods | 16:39 | |
ykarel|away | me out for the next week, will follow up the results once back | 16:39 |
dansmith | ack, thanks for the pointers | 16:39 |
gmann | yeah, I have seen that SG in past before concurrency things or any other improvement we did | 16:41 |
dansmith | okay, well, I've only seen it once so I'm not sure it's getting worse with concurrency=6, but the root of why it's happening is the same | 16:43 |
dansmith | with the log-slow change above, hopefully we'd see two slow requests using both workers in neutron in/around the time where this happens | 16:44 |
frickler | kopecmartin: https://review.opendev.org/c/openstack/devstack/+/887547 maybe too? tc is about to start discussing the pti for the next cycle | 16:53 |
frickler | gmann: ^^ | 16:57 |
gmann | frickler: done, lgtm | 16:58 |
frickler | ty | 16:59 |
dansmith | gmann: would you be okay if we had a post action to send SIGUSR1 to nova-compute before we finish a job so that we can get a guru-meditation report? | 17:55 |
dansmith | melwitt and I (mostly melwitt) were pondering a seeming lockup of n-cpu where it just seems to stop doing anything part way through a job | 17:56 |
dansmith | and I'm wondering if a gmr would show us something interesting | 17:56 |
dansmith | clarkb: fungi: I think we've discussed in the past trying to back things like our test lvm pool with an actual block device from ephemeral storage, right? and we can't reliably do that because not all workers have a second disk or something? | 17:59 |
clarkb | dansmith: correct. only rax provides flavors with a second disk | 18:03 |
dansmith | ack okay | 18:03 |
clarkb | most cloud providers give you a single device and want you to use volumes to add extra devices. The problem with that is even just boot from volume results in a giant mess of leaked volumes preventing image deletion and so on | 18:03 |
opendevreview | Merged openstack/devstack master: Add option to install everything in global venvs https://review.opendev.org/c/openstack/devstack/+/558930 | 18:04 |
clarkb | so while we theoretically could bfv everywhere instead we'd end up with everything grinding to a hault because cinder and glance and nova can't get along | 18:04 |
dansmith | yeah, we experience the same while *testing* volumes :/ | 18:04 |
dansmith | clarkb: you don't have to tell me | 18:04 |
dansmith | I'm asking because testing nova+cinder is such a sh*tshow lately and someone suggested not backing lvm with a loop device to try to help :) | 18:04 |
clarkb | you could repartition on rax nodes I suppose and collect data from that subset | 18:07 |
dansmith | ykarel|away: https://zuul.opendev.org/t/openstack/build/09ffeb805da0425a8ab8a1ce47499251/log/controller/logs/screen-n-api.txt#1876 | 18:08 |
clarkb | the jobs will have already partitioned swap and /opt onto that device but you should be able to unmount /opt temporarily, shrink the ext4 fs there and add another partition | 18:08 |
dansmith | clarkb: I'd rather have some good reason before going to that much work, because AIUI loop has gotten better recently in regards to the double-cache footprint problem | 18:08 |
gmann | dansmith: I am ok with that, do not think it will add large time in the job especially by seeing our normal job running time and job timeout duration | 18:17 |
dansmith | okay | 18:18 |
opendevreview | Dan Smith proposed openstack/devstack master: Set uwsgi threads per worker https://review.opendev.org/c/openstack/devstack/+/891214 | 18:21 |
opendevreview | Dan Smith proposed openstack/devstack master: Set uwsgi threads per worker https://review.opendev.org/c/openstack/devstack/+/891214 | 19:09 |
fungi | yeah, the other alternative is extending nodepool to also manage cinder volumes and then attach additional volumes to test nodes as part of the nodeset or something. that's not functionality that exists today. theoretically possible, logistically... challenging | 19:11 |
clarkb | fungi: right it also has the problem of leaking volumes that we see with bfv | 19:13 |
clarkb | in addition to the extra state tracking it is very likely we'll end up with a bunch of volumes that we cannot delete | 19:13 |
dansmith | I am in *no* way suggesting we start using volumes, to be clear.. I was just asking about ephemeral :) | 19:17 |
opendevreview | Dan Smith proposed openstack/devstack master: Set uwsgi threads per worker https://review.opendev.org/c/openstack/devstack/+/891214 | 19:36 |
melwitt | dansmith: I thought wsgi threads defaults to 1 as well but i can't remember where/how I knew that. wrt to the oslo.messaging reconnect needing threads=1, was there some kind of fix in oslo.messaging such that it should work now? | 19:48 |
dansmith | it does (I now know) and it's still broken for >1, but seemingly only in nova | 19:48 |
dansmith | I'm trying the disable-monkeypath thing along with it | 19:48 |
dansmith | basically, we're limited to two concurrent operations in the api right now which I'm *sure* is pretty limiting, especially with concurrency=6 | 19:49 |
dansmith | if we're not running in eventlet mode, we really shouldn't be monkeypatching anything | 19:49 |
melwitt | ok. yeah I think nova-api is the only one combining wsgi + eventlet, that's why it's only nova I think | 19:49 |
dansmith | so I suspect threads=>1 trying to switch greenthreads is the signal telling us that we're doing it wrong | 19:49 |
melwitt | a long time ago I proposed a patch to enable us to configure native threads vs eventlet (we use futurist already) but at the time it was nacked | 19:50 |
opendevreview | Dan Smith proposed openstack/devstack master: Set uwsgi threads per worker https://review.opendev.org/c/openstack/devstack/+/891214 | 19:50 |
dansmith | melwitt: I landed something for glance a couple years ago, fwiw | 19:51 |
melwitt | (for the scatter gather) | 19:51 |
dansmith | melwitt: https://review.opendev.org/c/openstack/glance/+/742065 | 19:52 |
dansmith | yeah | 19:52 |
melwitt | dansmith: ok. I'll try resurrecting mine. iirc the issue was a concern about not being able to "cancel" native threads whereas in eventlet you can | 19:52 |
opendevreview | Dr. Jens Harbott proposed openstack/devstack master: GLOBAL_VENV: add nova to linked binaries https://review.opendev.org/c/openstack/devstack/+/891230 | 19:52 |
dansmith | melwitt: ah, sounds vaguely familiar... we probably just need to have something that can handle that case by cleaning them up when they finish or something | 19:52 |
dansmith | technically I don't think we can cancel eventlet pool-based work either, but it gets cleaned up on its own | 19:53 |
melwitt | dansmith: it was this https://review.opendev.org/c/openstack/nova/+/650172 | 19:53 |
dansmith | melwitt: the meat is here: https://review.opendev.org/c/openstack/glance/+/742065/9/glance/async_/__init__.py | 19:53 |
dansmith | whoa the pipes | 19:53 |
dansmith | melwitt: yeah my explanation there was kinda "we need to do something but, ugh" | 19:55 |
dansmith | which is still the case, but maybe more "ugh if we don't" :P | 19:55 |
frickler | gmann: kopecmartin: dansmith: fix for the first nova gate failure I saw due to global venv in https://review.opendev.org/c/openstack/devstack/+/891230 | 19:56 |
melwitt | dansmith: yeah. that's what I had (and still) think. no perfect solution but better than eventlet given the wsgi issue | 19:57 |
dansmith | melwitt: well, we're not even really using eventlet in wsgi mode.. certainly not properly.. tbh I'm surprised it works as it is, but I imagine it's because we're in the main thread when we do the spawning, which is what makes the others fail when they're not | 19:58 |
dansmith | like, the monkey-patched stuff ends up doing what the hub does because we're in thread zero or some such | 19:59 |
dansmith | so our reno says you can disable monkeypatching *or* set threads=1, but now that we've discussed it I'm not sure the former would actually work and if it does, I dunno how/why | 20:00 |
dansmith | oh you said we're using futurist, so maybe that's why | 20:01 |
dansmith | I didn't think we were, but my cache on that stuff is preeety stale | 20:01 |
melwitt | dansmith: oh, sorry.. that was me misremembering. it's currently not futurist but futurist would be a drop-in replacement bc it does the same stuff underneath (back when I worked on this anyway) | 20:02 |
melwitt | (when configured with eventlet, I mean) | 20:03 |
dansmith | ack, okay, that's what I did in the glance stuff referenced above | 20:03 |
melwitt | yeah, that's what my patch was, replace with futurist so it can be swapped how we want | 20:05 |
melwitt | dansmith: I dunno if you saw in the nova channel (while everyone was away) I posted that I found ceph health WARN on those cases where n-cpu gets stuck, logging about "slow ops" and read latency. I couldn't find any place to see generic metrics or anything though, the dstat devstack loop doesn't run by default .. so having something like a GMR I think could help shed some light given the complete lack of current light. I was going to | 21:18 |
melwitt | upload a DNM patch with dstat on to see what it shows, if anything | 21:18 |
melwitt | i.e. to get a clue what we can or should do to prevent ceph from getting into that state. bc I'm not sure how to find out if or what we can do | 21:19 |
opendevreview | Merged openstack/devstack master: Add debian-bookworm job https://review.opendev.org/c/openstack/devstack/+/887547 | 21:46 |
dansmith | melwitt: ah I did not, but cool for sure, we can discuss more on monday | 22:40 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!