Friday, 2023-08-11

fricklerkopecmartin: what's your plan on https://review.opendev.org/c/openstack/devstack/+/558930 now?06:42
*** dmellado819 is now known as dmellado8106:55
opendevreviewJakub Skunda proposed openstack/tempest master: Add option to specify source and destination host  https://review.opendev.org/c/openstack/tempest/+/89112310:39
opendevreviewMerged openstack/hacking master: Improve H212 failure message  https://review.opendev.org/c/openstack/hacking/+/88213711:38
opendevreviewLukas Piwowarski proposed openstack/tempest master: Fix cleanup for volume backup tests  https://review.opendev.org/c/openstack/tempest/+/89079812:14
opendevreviewAshley Rodriguez proposed openstack/devstack-plugin-ceph master: Remote Ceph with cephadm  https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/87674713:38
kopecmartinfrickler: hm, is it better to merge it now or rather wait till Monday and merge then?14:08
fricklerkopecmartin: go for it, I'll be around to watch things a bit14:20
opendevreviewLukas Piwowarski proposed openstack/tempest master: Fix cleanup for volume backup tests  https://review.opendev.org/c/openstack/tempest/+/89079814:28
dansmithgmann: debugging this grenade failure: https://zuul.opendev.org/t/openstack/build/bb553b6605984dcda6b73eeef91c508415:29
dansmiththat is still running at concurrency=6 and the problem seems to be that the secgroup call tempest->nova->neutron takes too long.. it's still happening, as I can trace the path in the logs,15:30
dansmithbut to do the rule create, nova fetches the secgroup and other things, and that takes more than 60s, even though the actual calls to neutron show sub-second timers15:30
dansmiththat makes me think we're exhausting available workers in our uwsgi services, and adding more workers will increase the memory footprint15:31
*** ykarel is now known as ykarel|away15:59
ykarel|awaydansmith, wrt ^ not sure if you are aware of https://bugs.launchpad.net/neutron/+bug/2015065 where some investigation was done in past16:06
ykarel|awayand that time iirc concurrency was not 6, some data from that might be helpful16:07
dansmithykarel|away: ack, interesting16:10
dansmithykarel|away: the dbcounter issues should be very much relieved lately, and the bug doesn't mention disabling that having fixed it, right?16:10
dansmithI think this sort of proxy to neutron is always going to suffer from concurrency-strained resources, be it API worker threads or apache connections, etc, etc16:11
ykarel|awaydansmith, even disabling dbcounter the issue was reproduced, but i noticed it was less frequent after disabling it, may it was just coincidence16:12
dansmiththe rule_create tests are worse than the group_create because they make multiple calls to neutron to validate the group, etc so they're more likely to hit it16:12
dansmithykarel|away: ack, well, dbcounter no longer forces a flush after 100 operations like it did before, so it should run only during idle periods or once per minute max16:13
dansmithneutron does a *ton* of DB, so it was getting hit there for sure, but that should be much less now16:13
ykarel|awayyeap i heard slaweq is inspecting those large db calls, we might be able to reduce those, but /me don't know if those are related to the above failures16:15
dansmithno, I know.. the dbcounter behavior was amplifying the DB that you were doing, but shouldn't be anymore, that was my only point16:16
dansmithhowever, lots of DB can be consuming eventlet connection pool workers, and we can run out of those, especially at higher concurrency16:16
ykarel|awayack i see16:20
opendevreviewDan Smith proposed openstack/devstack master: Log slow uwsgi requests  https://review.opendev.org/c/openstack/devstack/+/89121316:29
opendevreviewDan Smith proposed openstack/devstack master: Set uwsgi threads per worker  https://review.opendev.org/c/openstack/devstack/+/89121416:29
ykarel|awayi recall default threads was 1 and i seem to remember there was some constraint from nova to not set it higher16:33
ykarel|awaybut ok to try these changes and see the impact16:34
dansmithykarel|away: I don't know why that would be, but we'll see if something obvious breaks :)16:34
dansmithif threads=1 by default and we're only running 2 workers then I think we're definitely in trouble for concurrency=6 :)16:35
dansmithand if it's only trouble for nova and not others, we could/should probably bump it for the others at least16:35
ykarel|awaydansmith, https://docs.openstack.org/releasenotes/nova/stein.html#relnotes-19-0-3-stable-stein16:36
ykarel|awaydon't know if that is still valid though16:36
dansmithah, yeah, I don't think that's still valid, but I remember that now16:37
ykarel|awayyeap /me agrees for the tunnings where ever possible, thx for looking into it16:37
dansmithwe're not setting threads=1 in devstack now, and I'm not sure it really defaults to 1 in uwsgi, but I don't know16:38
dansmithmaybe melwitt knows and I know sean-k-mooney was involved in that, so maybe next week16:39
ykarel|awayi recall seeing it defaults to 1, but would be good to confirm16:39
* dansmith nods16:39
ykarel|awayme out for the next week, will follow up the results once back16:39
dansmithack, thanks for the pointers16:39
gmannyeah, I have seen that SG in past before concurrency things or any other improvement we did16:41
dansmithokay, well, I've only seen it once so I'm not sure it's getting worse with concurrency=6, but the root of why it's happening is the same16:43
dansmithwith the log-slow change above, hopefully we'd see two slow requests using both workers in neutron in/around the time where this happens16:44
fricklerkopecmartin: https://review.opendev.org/c/openstack/devstack/+/887547 maybe too? tc is about to start discussing the pti for the next cycle16:53
fricklergmann: ^^16:57
gmannfrickler: done, lgtm16:58
fricklerty16:59
dansmithgmann: would you be okay if we had a post action to send SIGUSR1 to nova-compute before we finish a job so that we can get a guru-meditation report?17:55
dansmithmelwitt and I (mostly melwitt) were pondering a seeming lockup of n-cpu where it just seems to stop doing anything part way through a job17:56
dansmithand I'm wondering if a gmr would show us something interesting17:56
dansmithclarkb: fungi: I think we've discussed in the past trying to back things like our test lvm pool with an actual block device from ephemeral storage, right? and we can't reliably do that because not all workers have a second disk or something?17:59
clarkbdansmith: correct. only rax provides flavors with a second disk18:03
dansmithack okay18:03
clarkbmost cloud providers give you a single device and want you to use volumes to add extra devices. The problem with that is even just boot from volume results in a giant mess of leaked volumes preventing image deletion and so on18:03
opendevreviewMerged openstack/devstack master: Add option to install everything in global venvs  https://review.opendev.org/c/openstack/devstack/+/55893018:04
clarkbso while we theoretically could bfv everywhere instead we'd end up with everything grinding to a hault because cinder and glance and nova can't get along18:04
dansmithyeah, we experience the same while *testing* volumes :/18:04
dansmithclarkb: you don't have to tell me18:04
dansmithI'm asking because testing nova+cinder is such a sh*tshow lately and someone suggested not backing lvm with a loop device to try to help :)18:04
clarkbyou could repartition on rax nodes I suppose and collect data from that subset18:07
dansmithykarel|away: https://zuul.opendev.org/t/openstack/build/09ffeb805da0425a8ab8a1ce47499251/log/controller/logs/screen-n-api.txt#187618:08
clarkbthe jobs will have already partitioned swap and /opt onto that device but you should be able to unmount /opt temporarily, shrink the ext4 fs there and add another partition18:08
dansmithclarkb: I'd rather have some good reason before going to that much work, because AIUI loop has gotten better recently in regards to the double-cache footprint problem18:08
gmanndansmith: I am ok with that, do not think it will add large time in the job especially by seeing our normal job running time and job timeout duration18:17
dansmithokay18:18
opendevreviewDan Smith proposed openstack/devstack master: Set uwsgi threads per worker  https://review.opendev.org/c/openstack/devstack/+/89121418:21
opendevreviewDan Smith proposed openstack/devstack master: Set uwsgi threads per worker  https://review.opendev.org/c/openstack/devstack/+/89121419:09
fungiyeah, the other alternative is extending nodepool to also manage cinder volumes and then attach additional volumes to test nodes as part of the nodeset or something. that's not functionality that exists today. theoretically possible, logistically... challenging19:11
clarkbfungi: right it also has the problem of leaking volumes that we see with bfv19:13
clarkbin addition to the extra state tracking it is very likely we'll end up with a bunch of volumes that we cannot delete19:13
dansmithI am in *no* way suggesting we start using volumes, to be clear.. I was just asking about ephemeral :)19:17
opendevreviewDan Smith proposed openstack/devstack master: Set uwsgi threads per worker  https://review.opendev.org/c/openstack/devstack/+/89121419:36
melwittdansmith: I thought wsgi threads defaults to 1 as well but i can't remember where/how I knew that. wrt to the oslo.messaging reconnect needing threads=1, was there some kind of fix in oslo.messaging such that it should work now?19:48
dansmithit does (I now know) and it's still broken for >1, but seemingly only in nova19:48
dansmithI'm trying the disable-monkeypath thing along with it19:48
dansmithbasically, we're limited to two concurrent operations in the api right now which I'm *sure* is pretty limiting, especially with concurrency=619:49
dansmithif we're not running in eventlet mode, we really shouldn't be monkeypatching anything19:49
melwittok. yeah I think nova-api is the only one combining wsgi + eventlet, that's why it's only nova I think19:49
dansmithso I suspect threads=>1 trying to switch greenthreads is the signal telling us that we're doing it wrong19:49
melwitta long time ago I proposed a patch to enable us to configure native threads vs eventlet (we use futurist already) but at the time it was nacked19:50
opendevreviewDan Smith proposed openstack/devstack master: Set uwsgi threads per worker  https://review.opendev.org/c/openstack/devstack/+/89121419:50
dansmithmelwitt: I landed something for glance a couple years ago, fwiw19:51
melwitt(for the scatter gather)19:51
dansmithmelwitt: https://review.opendev.org/c/openstack/glance/+/74206519:52
dansmithyeah19:52
melwittdansmith: ok. I'll try resurrecting mine. iirc the issue was a concern about not being able to "cancel" native threads whereas in eventlet you can19:52
opendevreviewDr. Jens Harbott proposed openstack/devstack master: GLOBAL_VENV: add nova to linked binaries  https://review.opendev.org/c/openstack/devstack/+/89123019:52
dansmithmelwitt: ah, sounds vaguely familiar... we probably just need to have something that can handle that case by cleaning them up when they finish or something19:52
dansmithtechnically I don't think we can cancel eventlet pool-based work either, but it gets cleaned up on its own19:53
melwittdansmith: it was this https://review.opendev.org/c/openstack/nova/+/65017219:53
dansmithmelwitt: the meat is here: https://review.opendev.org/c/openstack/glance/+/742065/9/glance/async_/__init__.py19:53
dansmithwhoa the pipes19:53
dansmithmelwitt: yeah my explanation there was kinda "we need to do something but, ugh"19:55
dansmithwhich is still the case, but maybe more "ugh if we don't" :P19:55
fricklergmann: kopecmartin: dansmith: fix for the first nova gate failure I saw due to global venv in https://review.opendev.org/c/openstack/devstack/+/89123019:56
melwittdansmith: yeah. that's what I had (and still) think. no perfect solution but better than eventlet given the wsgi issue19:57
dansmithmelwitt: well, we're not even really using eventlet in wsgi mode.. certainly not properly.. tbh I'm surprised it works as it is, but I imagine it's because we're in the main thread when we do the spawning, which is what makes the others fail when they're not19:58
dansmithlike, the monkey-patched stuff ends up doing what the hub does because we're in thread zero or some such19:59
dansmithso our reno says you can disable monkeypatching *or* set threads=1, but now that we've discussed it I'm not sure the former would actually work and if it does, I dunno how/why20:00
dansmithoh you said we're using futurist, so maybe that's why20:01
dansmithI didn't think we were, but my cache on that stuff is preeety stale20:01
melwittdansmith: oh, sorry.. that was me misremembering. it's currently not futurist but futurist would be a drop-in replacement bc it does the same stuff underneath (back when I worked on this anyway)20:02
melwitt(when configured with eventlet, I mean)20:03
dansmithack, okay, that's what I did in the glance stuff referenced above20:03
melwittyeah, that's what my patch was, replace with futurist so it can be swapped how we want20:05
melwittdansmith: I dunno if you saw in the nova channel (while everyone was away) I posted that I found ceph health WARN on those cases where n-cpu gets stuck, logging about "slow ops" and read latency. I couldn't find any place to see generic metrics or anything though, the dstat devstack loop doesn't run by default .. so having something like a GMR I think could help shed some light given the complete lack of current light. I was going to21:18
melwitt upload a DNM patch with dstat on to see what it shows, if anything21:18
melwitti.e. to get a clue what we can or should do to prevent ceph from getting into that state. bc I'm not sure how to find out if or what we can do21:19
opendevreviewMerged openstack/devstack master: Add debian-bookworm job  https://review.opendev.org/c/openstack/devstack/+/88754721:46
dansmithmelwitt: ah I did not, but cool for sure, we can discuss more on monday22:40

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!