Friday, 2023-08-11

frickler	kopecmartin: what's your plan on https://review.opendev.org/c/openstack/devstack/+/558930 now?	06:42
*** dmellado819 is now known as dmellado81		06:55
opendevreview	Jakub Skunda proposed openstack/tempest master: Add option to specify source and destination host https://review.opendev.org/c/openstack/tempest/+/891123	10:39
opendevreview	Merged openstack/hacking master: Improve H212 failure message https://review.opendev.org/c/openstack/hacking/+/882137	11:38
opendevreview	Lukas Piwowarski proposed openstack/tempest master: Fix cleanup for volume backup tests https://review.opendev.org/c/openstack/tempest/+/890798	12:14
opendevreview	Ashley Rodriguez proposed openstack/devstack-plugin-ceph master: Remote Ceph with cephadm https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/876747	13:38
kopecmartin	frickler: hm, is it better to merge it now or rather wait till Monday and merge then?	14:08
frickler	kopecmartin: go for it, I'll be around to watch things a bit	14:20
opendevreview	Lukas Piwowarski proposed openstack/tempest master: Fix cleanup for volume backup tests https://review.opendev.org/c/openstack/tempest/+/890798	14:28
dansmith	gmann: debugging this grenade failure: https://zuul.opendev.org/t/openstack/build/bb553b6605984dcda6b73eeef91c5084	15:29
dansmith	that is still running at concurrency=6 and the problem seems to be that the secgroup call tempest->nova->neutron takes too long.. it's still happening, as I can trace the path in the logs,	15:30
dansmith	but to do the rule create, nova fetches the secgroup and other things, and that takes more than 60s, even though the actual calls to neutron show sub-second timers	15:30
dansmith	that makes me think we're exhausting available workers in our uwsgi services, and adding more workers will increase the memory footprint	15:31
*** ykarel is now known as ykarel\|away		15:59
ykarel\|away	dansmith, wrt ^ not sure if you are aware of https://bugs.launchpad.net/neutron/+bug/2015065 where some investigation was done in past	16:06
ykarel\|away	and that time iirc concurrency was not 6, some data from that might be helpful	16:07
dansmith	ykarel\|away: ack, interesting	16:10
dansmith	ykarel\|away: the dbcounter issues should be very much relieved lately, and the bug doesn't mention disabling that having fixed it, right?	16:10
dansmith	I think this sort of proxy to neutron is always going to suffer from concurrency-strained resources, be it API worker threads or apache connections, etc, etc	16:11
ykarel\|away	dansmith, even disabling dbcounter the issue was reproduced, but i noticed it was less frequent after disabling it, may it was just coincidence	16:12
dansmith	the rule_create tests are worse than the group_create because they make multiple calls to neutron to validate the group, etc so they're more likely to hit it	16:12
dansmith	ykarel\|away: ack, well, dbcounter no longer forces a flush after 100 operations like it did before, so it should run only during idle periods or once per minute max	16:13
dansmith	neutron does a ton of DB, so it was getting hit there for sure, but that should be much less now	16:13
ykarel\|away	yeap i heard slaweq is inspecting those large db calls, we might be able to reduce those, but /me don't know if those are related to the above failures	16:15
dansmith	no, I know.. the dbcounter behavior was amplifying the DB that you were doing, but shouldn't be anymore, that was my only point	16:16
dansmith	however, lots of DB can be consuming eventlet connection pool workers, and we can run out of those, especially at higher concurrency	16:16
ykarel\|away	ack i see	16:20
opendevreview	Dan Smith proposed openstack/devstack master: Log slow uwsgi requests https://review.opendev.org/c/openstack/devstack/+/891213	16:29
opendevreview	Dan Smith proposed openstack/devstack master: Set uwsgi threads per worker https://review.opendev.org/c/openstack/devstack/+/891214	16:29
ykarel\|away	i recall default threads was 1 and i seem to remember there was some constraint from nova to not set it higher	16:33
ykarel\|away	but ok to try these changes and see the impact	16:34
dansmith	ykarel\|away: I don't know why that would be, but we'll see if something obvious breaks :)	16:34
dansmith	if threads=1 by default and we're only running 2 workers then I think we're definitely in trouble for concurrency=6 :)	16:35
dansmith	and if it's only trouble for nova and not others, we could/should probably bump it for the others at least	16:35
ykarel\|away	dansmith, https://docs.openstack.org/releasenotes/nova/stein.html#relnotes-19-0-3-stable-stein	16:36
ykarel\|away	don't know if that is still valid though	16:36
dansmith	ah, yeah, I don't think that's still valid, but I remember that now	16:37
ykarel\|away	yeap /me agrees for the tunnings where ever possible, thx for looking into it	16:37
dansmith	we're not setting threads=1 in devstack now, and I'm not sure it really defaults to 1 in uwsgi, but I don't know	16:38
dansmith	maybe melwitt knows and I know sean-k-mooney was involved in that, so maybe next week	16:39
ykarel\|away	i recall seeing it defaults to 1, but would be good to confirm	16:39
* dansmith nods		16:39
ykarel\|away	me out for the next week, will follow up the results once back	16:39
dansmith	ack, thanks for the pointers	16:39
gmann	yeah, I have seen that SG in past before concurrency things or any other improvement we did	16:41
dansmith	okay, well, I've only seen it once so I'm not sure it's getting worse with concurrency=6, but the root of why it's happening is the same	16:43
dansmith	with the log-slow change above, hopefully we'd see two slow requests using both workers in neutron in/around the time where this happens	16:44
frickler	kopecmartin: https://review.opendev.org/c/openstack/devstack/+/887547 maybe too? tc is about to start discussing the pti for the next cycle	16:53
frickler	gmann: ^^	16:57
gmann	frickler: done, lgtm	16:58
frickler	ty	16:59
dansmith	gmann: would you be okay if we had a post action to send SIGUSR1 to nova-compute before we finish a job so that we can get a guru-meditation report?	17:55
dansmith	melwitt and I (mostly melwitt) were pondering a seeming lockup of n-cpu where it just seems to stop doing anything part way through a job	17:56
dansmith	and I'm wondering if a gmr would show us something interesting	17:56
dansmith	clarkb: fungi: I think we've discussed in the past trying to back things like our test lvm pool with an actual block device from ephemeral storage, right? and we can't reliably do that because not all workers have a second disk or something?	17:59
clarkb	dansmith: correct. only rax provides flavors with a second disk	18:03
dansmith	ack okay	18:03
clarkb	most cloud providers give you a single device and want you to use volumes to add extra devices. The problem with that is even just boot from volume results in a giant mess of leaked volumes preventing image deletion and so on	18:03
opendevreview	Merged openstack/devstack master: Add option to install everything in global venvs https://review.opendev.org/c/openstack/devstack/+/558930	18:04
clarkb	so while we theoretically could bfv everywhere instead we'd end up with everything grinding to a hault because cinder and glance and nova can't get along	18:04
dansmith	yeah, we experience the same while testing volumes :/	18:04
dansmith	clarkb: you don't have to tell me	18:04
dansmith	I'm asking because testing nova+cinder is such a sh*tshow lately and someone suggested not backing lvm with a loop device to try to help :)	18:04
clarkb	you could repartition on rax nodes I suppose and collect data from that subset	18:07
dansmith	ykarel\|away: https://zuul.opendev.org/t/openstack/build/09ffeb805da0425a8ab8a1ce47499251/log/controller/logs/screen-n-api.txt#1876	18:08
clarkb	the jobs will have already partitioned swap and /opt onto that device but you should be able to unmount /opt temporarily, shrink the ext4 fs there and add another partition	18:08
dansmith	clarkb: I'd rather have some good reason before going to that much work, because AIUI loop has gotten better recently in regards to the double-cache footprint problem	18:08
gmann	dansmith: I am ok with that, do not think it will add large time in the job especially by seeing our normal job running time and job timeout duration	18:17
dansmith	okay	18:18
opendevreview	Dan Smith proposed openstack/devstack master: Set uwsgi threads per worker https://review.opendev.org/c/openstack/devstack/+/891214	18:21
opendevreview	Dan Smith proposed openstack/devstack master: Set uwsgi threads per worker https://review.opendev.org/c/openstack/devstack/+/891214	19:09
fungi	yeah, the other alternative is extending nodepool to also manage cinder volumes and then attach additional volumes to test nodes as part of the nodeset or something. that's not functionality that exists today. theoretically possible, logistically... challenging	19:11
clarkb	fungi: right it also has the problem of leaking volumes that we see with bfv	19:13
clarkb	in addition to the extra state tracking it is very likely we'll end up with a bunch of volumes that we cannot delete	19:13
dansmith	I am in no way suggesting we start using volumes, to be clear.. I was just asking about ephemeral :)	19:17
opendevreview	Dan Smith proposed openstack/devstack master: Set uwsgi threads per worker https://review.opendev.org/c/openstack/devstack/+/891214	19:36
melwitt	dansmith: I thought wsgi threads defaults to 1 as well but i can't remember where/how I knew that. wrt to the oslo.messaging reconnect needing threads=1, was there some kind of fix in oslo.messaging such that it should work now?	19:48
dansmith	it does (I now know) and it's still broken for >1, but seemingly only in nova	19:48
dansmith	I'm trying the disable-monkeypath thing along with it	19:48
dansmith	basically, we're limited to two concurrent operations in the api right now which I'm sure is pretty limiting, especially with concurrency=6	19:49
dansmith	if we're not running in eventlet mode, we really shouldn't be monkeypatching anything	19:49
melwitt	ok. yeah I think nova-api is the only one combining wsgi + eventlet, that's why it's only nova I think	19:49
dansmith	so I suspect threads=>1 trying to switch greenthreads is the signal telling us that we're doing it wrong	19:49
melwitt	a long time ago I proposed a patch to enable us to configure native threads vs eventlet (we use futurist already) but at the time it was nacked	19:50
opendevreview	Dan Smith proposed openstack/devstack master: Set uwsgi threads per worker https://review.opendev.org/c/openstack/devstack/+/891214	19:50
dansmith	melwitt: I landed something for glance a couple years ago, fwiw	19:51
melwitt	(for the scatter gather)	19:51
dansmith	melwitt: https://review.opendev.org/c/openstack/glance/+/742065	19:52
dansmith	yeah	19:52
melwitt	dansmith: ok. I'll try resurrecting mine. iirc the issue was a concern about not being able to "cancel" native threads whereas in eventlet you can	19:52
opendevreview	Dr. Jens Harbott proposed openstack/devstack master: GLOBAL_VENV: add nova to linked binaries https://review.opendev.org/c/openstack/devstack/+/891230	19:52
dansmith	melwitt: ah, sounds vaguely familiar... we probably just need to have something that can handle that case by cleaning them up when they finish or something	19:52
dansmith	technically I don't think we can cancel eventlet pool-based work either, but it gets cleaned up on its own	19:53
melwitt	dansmith: it was this https://review.opendev.org/c/openstack/nova/+/650172	19:53
dansmith	melwitt: the meat is here: https://review.opendev.org/c/openstack/glance/+/742065/9/glance/async_/__init__.py	19:53
dansmith	whoa the pipes	19:53
dansmith	melwitt: yeah my explanation there was kinda "we need to do something but, ugh"	19:55
dansmith	which is still the case, but maybe more "ugh if we don't" :P	19:55
frickler	gmann: kopecmartin: dansmith: fix for the first nova gate failure I saw due to global venv in https://review.opendev.org/c/openstack/devstack/+/891230	19:56
melwitt	dansmith: yeah. that's what I had (and still) think. no perfect solution but better than eventlet given the wsgi issue	19:57
dansmith	melwitt: well, we're not even really using eventlet in wsgi mode.. certainly not properly.. tbh I'm surprised it works as it is, but I imagine it's because we're in the main thread when we do the spawning, which is what makes the others fail when they're not	19:58
dansmith	like, the monkey-patched stuff ends up doing what the hub does because we're in thread zero or some such	19:59
dansmith	so our reno says you can disable monkeypatching or set threads=1, but now that we've discussed it I'm not sure the former would actually work and if it does, I dunno how/why	20:00
dansmith	oh you said we're using futurist, so maybe that's why	20:01
dansmith	I didn't think we were, but my cache on that stuff is preeety stale	20:01
melwitt	dansmith: oh, sorry.. that was me misremembering. it's currently not futurist but futurist would be a drop-in replacement bc it does the same stuff underneath (back when I worked on this anyway)	20:02
melwitt	(when configured with eventlet, I mean)	20:03
dansmith	ack, okay, that's what I did in the glance stuff referenced above	20:03
melwitt	yeah, that's what my patch was, replace with futurist so it can be swapped how we want	20:05
melwitt	dansmith: I dunno if you saw in the nova channel (while everyone was away) I posted that I found ceph health WARN on those cases where n-cpu gets stuck, logging about "slow ops" and read latency. I couldn't find any place to see generic metrics or anything though, the dstat devstack loop doesn't run by default .. so having something like a GMR I think could help shed some light given the complete lack of current light. I was going to	21:18
melwitt	upload a DNM patch with dstat on to see what it shows, if anything	21:18
melwitt	i.e. to get a clue what we can or should do to prevent ceph from getting into that state. bc I'm not sure how to find out if or what we can do	21:19
opendevreview	Merged openstack/devstack master: Add debian-bookworm job https://review.opendev.org/c/openstack/devstack/+/887547	21:46
dansmith	melwitt: ah I did not, but cool for sure, we can discuss more on monday	22:40

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!