Monday, 2024-06-10

jrosser	morning	09:06
noonedeadpunk	o/	09:13
jrosser	noonedeadpunk: any ideas here https://zuul.opendev.org/t/openstack/build/4a71747815a94bccb281b95a0ef151b5/log/logs/openstack/aio1-magnum-container-a300d6c5/magnum-api.service.journal-12-58-34.log.txt#198	15:41
jrosser	like references to rabbitmq / shm in the stack track which makes me thing quorum queue	15:41
noonedeadpunk	yeah	15:42
noonedeadpunk	I think this is required now? https://opendev.org/openstack/openstack-ansible-os_glance/src/branch/master/templates/glance-api.conf.j2#L156-L157	15:42
noonedeadpunk	as `raise cfg.RequiredOptError('lock_path', 'oslo_concurrency')` - would be result of missing this option	15:43
noonedeadpunk	on L241	15:44
jrosser	did we have to patch all the roles for that	15:44
noonedeadpunk	It slightly feels so	15:44
noonedeadpunk	But I'm not sure what/why specifically things started failing on that	15:44
jrosser	oh maybe we realised this at some point and then included it in the other quorum queue patches	15:45
noonedeadpunk	but probably due to changes in oslo.messaging...	15:45
noonedeadpunk	no, that was qumanager	15:45
noonedeadpunk	https://opendev.org/openstack/openstack-ansible-os_glance/src/branch/master/templates/glance-api.conf.j2#L86 this param I realized	15:46
jrosser	i only see we patched glance and blazer with standalone changes like that	15:46
noonedeadpunk	yeah	15:46
noonedeadpunk	others were not failing...	15:46
noonedeadpunk	but probavbly we'd need this everywhere	15:47
jrosser	oh is it where there are two co-incident things on the same host/container using the same queue	15:47
jrosser	as for magnum we have api/conductor in the same place	15:48
noonedeadpunk	oh	15:48
noonedeadpunk	you're probavbly right	15:49
noonedeadpunk	oslo_concurrency get's asked when qmanager asks for hsm	15:49
noonedeadpunk	`reply_q = 'reply_' + self._q_manager.get()` in a trace	15:49
noonedeadpunk	and read_from_shm asks for lock then...	15:49
jrosser	maybe we just need to put this everwhere	15:51
* jrosser wonders how the metal jobs work at all		15:51
noonedeadpunk	yeah....	15:51
noonedeadpunk	we totally need	15:51
noonedeadpunk	https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messaging/_drivers/amqpdriver.py#L72	15:51
noonedeadpunk	this decorator what expects to see lockutils config in place	15:52
noonedeadpunk	crap	15:52
noonedeadpunk	so yeah, we need that pretty much everywhere	15:54
jrosser	so the shm path can be the same for all processes accessing the same queue	15:55
* jrosser forgets the detail		15:55
noonedeadpunk	it is expected to be same for specific service	15:56
noonedeadpunk	and then inside SHM there's an interator for threads	15:56
noonedeadpunk	but path should be same only for same type	15:57
noonedeadpunk	ie nova-conductor should be different from nova-scheduler	15:57
noonedeadpunk	but then seems on top of that, you'd need a lock utils to handle concurency among process threads for accessing shm	15:59
jrosser	errr	15:59
noonedeadpunk	so yes - we need to patch all roles	15:59
noonedeadpunk	If I read things correctly now...	15:59
jrosser	https://github.com/openstack/openstack-ansible-os_blazar/blob/master/templates/blazar.conf.j2#L23	15:59
jrosser	this does not differentiate between blazar-api and blazar-manager	16:00
noonedeadpunk	crap.	16:00
noonedeadpunk	probably worth to just disable the qManager again.....	16:00
jrosser	i wonder how actually we do template this out for more complicated services	16:01
noonedeadpunk	as in fact... we can't make them unique having same config file	16:01
jrosser	^ this	16:01
jrosser	but still - i would expect the metal jobs to be utterly broken	16:01
noonedeadpunk	so, for SHM path itself - we do make unique hostname	16:01
jrosser	and they are not	16:01
noonedeadpunk	and then also jamesden- did upgrade and it works for them....	16:02
jrosser	unless this is just luck and theres not enough messaging going on to hit something bad	16:02
noonedeadpunk	(also metal though)	16:02
opendevreview	Jonathan Rosser proposed openstack/openstack-ansible-os_magnum master: Define lock directory for oslo_concurrency https://review.opendev.org/c/openstack/openstack-ansible-os_magnum/+/921690	16:05
noonedeadpunk	so eventually, this SHM thingy is used only when spawning/restarting the service	16:11
jrosser	thats not really how my stack trace is	16:11
jrosser	it comes after many api calls to magnum	16:11
noonedeadpunk	which is quite weird...	16:12
noonedeadpunk	but dunno	16:12
jrosser	its almost last thing, cluster template is defined / cluster booted	16:12
jrosser	failed on retrieving the k8s creds for it	16:12
noonedeadpunk	I'm kinda more and more inclined to think if qmanager worth it....	16:13
noonedeadpunk	sec, let me find stats from ovh...	16:13
jrosser	maybe need to has ask the ovh people again	16:13
jrosser	yeah	16:13
noonedeadpunk	so, rabbit_stream_fanout dropped amount of queues for them from 55k to 24k	16:16
noonedeadpunk	they didn't say stats for qmanager though	16:16
noonedeadpunk	qmanager they described as smth to avoid load on rabbit during service restart specifically	16:18
noonedeadpunk	as then instead of spawning plenty of new queues services will re-use what they already have	16:18
jrosser	kind of like multiplexing connections?	16:18
noonedeadpunk	and yeah, that's pretty much for reply queues....	16:18
noonedeadpunk	no, it's more-more stupid	16:19
noonedeadpunk	manager is confusing there	16:19
noonedeadpunk	as it's very trivial tech involved	16:19
noonedeadpunk	so, previously queues were named as `reply_4170059e-e2d2-4291-94ce-2c765437b3b0`	16:19
noonedeadpunk	what they did - replace uuid with some "expected" values	16:20
noonedeadpunk	which consist of: "hostname:service_name:increment"	16:20
noonedeadpunk	and shm is used to store incriments	16:20
noonedeadpunk	and oslo.lock is used to prevent threads to get their incriment at the same time	16:21
noonedeadpunk	so, you;ll have `reply_aio1:nova-conductor:5` instead of uuid	16:21
noonedeadpunk	and once you restart service, it does not need to create whole bunch of new queues anymore	16:22
noonedeadpunk	but it can keep using already existing ones	16:22
noonedeadpunk	so lock path potentially doesn't matter much... but dunno	16:25
noonedeadpunk	I'd need to check code more to understand consequences	16:25
noonedeadpunk	fwiw, there's their presentation: https://www.arnaudmorin.fr/p17/	16:28
noonedeadpunk	some things were not mainstreamed yet though	16:29
noonedeadpunk	but if I'm not mistaken, what they told, is that usage of qmanager helped them to reduce downtime/load on rabbitmq when restarting l3 agents which create quite a storm for rabbit	16:47
jrosser	oh yes this is suuuuper slow	16:50
noonedeadpunk	but there was anothing thing they did not push yet...	16:53
jrosser	magnum trouble could be rooted in a db error https://zuul.opendev.org/t/openstack/build/893c3205310d4275a3aa2141d2123763/log/logs/openstack/aio1-magnum-container-8af236e8/magnum-conductor.service.journal-18-11-56.log.txt#2026	19:23
noonedeadpunk	mariadb log is full of connection errors	19:29
jrosser	yeah	19:30
jrosser	also in a totally unrelated job https://zuul.opendev.org/t/openstack/build/afad1275e8394e739866e3c124922685/log/logs/host/mariadb.service.journal-23-44-32.log.txt	19:31
noonedeadpunk	well, that frankly looks like pooling timeouts to me	19:32
noonedeadpunk	though "during query" should result in smth different...	19:32
noonedeadpunk	ah, well	19:33
noonedeadpunk	these 2 are slightly differrent: https://zuul.opendev.org/t/openstack/build/893c3205310d4275a3aa2141d2123763/log/logs/openstack/aio1-galera-container-a238e451/mariadb.service.journal-18-11-56.log.txt#338-339	19:34
noonedeadpunk	wonder what is timeout vs error means...	19:34
jrosser	maybe we dont have enough threads on the magnum side	19:34
noonedeadpunk	might be...	19:34
jrosser	because its very limited to the capi driver	19:35
hamburgler	hmm odd w/ Caracal upgrade some of the fanout streams, cinder, nova, manila keep having ready messages pile up slowly and not being consumed	21:04
jrosser	hamburgler: you've switched to quorum queues during the upgrade?	21:08
hamburgler	jrosser: Had been running them without the enhancements for quite some time now	21:08
hamburgler	this is just lab testing, I even destroyed LXC all over and re-provisioned them and all queues from scratch still the same thing with those few streams	21:09
hamburgler	all rabbit lxc*	21:09
hamburgler	still haven't finished all testing post upgrade - thought that was interesting to note so far	21:12
jamesden-	interesting Heat bug... 'openstack stack list' fails due to missing [oslo_concurrency] lock_path config in heat.conf, but i can't find any evidence it needed to exist	21:16
jrosser	jamesden-: read back todays irc :)	21:22
jamesden-	lol	21:22
jrosser	i expect you need this but for heat https://review.opendev.org/c/openstack/openstack-ansible-os_magnum/+/921690	21:23
jamesden-	thanks - i will work on that	21:23
jrosser	it is quite likley that its needed in all the services, or some variant of it	21:24
jrosser	there is detail / potential confusion that we discussed earlier	21:24
jamesden-	oh boy	21:25
jrosser	hamburgler: it is possible you are affected by this too ^^^	21:38
jrosser	patches / testing totally welcome	21:39
hamburgler	jrosser: have not looked much yet, quick cli command with openstack stack list though gives me ERROR: Internal Error, but I don't see much in the logs that reflects the output above. Magnum have not yet deployed with this lab env.	21:56
hamburgler	Will poke at things more this week	21:56

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!