Monday, 2024-06-10

jrossermorning09:06
noonedeadpunko/09:13
jrossernoonedeadpunk: any ideas here https://zuul.opendev.org/t/openstack/build/4a71747815a94bccb281b95a0ef151b5/log/logs/openstack/aio1-magnum-container-a300d6c5/magnum-api.service.journal-12-58-34.log.txt#19815:41
jrosserlike references to rabbitmq / shm in the stack track which makes me thing quorum queue15:41
noonedeadpunkyeah15:42
noonedeadpunkI think this is required now? https://opendev.org/openstack/openstack-ansible-os_glance/src/branch/master/templates/glance-api.conf.j2#L156-L15715:42
noonedeadpunkas `raise cfg.RequiredOptError('lock_path', 'oslo_concurrency')` - would be result of missing this option15:43
noonedeadpunkon L24115:44
jrosserdid we have to patch all the roles for that15:44
noonedeadpunkIt slightly feels so15:44
noonedeadpunkBut I'm not sure what/why specifically things started failing on that15:44
jrosseroh maybe we realised this at some point and then included it in the other quorum queue patches15:45
noonedeadpunkbut probably due to changes in oslo.messaging...15:45
noonedeadpunkno, that was qumanager15:45
noonedeadpunkhttps://opendev.org/openstack/openstack-ansible-os_glance/src/branch/master/templates/glance-api.conf.j2#L86 this param I realized15:46
jrosseri only see we patched glance and blazer with standalone changes like that15:46
noonedeadpunkyeah15:46
noonedeadpunkothers were not failing...15:46
noonedeadpunkbut probavbly we'd need this everywhere15:47
jrosseroh is it where there are two co-incident things on the same host/container using the same queue15:47
jrosseras for magnum we have api/conductor in the same place15:48
noonedeadpunkoh15:48
noonedeadpunkyou're probavbly right15:49
noonedeadpunkoslo_concurrency get's asked when qmanager asks for hsm15:49
noonedeadpunk`reply_q = 'reply_' + self._q_manager.get()` in a trace15:49
noonedeadpunkand read_from_shm asks for lock then...15:49
jrossermaybe we just need to put this everwhere15:51
* jrosser wonders how the metal jobs work at all15:51
noonedeadpunkyeah....15:51
noonedeadpunkwe totally need15:51
noonedeadpunkhttps://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messaging/_drivers/amqpdriver.py#L7215:51
noonedeadpunkthis decorator what expects to see lockutils config in place15:52
noonedeadpunkcrap15:52
noonedeadpunkso yeah, we need that pretty much everywhere15:54
jrosserso the shm path can be the same for all processes accessing the same queue15:55
* jrosser forgets the detail15:55
noonedeadpunkit is expected to be same for specific service 15:56
noonedeadpunkand then inside SHM there's an interator for threads15:56
noonedeadpunkbut path should be same only for same type15:57
noonedeadpunkie nova-conductor should be different from nova-scheduler15:57
noonedeadpunkbut then seems on top of that, you'd need a lock utils to handle concurency among process threads for accessing shm15:59
jrossererrr15:59
noonedeadpunkso yes - we need to patch all roles15:59
noonedeadpunkIf I read things correctly now...15:59
jrosserhttps://github.com/openstack/openstack-ansible-os_blazar/blob/master/templates/blazar.conf.j2#L2315:59
jrosserthis does not differentiate between blazar-api and blazar-manager16:00
noonedeadpunkcrap.16:00
noonedeadpunkprobably worth to just disable the qManager again.....16:00
jrosseri wonder how actually we do template this out for more complicated services16:01
noonedeadpunkas in fact... we can't make them unique having same config file16:01
jrosser^ this16:01
jrosserbut still - i would expect the metal jobs to be utterly broken16:01
noonedeadpunkso, for SHM path itself - we do make unique hostname16:01
jrosserand they are not16:01
noonedeadpunkand then also jamesden- did upgrade and it works for them....16:02
jrosserunless this is just luck and theres not enough messaging going on to hit something bad16:02
noonedeadpunk(also metal though)16:02
opendevreviewJonathan Rosser proposed openstack/openstack-ansible-os_magnum master: Define lock directory for oslo_concurrency  https://review.opendev.org/c/openstack/openstack-ansible-os_magnum/+/92169016:05
noonedeadpunkso eventually, this SHM thingy is used only when spawning/restarting the service16:11
jrosserthats not really how my stack trace is16:11
jrosserit comes after many api calls to magnum16:11
noonedeadpunkwhich is quite weird...16:12
noonedeadpunkbut dunno16:12
jrosserits almost last thing, cluster template is defined / cluster booted16:12
jrosserfailed on retrieving the k8s creds for it16:12
noonedeadpunkI'm kinda more and more inclined to think if qmanager worth it....16:13
noonedeadpunksec, let me find stats from ovh...16:13
jrossermaybe need to has ask the ovh people again16:13
jrosseryeah16:13
noonedeadpunkso, rabbit_stream_fanout dropped amount of queues for them from 55k to 24k16:16
noonedeadpunkthey didn't say stats for qmanager though16:16
noonedeadpunkqmanager they described as smth to avoid load on rabbit during service restart specifically16:18
noonedeadpunkas then instead of spawning plenty of new queues services will re-use what they already have16:18
jrosserkind of like multiplexing connections?16:18
noonedeadpunkand yeah, that's pretty much for reply queues....16:18
noonedeadpunkno, it's more-more stupid16:19
noonedeadpunkmanager is confusing there16:19
noonedeadpunkas it's very trivial tech involved16:19
noonedeadpunkso, previously queues were named as `reply_4170059e-e2d2-4291-94ce-2c765437b3b0`16:19
noonedeadpunkwhat they did - replace uuid with some "expected" values16:20
noonedeadpunkwhich consist of: "hostname:service_name:increment"16:20
noonedeadpunkand shm is used to store incriments16:20
noonedeadpunkand oslo.lock is used to prevent threads to get their incriment at the same time16:21
noonedeadpunkso, you;ll have `reply_aio1:nova-conductor:5` instead of uuid16:21
noonedeadpunkand once you restart service, it does not need to create whole bunch of new queues anymore16:22
noonedeadpunkbut it can keep using already existing ones16:22
noonedeadpunkso lock path potentially doesn't matter much... but dunno16:25
noonedeadpunkI'd need to check code more to understand consequences16:25
noonedeadpunkfwiw, there's their presentation: https://www.arnaudmorin.fr/p17/16:28
noonedeadpunksome things were not mainstreamed yet though16:29
noonedeadpunkbut if I'm not mistaken, what they told, is that usage of qmanager helped them to reduce downtime/load on rabbitmq when restarting l3 agents which create quite a storm for rabbit16:47
jrosseroh yes this is suuuuper slow16:50
noonedeadpunkbut there was anothing thing they did not push yet...16:53
jrossermagnum trouble could be rooted in a db error https://zuul.opendev.org/t/openstack/build/893c3205310d4275a3aa2141d2123763/log/logs/openstack/aio1-magnum-container-8af236e8/magnum-conductor.service.journal-18-11-56.log.txt#202619:23
noonedeadpunkmariadb log is full of connection errors19:29
jrosseryeah19:30
jrosseralso in a totally unrelated job https://zuul.opendev.org/t/openstack/build/afad1275e8394e739866e3c124922685/log/logs/host/mariadb.service.journal-23-44-32.log.txt19:31
noonedeadpunkwell, that frankly looks like pooling timeouts to me19:32
noonedeadpunkthough "during query" should result in smth different...19:32
noonedeadpunkah, well19:33
noonedeadpunkthese 2 are slightly differrent: https://zuul.opendev.org/t/openstack/build/893c3205310d4275a3aa2141d2123763/log/logs/openstack/aio1-galera-container-a238e451/mariadb.service.journal-18-11-56.log.txt#338-33919:34
noonedeadpunkwonder what is timeout vs error means...19:34
jrossermaybe we dont have enough threads on the magnum side19:34
noonedeadpunkmight be...19:34
jrosserbecause its very limited to the capi driver19:35
hamburglerhmm odd w/ Caracal upgrade some of the fanout streams, cinder, nova, manila keep having ready messages pile up slowly and not being consumed21:04
jrosserhamburgler: you've switched to quorum queues during the upgrade?21:08
hamburglerjrosser: Had been running them without the enhancements for quite some time now 21:08
hamburglerthis is just lab testing, I even destroyed LXC all over and re-provisioned them and all queues from scratch still the same thing with those few streams 21:09
hamburglerall rabbit lxc*21:09
hamburglerstill haven't finished all testing post upgrade - thought that was interesting to note so far21:12
jamesden-interesting Heat bug... 'openstack stack list' fails due to missing [oslo_concurrency] lock_path config in heat.conf, but i can't find any evidence it needed to exist21:16
jrosserjamesden-: read back todays irc :)21:22
jamesden-lol21:22
jrosseri expect you need this but for heat https://review.opendev.org/c/openstack/openstack-ansible-os_magnum/+/92169021:23
jamesden-thanks - i will work on that21:23
jrosserit is quite likley that its needed in all the services, or some variant of it21:24
jrosserthere is detail / potential confusion that we discussed earlier21:24
jamesden-oh boy21:25
jrosserhamburgler: it is possible you are affected by this too ^^^21:38
jrosserpatches / testing totally welcome21:39
hamburglerjrosser: have not looked much yet, quick cli command with openstack stack list though gives me ERROR: Internal Error, but I don't see much in the logs that reflects the output above. Magnum have not yet deployed with this lab env.21:56
hamburglerWill poke at things more this week21:56

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!