jrosser | morning | 09:06 |
---|---|---|
noonedeadpunk | o/ | 09:13 |
jrosser | noonedeadpunk: any ideas here https://zuul.opendev.org/t/openstack/build/4a71747815a94bccb281b95a0ef151b5/log/logs/openstack/aio1-magnum-container-a300d6c5/magnum-api.service.journal-12-58-34.log.txt#198 | 15:41 |
jrosser | like references to rabbitmq / shm in the stack track which makes me thing quorum queue | 15:41 |
noonedeadpunk | yeah | 15:42 |
noonedeadpunk | I think this is required now? https://opendev.org/openstack/openstack-ansible-os_glance/src/branch/master/templates/glance-api.conf.j2#L156-L157 | 15:42 |
noonedeadpunk | as `raise cfg.RequiredOptError('lock_path', 'oslo_concurrency')` - would be result of missing this option | 15:43 |
noonedeadpunk | on L241 | 15:44 |
jrosser | did we have to patch all the roles for that | 15:44 |
noonedeadpunk | It slightly feels so | 15:44 |
noonedeadpunk | But I'm not sure what/why specifically things started failing on that | 15:44 |
jrosser | oh maybe we realised this at some point and then included it in the other quorum queue patches | 15:45 |
noonedeadpunk | but probably due to changes in oslo.messaging... | 15:45 |
noonedeadpunk | no, that was qumanager | 15:45 |
noonedeadpunk | https://opendev.org/openstack/openstack-ansible-os_glance/src/branch/master/templates/glance-api.conf.j2#L86 this param I realized | 15:46 |
jrosser | i only see we patched glance and blazer with standalone changes like that | 15:46 |
noonedeadpunk | yeah | 15:46 |
noonedeadpunk | others were not failing... | 15:46 |
noonedeadpunk | but probavbly we'd need this everywhere | 15:47 |
jrosser | oh is it where there are two co-incident things on the same host/container using the same queue | 15:47 |
jrosser | as for magnum we have api/conductor in the same place | 15:48 |
noonedeadpunk | oh | 15:48 |
noonedeadpunk | you're probavbly right | 15:49 |
noonedeadpunk | oslo_concurrency get's asked when qmanager asks for hsm | 15:49 |
noonedeadpunk | `reply_q = 'reply_' + self._q_manager.get()` in a trace | 15:49 |
noonedeadpunk | and read_from_shm asks for lock then... | 15:49 |
jrosser | maybe we just need to put this everwhere | 15:51 |
* jrosser wonders how the metal jobs work at all | 15:51 | |
noonedeadpunk | yeah.... | 15:51 |
noonedeadpunk | we totally need | 15:51 |
noonedeadpunk | https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messaging/_drivers/amqpdriver.py#L72 | 15:51 |
noonedeadpunk | this decorator what expects to see lockutils config in place | 15:52 |
noonedeadpunk | crap | 15:52 |
noonedeadpunk | so yeah, we need that pretty much everywhere | 15:54 |
jrosser | so the shm path can be the same for all processes accessing the same queue | 15:55 |
* jrosser forgets the detail | 15:55 | |
noonedeadpunk | it is expected to be same for specific service | 15:56 |
noonedeadpunk | and then inside SHM there's an interator for threads | 15:56 |
noonedeadpunk | but path should be same only for same type | 15:57 |
noonedeadpunk | ie nova-conductor should be different from nova-scheduler | 15:57 |
noonedeadpunk | but then seems on top of that, you'd need a lock utils to handle concurency among process threads for accessing shm | 15:59 |
jrosser | errr | 15:59 |
noonedeadpunk | so yes - we need to patch all roles | 15:59 |
noonedeadpunk | If I read things correctly now... | 15:59 |
jrosser | https://github.com/openstack/openstack-ansible-os_blazar/blob/master/templates/blazar.conf.j2#L23 | 15:59 |
jrosser | this does not differentiate between blazar-api and blazar-manager | 16:00 |
noonedeadpunk | crap. | 16:00 |
noonedeadpunk | probably worth to just disable the qManager again..... | 16:00 |
jrosser | i wonder how actually we do template this out for more complicated services | 16:01 |
noonedeadpunk | as in fact... we can't make them unique having same config file | 16:01 |
jrosser | ^ this | 16:01 |
jrosser | but still - i would expect the metal jobs to be utterly broken | 16:01 |
noonedeadpunk | so, for SHM path itself - we do make unique hostname | 16:01 |
jrosser | and they are not | 16:01 |
noonedeadpunk | and then also jamesden- did upgrade and it works for them.... | 16:02 |
jrosser | unless this is just luck and theres not enough messaging going on to hit something bad | 16:02 |
noonedeadpunk | (also metal though) | 16:02 |
opendevreview | Jonathan Rosser proposed openstack/openstack-ansible-os_magnum master: Define lock directory for oslo_concurrency https://review.opendev.org/c/openstack/openstack-ansible-os_magnum/+/921690 | 16:05 |
noonedeadpunk | so eventually, this SHM thingy is used only when spawning/restarting the service | 16:11 |
jrosser | thats not really how my stack trace is | 16:11 |
jrosser | it comes after many api calls to magnum | 16:11 |
noonedeadpunk | which is quite weird... | 16:12 |
noonedeadpunk | but dunno | 16:12 |
jrosser | its almost last thing, cluster template is defined / cluster booted | 16:12 |
jrosser | failed on retrieving the k8s creds for it | 16:12 |
noonedeadpunk | I'm kinda more and more inclined to think if qmanager worth it.... | 16:13 |
noonedeadpunk | sec, let me find stats from ovh... | 16:13 |
jrosser | maybe need to has ask the ovh people again | 16:13 |
jrosser | yeah | 16:13 |
noonedeadpunk | so, rabbit_stream_fanout dropped amount of queues for them from 55k to 24k | 16:16 |
noonedeadpunk | they didn't say stats for qmanager though | 16:16 |
noonedeadpunk | qmanager they described as smth to avoid load on rabbit during service restart specifically | 16:18 |
noonedeadpunk | as then instead of spawning plenty of new queues services will re-use what they already have | 16:18 |
jrosser | kind of like multiplexing connections? | 16:18 |
noonedeadpunk | and yeah, that's pretty much for reply queues.... | 16:18 |
noonedeadpunk | no, it's more-more stupid | 16:19 |
noonedeadpunk | manager is confusing there | 16:19 |
noonedeadpunk | as it's very trivial tech involved | 16:19 |
noonedeadpunk | so, previously queues were named as `reply_4170059e-e2d2-4291-94ce-2c765437b3b0` | 16:19 |
noonedeadpunk | what they did - replace uuid with some "expected" values | 16:20 |
noonedeadpunk | which consist of: "hostname:service_name:increment" | 16:20 |
noonedeadpunk | and shm is used to store incriments | 16:20 |
noonedeadpunk | and oslo.lock is used to prevent threads to get their incriment at the same time | 16:21 |
noonedeadpunk | so, you;ll have `reply_aio1:nova-conductor:5` instead of uuid | 16:21 |
noonedeadpunk | and once you restart service, it does not need to create whole bunch of new queues anymore | 16:22 |
noonedeadpunk | but it can keep using already existing ones | 16:22 |
noonedeadpunk | so lock path potentially doesn't matter much... but dunno | 16:25 |
noonedeadpunk | I'd need to check code more to understand consequences | 16:25 |
noonedeadpunk | fwiw, there's their presentation: https://www.arnaudmorin.fr/p17/ | 16:28 |
noonedeadpunk | some things were not mainstreamed yet though | 16:29 |
noonedeadpunk | but if I'm not mistaken, what they told, is that usage of qmanager helped them to reduce downtime/load on rabbitmq when restarting l3 agents which create quite a storm for rabbit | 16:47 |
jrosser | oh yes this is suuuuper slow | 16:50 |
noonedeadpunk | but there was anothing thing they did not push yet... | 16:53 |
jrosser | magnum trouble could be rooted in a db error https://zuul.opendev.org/t/openstack/build/893c3205310d4275a3aa2141d2123763/log/logs/openstack/aio1-magnum-container-8af236e8/magnum-conductor.service.journal-18-11-56.log.txt#2026 | 19:23 |
noonedeadpunk | mariadb log is full of connection errors | 19:29 |
jrosser | yeah | 19:30 |
jrosser | also in a totally unrelated job https://zuul.opendev.org/t/openstack/build/afad1275e8394e739866e3c124922685/log/logs/host/mariadb.service.journal-23-44-32.log.txt | 19:31 |
noonedeadpunk | well, that frankly looks like pooling timeouts to me | 19:32 |
noonedeadpunk | though "during query" should result in smth different... | 19:32 |
noonedeadpunk | ah, well | 19:33 |
noonedeadpunk | these 2 are slightly differrent: https://zuul.opendev.org/t/openstack/build/893c3205310d4275a3aa2141d2123763/log/logs/openstack/aio1-galera-container-a238e451/mariadb.service.journal-18-11-56.log.txt#338-339 | 19:34 |
noonedeadpunk | wonder what is timeout vs error means... | 19:34 |
jrosser | maybe we dont have enough threads on the magnum side | 19:34 |
noonedeadpunk | might be... | 19:34 |
jrosser | because its very limited to the capi driver | 19:35 |
hamburgler | hmm odd w/ Caracal upgrade some of the fanout streams, cinder, nova, manila keep having ready messages pile up slowly and not being consumed | 21:04 |
jrosser | hamburgler: you've switched to quorum queues during the upgrade? | 21:08 |
hamburgler | jrosser: Had been running them without the enhancements for quite some time now | 21:08 |
hamburgler | this is just lab testing, I even destroyed LXC all over and re-provisioned them and all queues from scratch still the same thing with those few streams | 21:09 |
hamburgler | all rabbit lxc* | 21:09 |
hamburgler | still haven't finished all testing post upgrade - thought that was interesting to note so far | 21:12 |
jamesden- | interesting Heat bug... 'openstack stack list' fails due to missing [oslo_concurrency] lock_path config in heat.conf, but i can't find any evidence it needed to exist | 21:16 |
jrosser | jamesden-: read back todays irc :) | 21:22 |
jamesden- | lol | 21:22 |
jrosser | i expect you need this but for heat https://review.opendev.org/c/openstack/openstack-ansible-os_magnum/+/921690 | 21:23 |
jamesden- | thanks - i will work on that | 21:23 |
jrosser | it is quite likley that its needed in all the services, or some variant of it | 21:24 |
jrosser | there is detail / potential confusion that we discussed earlier | 21:24 |
jamesden- | oh boy | 21:25 |
jrosser | hamburgler: it is possible you are affected by this too ^^^ | 21:38 |
jrosser | patches / testing totally welcome | 21:39 |
hamburgler | jrosser: have not looked much yet, quick cli command with openstack stack list though gives me ERROR: Internal Error, but I don't see much in the logs that reflects the output above. Magnum have not yet deployed with this lab env. | 21:56 |
hamburgler | Will poke at things more this week | 21:56 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!