ttx | o/ | 09:00 |
---|---|---|
amorin | o/ | 09:01 |
ttx | #startmeeting large_scale_sig | 09:01 |
opendevmeet | Meeting started Wed Apr 24 09:01:18 2024 UTC and is due to finish in 60 minutes. The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot. | 09:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 09:01 |
opendevmeet | The meeting name has been set to 'large_scale_sig' | 09:01 |
ttx | #topic Rollcall | 09:01 |
ttx | amorin: o/ | 09:01 |
amorin | hey there | 09:01 |
ttx | Anyone else here for the large scale sig meeting? | 09:01 |
amorin | maybe felixhuettner[m] ? | 09:02 |
ttx | Our agenda for today's meeting is | 09:02 |
ttx | #link https://etherpad.opendev.org/p/large-scale-sig-meeting | 09:02 |
amorin | I added info on this etherpad about our recent rabbitmq improvment | 09:04 |
ttx | OK I'll read | 09:04 |
ttx | Pinging felix.huettner | 09:04 |
amorin | we can talk about it during open discussion I think | 09:05 |
ttx | #topic Ops Deep Dive proposals review | 09:05 |
ttx | So we did not get anything, which makes me wonder if we should change our strategy here | 09:05 |
felixhuettner[m] | o/ sorry | 09:05 |
amorin | I agree | 09:05 |
ttx | Maybe plan to do one ourselves in the future to keep it hot and running | 09:05 |
felixhuettner[m] | we can do that, its just the question of topics | 09:06 |
ttx | Would be great to hit something trendy... like a GPU-oriented cloud, or VMWare transitioning | 09:06 |
ttx | Maybe we can think about topics between now and next meeting and come back with proposals ? | 09:06 |
felixhuettner[m] | we have a lot of people asking about confidential VMs/GPUs | 09:07 |
felixhuettner[m] | might also be an interesting topic | 09:07 |
felixhuettner[m] | allthough it is quite snakeoily | 09:07 |
amorin | ok, I can try to reach internal IA/GPU team | 09:07 |
ttx | yeah, would be well aligned with the Foundation's 2024 messaging | 09:07 |
amorin | #action amorin reach OVHcloud IA team about openstack GPU integration | 09:08 |
ttx | #action everyone to think about a hot theme for our next episode, to be discussed at May meeting | 09:08 |
ttx | feel free to email me if you have something urgent earlier | 09:09 |
ttx | Any other idea before we move on? | 09:09 |
amorin | nop | 09:10 |
ttx | #topic Large scale doc | 09:10 |
ttx | No open reviews... | 09:10 |
ttx | amorin: do you want to discuss your rabbitMQ points ? Not sure large scale doc is the right outlet | 09:11 |
amorin | I can do it here | 09:11 |
ttx | Wondering if a technical blog post would not be better for it | 09:11 |
amorin | it's not only doc | 09:11 |
amorin | a blog post where? | 09:12 |
ttx | Could be on an OVH platform that we would promote... or maybe directly on https://www.openstack.org/blog/ | 09:13 |
amorin | yes | 09:14 |
amorin | I'll talk about it internally then and let you know which one is best | 09:14 |
amorin | we will talk about it on OpenInfra Day France also | 09:14 |
ttx | OK, in the mean time I will ask marketing here if they would be interested featuring it on openstack blog | 09:14 |
ttx | yeah, that's great (OID France talk) | 09:15 |
amorin | the fact is that all of the stuff is not yet upstream / merged | 09:15 |
ttx | could be on superuser instead | 09:15 |
ttx | openstack blog is more about things that landed, but superuser can be more opinionated | 09:16 |
amorin | and for part of them, I'd like to switch default behavior on oslo messaging part, but I dont know how to proceed, maybe large-scale group would be able to push it with me | 09:16 |
ttx | I'll ask and let you know | 09:16 |
amorin | ok! | 09:16 |
ttx | I think it's a good story, between what you already contributed and what you would like to change in the future | 09:17 |
ttx | in addition to being good large scale advice | 09:17 |
amorin | yup | 09:17 |
amorin | Should I expose some of the points in this discussion now for the record? | 09:17 |
ttx | yes please | 09:17 |
amorin | Lets go, stop me if needed :) | 09:18 |
amorin | Both nova and neutron heavily rely on rabbitmq for intra communication (between agents running on computes and API running on control plane). | 09:18 |
amorin | RabbitMQ clustering is a must have to let operators manage the lifecycle of rabbitMQ | 09:18 |
amorin | Some recent improvment have been done on oslo.messaging to allow a better scaling and management of rabbitmq queues | 09:18 |
amorin | Here is a list of what we did on OVH side to achieve better stability at large scale | 09:18 |
amorin | - Better eventlet / green thread management | 09:18 |
amorin | On that part, OVH basically finished the work to avoid green thread to fail sending heartbeats | 09:19 |
amorin | patches are merged upstream and available by default. | 09:19 |
amorin | - Replace classic HA with quorum | 09:19 |
amorin | Quorum queues are the future for rabbitmq, replacing classic HA | 09:19 |
amorin | OVH did a patch to finish this implementation | 09:20 |
amorin | Using quorum queues is not yet the default and we would like to enable this by default, we may need help from large-scale about this | 09:20 |
amorin | - Consistent queue naming | 09:20 |
amorin | oslo.messaging was relying on random queue naming. | 09:20 |
amorin | While this seems not a problem on small deployments, it has two bad side effects : | 09:20 |
amorin | - it's harder to figure out which service created a specific queue | 09:20 |
amorin | - as soon as you restart your services, new random queues are created, leaving a lot of orphaned queues in rabbitmq | 09:20 |
amorin | These side effects are highly visible at large scale, and even more visible when using quorum queues. | 09:20 |
amorin | We did a patch on oslo.messaging to stop using random name. | 09:20 |
amorin | This is now merged upstream, but disable by default. | 09:20 |
amorin | We would like to enable this by default in the future. | 09:21 |
ttx | question | 09:21 |
amorin | This is very important for us, and it helped a lot identifying which compute is affected by a rabbit issue | 09:21 |
amorin | yes? | 09:21 |
ttx | Should we document the setting that enables quorum queue in the large scale doc first? | 09:21 |
ttx | as a step toward making it default | 09:22 |
amorin | yes, we should | 09:22 |
amorin | the bad thing is that it's not easy to switch | 09:22 |
amorin | because it needs all queues to be destroyed and recreated | 09:22 |
amorin | oslo.messaging is not able to do it | 09:22 |
ttx | For all the settings that you suggest should be default we should first document them in large-scale doc | 09:23 |
amorin | what we did on our side: we stopped all openstack services, destroyed the queues, switch the flag, starts everything back | 09:23 |
ttx | maybe there is a way to make it default for new deployments only | 09:23 |
amorin | I agree about pushing it on doc | 09:23 |
ttx | (not affecting existing ones) | 09:23 |
amorin | #action push documentation about enabling quorum queues and other rabbitmq improvment settings for large scale | 09:24 |
amorin | ok, lets continue | 09:25 |
amorin | - Reduce the number of queues | 09:25 |
amorin | RabbitMQ is a message broker, not a queue broker. | 09:25 |
amorin | With a high number of queues, rabbitmq does not work correctly (timeouts / cpu usage / network usage / etc.). | 09:25 |
amorin | OVH did some patches to reduce the number of queues created by neutron by patching oslo.messaging and neutron code (we divide neutron number of queues by 5). | 09:25 |
amorin | This is not yet upstream, we would like to push it | 09:25 |
amorin | Without those patched, our biggest regions (2k computes) were consuming a lot of rabbitmq resources and it was hard to maintain it working correctly | 09:26 |
amorin | Next thing is: | 09:26 |
amorin | - Replace classic fanouts with streams | 09:26 |
amorin | OVH did a patch to rely on "stream" queues to replace classic fanouts. | 09:27 |
amorin | this is reducing the number of queues (a lot) and also the number messaging transiting from rabbit | 09:27 |
amorin | rabbit is now in charge of deduplicating the identical messages to all computes | 09:28 |
amorin | Those patches are merged upstream but disabled by default | 09:28 |
amorin | We would like to enable this by default. | 09:28 |
amorin | (will do doc about it) | 09:28 |
amorin | last point is: | 09:28 |
amorin | - Get rid of 'transient' queues | 09:28 |
amorin | oslo.messaging is distinguishing 'transient' queues from other queues but it make no sense | 09:28 |
amorin | neutron and nova does not rely on transient queues | 09:28 |
amorin | E.G. a reply_xyz queue is transient in oslo, but it needs to be fully replicated for nova/neutron to work correctly | 09:29 |
amorin | in the past, such queues were not HA, so if you had to switch off one server of your rabbitmq cluster, you would affect part of your deployment | 09:30 |
amorin | losing messages, leading to inconsistency, etc. | 09:30 |
amorin | by making thoses queues HA, you can operate your rabbitmq cluster correctly | 09:30 |
amorin | TLDR, transient queues is a non-sense | 09:31 |
ttx | That one might be the trickiest to convince people out of, due to history | 09:31 |
amorin | yup | 09:31 |
ttx | but an article can be the first step | 09:31 |
amorin | but removing such concept in oslo would simplify a lot code, configuration and understanding stakes | 09:31 |
amorin | that's all! thank you for reading :) | 09:32 |
ttx | and together with other actions show that you did your homework and speak from a good expertise position | 09:32 |
amorin | I am proud of the team on that part, we gain a lot of experience on rabbit side recently | 09:32 |
ttx | agreed! It's why it's good material for "superuser" which is user-oriented | 09:33 |
ttx | but I'll defer to Allison on where is the best | 09:33 |
amorin | thanks! | 09:33 |
ttx | anything else on this topic? | 09:33 |
amorin | I am done | 09:33 |
ttx | #topic Next meeting(s) | 09:34 |
ttx | Next meeting is May 22, which conflicts with OID France | 09:34 |
ttx | So I was thinking moving the meeting to the 3rd wednesday instead, which would be May 15, June 19... | 09:34 |
ttx | rather than skipping it | 09:35 |
ttx | thoughts on that? | 09:35 |
amorin | that works for me | 09:35 |
ttx | felix.huettner: ? | 09:35 |
felixhuettner[m] | i'm not there on the 15th May | 09:36 |
felixhuettner[m] | but otherwise that is fine with me | 09:36 |
ttx | Hmm, alternatively we could move it back one week exceptionally in May (May 29, June 19), but that is trickier/impossible to specify in the booking system | 09:38 |
ttx | Err, that would be May 29, June 26 | 09:38 |
ttx | Also I generally have conflicts on the last week of months so would not mind moving up | 09:39 |
ttx | what should we do? | 09:40 |
felixhuettner[m] | but we can just move it to the 3rd wednesday | 09:41 |
ttx | ok | 09:41 |
ttx | #action ttx to move meeting to 3rd wednesday every month | 09:41 |
felixhuettner[m] | then i'm just not there for one session | 09:41 |
amorin | yes | 09:41 |
ttx | OK, if you have episode ideas in the meantime, let us know by email :) | 09:42 |
ttx | #topic Open discussion | 09:42 |
ttx | Anything else, anyone? | 09:42 |
amorin | nop | 09:42 |
ttx | alright then | 09:43 |
ttx | #endmeeting | 09:43 |
opendevmeet | Meeting ended Wed Apr 24 09:43:28 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 09:43 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/large_scale_sig/2024/large_scale_sig.2024-04-24-09.01.html | 09:43 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/large_scale_sig/2024/large_scale_sig.2024-04-24-09.01.txt | 09:43 |
opendevmeet | Log: https://meetings.opendev.org/meetings/large_scale_sig/2024/large_scale_sig.2024-04-24-09.01.log.html | 09:43 |
felixhuettner[m] | <amorin> "what we did on our side: we..." <- we did this with a little less downtime: | 09:43 |
felixhuettner[m] | 1. patching oslo.messaging/kombu to not care if connecting to an existing queue if they are of a different type and roll this out to all machines | 09:44 |
felixhuettner[m] | 2. kill all queues one-by-one | 09:44 |
felixhuettner[m] | 3. punch a few openstack services that did not recreate queues correctly | 09:44 |
amorin | Thats nice | 09:46 |
amorin | Maybe we could patch Oslo to have a config param for this | 09:46 |
*** tobias-urdin4 is now known as tobias-urdin | 15:57 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!