09:01:18 <ttx> #startmeeting large_scale_sig
09:01:18 <opendevmeet> Meeting started Wed Apr 24 09:01:18 2024 UTC and is due to finish in 60 minutes.  The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot.
09:01:18 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
09:01:18 <opendevmeet> The meeting name has been set to 'large_scale_sig'
09:01:30 <ttx> #topic Rollcall
09:01:41 <ttx> amorin: o/
09:01:43 <amorin> hey there
09:01:53 <ttx> Anyone else here for the large scale sig meeting?
09:02:26 <amorin> maybe felixhuettner[m] ?
09:02:31 <ttx> Our agenda for today's meeting is
09:02:34 <ttx> #link https://etherpad.opendev.org/p/large-scale-sig-meeting
09:04:05 <amorin> I added info on this etherpad about our recent rabbitmq improvment
09:04:21 <ttx> OK I'll read
09:04:36 <ttx> Pinging felix.huettner
09:05:13 <amorin> we can talk about it during open discussion I think
09:05:14 <ttx> #topic Ops Deep Dive proposals review
09:05:30 <ttx> So we did not get anything, which makes me wonder if we should change our strategy here
09:05:38 <felixhuettner[m]> o/ sorry
09:05:39 <amorin> I agree
09:05:57 <ttx> Maybe plan to do one ourselves in the future to keep it hot and running
09:06:20 <felixhuettner[m]> we can do that, its just the question of topics
09:06:36 <ttx> Would be great to hit something trendy... like a GPU-oriented cloud, or VMWare transitioning
09:06:59 <ttx> Maybe we can think about topics between now and next meeting and come back with proposals ?
09:07:20 <felixhuettner[m]> we have a lot of people asking about confidential VMs/GPUs
09:07:29 <felixhuettner[m]> might also be an interesting topic
09:07:49 <felixhuettner[m]> allthough it is quite snakeoily
09:07:49 <amorin> ok, I can try to reach internal IA/GPU team
09:07:51 <ttx> yeah, would be well aligned with the Foundation's 2024 messaging
09:08:28 <amorin> #action amorin reach OVHcloud IA team about openstack GPU integration
09:08:50 <ttx> #action everyone to think about a hot theme for our next episode, to be discussed at May meeting
09:09:18 <ttx> feel free to email me if you have something urgent earlier
09:09:44 <ttx> Any other idea before we move on?
09:10:40 <amorin> nop
09:10:41 <ttx> #topic Large scale doc
09:10:50 <ttx> No open reviews...
09:11:13 <ttx> amorin: do you want to discuss your rabbitMQ points ? Not sure large scale doc is the right outlet
09:11:30 <amorin> I can do it here
09:11:42 <ttx> Wondering if a technical blog post would not be better for it
09:11:47 <amorin> it's not only doc
09:12:07 <amorin> a blog post where?
09:13:11 <ttx> Could be on an OVH platform that we would promote... or maybe directly on https://www.openstack.org/blog/
09:14:16 <amorin> yes
09:14:32 <amorin> I'll talk about it internally then and let you know which one is best
09:14:46 <amorin> we will talk about it on OpenInfra Day France also
09:14:53 <ttx> OK, in the mean time I will ask marketing here if they would be interested featuring it on openstack blog
09:15:10 <ttx> yeah, that's great (OID France talk)
09:15:14 <amorin> the fact is that all of the stuff is not yet upstream / merged
09:15:35 <ttx> could be on superuser instead
09:16:01 <ttx> openstack blog is more about things that landed, but superuser can be more opinionated
09:16:14 <amorin> and for part of them, I'd like to switch default behavior on oslo messaging part, but I dont know how to proceed, maybe large-scale group would be able to push it with me
09:16:18 <ttx> I'll ask and let you know
09:16:22 <amorin> ok!
09:17:03 <ttx> I think it's a good story, between what you already contributed and what you would like to change in the future
09:17:14 <ttx> in addition to being good large scale advice
09:17:23 <amorin> yup
09:17:30 <amorin> Should I expose some of the points in this discussion now for the record?
09:17:41 <ttx> yes please
09:18:02 <amorin> Lets go, stop me if needed :)
09:18:13 <amorin> Both nova and neutron heavily rely on rabbitmq for intra communication (between agents running on computes and API running on control plane).
09:18:23 <amorin> RabbitMQ clustering is a must have to let operators manage the lifecycle of rabbitMQ
09:18:30 <amorin> Some recent improvment have been done on oslo.messaging to allow a better scaling and management of rabbitmq queues
09:18:34 <amorin> Here is a list of what we did on OVH side to achieve better stability at large scale
09:18:39 <amorin> - Better eventlet / green thread management
09:19:13 <amorin> On that part, OVH basically finished the work to avoid green thread to fail sending heartbeats
09:19:23 <amorin> patches are merged upstream and available by default.
09:19:33 <amorin> - Replace classic HA with quorum
09:19:52 <amorin> Quorum queues are the future for rabbitmq, replacing classic HA
09:20:01 <amorin> OVH did a patch to finish this implementation
09:20:18 <amorin> Using quorum queues is not yet the default and we would like to enable this by default, we may need help from large-scale about this
09:20:24 <amorin> - Consistent queue naming
09:20:31 <amorin> oslo.messaging was relying on random queue naming.
09:20:42 <amorin> While this seems not a problem on small deployments, it has two bad side effects :
09:20:43 <amorin> - it's harder to figure out which service created a specific queue
09:20:45 <amorin> - as soon as you restart your services, new random queues are created, leaving a lot of orphaned queues in rabbitmq
09:20:47 <amorin> These side effects are highly visible at large scale, and even more visible when using quorum queues.
09:20:55 <amorin> We did a patch on oslo.messaging to stop using random name.
09:20:58 <amorin> This is now merged upstream, but disable by default.
09:21:05 <amorin> We would like to enable this by default in the future.
09:21:31 <ttx> question
09:21:32 <amorin> This is very important for us, and it helped a lot identifying which compute is affected by a rabbit issue
09:21:34 <amorin> yes?
09:21:59 <ttx> Should we document the setting that enables quorum queue in the large scale doc first?
09:22:08 <ttx> as a step toward making it default
09:22:15 <amorin> yes, we should
09:22:32 <amorin> the bad thing is that it's not easy to switch
09:22:43 <amorin> because it needs all queues to be destroyed and recreated
09:22:58 <amorin> oslo.messaging is not able to do it
09:23:08 <ttx> For all the settings that you suggest should be default we should first document them in large-scale doc
09:23:25 <amorin> what we did on our side: we stopped all openstack services, destroyed the queues, switch the flag, starts everything back
09:23:41 <ttx> maybe there is a way to make it default for new deployments only
09:23:44 <amorin> I agree about pushing it on doc
09:23:50 <ttx> (not affecting existing ones)
09:24:18 <amorin> #action push documentation about enabling quorum queues and other rabbitmq improvment settings for large scale
09:25:06 <amorin> ok, lets continue
09:25:22 <amorin> - Reduce the number of queues
09:25:30 <amorin> RabbitMQ is a message broker, not a queue broker.
09:25:35 <amorin> With a high number of queues, rabbitmq does not work correctly (timeouts / cpu usage / network usage / etc.).
09:25:45 <amorin> OVH did some patches to reduce the number of queues created by neutron by patching oslo.messaging and neutron code (we divide neutron number of queues by 5).
09:25:57 <amorin> This is not yet upstream, we would like to push it
09:26:37 <amorin> Without those patched, our biggest regions (2k computes) were consuming a lot of rabbitmq resources and it was hard to maintain it working correctly
09:26:57 <amorin> Next thing is:
09:26:59 <amorin> - Replace classic fanouts with streams
09:27:14 <amorin> OVH did a patch to rely on "stream" queues to replace classic fanouts.
09:27:44 <amorin> this is reducing the number of queues (a lot) and also the number messaging transiting from rabbit
09:28:05 <amorin> rabbit is now in charge of deduplicating the identical messages to all computes
09:28:09 <amorin> Those patches are merged upstream but disabled by default
09:28:13 <amorin> We would like to enable this by default.
09:28:20 <amorin> (will do doc about it)
09:28:28 <amorin> last point is:
09:28:30 <amorin> - Get rid of 'transient' queues
09:28:37 <amorin> oslo.messaging is distinguishing 'transient' queues from other queues but it make no sense
09:28:46 <amorin> neutron and nova does not rely on transient queues
09:29:30 <amorin> E.G. a reply_xyz queue is transient in oslo, but it needs to be fully replicated for nova/neutron to work correctly
09:30:11 <amorin> in the past, such queues were not HA, so if you had to switch off one server of your rabbitmq cluster, you would affect part of your deployment
09:30:23 <amorin> losing messages, leading to inconsistency, etc.
09:30:56 <amorin> by making thoses queues HA, you can operate your rabbitmq cluster correctly
09:31:17 <amorin> TLDR, transient queues is a non-sense
09:31:23 <ttx> That one might be the trickiest to convince people out of, due to history
09:31:30 <amorin> yup
09:31:45 <ttx> but an article can be the first step
09:31:51 <amorin> but removing such concept in oslo would simplify a lot code, configuration and understanding stakes
09:32:14 <amorin> that's all! thank you for reading :)
09:32:16 <ttx> and together with other actions show that you did your homework and speak from a good expertise position
09:32:50 <amorin> I am proud of the team on that part, we gain a lot of experience on rabbit side recently
09:33:13 <ttx> agreed! It's why it's good material for "superuser" which is user-oriented
09:33:25 <ttx> but I'll defer to Allison on where is the best
09:33:33 <amorin> thanks!
09:33:41 <ttx> anything else on this topic?
09:33:56 <amorin> I am done
09:34:08 <ttx> #topic Next meeting(s)
09:34:23 <ttx> Next meeting is May 22, which  conflicts with OID France
09:34:56 <ttx> So I was thinking moving the meeting to the 3rd wednesday instead, which would be May 15, June 19...
09:35:29 <ttx> rather than skipping it
09:35:34 <ttx> thoughts on that?
09:35:35 <amorin> that works for me
09:35:58 <ttx> felix.huettner: ?
09:36:38 <felixhuettner[m]> i'm not there on the 15th May
09:36:42 <felixhuettner[m]> but otherwise that is fine with me
09:38:04 <ttx> Hmm, alternatively we could move it back one week exceptionally in May (May 29, June 19), but that is trickier/impossible to specify in the booking system
09:38:45 <ttx> Err, that would be May 29, June 26
09:39:40 <ttx> Also I generally have conflicts on the last week of months so would not mind moving up
09:40:58 <ttx> what should we do?
09:41:24 <felixhuettner[m]> but we can just move it to the 3rd wednesday
09:41:35 <ttx> ok
09:41:36 <ttx> #action ttx to move meeting to 3rd wednesday every month
09:41:40 <felixhuettner[m]> then i'm just not there for one session
09:41:56 <amorin> yes
09:42:07 <ttx> OK, if you have episode ideas in the meantime, let us know by email :)
09:42:16 <ttx> #topic Open discussion
09:42:23 <ttx> Anything else, anyone?
09:42:57 <amorin> nop
09:43:24 <ttx> alright then
09:43:28 <ttx> #endmeeting