09:01:18 <ttx> #startmeeting large_scale_sig 09:01:18 <opendevmeet> Meeting started Wed Apr 24 09:01:18 2024 UTC and is due to finish in 60 minutes. The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:01:18 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:01:18 <opendevmeet> The meeting name has been set to 'large_scale_sig' 09:01:30 <ttx> #topic Rollcall 09:01:41 <ttx> amorin: o/ 09:01:43 <amorin> hey there 09:01:53 <ttx> Anyone else here for the large scale sig meeting? 09:02:26 <amorin> maybe felixhuettner[m] ? 09:02:31 <ttx> Our agenda for today's meeting is 09:02:34 <ttx> #link https://etherpad.opendev.org/p/large-scale-sig-meeting 09:04:05 <amorin> I added info on this etherpad about our recent rabbitmq improvment 09:04:21 <ttx> OK I'll read 09:04:36 <ttx> Pinging felix.huettner 09:05:13 <amorin> we can talk about it during open discussion I think 09:05:14 <ttx> #topic Ops Deep Dive proposals review 09:05:30 <ttx> So we did not get anything, which makes me wonder if we should change our strategy here 09:05:38 <felixhuettner[m]> o/ sorry 09:05:39 <amorin> I agree 09:05:57 <ttx> Maybe plan to do one ourselves in the future to keep it hot and running 09:06:20 <felixhuettner[m]> we can do that, its just the question of topics 09:06:36 <ttx> Would be great to hit something trendy... like a GPU-oriented cloud, or VMWare transitioning 09:06:59 <ttx> Maybe we can think about topics between now and next meeting and come back with proposals ? 09:07:20 <felixhuettner[m]> we have a lot of people asking about confidential VMs/GPUs 09:07:29 <felixhuettner[m]> might also be an interesting topic 09:07:49 <felixhuettner[m]> allthough it is quite snakeoily 09:07:49 <amorin> ok, I can try to reach internal IA/GPU team 09:07:51 <ttx> yeah, would be well aligned with the Foundation's 2024 messaging 09:08:28 <amorin> #action amorin reach OVHcloud IA team about openstack GPU integration 09:08:50 <ttx> #action everyone to think about a hot theme for our next episode, to be discussed at May meeting 09:09:18 <ttx> feel free to email me if you have something urgent earlier 09:09:44 <ttx> Any other idea before we move on? 09:10:40 <amorin> nop 09:10:41 <ttx> #topic Large scale doc 09:10:50 <ttx> No open reviews... 09:11:13 <ttx> amorin: do you want to discuss your rabbitMQ points ? Not sure large scale doc is the right outlet 09:11:30 <amorin> I can do it here 09:11:42 <ttx> Wondering if a technical blog post would not be better for it 09:11:47 <amorin> it's not only doc 09:12:07 <amorin> a blog post where? 09:13:11 <ttx> Could be on an OVH platform that we would promote... or maybe directly on https://www.openstack.org/blog/ 09:14:16 <amorin> yes 09:14:32 <amorin> I'll talk about it internally then and let you know which one is best 09:14:46 <amorin> we will talk about it on OpenInfra Day France also 09:14:53 <ttx> OK, in the mean time I will ask marketing here if they would be interested featuring it on openstack blog 09:15:10 <ttx> yeah, that's great (OID France talk) 09:15:14 <amorin> the fact is that all of the stuff is not yet upstream / merged 09:15:35 <ttx> could be on superuser instead 09:16:01 <ttx> openstack blog is more about things that landed, but superuser can be more opinionated 09:16:14 <amorin> and for part of them, I'd like to switch default behavior on oslo messaging part, but I dont know how to proceed, maybe large-scale group would be able to push it with me 09:16:18 <ttx> I'll ask and let you know 09:16:22 <amorin> ok! 09:17:03 <ttx> I think it's a good story, between what you already contributed and what you would like to change in the future 09:17:14 <ttx> in addition to being good large scale advice 09:17:23 <amorin> yup 09:17:30 <amorin> Should I expose some of the points in this discussion now for the record? 09:17:41 <ttx> yes please 09:18:02 <amorin> Lets go, stop me if needed :) 09:18:13 <amorin> Both nova and neutron heavily rely on rabbitmq for intra communication (between agents running on computes and API running on control plane). 09:18:23 <amorin> RabbitMQ clustering is a must have to let operators manage the lifecycle of rabbitMQ 09:18:30 <amorin> Some recent improvment have been done on oslo.messaging to allow a better scaling and management of rabbitmq queues 09:18:34 <amorin> Here is a list of what we did on OVH side to achieve better stability at large scale 09:18:39 <amorin> - Better eventlet / green thread management 09:19:13 <amorin> On that part, OVH basically finished the work to avoid green thread to fail sending heartbeats 09:19:23 <amorin> patches are merged upstream and available by default. 09:19:33 <amorin> - Replace classic HA with quorum 09:19:52 <amorin> Quorum queues are the future for rabbitmq, replacing classic HA 09:20:01 <amorin> OVH did a patch to finish this implementation 09:20:18 <amorin> Using quorum queues is not yet the default and we would like to enable this by default, we may need help from large-scale about this 09:20:24 <amorin> - Consistent queue naming 09:20:31 <amorin> oslo.messaging was relying on random queue naming. 09:20:42 <amorin> While this seems not a problem on small deployments, it has two bad side effects : 09:20:43 <amorin> - it's harder to figure out which service created a specific queue 09:20:45 <amorin> - as soon as you restart your services, new random queues are created, leaving a lot of orphaned queues in rabbitmq 09:20:47 <amorin> These side effects are highly visible at large scale, and even more visible when using quorum queues. 09:20:55 <amorin> We did a patch on oslo.messaging to stop using random name. 09:20:58 <amorin> This is now merged upstream, but disable by default. 09:21:05 <amorin> We would like to enable this by default in the future. 09:21:31 <ttx> question 09:21:32 <amorin> This is very important for us, and it helped a lot identifying which compute is affected by a rabbit issue 09:21:34 <amorin> yes? 09:21:59 <ttx> Should we document the setting that enables quorum queue in the large scale doc first? 09:22:08 <ttx> as a step toward making it default 09:22:15 <amorin> yes, we should 09:22:32 <amorin> the bad thing is that it's not easy to switch 09:22:43 <amorin> because it needs all queues to be destroyed and recreated 09:22:58 <amorin> oslo.messaging is not able to do it 09:23:08 <ttx> For all the settings that you suggest should be default we should first document them in large-scale doc 09:23:25 <amorin> what we did on our side: we stopped all openstack services, destroyed the queues, switch the flag, starts everything back 09:23:41 <ttx> maybe there is a way to make it default for new deployments only 09:23:44 <amorin> I agree about pushing it on doc 09:23:50 <ttx> (not affecting existing ones) 09:24:18 <amorin> #action push documentation about enabling quorum queues and other rabbitmq improvment settings for large scale 09:25:06 <amorin> ok, lets continue 09:25:22 <amorin> - Reduce the number of queues 09:25:30 <amorin> RabbitMQ is a message broker, not a queue broker. 09:25:35 <amorin> With a high number of queues, rabbitmq does not work correctly (timeouts / cpu usage / network usage / etc.). 09:25:45 <amorin> OVH did some patches to reduce the number of queues created by neutron by patching oslo.messaging and neutron code (we divide neutron number of queues by 5). 09:25:57 <amorin> This is not yet upstream, we would like to push it 09:26:37 <amorin> Without those patched, our biggest regions (2k computes) were consuming a lot of rabbitmq resources and it was hard to maintain it working correctly 09:26:57 <amorin> Next thing is: 09:26:59 <amorin> - Replace classic fanouts with streams 09:27:14 <amorin> OVH did a patch to rely on "stream" queues to replace classic fanouts. 09:27:44 <amorin> this is reducing the number of queues (a lot) and also the number messaging transiting from rabbit 09:28:05 <amorin> rabbit is now in charge of deduplicating the identical messages to all computes 09:28:09 <amorin> Those patches are merged upstream but disabled by default 09:28:13 <amorin> We would like to enable this by default. 09:28:20 <amorin> (will do doc about it) 09:28:28 <amorin> last point is: 09:28:30 <amorin> - Get rid of 'transient' queues 09:28:37 <amorin> oslo.messaging is distinguishing 'transient' queues from other queues but it make no sense 09:28:46 <amorin> neutron and nova does not rely on transient queues 09:29:30 <amorin> E.G. a reply_xyz queue is transient in oslo, but it needs to be fully replicated for nova/neutron to work correctly 09:30:11 <amorin> in the past, such queues were not HA, so if you had to switch off one server of your rabbitmq cluster, you would affect part of your deployment 09:30:23 <amorin> losing messages, leading to inconsistency, etc. 09:30:56 <amorin> by making thoses queues HA, you can operate your rabbitmq cluster correctly 09:31:17 <amorin> TLDR, transient queues is a non-sense 09:31:23 <ttx> That one might be the trickiest to convince people out of, due to history 09:31:30 <amorin> yup 09:31:45 <ttx> but an article can be the first step 09:31:51 <amorin> but removing such concept in oslo would simplify a lot code, configuration and understanding stakes 09:32:14 <amorin> that's all! thank you for reading :) 09:32:16 <ttx> and together with other actions show that you did your homework and speak from a good expertise position 09:32:50 <amorin> I am proud of the team on that part, we gain a lot of experience on rabbit side recently 09:33:13 <ttx> agreed! It's why it's good material for "superuser" which is user-oriented 09:33:25 <ttx> but I'll defer to Allison on where is the best 09:33:33 <amorin> thanks! 09:33:41 <ttx> anything else on this topic? 09:33:56 <amorin> I am done 09:34:08 <ttx> #topic Next meeting(s) 09:34:23 <ttx> Next meeting is May 22, which conflicts with OID France 09:34:56 <ttx> So I was thinking moving the meeting to the 3rd wednesday instead, which would be May 15, June 19... 09:35:29 <ttx> rather than skipping it 09:35:34 <ttx> thoughts on that? 09:35:35 <amorin> that works for me 09:35:58 <ttx> felix.huettner: ? 09:36:38 <felixhuettner[m]> i'm not there on the 15th May 09:36:42 <felixhuettner[m]> but otherwise that is fine with me 09:38:04 <ttx> Hmm, alternatively we could move it back one week exceptionally in May (May 29, June 19), but that is trickier/impossible to specify in the booking system 09:38:45 <ttx> Err, that would be May 29, June 26 09:39:40 <ttx> Also I generally have conflicts on the last week of months so would not mind moving up 09:40:58 <ttx> what should we do? 09:41:24 <felixhuettner[m]> but we can just move it to the 3rd wednesday 09:41:35 <ttx> ok 09:41:36 <ttx> #action ttx to move meeting to 3rd wednesday every month 09:41:40 <felixhuettner[m]> then i'm just not there for one session 09:41:56 <amorin> yes 09:42:07 <ttx> OK, if you have episode ideas in the meantime, let us know by email :) 09:42:16 <ttx> #topic Open discussion 09:42:23 <ttx> Anything else, anyone? 09:42:57 <amorin> nop 09:43:24 <ttx> alright then 09:43:28 <ttx> #endmeeting