09:01:18 #startmeeting large_scale_sig 09:01:18 Meeting started Wed Apr 24 09:01:18 2024 UTC and is due to finish in 60 minutes. The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:01:18 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:01:18 The meeting name has been set to 'large_scale_sig' 09:01:30 #topic Rollcall 09:01:41 amorin: o/ 09:01:43 hey there 09:01:53 Anyone else here for the large scale sig meeting? 09:02:26 maybe felixhuettner[m] ? 09:02:31 Our agenda for today's meeting is 09:02:34 #link https://etherpad.opendev.org/p/large-scale-sig-meeting 09:04:05 I added info on this etherpad about our recent rabbitmq improvment 09:04:21 OK I'll read 09:04:36 Pinging felix.huettner 09:05:13 we can talk about it during open discussion I think 09:05:14 #topic Ops Deep Dive proposals review 09:05:30 So we did not get anything, which makes me wonder if we should change our strategy here 09:05:38 o/ sorry 09:05:39 I agree 09:05:57 Maybe plan to do one ourselves in the future to keep it hot and running 09:06:20 we can do that, its just the question of topics 09:06:36 Would be great to hit something trendy... like a GPU-oriented cloud, or VMWare transitioning 09:06:59 Maybe we can think about topics between now and next meeting and come back with proposals ? 09:07:20 we have a lot of people asking about confidential VMs/GPUs 09:07:29 might also be an interesting topic 09:07:49 allthough it is quite snakeoily 09:07:49 ok, I can try to reach internal IA/GPU team 09:07:51 yeah, would be well aligned with the Foundation's 2024 messaging 09:08:28 #action amorin reach OVHcloud IA team about openstack GPU integration 09:08:50 #action everyone to think about a hot theme for our next episode, to be discussed at May meeting 09:09:18 feel free to email me if you have something urgent earlier 09:09:44 Any other idea before we move on? 09:10:40 nop 09:10:41 #topic Large scale doc 09:10:50 No open reviews... 09:11:13 amorin: do you want to discuss your rabbitMQ points ? Not sure large scale doc is the right outlet 09:11:30 I can do it here 09:11:42 Wondering if a technical blog post would not be better for it 09:11:47 it's not only doc 09:12:07 a blog post where? 09:13:11 Could be on an OVH platform that we would promote... or maybe directly on https://www.openstack.org/blog/ 09:14:16 yes 09:14:32 I'll talk about it internally then and let you know which one is best 09:14:46 we will talk about it on OpenInfra Day France also 09:14:53 OK, in the mean time I will ask marketing here if they would be interested featuring it on openstack blog 09:15:10 yeah, that's great (OID France talk) 09:15:14 the fact is that all of the stuff is not yet upstream / merged 09:15:35 could be on superuser instead 09:16:01 openstack blog is more about things that landed, but superuser can be more opinionated 09:16:14 and for part of them, I'd like to switch default behavior on oslo messaging part, but I dont know how to proceed, maybe large-scale group would be able to push it with me 09:16:18 I'll ask and let you know 09:16:22 ok! 09:17:03 I think it's a good story, between what you already contributed and what you would like to change in the future 09:17:14 in addition to being good large scale advice 09:17:23 yup 09:17:30 Should I expose some of the points in this discussion now for the record? 09:17:41 yes please 09:18:02 Lets go, stop me if needed :) 09:18:13 Both nova and neutron heavily rely on rabbitmq for intra communication (between agents running on computes and API running on control plane). 09:18:23 RabbitMQ clustering is a must have to let operators manage the lifecycle of rabbitMQ 09:18:30 Some recent improvment have been done on oslo.messaging to allow a better scaling and management of rabbitmq queues 09:18:34 Here is a list of what we did on OVH side to achieve better stability at large scale 09:18:39 - Better eventlet / green thread management 09:19:13 On that part, OVH basically finished the work to avoid green thread to fail sending heartbeats 09:19:23 patches are merged upstream and available by default. 09:19:33 - Replace classic HA with quorum 09:19:52 Quorum queues are the future for rabbitmq, replacing classic HA 09:20:01 OVH did a patch to finish this implementation 09:20:18 Using quorum queues is not yet the default and we would like to enable this by default, we may need help from large-scale about this 09:20:24 - Consistent queue naming 09:20:31 oslo.messaging was relying on random queue naming. 09:20:42 While this seems not a problem on small deployments, it has two bad side effects : 09:20:43 - it's harder to figure out which service created a specific queue 09:20:45 - as soon as you restart your services, new random queues are created, leaving a lot of orphaned queues in rabbitmq 09:20:47 These side effects are highly visible at large scale, and even more visible when using quorum queues. 09:20:55 We did a patch on oslo.messaging to stop using random name. 09:20:58 This is now merged upstream, but disable by default. 09:21:05 We would like to enable this by default in the future. 09:21:31 question 09:21:32 This is very important for us, and it helped a lot identifying which compute is affected by a rabbit issue 09:21:34 yes? 09:21:59 Should we document the setting that enables quorum queue in the large scale doc first? 09:22:08 as a step toward making it default 09:22:15 yes, we should 09:22:32 the bad thing is that it's not easy to switch 09:22:43 because it needs all queues to be destroyed and recreated 09:22:58 oslo.messaging is not able to do it 09:23:08 For all the settings that you suggest should be default we should first document them in large-scale doc 09:23:25 what we did on our side: we stopped all openstack services, destroyed the queues, switch the flag, starts everything back 09:23:41 maybe there is a way to make it default for new deployments only 09:23:44 I agree about pushing it on doc 09:23:50 (not affecting existing ones) 09:24:18 #action push documentation about enabling quorum queues and other rabbitmq improvment settings for large scale 09:25:06 ok, lets continue 09:25:22 - Reduce the number of queues 09:25:30 RabbitMQ is a message broker, not a queue broker. 09:25:35 With a high number of queues, rabbitmq does not work correctly (timeouts / cpu usage / network usage / etc.). 09:25:45 OVH did some patches to reduce the number of queues created by neutron by patching oslo.messaging and neutron code (we divide neutron number of queues by 5). 09:25:57 This is not yet upstream, we would like to push it 09:26:37 Without those patched, our biggest regions (2k computes) were consuming a lot of rabbitmq resources and it was hard to maintain it working correctly 09:26:57 Next thing is: 09:26:59 - Replace classic fanouts with streams 09:27:14 OVH did a patch to rely on "stream" queues to replace classic fanouts. 09:27:44 this is reducing the number of queues (a lot) and also the number messaging transiting from rabbit 09:28:05 rabbit is now in charge of deduplicating the identical messages to all computes 09:28:09 Those patches are merged upstream but disabled by default 09:28:13 We would like to enable this by default. 09:28:20 (will do doc about it) 09:28:28 last point is: 09:28:30 - Get rid of 'transient' queues 09:28:37 oslo.messaging is distinguishing 'transient' queues from other queues but it make no sense 09:28:46 neutron and nova does not rely on transient queues 09:29:30 E.G. a reply_xyz queue is transient in oslo, but it needs to be fully replicated for nova/neutron to work correctly 09:30:11 in the past, such queues were not HA, so if you had to switch off one server of your rabbitmq cluster, you would affect part of your deployment 09:30:23 losing messages, leading to inconsistency, etc. 09:30:56 by making thoses queues HA, you can operate your rabbitmq cluster correctly 09:31:17 TLDR, transient queues is a non-sense 09:31:23 That one might be the trickiest to convince people out of, due to history 09:31:30 yup 09:31:45 but an article can be the first step 09:31:51 but removing such concept in oslo would simplify a lot code, configuration and understanding stakes 09:32:14 that's all! thank you for reading :) 09:32:16 and together with other actions show that you did your homework and speak from a good expertise position 09:32:50 I am proud of the team on that part, we gain a lot of experience on rabbit side recently 09:33:13 agreed! It's why it's good material for "superuser" which is user-oriented 09:33:25 but I'll defer to Allison on where is the best 09:33:33 thanks! 09:33:41 anything else on this topic? 09:33:56 I am done 09:34:08 #topic Next meeting(s) 09:34:23 Next meeting is May 22, which conflicts with OID France 09:34:56 So I was thinking moving the meeting to the 3rd wednesday instead, which would be May 15, June 19... 09:35:29 rather than skipping it 09:35:34 thoughts on that? 09:35:35 that works for me 09:35:58 felix.huettner: ? 09:36:38 i'm not there on the 15th May 09:36:42 but otherwise that is fine with me 09:38:04 Hmm, alternatively we could move it back one week exceptionally in May (May 29, June 19), but that is trickier/impossible to specify in the booking system 09:38:45 Err, that would be May 29, June 26 09:39:40 Also I generally have conflicts on the last week of months so would not mind moving up 09:40:58 what should we do? 09:41:24 but we can just move it to the 3rd wednesday 09:41:35 ok 09:41:36 #action ttx to move meeting to 3rd wednesday every month 09:41:40 then i'm just not there for one session 09:41:56 yes 09:42:07 OK, if you have episode ideas in the meantime, let us know by email :) 09:42:16 #topic Open discussion 09:42:23 Anything else, anyone? 09:42:57 nop 09:43:24 alright then 09:43:28 #endmeeting