08:00:19 <ttx> #startmeeting large_scale_sig 08:00:19 <openstack> Meeting started Wed Sep 23 08:00:19 2020 UTC and is due to finish in 60 minutes. The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot. 08:00:20 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 08:00:22 <openstack> The meeting name has been set to 'large_scale_sig' 08:00:23 <ttx> #topic Rollcall 08:00:29 <ttx> Who is here for the Large Scale SIG meeting ? 08:00:34 <mdelavergne> Hi 08:00:56 <ttx> mdelavergne: hi! 08:01:29 <ttx> pinging amorin 08:01:46 <amorin> hello! 08:01:58 <amorin> thank for the ping, completely forgot :( 08:02:20 <ttx> belmoreira seems offline 08:03:02 <ttx> masahito might be around 08:04:41 <ttx> While we wait and are between French people -- I had a nice chat with the ops at Societe generale yesterday. They are in the first stages of scaling (think 100->1000 and are a bit in the unknown, which reinforces the need for documenting what to expect when scaling up/out 08:05:16 <ttx> I told then as they transition to 100->1000 they will likely hit the need to scale out at some point 08:05:22 <ttx> them* 08:05:40 <mdelavergne> Nice! 08:05:45 <ttx> But it's unclear what to watch for, and what will fail first 08:06:00 <mdelavergne> They might have some surprises :( 08:06:17 <ttx> (they are already sharding RabbitMQ a bit so I expect that won't be the first thing) 08:06:27 <ttx> masahito: hi! 08:06:50 <ttx> alright, let's get started 08:06:52 <ttx> Our agenda for today is at: 08:06:56 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-meeting 08:07:10 <ttx> #topic Meaningful monitoring 08:07:19 <masahito> hi! sorry at late. We're hitting a rabbit mq down right now :-( 08:07:30 <ttx> Two weeks ago we had our second EU-US meeting, and we actually had new participants joining! 08:07:58 <ttx> masahito: we can catch up later, you should focus on the urgent things :) 08:08:13 <ttx> One of the things the newcomers were interested in pushing was the idea of "meaningful monitoring" 08:08:24 <ttx> which I think could be a third workstream for the group 08:08:32 <ttx> as it's a pretty common challenge across large scale deployments 08:08:56 <ttx> and also matches the "how do you do metrics/billing in your deployment" question 08:09:08 <ttx> Thoughts on that? 08:10:46 <ttx> Is that something interesting to work on, or is it basically a solved problem? 08:11:40 <mdelavergne> It's definitely a good point, but where is it possible to begin on tackling this problem? 08:12:25 <ttx> I'd start by defining the need, and discuss whether the current tooling available is covering it or not 08:12:43 <belmoreira> sorry for joining late 08:12:47 <mdelavergne> hi! 08:12:50 <ttx> Also collecting stories 08:12:54 <ttx> belmoreira: hi! 08:13:08 <ttx> We were discussing "meaningful monitoring" as a potential new workstream for the group 08:13:30 <ttx> The new US participants (James Penick and Erik from Blizzard) were interested in that 08:14:05 <ttx> belmoreira: does that sound like a valuable thing for the group to explore? 08:14:18 <belmoreira> +1 from me. With all the work with oslo metrics looks a good next step 08:14:44 <ttx> OK I'll start documenting that and come up with a plan we can bootstrap from the US+EU side of the meeting 08:15:08 <belmoreira> there was any discussion regarding notifications? 08:15:14 <ttx> #action ttx to draft a plan to tackle "meaningful monitoring" as a new SIG workstream 08:15:27 <ttx> belmoreira: no, but I'd say that's part of "meaningful" 08:15:35 <ttx> like... "actionable monitoring" 08:16:15 <belmoreira> there are projects and produce a lot of notifications but no easy/interesting way to get "meaningful" from them 08:16:28 <belmoreira> s/and/that 08:16:58 <ttx> ack 08:17:40 <ttx> I think that's what they meant. There is data being produced, but it's difficult to derive operational status from it 08:18:26 <ttx> moving on to our other workstreams... 08:18:29 <ttx> #topic Progress on "Documenting large scale operations" goal 08:18:31 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-documentation 08:18:38 <ttx> On the OSops side, the repository was just created: 08:18:40 <ttx> #link https://opendev.org/openstack/osops 08:18:50 <ttx> It's managed under the "Ops Docs and Tooling" SIG, which is obviously closely related to ours 08:19:12 <ttx> but probably a better host for tools that are not specific for large scale 08:19:20 <ttx> amorin: you should reach out to smcginnis (or the SIG) to see where you could propose OSarchiver 08:19:28 <ttx> Personally I'd say it should have its own directory under tools/ as it is something you intend to maintain... 08:20:52 <belmoreira> I think this is a great initiative and for me this should be the "transition" path for the projects adopt/integrate this tools in the projects 08:21:21 <belmoreira> this shows that ops still need something that the projects don't offer 08:21:24 <ttx> ok, I think we lost amorin, but he will pick up the message later I expect 08:21:30 <ttx> The other open task on that workstream is to collect more metrics/billing stories. 08:21:45 <ttx> A task we could just move to the new "Meaningful monitoring" workstream, as part of the "current status" collection 08:22:05 <ttx> Nothing new has been contributed there, so I will push back the action there 08:22:41 <ttx> #action all to describe briefly how you solved metrics/billing in your deployment in https://etherpad.openstack.org/p/large-scale-sig-documentation 08:22:50 <ttx> Anything else on that topic? 08:23:26 <ttx> #topic Progress on "Scaling within one cluster" goal 08:23:30 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-cluster-scaling 08:23:41 <ttx> The work on middleware ping has landed, so we are left with oslo.metrics work... 08:23:50 <ttx> I did push a basic test framework for oslo.metrics at 08:23:53 <ttx> #link https://review.opendev.org/#/c/752262/ 08:24:02 <ttx> I'll continue by trying to add meaningful tests 08:24:12 <ttx> (but need to dive deeper into the code first) 08:24:17 <ttx> masahito: any progress on posting your latest changes ? 08:24:46 <ttx> amorin: did you discuss testing it internally at OVH? 08:24:49 <masahito> sorry.... nothing now. 08:25:06 <ttx> That's ok, I'll just push back the action item 08:25:07 <amorin> ttx yes 08:25:10 <ttx> #action masahito to push latest patches to oslo.metrics 08:25:11 <amorin> I talked about it 08:25:22 <amorin> we did not yet tested it 08:25:30 <amorin> but we are considering doing it 08:25:30 <ttx> ok, great! Keep us posted 08:25:34 <amorin> yup 08:25:53 <ttx> Our second subtopic is around scaling stories on: 08:25:57 <ttx> #link https://etherpad.openstack.org/p/scaling-stories 08:26:04 <ttx> Nothing new posted there... our next action is the forum session at the summit in a month. 08:26:18 <ttx> Anything else on this topic? 08:26:39 <belmoreira> I can contribute for the forum session 08:26:45 <belmoreira> if there is space 08:27:14 <ttx> belmoreira: theer is! The goal will be to create a lively conversation space to encourage others to share their own journey scaling up/out 08:27:37 <ttx> So it's good to have 2-3 people signed up to bootstrap the discussion 08:27:55 <belmoreira> great, a good topic would be moving from cells to regions because scalability issues 08:27:56 <ttx> We have James Penick (or someone from his team) + Arnaud that have volunteered to help too 08:28:09 <amorin> belmoreira: nice topic indeed! 08:28:26 <amorin> very curious about that 08:28:34 <amorin> is it because of neutron? 08:28:58 <belmoreira> amorin of course :) 08:29:24 <ttx> Maybe we can organize the discussion along different stages. Like "first thing that fails as you add more nodes" (scaling up woes), then moving from cells to regions (scaling out woes) 08:30:10 <belmoreira> ttx yes, that makes sense 08:30:44 <ttx> I'm personally very interested in gettign an order of magnitude of number of nodes/activity for a single cluster, as it's the top question I'm being asked by new companies starting to scale up their deployments 08:31:10 <ttx> like "what should we expect? What should we watch?" 08:31:26 <amorin> we have input at OVH about that 08:31:43 <ttx> And I'm always like... "well, somewhere between 100 and 1000 physical servers, something will fail and force you to scale out" 08:31:48 <ttx> which is not super-helpful :) 08:32:12 <ttx> also I'm not even sure that 100-1000 range is current in Ussuri 08:32:16 <amorin> there are plenty of differents things that could break 08:32:32 <amorin> and it depends also on what kind of usage you are doing with openstack 08:32:45 <amorin> that's why it's hard to answer I think 08:33:02 <ttx> amorin: right, type of activity will change the numbers completely 08:33:02 <belmoreira> and the risk that you are willing to take 08:33:34 <ttx> Also it's normal that "plenty of things can fail", since if there was only one thing, we would have handled that low hanging fruit already 08:34:03 <amorin> yup 08:34:32 <ttx> but still having guidance would be super helpful for the ops at that stage of their openstack life 08:34:39 <ttx> it can be fuzzy 08:34:58 <ttx> but just knowing that others have goine through that and survived... is good 08:35:44 <ttx> #topic PTG/Summit plans update 08:35:58 <ttx> So I did file a Forum session on scaling stories, which was just accepted/scheduled 08:36:05 * ttx checks when 08:36:37 <ttx> https://www.openstack.org/summit/2020/summit-schedule/events/24746/share-your-openstack-scaling-story 08:36:52 <ttx> Tuesday, October 20, 7:30am-8:15am CT (2:30pm - 3:15pm UTC) 08:36:53 <amorin> great 08:37:12 <ttx> hmm not CT, PT 08:37:39 <belmoreira> friendly time slot 08:37:51 <ttx> #info Forum session Tuesday, October 20, 7:30am-8:15am PT (2:30pm - 3:15pm UTC) 08:37:57 <ttx> yes, it's good for us 08:38:50 <ttx> Let me start an etherpad about it 08:38:58 <ttx> #link https://etherpad.opendev.org/p/w-forum-scaling-stories 08:41:25 <ttx> I dumped the base themes, feel free to add to that 08:41:54 <ttx> we only have 45 minutes so I would keep it to 4-5 questions max 08:42:50 <amorin> ack 08:43:31 <belmoreira> +1 08:43:44 <ttx> Alright let's refine that doc between now and the event 08:43:54 <ttx> Otherwise for PTG week we will have two one-hour sessions: 08:44:00 <ttx> #info PTG meeting Wednesday Oct 28 7UTC-8UTC and 16UTC-17UTC 08:44:08 <ttx> I suspect those will be approved but will let you know when I know more 08:44:31 <ttx> The idea will be to encourage people from the Forum session to follow up at the PTG hours 08:44:46 <ttx> (and then to sign them up to join the SIG meeting) 08:45:27 <ttx> We'll likely use both to introduce the various SIG workstreams and ask for input/help 08:45:42 <ttx> (both PTG hours) 08:45:57 <ttx> Any question on that 08:45:59 <ttx> ? 08:46:33 <ttx> ok then 08:46:34 <ttx> #topic Next meeting 08:46:43 <ttx> Next meeting will be US-EU-friendly on Oct 7, 16utc. 08:46:56 <ttx> Then we'll have summit and PTG weeks, and then go back to our regular cadence in November. 08:47:06 <ttx> #info next meeting: Oct 7, 16:00UTC 08:47:14 <ttx> Anything else before we close? 08:48:09 <ttx> Alright! Thanks everyone! 08:48:16 <ttx> #endmeeting