#openstack-meeting-3 log

08:00:19 <ttx> #startmeeting large_scale_sig
08:00:19 <openstack> Meeting started Wed Sep 23 08:00:19 2020 UTC and is due to finish in 60 minutes.  The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot.
08:00:20 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
08:00:22 <openstack> The meeting name has been set to 'large_scale_sig'
08:00:23 <ttx> #topic Rollcall
08:00:29 <ttx> Who is here for the Large Scale SIG meeting ?
08:00:34 <mdelavergne> Hi
08:00:56 <ttx> mdelavergne: hi!
08:01:29 <ttx> pinging amorin
08:01:46 <amorin> hello!
08:01:58 <amorin> thank for the ping, completely forgot :(
08:02:20 <ttx> belmoreira seems offline
08:03:02 <ttx> masahito might be around
08:04:41 <ttx> While we wait and are between French people -- I had a nice chat with the ops at Societe generale yesterday. They are in the first stages of scaling (think 100->1000 and are a bit in the unknown, which reinforces the need for documenting what to expect when scaling up/out
08:05:16 <ttx> I told then as they transition to 100->1000 they will likely hit the need to scale out at some point
08:05:22 <ttx> them*
08:05:40 <mdelavergne> Nice!
08:05:45 <ttx> But it's unclear what to watch for, and what will fail first
08:06:00 <mdelavergne> They might have some surprises :(
08:06:17 <ttx> (they are already sharding RabbitMQ a bit so I expect that won't be the first thing)
08:06:27 <ttx> masahito: hi!
08:06:50 <ttx> alright, let's get started
08:06:52 <ttx> Our agenda for today is at:
08:06:56 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-meeting
08:07:10 <ttx> #topic Meaningful monitoring
08:07:19 <masahito> hi! sorry at late. We're hitting a rabbit mq down right now :-(
08:07:30 <ttx> Two weeks ago we had our second EU-US meeting, and we actually had new participants joining!
08:07:58 <ttx> masahito: we can catch up later, you should focus on the urgent things :)
08:08:13 <ttx> One of the things the newcomers were interested in pushing was the idea of "meaningful monitoring"
08:08:24 <ttx> which I think could be a third workstream for the group
08:08:32 <ttx> as it's a pretty common challenge across large scale deployments
08:08:56 <ttx> and also matches the "how do you do metrics/billing in your deployment" question
08:09:08 <ttx> Thoughts on that?
08:10:46 <ttx> Is that something interesting to work on, or is it basically a solved problem?
08:11:40 <mdelavergne> It's definitely a good point, but where is it possible to begin on tackling this problem?
08:12:25 <ttx> I'd start by defining the need, and discuss whether the current tooling available is covering it or not
08:12:43 <belmoreira> sorry for joining late
08:12:47 <mdelavergne> hi!
08:12:50 <ttx> Also collecting stories
08:12:54 <ttx> belmoreira: hi!
08:13:08 <ttx> We were discussing "meaningful monitoring" as a potential new workstream for the group
08:13:30 <ttx> The new US participants (James Penick and Erik from Blizzard) were interested in that
08:14:05 <ttx> belmoreira: does that sound like a valuable thing for the group to explore?
08:14:18 <belmoreira> +1 from me. With all the work with oslo metrics looks a good next step
08:14:44 <ttx> OK I'll start documenting that and come up with a plan we can bootstrap from the US+EU side of the meeting
08:15:08 <belmoreira> there was any discussion regarding notifications?
08:15:14 <ttx> #action ttx to draft a plan to tackle "meaningful monitoring" as a new SIG workstream
08:15:27 <ttx> belmoreira: no, but I'd say that's part of "meaningful"
08:15:35 <ttx> like... "actionable monitoring"
08:16:15 <belmoreira> there are projects and produce a lot of notifications but no easy/interesting way to get "meaningful" from them
08:16:28 <belmoreira> s/and/that
08:16:58 <ttx> ack
08:17:40 <ttx> I think that's what they meant. There is data being produced, but it's difficult to derive operational status from it
08:18:26 <ttx> moving on to our other workstreams...
08:18:29 <ttx> #topic Progress on "Documenting large scale operations" goal
08:18:31 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-documentation
08:18:38 <ttx> On the OSops side, the repository was just created:
08:18:40 <ttx> #link https://opendev.org/openstack/osops
08:18:50 <ttx> It's managed under the "Ops Docs and Tooling" SIG, which is obviously closely related to ours
08:19:12 <ttx> but probably a better host for tools that are not specific for large scale
08:19:20 <ttx> amorin: you should reach out to smcginnis (or the SIG) to see where you could propose OSarchiver
08:19:28 <ttx> Personally I'd say it should have its own directory under tools/ as it is something you intend to maintain...
08:20:52 <belmoreira> I think this is a great initiative and for me this should be the "transition" path for the projects adopt/integrate this tools in the projects
08:21:21 <belmoreira> this shows that ops still need something that the projects don't offer
08:21:24 <ttx> ok, I think we lost amorin, but he will pick up the message later I expect
08:21:30 <ttx> The other open task on that workstream is to collect more metrics/billing stories.
08:21:45 <ttx> A task we could just move to the new "Meaningful monitoring" workstream, as part of the "current status" collection
08:22:05 <ttx> Nothing new has been contributed there, so I will push back the action there
08:22:41 <ttx> #action all to describe briefly how you solved metrics/billing in your deployment in https://etherpad.openstack.org/p/large-scale-sig-documentation
08:22:50 <ttx> Anything else on that topic?
08:23:26 <ttx> #topic Progress on "Scaling within one cluster" goal
08:23:30 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-cluster-scaling
08:23:41 <ttx> The work on middleware ping has landed, so we are left with oslo.metrics work...
08:23:50 <ttx> I did push a basic test framework for oslo.metrics at
08:23:53 <ttx> #link https://review.opendev.org/#/c/752262/
08:24:02 <ttx> I'll continue by trying to add meaningful tests
08:24:12 <ttx> (but need to dive deeper into the code first)
08:24:17 <ttx> masahito: any progress on posting your latest changes ?
08:24:46 <ttx> amorin: did you discuss testing it internally at OVH?
08:24:49 <masahito> sorry.... nothing now.
08:25:06 <ttx> That's ok, I'll just push back the action item
08:25:07 <amorin> ttx yes
08:25:10 <ttx> #action masahito to push latest patches to oslo.metrics
08:25:11 <amorin> I talked about it
08:25:22 <amorin> we did not yet tested it
08:25:30 <amorin> but we are considering doing it
08:25:30 <ttx> ok, great! Keep us posted
08:25:34 <amorin> yup
08:25:53 <ttx> Our second subtopic is around scaling stories on:
08:25:57 <ttx> #link https://etherpad.openstack.org/p/scaling-stories
08:26:04 <ttx> Nothing new posted there... our next action is the forum session at the summit in a month.
08:26:18 <ttx> Anything else on this topic?
08:26:39 <belmoreira> I can contribute for the forum session
08:26:45 <belmoreira> if there is space
08:27:14 <ttx> belmoreira: theer is! The goal will be to create a lively conversation space to encourage others to share their own journey scaling up/out
08:27:37 <ttx> So it's good to have 2-3 people signed up to bootstrap the discussion
08:27:55 <belmoreira> great, a good topic would be moving from cells to regions because scalability issues
08:27:56 <ttx> We have James Penick (or someone from his team) + Arnaud that have volunteered to help too
08:28:09 <amorin> belmoreira: nice topic indeed!
08:28:26 <amorin> very curious about that
08:28:34 <amorin> is it because of neutron?
08:28:58 <belmoreira> amorin of course :)
08:29:24 <ttx> Maybe we can organize the discussion along different stages. Like "first thing that fails as you add more nodes" (scaling up woes), then moving from cells to regions (scaling out woes)
08:30:10 <belmoreira> ttx yes, that makes sense
08:30:44 <ttx> I'm personally very interested in gettign an order of magnitude of number of nodes/activity for a single cluster, as it's the top question I'm being asked by new companies starting to scale up their deployments
08:31:10 <ttx> like "what should we expect? What should we watch?"
08:31:26 <amorin> we have input at OVH about that
08:31:43 <ttx> And I'm always like... "well, somewhere between 100 and 1000 physical servers, something will fail and force you to scale out"
08:31:48 <ttx> which is not super-helpful :)
08:32:12 <ttx> also I'm not even sure that 100-1000 range is current in Ussuri
08:32:16 <amorin> there are plenty of differents things that could break
08:32:32 <amorin> and it depends also on what kind of usage you are doing with openstack
08:32:45 <amorin> that's why it's hard to answer I think
08:33:02 <ttx> amorin: right, type of activity will change the numbers completely
08:33:02 <belmoreira> and the risk that you are willing to take
08:33:34 <ttx> Also it's normal that "plenty of things can fail", since if there was only one thing, we would have handled that low hanging fruit already
08:34:03 <amorin> yup
08:34:32 <ttx> but still having guidance would be super helpful for the ops at that stage of their openstack life
08:34:39 <ttx> it can be fuzzy
08:34:58 <ttx> but just knowing that others have goine through that and survived... is good
08:35:44 <ttx> #topic PTG/Summit plans update
08:35:58 <ttx> So I did file a Forum session on scaling stories, which was just accepted/scheduled
08:36:05 * ttx checks when
08:36:37 <ttx> https://www.openstack.org/summit/2020/summit-schedule/events/24746/share-your-openstack-scaling-story
08:36:52 <ttx> Tuesday, October 20, 7:30am-8:15am CT (2:30pm - 3:15pm UTC)
08:36:53 <amorin> great
08:37:12 <ttx> hmm not CT, PT
08:37:39 <belmoreira> friendly time slot
08:37:51 <ttx> #info Forum session Tuesday, October 20, 7:30am-8:15am PT (2:30pm - 3:15pm UTC)
08:37:57 <ttx> yes, it's good for us
08:38:50 <ttx> Let me start an etherpad about it
08:38:58 <ttx> #link https://etherpad.opendev.org/p/w-forum-scaling-stories
08:41:25 <ttx> I dumped the base themes, feel free to add to that
08:41:54 <ttx> we only have 45 minutes so I would keep it to 4-5 questions max
08:42:50 <amorin> ack
08:43:31 <belmoreira> +1
08:43:44 <ttx> Alright let's refine that doc between now and the event
08:43:54 <ttx> Otherwise for PTG week we will have two one-hour sessions:
08:44:00 <ttx> #info PTG meeting Wednesday Oct 28 7UTC-8UTC and 16UTC-17UTC
08:44:08 <ttx> I suspect those will be approved but will let you know when I know more
08:44:31 <ttx> The idea will be to encourage people from the Forum session to follow up at the PTG hours
08:44:46 <ttx> (and then to sign them up to join the SIG meeting)
08:45:27 <ttx> We'll likely use both to introduce the various SIG workstreams and ask for input/help
08:45:42 <ttx> (both PTG hours)
08:45:57 <ttx> Any question on that
08:45:59 <ttx> ?
08:46:33 <ttx> ok then
08:46:34 <ttx> #topic Next meeting
08:46:43 <ttx> Next meeting will be US-EU-friendly on Oct 7, 16utc.
08:46:56 <ttx> Then we'll have summit and PTG weeks, and then go back to our regular cadence in November.
08:47:06 <ttx> #info next meeting: Oct 7, 16:00UTC
08:47:14 <ttx> Anything else before we close?
08:48:09 <ttx> Alright! Thanks everyone!
08:48:16 <ttx> #endmeeting