16:00:07 <ttx> #startmeeting large_scale_sig
16:00:11 <openstack> Meeting started Wed Sep  9 16:00:07 2020 UTC and is due to finish in 60 minutes.  The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:12 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:12 <ttx> #topic Rollcall
16:00:14 <openstack> The meeting name has been set to 'large_scale_sig'
16:00:22 <ttx> Who is here for the Large Scale SIG meeting ?
16:00:29 <mdelavergne> Hi o/
16:01:05 <ttx> I see penick is in the channel list
16:01:26 <amorin> hey!
16:01:40 <ttx> and amorin
16:01:43 * penick waves
16:02:14 <ttx> and eandersson
16:02:21 <eandersson> o/
16:02:30 <ttx> Alright, let's get started
16:02:39 <ttx> Our agenda for today is at:
16:02:41 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-meeting
16:02:46 <ttx> #topic Welcome newcomers
16:03:00 <ttx> Following the Opendev event on Large scale deployments, we had several people expressing interest in joining the SIG
16:03:07 <ttx> Several of them in the US and not interested in our original meeting time at 8utc
16:03:22 <ttx> So we decided some time ago to rotate between US+EU / APAC+EU times, as the majority of the group is from EU
16:03:32 <ttx> This is our second US+EU meet... but the first was not really successful at attracting new participants
16:03:44 <ttx> so I'd like to take the time to welcome new attendees, and spend some time discussing what they are interested in
16:03:56 <ttx> so we can shape the direction of the SIG accordingly
16:04:07 <ttx> penick: not sure you need any intro, but go
16:04:40 <penick> Heh, hey folks. I work for Verizon Media and act as the director/architect for our private cloud.
16:05:02 <ttx> Anything specific you're interested in within this SIG, beyond sharing your experience?
16:05:30 <penick> I'm here to offer some perspective and feedback on our needs as a large scale deployer, and learn other use cases. Ideally i'd like to align our upstream development efforts to help make the product better for large and small deployers
16:05:54 <ttx> great! You're in the right place
16:06:05 <ttx> eandersson: care to quickly introduce yourself?
16:06:12 <amorin> welcome
16:06:23 <eandersson> Sure!
16:07:46 <eandersson> I work at Blizzard Entertainment as a Technical Lead and I am responsible for the Private Cloud here.
16:07:54 <ttx> Anything specific you're interested in within this SIG, beyond sharing your experience?
16:08:05 <ttx> (yeah, i copy pasted that)
16:08:31 <mdelavergne> Welcome to the both of you :)
16:08:34 <eandersson> Very similar as penick, here to share perspective and provide feedback.
16:09:10 <ttx> Perfect. I'll let amorin and mdelavergne quickly introduce themselves... I'm working for the OSF, facilitating this group operations
16:09:34 <ttx> Not much of a first-hand experience on large scale deployments, lots of second hand accounts though :)
16:10:25 <ttx> Other regular SIG members include belmoreira of CERN fame, and masahito from LINE, whoi opefully is sleeping at this hour.
16:10:38 <ttx> hopefully*
16:10:40 <amorin> hey, iwork for ovh, i am mostly involved in deploying openstack for our public cloud offer
16:11:05 <mdelavergne> Hi, I'm a PhD student working on a way to geo-distribute applications, so I mostly do experiments on large-scale
16:11:07 <amorin> and sorry, i am on my phone :( not easy to type here
16:11:18 <ttx> heh
16:11:22 <penick> i'm glad to meet you all :)
16:11:50 <ttx> Alright, let's jump right in in our current workstreams... don't hesitate to interrupt me to ask questions
16:11:56 <ttx> #topic Progress on "Documenting large scale operations" goal
16:11:58 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-documentation
16:12:11 <ttx> So this is one of our current goals for the SIG - produce better documentation to help operators setting up large deployments.
16:12:23 <ttx> One work stream is around collecting articles, howto and tips & tricks around large scale published over the years
16:12:34 <ttx> You can add any you know of on that etherpad I just linked to
16:12:54 <ttx> Another workstream is around documenting better configuration values when you start hitting issues with default values
16:13:07 <ttx> amorin is leading that work at https://wiki.openstack.org/wiki/Large_Scale_Configuration_Guidelines but could use help
16:13:34 <ttx> Basically the default settings only carry you so far
16:13:54 <ttx> and if we can agree on commonly-tweaked settings at scale, we should find a way to document them
16:13:56 <amorin> yes i'd love to be able to push that forward, but i lack some time mostly
16:14:22 <ttx> We had another work item on collecting metrics/billing stories
16:14:28 <ttx> That points to one critical activity of this SIG:
16:14:38 <ttx> It's all about sharing your experience operating large scale deployments of OpenStack
16:14:44 <ttx> so that we can derive best practices and/or fix common issues
16:15:06 <ttx> Only amorin contributed the story for OVH on the etherpad, so please add to that if you have time
16:15:34 <ttx> Maybe there is no common pattern there, but my guess is there is, and it's not necessarily aligned with what upstream provides
16:15:54 <ttx> So I'll push again an action to contribute to that:
16:15:58 <ttx> #action all to describe briefly how you solved metrics/billing in your deployment in https://etherpad.openstack.org/p/large-scale-sig-documentation
16:16:10 <ttx> And finally, we have a workstream on tooling, prompted by OVH's interest in pushing osarchiver upstream
16:16:25 <ttx> The current status there is that the "OSOps" effort is being revived, under the 'Operation Docs and Tooling' SIG
16:16:44 <ttx> belmoreira and I signed up to make sure that was moving forward, but smcginnis has been active leading it
16:16:55 <ttx> The setup is still in progress at https://review.opendev.org/#/c/749834/
16:16:59 <penick> I can ask my team to see if they can distill some of the useful configuration changes we've made. One trick is there's so many ways to deploy OpenStack that there are a lot of different contexts for when to use some settings vs others. For example, we focus on fewer,larger openstack deployments. With thousands of hypervisors per VM cluster and tens of thousands of baremetal nodes per baremetal cluster. So some things
16:16:59 <penick> we've done will not really apply to a large scale deployer who might prefer many smaller clusters.
16:17:48 <ttx> penick: I think the info will be useful anyway. You're right that in some cases there won't be a common practice and it will be all over the map
16:18:11 <amorin> agree, but still useful to know how people are doing scale
16:18:19 <ttx> But from some early discussions at the SIG it was apparent taht some of the pain points are still common
16:18:55 <ttx> To finish on osarchiver, once OSOps repo is set up we'll start working on pushing osarchiver in it
16:19:26 <ttx> Anything else on this topic? Any other direction you'd like this broad goal to take?
16:19:30 <amorin> great
16:19:56 <ttx> (the SIG mostly goes where its members push it, so if you're interested in something specific, please let the group know)
16:20:09 <ttx> Otherwise I'll keep on exposing current status
16:20:26 <ttx> #topic Progress on "Scaling within one cluster" goal
16:20:30 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-cluster-scaling
16:20:44 <ttx> This is the other initial goal of the SIG - identify, measure and push back common scaling limitations within one cluster
16:20:55 <ttx> Basically, as you add nodes to a single OpenStack cluster, at one point you reach scaling limits.
16:21:01 <ttx> How do we push that limit back, and reduce the need to create multiple clusters ?
16:21:19 <ttx> Again there may not be a common pattern, so...
16:21:26 <ttx> First task here is to collect those scaling stories. You have a large scale deployment, what happens if you add too many nodes ?
16:21:44 <ttx> We try to collect those stories at https://etherpad.openstack.org/p/scaling-stories
16:21:54 <ttx> and then move them to https://wiki.openstack.org/wiki/Large_Scale_Scaling_Stories once edited
16:22:00 <ttx> So please add to that if you have a story to share! (can be old)
16:22:02 <eandersson> Neutron has been a big pain point for us, with memory usage been excessive, to the point where we ended up running Neutron in VMs.
16:22:32 <ttx> Neutron failed first? Usually people point to Rabbit first, but maybe you optimized that already
16:22:51 <eandersson> Rabbit has been good to us, besides recovering after a crash.
16:23:31 <penick> eandersson: interesting, was neutron running in containers prior to moving to VMs?
16:23:51 <eandersson> Yea - we containeraized all of our deployments when moving to Rocky about ~2 years ago.
16:24:47 <ttx> amorin: you're not (yet) running components containerized, right?
16:25:47 <amorin> nop, not yet
16:25:53 <amorin> we are working on it
16:25:59 <penick> eandersson good to know, thanks! We're working on containerizing all of our stuff now.. I'll make sure my team is  aware
16:26:35 <ttx> So as I said, one common issue when adding nodes is around RabbitMQ falling down, even if I expect most people in this group to have moved past that
16:26:42 <ttx> And so this SIG has worked to produce code to instrument oslo.messaging calls and get good metrics from them
16:26:46 <amorin> is your database and rabbit cluster running in containers as well?
16:26:54 <ttx> Based on what LINE used internally to solve that
16:27:09 <eandersson> Not yet.
16:27:18 <ttx> This resulted in https://opendev.org/openstack/oslo.metrics
16:27:32 <ttx> Next step there is to add basic tests, so that we are reasonably confident we do not introduce regressions
16:27:40 <ttx> I had an action item I did not finish to evaluate that, let me push it back
16:27:44 <ttx> #action ttx to look into a basic test framework for oslo.metrics
16:27:56 <ttx> Also masahito was planning to push the latest patches to it, but I haven't seen anything posted yet
16:28:00 <ttx> #action masahito to push latest patches to oslo.metrics
16:28:08 <ttx> amorin: did you check about applicability of oslo.metrics within OVH?
16:28:22 <amorin> not yet, still in my todo
16:28:30 <ttx> alright pushing that back too
16:28:33 <ttx> #action amorin to see if oslo.metrics could be tested at OVH
16:28:59 <ttx> Finally we recently helped with OVH's patch to add a basic ping to oslo.middleware
16:29:06 <ttx> That can be useful to monitor a rabbitMQ setup and detect weird failure cases
16:29:13 <ttx> (there were threads on the ML about it)
16:29:22 <ttx> Happy to report the patch finally landed at https://review.opendev.org/#/c/749834/
16:29:28 <ttx> And shipped in oslo.messaging 12.4.0!
16:29:34 <amorin> yay!
16:29:41 <mdelavergne> congrats!
16:30:20 <eandersson> I believe a lot of the critical recovery issues with RabbitMQ are fixed in Ussuri. Especially the excessive number of exchanges created by RPC calls that I believe caused all our recovery issues.
16:31:28 <ttx> eandersson: yes I think there were lots of improvements there
16:31:32 <ttx> Does that sound like a good goal for you penick and eandersson, or do you think we'll also fail to extract common patterns and low-hanging fruit improvements to raise scale
16:32:39 <penick> Sorry, not understanding.. Is what a good goal?
16:32:56 <ttx> oh the "Scaling within one cluster" goal
16:33:11 <ttx> Trying to push back when you need to scale out to multiple clusters
16:33:49 <eandersson> Yea - that is a good goal.
16:34:10 <penick> Ah, yeah. That's good for me.
16:34:26 <ttx> ok, moving on
16:34:32 <ttx> #topic PTG/Summit plans
16:34:56 <ttx> For the PTG we decided last meeting to ask for a PTG room around our usual meeting times, to serve as our regular meeting while recruiting potential new members
16:35:05 <ttx> So I requested Wednesday 7UTC-8UTC and 16UTC-17UTC (Oct 28)
16:35:44 <ttx> (first one is a bit early but there was no slot scheduled at our normal time)
16:35:54 <ttx> For the summit we discussed proposing one Forum session around collecting scaling stories, which I still have to file
16:36:21 <ttx> The idea is to get people to talk and we can document their story afterwards
16:36:31 <ttx> One learning from Opendev is that to get a virtual discussion going, it's good to prime the pump by having 2-3 people signed up to discuss the topic
16:36:45 <ttx> So... anyone interested in sharing their scaling story in the context of a Forum session?
16:37:26 <ttx> I just fear unless we start talking nobody will dare expose their case
16:37:36 <ttx> I can moderate but I don't have any first-hand scaling story to share
16:37:55 <amorin> i can maybe do a small one
16:38:04 <penick> I can share something, or ask someone from my team to
16:38:31 <ttx> great, thanks! I think that will help
16:39:24 <ttx> We'll try to actively promote that session in the openstack ops community, so hopefully it will work, both as a recruiting mechanism and a way to collect data
16:39:48 <amorin> ok
16:40:05 <ttx> #action ttx to file Scaling Stories forum session, with amorin and someone from penick's team to help get it off the ground
16:40:37 <ttx> #topic Next meeting
16:40:44 <ttx> If we continue on the same rhythm:
16:40:50 <ttx> Next meeting will be EU-APAC on Sept 23, 8utc.
16:40:56 <ttx> Then next US-EU meeting will be Oct 7, 16utc.
16:41:00 <ttx> How does that sound?
16:41:05 <ttx> Any objection to continuing to hold those on IRC?
16:41:07 <amorin> good
16:41:53 <eandersson> Sounds good to me.
16:41:56 <mdelavergne> it's fine
16:42:06 <ttx> Feel free to invite others from your teams to join if they can help, or fellow openstack ops you accidentally discuss with
16:42:10 <ttx> #info next meetings: Sep 23, 8:00UTC; Oct 7, 16:00UTC
16:42:34 <ttx> That is all I had, opening the floor for comments, questions, desires
16:42:41 <ttx> #topic Open discussion
16:42:53 <penick> 16utc is fine for me, i'm ok with either IRC or video chat
16:43:30 <ttx> Anything else this SIG should tackle? We are trying to have reasonable goals as we all have a lot of work besides the SIG
16:43:42 <ttx> and not sweat it if things go slow
16:44:24 <ttx> but still expose and share knowledge and tools that is distributed in the large deployment ops community
16:44:31 <eandersson> Meaningful monitoring of the control plane has always been difficult to me.
16:44:37 <penick> I'd be interested to learn how many large scale deployers have dedicated time for upstream contributions, and if there's any interest in collaborating on things.
16:44:58 <penick> For example, producing reference deployment documentation, similar to what we did for the edge deployments.
16:45:16 <eandersson> We focus heavily on anything that we think is critical and is currently lacking the support (e.g. our contributions to Senlin)
16:45:18 <penick> ah, meaningful monitoring is a good one
16:46:17 <eandersson> We don't necessarily have the time to dedicate someone to a project like Nova, Neutron just because of the sheer complexity of learning that. So we focus where we can make the most impact.
16:46:39 <eandersson> But we do contribute up smaller bug fixes to all projects (or report them at the very least) the moment we find them.
16:47:06 <ttx> I like the idea of meaningful monitoring. It really is in a sorry state
16:48:16 <ttx> I think the quest for exact metrics (usable for billing) has killed the "meaningful" out of our monitoring
16:48:47 <eandersson> Yep!
16:49:23 <eandersson> It's difficult to find a good balance.
16:49:57 <ttx> Maybe we should collect experiences on how people currently do monitoring, not even talking about metrics/billing
16:50:01 <eandersson> I would also like to see if someone has successfully enabled tracing.
16:51:01 <penick> splunk monitoring/search examples for troubleshooting problems might also be useful
16:51:34 <ttx> I'll give some thought on how to drive that, maybe we should just add a goal on "meaningful monitoring" (I like that termbut it might be a bit too ambitious for the group right now
16:51:37 <eandersson> I would like to know what monitoring is the first to alert you when something goes wrong (e.g. rabbit goes down, control node goes down)
16:51:49 <penick> also, relevant, i'm on two meetings at once here, but the other meeting is what we're going to do to get off of Ceilometer and Gnocchi to move to something active and supported
16:52:06 <ttx> relevant indeed
16:52:08 <eandersson> We have a lot synthetic monitoring scenarios running continuously
16:52:34 <eandersson> or scenario monitoring (e.g. a VM is created and tested every 5 minute)
16:52:38 <ttx> penick: would love to know where you end up
16:53:26 <ttx> OK that was a great first meeting y'all, but I need to run
16:53:28 <penick> I'll share what we come up with
16:53:34 <penick> cool, thanks y'all!
16:54:12 <ttx> So thanks everyone, let's continue the discussion next time you can drop by. I'll add a section to continue discussing meaningful monitoring in the EU+APAC meeting in two weeks
16:54:45 <ttx> #endmeeting