08:00:03 <ttx> #startmeeting large_scale_sig
08:00:04 <openstack> Meeting started Wed Jul 22 08:00:03 2020 UTC and is due to finish in 60 minutes.  The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot.
08:00:05 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
08:00:07 <openstack> The meeting name has been set to 'large_scale_sig'
08:00:09 <ttx> #topic Rollcall
08:00:21 <ttx> Hi everyone! Who is here for the Large Scale SIG meeting ?
08:00:37 <ttx> Our agenda for today is at:
08:00:43 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-meeting
08:00:52 <ttx> (please feel free to add to it)
08:02:33 <ttx> amorin, dparkes: hi!
08:02:47 <t0rrant> good morning!
08:02:48 <dparkes> Hi, hello
08:02:55 <ttx> t0rrant: welcome
08:03:02 * t0rrant thanks
08:03:13 <ttx> waiting a minute for more people to join
08:04:11 <ttx> halali_: hi!
08:04:45 <ttx> OK let's start, I'm not sure we'll have more people, this is summer time
08:04:52 <ttx> #topic Welcome newcomers
08:05:09 <ttx> A few weeks ago we had a great Opendev virtual event around large scale deployments of open infrastructure
08:05:19 <ttx> As a result of that event we have a number of new people joining us today
08:05:31 <ttx> So I'd like to take the time to welcome new attendees, and spend some time discussing what they are interested in
08:05:51 <ttx> t0rrant: care to introduce yourself and what you're interested in?
08:06:00 <t0rrant> sure
08:07:49 <ttx> mdelavergne: hi!
08:07:56 <t0rrant> My name is Manuel Torrinha and I work with INESC-ID in Portugal. I collaborate with the IT services in a university setting which have a multi-region Openstack deployment. That said I'm here today to learn from your larger-scale examples overall and am specially interested in discussing different control plane architectures in order to improve our own
08:07:59 <mdelavergne> hi, sorry for being late
08:08:51 <ttx> t0rrant: welcome. I'm Thierry Carrez, VP of Engineering for the OpenStack Foundation, helping with running the SIG
08:09:16 <ttx> My interest is in getting operators of large scale deployments to collaborate together and share practices and tools
08:09:43 <ttx> dparkes: how about you?
08:09:49 <dparkes> yes, sure
08:11:49 <ttx> other regular members of the SIG are mdelavergne, amorin, masahito, belmoreira... but most of them are not around today
08:12:40 <ttx> dparkes: could you quickly introduce yourself?
08:12:47 <dparkes> Hi, I'm Daniel Parkes and work for services at Red Hat, I work with OSP, and have several large deployments that I work with were we fight daily with performance and operational issues, so I'm here to share experiences on what things I see break how we fix them so contribute to user stories, and share also your knowledge and discuss about all these topics
08:12:59 <ttx> nice!
08:13:50 <ttx> So when the SIG started, we tried to avoid boiling the ocean, and came up with two short-term objectives
08:14:08 <ttx> trying to keep it reasonable, as inflated expectations can quickly kill a group like this
08:14:26 <ttx> We'll now review progress on those two goals
08:14:50 <ttx> But obviously the SIG just goes where its members push it, so if there is a workstream you'd like to pursue, we are open
08:15:03 <ttx> halali_: are you around?
08:15:34 <ttx> I guess not, we can come back to intros if he shows up later
08:15:37 <ttx> #topic Progress on "Documenting large scale operations" goal
08:15:42 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-documentation
08:15:53 <ttx> So this is one of our current goals for the SIG - produce better documentation to help operators setting up large deployments.
08:15:59 <halali_> Hello, Sorry for that, was away from the desk, discuss something with team, will join later
08:16:07 <ttx> sure no pb :)
08:16:17 <ttx> In particular, this goal is around documenting better configuration values when you start hitting issues with default values
08:16:56 <ttx> If you look at https://etherpad.openstack.org/p/large-scale-sig-documentation you can see the various things we are pushing in that direction
08:17:07 <ttx> I'll push back amorin's TODO from last meeting since he is in vacation those days
08:17:13 <ttx> #action amorin to add some meat to the wiki page before we push the Nova doc patch further
08:17:25 <ttx> We had another work item on collecting metrics/billing stories
08:17:35 <ttx> That points to one critical activity of this SIG:
08:17:46 <ttx> It's all about sharing your experience operating large scale deployments of OpenStack
08:17:55 <ttx> so that we can derive best practices and/or fix common issues
08:18:30 <ttx> Only amorin contributed the story for OVH on the etherpad so far (line 34+), so please add to that
08:18:38 <ttx> if you have any experience with such setups
08:18:50 <ttx> I'll log an action item to everyone on that
08:18:59 <ttx> #action all to describe briefly how you solved metrics/billing in your deployment in https://etherpad.openstack.org/p/large-scale-sig-documentation
08:19:10 <ttx> Finally, we had an action on discussing how to upstream osarchiver, OVH's internal tool for database cleanup
08:19:19 <ttx> amorin raised a thread about it:
08:19:22 <ttx> #link http://lists.openstack.org/pipermail/openstack-discuss/2020-July/015970.html
08:19:31 <ttx> I did reply as planned to discuss how to best land it:
08:19:37 <ttx> #link http://lists.openstack.org/pipermail/openstack-discuss/2020-July/015978.html
08:19:46 <ttx> Not much replies on that yet... so I'll escalate it to get a response
08:19:53 <ttx> #action ttx to escalate OSops revival thread for osarchiver hosting
08:20:11 <dparkes> Yes I have had interest in the osarchiver in the past, and did some testing, it would be a great addition
08:20:52 <ttx> That's it for status updates on this goal... did you have comments on this goal, does that sound like a good thing to pursue, any additional action you'd like to suggest in that area ?
08:21:26 <ttx> (the trick being, it's hard to push for best practices until we reach a critical mass of experience feedback)
08:21:56 <t0rrant> I wasn't aware of osarchiver but it looks like a very useful tool for large scale and even small scale deployments
08:22:34 <ttx> yes, we are trying to revive the "OSops" concept and land it there. OSops was an operator-led collection of small tools, with low bar to entry
08:22:37 <t0rrant> we don't use mistral for example, but I guess it goes through all the services
08:23:10 <ttx> ok, moving on to the other SIG goal...
08:23:16 <ttx> #topic Progress on "Scaling within one cluster" goal
08:23:20 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-cluster-scaling
08:23:34 <ttx> This is the other early goal of the SIG - identify, measure and push back common scaling limitations within one cluster
08:24:03 <ttx> To that effect we collect "scaling stories", like what happened when you started adding up nodes, what breaks first
08:24:28 <ttx> We collect the stories in https://etherpad.opendev.org/p/scaling-stories
08:24:50 <ttx> and then publish them to https://wiki.openstack.org/wiki/Large_Scale_Scaling_Stories for long-term storage
08:25:01 <ttx> One common issue when adding nodes is around RabbitMQ falling down
08:25:14 <ttx> And so this SIG has worked to produce code to instrument oslo.messaging calls and get good metrics from them
08:25:39 <ttx> This resulted in https://opendev.org/openstack/oslo.metrics -- a PoC that we hope to turn into a full-fledged oslo library to do taht instrumentation
08:25:55 <ttx> Next step is to add basic tests, so that we are reasonably confident we do not introduce regressions
08:26:11 <ttx> Let me know if you are interested in looking into that
08:26:28 <t0rrant> one thing we are seeing in our deployment is timeouts on the DB side, and probably MQ bottlenecks. We have still metrics collection to set up to be sure, but I can make it fail with a not so large rally test
08:26:39 <t0rrant> that tool would be very helpful
08:27:09 <ttx> yes, and the idea is to expand it beyond oslo.messaging, to oslo,db, which would allow to capture those DB timeouts
08:27:18 <t0rrant> +1
08:27:45 <ttx> LINE used the feedback from that tool to efficiently shard its RabbitMQ setup
08:28:07 <ttx> allowing to push back the size of an individual cluster and not create too many of those
08:28:12 <dparkes> Because osprofiler doesn't have much traction these days?
08:28:35 <ttx> less traction, but also more scope
08:28:45 <ttx> this is more targeted
08:28:58 <dparkes> we sometimes find with the need of tracing to see where the issue is, so somethis like oslo.metrics would be great
08:29:03 <ttx> for me osprofiler is a dev tool, while this is an operational tool
08:29:24 <ttx> obviously there is overlap
08:29:45 <dparkes> ttx yes, something light, easy for ops, but that can give you and idea of where your spending your time
08:30:49 <ttx> dparkes: interested in your scaling (horror) stories. Not sure how much you can share from RH OSP customers, but even anonymized repotrs would be super-useful
08:32:07 <ttx> The idea being to identify what breaks first and focus on that, and gradually raise the number of hosts we can have in a given cluster
08:32:30 <ttx> OK, anything else on that topic?
08:33:05 <ttx> dparkes, t0rrant: do those two goals sound good to you? Or was there something completely different you were interested in pursuing?
08:34:15 <t0rrant> those goals seem reasonable to me, one thing I would like to discuss is advice on control plane architecture,m although I don't know if this meeting is the most appropriate :P
08:34:47 <dparkes> ttx yes I will go through the notes and try to add to the user stories, things that break ,etc
08:34:57 <ttx> dparkes: cool, thanks
08:35:15 <ttx> t0rrant: we rae not really at that level of detail yet, but we could come to it
08:35:27 <ttx> ok, moving on...
08:35:29 <t0rrant> sure thing
08:35:31 <ttx> #topic Discuss a US-friendly meeting time
08:35:38 <ttx> Following the Opendev event we have a couple more people interested in joining
08:35:45 <ttx> But they are based on US Pacific TZ, so our current meeting time is not working for them :)
08:35:53 <ttx> (1am)
08:36:15 <ttx> The SIG members are currently mostly in EU, with a couple in APAC
08:36:19 <ttx> Given the SIG membership, I was thinking we could alternate between a APAC-EU time and a EU-US time.
08:36:33 <ttx> For example, have next meeting in two weeks at 16utc, then in four weeks, back to 8utc
08:36:45 <ttx> Would that work for you all? Obviously just attend the meetings you can attend :)
08:36:58 <dparkes> yes, sounds fair
08:37:05 <t0rrant> looks like a good compromise yes
08:37:29 <ttx> Since the goal of the SIG is really to collect and share experiences, I feel like we'll maximize input that way
08:37:33 <mdelavergne> fine by me, but 16utc every time is also fine
08:37:53 <ttx> even if that will make my work communicating each meeting output a bit more critical :)
08:38:15 <ttx> I'll confirm with our US-based prospects that 16utc every 4 weeks on wednesdays is ok, and update the meeting info.
08:38:23 <ttx> #action ttx to set alternating US-EU / EU-APAC meetings
08:38:29 <ttx> #topic Next meeting
08:38:37 <ttx> So... I'll be taking time off in two weeks, at when would be our first US-EU meeting... So I'd rather move it off one week
08:38:48 <ttx> Like first US-EU meeting on August 12, 16utc, then next EU-APAC meeting on August 26, 8utc.
08:38:55 <ttx> How does that sound?
08:39:02 <t0rrant> sounds good to me
08:39:12 <ttx> We have reduced attendance in summer months anyway
08:39:14 <mdelavergne> yep, sounds good
08:39:40 <ttx> OK, I'll keep you posted on openstack-discuss mailing-list as always
08:39:57 <mdelavergne> thanks :)
08:40:03 <ttx> we are using the [largescale-sig] prefix for all things SIG-related
08:40:29 <ttx> I announce the meetings, and post summaries after the fact there
08:40:44 <ttx> #info next meetings: Aug 12, 16:00UTC, Aug 26, 8:00UTC
08:40:57 <ttx> #topic Open discussion
08:41:07 <ttx> Anything else you'd like to discuss today?
08:41:48 <ttx> Thanks again for joining the SIG and helping making OpenStack better!
08:42:04 <t0rrant> Thank you!
08:42:35 <ttx> Alright, let's close this. Thanks everyone for attending today
08:42:43 <mdelavergne> Welcome to the newcomers, and thanks everyone
08:42:54 <ttx> #endmeeting