08:00:03 <ttx> #startmeeting large_scale_sig 08:00:04 <openstack> Meeting started Wed Jul 22 08:00:03 2020 UTC and is due to finish in 60 minutes. The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot. 08:00:05 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 08:00:07 <openstack> The meeting name has been set to 'large_scale_sig' 08:00:09 <ttx> #topic Rollcall 08:00:21 <ttx> Hi everyone! Who is here for the Large Scale SIG meeting ? 08:00:37 <ttx> Our agenda for today is at: 08:00:43 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-meeting 08:00:52 <ttx> (please feel free to add to it) 08:02:33 <ttx> amorin, dparkes: hi! 08:02:47 <t0rrant> good morning! 08:02:48 <dparkes> Hi, hello 08:02:55 <ttx> t0rrant: welcome 08:03:02 * t0rrant thanks 08:03:13 <ttx> waiting a minute for more people to join 08:04:11 <ttx> halali_: hi! 08:04:45 <ttx> OK let's start, I'm not sure we'll have more people, this is summer time 08:04:52 <ttx> #topic Welcome newcomers 08:05:09 <ttx> A few weeks ago we had a great Opendev virtual event around large scale deployments of open infrastructure 08:05:19 <ttx> As a result of that event we have a number of new people joining us today 08:05:31 <ttx> So I'd like to take the time to welcome new attendees, and spend some time discussing what they are interested in 08:05:51 <ttx> t0rrant: care to introduce yourself and what you're interested in? 08:06:00 <t0rrant> sure 08:07:49 <ttx> mdelavergne: hi! 08:07:56 <t0rrant> My name is Manuel Torrinha and I work with INESC-ID in Portugal. I collaborate with the IT services in a university setting which have a multi-region Openstack deployment. That said I'm here today to learn from your larger-scale examples overall and am specially interested in discussing different control plane architectures in order to improve our own 08:07:59 <mdelavergne> hi, sorry for being late 08:08:51 <ttx> t0rrant: welcome. I'm Thierry Carrez, VP of Engineering for the OpenStack Foundation, helping with running the SIG 08:09:16 <ttx> My interest is in getting operators of large scale deployments to collaborate together and share practices and tools 08:09:43 <ttx> dparkes: how about you? 08:09:49 <dparkes> yes, sure 08:11:49 <ttx> other regular members of the SIG are mdelavergne, amorin, masahito, belmoreira... but most of them are not around today 08:12:40 <ttx> dparkes: could you quickly introduce yourself? 08:12:47 <dparkes> Hi, I'm Daniel Parkes and work for services at Red Hat, I work with OSP, and have several large deployments that I work with were we fight daily with performance and operational issues, so I'm here to share experiences on what things I see break how we fix them so contribute to user stories, and share also your knowledge and discuss about all these topics 08:12:59 <ttx> nice! 08:13:50 <ttx> So when the SIG started, we tried to avoid boiling the ocean, and came up with two short-term objectives 08:14:08 <ttx> trying to keep it reasonable, as inflated expectations can quickly kill a group like this 08:14:26 <ttx> We'll now review progress on those two goals 08:14:50 <ttx> But obviously the SIG just goes where its members push it, so if there is a workstream you'd like to pursue, we are open 08:15:03 <ttx> halali_: are you around? 08:15:34 <ttx> I guess not, we can come back to intros if he shows up later 08:15:37 <ttx> #topic Progress on "Documenting large scale operations" goal 08:15:42 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-documentation 08:15:53 <ttx> So this is one of our current goals for the SIG - produce better documentation to help operators setting up large deployments. 08:15:59 <halali_> Hello, Sorry for that, was away from the desk, discuss something with team, will join later 08:16:07 <ttx> sure no pb :) 08:16:17 <ttx> In particular, this goal is around documenting better configuration values when you start hitting issues with default values 08:16:56 <ttx> If you look at https://etherpad.openstack.org/p/large-scale-sig-documentation you can see the various things we are pushing in that direction 08:17:07 <ttx> I'll push back amorin's TODO from last meeting since he is in vacation those days 08:17:13 <ttx> #action amorin to add some meat to the wiki page before we push the Nova doc patch further 08:17:25 <ttx> We had another work item on collecting metrics/billing stories 08:17:35 <ttx> That points to one critical activity of this SIG: 08:17:46 <ttx> It's all about sharing your experience operating large scale deployments of OpenStack 08:17:55 <ttx> so that we can derive best practices and/or fix common issues 08:18:30 <ttx> Only amorin contributed the story for OVH on the etherpad so far (line 34+), so please add to that 08:18:38 <ttx> if you have any experience with such setups 08:18:50 <ttx> I'll log an action item to everyone on that 08:18:59 <ttx> #action all to describe briefly how you solved metrics/billing in your deployment in https://etherpad.openstack.org/p/large-scale-sig-documentation 08:19:10 <ttx> Finally, we had an action on discussing how to upstream osarchiver, OVH's internal tool for database cleanup 08:19:19 <ttx> amorin raised a thread about it: 08:19:22 <ttx> #link http://lists.openstack.org/pipermail/openstack-discuss/2020-July/015970.html 08:19:31 <ttx> I did reply as planned to discuss how to best land it: 08:19:37 <ttx> #link http://lists.openstack.org/pipermail/openstack-discuss/2020-July/015978.html 08:19:46 <ttx> Not much replies on that yet... so I'll escalate it to get a response 08:19:53 <ttx> #action ttx to escalate OSops revival thread for osarchiver hosting 08:20:11 <dparkes> Yes I have had interest in the osarchiver in the past, and did some testing, it would be a great addition 08:20:52 <ttx> That's it for status updates on this goal... did you have comments on this goal, does that sound like a good thing to pursue, any additional action you'd like to suggest in that area ? 08:21:26 <ttx> (the trick being, it's hard to push for best practices until we reach a critical mass of experience feedback) 08:21:56 <t0rrant> I wasn't aware of osarchiver but it looks like a very useful tool for large scale and even small scale deployments 08:22:34 <ttx> yes, we are trying to revive the "OSops" concept and land it there. OSops was an operator-led collection of small tools, with low bar to entry 08:22:37 <t0rrant> we don't use mistral for example, but I guess it goes through all the services 08:23:10 <ttx> ok, moving on to the other SIG goal... 08:23:16 <ttx> #topic Progress on "Scaling within one cluster" goal 08:23:20 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-cluster-scaling 08:23:34 <ttx> This is the other early goal of the SIG - identify, measure and push back common scaling limitations within one cluster 08:24:03 <ttx> To that effect we collect "scaling stories", like what happened when you started adding up nodes, what breaks first 08:24:28 <ttx> We collect the stories in https://etherpad.opendev.org/p/scaling-stories 08:24:50 <ttx> and then publish them to https://wiki.openstack.org/wiki/Large_Scale_Scaling_Stories for long-term storage 08:25:01 <ttx> One common issue when adding nodes is around RabbitMQ falling down 08:25:14 <ttx> And so this SIG has worked to produce code to instrument oslo.messaging calls and get good metrics from them 08:25:39 <ttx> This resulted in https://opendev.org/openstack/oslo.metrics -- a PoC that we hope to turn into a full-fledged oslo library to do taht instrumentation 08:25:55 <ttx> Next step is to add basic tests, so that we are reasonably confident we do not introduce regressions 08:26:11 <ttx> Let me know if you are interested in looking into that 08:26:28 <t0rrant> one thing we are seeing in our deployment is timeouts on the DB side, and probably MQ bottlenecks. We have still metrics collection to set up to be sure, but I can make it fail with a not so large rally test 08:26:39 <t0rrant> that tool would be very helpful 08:27:09 <ttx> yes, and the idea is to expand it beyond oslo.messaging, to oslo,db, which would allow to capture those DB timeouts 08:27:18 <t0rrant> +1 08:27:45 <ttx> LINE used the feedback from that tool to efficiently shard its RabbitMQ setup 08:28:07 <ttx> allowing to push back the size of an individual cluster and not create too many of those 08:28:12 <dparkes> Because osprofiler doesn't have much traction these days? 08:28:35 <ttx> less traction, but also more scope 08:28:45 <ttx> this is more targeted 08:28:58 <dparkes> we sometimes find with the need of tracing to see where the issue is, so somethis like oslo.metrics would be great 08:29:03 <ttx> for me osprofiler is a dev tool, while this is an operational tool 08:29:24 <ttx> obviously there is overlap 08:29:45 <dparkes> ttx yes, something light, easy for ops, but that can give you and idea of where your spending your time 08:30:49 <ttx> dparkes: interested in your scaling (horror) stories. Not sure how much you can share from RH OSP customers, but even anonymized repotrs would be super-useful 08:32:07 <ttx> The idea being to identify what breaks first and focus on that, and gradually raise the number of hosts we can have in a given cluster 08:32:30 <ttx> OK, anything else on that topic? 08:33:05 <ttx> dparkes, t0rrant: do those two goals sound good to you? Or was there something completely different you were interested in pursuing? 08:34:15 <t0rrant> those goals seem reasonable to me, one thing I would like to discuss is advice on control plane architecture,m although I don't know if this meeting is the most appropriate :P 08:34:47 <dparkes> ttx yes I will go through the notes and try to add to the user stories, things that break ,etc 08:34:57 <ttx> dparkes: cool, thanks 08:35:15 <ttx> t0rrant: we rae not really at that level of detail yet, but we could come to it 08:35:27 <ttx> ok, moving on... 08:35:29 <t0rrant> sure thing 08:35:31 <ttx> #topic Discuss a US-friendly meeting time 08:35:38 <ttx> Following the Opendev event we have a couple more people interested in joining 08:35:45 <ttx> But they are based on US Pacific TZ, so our current meeting time is not working for them :) 08:35:53 <ttx> (1am) 08:36:15 <ttx> The SIG members are currently mostly in EU, with a couple in APAC 08:36:19 <ttx> Given the SIG membership, I was thinking we could alternate between a APAC-EU time and a EU-US time. 08:36:33 <ttx> For example, have next meeting in two weeks at 16utc, then in four weeks, back to 8utc 08:36:45 <ttx> Would that work for you all? Obviously just attend the meetings you can attend :) 08:36:58 <dparkes> yes, sounds fair 08:37:05 <t0rrant> looks like a good compromise yes 08:37:29 <ttx> Since the goal of the SIG is really to collect and share experiences, I feel like we'll maximize input that way 08:37:33 <mdelavergne> fine by me, but 16utc every time is also fine 08:37:53 <ttx> even if that will make my work communicating each meeting output a bit more critical :) 08:38:15 <ttx> I'll confirm with our US-based prospects that 16utc every 4 weeks on wednesdays is ok, and update the meeting info. 08:38:23 <ttx> #action ttx to set alternating US-EU / EU-APAC meetings 08:38:29 <ttx> #topic Next meeting 08:38:37 <ttx> So... I'll be taking time off in two weeks, at when would be our first US-EU meeting... So I'd rather move it off one week 08:38:48 <ttx> Like first US-EU meeting on August 12, 16utc, then next EU-APAC meeting on August 26, 8utc. 08:38:55 <ttx> How does that sound? 08:39:02 <t0rrant> sounds good to me 08:39:12 <ttx> We have reduced attendance in summer months anyway 08:39:14 <mdelavergne> yep, sounds good 08:39:40 <ttx> OK, I'll keep you posted on openstack-discuss mailing-list as always 08:39:57 <mdelavergne> thanks :) 08:40:03 <ttx> we are using the [largescale-sig] prefix for all things SIG-related 08:40:29 <ttx> I announce the meetings, and post summaries after the fact there 08:40:44 <ttx> #info next meetings: Aug 12, 16:00UTC, Aug 26, 8:00UTC 08:40:57 <ttx> #topic Open discussion 08:41:07 <ttx> Anything else you'd like to discuss today? 08:41:48 <ttx> Thanks again for joining the SIG and helping making OpenStack better! 08:42:04 <t0rrant> Thank you! 08:42:35 <ttx> Alright, let's close this. Thanks everyone for attending today 08:42:43 <mdelavergne> Welcome to the newcomers, and thanks everyone 08:42:54 <ttx> #endmeeting