08:00:03 #startmeeting large_scale_sig 08:00:04 Meeting started Wed Jul 22 08:00:03 2020 UTC and is due to finish in 60 minutes. The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot. 08:00:05 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 08:00:07 The meeting name has been set to 'large_scale_sig' 08:00:09 #topic Rollcall 08:00:21 Hi everyone! Who is here for the Large Scale SIG meeting ? 08:00:37 Our agenda for today is at: 08:00:43 #link https://etherpad.openstack.org/p/large-scale-sig-meeting 08:00:52 (please feel free to add to it) 08:02:33 amorin, dparkes: hi! 08:02:47 good morning! 08:02:48 Hi, hello 08:02:55 t0rrant: welcome 08:03:02 * t0rrant thanks 08:03:13 waiting a minute for more people to join 08:04:11 halali_: hi! 08:04:45 OK let's start, I'm not sure we'll have more people, this is summer time 08:04:52 #topic Welcome newcomers 08:05:09 A few weeks ago we had a great Opendev virtual event around large scale deployments of open infrastructure 08:05:19 As a result of that event we have a number of new people joining us today 08:05:31 So I'd like to take the time to welcome new attendees, and spend some time discussing what they are interested in 08:05:51 t0rrant: care to introduce yourself and what you're interested in? 08:06:00 sure 08:07:49 mdelavergne: hi! 08:07:56 My name is Manuel Torrinha and I work with INESC-ID in Portugal. I collaborate with the IT services in a university setting which have a multi-region Openstack deployment. That said I'm here today to learn from your larger-scale examples overall and am specially interested in discussing different control plane architectures in order to improve our own 08:07:59 hi, sorry for being late 08:08:51 t0rrant: welcome. I'm Thierry Carrez, VP of Engineering for the OpenStack Foundation, helping with running the SIG 08:09:16 My interest is in getting operators of large scale deployments to collaborate together and share practices and tools 08:09:43 dparkes: how about you? 08:09:49 yes, sure 08:11:49 other regular members of the SIG are mdelavergne, amorin, masahito, belmoreira... but most of them are not around today 08:12:40 dparkes: could you quickly introduce yourself? 08:12:47 Hi, I'm Daniel Parkes and work for services at Red Hat, I work with OSP, and have several large deployments that I work with were we fight daily with performance and operational issues, so I'm here to share experiences on what things I see break how we fix them so contribute to user stories, and share also your knowledge and discuss about all these topics 08:12:59 nice! 08:13:50 So when the SIG started, we tried to avoid boiling the ocean, and came up with two short-term objectives 08:14:08 trying to keep it reasonable, as inflated expectations can quickly kill a group like this 08:14:26 We'll now review progress on those two goals 08:14:50 But obviously the SIG just goes where its members push it, so if there is a workstream you'd like to pursue, we are open 08:15:03 halali_: are you around? 08:15:34 I guess not, we can come back to intros if he shows up later 08:15:37 #topic Progress on "Documenting large scale operations" goal 08:15:42 #link https://etherpad.openstack.org/p/large-scale-sig-documentation 08:15:53 So this is one of our current goals for the SIG - produce better documentation to help operators setting up large deployments. 08:15:59 Hello, Sorry for that, was away from the desk, discuss something with team, will join later 08:16:07 sure no pb :) 08:16:17 In particular, this goal is around documenting better configuration values when you start hitting issues with default values 08:16:56 If you look at https://etherpad.openstack.org/p/large-scale-sig-documentation you can see the various things we are pushing in that direction 08:17:07 I'll push back amorin's TODO from last meeting since he is in vacation those days 08:17:13 #action amorin to add some meat to the wiki page before we push the Nova doc patch further 08:17:25 We had another work item on collecting metrics/billing stories 08:17:35 That points to one critical activity of this SIG: 08:17:46 It's all about sharing your experience operating large scale deployments of OpenStack 08:17:55 so that we can derive best practices and/or fix common issues 08:18:30 Only amorin contributed the story for OVH on the etherpad so far (line 34+), so please add to that 08:18:38 if you have any experience with such setups 08:18:50 I'll log an action item to everyone on that 08:18:59 #action all to describe briefly how you solved metrics/billing in your deployment in https://etherpad.openstack.org/p/large-scale-sig-documentation 08:19:10 Finally, we had an action on discussing how to upstream osarchiver, OVH's internal tool for database cleanup 08:19:19 amorin raised a thread about it: 08:19:22 #link http://lists.openstack.org/pipermail/openstack-discuss/2020-July/015970.html 08:19:31 I did reply as planned to discuss how to best land it: 08:19:37 #link http://lists.openstack.org/pipermail/openstack-discuss/2020-July/015978.html 08:19:46 Not much replies on that yet... so I'll escalate it to get a response 08:19:53 #action ttx to escalate OSops revival thread for osarchiver hosting 08:20:11 Yes I have had interest in the osarchiver in the past, and did some testing, it would be a great addition 08:20:52 That's it for status updates on this goal... did you have comments on this goal, does that sound like a good thing to pursue, any additional action you'd like to suggest in that area ? 08:21:26 (the trick being, it's hard to push for best practices until we reach a critical mass of experience feedback) 08:21:56 I wasn't aware of osarchiver but it looks like a very useful tool for large scale and even small scale deployments 08:22:34 yes, we are trying to revive the "OSops" concept and land it there. OSops was an operator-led collection of small tools, with low bar to entry 08:22:37 we don't use mistral for example, but I guess it goes through all the services 08:23:10 ok, moving on to the other SIG goal... 08:23:16 #topic Progress on "Scaling within one cluster" goal 08:23:20 #link https://etherpad.openstack.org/p/large-scale-sig-cluster-scaling 08:23:34 This is the other early goal of the SIG - identify, measure and push back common scaling limitations within one cluster 08:24:03 To that effect we collect "scaling stories", like what happened when you started adding up nodes, what breaks first 08:24:28 We collect the stories in https://etherpad.opendev.org/p/scaling-stories 08:24:50 and then publish them to https://wiki.openstack.org/wiki/Large_Scale_Scaling_Stories for long-term storage 08:25:01 One common issue when adding nodes is around RabbitMQ falling down 08:25:14 And so this SIG has worked to produce code to instrument oslo.messaging calls and get good metrics from them 08:25:39 This resulted in https://opendev.org/openstack/oslo.metrics -- a PoC that we hope to turn into a full-fledged oslo library to do taht instrumentation 08:25:55 Next step is to add basic tests, so that we are reasonably confident we do not introduce regressions 08:26:11 Let me know if you are interested in looking into that 08:26:28 one thing we are seeing in our deployment is timeouts on the DB side, and probably MQ bottlenecks. We have still metrics collection to set up to be sure, but I can make it fail with a not so large rally test 08:26:39 that tool would be very helpful 08:27:09 yes, and the idea is to expand it beyond oslo.messaging, to oslo,db, which would allow to capture those DB timeouts 08:27:18 +1 08:27:45 LINE used the feedback from that tool to efficiently shard its RabbitMQ setup 08:28:07 allowing to push back the size of an individual cluster and not create too many of those 08:28:12 Because osprofiler doesn't have much traction these days? 08:28:35 less traction, but also more scope 08:28:45 this is more targeted 08:28:58 we sometimes find with the need of tracing to see where the issue is, so somethis like oslo.metrics would be great 08:29:03 for me osprofiler is a dev tool, while this is an operational tool 08:29:24 obviously there is overlap 08:29:45 ttx yes, something light, easy for ops, but that can give you and idea of where your spending your time 08:30:49 dparkes: interested in your scaling (horror) stories. Not sure how much you can share from RH OSP customers, but even anonymized repotrs would be super-useful 08:32:07 The idea being to identify what breaks first and focus on that, and gradually raise the number of hosts we can have in a given cluster 08:32:30 OK, anything else on that topic? 08:33:05 dparkes, t0rrant: do those two goals sound good to you? Or was there something completely different you were interested in pursuing? 08:34:15 those goals seem reasonable to me, one thing I would like to discuss is advice on control plane architecture,m although I don't know if this meeting is the most appropriate :P 08:34:47 ttx yes I will go through the notes and try to add to the user stories, things that break ,etc 08:34:57 dparkes: cool, thanks 08:35:15 t0rrant: we rae not really at that level of detail yet, but we could come to it 08:35:27 ok, moving on... 08:35:29 sure thing 08:35:31 #topic Discuss a US-friendly meeting time 08:35:38 Following the Opendev event we have a couple more people interested in joining 08:35:45 But they are based on US Pacific TZ, so our current meeting time is not working for them :) 08:35:53 (1am) 08:36:15 The SIG members are currently mostly in EU, with a couple in APAC 08:36:19 Given the SIG membership, I was thinking we could alternate between a APAC-EU time and a EU-US time. 08:36:33 For example, have next meeting in two weeks at 16utc, then in four weeks, back to 8utc 08:36:45 Would that work for you all? Obviously just attend the meetings you can attend :) 08:36:58 yes, sounds fair 08:37:05 looks like a good compromise yes 08:37:29 Since the goal of the SIG is really to collect and share experiences, I feel like we'll maximize input that way 08:37:33 fine by me, but 16utc every time is also fine 08:37:53 even if that will make my work communicating each meeting output a bit more critical :) 08:38:15 I'll confirm with our US-based prospects that 16utc every 4 weeks on wednesdays is ok, and update the meeting info. 08:38:23 #action ttx to set alternating US-EU / EU-APAC meetings 08:38:29 #topic Next meeting 08:38:37 So... I'll be taking time off in two weeks, at when would be our first US-EU meeting... So I'd rather move it off one week 08:38:48 Like first US-EU meeting on August 12, 16utc, then next EU-APAC meeting on August 26, 8utc. 08:38:55 How does that sound? 08:39:02 sounds good to me 08:39:12 We have reduced attendance in summer months anyway 08:39:14 yep, sounds good 08:39:40 OK, I'll keep you posted on openstack-discuss mailing-list as always 08:39:57 thanks :) 08:40:03 we are using the [largescale-sig] prefix for all things SIG-related 08:40:29 I announce the meetings, and post summaries after the fact there 08:40:44 #info next meetings: Aug 12, 16:00UTC, Aug 26, 8:00UTC 08:40:57 #topic Open discussion 08:41:07 Anything else you'd like to discuss today? 08:41:48 Thanks again for joining the SIG and helping making OpenStack better! 08:42:04 Thank you! 08:42:35 Alright, let's close this. Thanks everyone for attending today 08:42:43 Welcome to the newcomers, and thanks everyone 08:42:54 #endmeeting