16:00:07 <ttx> #startmeeting large_scale_sig 16:00:11 <openstack> Meeting started Wed Sep 9 16:00:07 2020 UTC and is due to finish in 60 minutes. The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:12 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:12 <ttx> #topic Rollcall 16:00:14 <openstack> The meeting name has been set to 'large_scale_sig' 16:00:22 <ttx> Who is here for the Large Scale SIG meeting ? 16:00:29 <mdelavergne> Hi o/ 16:01:05 <ttx> I see penick is in the channel list 16:01:26 <amorin> hey! 16:01:40 <ttx> and amorin 16:01:43 * penick waves 16:02:14 <ttx> and eandersson 16:02:21 <eandersson> o/ 16:02:30 <ttx> Alright, let's get started 16:02:39 <ttx> Our agenda for today is at: 16:02:41 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-meeting 16:02:46 <ttx> #topic Welcome newcomers 16:03:00 <ttx> Following the Opendev event on Large scale deployments, we had several people expressing interest in joining the SIG 16:03:07 <ttx> Several of them in the US and not interested in our original meeting time at 8utc 16:03:22 <ttx> So we decided some time ago to rotate between US+EU / APAC+EU times, as the majority of the group is from EU 16:03:32 <ttx> This is our second US+EU meet... but the first was not really successful at attracting new participants 16:03:44 <ttx> so I'd like to take the time to welcome new attendees, and spend some time discussing what they are interested in 16:03:56 <ttx> so we can shape the direction of the SIG accordingly 16:04:07 <ttx> penick: not sure you need any intro, but go 16:04:40 <penick> Heh, hey folks. I work for Verizon Media and act as the director/architect for our private cloud. 16:05:02 <ttx> Anything specific you're interested in within this SIG, beyond sharing your experience? 16:05:30 <penick> I'm here to offer some perspective and feedback on our needs as a large scale deployer, and learn other use cases. Ideally i'd like to align our upstream development efforts to help make the product better for large and small deployers 16:05:54 <ttx> great! You're in the right place 16:06:05 <ttx> eandersson: care to quickly introduce yourself? 16:06:12 <amorin> welcome 16:06:23 <eandersson> Sure! 16:07:46 <eandersson> I work at Blizzard Entertainment as a Technical Lead and I am responsible for the Private Cloud here. 16:07:54 <ttx> Anything specific you're interested in within this SIG, beyond sharing your experience? 16:08:05 <ttx> (yeah, i copy pasted that) 16:08:31 <mdelavergne> Welcome to the both of you :) 16:08:34 <eandersson> Very similar as penick, here to share perspective and provide feedback. 16:09:10 <ttx> Perfect. I'll let amorin and mdelavergne quickly introduce themselves... I'm working for the OSF, facilitating this group operations 16:09:34 <ttx> Not much of a first-hand experience on large scale deployments, lots of second hand accounts though :) 16:10:25 <ttx> Other regular SIG members include belmoreira of CERN fame, and masahito from LINE, whoi opefully is sleeping at this hour. 16:10:38 <ttx> hopefully* 16:10:40 <amorin> hey, iwork for ovh, i am mostly involved in deploying openstack for our public cloud offer 16:11:05 <mdelavergne> Hi, I'm a PhD student working on a way to geo-distribute applications, so I mostly do experiments on large-scale 16:11:07 <amorin> and sorry, i am on my phone :( not easy to type here 16:11:18 <ttx> heh 16:11:22 <penick> i'm glad to meet you all :) 16:11:50 <ttx> Alright, let's jump right in in our current workstreams... don't hesitate to interrupt me to ask questions 16:11:56 <ttx> #topic Progress on "Documenting large scale operations" goal 16:11:58 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-documentation 16:12:11 <ttx> So this is one of our current goals for the SIG - produce better documentation to help operators setting up large deployments. 16:12:23 <ttx> One work stream is around collecting articles, howto and tips & tricks around large scale published over the years 16:12:34 <ttx> You can add any you know of on that etherpad I just linked to 16:12:54 <ttx> Another workstream is around documenting better configuration values when you start hitting issues with default values 16:13:07 <ttx> amorin is leading that work at https://wiki.openstack.org/wiki/Large_Scale_Configuration_Guidelines but could use help 16:13:34 <ttx> Basically the default settings only carry you so far 16:13:54 <ttx> and if we can agree on commonly-tweaked settings at scale, we should find a way to document them 16:13:56 <amorin> yes i'd love to be able to push that forward, but i lack some time mostly 16:14:22 <ttx> We had another work item on collecting metrics/billing stories 16:14:28 <ttx> That points to one critical activity of this SIG: 16:14:38 <ttx> It's all about sharing your experience operating large scale deployments of OpenStack 16:14:44 <ttx> so that we can derive best practices and/or fix common issues 16:15:06 <ttx> Only amorin contributed the story for OVH on the etherpad, so please add to that if you have time 16:15:34 <ttx> Maybe there is no common pattern there, but my guess is there is, and it's not necessarily aligned with what upstream provides 16:15:54 <ttx> So I'll push again an action to contribute to that: 16:15:58 <ttx> #action all to describe briefly how you solved metrics/billing in your deployment in https://etherpad.openstack.org/p/large-scale-sig-documentation 16:16:10 <ttx> And finally, we have a workstream on tooling, prompted by OVH's interest in pushing osarchiver upstream 16:16:25 <ttx> The current status there is that the "OSOps" effort is being revived, under the 'Operation Docs and Tooling' SIG 16:16:44 <ttx> belmoreira and I signed up to make sure that was moving forward, but smcginnis has been active leading it 16:16:55 <ttx> The setup is still in progress at https://review.opendev.org/#/c/749834/ 16:16:59 <penick> I can ask my team to see if they can distill some of the useful configuration changes we've made. One trick is there's so many ways to deploy OpenStack that there are a lot of different contexts for when to use some settings vs others. For example, we focus on fewer,larger openstack deployments. With thousands of hypervisors per VM cluster and tens of thousands of baremetal nodes per baremetal cluster. So some things 16:16:59 <penick> we've done will not really apply to a large scale deployer who might prefer many smaller clusters. 16:17:48 <ttx> penick: I think the info will be useful anyway. You're right that in some cases there won't be a common practice and it will be all over the map 16:18:11 <amorin> agree, but still useful to know how people are doing scale 16:18:19 <ttx> But from some early discussions at the SIG it was apparent taht some of the pain points are still common 16:18:55 <ttx> To finish on osarchiver, once OSOps repo is set up we'll start working on pushing osarchiver in it 16:19:26 <ttx> Anything else on this topic? Any other direction you'd like this broad goal to take? 16:19:30 <amorin> great 16:19:56 <ttx> (the SIG mostly goes where its members push it, so if you're interested in something specific, please let the group know) 16:20:09 <ttx> Otherwise I'll keep on exposing current status 16:20:26 <ttx> #topic Progress on "Scaling within one cluster" goal 16:20:30 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-cluster-scaling 16:20:44 <ttx> This is the other initial goal of the SIG - identify, measure and push back common scaling limitations within one cluster 16:20:55 <ttx> Basically, as you add nodes to a single OpenStack cluster, at one point you reach scaling limits. 16:21:01 <ttx> How do we push that limit back, and reduce the need to create multiple clusters ? 16:21:19 <ttx> Again there may not be a common pattern, so... 16:21:26 <ttx> First task here is to collect those scaling stories. You have a large scale deployment, what happens if you add too many nodes ? 16:21:44 <ttx> We try to collect those stories at https://etherpad.openstack.org/p/scaling-stories 16:21:54 <ttx> and then move them to https://wiki.openstack.org/wiki/Large_Scale_Scaling_Stories once edited 16:22:00 <ttx> So please add to that if you have a story to share! (can be old) 16:22:02 <eandersson> Neutron has been a big pain point for us, with memory usage been excessive, to the point where we ended up running Neutron in VMs. 16:22:32 <ttx> Neutron failed first? Usually people point to Rabbit first, but maybe you optimized that already 16:22:51 <eandersson> Rabbit has been good to us, besides recovering after a crash. 16:23:31 <penick> eandersson: interesting, was neutron running in containers prior to moving to VMs? 16:23:51 <eandersson> Yea - we containeraized all of our deployments when moving to Rocky about ~2 years ago. 16:24:47 <ttx> amorin: you're not (yet) running components containerized, right? 16:25:47 <amorin> nop, not yet 16:25:53 <amorin> we are working on it 16:25:59 <penick> eandersson good to know, thanks! We're working on containerizing all of our stuff now.. I'll make sure my team is aware 16:26:35 <ttx> So as I said, one common issue when adding nodes is around RabbitMQ falling down, even if I expect most people in this group to have moved past that 16:26:42 <ttx> And so this SIG has worked to produce code to instrument oslo.messaging calls and get good metrics from them 16:26:46 <amorin> is your database and rabbit cluster running in containers as well? 16:26:54 <ttx> Based on what LINE used internally to solve that 16:27:09 <eandersson> Not yet. 16:27:18 <ttx> This resulted in https://opendev.org/openstack/oslo.metrics 16:27:32 <ttx> Next step there is to add basic tests, so that we are reasonably confident we do not introduce regressions 16:27:40 <ttx> I had an action item I did not finish to evaluate that, let me push it back 16:27:44 <ttx> #action ttx to look into a basic test framework for oslo.metrics 16:27:56 <ttx> Also masahito was planning to push the latest patches to it, but I haven't seen anything posted yet 16:28:00 <ttx> #action masahito to push latest patches to oslo.metrics 16:28:08 <ttx> amorin: did you check about applicability of oslo.metrics within OVH? 16:28:22 <amorin> not yet, still in my todo 16:28:30 <ttx> alright pushing that back too 16:28:33 <ttx> #action amorin to see if oslo.metrics could be tested at OVH 16:28:59 <ttx> Finally we recently helped with OVH's patch to add a basic ping to oslo.middleware 16:29:06 <ttx> That can be useful to monitor a rabbitMQ setup and detect weird failure cases 16:29:13 <ttx> (there were threads on the ML about it) 16:29:22 <ttx> Happy to report the patch finally landed at https://review.opendev.org/#/c/749834/ 16:29:28 <ttx> And shipped in oslo.messaging 12.4.0! 16:29:34 <amorin> yay! 16:29:41 <mdelavergne> congrats! 16:30:20 <eandersson> I believe a lot of the critical recovery issues with RabbitMQ are fixed in Ussuri. Especially the excessive number of exchanges created by RPC calls that I believe caused all our recovery issues. 16:31:28 <ttx> eandersson: yes I think there were lots of improvements there 16:31:32 <ttx> Does that sound like a good goal for you penick and eandersson, or do you think we'll also fail to extract common patterns and low-hanging fruit improvements to raise scale 16:32:39 <penick> Sorry, not understanding.. Is what a good goal? 16:32:56 <ttx> oh the "Scaling within one cluster" goal 16:33:11 <ttx> Trying to push back when you need to scale out to multiple clusters 16:33:49 <eandersson> Yea - that is a good goal. 16:34:10 <penick> Ah, yeah. That's good for me. 16:34:26 <ttx> ok, moving on 16:34:32 <ttx> #topic PTG/Summit plans 16:34:56 <ttx> For the PTG we decided last meeting to ask for a PTG room around our usual meeting times, to serve as our regular meeting while recruiting potential new members 16:35:05 <ttx> So I requested Wednesday 7UTC-8UTC and 16UTC-17UTC (Oct 28) 16:35:44 <ttx> (first one is a bit early but there was no slot scheduled at our normal time) 16:35:54 <ttx> For the summit we discussed proposing one Forum session around collecting scaling stories, which I still have to file 16:36:21 <ttx> The idea is to get people to talk and we can document their story afterwards 16:36:31 <ttx> One learning from Opendev is that to get a virtual discussion going, it's good to prime the pump by having 2-3 people signed up to discuss the topic 16:36:45 <ttx> So... anyone interested in sharing their scaling story in the context of a Forum session? 16:37:26 <ttx> I just fear unless we start talking nobody will dare expose their case 16:37:36 <ttx> I can moderate but I don't have any first-hand scaling story to share 16:37:55 <amorin> i can maybe do a small one 16:38:04 <penick> I can share something, or ask someone from my team to 16:38:31 <ttx> great, thanks! I think that will help 16:39:24 <ttx> We'll try to actively promote that session in the openstack ops community, so hopefully it will work, both as a recruiting mechanism and a way to collect data 16:39:48 <amorin> ok 16:40:05 <ttx> #action ttx to file Scaling Stories forum session, with amorin and someone from penick's team to help get it off the ground 16:40:37 <ttx> #topic Next meeting 16:40:44 <ttx> If we continue on the same rhythm: 16:40:50 <ttx> Next meeting will be EU-APAC on Sept 23, 8utc. 16:40:56 <ttx> Then next US-EU meeting will be Oct 7, 16utc. 16:41:00 <ttx> How does that sound? 16:41:05 <ttx> Any objection to continuing to hold those on IRC? 16:41:07 <amorin> good 16:41:53 <eandersson> Sounds good to me. 16:41:56 <mdelavergne> it's fine 16:42:06 <ttx> Feel free to invite others from your teams to join if they can help, or fellow openstack ops you accidentally discuss with 16:42:10 <ttx> #info next meetings: Sep 23, 8:00UTC; Oct 7, 16:00UTC 16:42:34 <ttx> That is all I had, opening the floor for comments, questions, desires 16:42:41 <ttx> #topic Open discussion 16:42:53 <penick> 16utc is fine for me, i'm ok with either IRC or video chat 16:43:30 <ttx> Anything else this SIG should tackle? We are trying to have reasonable goals as we all have a lot of work besides the SIG 16:43:42 <ttx> and not sweat it if things go slow 16:44:24 <ttx> but still expose and share knowledge and tools that is distributed in the large deployment ops community 16:44:31 <eandersson> Meaningful monitoring of the control plane has always been difficult to me. 16:44:37 <penick> I'd be interested to learn how many large scale deployers have dedicated time for upstream contributions, and if there's any interest in collaborating on things. 16:44:58 <penick> For example, producing reference deployment documentation, similar to what we did for the edge deployments. 16:45:16 <eandersson> We focus heavily on anything that we think is critical and is currently lacking the support (e.g. our contributions to Senlin) 16:45:18 <penick> ah, meaningful monitoring is a good one 16:46:17 <eandersson> We don't necessarily have the time to dedicate someone to a project like Nova, Neutron just because of the sheer complexity of learning that. So we focus where we can make the most impact. 16:46:39 <eandersson> But we do contribute up smaller bug fixes to all projects (or report them at the very least) the moment we find them. 16:47:06 <ttx> I like the idea of meaningful monitoring. It really is in a sorry state 16:48:16 <ttx> I think the quest for exact metrics (usable for billing) has killed the "meaningful" out of our monitoring 16:48:47 <eandersson> Yep! 16:49:23 <eandersson> It's difficult to find a good balance. 16:49:57 <ttx> Maybe we should collect experiences on how people currently do monitoring, not even talking about metrics/billing 16:50:01 <eandersson> I would also like to see if someone has successfully enabled tracing. 16:51:01 <penick> splunk monitoring/search examples for troubleshooting problems might also be useful 16:51:34 <ttx> I'll give some thought on how to drive that, maybe we should just add a goal on "meaningful monitoring" (I like that termbut it might be a bit too ambitious for the group right now 16:51:37 <eandersson> I would like to know what monitoring is the first to alert you when something goes wrong (e.g. rabbit goes down, control node goes down) 16:51:49 <penick> also, relevant, i'm on two meetings at once here, but the other meeting is what we're going to do to get off of Ceilometer and Gnocchi to move to something active and supported 16:52:06 <ttx> relevant indeed 16:52:08 <eandersson> We have a lot synthetic monitoring scenarios running continuously 16:52:34 <eandersson> or scenario monitoring (e.g. a VM is created and tested every 5 minute) 16:52:38 <ttx> penick: would love to know where you end up 16:53:26 <ttx> OK that was a great first meeting y'all, but I need to run 16:53:28 <penick> I'll share what we come up with 16:53:34 <penick> cool, thanks y'all! 16:54:12 <ttx> So thanks everyone, let's continue the discussion next time you can drop by. I'll add a section to continue discussing meaningful monitoring in the EU+APAC meeting in two weeks 16:54:45 <ttx> #endmeeting