15:00:06 <ttx> #startmeeting large_scale_sig 15:00:06 <openstack> Meeting started Wed Dec 16 15:00:06 2020 UTC and is due to finish in 60 minutes. The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:09 <openstack> The meeting name has been set to 'large_scale_sig' 15:00:11 <ttx> #topic Rollcall 15:00:14 <ttx> Who is here for the Large Scale SIG meeting ? 15:00:16 <mdelavergne> Hi! 15:00:19 <genekuo> o/ 15:00:23 <ttx> mdelavergne: hi! 15:00:25 <jpward> o/ 15:00:58 <liuyulong> HI 15:01:15 <ttx> pinging amorin 15:01:27 <ttx> I don;t see belmiro in channel 15:01:52 <ttx> pinging imtiazc too 15:02:03 <imtiazc> I'm here 15:02:09 <belmoreira> o/ 15:02:23 <ttx> if we say belmoreira 3 times he appears 15:02:36 <ttx> Our agenda for today is at: 15:02:39 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-meeting 15:02:56 <ttx> #topic Review previous meetings action items 15:03:05 <ttx> "ttx to add 5th stage around upgrade and maintain scaled out systems in operation" 15:03:13 <ttx> that's done at: 15:03:17 <ttx> #link https://wiki.openstack.org/wiki/Large_Scale_SIG/UpgradeAndMaintain 15:03:27 <ttx> we can have a look when we'll review those pages later 15:03:33 <ttx> "ttx to make sure oslo.metrics 0.1 is released" 15:03:37 <ttx> That was done through https://review.opendev.org/c/openstack/releases/+/764631 and now oslo.metrics is available at: 15:03:41 <ttx> #link https://pypi.org/project/oslo.metrics/ 15:03:57 <ttx> It was also added to OpenStack global requirements by genekuo: 15:04:01 <ttx> #link https://review.opendev.org/c/openstack/requirements/+/766662 15:04:12 <ttx> So it's not ready to consume and will be included in OpenStack Wallaby. 15:04:18 <genekuo> ttx, can you ping me when CI is fix? 15:04:24 <genekuo> *fixed 15:04:26 <ttx> It's up to us to now better explain how to enable and use it 15:04:40 <ttx> I'll ask around on what is still blocked, yes 15:05:00 <ttx> "all to help in filling out https://etherpad.opendev.org/p/large-scale-sig-scaling-videos" 15:05:05 <ttx> Thanks everyone for the help there! 15:05:11 <imtiazc> Here 15:05:15 <ttx> As a reminder we did look up those videos for two reasons: 15:05:23 <ttx> - we can link to them on wiki pages as a good resource to watch (if relevant to any stage) 15:05:33 <ttx> - we could reach out to specific presenters so that they share a bit more about their scaling story 15:05:45 <ttx> So if you watch them and find them very relevant for any of our stages, please add them to the wiki pages 15:05:57 <ttx> And if a specific use case looks very interesting but lacks details, we could reach our to the presenters with more questions 15:06:12 <ttx> Especially from presenters who are not already on the SIG, like China Mobile, Los Alamos, ATT, Reliance Jio... 15:06:21 <ttx> Questions on that? 15:07:00 <mdelavergne> seems straightforward 15:07:09 <ttx> "ttx to check out Ops meetups future plans" 15:07:27 <ttx> I did ask and there is no event planned yet, so we can't piggyback on that for now for our "scaling story collection" work 15:08:07 <ttx> we'll see what event(s) are being organized in 2021. should see more clearly in January 15:08:16 <ttx> "all to review pages under https://wiki.openstack.org/wiki/Large_Scale_SIG in preparation for next meeting" 15:08:27 <ttx> We'll discuss that now in more details in the next topic 15:08:45 <genekuo> I've put some short answer on some of the question listed there 15:08:46 <ttx> Any question or comment on those action items? Anything to add? 15:08:51 <genekuo> at least the thing I know 15:08:56 <ttx> oh yes I saw it 15:09:39 <ttx> that's the idea, feel free to add things to those pages. We'll review them now and see if there is anything we should prioritize adding 15:09:47 <ttx> #topic Reviewing all scaling stages, and identifying simple tasks to do a first pass at improving those pages 15:09:59 <ttx> So.. the first one is... 15:10:02 <ttx> #link https://wiki.openstack.org/wiki/Large_Scale_SIG/Configure 15:10:17 <ttx> At this stage I don't think there are any easy tasks... 15:10:42 <ttx> amorin did lead the curation of at-scale configuration defaults 15:10:48 <ttx> But it's still work in progress, 15:11:03 <ttx> so i don;t think we have a final answer for the "Which parameters should I adjust before tackling scale ?" question 15:11:23 <ttx> Are there other common questions that we should list for that stage? 15:12:17 <ttx> maybe something around choosing the right drivers/backends at install time 15:12:29 <mdelavergne> maybe "how not" ? 15:12:31 <ttx> like which are the ones that actually CAN scale? 15:12:52 <imtiazc> Rabbit configuration. We had to tweak a few things there. 15:13:21 <ttx> maybe we can split the question into openstack parameters and rabbit parameters 15:13:52 <ttx> I'll do that now 15:14:36 <genekuo> I agree with listing out the drivers and backends people are using in large scale 15:15:31 <ttx> ok I added those two as questions 15:16:19 <ttx> Any other easy things to add at that stage? 15:16:20 <imtiazc> Apart from DB and RMQ tuning, we had to add memcached. Memcached is used to be an optional deployment option but makes a big difference in performance 15:17:07 <ttx> imtiazc: how about we add "should I use memcached?" question 15:17:21 <ttx> then you can answer it 15:17:23 <ttx> :) 15:17:46 <imtiazc> Sure 15:18:07 <ttx> I like it, that's a good one 15:18:27 <ttx> OK, moving on to next stage... 15:18:32 <ttx> #link https://wiki.openstack.org/wiki/Large_Scale_SIG/Monitor 15:18:38 <jpward> I don't know exactly how to ask the question, but what about determining the number of controller nodes and associated services? 15:18:49 <ttx> jpward: that would be for step 3 15:18:55 <ttx> we'll be back to it 15:19:13 <ttx> For the "Monitor" stage I feel like we should redirect people more aggressively to oslo.metrics 15:19:37 <ttx> oh I see that genekuo already added those 15:20:10 <genekuo> yeah 15:20:17 <ttx> genekuo: the next step will be to write good doc for oslo.metrics 15:20:19 <mdelavergne> yep, seems that oslo.metrics is currently everywhere :D 15:20:37 <ttx> so that we can redirect people to it and there they will find all answers 15:20:38 <genekuo> I think there are some other stuff that is worth monitoring like queued messages in rabbitmq 15:21:05 <genekuo> I will try to add some docs once the oslo.messaging code is done 15:22:29 <ttx> Anything else to add? I was tempted to add questions around "how do I track latency issues", "how do I track traffic issues", "how do I track error rates", "how do I track saturation issues" 15:22:45 <ttx> but I'm not sure we would have good answers for those anytime soon 15:22:50 <imtiazc> Are oslo.metrics supposed to help with distributed tracing? 15:23:14 <genekuo> I'm not sure about how the question will be, but we do monitor queued messages in rabbitmq 15:23:33 <genekuo> if it keep piling up, it may indicate that the workers aren't enough 15:24:16 <ttx> imtiazc: I'd say that oslo.metrics is more around targeted monitoring of external data sources (database, queue) from openstack perspective 15:24:35 <genekuo> imtiazc, what do you mean by distributed tracing? can you give an example on that? 15:24:53 <ttx> like tracing a user call through all components? 15:24:54 <genekuo> thanks ttx for the explaination 15:24:55 <imtiazc> ttx: Those are good questions. I think the answers will vary from one operator to another. 15:25:32 <genekuo> I agree it will be good to add those questions 15:25:45 <ttx> ok, I'll add them now 15:26:20 <imtiazc> genekuo; An example would be how much time each component of OpenStack takes to create a VM. It can be traced using a common request ID 15:27:27 <ttx> imtiazc: OSProfiler is supposed to help there 15:28:31 <ttx> example https://docs.openstack.org/ironic/pike/_images/sample_trace.svg 15:28:48 <imtiazc> Thanks, haven't tried that out yet. We were considering hooking up with OpenTracing or something like Jaeger 15:29:03 <ttx> I haven;t looked at it in a while, so not sure how usable it is 15:29:03 <genekuo> for oslo.metrics, I think what you can get is how much time it takes for scheduling rpc calls in a certain period. 15:29:13 <genekuo> but not for a specific request 15:29:29 <ttx> right, it's different goal 15:30:03 <mdelavergne> Osprofiler worked fine when we used it 15:30:07 <ttx> imtiazc: if you have a fresh look at it, I'm sure the group will be interested in learning what you thought of it 15:30:31 <ttx> ok, anything else to add to the Monitoring stage at this point? 15:31:06 <genekuo> LGTM 15:31:22 <imtiazc> Is there a plan for the community to develop all the monitoring checks (e.g. prometheus checks)? 15:32:15 <ttx> imtiazc: there has been a technical Committee discussion on how to develop something for monitoring that's more sustainable than Ceilometer 15:32:56 <ttx> including building something around prometheus 15:33:31 <ttx> discussion died down as people did not take immediate interest in working on it 15:33:41 <ttx> that does not mean it's not important 15:33:56 <ttx> we might need to revive that discussion after the holidays in one way or another 15:34:27 <ttx> moving on to stage 3 15:34:32 <ttx> #link https://wiki.openstack.org/wiki/Large_Scale_SIG/ScaleUp 15:35:03 <ttx> so this is where we should give guidance on number of nodes 15:35:23 <ttx> jpward: ^ 15:35:57 <ttx> some of the resources listed here might be a better fit in the Configure stage 15:36:09 <genekuo> For third question "RabbitMQ is clearly my bottleneck, is there a way to reconfigure that part to handle more scale?" 15:36:14 <ttx> Like it's a bit late in the journey to select a Neutron backend 15:36:16 <genekuo> should we put this in step 1? 15:36:18 <imtiazc> That a good topic :) The answer, however, depends a lot on the network provider selection. 15:36:59 <imtiazc> We often wondered about what tools other operators use. For e.g. what network provider are they using, what is used for monitoring, logging. How do other provision their hosts (before even deploying OpenStack) and also deployment tool - Puppet, Ansible, etc. Do you think we can come up with a matrix /table for this? 15:37:03 <ttx> genekuo: yeah, I think we should delete that question from this stage. I already added a question on RabbitMQ configuration 15:37:21 <ttx> done 15:37:51 <jpward> imtiazc, I have wondered the same thing, I would like to see that as well 15:38:09 <ttx> I'll move the Neutron backends comparison to stage 1 too 15:39:15 <ttx> ok done 15:40:08 <genekuo> imtiazc, I think there's a lot of feedback about what tools ops uses in ops forum during summit 15:40:32 <jpward> should there also be a planning stage? Like determining the type of hardware, networking configurations, etc? 15:40:32 <ttx> yeah, the trick is to reduce all that feedback into common best practices 15:40:58 <ttx> jpward: currently we use stage 1 (Configure) for that 15:41:17 <ttx> It's like initial decisions (stage 1) and later decisions (stage 3) 15:41:27 <jpward> ok 15:41:31 <ttx> Picking a neutron backend would be an initial decision 15:42:01 <ttx> deciding on a control plane / data plane number of nodes mix is more stage 3 15:42:37 <ttx> (bad example, it's like where the answer is the most "it depends") 15:42:59 <ttx> maybe we should rename to the "It Depends SIG" 15:43:11 <jpward> lol 15:43:32 <genekuo> lol 15:43:51 <ttx> Seriously though, there is a reason why there is no "Scaling guide" yet... It's just hard to extract common guidance 15:43:59 <genekuo> we determine the number of control plane process by looking into rabbitmq queue 15:44:09 <ttx> yet we need to, because this journey is superscary 15:44:18 <genekuo> if the number of messages keep queueing up it probably means that you need to add more workers 15:44:33 <ttx> So any answer or reassurance we can give, we should. 15:45:23 <ttx> genekuo: would you mind adding a question around that? Like "how do you decide to add a new node for control plane" maybe 15:45:24 <imtiazc> Yes, the guidance is somewhat dependent on monitoring your queues and other services. But I think we can vouch for the max number of computes given our architecture. 15:45:41 <genekuo> ttx, let me add it 15:46:00 <ttx> Frankly, we should set the bar pretty low. Any information is better than the current void 15:46:14 <ttx> which is why I see this as a no pressure exercise 15:46:39 <ttx> It is a complex system and every use case is different 15:47:03 <ttx> If optimizing was easy we'd encode it in the software 15:47:41 <ttx> So even if the answer is always "it depends", at least we can say "it depends on..." 15:47:51 <genekuo> done 15:47:54 <ttx> and provide tools to help determining the best path 15:47:58 <ttx> genekuo: thx 15:48:08 <imtiazc> We had some rough ideas on how much we could scale based on feedback from other operators like CERN, Salesforce, PayPal etc.. 15:48:18 <ttx> Anything else to add to Scaleup? 15:48:50 <ttx> imtiazc: the best way is, indeed, to listen and discuss with others and apply what they say mentally to your use case 15:49:55 <ttx> Maybe one pro tip we should give is to attend events, watch presentations, engage with fellow operators 15:50:09 <ttx> once that will be possible again to socialize :) 15:50:17 <genekuo> sounds good 15:50:24 <ttx> ok,. moving on to next stage 15:50:27 <ttx> #link https://wiki.openstack.org/wiki/Large_Scale_SIG/ScaleOut 15:50:52 <ttx> So here I think it would be great to have a few models 15:51:19 <ttx> i can't lead that as I don;t have practical experience doing it 15:51:45 <genekuo> me too, we currently only split regions because of DR purpose 15:51:50 <ttx> If someone is interested in listing the various ways you can scale out to multiple clusters/zones/regions/cells... 15:52:12 <ttx> genekuo: independent clusters is still one model 15:52:52 <ttx> So we won;t solve that one today, but if you;re interested in helping there, let me know 15:53:09 <ttx> Last stage is the one I just added 15:53:15 <ttx> #link https://wiki.openstack.org/wiki/Large_Scale_SIG/UpgradeAndMaintain 15:53:23 <ttx> (based on input from last meeting) 15:53:31 <imtiazc> We are also following a cookie cutter model. Once we have determined a max size we are comfortable with, we just replicate. I do like what CERN has done there 15:54:14 <ttx> imtiazc: that's good input. If you can formalize it as a question/answer, I think it would be a great addition 15:54:53 <ttx> So again, I don't think there is easy low-hanging fruit in this stage we could pick up 15:55:27 <ttx> Also wondering how much that stage depends on the distribution you picked at stage 1 15:56:43 <ttx> could be an interesting question to add -- which OpenStack distribution model is well-suited for large scale 15:56:50 <ttx> (stage 1 probably) 15:56:57 <ttx> I'll add it 15:57:21 <ttx> Any last comment before we switch to discussing next meeting date? 15:57:34 <genekuo> nope :) 15:57:46 <imtiazc> By distribution, do you mean Ubuntu, RedHat, SuSe etc? 15:58:23 <ttx> or openstackansible etc 15:58:38 <ttx> Like how do you install openstack 15:58:58 <imtiazc> ok. thanks. I don't have anything else for today 15:59:03 <ttx> So not really Ubuntu, but Ubuntu debs vs. Juju vs... 15:59:11 <ttx> #topic Next meeting 15:59:15 <ttx> As discussed last meeting, we'll skip the meeting over teh end-of-year holidays 15:59:23 <ttx> So our next meeting will be January 13. 15:59:35 <ttx> I don't think we'll have a specific item to discuss in-depth, we'll just focus on restarting the Large Scale SIG engine in the new year 15:59:37 <imtiazc> Happy holidays everyone! 15:59:45 <ttx> Super, we made it to the end of the meeting without logging any TODOs! We'll be able to take a clean break over the holidays 15:59:53 <ttx> Thanks everyone 16:00:03 <ttx> #endmeeting