15:00:06 <ttx> #startmeeting large_scale_sig
15:00:06 <openstack> Meeting started Wed Dec 16 15:00:06 2020 UTC and is due to finish in 60 minutes.  The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:09 <openstack> The meeting name has been set to 'large_scale_sig'
15:00:11 <ttx> #topic Rollcall
15:00:14 <ttx> Who is here for the Large Scale SIG meeting ?
15:00:16 <mdelavergne> Hi!
15:00:19 <genekuo> o/
15:00:23 <ttx> mdelavergne: hi!
15:00:25 <jpward> o/
15:00:58 <liuyulong> HI
15:01:15 <ttx> pinging amorin
15:01:27 <ttx> I don;t see belmiro in channel
15:01:52 <ttx> pinging imtiazc too
15:02:03 <imtiazc> I'm here
15:02:09 <belmoreira> o/
15:02:23 <ttx> if we say belmoreira 3 times he appears
15:02:36 <ttx> Our agenda for today is at:
15:02:39 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-meeting
15:02:56 <ttx> #topic Review previous meetings action items
15:03:05 <ttx> "ttx to add 5th stage around upgrade and maintain scaled out systems in operation"
15:03:13 <ttx> that's done at:
15:03:17 <ttx> #link https://wiki.openstack.org/wiki/Large_Scale_SIG/UpgradeAndMaintain
15:03:27 <ttx> we can have a look when we'll review those pages later
15:03:33 <ttx> "ttx to make sure oslo.metrics 0.1 is released"
15:03:37 <ttx> That was done through https://review.opendev.org/c/openstack/releases/+/764631 and now oslo.metrics is available at:
15:03:41 <ttx> #link https://pypi.org/project/oslo.metrics/
15:03:57 <ttx> It was also added to OpenStack global requirements by genekuo:
15:04:01 <ttx> #link https://review.opendev.org/c/openstack/requirements/+/766662
15:04:12 <ttx> So it's not ready to consume and will be included in OpenStack Wallaby.
15:04:18 <genekuo> ttx, can you ping me when CI is fix?
15:04:24 <genekuo> *fixed
15:04:26 <ttx> It's up to us to now better explain how to enable and use it
15:04:40 <ttx> I'll ask around on what is still blocked, yes
15:05:00 <ttx> "all to help in filling out https://etherpad.opendev.org/p/large-scale-sig-scaling-videos"
15:05:05 <ttx> Thanks everyone for the help there!
15:05:11 <imtiazc> Here
15:05:15 <ttx> As a reminder we did look up those videos for two reasons:
15:05:23 <ttx> - we can link to them on wiki pages as a good resource to watch (if relevant to any stage)
15:05:33 <ttx> - we could reach out to specific presenters so that they share a bit more about their scaling story
15:05:45 <ttx> So if you watch them and find them very relevant for any of our stages, please add them to the wiki pages
15:05:57 <ttx> And if a specific use case looks very interesting but lacks details, we could reach our to the presenters with more questions
15:06:12 <ttx> Especially from presenters who are not already on the SIG, like China Mobile, Los Alamos, ATT, Reliance Jio...
15:06:21 <ttx> Questions on that?
15:07:00 <mdelavergne> seems straightforward
15:07:09 <ttx> "ttx to check out Ops meetups future plans"
15:07:27 <ttx> I did ask and there is no event planned yet, so we can't piggyback on that for now for our "scaling story collection" work
15:08:07 <ttx> we'll see what event(s) are being organized in 2021. should see more clearly in January
15:08:16 <ttx> "all to review pages under https://wiki.openstack.org/wiki/Large_Scale_SIG in preparation for next meeting"
15:08:27 <ttx> We'll discuss that now in more details in the next topic
15:08:45 <genekuo> I've put some short answer on some of the question listed there
15:08:46 <ttx> Any question or comment on those action items? Anything to add?
15:08:51 <genekuo> at least the thing I know
15:08:56 <ttx> oh yes I saw it
15:09:39 <ttx> that's the idea, feel free to add things to those pages. We'll review them now and see if there is anything we should prioritize adding
15:09:47 <ttx> #topic Reviewing all scaling stages, and identifying simple tasks to do a first pass at improving those pages
15:09:59 <ttx> So.. the first one is...
15:10:02 <ttx> #link https://wiki.openstack.org/wiki/Large_Scale_SIG/Configure
15:10:17 <ttx> At this stage I don't think there are any easy tasks...
15:10:42 <ttx> amorin did lead the curation of at-scale configuration defaults
15:10:48 <ttx> But it's still work in progress,
15:11:03 <ttx> so i don;t think we have a final answer for the "Which parameters should I adjust before tackling scale ?" question
15:11:23 <ttx> Are there other common questions that we should list for that stage?
15:12:17 <ttx> maybe something around choosing the right drivers/backends at install time
15:12:29 <mdelavergne> maybe "how not" ?
15:12:31 <ttx> like which are the ones that actually CAN scale?
15:12:52 <imtiazc> Rabbit configuration. We had to tweak a few things there.
15:13:21 <ttx> maybe we can split the question into openstack parameters and rabbit parameters
15:13:52 <ttx> I'll do that now
15:14:36 <genekuo> I agree with listing out the drivers and backends people are using in large scale
15:15:31 <ttx> ok I added those two as questions
15:16:19 <ttx> Any other easy things to add at that stage?
15:16:20 <imtiazc> Apart from DB and RMQ tuning, we had to add memcached. Memcached is used to be an optional deployment option but makes a big difference in performance
15:17:07 <ttx> imtiazc: how about we add "should I use memcached?" question
15:17:21 <ttx> then you can answer it
15:17:23 <ttx> :)
15:17:46 <imtiazc> Sure
15:18:07 <ttx> I like it, that's a good one
15:18:27 <ttx> OK, moving on to next stage...
15:18:32 <ttx> #link https://wiki.openstack.org/wiki/Large_Scale_SIG/Monitor
15:18:38 <jpward> I don't know exactly how to ask the question, but what about determining the number of controller nodes and associated services?
15:18:49 <ttx> jpward: that would be for step 3
15:18:55 <ttx> we'll be back to it
15:19:13 <ttx> For the "Monitor" stage I feel like we should redirect people more aggressively to oslo.metrics
15:19:37 <ttx> oh I see that genekuo already added those
15:20:10 <genekuo> yeah
15:20:17 <ttx> genekuo: the next step will be to write good doc for oslo.metrics
15:20:19 <mdelavergne> yep, seems that oslo.metrics is currently everywhere :D
15:20:37 <ttx> so that we can redirect people to it and there they will find all answers
15:20:38 <genekuo> I think there are some other stuff that is worth monitoring like queued messages in rabbitmq
15:21:05 <genekuo> I will try to add some docs once the oslo.messaging code is done
15:22:29 <ttx> Anything else to add? I was tempted to add questions around "how do I track latency issues", "how do I track traffic issues", "how do I track error rates", "how do I track saturation issues"
15:22:45 <ttx> but I'm not sure we would have good answers for those anytime soon
15:22:50 <imtiazc> Are oslo.metrics supposed to help with distributed tracing?
15:23:14 <genekuo> I'm not sure about how the question will be, but we do monitor queued messages in rabbitmq
15:23:33 <genekuo> if it keep piling up, it may indicate that the workers aren't enough
15:24:16 <ttx> imtiazc: I'd say that oslo.metrics is more around targeted monitoring of external data sources (database, queue) from openstack perspective
15:24:35 <genekuo> imtiazc, what do you mean by distributed tracing? can you give an example on that?
15:24:53 <ttx> like tracing a user call through all components?
15:24:54 <genekuo> thanks ttx for the explaination
15:24:55 <imtiazc> ttx: Those are good questions. I think the answers will vary from one operator to another.
15:25:32 <genekuo> I agree it will be good to add those questions
15:25:45 <ttx> ok, I'll add them now
15:26:20 <imtiazc> genekuo; An example would be how much time each component of OpenStack takes to create a VM. It can be traced using a common request ID
15:27:27 <ttx> imtiazc: OSProfiler is supposed to help there
15:28:31 <ttx> example https://docs.openstack.org/ironic/pike/_images/sample_trace.svg
15:28:48 <imtiazc> Thanks, haven't tried that out yet. We were considering hooking up with OpenTracing or something like Jaeger
15:29:03 <ttx> I haven;t looked at it in a while, so not sure how usable it is
15:29:03 <genekuo> for oslo.metrics, I think what you can get is how much time it takes for scheduling rpc calls in a certain period.
15:29:13 <genekuo> but not for a specific request
15:29:29 <ttx> right, it's different goal
15:30:03 <mdelavergne> Osprofiler worked fine when we used it
15:30:07 <ttx> imtiazc: if you have a fresh look at it, I'm sure the group will be interested in learning what you thought of it
15:30:31 <ttx> ok, anything else to add to the Monitoring stage at this point?
15:31:06 <genekuo> LGTM
15:31:22 <imtiazc> Is there a plan for the community to develop all the monitoring checks (e.g. prometheus checks)?
15:32:15 <ttx> imtiazc: there has been a technical Committee discussion on how to develop something for monitoring that's more sustainable than Ceilometer
15:32:56 <ttx> including building something around prometheus
15:33:31 <ttx> discussion died down as people did not take immediate interest in working on it
15:33:41 <ttx> that does not mean it's not important
15:33:56 <ttx> we might need to revive that discussion after the holidays in one way or another
15:34:27 <ttx> moving on to stage 3
15:34:32 <ttx> #link https://wiki.openstack.org/wiki/Large_Scale_SIG/ScaleUp
15:35:03 <ttx> so this is where we should give guidance on number of nodes
15:35:23 <ttx> jpward: ^
15:35:57 <ttx> some of the resources listed here might be a better fit in the Configure stage
15:36:09 <genekuo> For third question "RabbitMQ is clearly my bottleneck, is there a way to reconfigure that part to handle more scale?"
15:36:14 <ttx> Like it's a bit late in the journey to select a Neutron backend
15:36:16 <genekuo> should we put this in step 1?
15:36:18 <imtiazc> That a good topic :) The answer, however, depends a lot on the network provider selection.
15:36:59 <imtiazc> We often wondered about what tools other operators use. For e.g. what network provider are they using, what is used for monitoring, logging. How do other provision their hosts (before even deploying OpenStack) and also deployment tool - Puppet, Ansible, etc. Do you think we can come up with a matrix /table for this?
15:37:03 <ttx> genekuo: yeah, I think we should delete that question from this stage. I already added a question on RabbitMQ configuration
15:37:21 <ttx> done
15:37:51 <jpward> imtiazc, I have wondered the same thing, I would like to see that as well
15:38:09 <ttx> I'll move the Neutron backends comparison to stage 1 too
15:39:15 <ttx> ok done
15:40:08 <genekuo> imtiazc, I think there's a lot of feedback about what tools ops uses in ops forum during summit
15:40:32 <jpward> should there also be a planning stage? Like determining the type of hardware, networking configurations, etc?
15:40:32 <ttx> yeah, the trick is to reduce all that feedback into common best practices
15:40:58 <ttx> jpward: currently we use stage 1 (Configure) for that
15:41:17 <ttx> It's like initial decisions (stage 1) and later decisions (stage 3)
15:41:27 <jpward> ok
15:41:31 <ttx> Picking a neutron backend would be an initial decision
15:42:01 <ttx> deciding on a control plane / data plane number of nodes mix is more stage 3
15:42:37 <ttx> (bad example, it's like where the answer is the most "it depends")
15:42:59 <ttx> maybe we should rename to the "It Depends SIG"
15:43:11 <jpward> lol
15:43:32 <genekuo> lol
15:43:51 <ttx> Seriously though, there is a reason why there is no "Scaling guide" yet... It's just hard to extract common guidance
15:43:59 <genekuo> we determine the number of control plane process by looking into rabbitmq queue
15:44:09 <ttx> yet we need to, because this journey is superscary
15:44:18 <genekuo> if the number of messages keep queueing up it probably means that you need to add more workers
15:44:33 <ttx> So any answer or reassurance we can give, we should.
15:45:23 <ttx> genekuo: would you mind adding a question around that? Like "how do you decide to add a new node for control plane" maybe
15:45:24 <imtiazc> Yes, the guidance is somewhat dependent on monitoring your queues and other services. But I think we can vouch for the max number of computes given our architecture.
15:45:41 <genekuo> ttx, let me add it
15:46:00 <ttx> Frankly, we should set the bar pretty low. Any information is better than the current void
15:46:14 <ttx> which is why I see this as a no pressure exercise
15:46:39 <ttx> It is a complex system and every use case is different
15:47:03 <ttx> If optimizing was easy we'd encode it in the software
15:47:41 <ttx> So even if the answer is always "it depends", at least we can say "it depends on..."
15:47:51 <genekuo> done
15:47:54 <ttx> and provide tools to help determining the best path
15:47:58 <ttx> genekuo: thx
15:48:08 <imtiazc> We had some rough ideas on how much we could scale based on feedback from other operators like CERN, Salesforce, PayPal etc..
15:48:18 <ttx> Anything else to add to Scaleup?
15:48:50 <ttx> imtiazc: the best way is, indeed, to listen and discuss with others and apply what they say mentally to your use case
15:49:55 <ttx> Maybe one pro tip we should give is to attend events, watch presentations, engage with fellow operators
15:50:09 <ttx> once that will be possible again to socialize :)
15:50:17 <genekuo> sounds good
15:50:24 <ttx> ok,. moving on to next stage
15:50:27 <ttx> #link https://wiki.openstack.org/wiki/Large_Scale_SIG/ScaleOut
15:50:52 <ttx> So here I think it would be great to have a few models
15:51:19 <ttx> i can't lead that as I don;t have practical experience doing it
15:51:45 <genekuo> me too, we currently only split regions because of DR purpose
15:51:50 <ttx> If someone is interested in listing the various ways you can scale out to multiple clusters/zones/regions/cells...
15:52:12 <ttx> genekuo: independent clusters is still one model
15:52:52 <ttx> So we won;t solve that one today, but if you;re interested in helping there, let me know
15:53:09 <ttx> Last stage is the one I just added
15:53:15 <ttx> #link https://wiki.openstack.org/wiki/Large_Scale_SIG/UpgradeAndMaintain
15:53:23 <ttx> (based on input from last meeting)
15:53:31 <imtiazc> We are also following a cookie cutter model. Once we have determined a max size we are comfortable with, we just replicate. I do like what CERN has done there
15:54:14 <ttx> imtiazc: that's good input. If you can formalize it as a question/answer, I think it would be a great addition
15:54:53 <ttx> So again, I don't think there is easy low-hanging fruit in this stage we could pick up
15:55:27 <ttx> Also wondering how much that stage depends on the distribution you picked at stage 1
15:56:43 <ttx> could be an interesting question to add -- which OpenStack distribution model is well-suited for large scale
15:56:50 <ttx> (stage 1 probably)
15:56:57 <ttx> I'll add it
15:57:21 <ttx> Any last comment before we switch to discussing next meeting date?
15:57:34 <genekuo> nope :)
15:57:46 <imtiazc> By distribution, do you mean Ubuntu, RedHat, SuSe etc?
15:58:23 <ttx> or openstackansible etc
15:58:38 <ttx> Like how do you install openstack
15:58:58 <imtiazc> ok. thanks. I don't have anything else for today
15:59:03 <ttx> So not really Ubuntu, but Ubuntu debs vs. Juju vs...
15:59:11 <ttx> #topic Next meeting
15:59:15 <ttx> As discussed last meeting, we'll skip the meeting over teh end-of-year holidays
15:59:23 <ttx> So our next meeting will be January 13.
15:59:35 <ttx> I don't think we'll have a specific item to discuss in-depth, we'll just focus on restarting the Large Scale SIG engine in the new year
15:59:37 <imtiazc> Happy holidays everyone!
15:59:45 <ttx> Super, we made it to the end of the meeting without logging any TODOs! We'll be able to take a clean break over the holidays
15:59:53 <ttx> Thanks everyone
16:00:03 <ttx> #endmeeting