Wednesday, 2020-12-16

*** tosky has quit IRC00:04
*** macz_ has quit IRC00:58
*** mlavalle has quit IRC01:30
*** _mlavalle_1 has joined #openstack-meeting-301:30
*** artom has quit IRC02:12
*** hemanth_n has joined #openstack-meeting-302:26
*** benj_- has joined #openstack-meeting-302:35
*** benj_ has quit IRC02:35
*** benj_- is now known as benj_02:35
*** macz_ has joined #openstack-meeting-302:55
*** macz_ has quit IRC03:00
*** macz_ has joined #openstack-meeting-303:46
*** macz_ has quit IRC03:50
*** ricolin has joined #openstack-meeting-305:59
*** yamamoto has quit IRC06:48
*** lkoranda has joined #openstack-meeting-307:25
*** yamamoto has joined #openstack-meeting-307:29
*** lkoranda has quit IRC07:35
*** eolivare has joined #openstack-meeting-307:35
*** yamamoto has quit IRC07:39
*** slaweq has joined #openstack-meeting-308:00
*** tosky has joined #openstack-meeting-308:33
*** e0ne has joined #openstack-meeting-308:54
*** aarents has quit IRC09:24
*** tosky_ has joined #openstack-meeting-309:47
*** tosky is now known as Guest2437209:49
*** tosky_ is now known as tosky09:49
*** Guest24372 has quit IRC09:50
*** lpetrut has joined #openstack-meeting-309:57
*** yamamoto has joined #openstack-meeting-310:12
*** baojg has quit IRC10:18
*** baojg has joined #openstack-meeting-310:20
*** artom has joined #openstack-meeting-311:13
*** macz_ has joined #openstack-meeting-311:18
*** macz_ has quit IRC11:23
*** yamamoto has quit IRC11:32
*** raildo has joined #openstack-meeting-311:53
*** yamamoto has joined #openstack-meeting-311:55
*** yamamoto has quit IRC12:00
*** eolivare_ has joined #openstack-meeting-312:03
*** eolivare has quit IRC12:05
*** baojg has quit IRC12:09
*** baojg has joined #openstack-meeting-312:09
*** yamamoto has joined #openstack-meeting-312:20
*** eolivare_ has quit IRC12:25
*** baojg has quit IRC12:42
*** eolivare_ has joined #openstack-meeting-312:43
*** baojg has joined #openstack-meeting-312:44
*** hemanth_n has quit IRC13:03
*** liuyulong has joined #openstack-meeting-313:59
*** mdelavergne has joined #openstack-meeting-314:51
ttx#startmeeting large_scale_sig15:00
openstackMeeting started Wed Dec 16 15:00:06 2020 UTC and is due to finish in 60 minutes.  The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot.15:00
*** genekuo has joined #openstack-meeting-315:00
openstackUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.15:00
*** openstack changes topic to " (Meeting topic: large_scale_sig)"15:00
openstackThe meeting name has been set to 'large_scale_sig'15:00
ttx#topic Rollcall15:00
*** openstack changes topic to "Rollcall (Meeting topic: large_scale_sig)"15:00
ttxWho is here for the Large Scale SIG meeting ?15:00
mdelavergneHi!15:00
genekuoo/15:00
ttxmdelavergne: hi!15:00
jpwardo/15:00
liuyulongHI15:00
ttxpinging amorin15:01
ttxI don;t see belmiro in channel15:01
*** belmoreira has joined #openstack-meeting-315:01
ttxpinging imtiazc too15:01
imtiazcI'm here15:02
belmoreirao/15:02
ttxif we say belmoreira 3 times he appears15:02
ttxOur agenda for today is at:15:02
ttx#link https://etherpad.openstack.org/p/large-scale-sig-meeting15:02
ttx#topic Review previous meetings action items15:02
*** openstack changes topic to "Review previous meetings action items (Meeting topic: large_scale_sig)"15:02
ttx"ttx to add 5th stage around upgrade and maintain scaled out systems in operation"15:03
ttxthat's done at:15:03
ttx#link https://wiki.openstack.org/wiki/Large_Scale_SIG/UpgradeAndMaintain15:03
ttxwe can have a look when we'll review those pages later15:03
ttx"ttx to make sure oslo.metrics 0.1 is released"15:03
ttxThat was done through https://review.opendev.org/c/openstack/releases/+/764631 and now oslo.metrics is available at:15:03
ttx#link https://pypi.org/project/oslo.metrics/15:03
ttxIt was also added to OpenStack global requirements by genekuo:15:03
ttx#link https://review.opendev.org/c/openstack/requirements/+/76666215:04
ttxSo it's not ready to consume and will be included in OpenStack Wallaby.15:04
genekuottx, can you ping me when CI is fix?15:04
genekuo*fixed15:04
ttxIt's up to us to now better explain how to enable and use it15:04
ttxI'll ask around on what is still blocked, yes15:04
ttx"all to help in filling out https://etherpad.opendev.org/p/large-scale-sig-scaling-videos"15:04
ttxThanks everyone for the help there!15:05
imtiazcHere15:05
ttxAs a reminder we did look up those videos for two reasons:15:05
ttx- we can link to them on wiki pages as a good resource to watch (if relevant to any stage)15:05
ttx- we could reach out to specific presenters so that they share a bit more about their scaling story15:05
ttxSo if you watch them and find them very relevant for any of our stages, please add them to the wiki pages15:05
ttxAnd if a specific use case looks very interesting but lacks details, we could reach our to the presenters with more questions15:05
ttxEspecially from presenters who are not already on the SIG, like China Mobile, Los Alamos, ATT, Reliance Jio...15:06
ttxQuestions on that?15:06
mdelavergneseems straightforward15:07
ttx"ttx to check out Ops meetups future plans"15:07
ttxI did ask and there is no event planned yet, so we can't piggyback on that for now for our "scaling story collection" work15:07
ttxwe'll see what event(s) are being organized in 2021. should see more clearly in January15:08
ttx"all to review pages under https://wiki.openstack.org/wiki/Large_Scale_SIG in preparation for next meeting"15:08
ttxWe'll discuss that now in more details in the next topic15:08
genekuoI've put some short answer on some of the question listed there15:08
ttxAny question or comment on those action items? Anything to add?15:08
genekuoat least the thing I know15:08
ttxoh yes I saw it15:08
ttxthat's the idea, feel free to add things to those pages. We'll review them now and see if there is anything we should prioritize adding15:09
ttx#topic Reviewing all scaling stages, and identifying simple tasks to do a first pass at improving those pages15:09
*** openstack changes topic to "Reviewing all scaling stages, and identifying simple tasks to do a first pass at improving those pages (Meeting topic: large_scale_sig)"15:09
ttxSo.. the first one is...15:09
ttx#link https://wiki.openstack.org/wiki/Large_Scale_SIG/Configure15:10
ttxAt this stage I don't think there are any easy tasks...15:10
ttxamorin did lead the curation of at-scale configuration defaults15:10
ttxBut it's still work in progress,15:10
ttxso i don;t think we have a final answer for the "Which parameters should I adjust before tackling scale ?" question15:11
ttxAre there other common questions that we should list for that stage?15:11
ttxmaybe something around choosing the right drivers/backends at install time15:12
mdelavergnemaybe "how not" ?15:12
ttxlike which are the ones that actually CAN scale?15:12
imtiazcRabbit configuration. We had to tweak a few things there.15:12
ttxmaybe we can split the question into openstack parameters and rabbit parameters15:13
ttxI'll do that now15:13
genekuoI agree with listing out the drivers and backends people are using in large scale15:14
*** ralonsoh has quit IRC15:15
ttxok I added those two as questions15:15
*** ralonsoh has joined #openstack-meeting-315:15
*** ricolin_ has joined #openstack-meeting-315:15
ttxAny other easy things to add at that stage?15:16
imtiazcApart from DB and RMQ tuning, we had to add memcached. Memcached is used to be an optional deployment option but makes a big difference in performance15:16
ttximtiazc: how about we add "should I use memcached?" question15:17
ttxthen you can answer it15:17
ttx:)15:17
imtiazcSure15:17
ttxI like it, that's a good one15:18
ttxOK, moving on to next stage...15:18
ttx#link https://wiki.openstack.org/wiki/Large_Scale_SIG/Monitor15:18
jpwardI don't know exactly how to ask the question, but what about determining the number of controller nodes and associated services?15:18
ttxjpward: that would be for step 315:18
ttxwe'll be back to it15:18
ttxFor the "Monitor" stage I feel like we should redirect people more aggressively to oslo.metrics15:19
ttxoh I see that genekuo already added those15:19
genekuoyeah15:20
ttxgenekuo: the next step will be to write good doc for oslo.metrics15:20
mdelavergneyep, seems that oslo.metrics is currently everywhere :D15:20
ttxso that we can redirect people to it and there they will find all answers15:20
genekuoI think there are some other stuff that is worth monitoring like queued messages in rabbitmq15:20
genekuoI will try to add some docs once the oslo.messaging code is done15:21
ttxAnything else to add? I was tempted to add questions around "how do I track latency issues", "how do I track traffic issues", "how do I track error rates", "how do I track saturation issues"15:22
ttxbut I'm not sure we would have good answers for those anytime soon15:22
imtiazcAre oslo.metrics supposed to help with distributed tracing?15:22
*** lpetrut has quit IRC15:23
genekuoI'm not sure about how the question will be, but we do monitor queued messages in rabbitmq15:23
genekuoif it keep piling up, it may indicate that the workers aren't enough15:23
ttximtiazc: I'd say that oslo.metrics is more around targeted monitoring of external data sources (database, queue) from openstack perspective15:24
genekuoimtiazc, what do you mean by distributed tracing? can you give an example on that?15:24
ttxlike tracing a user call through all components?15:24
genekuothanks ttx for the explaination15:24
imtiazcttx: Those are good questions. I think the answers will vary from one operator to another.15:24
genekuoI agree it will be good to add those questions15:25
ttxok, I'll add them now15:25
imtiazcgenekuo; An example would be how much time each component of OpenStack takes to create a VM. It can be traced using a common request ID15:26
ttximtiazc: OSProfiler is supposed to help there15:27
ttxexample https://docs.openstack.org/ironic/pike/_images/sample_trace.svg15:28
imtiazcThanks, haven't tried that out yet. We were considering hooking up with OpenTracing or something like Jaeger15:28
ttxI haven;t looked at it in a while, so not sure how usable it is15:29
genekuofor oslo.metrics, I think what you can get is how much time it takes for scheduling rpc calls in a certain period.15:29
genekuobut not for a specific request15:29
ttxright, it's different goal15:29
mdelavergneOsprofiler worked fine when we used it15:30
ttximtiazc: if you have a fresh look at it, I'm sure the group will be interested in learning what you thought of it15:30
ttxok, anything else to add to the Monitoring stage at this point?15:30
genekuoLGTM15:31
imtiazcIs there a plan for the community to develop all the monitoring checks (e.g. prometheus checks)?15:31
ttximtiazc: there has been a technical Committee discussion on how to develop something for monitoring that's more sustainable than Ceilometer15:32
ttxincluding building something around prometheus15:32
ttxdiscussion died down as people did not take immediate interest in working on it15:33
ttxthat does not mean it's not important15:33
ttxwe might need to revive that discussion after the holidays in one way or another15:33
ttxmoving on to stage 315:34
ttx#link https://wiki.openstack.org/wiki/Large_Scale_SIG/ScaleUp15:34
ttxso this is where we should give guidance on number of nodes15:35
ttxjpward: ^15:35
ttxsome of the resources listed here might be a better fit in the Configure stage15:35
genekuoFor third question "RabbitMQ is clearly my bottleneck, is there a way to reconfigure that part to handle more scale?"15:36
ttxLike it's a bit late in the journey to select a Neutron backend15:36
genekuoshould we put this in step 1?15:36
imtiazcThat a good topic :) The answer, however, depends a lot on the network provider selection.15:36
imtiazcWe often wondered about what tools other operators use. For e.g. what network provider are they using, what is used for monitoring, logging. How do other provision their hosts (before even deploying OpenStack) and also deployment tool - Puppet, Ansible, etc. Do you think we can come up with a matrix /table for this?15:36
ttxgenekuo: yeah, I think we should delete that question from this stage. I already added a question on RabbitMQ configuration15:37
ttxdone15:37
jpwardimtiazc, I have wondered the same thing, I would like to see that as well15:37
ttxI'll move the Neutron backends comparison to stage 1 too15:38
ttxok done15:39
genekuoimtiazc, I think there's a lot of feedback about what tools ops uses in ops forum during summit15:40
jpwardshould there also be a planning stage? Like determining the type of hardware, networking configurations, etc?15:40
ttxyeah, the trick is to reduce all that feedback into common best practices15:40
ttxjpward: currently we use stage 1 (Configure) for that15:40
ttxIt's like initial decisions (stage 1) and later decisions (stage 3)15:41
jpwardok15:41
ttxPicking a neutron backend would be an initial decision15:41
ttxdeciding on a control plane / data plane number of nodes mix is more stage 315:42
*** liuyulong has quit IRC15:42
ttx(bad example, it's like where the answer is the most "it depends")15:42
ttxmaybe we should rename to the "It Depends SIG"15:42
jpwardlol15:43
genekuolol15:43
ttxSeriously though, there is a reason why there is no "Scaling guide" yet... It's just hard to extract common guidance15:43
genekuowe determine the number of control plane process by looking into rabbitmq queue15:43
ttxyet we need to, because this journey is superscary15:44
genekuoif the number of messages keep queueing up it probably means that you need to add more workers15:44
ttxSo any answer or reassurance we can give, we should.15:44
ttxgenekuo: would you mind adding a question around that? Like "how do you decide to add a new node for control plane" maybe15:45
imtiazcYes, the guidance is somewhat dependent on monitoring your queues and other services. But I think we can vouch for the max number of computes given our architecture.15:45
genekuottx, let me add it15:45
ttxFrankly, we should set the bar pretty low. Any information is better than the current void15:46
ttxwhich is why I see this as a no pressure exercise15:46
*** _mlavalle_1 has quit IRC15:46
ttxIt is a complex system and every use case is different15:46
ttxIf optimizing was easy we'd encode it in the software15:47
ttxSo even if the answer is always "it depends", at least we can say "it depends on..."15:47
genekuodone15:47
ttxand provide tools to help determining the best path15:47
ttxgenekuo: thx15:47
imtiazcWe had some rough ideas on how much we could scale based on feedback from other operators like CERN, Salesforce, PayPal etc..15:48
ttxAnything else to add to Scaleup?15:48
ttximtiazc: the best way is, indeed, to listen and discuss with others and apply what they say mentally to your use case15:48
ttxMaybe one pro tip we should give is to attend events, watch presentations, engage with fellow operators15:49
ttxonce that will be possible again to socialize :)15:50
genekuosounds good15:50
ttxok,. moving on to next stage15:50
ttx#link https://wiki.openstack.org/wiki/Large_Scale_SIG/ScaleOut15:50
ttxSo here I think it would be great to have a few models15:50
ttxi can't lead that as I don;t have practical experience doing it15:51
genekuome too, we currently only split regions because of DR purpose15:51
ttxIf someone is interested in listing the various ways you can scale out to multiple clusters/zones/regions/cells...15:51
ttxgenekuo: independent clusters is still one model15:52
ttxSo we won;t solve that one today, but if you;re interested in helping there, let me know15:52
ttxLast stage is the one I just added15:53
ttx#link https://wiki.openstack.org/wiki/Large_Scale_SIG/UpgradeAndMaintain15:53
ttx(based on input from last meeting)15:53
imtiazcWe are also following a cookie cutter model. Once we have determined a max size we are comfortable with, we just replicate. I do like what CERN has done there15:53
ttximtiazc: that's good input. If you can formalize it as a question/answer, I think it would be a great addition15:54
ttxSo again, I don't think there is easy low-hanging fruit in this stage we could pick up15:54
ttxAlso wondering how much that stage depends on the distribution you picked at stage 115:55
ttxcould be an interesting question to add -- which OpenStack distribution model is well-suited for large scale15:56
ttx(stage 1 probably)15:56
ttxI'll add it15:56
ttxAny last comment before we switch to discussing next meeting date?15:57
genekuonope :)15:57
imtiazcBy distribution, do you mean Ubuntu, RedHat, SuSe etc?15:57
ttxor openstackansible etc15:58
ttxLike how do you install openstack15:58
imtiazcok. thanks. I don't have anything else for today15:58
ttxSo not really Ubuntu, but Ubuntu debs vs. Juju vs...15:59
ttx#topic Next meeting15:59
*** openstack changes topic to "Next meeting (Meeting topic: large_scale_sig)"15:59
ttxAs discussed last meeting, we'll skip the meeting over teh end-of-year holidays15:59
ttxSo our next meeting will be January 13.15:59
ttxI don't think we'll have a specific item to discuss in-depth, we'll just focus on restarting the Large Scale SIG engine in the new year15:59
imtiazcHappy holidays everyone!15:59
ttxSuper, we made it to the end of the meeting without logging any TODOs! We'll be able to take a clean break over the holidays15:59
ttxThanks everyone15:59
ttx#endmeeting16:00
*** openstack changes topic to "OpenStack Meetings || https://wiki.openstack.org/wiki/Meetings/"16:00
openstackMeeting ended Wed Dec 16 16:00:03 2020 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)16:00
mdelavergneHappy holidays, see you next year, and thanks :)16:00
openstackMinutes:        http://eavesdrop.openstack.org/meetings/large_scale_sig/2020/large_scale_sig.2020-12-16-15.00.html16:00
openstackMinutes (text): http://eavesdrop.openstack.org/meetings/large_scale_sig/2020/large_scale_sig.2020-12-16-15.00.txt16:00
openstackLog:            http://eavesdrop.openstack.org/meetings/large_scale_sig/2020/large_scale_sig.2020-12-16-15.00.log.html16:00
ttxright on time16:00
ttxthat was clsoe16:00
genekuothanks all, see you next year16:00
ttxclose even16:00
ttxgenekuo: thanks!16:00
*** imtiazc has left #openstack-meeting-316:00
*** mdelavergne has quit IRC16:00
*** macz_ has joined #openstack-meeting-316:04
*** eolivare_ has quit IRC16:16
*** mlavalle has joined #openstack-meeting-316:16
*** ricolin_ has quit IRC16:31
*** ralonsoh is now known as ralonsoh|afk17:00
*** belmoreira has quit IRC17:34
*** e0ne has quit IRC19:52
*** baojg has quit IRC21:15
*** baojg has joined #openstack-meeting-321:17
*** baojg has quit IRC21:19
*** baojg has joined #openstack-meeting-321:21
*** ralonsoh|afk has quit IRC22:09
*** raildo has quit IRC22:30
*** slaweq has quit IRC22:42
*** haleyb is now known as haleyb|away23:18
*** baojg has quit IRC23:25
*** baojg has joined #openstack-meeting-323:26

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!