15:00:12 <ttx> #startmeeting large_scale_sig
15:00:13 <openstack> Meeting started Wed Jan 27 15:00:12 2021 UTC and is due to finish in 60 minutes.  The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:16 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:18 <openstack> The meeting name has been set to 'large_scale_sig'
15:00:19 <ttx> #topic Rollcall
15:00:24 <belmoreira> o/
15:00:26 <ttx> Who is here for the Large Scale SIG meeting ?
15:00:30 <ttx> belmoreira: hi!
15:00:37 <belmoreira> hi ttx
15:00:38 <genekuo> hi
15:00:54 <ttx> Hi genekuo !
15:01:19 <ttx> pinging amorin mdelavergne
15:01:31 <ttx> and anyone else interested
15:01:34 <mdelavergne> hi o/
15:01:49 <ttx> jpward: maybe
15:02:14 <ttx> alright let's get started
15:02:19 <ttx> Our agenda for today is at:
15:02:21 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-meeting
15:02:33 <ttx> First, let's review progress and blockers on each of the stages of the Scaling journey
15:02:40 <ttx> #topic Stage 1 - Configure
15:02:44 <ttx> #link https://wiki.openstack.org/wiki/Large_Scale_SIG/Configure
15:03:02 <ttx> I was counting on amorin for an update, but we can push back to next meeting
15:03:09 <amorin> hello!
15:03:17 <ttx> ah!
15:03:20 <amorin> sorry  I am late
15:03:29 <ttx> I see you added a few bits to that page
15:03:39 <amorin> yes I did!
15:03:39 <ttx> Anything we can help you with? Any blocker?
15:03:53 <amorin> based on a discussion we had few month ago on the mailing list
15:04:00 <amorin> and based on what we did at OVH
15:04:12 <amorin> I initiated the documentation about how to configure rabbit
15:04:16 <amorin> at least in a cluster way
15:04:33 <amorin> I tried to explain things based on my experience and what we do, that's far from beeing perfect IMO
15:04:44 <ttx> Given the event-less nature of 2021, I'd say that starting threads on the mailing-list is probably a good way to extract knowledge and derive best practices
15:04:50 <amorin> anyway, feel free to comment and update the part that sounds weird or not good
15:05:00 <ttx> Like asking a specific question
15:05:04 <amorin> yes
15:05:12 <belmoreira> great, thanks
15:05:13 <ttx> and then compile the answer
15:05:37 <amorin> I was very happy with the thread on rabbit
15:05:49 <ttx> so whenever you have doubts or when you want to gut-check some response, I'd recommend you raise a new thread
15:05:56 <amorin> so I wanted to write that down to avoid losing it in internet archives
15:06:18 <ttx> Sometimes it just won't catch on, but sometimes it will
15:06:40 <ttx> it's like throwing fishing lines
15:06:50 <ttx> perseverance is key
15:07:03 <ttx> OK, anything else on that topic?
15:07:12 <amorin> nop
15:07:21 <ttx> #topic Stage 2 - Monitor (genekuo)
15:07:27 <ttx> #link https://wiki.openstack.org/wiki/Large_Scale_SIG/Monitor
15:07:45 <genekuo> haven't got time to update it
15:07:45 <ttx> genekuo: There was some progress on the oslo.metrics front
15:08:00 <genekuo> yeah, I was pushing the patches
15:08:09 <ttx> we are lacking reviewers, but at this stage in development it's probably normal (not a lot of users yet)
15:08:25 <ttx> so I think single core +2A is probably good enmough at this pre-1.0 stage
15:08:38 <genekuo> I think it's ok, I've got some plenty of work to do on functional test at oslo.messaging part
15:08:44 <ttx> which is why I ended up approving what we had
15:08:53 <ttx> yeah, reviewers there will be a lot more cautious
15:09:21 <ttx> The goal here is still to get to something in the Wallaby release
15:09:22 <genekuo> we don't have that part internally yet so I'll have to work from scratch
15:09:50 <ttx> I think it's good to have anyway, making sure we don;t break it with further code changes
15:09:55 <genekuo> yeah
15:10:09 <ttx> OK, anything else on that stage?
15:10:22 <genekuo> nope, we have a project internally working on some monitoring stuff
15:10:35 <genekuo> We will have people contributing to the docs once we have progress on it
15:10:39 <ttx> ok
15:10:45 <ttx> #topic Stage 3 - Scale Up (ttx)
15:10:48 <ttx> #link https://wiki.openstack.org/wiki/Large_Scale_SIG/ScaleUp
15:11:08 <ttx> Not much progress to report, still looking for someone with experience of that stage to help drive
15:11:47 <belmoreira> I can help
15:11:50 <ttx> I might pick up the efforts to upstream OSarchiver since our last try (make it part of OSops) was a bit stalled
15:11:56 <amorin> scale up is adding more computes to a region, right?
15:12:04 <ttx> yes, until it breaks
15:12:11 <amorin> ok
15:12:29 <ttx> belmoreira: you already lead the last two stages :)
15:12:30 <belmoreira> this is also related to the work that amorin is doing with configuration
15:12:39 <ttx> everything is related (and overlapping)
15:13:13 <belmoreira> ttx I have been there :) with things breaking when scaling up
15:13:19 <ttx> #action ttx to revive the OSarchiver upstreaming effort
15:13:24 <amorin> Maybe we can ask on the mailing list how many computes the users are running?
15:13:48 <amorin> and try to start a discussion on how they managed to get there
15:13:57 <ttx> yeah, some kind of a poll... good idea
15:13:57 <amorin> if we have huge numbers
15:14:16 <ttx> would be god to capture the general guideline that people use
15:14:21 <amorin> yup
15:14:21 <ttx> OK I'll do that
15:14:23 <mdelavergne> I'd like to say I had to tackle scaling up, but my experience was breaking at around 10 at first, and around 50, so I don't think we can call it "scaling" :D
15:14:34 <amorin> :)
15:14:39 <belmoreira> lol
15:14:56 <amorin> belmoreira I think you were having huge number of computes in CERN, right?
15:15:01 <amorin> like 4k for a region, no?
15:15:20 <belmoreira> amarin, yes, but we use cells
15:15:28 <amorin> cheating :)
15:15:34 <belmoreira> but Neutron keeps the 4k load
15:15:43 <mdelavergne> is it cheating if it works?
15:15:43 <ttx> So I'll start a thread, and you can all contribute a short answer, that should bring some interest to that thread and encourage others to chime in
15:15:49 <amorin> using vxlan?
15:15:57 <amorin> ttx, ok
15:16:04 <belmoreira> amorin linux bridge
15:16:19 <ttx> #action ttx to start a "how many compute nodes in your typical cluster" discussion on the ML
15:16:20 <belmoreira> amorin cheating again? :)
15:16:25 <amorin> lol :)
15:16:44 <ttx> even if the numbers are all over the map that's interesting data
15:17:02 <ttx> the user survey does not really give that "per cluster" granularity
15:17:02 <belmoreira> yes, I guess we will have all flavors
15:17:15 <ttx> ok, great idea
15:17:23 <ttx> anything else on that stage?
15:17:38 <genekuo> nope
15:17:43 <ttx> #topic Stage 4/5 - Scale Out, upgrade & maintain (belmoreira)
15:17:49 <ttx> #link https://wiki.openstack.org/wiki/Large_Scale_SIG/ScaleOut
15:17:51 <ttx> #link https://wiki.openstack.org/wiki/Large_Scale_SIG/UpgradeAndMaintain
15:17:53 <belmoreira> I was working on it
15:18:00 <belmoreira> https://docs.google.com/document/d/1IKH6odpQ5vjTcBG1P_-MQbAClRD9orgyeMyTQ2P7de0/edit?usp=sharing
15:18:00 <ttx> perfect
15:18:15 <ttx> wow that's great
15:18:16 <belmoreira> I'm using a google doc for my initial draft
15:18:51 <belmoreira> let me know if this is what's expected
15:18:57 <ttx> feel free to drop it in the wiki early, it's ok if it's not perfect or final
15:19:15 <ttx> At first glance it strikes the right tone
15:19:15 <genekuo> cool I'll take a look of it
15:19:31 <belmoreira> will do it after the meeting, I was just waiting for your review first
15:19:31 <ttx> maybe we can all review it and discuss it next meeting?
15:19:43 <ttx> or we quickly read it
15:19:49 <ttx> short enough
15:20:10 <belmoreira> sounds good for me either way.
15:20:24 <belmoreira> it's still WIP
15:20:33 <ttx> the first question could be moved to the ScaleUp stage, since it's about when to move to the next stage
15:20:48 <ttx> (when scaling up is not longer enough)
15:20:53 <ttx> no*
15:21:21 <amorin> sounds great
15:21:25 <mdelavergne> maybe there should be a link to region and cells doc? but overall it's really great
15:21:41 <belmoreira> mdelavergne good point
15:22:09 <belmoreira> I will try to come up with other questions
15:22:27 <belmoreira> let me know if you have any that can be added
15:22:30 <ttx> #action belmoreira to post first draft of the ScaleOut FAQ
15:23:01 <ttx> Alright, anything else on those late stages?
15:23:37 <ttx> moving on then
15:23:39 <ttx> #topic Our next "share your scaling story" event
15:23:50 <belmoreira> not in the upgrades, I hope to have something similar for the next meeting
15:24:02 <ttx> +1
15:24:06 <ttx> As we discussed over the past months, we should try to regularly organize events where operators can share their scaling story experience
15:24:20 <ttx> This primarily serves as an information gathering / consensus building exercise to help us document the Scaling Journey
15:24:28 <ttx> But also doubles as a potential recruitment platform
15:24:36 <ttx> It looks more and more like 2021 will be mostly virtual again
15:24:48 <ttx> I'm keeping an eye on Open Infrastructure Foundation's own events, so that we can leverage those
15:24:56 <ttx> But the 2021 plan is not final for those yet
15:25:11 <ttx> Is there any other event we could piggyback on in the near future?
15:25:23 <ttx> If nothing gets organized that works for us, I was thinking we could organize our own thing on Zoom, like around March
15:25:31 <belmoreira> I was thinking about this...
15:25:40 <genekuo> I'm ok with it
15:25:52 <ttx> but maybe for now we should wait to see when the OIF events will land
15:26:00 <belmoreira> and I'm stealing this idea from the baremetal SIG
15:26:27 <ttx> there is no reason it has to be in an established event... that just helps in visibility
15:27:01 <belmoreira> instead having a "big" event, how about we have a small presentation (user story, configuration experience, ...) every SIG meeting?
15:27:19 <belmoreira> just a 10 min presentation that would set the pace for the meeting
15:27:45 <ttx> you mean during the IRC meeting?
15:27:49 <belmoreira> if we have interesting user stories it probably will catch the eye of several people that are interested in the subject
15:27:55 <belmoreira> yes
15:28:06 <belmoreira> and no
15:28:15 <amorin> do you think we can find enough people willing to do that every 15days?
15:28:19 <belmoreira> we use the slot to have presentations over zoom
15:28:27 <genekuo> 15 days probably hard
15:28:44 <genekuo> I think at least a month will be better
15:28:50 <belmoreira> I think so... it's only a 10 min presentation, so very little preparation is required
15:28:55 <belmoreira> also it's informal
15:29:11 <ttx> Yes in my experience it's been hard to get people to share things, so committing to a schedule with things every two weeks might be a bit much
15:29:25 <belmoreira> maybe it won't work every meeting... but having this kind of presentations it may increase the interest in the SIG
15:29:33 <ttx> true.
15:29:36 <belmoreira> instead a "big" isolated event
15:30:10 <ttx> the "big" event would probably just be a 90-min zoom call, but I see what you mean
15:30:47 <ttx> trying to see how we can organize that, and handle te Zoom/IRC transition bit
15:30:58 <belmoreira> we can find the initial volunteers inside the SIG :) and announce this events/discussion in the ML
15:31:38 <genekuo> Can Foundation help with announcing events on social media?
15:31:38 <amorin> you will show us the path :)
15:31:56 <ttx> and maybe post the smallish presentations part onto the OIF Youtube channel
15:31:59 <ttx> genekuo: sure
15:32:35 <belmoreira> ttx yes, it would be easy to record those sessions
15:32:40 <ttx> belmoreira: so we would move the meeting to Slack completely, or just the presentation part?
15:32:57 <ttx> err. Zoom
15:33:01 <ttx> not Slack
15:33:01 <reedip> 0/
15:33:19 <genekuo> Hi reedip
15:33:25 <ttx> reedip: o/
15:33:31 <reedip> Hi @genekuo
15:33:45 <reedip> Hi ttx
15:33:56 <belmoreira> ttx I don't have a strong opinion, but if it's presentation meeting probably we do all the discussion there and if there are other topics to discuss we may move to IRC? or just stop recording
15:34:57 <ttx> i think the simplest is to discuss each meeting if we have content for a "presentation meeting" in two weeks, and if we do, switch the whole meeting to zoom
15:35:13 <ttx> 2 weeks is plenty enough to promote it
15:35:23 <genekuo> I think we can start the presentation first, once it ends, we probably still have time for other discussion then we can move to irc
15:35:38 <genekuo> ttx ideas sounds good
15:35:43 <ttx> I'm afraid that would create friction and we'd lose everyone
15:35:50 <ttx> (switchign to IRC mid-meeting)
15:36:10 <belmoreira> I agree with ttx
15:36:25 <ttx> I guess that leads us to next topic
15:36:28 <ttx> #topic Next meeting
15:36:39 <ttx> Our next meeting is in two weeks, February 10
15:36:55 <ttx> Do we have someone interested in doing a 5-10 minute presentation?
15:36:59 <mdelavergne> I vote for no presentations next meeting, but it's only selfish because I won't be able to attend next meeting
15:37:03 <ttx> or shoudl we do IRC
15:37:09 <ttx> lol
15:37:29 <genekuo> lol
15:37:43 <belmoreira> so, inside the group, if we can think what we can present (10 min) and discuss/schedule next meeting it would be great
15:37:47 <mdelavergne> but I'm affraid if we wait we will never begin presentations, so I don't know
15:38:15 <genekuo> I couldn't present next meeting because I've already had a presentation planned for local community next month
15:38:18 <ttx> OK, let's keep the next mneetign on IRC, but gather presentations idea so that we can schedule a Zoom meeting for end of February
15:38:33 <amorin> ack
15:38:38 <belmoreira> great
15:38:42 <genekuo> ok
15:38:51 <reedip> ack
15:38:55 <mdelavergne> yep
15:39:02 <ttx> #action all to think about 5-10min presentations to use in a video version of our SIG meeting
15:39:17 <ttx> #info next meeting: Feb 10 on IRC usual time
15:39:41 <ttx> Anything else, anyone?
15:40:04 <reedip> I do have a question for the team ... How or what factors do we monitor to justify a clusterr health?
15:40:35 <belmoreira> cluster being?
15:40:52 <ttx> I'd generally answer... latency, traffic, errors, saturation
15:41:03 <ttx> but that's a very general SRE answer
15:41:07 <reedip> Lets say a part of a DC or a section of rack
15:41:29 <amorin> so cluster is made of servers ?
15:41:37 <amorin> not a rabbit cluster, or db cluster?
15:41:39 <ttx> no it's made OF PEOPLE
15:41:48 <reedip> Nah, not the rabbit/db
15:42:09 <reedip> But an OpenStack cluster, sorry for not being explicit in my query
15:42:22 <belmoreira> ttx :)
15:42:28 <mdelavergne> "soylent cluster"
15:42:35 <reedip> Lol :D
15:43:19 <amorin> you can monitor API response time, error rates in logs, messages in queues
15:43:21 <amorin> stuff like that
15:43:29 <belmoreira> for compute nodes we measure CPU utilisation, network traffic, steal time...
15:43:44 <amorin> number of ports in 'build
15:43:56 <amorin> number of hypevisors online
15:44:18 <amorin> number of instances ending in error
15:44:36 <reedip> Though being generic, maybe if we don't have a proper differentiation, do we intend to make a list of these, for large scale systems ( kind of like a generalistic table for systems to consider, to monitor?)
15:44:54 <reedip> I mean, is there a plan/scope for this in LargeScale SIG?
15:45:24 <genekuo> These stuff should be added in this page https://wiki.openstack.org/wiki/Large_Scale_SIG/Monitor
15:46:14 <ttx> yes!
15:46:19 <reedip> Ok, so there is a scope for it ..
15:46:24 <reedip> Thnx genekuo
15:46:54 <reedip> Thnx amorix, tax, belmoreira :)
15:46:59 <ttx> alright, if nothing else, we can close for today
15:47:07 <ttx> Thanks everyone!
15:47:18 <genekuo> thanks!
15:47:33 <ttx> #endmeeting