15:00:12 #startmeeting large_scale_sig 15:00:13 Meeting started Wed Jan 27 15:00:12 2021 UTC and is due to finish in 60 minutes. The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:16 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:18 The meeting name has been set to 'large_scale_sig' 15:00:19 #topic Rollcall 15:00:24 o/ 15:00:26 Who is here for the Large Scale SIG meeting ? 15:00:30 belmoreira: hi! 15:00:37 hi ttx 15:00:38 hi 15:00:54 Hi genekuo ! 15:01:19 pinging amorin mdelavergne 15:01:31 and anyone else interested 15:01:34 hi o/ 15:01:49 jpward: maybe 15:02:14 alright let's get started 15:02:19 Our agenda for today is at: 15:02:21 #link https://etherpad.openstack.org/p/large-scale-sig-meeting 15:02:33 First, let's review progress and blockers on each of the stages of the Scaling journey 15:02:40 #topic Stage 1 - Configure 15:02:44 #link https://wiki.openstack.org/wiki/Large_Scale_SIG/Configure 15:03:02 I was counting on amorin for an update, but we can push back to next meeting 15:03:09 hello! 15:03:17 ah! 15:03:20 sorry I am late 15:03:29 I see you added a few bits to that page 15:03:39 yes I did! 15:03:39 Anything we can help you with? Any blocker? 15:03:53 based on a discussion we had few month ago on the mailing list 15:04:00 and based on what we did at OVH 15:04:12 I initiated the documentation about how to configure rabbit 15:04:16 at least in a cluster way 15:04:33 I tried to explain things based on my experience and what we do, that's far from beeing perfect IMO 15:04:44 Given the event-less nature of 2021, I'd say that starting threads on the mailing-list is probably a good way to extract knowledge and derive best practices 15:04:50 anyway, feel free to comment and update the part that sounds weird or not good 15:05:00 Like asking a specific question 15:05:04 yes 15:05:12 great, thanks 15:05:13 and then compile the answer 15:05:37 I was very happy with the thread on rabbit 15:05:49 so whenever you have doubts or when you want to gut-check some response, I'd recommend you raise a new thread 15:05:56 so I wanted to write that down to avoid losing it in internet archives 15:06:18 Sometimes it just won't catch on, but sometimes it will 15:06:40 it's like throwing fishing lines 15:06:50 perseverance is key 15:07:03 OK, anything else on that topic? 15:07:12 nop 15:07:21 #topic Stage 2 - Monitor (genekuo) 15:07:27 #link https://wiki.openstack.org/wiki/Large_Scale_SIG/Monitor 15:07:45 haven't got time to update it 15:07:45 genekuo: There was some progress on the oslo.metrics front 15:08:00 yeah, I was pushing the patches 15:08:09 we are lacking reviewers, but at this stage in development it's probably normal (not a lot of users yet) 15:08:25 so I think single core +2A is probably good enmough at this pre-1.0 stage 15:08:38 I think it's ok, I've got some plenty of work to do on functional test at oslo.messaging part 15:08:44 which is why I ended up approving what we had 15:08:53 yeah, reviewers there will be a lot more cautious 15:09:21 The goal here is still to get to something in the Wallaby release 15:09:22 we don't have that part internally yet so I'll have to work from scratch 15:09:50 I think it's good to have anyway, making sure we don;t break it with further code changes 15:09:55 yeah 15:10:09 OK, anything else on that stage? 15:10:22 nope, we have a project internally working on some monitoring stuff 15:10:35 We will have people contributing to the docs once we have progress on it 15:10:39 ok 15:10:45 #topic Stage 3 - Scale Up (ttx) 15:10:48 #link https://wiki.openstack.org/wiki/Large_Scale_SIG/ScaleUp 15:11:08 Not much progress to report, still looking for someone with experience of that stage to help drive 15:11:47 I can help 15:11:50 I might pick up the efforts to upstream OSarchiver since our last try (make it part of OSops) was a bit stalled 15:11:56 scale up is adding more computes to a region, right? 15:12:04 yes, until it breaks 15:12:11 ok 15:12:29 belmoreira: you already lead the last two stages :) 15:12:30 this is also related to the work that amorin is doing with configuration 15:12:39 everything is related (and overlapping) 15:13:13 ttx I have been there :) with things breaking when scaling up 15:13:19 #action ttx to revive the OSarchiver upstreaming effort 15:13:24 Maybe we can ask on the mailing list how many computes the users are running? 15:13:48 and try to start a discussion on how they managed to get there 15:13:57 yeah, some kind of a poll... good idea 15:13:57 if we have huge numbers 15:14:16 would be god to capture the general guideline that people use 15:14:21 yup 15:14:21 OK I'll do that 15:14:23 I'd like to say I had to tackle scaling up, but my experience was breaking at around 10 at first, and around 50, so I don't think we can call it "scaling" :D 15:14:34 :) 15:14:39 lol 15:14:56 belmoreira I think you were having huge number of computes in CERN, right? 15:15:01 like 4k for a region, no? 15:15:20 amarin, yes, but we use cells 15:15:28 cheating :) 15:15:34 but Neutron keeps the 4k load 15:15:43 is it cheating if it works? 15:15:43 So I'll start a thread, and you can all contribute a short answer, that should bring some interest to that thread and encourage others to chime in 15:15:49 using vxlan? 15:15:57 ttx, ok 15:16:04 amorin linux bridge 15:16:19 #action ttx to start a "how many compute nodes in your typical cluster" discussion on the ML 15:16:20 amorin cheating again? :) 15:16:25 lol :) 15:16:44 even if the numbers are all over the map that's interesting data 15:17:02 the user survey does not really give that "per cluster" granularity 15:17:02 yes, I guess we will have all flavors 15:17:15 ok, great idea 15:17:23 anything else on that stage? 15:17:38 nope 15:17:43 #topic Stage 4/5 - Scale Out, upgrade & maintain (belmoreira) 15:17:49 #link https://wiki.openstack.org/wiki/Large_Scale_SIG/ScaleOut 15:17:51 #link https://wiki.openstack.org/wiki/Large_Scale_SIG/UpgradeAndMaintain 15:17:53 I was working on it 15:18:00 https://docs.google.com/document/d/1IKH6odpQ5vjTcBG1P_-MQbAClRD9orgyeMyTQ2P7de0/edit?usp=sharing 15:18:00 perfect 15:18:15 wow that's great 15:18:16 I'm using a google doc for my initial draft 15:18:51 let me know if this is what's expected 15:18:57 feel free to drop it in the wiki early, it's ok if it's not perfect or final 15:19:15 At first glance it strikes the right tone 15:19:15 cool I'll take a look of it 15:19:31 will do it after the meeting, I was just waiting for your review first 15:19:31 maybe we can all review it and discuss it next meeting? 15:19:43 or we quickly read it 15:19:49 short enough 15:20:10 sounds good for me either way. 15:20:24 it's still WIP 15:20:33 the first question could be moved to the ScaleUp stage, since it's about when to move to the next stage 15:20:48 (when scaling up is not longer enough) 15:20:53 no* 15:21:21 sounds great 15:21:25 maybe there should be a link to region and cells doc? but overall it's really great 15:21:41 mdelavergne good point 15:22:09 I will try to come up with other questions 15:22:27 let me know if you have any that can be added 15:22:30 #action belmoreira to post first draft of the ScaleOut FAQ 15:23:01 Alright, anything else on those late stages? 15:23:37 moving on then 15:23:39 #topic Our next "share your scaling story" event 15:23:50 not in the upgrades, I hope to have something similar for the next meeting 15:24:02 +1 15:24:06 As we discussed over the past months, we should try to regularly organize events where operators can share their scaling story experience 15:24:20 This primarily serves as an information gathering / consensus building exercise to help us document the Scaling Journey 15:24:28 But also doubles as a potential recruitment platform 15:24:36 It looks more and more like 2021 will be mostly virtual again 15:24:48 I'm keeping an eye on Open Infrastructure Foundation's own events, so that we can leverage those 15:24:56 But the 2021 plan is not final for those yet 15:25:11 Is there any other event we could piggyback on in the near future? 15:25:23 If nothing gets organized that works for us, I was thinking we could organize our own thing on Zoom, like around March 15:25:31 I was thinking about this... 15:25:40 I'm ok with it 15:25:52 but maybe for now we should wait to see when the OIF events will land 15:26:00 and I'm stealing this idea from the baremetal SIG 15:26:27 there is no reason it has to be in an established event... that just helps in visibility 15:27:01 instead having a "big" event, how about we have a small presentation (user story, configuration experience, ...) every SIG meeting? 15:27:19 just a 10 min presentation that would set the pace for the meeting 15:27:45 you mean during the IRC meeting? 15:27:49 if we have interesting user stories it probably will catch the eye of several people that are interested in the subject 15:27:55 yes 15:28:06 and no 15:28:15 do you think we can find enough people willing to do that every 15days? 15:28:19 we use the slot to have presentations over zoom 15:28:27 15 days probably hard 15:28:44 I think at least a month will be better 15:28:50 I think so... it's only a 10 min presentation, so very little preparation is required 15:28:55 also it's informal 15:29:11 Yes in my experience it's been hard to get people to share things, so committing to a schedule with things every two weeks might be a bit much 15:29:25 maybe it won't work every meeting... but having this kind of presentations it may increase the interest in the SIG 15:29:33 true. 15:29:36 instead a "big" isolated event 15:30:10 the "big" event would probably just be a 90-min zoom call, but I see what you mean 15:30:47 trying to see how we can organize that, and handle te Zoom/IRC transition bit 15:30:58 we can find the initial volunteers inside the SIG :) and announce this events/discussion in the ML 15:31:38 Can Foundation help with announcing events on social media? 15:31:38 you will show us the path :) 15:31:56 and maybe post the smallish presentations part onto the OIF Youtube channel 15:31:59 genekuo: sure 15:32:35 ttx yes, it would be easy to record those sessions 15:32:40 belmoreira: so we would move the meeting to Slack completely, or just the presentation part? 15:32:57 err. Zoom 15:33:01 not Slack 15:33:01 0/ 15:33:19 Hi reedip 15:33:25 reedip: o/ 15:33:31 Hi @genekuo 15:33:45 Hi ttx 15:33:56 ttx I don't have a strong opinion, but if it's presentation meeting probably we do all the discussion there and if there are other topics to discuss we may move to IRC? or just stop recording 15:34:57 i think the simplest is to discuss each meeting if we have content for a "presentation meeting" in two weeks, and if we do, switch the whole meeting to zoom 15:35:13 2 weeks is plenty enough to promote it 15:35:23 I think we can start the presentation first, once it ends, we probably still have time for other discussion then we can move to irc 15:35:38 ttx ideas sounds good 15:35:43 I'm afraid that would create friction and we'd lose everyone 15:35:50 (switchign to IRC mid-meeting) 15:36:10 I agree with ttx 15:36:25 I guess that leads us to next topic 15:36:28 #topic Next meeting 15:36:39 Our next meeting is in two weeks, February 10 15:36:55 Do we have someone interested in doing a 5-10 minute presentation? 15:36:59 I vote for no presentations next meeting, but it's only selfish because I won't be able to attend next meeting 15:37:03 or shoudl we do IRC 15:37:09 lol 15:37:29 lol 15:37:43 so, inside the group, if we can think what we can present (10 min) and discuss/schedule next meeting it would be great 15:37:47 but I'm affraid if we wait we will never begin presentations, so I don't know 15:38:15 I couldn't present next meeting because I've already had a presentation planned for local community next month 15:38:18 OK, let's keep the next mneetign on IRC, but gather presentations idea so that we can schedule a Zoom meeting for end of February 15:38:33 ack 15:38:38 great 15:38:42 ok 15:38:51 ack 15:38:55 yep 15:39:02 #action all to think about 5-10min presentations to use in a video version of our SIG meeting 15:39:17 #info next meeting: Feb 10 on IRC usual time 15:39:41 Anything else, anyone? 15:40:04 I do have a question for the team ... How or what factors do we monitor to justify a clusterr health? 15:40:35 cluster being? 15:40:52 I'd generally answer... latency, traffic, errors, saturation 15:41:03 but that's a very general SRE answer 15:41:07 Lets say a part of a DC or a section of rack 15:41:29 so cluster is made of servers ? 15:41:37 not a rabbit cluster, or db cluster? 15:41:39 no it's made OF PEOPLE 15:41:48 Nah, not the rabbit/db 15:42:09 But an OpenStack cluster, sorry for not being explicit in my query 15:42:22 ttx :) 15:42:28 "soylent cluster" 15:42:35 Lol :D 15:43:19 you can monitor API response time, error rates in logs, messages in queues 15:43:21 stuff like that 15:43:29 for compute nodes we measure CPU utilisation, network traffic, steal time... 15:43:44 number of ports in 'build 15:43:56 number of hypevisors online 15:44:18 number of instances ending in error 15:44:36 Though being generic, maybe if we don't have a proper differentiation, do we intend to make a list of these, for large scale systems ( kind of like a generalistic table for systems to consider, to monitor?) 15:44:54 I mean, is there a plan/scope for this in LargeScale SIG? 15:45:24 These stuff should be added in this page https://wiki.openstack.org/wiki/Large_Scale_SIG/Monitor 15:46:14 yes! 15:46:19 Ok, so there is a scope for it .. 15:46:24 Thnx genekuo 15:46:54 Thnx amorix, tax, belmoreira :) 15:46:59 alright, if nothing else, we can close for today 15:47:07 Thanks everyone! 15:47:18 thanks! 15:47:33 #endmeeting