Wednesday, 2020-09-09

*** tsmith2 has quit IRC00:33
*** ricolin_ has joined #openstack-meeting-301:31
*** dconde has joined #openstack-meeting-302:34
*** dconde has quit IRC02:35
*** njohnston has quit IRC04:22
*** belmoreira has joined #openstack-meeting-304:23
*** ralonsoh has joined #openstack-meeting-306:34
*** e0ne has joined #openstack-meeting-306:35
*** e0ne has quit IRC06:39
*** slaweq has joined #openstack-meeting-306:51
*** tosky has joined #openstack-meeting-307:57
*** slaweq has quit IRC08:20
*** apetrich has joined #openstack-meeting-308:26
*** slaweq has joined #openstack-meeting-308:29
*** apetrich has quit IRC08:39
*** ianychoi__ has quit IRC08:40
*** apetrich has joined #openstack-meeting-308:44
*** slaweq has quit IRC08:49
*** e0ne has joined #openstack-meeting-309:32
*** ricolin_ has quit IRC09:38
*** ralonsoh has quit IRC10:01
*** ralonsoh has joined #openstack-meeting-310:02
*** raildo has joined #openstack-meeting-311:30
*** lkoranda has joined #openstack-meeting-311:36
*** lkoranda has quit IRC11:52
*** lkoranda has joined #openstack-meeting-311:52
*** njohnston has joined #openstack-meeting-312:08
*** slaweq has joined #openstack-meeting-314:03
*** lkoranda has quit IRC14:08
*** ricolin_ has joined #openstack-meeting-314:11
*** ricolin_ has quit IRC14:13
*** lkoranda has joined #openstack-meeting-315:03
*** slaweq has quit IRC15:04
*** tsmith2 has joined #openstack-meeting-315:11
*** lpetrut has joined #openstack-meeting-315:12
*** mlavalle has joined #openstack-meeting-315:17
*** tsmith2 has quit IRC15:20
*** tsmith2 has joined #openstack-meeting-315:22
*** tsmith_ has joined #openstack-meeting-315:25
*** tsmith2 has quit IRC15:27
*** tsmith_ is now known as tsmith215:27
*** belmoreira has quit IRC15:40
*** mdelavergne has joined #openstack-meeting-315:51
*** penick has joined #openstack-meeting-315:51
*** mdelavergne has quit IRC15:53
*** mdelavergne has joined #openstack-meeting-315:54
*** lpetrut has quit IRC15:59
ttx#startmeeting large_scale_sig16:00
openstackMeeting started Wed Sep  9 16:00:07 2020 UTC and is due to finish in 60 minutes.  The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot.16:00
openstackUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.16:00
ttx#topic Rollcall16:00
*** openstack changes topic to " (Meeting topic: large_scale_sig)"16:00
openstackThe meeting name has been set to 'large_scale_sig'16:00
*** openstack changes topic to "Rollcall (Meeting topic: large_scale_sig)"16:00
ttxWho is here for the Large Scale SIG meeting ?16:00
mdelavergneHi o/16:00
*** eandersson has joined #openstack-meeting-316:01
ttxI see penick is in the channel list16:01
amorinhey!16:01
ttxand amorin16:01
* penick waves16:01
ttxand eandersson16:02
eanderssono/16:02
ttxAlright, let's get started16:02
ttxOur agenda for today is at:16:02
ttx#link https://etherpad.openstack.org/p/large-scale-sig-meeting16:02
ttx#topic Welcome newcomers16:02
*** openstack changes topic to "Welcome newcomers (Meeting topic: large_scale_sig)"16:02
ttxFollowing the Opendev event on Large scale deployments, we had several people expressing interest in joining the SIG16:03
ttxSeveral of them in the US and not interested in our original meeting time at 8utc16:03
*** priteau has joined #openstack-meeting-316:03
ttxSo we decided some time ago to rotate between US+EU / APAC+EU times, as the majority of the group is from EU16:03
ttxThis is our second US+EU meet... but the first was not really successful at attracting new participants16:03
ttxso I'd like to take the time to welcome new attendees, and spend some time discussing what they are interested in16:03
ttxso we can shape the direction of the SIG accordingly16:03
ttxpenick: not sure you need any intro, but go16:04
penickHeh, hey folks. I work for Verizon Media and act as the director/architect for our private cloud.16:04
ttxAnything specific you're interested in within this SIG, beyond sharing your experience?16:05
penickI'm here to offer some perspective and feedback on our needs as a large scale deployer, and learn other use cases. Ideally i'd like to align our upstream development efforts to help make the product better for large and small deployers16:05
ttxgreat! You're in the right place16:05
ttxeandersson: care to quickly introduce yourself?16:06
amorinwelcome16:06
eanderssonSure!16:06
eanderssonI work at Blizzard Entertainment as a Technical Lead and I am responsible for the Private Cloud here.16:07
ttxAnything specific you're interested in within this SIG, beyond sharing your experience?16:07
ttx(yeah, i copy pasted that)16:08
mdelavergneWelcome to the both of you :)16:08
eanderssonVery similar as penick, here to share perspective and provide feedback.16:08
ttxPerfect. I'll let amorin and mdelavergne quickly introduce themselves... I'm working for the OSF, facilitating this group operations16:09
ttxNot much of a first-hand experience on large scale deployments, lots of second hand accounts though :)16:09
ttxOther regular SIG members include belmoreira of CERN fame, and masahito from LINE, whoi opefully is sleeping at this hour.16:10
ttxhopefully*16:10
amorinhey, iwork for ovh, i am mostly involved in deploying openstack for our public cloud offer16:10
mdelavergneHi, I'm a PhD student working on a way to geo-distribute applications, so I mostly do experiments on large-scale16:11
amorinand sorry, i am on my phone :( not easy to type here16:11
ttxheh16:11
penicki'm glad to meet you all :)16:11
ttxAlright, let's jump right in in our current workstreams... don't hesitate to interrupt me to ask questions16:11
ttx#topic Progress on "Documenting large scale operations" goal16:11
*** openstack changes topic to "Progress on "Documenting large scale operations" goal (Meeting topic: large_scale_sig)"16:11
ttx#link https://etherpad.openstack.org/p/large-scale-sig-documentation16:11
ttxSo this is one of our current goals for the SIG - produce better documentation to help operators setting up large deployments.16:12
ttxOne work stream is around collecting articles, howto and tips & tricks around large scale published over the years16:12
ttxYou can add any you know of on that etherpad I just linked to16:12
ttxAnother workstream is around documenting better configuration values when you start hitting issues with default values16:12
ttxamorin is leading that work at https://wiki.openstack.org/wiki/Large_Scale_Configuration_Guidelines but could use help16:13
ttxBasically the default settings only carry you so far16:13
ttxand if we can agree on commonly-tweaked settings at scale, we should find a way to document them16:13
amorinyes i'd love to be able to push that forward, but i lack some time mostly16:13
ttxWe had another work item on collecting metrics/billing stories16:14
ttxThat points to one critical activity of this SIG:16:14
ttxIt's all about sharing your experience operating large scale deployments of OpenStack16:14
ttxso that we can derive best practices and/or fix common issues16:14
ttxOnly amorin contributed the story for OVH on the etherpad, so please add to that if you have time16:15
ttxMaybe there is no common pattern there, but my guess is there is, and it's not necessarily aligned with what upstream provides16:15
ttxSo I'll push again an action to contribute to that:16:15
ttx#action all to describe briefly how you solved metrics/billing in your deployment in https://etherpad.openstack.org/p/large-scale-sig-documentation16:15
ttxAnd finally, we have a workstream on tooling, prompted by OVH's interest in pushing osarchiver upstream16:16
ttxThe current status there is that the "OSOps" effort is being revived, under the 'Operation Docs and Tooling' SIG16:16
ttxbelmoreira and I signed up to make sure that was moving forward, but smcginnis has been active leading it16:16
ttxThe setup is still in progress at https://review.opendev.org/#/c/749834/16:16
penickI can ask my team to see if they can distill some of the useful configuration changes we've made. One trick is there's so many ways to deploy OpenStack that there are a lot of different contexts for when to use some settings vs others. For example, we focus on fewer,larger openstack deployments. With thousands of hypervisors per VM cluster and tens of thousands of baremetal nodes per baremetal cluster. So some things16:16
penick we've done will not really apply to a large scale deployer who might prefer many smaller clusters.16:16
ttxpenick: I think the info will be useful anyway. You're right that in some cases there won't be a common practice and it will be all over the map16:17
amorinagree, but still useful to know how people are doing scale16:18
ttxBut from some early discussions at the SIG it was apparent taht some of the pain points are still common16:18
ttxTo finish on osarchiver, once OSOps repo is set up we'll start working on pushing osarchiver in it16:18
ttxAnything else on this topic? Any other direction you'd like this broad goal to take?16:19
amoringreat16:19
ttx(the SIG mostly goes where its members push it, so if you're interested in something specific, please let the group know)16:19
ttxOtherwise I'll keep on exposing current status16:20
ttx#topic Progress on "Scaling within one cluster" goal16:20
*** openstack changes topic to "Progress on "Scaling within one cluster" goal (Meeting topic: large_scale_sig)"16:20
ttx#link https://etherpad.openstack.org/p/large-scale-sig-cluster-scaling16:20
ttxThis is the other initial goal of the SIG - identify, measure and push back common scaling limitations within one cluster16:20
ttxBasically, as you add nodes to a single OpenStack cluster, at one point you reach scaling limits.16:20
ttxHow do we push that limit back, and reduce the need to create multiple clusters ?16:21
ttxAgain there may not be a common pattern, so...16:21
ttxFirst task here is to collect those scaling stories. You have a large scale deployment, what happens if you add too many nodes ?16:21
ttxWe try to collect those stories at https://etherpad.openstack.org/p/scaling-stories16:21
ttxand then move them to https://wiki.openstack.org/wiki/Large_Scale_Scaling_Stories once edited16:21
ttxSo please add to that if you have a story to share! (can be old)16:22
eanderssonNeutron has been a big pain point for us, with memory usage been excessive, to the point where we ended up running Neutron in VMs.16:22
ttxNeutron failed first? Usually people point to Rabbit first, but maybe you optimized that already16:22
eanderssonRabbit has been good to us, besides recovering after a crash.16:22
penickeandersson: interesting, was neutron running in containers prior to moving to VMs?16:23
eanderssonYea - we containeraized all of our deployments when moving to Rocky about ~2 years ago.16:23
ttxamorin: you're not (yet) running components containerized, right?16:24
amorinnop, not yet16:25
amorinwe are working on it16:25
penickeandersson good to know, thanks! We're working on containerizing all of our stuff now.. I'll make sure my team is  aware16:25
ttxSo as I said, one common issue when adding nodes is around RabbitMQ falling down, even if I expect most people in this group to have moved past that16:26
ttxAnd so this SIG has worked to produce code to instrument oslo.messaging calls and get good metrics from them16:26
amorinis your database and rabbit cluster running in containers as well?16:26
ttxBased on what LINE used internally to solve that16:26
eanderssonNot yet.16:27
ttxThis resulted in https://opendev.org/openstack/oslo.metrics16:27
ttxNext step there is to add basic tests, so that we are reasonably confident we do not introduce regressions16:27
ttxI had an action item I did not finish to evaluate that, let me push it back16:27
ttx#action ttx to look into a basic test framework for oslo.metrics16:27
ttxAlso masahito was planning to push the latest patches to it, but I haven't seen anything posted yet16:27
ttx#action masahito to push latest patches to oslo.metrics16:28
ttxamorin: did you check about applicability of oslo.metrics within OVH?16:28
amorinnot yet, still in my todo16:28
ttxalright pushing that back too16:28
ttx#action amorin to see if oslo.metrics could be tested at OVH16:28
ttxFinally we recently helped with OVH's patch to add a basic ping to oslo.middleware16:28
ttxThat can be useful to monitor a rabbitMQ setup and detect weird failure cases16:29
ttx(there were threads on the ML about it)16:29
ttxHappy to report the patch finally landed at https://review.opendev.org/#/c/749834/16:29
ttxAnd shipped in oslo.messaging 12.4.0!16:29
amorinyay!16:29
mdelavergnecongrats!16:29
eanderssonI believe a lot of the critical recovery issues with RabbitMQ are fixed in Ussuri. Especially the excessive number of exchanges created by RPC calls that I believe caused all our recovery issues.16:30
ttxeandersson: yes I think there were lots of improvements there16:31
ttxDoes that sound like a good goal for you penick and eandersson, or do you think we'll also fail to extract common patterns and low-hanging fruit improvements to raise scale16:31
penickSorry, not understanding.. Is what a good goal?16:32
ttxoh the "Scaling within one cluster" goal16:32
ttxTrying to push back when you need to scale out to multiple clusters16:33
eanderssonYea - that is a good goal.16:33
penickAh, yeah. That's good for me.16:34
ttxok, moving on16:34
ttx#topic PTG/Summit plans16:34
*** openstack changes topic to "PTG/Summit plans (Meeting topic: large_scale_sig)"16:34
ttxFor the PTG we decided last meeting to ask for a PTG room around our usual meeting times, to serve as our regular meeting while recruiting potential new members16:34
ttxSo I requested Wednesday 7UTC-8UTC and 16UTC-17UTC (Oct 28)16:35
ttx(first one is a bit early but there was no slot scheduled at our normal time)16:35
ttxFor the summit we discussed proposing one Forum session around collecting scaling stories, which I still have to file16:35
ttxThe idea is to get people to talk and we can document their story afterwards16:36
ttxOne learning from Opendev is that to get a virtual discussion going, it's good to prime the pump by having 2-3 people signed up to discuss the topic16:36
ttxSo... anyone interested in sharing their scaling story in the context of a Forum session?16:36
ttxI just fear unless we start talking nobody will dare expose their case16:37
ttxI can moderate but I don't have any first-hand scaling story to share16:37
amorini can maybe do a small one16:37
penickI can share something, or ask someone from my team to16:38
ttxgreat, thanks! I think that will help16:38
ttxWe'll try to actively promote that session in the openstack ops community, so hopefully it will work, both as a recruiting mechanism and a way to collect data16:39
amorinok16:39
ttx#action ttx to file Scaling Stories forum session, with amorin and someone from penick's team to help get it off the ground16:40
ttx#topic Next meeting16:40
*** openstack changes topic to "Next meeting (Meeting topic: large_scale_sig)"16:40
ttxIf we continue on the same rhythm:16:40
ttxNext meeting will be EU-APAC on Sept 23, 8utc.16:40
ttxThen next US-EU meeting will be Oct 7, 16utc.16:40
ttxHow does that sound?16:41
ttxAny objection to continuing to hold those on IRC?16:41
amoringood16:41
eanderssonSounds good to me.16:41
mdelavergneit's fine16:41
ttxFeel free to invite others from your teams to join if they can help, or fellow openstack ops you accidentally discuss with16:42
ttx#info next meetings: Sep 23, 8:00UTC; Oct 7, 16:00UTC16:42
ttxThat is all I had, opening the floor for comments, questions, desires16:42
ttx#topic Open discussion16:42
*** openstack changes topic to "Open discussion (Meeting topic: large_scale_sig)"16:42
penick16utc is fine for me, i'm ok with either IRC or video chat16:42
ttxAnything else this SIG should tackle? We are trying to have reasonable goals as we all have a lot of work besides the SIG16:43
ttxand not sweat it if things go slow16:43
ttxbut still expose and share knowledge and tools that is distributed in the large deployment ops community16:44
eanderssonMeaningful monitoring of the control plane has always been difficult to me.16:44
penickI'd be interested to learn how many large scale deployers have dedicated time for upstream contributions, and if there's any interest in collaborating on things.16:44
penickFor example, producing reference deployment documentation, similar to what we did for the edge deployments.16:44
eanderssonWe focus heavily on anything that we think is critical and is currently lacking the support (e.g. our contributions to Senlin)16:45
penickah, meaningful monitoring is a good one16:45
eanderssonWe don't necessarily have the time to dedicate someone to a project like Nova, Neutron just because of the sheer complexity of learning that. So we focus where we can make the most impact.16:46
eanderssonBut we do contribute up smaller bug fixes to all projects (or report them at the very least) the moment we find them.16:46
ttxI like the idea of meaningful monitoring. It really is in a sorry state16:47
ttxI think the quest for exact metrics (usable for billing) has killed the "meaningful" out of our monitoring16:48
eanderssonYep!16:48
eanderssonIt's difficult to find a good balance.16:49
ttxMaybe we should collect experiences on how people currently do monitoring, not even talking about metrics/billing16:49
eanderssonI would also like to see if someone has successfully enabled tracing.16:50
penicksplunk monitoring/search examples for troubleshooting problems might also be useful16:51
ttxI'll give some thought on how to drive that, maybe we should just add a goal on "meaningful monitoring" (I like that termbut it might be a bit too ambitious for the group right now16:51
eanderssonI would like to know what monitoring is the first to alert you when something goes wrong (e.g. rabbit goes down, control node goes down)16:51
penickalso, relevant, i'm on two meetings at once here, but the other meeting is what we're going to do to get off of Ceilometer and Gnocchi to move to something active and supported16:51
*** dosaboy has quit IRC16:51
ttxrelevant indeed16:52
eanderssonWe have a lot synthetic monitoring scenarios running continuously16:52
eanderssonor scenario monitoring (e.g. a VM is created and tested every 5 minute)16:52
ttxpenick: would love to know where you end up16:52
ttxOK that was a great first meeting y'all, but I need to run16:53
penickI'll share what we come up with16:53
penickcool, thanks y'all!16:53
ttxSo thanks everyone, let's continue the discussion next time you can drop by. I'll add a section to continue discussing meaningful monitoring in the EU+APAC meeting in two weeks16:54
ttx#endmeeting16:54
*** openstack changes topic to "OpenStack Meetings || https://wiki.openstack.org/wiki/Meetings/"16:54
openstackMeeting ended Wed Sep  9 16:54:45 2020 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)16:54
openstackMinutes:        http://eavesdrop.openstack.org/meetings/large_scale_sig/2020/large_scale_sig.2020-09-09-16.00.html16:54
openstackMinutes (text): http://eavesdrop.openstack.org/meetings/large_scale_sig/2020/large_scale_sig.2020-09-09-16.00.txt16:54
openstackLog:            http://eavesdrop.openstack.org/meetings/large_scale_sig/2020/large_scale_sig.2020-09-09-16.00.log.html16:54
*** eandersson has left #openstack-meeting-316:54
amorinbye16:55
ttxI'll post summary of the meeting, but tomorrow since I need to run16:55
mdelavergneThanks everyone, see you!16:55
*** mdelavergne has quit IRC16:55
*** slaweq has joined #openstack-meeting-316:57
*** penick has quit IRC17:00
*** slaweq has quit IRC17:12
*** belmoreira has joined #openstack-meeting-317:35
*** e0ne has quit IRC17:37
*** ralonsoh has quit IRC18:05
*** priteau has quit IRC18:57
*** apetrich has quit IRC19:11
*** slaweq has joined #openstack-meeting-319:17
*** tosky has quit IRC19:37
*** slaweq has quit IRC20:51
*** slaweq has joined #openstack-meeting-320:55
*** slaweq has quit IRC20:59
*** slaweq has joined #openstack-meeting-321:26
*** slaweq has quit IRC21:31
*** raildo has quit IRC22:22
*** dosaboy has joined #openstack-meeting-323:28

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!