09:00:39 <ttx> #startmeeting large_scale_sig
09:00:44 <openstack> Meeting started Wed Feb 12 09:00:39 2020 UTC and is due to finish in 60 minutes.  The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot.
09:00:45 <ttx> #topic Rollcall
09:00:45 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
09:00:48 <openstack> The meeting name has been set to 'large_scale_sig'
09:00:55 <oneswig> hello
09:00:59 <mdelavergne> Hi!
09:01:02 <ttx> Hello hello everyone
09:01:16 <ttx> Agenda at:
09:01:17 <witek> good morning
09:01:21 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-meeting
09:01:31 <ttx> Let's wait for our Asian friends to join
09:01:43 <dougsz> o/
09:02:02 <masahito> o/
09:02:08 <etp> o/
09:02:43 <ttx> OK, let's get started and maybe others will join
09:02:54 <ttx> #topic Show and tell
09:03:03 <oneswig> Thanks ttx
09:03:04 <ttx> We'll start with oneswig presenting
09:03:08 <ttx> oneswig: floor is yours
09:03:25 <oneswig> I put a link in the etherpad but here's a kind-of presentation... http://www.stackhpc.com/resources/HAproxy-telemetry.pdf
09:03:50 <oneswig> just a few slides as talking points.  We do this in the Scientific SIG sometimes and it mostly works
09:03:50 * ttx clicks
09:04:14 <oneswig> So I wanted to share some work my colleague dougsz has been doing :-)
09:04:34 <oneswig> We noticed that HAproxy had sprouted some new telemetry data and wanted to use it.
09:04:58 <oneswig> It's published in prometheus protocol.
09:05:48 <oneswig> Doug has done some useful work recently with Monasca-Agent (our monitoring infra) to extend support for drawing data from prometheus sources
09:06:29 <oneswig> this is desirable because there's a lot of new work coming in that format - Ceph, HAproxy, etc.
09:07:01 <ttx> and oslo.metric
09:07:14 <oneswig> ttx: indeed :-) It works nicely for us because we get the innovation in prometheus coupled with the advantages of Monasca
09:08:10 <oneswig> OK, so on slide 2 we have a diagram of Monasca services and how the agent pulls data from prometheus endoints, does some preliminary sanitisation and pushes them to the Monasca API
09:09:29 <oneswig> Slide 4 is an example end-to-end.  That's the API response latencies, as measured by haproxy, sampled and averaged, collated by API endpoint
09:09:40 <oneswig> Its interesting - to a point.
09:10:31 <oneswig> One problem with the prometheus data as it stands is that it doesn't appear to split latency out by (eg) response code or http operation
09:10:39 <ttx> that glance_api_external does not seem to be on a good trends
09:11:00 <oneswig> the big red ascending saw-tooth looks bad.  To make it do that I was repeatedly creating and deleting images.
09:11:25 <oneswig> (this is only a single-node control plane, it's a dev environment)
09:12:00 <ttx> oneswig: so success latency is mixed with error latency?
09:12:02 <oneswig> What it doesn't show is that this is a combination of POST, PUT, GET, DELETE
09:12:14 <oneswig> ttx: exactly, which isn't perfect.
09:12:26 <ttx> yes because I read the golden signals chapter
09:12:27 <oneswig> and brings us to the next part of the work - slide 5
09:12:32 <oneswig> :-)
09:13:10 <oneswig> Doug noticed that haproxy can log copious amounts of useful data, and the log messages appear designed to be parsed into semi-structure ddata
09:13:35 <oneswig> Forgot to mention at the top - we use Kolla-Ansible, which standardises on a fluentd log aggregator container on each node
09:14:22 <oneswig> After parsing the logs with some epic regex, we get json structured data about each api request
09:15:15 <oneswig> This goes into monasca's log pipeline.  The box in the bottom right corner can be used to convert log messages into dimensioned time series telemetry (monasca log metrics)
09:15:45 <oneswig> That gives us individual samples which can be aggregated at various points in the pipeline into histograms
09:16:08 <oneswig> currently we do this in grafana, but it could be pushed further upstream, even to fluentd
09:17:10 <oneswig> The second grafana screenshot shows api responses for one service (keystone in this case, because the data's prettiest:-)
09:18:16 <ttx> Hmm... y is response time, but what is x in this graph?
09:18:19 <oneswig> But what we can see is data by operation, interface and response
09:18:28 <oneswig> frequency density, iirc
09:18:44 <oneswig> it's binned into 10ms intervals
09:18:58 <ttx> ah ok
09:19:24 <oneswig> Unfortunately there weren't errors for this service but if there were, we might see 500 response codes on the graph too.
09:19:37 <ttx> Keystone never fails
09:20:05 <oneswig> One shortcoming I guess is that if the number of errors is small, or the timing of them scattered, they'd be lost in the noise along the x-xis
09:20:46 <ttx> yeah for errors you might want to aggregate on longer periods of time
09:21:02 <ttx> (or for other rare events)
09:21:33 <witek> and create a separate dashboard for errors only
09:21:34 <dougsz> Re. the axis labelling, unfortunately the histogram plugin doesn't support it
09:22:02 <ttx> dougsz: as long as the reader knows what's on them, I guess it does not matter that much
09:22:06 <oneswig> a good sample would certainly be more useful.  We also have such detailed data here, we could potentially trace every datapoint to a single api request
09:22:28 <ttx> dougsz, oneswig: That's interesting integration work! Thanks for sharing
09:22:33 <oneswig> BTW all this work is going / has gone upstream
09:23:04 <oneswig> It should be useful at various levels - even the general principles for folks not using any of these components
09:23:18 <ttx> It shows that once you start reducing the parameters (like say you start from Kolla-Ansible so you can rely on fluentd/monasca being present), it simplifies things a lot
09:23:40 <ttx> (in the telemetry gathering space)
09:24:06 <oneswig> I think so.  One thing missing is a nice starting point in terms of dashboards.  It's pretty basic what you get as standard
09:24:14 <ttx> questions on this presentation?
09:24:43 <oneswig> doug did I miss anything?
09:24:52 <dougsz> Indeed, we have a goal to pre-configure monitoring out of the box in Kolla-Ansible
09:25:06 <dougsz> I think you captured it nicely, thanks oneswig
09:25:30 <ttx> oneswig: do you mind if I include a link to your slides on the meeting summary sent to the ML?
09:25:35 <dougsz> It might be worth mentioning that Prometheus users can take advantage of the log parsing via the Fluentd Prometheus plugin in Kolla
09:25:36 <oneswig> BTW the glance API - I ran that test overnight, it reached a steady state around 3.5s :-)
09:26:22 <ttx> OK, if no more questions we'll switch to our next topic...
09:26:36 <ttx> #topic Progress on "Documenting large scale operations" goal
09:26:41 <ttx> amorin: around?
09:26:45 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-documentation
09:27:46 <ttx> No amorin today, so we'll get how things are going on the config default documentation effort in a future meeting
09:28:04 <ttx> We also had a bunch a links added to the etherpad, which is great. I briefly scanned them and feel like they all belong.
09:28:24 <ttx> #topic Progress on "Scaling within one cluster" goal
09:28:38 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-cluster-scaling
09:28:49 <ttx> The oslo.metric blueprint got its first round of reviews, mostly from outside the SIG
09:28:56 <ttx> #link https://review.opendev.org/#/c/704733/
09:29:10 <ttx> Would be great to get more feedback from the SIG participants themselves, before masahito does another round.
09:29:41 <ttx> So I propose to add a TODO for the group to review that spec ASAP
09:29:50 <ttx> (as soon as possible)
09:30:08 <masahito> I don't have much update this week.  I'll update the patch without adding actual schema in this week.
09:30:29 <masahito> s/patch/draft/
09:30:34 <ttx> #action masahito to update spec based on initial feedback
09:30:49 <ttx> #action everyone to review and comment on https://review.opendev.org/#/c/704733/
09:31:03 <ttx> Regarding golden signals, did anyone read the recommended read ?
09:31:10 <ttx> #link https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals
09:31:23 <ttx> Basically the summary is that any monitoring system should watch latency, traffic, errors, and saturation as early detectors of impeding failure
09:31:44 <ttx> So making sure this group has a common understanding of those terms is useful.
09:32:16 <ttx> Please read it (it's short) if you haven't had the chance yet
09:32:24 <ttx> On the scaling stories front, thanks to Albert Braden who posted to the doc recently
09:32:30 <ttx> #link https://etherpad.openstack.org/p/scaling-stories
09:33:04 <ttx> I feel like some of it should find its way to amorin's documentation efforts
09:33:58 <ttx> comments on that?
09:34:42 <ttx> Sounds like a no
09:34:44 <ttx> #topic Other topics
09:34:56 <ttx> So... we have an "Opendev" event coming up in Vancouver, June 8-11
09:35:15 <ttx> I wanted to talk briefly about it and how this group can get involved
09:35:27 <ttx> For those not familiar with the format, it's a workshop-style event where we deep-dive into a number of predefined topics, with lots of flexibility in the exact event format
09:35:46 <ttx> One of the 4 themes selected for the event is "Large-scale Usage of Open Source Infrastructure Software"
09:35:54 <ttx> Which is directly aligned with this SIG
09:36:18 <ttx> So I think we should definitely participate to the programming committee and help select the content that will be discussed there
09:36:39 <ttx> Like make sure we have some part of the event discussing our "scaling one cluster" and "documenting large scale use defaults" goals
09:36:49 <ttx> And also take the opportunity to recruit more people to this SIG.
09:37:08 <ttx> I saw that Belmiro applied to be on the programming committee, I'll make sure to support that application. I myself will be focused on other tracks at the same event
09:37:43 <ttx> So if others (masahito? oneswig? amorin? jiaopingju?) are interested in driving the content that will be discussed there, please let me know
09:38:44 <masahito> Thank. I plan to go to the Opendev. But not decided yet.
09:38:57 <mdelavergne> I might be able to help, but I am not sure I will physically go there
09:39:09 <masahito> If I can go, I'm okay to be a moderator.
09:39:17 <ttx> That would be great
09:39:52 <ttx> OK, we still have some tie, but I  know the event organizers want to select the programming committees ASAP
09:39:58 <ttx> some time*
09:40:19 <masahito> Is there room or talk submission for the event now?
09:41:06 <ttx> masahito: It depends on the choice of each programming committee. The general idea is to have a limited number of formal presentations, so in most cases the programming committee will just invite a specific speaker
09:41:24 <ttx> But some tracks might do a more classic CFP
09:41:37 <masahito> Got it. Thanks.
09:41:40 <ttx> and some others might do more of a short-presentations session
09:42:08 <ttx> (this is why it's important to have members of this SIG in the programming committee)
09:42:42 <ttx> masahito: in case you can confirm your presence soon, I'd be happy to pass your name to be a part of the committee
09:43:05 <ttx> if you'll know later, we can always have you moderate some topic
09:43:45 <ttx> like a workshop on gathering metrics on bottlenecks in your OpenStack clusters
09:43:50 <masahito> okay.
09:44:05 <ttx> Other questions on this event?
09:44:13 <ttx> Other topics to discuss?
09:44:32 <mdelavergne> Not from myself right now
09:44:40 <ttx> #topic Next meeting
09:44:41 <masahito> Nothing from my side.
09:44:48 <ttx> Our next meeting should be Feb 26, 9 utc
09:44:55 <ttx> Hopefully we'll have more people around
09:45:43 <ttx> Anything else before we close the meeting?
09:45:56 <ttx> If not, thank you for attending!
09:46:12 <mdelavergne> Thanks everyone!
09:46:18 <ttx> #endmeeting