09:00:39 <ttx> #startmeeting large_scale_sig 09:00:44 <openstack> Meeting started Wed Feb 12 09:00:39 2020 UTC and is due to finish in 60 minutes. The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:00:45 <ttx> #topic Rollcall 09:00:45 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:00:48 <openstack> The meeting name has been set to 'large_scale_sig' 09:00:55 <oneswig> hello 09:00:59 <mdelavergne> Hi! 09:01:02 <ttx> Hello hello everyone 09:01:16 <ttx> Agenda at: 09:01:17 <witek> good morning 09:01:21 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-meeting 09:01:31 <ttx> Let's wait for our Asian friends to join 09:01:43 <dougsz> o/ 09:02:02 <masahito> o/ 09:02:08 <etp> o/ 09:02:43 <ttx> OK, let's get started and maybe others will join 09:02:54 <ttx> #topic Show and tell 09:03:03 <oneswig> Thanks ttx 09:03:04 <ttx> We'll start with oneswig presenting 09:03:08 <ttx> oneswig: floor is yours 09:03:25 <oneswig> I put a link in the etherpad but here's a kind-of presentation... http://www.stackhpc.com/resources/HAproxy-telemetry.pdf 09:03:50 <oneswig> just a few slides as talking points. We do this in the Scientific SIG sometimes and it mostly works 09:03:50 * ttx clicks 09:04:14 <oneswig> So I wanted to share some work my colleague dougsz has been doing :-) 09:04:34 <oneswig> We noticed that HAproxy had sprouted some new telemetry data and wanted to use it. 09:04:58 <oneswig> It's published in prometheus protocol. 09:05:48 <oneswig> Doug has done some useful work recently with Monasca-Agent (our monitoring infra) to extend support for drawing data from prometheus sources 09:06:29 <oneswig> this is desirable because there's a lot of new work coming in that format - Ceph, HAproxy, etc. 09:07:01 <ttx> and oslo.metric 09:07:14 <oneswig> ttx: indeed :-) It works nicely for us because we get the innovation in prometheus coupled with the advantages of Monasca 09:08:10 <oneswig> OK, so on slide 2 we have a diagram of Monasca services and how the agent pulls data from prometheus endoints, does some preliminary sanitisation and pushes them to the Monasca API 09:09:29 <oneswig> Slide 4 is an example end-to-end. That's the API response latencies, as measured by haproxy, sampled and averaged, collated by API endpoint 09:09:40 <oneswig> Its interesting - to a point. 09:10:31 <oneswig> One problem with the prometheus data as it stands is that it doesn't appear to split latency out by (eg) response code or http operation 09:10:39 <ttx> that glance_api_external does not seem to be on a good trends 09:11:00 <oneswig> the big red ascending saw-tooth looks bad. To make it do that I was repeatedly creating and deleting images. 09:11:25 <oneswig> (this is only a single-node control plane, it's a dev environment) 09:12:00 <ttx> oneswig: so success latency is mixed with error latency? 09:12:02 <oneswig> What it doesn't show is that this is a combination of POST, PUT, GET, DELETE 09:12:14 <oneswig> ttx: exactly, which isn't perfect. 09:12:26 <ttx> yes because I read the golden signals chapter 09:12:27 <oneswig> and brings us to the next part of the work - slide 5 09:12:32 <oneswig> :-) 09:13:10 <oneswig> Doug noticed that haproxy can log copious amounts of useful data, and the log messages appear designed to be parsed into semi-structure ddata 09:13:35 <oneswig> Forgot to mention at the top - we use Kolla-Ansible, which standardises on a fluentd log aggregator container on each node 09:14:22 <oneswig> After parsing the logs with some epic regex, we get json structured data about each api request 09:15:15 <oneswig> This goes into monasca's log pipeline. The box in the bottom right corner can be used to convert log messages into dimensioned time series telemetry (monasca log metrics) 09:15:45 <oneswig> That gives us individual samples which can be aggregated at various points in the pipeline into histograms 09:16:08 <oneswig> currently we do this in grafana, but it could be pushed further upstream, even to fluentd 09:17:10 <oneswig> The second grafana screenshot shows api responses for one service (keystone in this case, because the data's prettiest:-) 09:18:16 <ttx> Hmm... y is response time, but what is x in this graph? 09:18:19 <oneswig> But what we can see is data by operation, interface and response 09:18:28 <oneswig> frequency density, iirc 09:18:44 <oneswig> it's binned into 10ms intervals 09:18:58 <ttx> ah ok 09:19:24 <oneswig> Unfortunately there weren't errors for this service but if there were, we might see 500 response codes on the graph too. 09:19:37 <ttx> Keystone never fails 09:20:05 <oneswig> One shortcoming I guess is that if the number of errors is small, or the timing of them scattered, they'd be lost in the noise along the x-xis 09:20:46 <ttx> yeah for errors you might want to aggregate on longer periods of time 09:21:02 <ttx> (or for other rare events) 09:21:33 <witek> and create a separate dashboard for errors only 09:21:34 <dougsz> Re. the axis labelling, unfortunately the histogram plugin doesn't support it 09:22:02 <ttx> dougsz: as long as the reader knows what's on them, I guess it does not matter that much 09:22:06 <oneswig> a good sample would certainly be more useful. We also have such detailed data here, we could potentially trace every datapoint to a single api request 09:22:28 <ttx> dougsz, oneswig: That's interesting integration work! Thanks for sharing 09:22:33 <oneswig> BTW all this work is going / has gone upstream 09:23:04 <oneswig> It should be useful at various levels - even the general principles for folks not using any of these components 09:23:18 <ttx> It shows that once you start reducing the parameters (like say you start from Kolla-Ansible so you can rely on fluentd/monasca being present), it simplifies things a lot 09:23:40 <ttx> (in the telemetry gathering space) 09:24:06 <oneswig> I think so. One thing missing is a nice starting point in terms of dashboards. It's pretty basic what you get as standard 09:24:14 <ttx> questions on this presentation? 09:24:43 <oneswig> doug did I miss anything? 09:24:52 <dougsz> Indeed, we have a goal to pre-configure monitoring out of the box in Kolla-Ansible 09:25:06 <dougsz> I think you captured it nicely, thanks oneswig 09:25:30 <ttx> oneswig: do you mind if I include a link to your slides on the meeting summary sent to the ML? 09:25:35 <dougsz> It might be worth mentioning that Prometheus users can take advantage of the log parsing via the Fluentd Prometheus plugin in Kolla 09:25:36 <oneswig> BTW the glance API - I ran that test overnight, it reached a steady state around 3.5s :-) 09:26:22 <ttx> OK, if no more questions we'll switch to our next topic... 09:26:36 <ttx> #topic Progress on "Documenting large scale operations" goal 09:26:41 <ttx> amorin: around? 09:26:45 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-documentation 09:27:46 <ttx> No amorin today, so we'll get how things are going on the config default documentation effort in a future meeting 09:28:04 <ttx> We also had a bunch a links added to the etherpad, which is great. I briefly scanned them and feel like they all belong. 09:28:24 <ttx> #topic Progress on "Scaling within one cluster" goal 09:28:38 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-cluster-scaling 09:28:49 <ttx> The oslo.metric blueprint got its first round of reviews, mostly from outside the SIG 09:28:56 <ttx> #link https://review.opendev.org/#/c/704733/ 09:29:10 <ttx> Would be great to get more feedback from the SIG participants themselves, before masahito does another round. 09:29:41 <ttx> So I propose to add a TODO for the group to review that spec ASAP 09:29:50 <ttx> (as soon as possible) 09:30:08 <masahito> I don't have much update this week. I'll update the patch without adding actual schema in this week. 09:30:29 <masahito> s/patch/draft/ 09:30:34 <ttx> #action masahito to update spec based on initial feedback 09:30:49 <ttx> #action everyone to review and comment on https://review.opendev.org/#/c/704733/ 09:31:03 <ttx> Regarding golden signals, did anyone read the recommended read ? 09:31:10 <ttx> #link https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals 09:31:23 <ttx> Basically the summary is that any monitoring system should watch latency, traffic, errors, and saturation as early detectors of impeding failure 09:31:44 <ttx> So making sure this group has a common understanding of those terms is useful. 09:32:16 <ttx> Please read it (it's short) if you haven't had the chance yet 09:32:24 <ttx> On the scaling stories front, thanks to Albert Braden who posted to the doc recently 09:32:30 <ttx> #link https://etherpad.openstack.org/p/scaling-stories 09:33:04 <ttx> I feel like some of it should find its way to amorin's documentation efforts 09:33:58 <ttx> comments on that? 09:34:42 <ttx> Sounds like a no 09:34:44 <ttx> #topic Other topics 09:34:56 <ttx> So... we have an "Opendev" event coming up in Vancouver, June 8-11 09:35:15 <ttx> I wanted to talk briefly about it and how this group can get involved 09:35:27 <ttx> For those not familiar with the format, it's a workshop-style event where we deep-dive into a number of predefined topics, with lots of flexibility in the exact event format 09:35:46 <ttx> One of the 4 themes selected for the event is "Large-scale Usage of Open Source Infrastructure Software" 09:35:54 <ttx> Which is directly aligned with this SIG 09:36:18 <ttx> So I think we should definitely participate to the programming committee and help select the content that will be discussed there 09:36:39 <ttx> Like make sure we have some part of the event discussing our "scaling one cluster" and "documenting large scale use defaults" goals 09:36:49 <ttx> And also take the opportunity to recruit more people to this SIG. 09:37:08 <ttx> I saw that Belmiro applied to be on the programming committee, I'll make sure to support that application. I myself will be focused on other tracks at the same event 09:37:43 <ttx> So if others (masahito? oneswig? amorin? jiaopingju?) are interested in driving the content that will be discussed there, please let me know 09:38:44 <masahito> Thank. I plan to go to the Opendev. But not decided yet. 09:38:57 <mdelavergne> I might be able to help, but I am not sure I will physically go there 09:39:09 <masahito> If I can go, I'm okay to be a moderator. 09:39:17 <ttx> That would be great 09:39:52 <ttx> OK, we still have some tie, but I know the event organizers want to select the programming committees ASAP 09:39:58 <ttx> some time* 09:40:19 <masahito> Is there room or talk submission for the event now? 09:41:06 <ttx> masahito: It depends on the choice of each programming committee. The general idea is to have a limited number of formal presentations, so in most cases the programming committee will just invite a specific speaker 09:41:24 <ttx> But some tracks might do a more classic CFP 09:41:37 <masahito> Got it. Thanks. 09:41:40 <ttx> and some others might do more of a short-presentations session 09:42:08 <ttx> (this is why it's important to have members of this SIG in the programming committee) 09:42:42 <ttx> masahito: in case you can confirm your presence soon, I'd be happy to pass your name to be a part of the committee 09:43:05 <ttx> if you'll know later, we can always have you moderate some topic 09:43:45 <ttx> like a workshop on gathering metrics on bottlenecks in your OpenStack clusters 09:43:50 <masahito> okay. 09:44:05 <ttx> Other questions on this event? 09:44:13 <ttx> Other topics to discuss? 09:44:32 <mdelavergne> Not from myself right now 09:44:40 <ttx> #topic Next meeting 09:44:41 <masahito> Nothing from my side. 09:44:48 <ttx> Our next meeting should be Feb 26, 9 utc 09:44:55 <ttx> Hopefully we'll have more people around 09:45:43 <ttx> Anything else before we close the meeting? 09:45:56 <ttx> If not, thank you for attending! 09:46:12 <mdelavergne> Thanks everyone! 09:46:18 <ttx> #endmeeting