09:00:39 #startmeeting large_scale_sig 09:00:44 Meeting started Wed Feb 12 09:00:39 2020 UTC and is due to finish in 60 minutes. The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:00:45 #topic Rollcall 09:00:45 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:00:48 The meeting name has been set to 'large_scale_sig' 09:00:55 hello 09:00:59 Hi! 09:01:02 Hello hello everyone 09:01:16 Agenda at: 09:01:17 good morning 09:01:21 #link https://etherpad.openstack.org/p/large-scale-sig-meeting 09:01:31 Let's wait for our Asian friends to join 09:01:43 o/ 09:02:02 o/ 09:02:08 o/ 09:02:43 OK, let's get started and maybe others will join 09:02:54 #topic Show and tell 09:03:03 Thanks ttx 09:03:04 We'll start with oneswig presenting 09:03:08 oneswig: floor is yours 09:03:25 I put a link in the etherpad but here's a kind-of presentation... http://www.stackhpc.com/resources/HAproxy-telemetry.pdf 09:03:50 just a few slides as talking points. We do this in the Scientific SIG sometimes and it mostly works 09:03:50 * ttx clicks 09:04:14 So I wanted to share some work my colleague dougsz has been doing :-) 09:04:34 We noticed that HAproxy had sprouted some new telemetry data and wanted to use it. 09:04:58 It's published in prometheus protocol. 09:05:48 Doug has done some useful work recently with Monasca-Agent (our monitoring infra) to extend support for drawing data from prometheus sources 09:06:29 this is desirable because there's a lot of new work coming in that format - Ceph, HAproxy, etc. 09:07:01 and oslo.metric 09:07:14 ttx: indeed :-) It works nicely for us because we get the innovation in prometheus coupled with the advantages of Monasca 09:08:10 OK, so on slide 2 we have a diagram of Monasca services and how the agent pulls data from prometheus endoints, does some preliminary sanitisation and pushes them to the Monasca API 09:09:29 Slide 4 is an example end-to-end. That's the API response latencies, as measured by haproxy, sampled and averaged, collated by API endpoint 09:09:40 Its interesting - to a point. 09:10:31 One problem with the prometheus data as it stands is that it doesn't appear to split latency out by (eg) response code or http operation 09:10:39 that glance_api_external does not seem to be on a good trends 09:11:00 the big red ascending saw-tooth looks bad. To make it do that I was repeatedly creating and deleting images. 09:11:25 (this is only a single-node control plane, it's a dev environment) 09:12:00 oneswig: so success latency is mixed with error latency? 09:12:02 What it doesn't show is that this is a combination of POST, PUT, GET, DELETE 09:12:14 ttx: exactly, which isn't perfect. 09:12:26 yes because I read the golden signals chapter 09:12:27 and brings us to the next part of the work - slide 5 09:12:32 :-) 09:13:10 Doug noticed that haproxy can log copious amounts of useful data, and the log messages appear designed to be parsed into semi-structure ddata 09:13:35 Forgot to mention at the top - we use Kolla-Ansible, which standardises on a fluentd log aggregator container on each node 09:14:22 After parsing the logs with some epic regex, we get json structured data about each api request 09:15:15 This goes into monasca's log pipeline. The box in the bottom right corner can be used to convert log messages into dimensioned time series telemetry (monasca log metrics) 09:15:45 That gives us individual samples which can be aggregated at various points in the pipeline into histograms 09:16:08 currently we do this in grafana, but it could be pushed further upstream, even to fluentd 09:17:10 The second grafana screenshot shows api responses for one service (keystone in this case, because the data's prettiest:-) 09:18:16 Hmm... y is response time, but what is x in this graph? 09:18:19 But what we can see is data by operation, interface and response 09:18:28 frequency density, iirc 09:18:44 it's binned into 10ms intervals 09:18:58 ah ok 09:19:24 Unfortunately there weren't errors for this service but if there were, we might see 500 response codes on the graph too. 09:19:37 Keystone never fails 09:20:05 One shortcoming I guess is that if the number of errors is small, or the timing of them scattered, they'd be lost in the noise along the x-xis 09:20:46 yeah for errors you might want to aggregate on longer periods of time 09:21:02 (or for other rare events) 09:21:33 and create a separate dashboard for errors only 09:21:34 Re. the axis labelling, unfortunately the histogram plugin doesn't support it 09:22:02 dougsz: as long as the reader knows what's on them, I guess it does not matter that much 09:22:06 a good sample would certainly be more useful. We also have such detailed data here, we could potentially trace every datapoint to a single api request 09:22:28 dougsz, oneswig: That's interesting integration work! Thanks for sharing 09:22:33 BTW all this work is going / has gone upstream 09:23:04 It should be useful at various levels - even the general principles for folks not using any of these components 09:23:18 It shows that once you start reducing the parameters (like say you start from Kolla-Ansible so you can rely on fluentd/monasca being present), it simplifies things a lot 09:23:40 (in the telemetry gathering space) 09:24:06 I think so. One thing missing is a nice starting point in terms of dashboards. It's pretty basic what you get as standard 09:24:14 questions on this presentation? 09:24:43 doug did I miss anything? 09:24:52 Indeed, we have a goal to pre-configure monitoring out of the box in Kolla-Ansible 09:25:06 I think you captured it nicely, thanks oneswig 09:25:30 oneswig: do you mind if I include a link to your slides on the meeting summary sent to the ML? 09:25:35 It might be worth mentioning that Prometheus users can take advantage of the log parsing via the Fluentd Prometheus plugin in Kolla 09:25:36 BTW the glance API - I ran that test overnight, it reached a steady state around 3.5s :-) 09:26:22 OK, if no more questions we'll switch to our next topic... 09:26:36 #topic Progress on "Documenting large scale operations" goal 09:26:41 amorin: around? 09:26:45 #link https://etherpad.openstack.org/p/large-scale-sig-documentation 09:27:46 No amorin today, so we'll get how things are going on the config default documentation effort in a future meeting 09:28:04 We also had a bunch a links added to the etherpad, which is great. I briefly scanned them and feel like they all belong. 09:28:24 #topic Progress on "Scaling within one cluster" goal 09:28:38 #link https://etherpad.openstack.org/p/large-scale-sig-cluster-scaling 09:28:49 The oslo.metric blueprint got its first round of reviews, mostly from outside the SIG 09:28:56 #link https://review.opendev.org/#/c/704733/ 09:29:10 Would be great to get more feedback from the SIG participants themselves, before masahito does another round. 09:29:41 So I propose to add a TODO for the group to review that spec ASAP 09:29:50 (as soon as possible) 09:30:08 I don't have much update this week. I'll update the patch without adding actual schema in this week. 09:30:29 s/patch/draft/ 09:30:34 #action masahito to update spec based on initial feedback 09:30:49 #action everyone to review and comment on https://review.opendev.org/#/c/704733/ 09:31:03 Regarding golden signals, did anyone read the recommended read ? 09:31:10 #link https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals 09:31:23 Basically the summary is that any monitoring system should watch latency, traffic, errors, and saturation as early detectors of impeding failure 09:31:44 So making sure this group has a common understanding of those terms is useful. 09:32:16 Please read it (it's short) if you haven't had the chance yet 09:32:24 On the scaling stories front, thanks to Albert Braden who posted to the doc recently 09:32:30 #link https://etherpad.openstack.org/p/scaling-stories 09:33:04 I feel like some of it should find its way to amorin's documentation efforts 09:33:58 comments on that? 09:34:42 Sounds like a no 09:34:44 #topic Other topics 09:34:56 So... we have an "Opendev" event coming up in Vancouver, June 8-11 09:35:15 I wanted to talk briefly about it and how this group can get involved 09:35:27 For those not familiar with the format, it's a workshop-style event where we deep-dive into a number of predefined topics, with lots of flexibility in the exact event format 09:35:46 One of the 4 themes selected for the event is "Large-scale Usage of Open Source Infrastructure Software" 09:35:54 Which is directly aligned with this SIG 09:36:18 So I think we should definitely participate to the programming committee and help select the content that will be discussed there 09:36:39 Like make sure we have some part of the event discussing our "scaling one cluster" and "documenting large scale use defaults" goals 09:36:49 And also take the opportunity to recruit more people to this SIG. 09:37:08 I saw that Belmiro applied to be on the programming committee, I'll make sure to support that application. I myself will be focused on other tracks at the same event 09:37:43 So if others (masahito? oneswig? amorin? jiaopingju?) are interested in driving the content that will be discussed there, please let me know 09:38:44 Thank. I plan to go to the Opendev. But not decided yet. 09:38:57 I might be able to help, but I am not sure I will physically go there 09:39:09 If I can go, I'm okay to be a moderator. 09:39:17 That would be great 09:39:52 OK, we still have some tie, but I know the event organizers want to select the programming committees ASAP 09:39:58 some time* 09:40:19 Is there room or talk submission for the event now? 09:41:06 masahito: It depends on the choice of each programming committee. The general idea is to have a limited number of formal presentations, so in most cases the programming committee will just invite a specific speaker 09:41:24 But some tracks might do a more classic CFP 09:41:37 Got it. Thanks. 09:41:40 and some others might do more of a short-presentations session 09:42:08 (this is why it's important to have members of this SIG in the programming committee) 09:42:42 masahito: in case you can confirm your presence soon, I'd be happy to pass your name to be a part of the committee 09:43:05 if you'll know later, we can always have you moderate some topic 09:43:45 like a workshop on gathering metrics on bottlenecks in your OpenStack clusters 09:43:50 okay. 09:44:05 Other questions on this event? 09:44:13 Other topics to discuss? 09:44:32 Not from myself right now 09:44:40 #topic Next meeting 09:44:41 Nothing from my side. 09:44:48 Our next meeting should be Feb 26, 9 utc 09:44:55 Hopefully we'll have more people around 09:45:43 Anything else before we close the meeting? 09:45:56 If not, thank you for attending! 09:46:12 Thanks everyone! 09:46:18 #endmeeting