#openstack-monasca log

15:00:52 <witek> #startmeeting monasca
15:00:52 <dougsz> hello all
15:00:53 <openstack> Meeting started Wed Apr 10 15:00:52 2019 UTC and is due to finish in 60 minutes.  The chair is witek. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:54 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:57 <openstack> The meeting name has been set to 'monasca'
15:00:58 <witek> hi dougsz
15:01:17 <chaconpiza> Hi
15:01:23 <witek> hi chaconpiza
15:01:36 <bandorf> hi, everybody
15:01:53 <witek> hi
15:02:00 <witek> agenda for today:
15:02:03 <witek> https://etherpad.openstack.org/p/monasca-team-meeting-agenda
15:02:24 <witek> I don
15:02:35 <witek> sorry, let's start
15:02:47 <witek> #topic monasca-thresh replacement
15:03:08 <witek> I started making myself thought how can we replace monasca-thresh
15:03:21 <witek> as we urgently do need to replace it
15:04:08 <witek> and so I looked how Prometheus or Aodh are doing this
15:04:31 <witek> and they both don't work on streams but query from the DB
15:04:45 <witek> which is much easier to implement
15:05:18 <witek> and then I thought we could actually try to use what Prometheus offers
15:05:29 <witek> and came up with this document
15:05:35 <witek> https://docs.google.com/presentation/d/1tvllnWaridOG-t-qj9D2brddeQXsYNyZwoYUfby_3Ns/edit?usp=sharing
15:05:51 <witek> I've seen your first comments, thanks a lot for that
15:06:41 <witek> I'd like to start discussion, what do you think of that approach? is that plausible?
15:07:33 <bandorf> maybe we can discuss smaller topics first? and then conclude wether it's plausible?
15:08:21 <witek> right, do we have to discuss if monasca-thresh should be replaced?
15:09:03 <chaconpiza> What about the upgrade from current solution to the new one using Prometheus for current clients?
15:09:07 <Dobroslaw> hi
15:09:37 <joadavis> hi Dobroslaw
15:09:57 <witek> chaconpiza: you mean, what operator would have to do to upgrade from one Monasca version to another?
15:10:05 <chaconpiza> yes
15:10:47 <bandorf> I propose to discuss this (migration) later, when a decision has been taken
15:11:13 <witek> the measurement schema would change, so although saved in InfluxDB, some data migration would have to happen if new functionality would be required
15:11:15 <joadavis> well, if we keep the monasca api and just use prometheus for the thresholding and alarming, it might not be much change for a current client
15:11:22 <bandorf> Regarding your problem statement, Witek: I agree with topic 1,2 and 5.
15:11:35 <bandorf> $ (complex cluster): I can't really judge
15:11:40 <bandorf> $=4
15:12:22 <bandorf> topic 3: High resource consumption: This is certainly true. However, I 'm not sure if this is caused by monasca itself or bei storm
15:12:35 <bandorf> bei = by
15:12:38 <Dobroslaw> I'm not sure if Prometheus actually will be lighter than storm...
15:13:33 <joadavis> yes, would definitely want to qualify performance
15:13:37 <joadavis> and footprint
15:13:38 <Dobroslaw> would be nice if we find someone using prometheus at production and tell us how much resources it's using, on average and with data spikes
15:14:09 <Dobroslaw> I found quite few people complaining about memory usage
15:14:22 <dougsz> Dobroslaw: We're using it, I haven't benchmarked it yet, but I've frequently seen it at the top of `top`
15:15:00 <bandorf> Using "remote read" from influx causes some further overhead - don't know, to what extent
15:15:11 <Dobroslaw> and it don't have build in max memory tuning options
15:15:52 <dougsz> In addition to extending alarm expression language (#2) we also have a requirement to include metadata with alarms
15:15:59 <Dobroslaw> I think I linked to discussion, like using 10x more memory per measurement...
15:17:04 <witek> dougsz: where does the metadata come from, and can that requirement be addressed with Prometheus?
15:17:11 <joadavis> I've talked to a few people who have the impression that Prometheus has a smaller footprint than Monasca, but I suspect that is relative to their install (or just marketing speak)
15:18:36 <dougsz> witek: For example, we want to create a Jira ticket for every log error message. The metadata would include a snippet of the error message. Not sure if it can be done with Prometheus either. I think the approach would be to use something like mtail to make logs scrapable.
15:18:55 <Dobroslaw> it's invasive change, HA will need to be handled differently, not sure how to fast test it with monasca
15:20:47 <witek> Dobroslaw: what would be an alternative?
15:21:45 <Dobroslaw> unfortunately I don't have alternative, just bringing important point, monasca most likely would be installed on same machine with prometheus
15:21:58 <Dobroslaw> and sharing resources with it
15:22:38 <joadavis> we may need a POC to show it can be done...
15:22:59 <witek> remote read is for sure an important aspect, Prometheus normally makes use of built-in aggregations and in proposed setup, the calculation would have to be done on the complete dataset
15:24:12 <witek> complete dataset for a given alerting rule only of course, normally the last 10 minutes of data or so
15:25:36 <witek> dougsz: how do you use Prometheus, do you have many alerting rules? how much data?
15:27:27 <dougsz> We aren't using it at scale yet and we don't have a large number of alerting rules.
15:27:53 <dougsz> We've combined it with mtail to generate metrics from log messages
15:28:44 <dougsz> Currently we use Prometheus as the TSDB, no Influx yet
15:30:02 <dougsz> We use kolla-ansible for the deployment - there are quite a few exporters included in that out of the box
15:31:33 <witek> yes, for the collector part we should advertise the monasca-agent Prometheus plugin better
15:32:07 <witek> thanks dougsz
15:32:29 <dougsz> +1 - I think that's a big win - Prometheus exporters are generally pretty up-to-date and it's great we can take advantage from the Monasca Agent.
15:32:51 <witek> bandorf has commented on the delay until the alarm get's triggered
15:32:56 <witek> is that an issue?
15:33:15 <dougsz> I think it's a good point.
15:33:56 <witek> is it a requirement for anyone?
15:34:29 <bandorf> I had a brief discussion with Cristiano (Product Management) about this. His opinion was: In a typical OpenStack environment, it should be OK. In other scenarios (IoT-demo-fire alarm) it is not.
15:34:38 <dougsz> Generally we haven't used the buffering capabilities of Kafka too much, but it's slightly concerning that alarms could stop working if there was a large burst of metrics.
15:34:53 <joadavis> may depend on use case.  Some of the auto-scaling/self healing may want faster alarming
15:36:14 <joadavis> to reduce downtimes and interruptions
15:37:51 <witek> I think the streaming based implementation would be much more complicated, requiring knowledge of Kafka Streams or Apache Storm
15:38:09 <witek> or not scalable, like monasca-aggregator
15:39:35 <witek> the only way to scale aggregator is to shard the data and consume from different Kafka topics
15:39:52 <witek> which is also a valid approach after all
15:40:52 <witek> I have one another concern about Prometheus based set up
15:41:24 <witek> Prometheus defines all its alerting rules and notification via config files
15:41:38 <witek> there is no API for setting them
15:41:58 <witek> only query API to get the current configuration
15:42:12 <joadavis> yeah, that is a concern especially if we do an HA setup (keeping the config files in sync)
15:42:55 <joadavis> does changing a rule then require restarting the Prometheus service?
15:43:03 <witek> reloading
15:44:32 <witek> ok, let's sum up what we have on advantages:
15:45:08 <witek> * great community eco-system with many integrations
15:45:23 <witek> * very flexible alerting rules
15:45:47 <witek> * and query language for visualisations
15:46:57 <witek> * easy deployment
15:47:31 <witek> anything else?
15:48:02 <witek> disadvantages:
15:48:03 <dougsz> * could also monitor the monasca components directly? eg. alert if influxdb goes down
15:49:21 <witek> yes, I'm not sure if that's Prometheus specific
15:49:23 <joadavis> disadvantage: * potentially large footprint and resource usage
15:50:26 <joadavis> disadvantage: * no guaranteed delivery of metrics (requirement for billing systems, not as much a concern for alerting)
15:50:41 <witek> * remote read requires getting complete data chunks from InfluxDB for every evaluation
15:51:02 <joadavis> disadvantage: * no native HA support, requires work to design
15:51:04 <dougsz> disadvantage: * HA model for Prometheus server isn't totally clear (to me at least)
15:51:20 <witek> joadavis: well, with Kafka and InfluxDB we do get guaranteed delivery
15:52:04 <dougsz> disadvantage: * Alerting chain is even more complex. Eg. Monasca API -> Kafka -> Persister -> Influx -> Prometheus -> Alert manager
15:52:11 <bandorf> disadvantage: * longer latency time until alarm gets fired
15:53:25 <bandorf> unknown: * impact of 'remote read to influxdb'
15:53:35 <witek> I would also argue with HA model, it's the same model as for InfluxDB, and we can use API and Kafka to help make it better
15:54:25 <witek> disadvantage: no API for alerting rules and notifications, config based operation
15:54:43 <joadavis> I have a question about whether this puts Cassandra out of our design, but we are short on time so we can save that for another day
15:55:46 <witek> for this set up, we could not use Cassandra, it does not have remote read
15:56:14 <witek> OK, let's cut it here for now
15:56:28 <witek> let's quickly go through the other topics:
15:56:47 <witek> #topic Retirement of Openstack Ansible Monasca roles
15:56:55 <witek> http://lists.openstack.org/pipermail/openstack-discuss/2019-April/004610.html
15:57:04 <witek> guimaluf: are you around?
15:57:33 <witek> unfortunately I don't know anyone using OSA
15:58:17 <witek> #topic Telemetry discussion
15:58:23 <witek> http://lists.openstack.org/pipermail/openstack-discuss/2019-April/004851.html
15:58:52 <witek> there was a quick of meeting for Telemetry project yesterday
15:58:59 <witek> with the new PTL
15:59:26 <witek> after there was nobody starting for the PTL in Train
16:00:13 <witek> anyone, they have considered if they should continue to rely on Gnocchi or search for alternatives
16:00:32 <joadavis> I want us to have a good response for taht
16:00:51 <joadavis> I need to write a thoughtful email back and recommend monasca-ceilometer :)
16:01:12 <witek> as Mark has written in his email, it would be good to maintain just one monitoring project in OpenStack
16:01:26 <dougsz> was just thinking about ceilosca
16:01:33 <joadavis> but we could also have larger discussions about where the monasca agent and ceilometer agent overlap and how to make mon-agent cover all
16:03:15 <witek> joadavis: do we want to sync about the answer to the mailing list?
16:03:35 <joadavis> sure. I can write a draft and send it to you, or you can
16:03:56 <witek> OK, ping you offline
16:04:02 <joadavis> with these kind of questions I start thinking in pictures, but that is hard to do in text emails
16:04:11 <witek> #topic PTG
16:04:32 <witek> we have a conflict with self-healing session on the first day, Thursday
16:04:54 <witek> should we start our sessions on Friday?
16:05:02 <witek> and free the slot?
16:05:32 <dougsz> sounds sensible
16:05:53 <chaconpiza> +1
16:05:54 <witek> joadavis: chaconpiza ?
16:05:55 <Dobroslaw> I'm not sure if chaconpiza will be returning on Friday
16:06:22 <chaconpiza> I will come back on Saturday, I found a good connection flight :)
16:06:29 <Dobroslaw> oh, great
16:06:30 <joadavis> I'm ok with that. I think one of our goals for this PTG should be working with other projects and SIGs
16:07:00 <witek> OK, thanks for joining today
16:07:05 <witek> and for good discussion
16:07:26 <witek> next week I'm in vacation
16:07:46 <witek> so could some else please start the meeting
16:08:12 <witek> all from me, bye
16:08:17 <dougsz> Thanks all, and have a good vacation
16:08:22 <Dobroslaw> bye
16:08:26 <joadavis> bye
16:08:29 <haru5ny> thank you, bye.
16:08:36 <chaconpiza> Ok, enjoy the vacations. Bye.
16:08:38 <witek> #endmeeting