15:00:52 <witek> #startmeeting monasca 15:00:52 <dougsz> hello all 15:00:53 <openstack> Meeting started Wed Apr 10 15:00:52 2019 UTC and is due to finish in 60 minutes. The chair is witek. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:54 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:57 <openstack> The meeting name has been set to 'monasca' 15:00:58 <witek> hi dougsz 15:01:17 <chaconpiza> Hi 15:01:23 <witek> hi chaconpiza 15:01:36 <bandorf> hi, everybody 15:01:53 <witek> hi 15:02:00 <witek> agenda for today: 15:02:03 <witek> https://etherpad.openstack.org/p/monasca-team-meeting-agenda 15:02:24 <witek> I don 15:02:35 <witek> sorry, let's start 15:02:47 <witek> #topic monasca-thresh replacement 15:03:08 <witek> I started making myself thought how can we replace monasca-thresh 15:03:21 <witek> as we urgently do need to replace it 15:04:08 <witek> and so I looked how Prometheus or Aodh are doing this 15:04:31 <witek> and they both don't work on streams but query from the DB 15:04:45 <witek> which is much easier to implement 15:05:18 <witek> and then I thought we could actually try to use what Prometheus offers 15:05:29 <witek> and came up with this document 15:05:35 <witek> https://docs.google.com/presentation/d/1tvllnWaridOG-t-qj9D2brddeQXsYNyZwoYUfby_3Ns/edit?usp=sharing 15:05:51 <witek> I've seen your first comments, thanks a lot for that 15:06:41 <witek> I'd like to start discussion, what do you think of that approach? is that plausible? 15:07:33 <bandorf> maybe we can discuss smaller topics first? and then conclude wether it's plausible? 15:08:21 <witek> right, do we have to discuss if monasca-thresh should be replaced? 15:09:03 <chaconpiza> What about the upgrade from current solution to the new one using Prometheus for current clients? 15:09:07 <Dobroslaw> hi 15:09:37 <joadavis> hi Dobroslaw 15:09:57 <witek> chaconpiza: you mean, what operator would have to do to upgrade from one Monasca version to another? 15:10:05 <chaconpiza> yes 15:10:47 <bandorf> I propose to discuss this (migration) later, when a decision has been taken 15:11:13 <witek> the measurement schema would change, so although saved in InfluxDB, some data migration would have to happen if new functionality would be required 15:11:15 <joadavis> well, if we keep the monasca api and just use prometheus for the thresholding and alarming, it might not be much change for a current client 15:11:22 <bandorf> Regarding your problem statement, Witek: I agree with topic 1,2 and 5. 15:11:35 <bandorf> $ (complex cluster): I can't really judge 15:11:40 <bandorf> $=4 15:12:22 <bandorf> topic 3: High resource consumption: This is certainly true. However, I 'm not sure if this is caused by monasca itself or bei storm 15:12:35 <bandorf> bei = by 15:12:38 <Dobroslaw> I'm not sure if Prometheus actually will be lighter than storm... 15:13:33 <joadavis> yes, would definitely want to qualify performance 15:13:37 <joadavis> and footprint 15:13:38 <Dobroslaw> would be nice if we find someone using prometheus at production and tell us how much resources it's using, on average and with data spikes 15:14:09 <Dobroslaw> I found quite few people complaining about memory usage 15:14:22 <dougsz> Dobroslaw: We're using it, I haven't benchmarked it yet, but I've frequently seen it at the top of `top` 15:15:00 <bandorf> Using "remote read" from influx causes some further overhead - don't know, to what extent 15:15:11 <Dobroslaw> and it don't have build in max memory tuning options 15:15:52 <dougsz> In addition to extending alarm expression language (#2) we also have a requirement to include metadata with alarms 15:15:59 <Dobroslaw> I think I linked to discussion, like using 10x more memory per measurement... 15:17:04 <witek> dougsz: where does the metadata come from, and can that requirement be addressed with Prometheus? 15:17:11 <joadavis> I've talked to a few people who have the impression that Prometheus has a smaller footprint than Monasca, but I suspect that is relative to their install (or just marketing speak) 15:18:36 <dougsz> witek: For example, we want to create a Jira ticket for every log error message. The metadata would include a snippet of the error message. Not sure if it can be done with Prometheus either. I think the approach would be to use something like mtail to make logs scrapable. 15:18:55 <Dobroslaw> it's invasive change, HA will need to be handled differently, not sure how to fast test it with monasca 15:20:47 <witek> Dobroslaw: what would be an alternative? 15:21:45 <Dobroslaw> unfortunately I don't have alternative, just bringing important point, monasca most likely would be installed on same machine with prometheus 15:21:58 <Dobroslaw> and sharing resources with it 15:22:38 <joadavis> we may need a POC to show it can be done... 15:22:59 <witek> remote read is for sure an important aspect, Prometheus normally makes use of built-in aggregations and in proposed setup, the calculation would have to be done on the complete dataset 15:24:12 <witek> complete dataset for a given alerting rule only of course, normally the last 10 minutes of data or so 15:25:36 <witek> dougsz: how do you use Prometheus, do you have many alerting rules? how much data? 15:27:27 <dougsz> We aren't using it at scale yet and we don't have a large number of alerting rules. 15:27:53 <dougsz> We've combined it with mtail to generate metrics from log messages 15:28:44 <dougsz> Currently we use Prometheus as the TSDB, no Influx yet 15:30:02 <dougsz> We use kolla-ansible for the deployment - there are quite a few exporters included in that out of the box 15:31:33 <witek> yes, for the collector part we should advertise the monasca-agent Prometheus plugin better 15:32:07 <witek> thanks dougsz 15:32:29 <dougsz> +1 - I think that's a big win - Prometheus exporters are generally pretty up-to-date and it's great we can take advantage from the Monasca Agent. 15:32:51 <witek> bandorf has commented on the delay until the alarm get's triggered 15:32:56 <witek> is that an issue? 15:33:15 <dougsz> I think it's a good point. 15:33:56 <witek> is it a requirement for anyone? 15:34:29 <bandorf> I had a brief discussion with Cristiano (Product Management) about this. His opinion was: In a typical OpenStack environment, it should be OK. In other scenarios (IoT-demo-fire alarm) it is not. 15:34:38 <dougsz> Generally we haven't used the buffering capabilities of Kafka too much, but it's slightly concerning that alarms could stop working if there was a large burst of metrics. 15:34:53 <joadavis> may depend on use case. Some of the auto-scaling/self healing may want faster alarming 15:36:14 <joadavis> to reduce downtimes and interruptions 15:37:51 <witek> I think the streaming based implementation would be much more complicated, requiring knowledge of Kafka Streams or Apache Storm 15:38:09 <witek> or not scalable, like monasca-aggregator 15:39:35 <witek> the only way to scale aggregator is to shard the data and consume from different Kafka topics 15:39:52 <witek> which is also a valid approach after all 15:40:52 <witek> I have one another concern about Prometheus based set up 15:41:24 <witek> Prometheus defines all its alerting rules and notification via config files 15:41:38 <witek> there is no API for setting them 15:41:58 <witek> only query API to get the current configuration 15:42:12 <joadavis> yeah, that is a concern especially if we do an HA setup (keeping the config files in sync) 15:42:55 <joadavis> does changing a rule then require restarting the Prometheus service? 15:43:03 <witek> reloading 15:44:32 <witek> ok, let's sum up what we have on advantages: 15:45:08 <witek> * great community eco-system with many integrations 15:45:23 <witek> * very flexible alerting rules 15:45:47 <witek> * and query language for visualisations 15:46:57 <witek> * easy deployment 15:47:31 <witek> anything else? 15:48:02 <witek> disadvantages: 15:48:03 <dougsz> * could also monitor the monasca components directly? eg. alert if influxdb goes down 15:49:21 <witek> yes, I'm not sure if that's Prometheus specific 15:49:23 <joadavis> disadvantage: * potentially large footprint and resource usage 15:50:26 <joadavis> disadvantage: * no guaranteed delivery of metrics (requirement for billing systems, not as much a concern for alerting) 15:50:41 <witek> * remote read requires getting complete data chunks from InfluxDB for every evaluation 15:51:02 <joadavis> disadvantage: * no native HA support, requires work to design 15:51:04 <dougsz> disadvantage: * HA model for Prometheus server isn't totally clear (to me at least) 15:51:20 <witek> joadavis: well, with Kafka and InfluxDB we do get guaranteed delivery 15:52:04 <dougsz> disadvantage: * Alerting chain is even more complex. Eg. Monasca API -> Kafka -> Persister -> Influx -> Prometheus -> Alert manager 15:52:11 <bandorf> disadvantage: * longer latency time until alarm gets fired 15:53:25 <bandorf> unknown: * impact of 'remote read to influxdb' 15:53:35 <witek> I would also argue with HA model, it's the same model as for InfluxDB, and we can use API and Kafka to help make it better 15:54:25 <witek> disadvantage: no API for alerting rules and notifications, config based operation 15:54:43 <joadavis> I have a question about whether this puts Cassandra out of our design, but we are short on time so we can save that for another day 15:55:46 <witek> for this set up, we could not use Cassandra, it does not have remote read 15:56:14 <witek> OK, let's cut it here for now 15:56:28 <witek> let's quickly go through the other topics: 15:56:47 <witek> #topic Retirement of Openstack Ansible Monasca roles 15:56:55 <witek> http://lists.openstack.org/pipermail/openstack-discuss/2019-April/004610.html 15:57:04 <witek> guimaluf: are you around? 15:57:33 <witek> unfortunately I don't know anyone using OSA 15:58:17 <witek> #topic Telemetry discussion 15:58:23 <witek> http://lists.openstack.org/pipermail/openstack-discuss/2019-April/004851.html 15:58:52 <witek> there was a quick of meeting for Telemetry project yesterday 15:58:59 <witek> with the new PTL 15:59:26 <witek> after there was nobody starting for the PTL in Train 16:00:13 <witek> anyone, they have considered if they should continue to rely on Gnocchi or search for alternatives 16:00:32 <joadavis> I want us to have a good response for taht 16:00:51 <joadavis> I need to write a thoughtful email back and recommend monasca-ceilometer :) 16:01:12 <witek> as Mark has written in his email, it would be good to maintain just one monitoring project in OpenStack 16:01:26 <dougsz> was just thinking about ceilosca 16:01:33 <joadavis> but we could also have larger discussions about where the monasca agent and ceilometer agent overlap and how to make mon-agent cover all 16:03:15 <witek> joadavis: do we want to sync about the answer to the mailing list? 16:03:35 <joadavis> sure. I can write a draft and send it to you, or you can 16:03:56 <witek> OK, ping you offline 16:04:02 <joadavis> with these kind of questions I start thinking in pictures, but that is hard to do in text emails 16:04:11 <witek> #topic PTG 16:04:32 <witek> we have a conflict with self-healing session on the first day, Thursday 16:04:54 <witek> should we start our sessions on Friday? 16:05:02 <witek> and free the slot? 16:05:32 <dougsz> sounds sensible 16:05:53 <chaconpiza> +1 16:05:54 <witek> joadavis: chaconpiza ? 16:05:55 <Dobroslaw> I'm not sure if chaconpiza will be returning on Friday 16:06:22 <chaconpiza> I will come back on Saturday, I found a good connection flight :) 16:06:29 <Dobroslaw> oh, great 16:06:30 <joadavis> I'm ok with that. I think one of our goals for this PTG should be working with other projects and SIGs 16:07:00 <witek> OK, thanks for joining today 16:07:05 <witek> and for good discussion 16:07:26 <witek> next week I'm in vacation 16:07:46 <witek> so could some else please start the meeting 16:08:12 <witek> all from me, bye 16:08:17 <dougsz> Thanks all, and have a good vacation 16:08:22 <Dobroslaw> bye 16:08:26 <joadavis> bye 16:08:29 <haru5ny> thank you, bye. 16:08:36 <chaconpiza> Ok, enjoy the vacations. Bye. 16:08:38 <witek> #endmeeting