14:05:47 <tobberydberg> #startmeeting publiccloud_wg
14:05:48 <openstack> Meeting started Tue May 28 14:05:47 2019 UTC and is due to finish in 60 minutes.  The chair is tobberydberg. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:05:49 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:05:53 <openstack> The meeting name has been set to 'publiccloud_wg'
14:06:45 <ncastele> so what do we want to achieve by the end of this hour?
14:06:50 <tobberydberg> So, continue the discussions from last week
14:07:08 <tobberydberg> Wrap up from last week
14:07:37 <tobberydberg> we agreed pretty much on that a first good step (phase 1) would be to focus on collecting data and storing the raw data
14:07:56 <tobberydberg> #link https://etherpad.openstack.org/p/publiccloud-sig-billing-implementation-proposal
14:08:05 <ncastele> yep, then we discuss about prometheus as a possible technical solution
14:08:16 <tobberydberg> Yes
14:08:41 <tobberydberg> I still like that idea .... have extremely limited experience though
14:09:05 <ncastele> that's my concern, I'm not enough into prometheus to feel comfortable about this solution
14:09:39 <ncastele> we should probably go deeper into our needs for collecting and storing data, and challenge those needs with someone who has a better overview/understanding of prometheus
14:09:42 <gtema> I personally think this is kind of misuse of the existing solution for different purposes
14:09:56 <tobberydberg> I would say, we should definitely first of all find the measurements that we need, and then choose technology that can solve that
14:09:57 <witek> I think we can split the discussion into two parts, collection and storage
14:10:13 <tobias-urdin> prometheus doesn't consider itself a reliable storage for data for example being used for billing according to their page iirc
14:10:21 <gtema> ++
14:10:25 <tobberydberg> agree with that witek
14:10:37 <ncastele> +1
14:10:57 <tobberydberg> tobias-urdin exporting the data to other storage backend?
14:11:22 <tobberydberg> #link https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage
14:11:28 <tobias-urdin> you mean using prometheus for just the scraping of data then store it some other place, might be a idea
14:11:39 <tobberydberg> yes
14:11:44 <gtema> tobberydberg - nope, the purpose is totally different, where loosing few measurements is not a big deal
14:12:05 <tobias-urdin> personally don't like scraping first because you have to specify what to scrape and also during issues there is no queue-like behavior where you can retrieve data that you missed
14:12:14 <tobberydberg> (TBH ... ceilometer never gave us that reliability either ;-) )
14:13:24 <gtema> I prefer exporting data directly to any TSDB
14:13:29 <tobias-urdin> +1
14:13:52 <tobias-urdin> on that, i like the approach of having a tested solution do the collecting part and just writing integrations for openstack
14:14:12 <tobberydberg> you mean more the setup of ceilometer but with another storage?
14:14:19 <tobias-urdin> maybe for hypervisor based data, scraping is optimal, for the reasons mnaser said about scaling
14:14:45 <tobias-urdin> which is pretty much what ceilometer does today
14:14:54 <tobias-urdin> central polling and agent polling on compute nodes
14:15:05 <tobberydberg> right
14:15:37 <ncastele> can we challenge a bit the usage of a TSDB for our purpose? I know that it seems obvious, but we are already using a TSDB on our side and it has some limitation
14:16:20 <witek> ncastele: what do you mean?
14:16:48 <tobias-urdin> what are you running? imo billing and tsdb doesn't necessarily need to be related other than we aggregate and save metrics to billing
14:17:44 <ncastele> we are using TSDB to push heartbeat of our instances for billing purpose, and as we need a lot of information for each point (instance id, status, flavor, tenant, ...), with the volume of instances we are handling, it's hard to handle the load
14:17:45 <tobberydberg> "aggregate and save metrics to billing" that is what is important imo
14:18:09 <tobias-urdin> all depends on how much you need to "prove" i.e how much do you want to aggregate
14:18:35 <witek> ncastele: common solution to that is to push measurements to StatsD deamon which does aggregations and saves to TSDB
14:19:09 <ncastele> when we are talking about metrics, which kind of metrics are you talking about? Just events (like creation, deletion, upsize, etc.), or more something like heartbeat (a complete overview of everything that is running into the infrastructure)
14:19:20 <ncastele> ?
14:19:26 <gtema> and comming from billing in telecoms: storing only aggregates is a very bad approach
14:20:14 <tobberydberg> I mean, there are those to categories of metrics ... some that are event driven only and those that is more "scraping/heartbeat" driven
14:20:16 <gtema> always the raw values need to be stored, and later additionally aggregates for "invoicing purposes"
14:20:33 <tobberydberg> *two
14:21:29 <tobias-urdin> fwiw metrics when i speak is volume-based so nothing with events, we pretty much only care about usage/volume and not events
14:22:51 <ncastele> raw data for usage in metrics could be pretty huge regarding the number of instances we are handling, tbh, even in a tsdb storage
14:22:57 <witek> gtema: that's a different type of aggregate
14:25:21 <tobberydberg> for example tobias-urdin ... how would you collect data for a virtual router? Usage? Or on event basis?
14:26:32 <tobias-urdin> do you want data from the virtual router (bandwidth for example) or just the amount of virtual routers (or interfaces etc)?
14:26:36 <ncastele> it depends on your BM around virtual router: do you plan to charge the virtual router resource, or their trafic?
14:26:37 <tobias-urdin> we go by the second one
14:27:34 <ncastele> bandwidth should always be billed (so stored/aggregated) by usage (number of GiB)
14:27:43 <ncastele> for instances, we can just bill the number of hours
14:28:13 <witek> btw. here the list of metrics provided by OVS Monasca plugin
14:28:18 <witek> https://opendev.org/openstack/monasca-agent/src/branch/master/docs/Ovs.md
14:28:49 <ncastele> imo it's two different way of collecting, we don't have to use the same algorithm/backend to store and collect hours of instances than GiB of bandwidth
14:28:51 <tobberydberg> both, bandwith is ofcourse usage ... but the existance of a router something else
14:29:47 <tobias-urdin> ncastele: +1 i'd prefer a very good distrinction of them both because they can easily rack up a lot of space, especially if you want raw values
14:30:58 <ncastele> hours of instances/routers/etc. should not take a lot of space imo: we "just" need to store creation and deletion date
14:31:00 <tobberydberg> +1
14:31:45 <tobias-urdin> regarding metrics that we polled each minute, they easily racked up to like 1 TB in a short timespan before we limited our scope
14:31:58 <tobberydberg> I mean, router existence can be aggregated as usage as well, but the calculation of that can easily be done via events as well, with less raw data
14:31:59 <tobias-urdin> then we swapped to gnocchi and we use pretty much nothing now since we can aggregate on the interval we want
14:34:49 <tobberydberg> so you are using ceilometer for all the collection of data today tobias-urdin ?
14:35:05 <tobias-urdin> yes
14:35:43 <tobberydberg> with gnocchi then, and that works well for you for all kind of resources?
14:36:22 <ncastele> same on our side: ceilometer for data collection, then some mongo db for aggregating, and long term storage in a postgresql and a tsdb (and it's working, but we reached some limitation in this architecture)
14:37:49 <witek> ncastele: have you looked at Monasca?
14:38:19 <tobias-urdin> gnocchi is ok, but imo both ceilometer and gnocchi became more troublesome to use, could be simplified a little. our third-party billing engine also does some instance hour lookups against nova's simpleusage api
14:39:00 <ncastele> witek nope, not yet unfortunately :/
14:40:59 <tobberydberg> So we all have a little bit of different setups and preferences when it comes to the collection AND storage parts. (just trying to get some kind of view of the situation here)
14:41:21 <ncastele> the approach we were thinking of before discovering this working group, to achieve per second billing, was just some dirty SQL queries on nova database to collect usage. The main issue with this approach is that it needs specific implementation for each data collection
14:42:44 <tobias-urdin> ncastele: what kind of usage do you want to pull from the nova db?
14:42:49 <tobberydberg> yea, that can probably "work" for instances, but definitely not for neutron resources since they are deleted from the databse
14:43:14 <ncastele> tobberydberg yes. We should probably take time, for each of us, to define our needs regarding collecting so we will be able to address easier each of those needs with a solution
14:43:14 <tobberydberg> don't think that is the way to solve it though :-)
14:43:32 <ncastele> don't think either :)
14:43:35 <tobberydberg> I believe so to
14:44:12 <witek> in long term, I think the services should instrument their code to provide application specific metrics, these could be scraped or pushed to the monitoring system depending on a given architecture
14:44:13 <ncastele> tobias-urdin: we wanted to pull instance id, flavor, start, stop, because that's 90% of what we need to bill instances
14:44:57 <witek> the code instrumentation is a flexible and future-proof approach, the monitoring will evolve together with the application
14:45:43 <tobberydberg> My suggestion is that get a summarised list of the metrics we need to measure, it is not all about instances, and how these can be measured (scraping or what not)
14:46:22 <tobberydberg> Do you think that is a potential way forward? I'm open for any suggestions
14:47:09 <ncastele> that's a good start. We will not cover all resources/services, but that's a good way to focus on those we need to go forward
14:47:13 <tobberydberg> witek Agree that would be the best solution, but I'm pretty sure that won't happen any time soon
14:48:08 <witek> tobberydberg: it could be the mission of this group to drive and coordinate, there are other groups interested as well, like e.g. self-healing
14:49:48 <tobberydberg> witek absolutely, might be that we come to that conclusion after all
14:50:17 <tobberydberg> so, added this section "Metrics we need" under the section "Limitation of scope"
14:50:21 <witek> sounds good
14:50:50 <ncastele> +1
14:51:04 <ncastele> can we plan to fill it for the next meeting ?
14:51:17 <tobberydberg> Would be good if all can try to identify and contribute to this section until next meeting, that will be next Thursday at the same time (1400 UTC) in this channel
14:51:34 <tobberydberg> you were quicker than me ncastele :-)
14:52:02 <ncastele> :)
14:52:24 <tobias-urdin> cool, sounds like good next step
14:55:03 <tobberydberg> added some examples of a suggestion of how to structure that, resource, units, how to collect the data ... feel free to change and structure in whatever way you feel works best
14:55:36 <tobberydberg> Anything else someone wants to rise before we close todays meeting?
14:56:31 <ncastele> not on my side :)
14:59:21 <witek> thanks tobberydberg
14:59:25 <witek> see you next week
14:59:29 <ncastele> thanks for this exceptional meeting
14:59:37 <ncastele> see u next week
14:59:40 <tobias-urdin> not really, we might want some heads up to see what the ceilometer-reboot comes up with
14:59:43 <tobias-urdin> thanks tobberydberg
14:59:46 <tobberydberg> thanks for today folks! Talk next week!
14:59:59 <tobberydberg> indeed
15:00:08 <tobberydberg> #endmeeting