14:05:47 <tobberydberg> #startmeeting publiccloud_wg 14:05:48 <openstack> Meeting started Tue May 28 14:05:47 2019 UTC and is due to finish in 60 minutes. The chair is tobberydberg. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:05:49 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:05:53 <openstack> The meeting name has been set to 'publiccloud_wg' 14:06:45 <ncastele> so what do we want to achieve by the end of this hour? 14:06:50 <tobberydberg> So, continue the discussions from last week 14:07:08 <tobberydberg> Wrap up from last week 14:07:37 <tobberydberg> we agreed pretty much on that a first good step (phase 1) would be to focus on collecting data and storing the raw data 14:07:56 <tobberydberg> #link https://etherpad.openstack.org/p/publiccloud-sig-billing-implementation-proposal 14:08:05 <ncastele> yep, then we discuss about prometheus as a possible technical solution 14:08:16 <tobberydberg> Yes 14:08:41 <tobberydberg> I still like that idea .... have extremely limited experience though 14:09:05 <ncastele> that's my concern, I'm not enough into prometheus to feel comfortable about this solution 14:09:39 <ncastele> we should probably go deeper into our needs for collecting and storing data, and challenge those needs with someone who has a better overview/understanding of prometheus 14:09:42 <gtema> I personally think this is kind of misuse of the existing solution for different purposes 14:09:56 <tobberydberg> I would say, we should definitely first of all find the measurements that we need, and then choose technology that can solve that 14:09:57 <witek> I think we can split the discussion into two parts, collection and storage 14:10:13 <tobias-urdin> prometheus doesn't consider itself a reliable storage for data for example being used for billing according to their page iirc 14:10:21 <gtema> ++ 14:10:25 <tobberydberg> agree with that witek 14:10:37 <ncastele> +1 14:10:57 <tobberydberg> tobias-urdin exporting the data to other storage backend? 14:11:22 <tobberydberg> #link https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage 14:11:28 <tobias-urdin> you mean using prometheus for just the scraping of data then store it some other place, might be a idea 14:11:39 <tobberydberg> yes 14:11:44 <gtema> tobberydberg - nope, the purpose is totally different, where loosing few measurements is not a big deal 14:12:05 <tobias-urdin> personally don't like scraping first because you have to specify what to scrape and also during issues there is no queue-like behavior where you can retrieve data that you missed 14:12:14 <tobberydberg> (TBH ... ceilometer never gave us that reliability either ;-) ) 14:13:24 <gtema> I prefer exporting data directly to any TSDB 14:13:29 <tobias-urdin> +1 14:13:52 <tobias-urdin> on that, i like the approach of having a tested solution do the collecting part and just writing integrations for openstack 14:14:12 <tobberydberg> you mean more the setup of ceilometer but with another storage? 14:14:19 <tobias-urdin> maybe for hypervisor based data, scraping is optimal, for the reasons mnaser said about scaling 14:14:45 <tobias-urdin> which is pretty much what ceilometer does today 14:14:54 <tobias-urdin> central polling and agent polling on compute nodes 14:15:05 <tobberydberg> right 14:15:37 <ncastele> can we challenge a bit the usage of a TSDB for our purpose? I know that it seems obvious, but we are already using a TSDB on our side and it has some limitation 14:16:20 <witek> ncastele: what do you mean? 14:16:48 <tobias-urdin> what are you running? imo billing and tsdb doesn't necessarily need to be related other than we aggregate and save metrics to billing 14:17:44 <ncastele> we are using TSDB to push heartbeat of our instances for billing purpose, and as we need a lot of information for each point (instance id, status, flavor, tenant, ...), with the volume of instances we are handling, it's hard to handle the load 14:17:45 <tobberydberg> "aggregate and save metrics to billing" that is what is important imo 14:18:09 <tobias-urdin> all depends on how much you need to "prove" i.e how much do you want to aggregate 14:18:35 <witek> ncastele: common solution to that is to push measurements to StatsD deamon which does aggregations and saves to TSDB 14:19:09 <ncastele> when we are talking about metrics, which kind of metrics are you talking about? Just events (like creation, deletion, upsize, etc.), or more something like heartbeat (a complete overview of everything that is running into the infrastructure) 14:19:20 <ncastele> ? 14:19:26 <gtema> and comming from billing in telecoms: storing only aggregates is a very bad approach 14:20:14 <tobberydberg> I mean, there are those to categories of metrics ... some that are event driven only and those that is more "scraping/heartbeat" driven 14:20:16 <gtema> always the raw values need to be stored, and later additionally aggregates for "invoicing purposes" 14:20:33 <tobberydberg> *two 14:21:29 <tobias-urdin> fwiw metrics when i speak is volume-based so nothing with events, we pretty much only care about usage/volume and not events 14:22:51 <ncastele> raw data for usage in metrics could be pretty huge regarding the number of instances we are handling, tbh, even in a tsdb storage 14:22:57 <witek> gtema: that's a different type of aggregate 14:25:21 <tobberydberg> for example tobias-urdin ... how would you collect data for a virtual router? Usage? Or on event basis? 14:26:32 <tobias-urdin> do you want data from the virtual router (bandwidth for example) or just the amount of virtual routers (or interfaces etc)? 14:26:36 <ncastele> it depends on your BM around virtual router: do you plan to charge the virtual router resource, or their trafic? 14:26:37 <tobias-urdin> we go by the second one 14:27:34 <ncastele> bandwidth should always be billed (so stored/aggregated) by usage (number of GiB) 14:27:43 <ncastele> for instances, we can just bill the number of hours 14:28:13 <witek> btw. here the list of metrics provided by OVS Monasca plugin 14:28:18 <witek> https://opendev.org/openstack/monasca-agent/src/branch/master/docs/Ovs.md 14:28:49 <ncastele> imo it's two different way of collecting, we don't have to use the same algorithm/backend to store and collect hours of instances than GiB of bandwidth 14:28:51 <tobberydberg> both, bandwith is ofcourse usage ... but the existance of a router something else 14:29:47 <tobias-urdin> ncastele: +1 i'd prefer a very good distrinction of them both because they can easily rack up a lot of space, especially if you want raw values 14:30:58 <ncastele> hours of instances/routers/etc. should not take a lot of space imo: we "just" need to store creation and deletion date 14:31:00 <tobberydberg> +1 14:31:45 <tobias-urdin> regarding metrics that we polled each minute, they easily racked up to like 1 TB in a short timespan before we limited our scope 14:31:58 <tobberydberg> I mean, router existence can be aggregated as usage as well, but the calculation of that can easily be done via events as well, with less raw data 14:31:59 <tobias-urdin> then we swapped to gnocchi and we use pretty much nothing now since we can aggregate on the interval we want 14:34:49 <tobberydberg> so you are using ceilometer for all the collection of data today tobias-urdin ? 14:35:05 <tobias-urdin> yes 14:35:43 <tobberydberg> with gnocchi then, and that works well for you for all kind of resources? 14:36:22 <ncastele> same on our side: ceilometer for data collection, then some mongo db for aggregating, and long term storage in a postgresql and a tsdb (and it's working, but we reached some limitation in this architecture) 14:37:49 <witek> ncastele: have you looked at Monasca? 14:38:19 <tobias-urdin> gnocchi is ok, but imo both ceilometer and gnocchi became more troublesome to use, could be simplified a little. our third-party billing engine also does some instance hour lookups against nova's simpleusage api 14:39:00 <ncastele> witek nope, not yet unfortunately :/ 14:40:59 <tobberydberg> So we all have a little bit of different setups and preferences when it comes to the collection AND storage parts. (just trying to get some kind of view of the situation here) 14:41:21 <ncastele> the approach we were thinking of before discovering this working group, to achieve per second billing, was just some dirty SQL queries on nova database to collect usage. The main issue with this approach is that it needs specific implementation for each data collection 14:42:44 <tobias-urdin> ncastele: what kind of usage do you want to pull from the nova db? 14:42:49 <tobberydberg> yea, that can probably "work" for instances, but definitely not for neutron resources since they are deleted from the databse 14:43:14 <ncastele> tobberydberg yes. We should probably take time, for each of us, to define our needs regarding collecting so we will be able to address easier each of those needs with a solution 14:43:14 <tobberydberg> don't think that is the way to solve it though :-) 14:43:32 <ncastele> don't think either :) 14:43:35 <tobberydberg> I believe so to 14:44:12 <witek> in long term, I think the services should instrument their code to provide application specific metrics, these could be scraped or pushed to the monitoring system depending on a given architecture 14:44:13 <ncastele> tobias-urdin: we wanted to pull instance id, flavor, start, stop, because that's 90% of what we need to bill instances 14:44:57 <witek> the code instrumentation is a flexible and future-proof approach, the monitoring will evolve together with the application 14:45:43 <tobberydberg> My suggestion is that get a summarised list of the metrics we need to measure, it is not all about instances, and how these can be measured (scraping or what not) 14:46:22 <tobberydberg> Do you think that is a potential way forward? I'm open for any suggestions 14:47:09 <ncastele> that's a good start. We will not cover all resources/services, but that's a good way to focus on those we need to go forward 14:47:13 <tobberydberg> witek Agree that would be the best solution, but I'm pretty sure that won't happen any time soon 14:48:08 <witek> tobberydberg: it could be the mission of this group to drive and coordinate, there are other groups interested as well, like e.g. self-healing 14:49:48 <tobberydberg> witek absolutely, might be that we come to that conclusion after all 14:50:17 <tobberydberg> so, added this section "Metrics we need" under the section "Limitation of scope" 14:50:21 <witek> sounds good 14:50:50 <ncastele> +1 14:51:04 <ncastele> can we plan to fill it for the next meeting ? 14:51:17 <tobberydberg> Would be good if all can try to identify and contribute to this section until next meeting, that will be next Thursday at the same time (1400 UTC) in this channel 14:51:34 <tobberydberg> you were quicker than me ncastele :-) 14:52:02 <ncastele> :) 14:52:24 <tobias-urdin> cool, sounds like good next step 14:55:03 <tobberydberg> added some examples of a suggestion of how to structure that, resource, units, how to collect the data ... feel free to change and structure in whatever way you feel works best 14:55:36 <tobberydberg> Anything else someone wants to rise before we close todays meeting? 14:56:31 <ncastele> not on my side :) 14:59:21 <witek> thanks tobberydberg 14:59:25 <witek> see you next week 14:59:29 <ncastele> thanks for this exceptional meeting 14:59:37 <ncastele> see u next week 14:59:40 <tobias-urdin> not really, we might want some heads up to see what the ceilometer-reboot comes up with 14:59:43 <tobias-urdin> thanks tobberydberg 14:59:46 <tobberydberg> thanks for today folks! Talk next week! 14:59:59 <tobberydberg> indeed 15:00:08 <tobberydberg> #endmeeting