15:02:03 <witek> #startmeeting monasca 15:02:04 <openstack> Meeting started Wed Feb 14 15:02:03 2018 UTC and is due to finish in 60 minutes. The chair is witek. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:02:05 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:02:08 <openstack> The meeting name has been set to 'monasca' 15:02:10 <pilgrimstack> ok good to know 15:02:23 <witek> hello everyone 15:02:28 <amofakhar> Hello 15:02:29 <sgrasley1> Hello 15:02:33 <kjansson> hello 15:02:34 <aadams> o/ 15:02:39 <haruki> hello 15:02:51 <jamesgu> Hello 15:02:52 <cbellucci> hello 15:02:59 <witek> nice attendance today 15:03:12 <witek> aadams: can we start with your change? 15:03:15 <aadams> sure 15:03:27 <witek> #topic reject old metrics 15:04:07 <witek> so there was some discussion in review if we should reject at all 15:04:17 <witek> https://review.openstack.org/#/c/543054 15:04:19 <aadams> yes 15:04:40 <aadams> I think we should reject 15:04:53 <witek> does threshold engine has problems with handling measurements out of sync? 15:05:01 <aadams> yes 15:05:08 <aadams> it evaluates alarms based on timestamps 15:05:24 <amofakhar> why not rejecting those in threshold engine? 15:05:29 <aadams> and if you are receiving metics with wrong timestamps then alarms will be bad too 15:05:53 <aadams> that is certainly an option, but in the past we have done it in the api 15:06:43 <witek> seems indeed like it could be improved in threshold engine 15:06:44 <aadams> are there any other questions? 15:06:48 <amofakhar> because if it be done in api then we will lose some data which can be useful for other parts 15:06:53 <witek> but that is definitely more effort 15:06:59 <amofakhar> and also it makes the api slower 15:07:11 <aadams> how does it make the api slower? 15:07:41 <nseyvet_> How does an old timestamp for a metric generate an alarm? 15:08:10 <nseyvet_> Does not the threshold engine use some time window and rejects events outside of that? 15:08:14 <aadams> if a timestamp should be for now but there is an ntp problem, alarms might flip or evaluate wrong 15:08:28 <nseyvet_> I dont understand 15:08:43 <aadams> the main issue here is timestamps that are wrong 15:08:50 <nseyvet_> no they are correct 15:09:01 <nseyvet_> from the agent perspective they are fine 15:09:07 <amofakhar> and in a case if metrics pass the evaluation and come into the queue but threshold is stopped for a reason and it be started again then it will have the same problem as before 15:09:07 <aadams> sure 15:09:15 <nseyvet_> there might be network congestion 15:09:35 <nseyvet_> or those metrics are generated in the future on purpose by a prediction engine for example 15:09:41 <nseyvet_> so the timestamp is fine 15:09:52 <aadams> my isue is when they are not fine 15:09:58 <nseyvet_> I dont understand why this timestamp troubles the engine? 15:09:58 <aadams> and there is an ntp problem 15:10:11 <aadams> because alarms are evaluated based on timestamps 15:10:19 <nseyvet_> but how do u know if it is NTP or network pb? 15:10:24 <aadams> exactly 15:10:45 <aadams> I don't know about your use cases, but furure metrics points to an ntp problem 15:10:46 <nseyvet_> if alarms are based on timestamps I dont see the pb 15:10:50 <nseyvet_> no 15:11:01 <nseyvet_> I predict 10 min ahead a metric 15:11:08 <nseyvet_> I push it into Monasca API 15:11:16 <nseyvet_> perfectly legit use case 15:11:27 <aadams> well then, i suppose this patch should not be turned on for you 15:11:31 <nseyvet_> I still dont understand the pb w timestamps and alarms? 15:11:42 <aadams> In my use case, I do not chage the metric created at timestamp 15:11:45 <nseyvet_> I dont see much point for this patch at atll IMO 15:11:52 <aadams> and do not expect furure timestamps 15:12:03 <nseyvet_> because of what? 15:12:22 <aadams> i do not expect future timestamps because i dont create metrics in the future 15:12:53 <aadams> a future timestamp to me is an ntp problem 15:13:33 <aadams> the good news is that if you are expecting furure timestamps, the default behaviour does not change with this patch 15:13:43 <aadams> so you will not be effected 15:13:44 <nseyvet_> There is a large misunderstand here. If the pb is in the threshold engine it should be fixed there. At this point I dont understand the pb. And assuming time is off due to NTP is plain wrong 15:13:52 <kjansson> won't this also obfuscate a NTP or similar problems at the agent side? the story says we should make these problems visible and debuggable? 15:14:25 <witek> I think the key question at the moment is to understand why wrong (out of sync) timestamps are causing problems in threshold 15:14:27 <nseyvet_> Also, I am wondering when the agent receives a 422, does it re-transmit? 15:14:41 <nseyvet_> or queue? 15:15:07 <aadams> what would you rather throw? 15:15:19 <nseyvet_> nothing since it is fine IMO 15:15:31 <aadams> hmm 15:15:38 <nseyvet_> question 1: what is this solving? 15:15:58 <aadams> metrics with furure timestamps are invalid in my usecase 15:16:13 <aadams> i should be able to configure the api to agree with that 15:16:44 <aadams> i understand, in your use case they are valid, so dont configure it to be on 15:16:55 <nseyvet_> Should I expand that to assume I may want to reject any metrics that have negative values? 15:17:15 <nseyvet_> It sounds to me like a filter function 15:17:30 <aadams> it is not 15:17:46 <aadams> it rejects only invalid metrics as the user defines them 15:17:58 <nseyvet_> it filter metrics based on timestamp 15:18:02 <aadams> ok 15:18:50 <nseyvet_> invalid is in your situation a timestamp in the future or in the past by 2 weeks. yes? 15:18:57 <aadams> yes 15:20:00 <nseyvet_> so the pb to solve is how to define "invalid" for a general use case and add it as a general functionality in API IMO 15:20:12 <nseyvet_> It sounds like a filter 15:20:19 <aadams> ok 15:20:28 <nseyvet_> so I may want to filter on negative values for example 15:20:30 <nseyvet_> or NaN 15:20:32 <nseyvet_> or etc 15:20:33 <aadams> ok 15:20:39 <nseyvet_> or timestamp being old 15:20:47 <aadams> whats your point 15:21:01 <nseyvet_> I dont think the API is the proper location for that function 15:21:29 <aadams> thats valid, i didn't add gereneral filering though\ 15:21:42 <aadams> I wouldnt put general filtering there either 15:23:05 <kjansson> but specific filtering? and if some other use case requires another specific filter? 15:23:32 <aadams> well i suppose the conversaition should be what is a filter and what is simple invalid then? 15:24:51 <aadams> because I am hearing two very different points of view here 15:25:38 <witek> aadams: what behaviour do you have now with old metrics being sent? 15:25:50 <witek> in thresh 15:26:47 <aadams> old alarms aren't as problematic to me as future alarms, because caching is expected, but a two week old metric that is only 2 weeks old because of an NTP problem is bad for my system 15:27:22 <witek> and for future measurements? 15:27:43 <aadams> again, alarms won't be evaluated correctly, and the NTP problem is hidden 15:28:10 <nseyvet_> Let me see if I understand, a timestamp 2 weeks in the past is an NTP problem? 15:28:38 <witek> because thresh evaluates on window (now - 1 min. -> now) ? 15:29:18 <aadams> I am not exactly sure on the math the thresh does but yes, i assume so 15:29:47 <witek> so the alarm state would correspond to the actual alarm state in the past 15:30:26 <aadams> yes 15:30:31 <nseyvet_> The only way to detect that there is an NTP pb would be compare the state of a specific agent vs others, calculate a deviation and etc. 15:30:51 <nseyvet_> any network congestion can slow down packets for minutes 15:31:17 <aadams> sure 15:32:00 <sgrasley1> What is the NTP Check used for if it does not tell you about NTP problems? 15:32:47 <aadams> That is a good question 15:34:39 <amofakhar> you said having it in thresh is an option, would you please tell us what was the reason and benefits of choosing API for it? 15:34:52 <witek> yes, if it's all about NTP problems, it could be better approach to monitor the time offset on agent node 15:35:01 <aadams> Mostly time, and parity to the java implementation 15:35:28 <aadams> we already have this implemented with out configurability in the java api 15:37:50 <witek> so, do we get to decisions? 15:38:38 <aadams> im not sure, did we decide that this needs to be implemented elsewhere? 15:39:43 <witek> no, we could think of having it in thresh, but that's definitely more work 15:39:57 <aadams> I have already spent too much time on this bug 15:40:12 <witek> aadams: is this feature urgent for you? 15:40:14 <aadams> Im affraid I wont get permission to do a deeper fix 15:40:55 <aadams> We can work around, but we have already seen this problem once on a customer environment and are a little affraid of dectecting it in the future 15:41:48 <witek> I would suggest to discuss it again during the PTG if it can wait until then 15:41:59 <aadams> sure 15:42:24 <witek> nseyvet_: ? 15:42:48 <aadams> although Im affraid I can't do any thresh work, so that fix might be just an open storyboard 15:42:53 <nseyvet_> sure 15:43:01 <amofakhar> Yes -> PTG 15:43:15 <aadams> Ok, thanks everybody! 15:43:25 <witek> ok, let's discuss it again then 15:43:29 <witek> thanks aadams 15:43:34 <nseyvet_> What are the documentation describing the threshold engine computations? 15:43:39 <amofakhar> thank you aadams 15:43:44 <nseyvet_> yes thanks aadams 15:43:44 <aadams> :) 15:44:46 <witek> nseyvet_: I think craigbr is the best to ask 15:44:58 <witek> #topic tempest tests 15:45:26 <witek> I was going through bugs related to tempest tests 15:45:42 <witek> wanted to ask about this one: https://storyboard.openstack.org/#!/story/2001309 15:46:01 <witek> haven't seen it for a while now 15:46:07 <witek> is it fixed? 15:46:38 <witek> jamesgu: it's yours 15:47:20 <jamesgu> I ma havin gtrouble opening the story board 15:47:36 <jamesgu> ah yes... let me close it 15:47:42 <witek> cool, thanks 15:48:20 <witek> thanks to craigbr tip we have added some waits in other two test 15:48:43 <witek> which seems to fix occasional failures in CI jobs 15:48:55 <witek> https://review.openstack.org/#/q/topic:2001533 15:49:19 <witek> I'm adding Depends-On tag on changes which fail 15:49:35 <witek> to see if the fix helps 15:50:22 <witek> #topic Ceilosca update 15:50:56 <witek> Ashwin has provided new path for monasca-ceilometer project 15:51:20 <witek> can someone report on the current state of the master branch? 15:51:48 <witek> joadavis is not around? 15:52:53 <witek> peschk_l has also pushed two small changes for Ceilosca 15:53:09 <witek> he works for CloudKitty project 15:54:01 <witek> he proposed the presentation about integration with Monasca for the next Summit 15:54:16 <witek> with cbellucci 15:54:35 <witek> hi aagate 15:55:07 <aagate> hi witek 15:55:31 <witek> I've seen your recent change in monasca-ceilometer 15:55:51 <witek> could you give an update about the current state? 15:58:01 <witek> is it now in sync with current Ceilometer code? 15:58:15 <aagate> yes sure. We have made changes to monasca-ceilometer master so that it works with latest ceilometer master. Had to remove dependency within monasca-ceilometer on oslo config that was major change. 15:58:43 <witek> is there still some work remaining? 15:59:02 <aagate> Also monasca-ceilometer stable/pike now also works with ceilometer stable/pike 15:59:51 <witek> thanks, I have to finish soon 16:00:00 <witek> one more announcement 16:00:11 <aagate> There is still some work to do to make sure devstack plugin in monasca-ceilometer still functions as before. But as far as getting code uptospeed with ceilometer its done 16:00:18 <witek> https://review.openstack.org/#/q/topic:2001533 16:00:35 <witek> Monasca has gained the diverse-affiliation tag 16:00:36 <witek> :) 16:00:46 <witek> thanks everyone for your contribution 16:01:00 <witek> it's based on review stats 16:01:16 <witek> that's alll 16:01:23 <witek> thank you for today 16:01:26 <witek> bye bye 16:01:33 <witek> #endmeeting