15:02:03 #startmeeting monasca 15:02:04 Meeting started Wed Feb 14 15:02:03 2018 UTC and is due to finish in 60 minutes. The chair is witek. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:02:05 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:02:08 The meeting name has been set to 'monasca' 15:02:10 ok good to know 15:02:23 hello everyone 15:02:28 Hello 15:02:29 Hello 15:02:33 hello 15:02:34 o/ 15:02:39 hello 15:02:51 Hello 15:02:52 hello 15:02:59 nice attendance today 15:03:12 aadams: can we start with your change? 15:03:15 sure 15:03:27 #topic reject old metrics 15:04:07 so there was some discussion in review if we should reject at all 15:04:17 https://review.openstack.org/#/c/543054 15:04:19 yes 15:04:40 I think we should reject 15:04:53 does threshold engine has problems with handling measurements out of sync? 15:05:01 yes 15:05:08 it evaluates alarms based on timestamps 15:05:24 why not rejecting those in threshold engine? 15:05:29 and if you are receiving metics with wrong timestamps then alarms will be bad too 15:05:53 that is certainly an option, but in the past we have done it in the api 15:06:43 seems indeed like it could be improved in threshold engine 15:06:44 are there any other questions? 15:06:48 because if it be done in api then we will lose some data which can be useful for other parts 15:06:53 but that is definitely more effort 15:06:59 and also it makes the api slower 15:07:11 how does it make the api slower? 15:07:41 How does an old timestamp for a metric generate an alarm? 15:08:10 Does not the threshold engine use some time window and rejects events outside of that? 15:08:14 if a timestamp should be for now but there is an ntp problem, alarms might flip or evaluate wrong 15:08:28 I dont understand 15:08:43 the main issue here is timestamps that are wrong 15:08:50 no they are correct 15:09:01 from the agent perspective they are fine 15:09:07 and in a case if metrics pass the evaluation and come into the queue but threshold is stopped for a reason and it be started again then it will have the same problem as before 15:09:07 sure 15:09:15 there might be network congestion 15:09:35 or those metrics are generated in the future on purpose by a prediction engine for example 15:09:41 so the timestamp is fine 15:09:52 my isue is when they are not fine 15:09:58 I dont understand why this timestamp troubles the engine? 15:09:58 and there is an ntp problem 15:10:11 because alarms are evaluated based on timestamps 15:10:19 but how do u know if it is NTP or network pb? 15:10:24 exactly 15:10:45 I don't know about your use cases, but furure metrics points to an ntp problem 15:10:46 if alarms are based on timestamps I dont see the pb 15:10:50 no 15:11:01 I predict 10 min ahead a metric 15:11:08 I push it into Monasca API 15:11:16 perfectly legit use case 15:11:27 well then, i suppose this patch should not be turned on for you 15:11:31 I still dont understand the pb w timestamps and alarms? 15:11:42 In my use case, I do not chage the metric created at timestamp 15:11:45 I dont see much point for this patch at atll IMO 15:11:52 and do not expect furure timestamps 15:12:03 because of what? 15:12:22 i do not expect future timestamps because i dont create metrics in the future 15:12:53 a future timestamp to me is an ntp problem 15:13:33 the good news is that if you are expecting furure timestamps, the default behaviour does not change with this patch 15:13:43 so you will not be effected 15:13:44 There is a large misunderstand here. If the pb is in the threshold engine it should be fixed there. At this point I dont understand the pb. And assuming time is off due to NTP is plain wrong 15:13:52 won't this also obfuscate a NTP or similar problems at the agent side? the story says we should make these problems visible and debuggable? 15:14:25 I think the key question at the moment is to understand why wrong (out of sync) timestamps are causing problems in threshold 15:14:27 Also, I am wondering when the agent receives a 422, does it re-transmit? 15:14:41 or queue? 15:15:07 what would you rather throw? 15:15:19 nothing since it is fine IMO 15:15:31 hmm 15:15:38 question 1: what is this solving? 15:15:58 metrics with furure timestamps are invalid in my usecase 15:16:13 i should be able to configure the api to agree with that 15:16:44 i understand, in your use case they are valid, so dont configure it to be on 15:16:55 Should I expand that to assume I may want to reject any metrics that have negative values? 15:17:15 It sounds to me like a filter function 15:17:30 it is not 15:17:46 it rejects only invalid metrics as the user defines them 15:17:58 it filter metrics based on timestamp 15:18:02 ok 15:18:50 invalid is in your situation a timestamp in the future or in the past by 2 weeks. yes? 15:18:57 yes 15:20:00 so the pb to solve is how to define "invalid" for a general use case and add it as a general functionality in API IMO 15:20:12 It sounds like a filter 15:20:19 ok 15:20:28 so I may want to filter on negative values for example 15:20:30 or NaN 15:20:32 or etc 15:20:33 ok 15:20:39 or timestamp being old 15:20:47 whats your point 15:21:01 I dont think the API is the proper location for that function 15:21:29 thats valid, i didn't add gereneral filering though\ 15:21:42 I wouldnt put general filtering there either 15:23:05 but specific filtering? and if some other use case requires another specific filter? 15:23:32 well i suppose the conversaition should be what is a filter and what is simple invalid then? 15:24:51 because I am hearing two very different points of view here 15:25:38 aadams: what behaviour do you have now with old metrics being sent? 15:25:50 in thresh 15:26:47 old alarms aren't as problematic to me as future alarms, because caching is expected, but a two week old metric that is only 2 weeks old because of an NTP problem is bad for my system 15:27:22 and for future measurements? 15:27:43 again, alarms won't be evaluated correctly, and the NTP problem is hidden 15:28:10 Let me see if I understand, a timestamp 2 weeks in the past is an NTP problem? 15:28:38 because thresh evaluates on window (now - 1 min. -> now) ? 15:29:18 I am not exactly sure on the math the thresh does but yes, i assume so 15:29:47 so the alarm state would correspond to the actual alarm state in the past 15:30:26 yes 15:30:31 The only way to detect that there is an NTP pb would be compare the state of a specific agent vs others, calculate a deviation and etc. 15:30:51 any network congestion can slow down packets for minutes 15:31:17 sure 15:32:00 What is the NTP Check used for if it does not tell you about NTP problems? 15:32:47 That is a good question 15:34:39 you said having it in thresh is an option, would you please tell us what was the reason and benefits of choosing API for it? 15:34:52 yes, if it's all about NTP problems, it could be better approach to monitor the time offset on agent node 15:35:01 Mostly time, and parity to the java implementation 15:35:28 we already have this implemented with out configurability in the java api 15:37:50 so, do we get to decisions? 15:38:38 im not sure, did we decide that this needs to be implemented elsewhere? 15:39:43 no, we could think of having it in thresh, but that's definitely more work 15:39:57 I have already spent too much time on this bug 15:40:12 aadams: is this feature urgent for you? 15:40:14 Im affraid I wont get permission to do a deeper fix 15:40:55 We can work around, but we have already seen this problem once on a customer environment and are a little affraid of dectecting it in the future 15:41:48 I would suggest to discuss it again during the PTG if it can wait until then 15:41:59 sure 15:42:24 nseyvet_: ? 15:42:48 although Im affraid I can't do any thresh work, so that fix might be just an open storyboard 15:42:53 sure 15:43:01 Yes -> PTG 15:43:15 Ok, thanks everybody! 15:43:25 ok, let's discuss it again then 15:43:29 thanks aadams 15:43:34 What are the documentation describing the threshold engine computations? 15:43:39 thank you aadams 15:43:44 yes thanks aadams 15:43:44 :) 15:44:46 nseyvet_: I think craigbr is the best to ask 15:44:58 #topic tempest tests 15:45:26 I was going through bugs related to tempest tests 15:45:42 wanted to ask about this one: https://storyboard.openstack.org/#!/story/2001309 15:46:01 haven't seen it for a while now 15:46:07 is it fixed? 15:46:38 jamesgu: it's yours 15:47:20 I ma havin gtrouble opening the story board 15:47:36 ah yes... let me close it 15:47:42 cool, thanks 15:48:20 thanks to craigbr tip we have added some waits in other two test 15:48:43 which seems to fix occasional failures in CI jobs 15:48:55 https://review.openstack.org/#/q/topic:2001533 15:49:19 I'm adding Depends-On tag on changes which fail 15:49:35 to see if the fix helps 15:50:22 #topic Ceilosca update 15:50:56 Ashwin has provided new path for monasca-ceilometer project 15:51:20 can someone report on the current state of the master branch? 15:51:48 joadavis is not around? 15:52:53 peschk_l has also pushed two small changes for Ceilosca 15:53:09 he works for CloudKitty project 15:54:01 he proposed the presentation about integration with Monasca for the next Summit 15:54:16 with cbellucci 15:54:35 hi aagate 15:55:07 hi witek 15:55:31 I've seen your recent change in monasca-ceilometer 15:55:51 could you give an update about the current state? 15:58:01 is it now in sync with current Ceilometer code? 15:58:15 yes sure. We have made changes to monasca-ceilometer master so that it works with latest ceilometer master. Had to remove dependency within monasca-ceilometer on oslo config that was major change. 15:58:43 is there still some work remaining? 15:59:02 Also monasca-ceilometer stable/pike now also works with ceilometer stable/pike 15:59:51 thanks, I have to finish soon 16:00:00 one more announcement 16:00:11 There is still some work to do to make sure devstack plugin in monasca-ceilometer still functions as before. But as far as getting code uptospeed with ceilometer its done 16:00:18 https://review.openstack.org/#/q/topic:2001533 16:00:35 Monasca has gained the diverse-affiliation tag 16:00:36 :) 16:00:46 thanks everyone for your contribution 16:01:00 it's based on review stats 16:01:16 that's alll 16:01:23 thank you for today 16:01:26 bye bye 16:01:33 #endmeeting