#openstack-meeting-3 log

15:02:03 <witek> #startmeeting monasca
15:02:04 <openstack> Meeting started Wed Feb 14 15:02:03 2018 UTC and is due to finish in 60 minutes.  The chair is witek. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:02:05 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:02:08 <openstack> The meeting name has been set to 'monasca'
15:02:10 <pilgrimstack> ok good to know
15:02:23 <witek> hello everyone
15:02:28 <amofakhar> Hello
15:02:29 <sgrasley1> Hello
15:02:33 <kjansson> hello
15:02:34 <aadams> o/
15:02:39 <haruki> hello
15:02:51 <jamesgu> Hello
15:02:52 <cbellucci> hello
15:02:59 <witek> nice attendance today
15:03:12 <witek> aadams: can we start with your change?
15:03:15 <aadams> sure
15:03:27 <witek> #topic reject old metrics
15:04:07 <witek> so there was some discussion in review if we should reject at all
15:04:17 <witek> https://review.openstack.org/#/c/543054
15:04:19 <aadams> yes
15:04:40 <aadams> I think we should reject
15:04:53 <witek> does threshold engine has problems with handling measurements out of sync?
15:05:01 <aadams> yes
15:05:08 <aadams> it evaluates alarms based on timestamps
15:05:24 <amofakhar> why not rejecting those in threshold engine?
15:05:29 <aadams> and if you are receiving metics with wrong timestamps then alarms will be bad too
15:05:53 <aadams> that is certainly an option, but in the past we have done it in the api
15:06:43 <witek> seems indeed like it could be improved in threshold engine
15:06:44 <aadams> are there any other questions?
15:06:48 <amofakhar> because if it be done in api then we will lose some data which can be useful for other parts
15:06:53 <witek> but that is definitely more effort
15:06:59 <amofakhar> and also it makes the api slower
15:07:11 <aadams> how does it make the api slower?
15:07:41 <nseyvet_> How does an old timestamp for a metric generate an alarm?
15:08:10 <nseyvet_> Does not the threshold engine use some time window and rejects events outside of that?
15:08:14 <aadams> if a timestamp should be for now but there is an ntp problem, alarms might flip or evaluate wrong
15:08:28 <nseyvet_> I dont understand
15:08:43 <aadams> the main issue here is timestamps that are wrong
15:08:50 <nseyvet_> no they are correct
15:09:01 <nseyvet_> from the agent perspective they are fine
15:09:07 <amofakhar> and in a case if metrics pass the evaluation and come into the queue but threshold is stopped for a reason and it be started again then it will have the same problem as before
15:09:07 <aadams> sure
15:09:15 <nseyvet_> there might be network congestion
15:09:35 <nseyvet_> or those metrics are generated in the future on purpose by a prediction engine for example
15:09:41 <nseyvet_> so the timestamp is fine
15:09:52 <aadams> my isue is when they are not fine
15:09:58 <nseyvet_> I dont understand why this timestamp troubles the engine?
15:09:58 <aadams> and there is an ntp problem
15:10:11 <aadams> because alarms are evaluated based on timestamps
15:10:19 <nseyvet_> but how do u know if it is NTP or network pb?
15:10:24 <aadams> exactly
15:10:45 <aadams> I don't know about your use cases, but furure metrics points to an ntp problem
15:10:46 <nseyvet_> if alarms are based on timestamps I dont see the pb
15:10:50 <nseyvet_> no
15:11:01 <nseyvet_> I predict 10 min ahead a metric
15:11:08 <nseyvet_> I push it into Monasca API
15:11:16 <nseyvet_> perfectly legit use case
15:11:27 <aadams> well then, i suppose this patch should not be turned on for you
15:11:31 <nseyvet_> I still dont understand the pb w timestamps and alarms?
15:11:42 <aadams> In my use case, I do not chage the metric created at timestamp
15:11:45 <nseyvet_> I dont see much point for this patch at atll IMO
15:11:52 <aadams> and do not expect furure timestamps
15:12:03 <nseyvet_> because of what?
15:12:22 <aadams> i do not expect future timestamps because i dont create metrics in the future
15:12:53 <aadams> a future timestamp to me is an ntp problem
15:13:33 <aadams> the good news is that if you are expecting furure timestamps, the default behaviour does not change with this patch
15:13:43 <aadams> so you will not be effected
15:13:44 <nseyvet_> There is a large misunderstand here. If the pb is in the threshold engine it should be fixed there.  At this point I dont understand the pb.  And assuming time is off due to NTP is plain wrong
15:13:52 <kjansson> won't this also obfuscate a NTP or similar problems at the agent side? the story says we should make these problems visible and debuggable?
15:14:25 <witek> I think the key question at the moment is to understand why wrong (out of sync) timestamps are causing problems in threshold
15:14:27 <nseyvet_> Also, I am wondering when the agent receives a 422, does it re-transmit?
15:14:41 <nseyvet_> or queue?
15:15:07 <aadams> what would you rather throw?
15:15:19 <nseyvet_> nothing since it is fine IMO
15:15:31 <aadams> hmm
15:15:38 <nseyvet_> question 1: what is this solving?
15:15:58 <aadams> metrics with furure timestamps are invalid in my usecase
15:16:13 <aadams> i should be able to configure the api to agree with that
15:16:44 <aadams> i understand, in your use case they are valid, so dont configure it to be on
15:16:55 <nseyvet_> Should I expand that to assume I may want to reject any metrics that have negative values?
15:17:15 <nseyvet_> It sounds to me like a filter function
15:17:30 <aadams> it is not
15:17:46 <aadams> it rejects only invalid metrics as the user defines them
15:17:58 <nseyvet_> it filter metrics based on timestamp
15:18:02 <aadams> ok
15:18:50 <nseyvet_> invalid is in your situation a timestamp in the future or in the past by 2 weeks. yes?
15:18:57 <aadams> yes
15:20:00 <nseyvet_> so the pb to solve is how to define "invalid" for a general use case and add it as a general functionality in API IMO
15:20:12 <nseyvet_> It sounds like a filter
15:20:19 <aadams> ok
15:20:28 <nseyvet_> so I may want to filter on negative values for example
15:20:30 <nseyvet_> or NaN
15:20:32 <nseyvet_> or etc
15:20:33 <aadams> ok
15:20:39 <nseyvet_> or timestamp being old
15:20:47 <aadams> whats your point
15:21:01 <nseyvet_> I dont think the API is the proper location for that function
15:21:29 <aadams> thats valid, i didn't add gereneral filering though\
15:21:42 <aadams> I wouldnt put general filtering there either
15:23:05 <kjansson> but specific filtering? and if some other use case requires another specific filter?
15:23:32 <aadams> well i suppose the conversaition should be what is a filter and what is simple invalid then?
15:24:51 <aadams> because I am hearing two very different points of view here
15:25:38 <witek> aadams: what behaviour do you have now with old metrics being sent?
15:25:50 <witek> in thresh
15:26:47 <aadams> old alarms aren't as problematic to me as future alarms, because caching is expected, but a two week old metric that is only 2 weeks old because of an NTP problem is bad for my system
15:27:22 <witek> and for future measurements?
15:27:43 <aadams> again, alarms won't be evaluated correctly, and the NTP problem is hidden
15:28:10 <nseyvet_> Let me see if I understand, a timestamp 2 weeks in the past is an NTP problem?
15:28:38 <witek> because thresh evaluates on window (now - 1 min. -> now) ?
15:29:18 <aadams> I am not exactly sure on the math the thresh does but yes, i assume so
15:29:47 <witek> so the alarm state would correspond to the actual alarm state in the past
15:30:26 <aadams> yes
15:30:31 <nseyvet_> The only way to detect that there is an NTP pb would be compare the state of a specific agent vs others, calculate a deviation and etc.
15:30:51 <nseyvet_> any network congestion can slow down packets for minutes
15:31:17 <aadams> sure
15:32:00 <sgrasley1> What is the NTP Check used for if it does not tell you about NTP problems?
15:32:47 <aadams> That is a good question
15:34:39 <amofakhar> you said having it in thresh is an option, would you please tell us what was the reason and benefits of choosing API for it?
15:34:52 <witek> yes, if it's all about NTP problems, it could be better approach to monitor the time offset on agent node
15:35:01 <aadams> Mostly time, and parity to the java implementation
15:35:28 <aadams> we already have this implemented with out configurability in the java api
15:37:50 <witek> so, do we get to decisions?
15:38:38 <aadams> im not sure, did we decide that this needs to be implemented elsewhere?
15:39:43 <witek> no, we could think of having it in thresh, but that's definitely more work
15:39:57 <aadams> I have already spent too much time on this bug
15:40:12 <witek> aadams: is this feature urgent for you?
15:40:14 <aadams> Im affraid I wont get permission to do a deeper fix
15:40:55 <aadams> We can work around, but we have already seen this problem once on a customer environment and are a little affraid of dectecting it in the future
15:41:48 <witek> I would suggest to discuss it again during the PTG if it can wait until then
15:41:59 <aadams> sure
15:42:24 <witek> nseyvet_: ?
15:42:48 <aadams> although Im affraid I can't do any thresh work, so that fix might be just an open storyboard
15:42:53 <nseyvet_> sure
15:43:01 <amofakhar> Yes -> PTG
15:43:15 <aadams> Ok, thanks everybody!
15:43:25 <witek> ok, let's discuss it again then
15:43:29 <witek> thanks aadams
15:43:34 <nseyvet_> What are the documentation describing the threshold engine computations?
15:43:39 <amofakhar> thank you aadams
15:43:44 <nseyvet_> yes thanks aadams
15:43:44 <aadams> :)
15:44:46 <witek> nseyvet_: I think craigbr is the best to ask
15:44:58 <witek> #topic tempest tests
15:45:26 <witek> I was going through bugs related to tempest tests
15:45:42 <witek> wanted to ask about this one: https://storyboard.openstack.org/#!/story/2001309
15:46:01 <witek> haven't seen it for a while now
15:46:07 <witek> is it fixed?
15:46:38 <witek> jamesgu: it's yours
15:47:20 <jamesgu> I ma havin gtrouble opening the story board
15:47:36 <jamesgu> ah yes... let me close it
15:47:42 <witek> cool, thanks
15:48:20 <witek> thanks to craigbr tip we have added some waits in other two test
15:48:43 <witek> which seems to fix occasional failures in CI jobs
15:48:55 <witek> https://review.openstack.org/#/q/topic:2001533
15:49:19 <witek> I'm adding Depends-On tag on changes which fail
15:49:35 <witek> to see if the fix helps
15:50:22 <witek> #topic Ceilosca update
15:50:56 <witek> Ashwin has provided new path for monasca-ceilometer project
15:51:20 <witek> can someone report on the current state of the master branch?
15:51:48 <witek> joadavis is not around?
15:52:53 <witek> peschk_l has also pushed two small changes for Ceilosca
15:53:09 <witek> he works for CloudKitty project
15:54:01 <witek> he proposed the presentation about integration with Monasca for the next Summit
15:54:16 <witek> with cbellucci
15:54:35 <witek> hi aagate
15:55:07 <aagate> hi witek
15:55:31 <witek> I've seen your recent change in monasca-ceilometer
15:55:51 <witek> could you give an update about the current state?
15:58:01 <witek> is it now in sync with current Ceilometer code?
15:58:15 <aagate> yes sure. We have made changes to monasca-ceilometer master so that it works with latest ceilometer master. Had to remove dependency within monasca-ceilometer on oslo config that was major change.
15:58:43 <witek> is there still some work remaining?
15:59:02 <aagate> Also monasca-ceilometer stable/pike now also works  with ceilometer stable/pike
15:59:51 <witek> thanks, I have to finish soon
16:00:00 <witek> one more announcement
16:00:11 <aagate> There is still some work to do to make sure devstack plugin in monasca-ceilometer still functions as before. But as far as getting code uptospeed with ceilometer its done
16:00:18 <witek> https://review.openstack.org/#/q/topic:2001533
16:00:35 <witek> Monasca has gained the diverse-affiliation tag
16:00:36 <witek> :)
16:00:46 <witek> thanks everyone for your contribution
16:01:00 <witek> it's based on review stats
16:01:16 <witek> that's alll
16:01:23 <witek> thank you for today
16:01:26 <witek> bye bye
16:01:33 <witek> #endmeeting