Wednesday, 2016-04-20

*** Gen_ has joined #openstack-monasca00:01
*** Gen has quit IRC00:05
*** ddieterly[away] is now known as ddieterly00:07
*** ddieterly is now known as ddieterly[away]00:09
*** slogan has quit IRC00:14
*** ddieterly[away] is now known as ddieterly00:14
*** craigbr has joined #openstack-monasca00:39
*** craigbr has quit IRC00:42
*** ljxiash has quit IRC00:46
*** bobh has joined #openstack-monasca00:52
*** bobh has quit IRC00:57
*** ybathia has joined #openstack-monasca00:59
*** ljxiash has joined #openstack-monasca01:06
*** ybathia has quit IRC01:08
*** bobh has joined #openstack-monasca01:13
*** ddieterly has quit IRC01:18
*** ljxiash has quit IRC01:20
*** ljxiash has joined #openstack-monasca01:23
*** rohit_ has quit IRC01:37
*** ddieterly has joined #openstack-monasca01:50
*** ddieterly is now known as ddieterly[away]01:51
*** ducttape_ has joined #openstack-monasca01:59
*** ducttape_ has quit IRC02:09
*** bobh has quit IRC02:12
*** ddieterly[away] is now known as ddieterly02:25
*** ddieterly is now known as ddieterly[away]02:26
*** ljxiash has quit IRC02:27
*** ljxiash has joined #openstack-monasca02:32
*** ljxiash has quit IRC02:35
*** ljxiash has joined #openstack-monasca02:37
*** bobh has joined #openstack-monasca02:46
*** ducttape_ has joined #openstack-monasca02:49
*** Gen_ has quit IRC02:51
*** kse has quit IRC02:56
*** kse has joined #openstack-monasca02:57
*** ljxiash has quit IRC03:01
*** ljxiash has joined #openstack-monasca03:02
*** ljxiash has joined #openstack-monasca03:08
*** ljxiash has quit IRC03:10
*** ljxiash has joined #openstack-monasca03:10
*** ducttape_ has quit IRC03:24
*** ddieterly[away] has quit IRC03:27
*** bobh has quit IRC03:32
*** ljxiash has quit IRC04:00
*** ekarlso has quit IRC04:11
*** ljxiash has joined #openstack-monasca04:12
*** hosanai has quit IRC04:13
*** hosanai has joined #openstack-monasca04:14
*** ekarlso has joined #openstack-monasca04:25
*** ljxiash has quit IRC04:35
*** ljxiash has joined #openstack-monasca05:40
*** ericksonsantos has quit IRC06:13
*** nadya has joined #openstack-monasca06:26
*** ljxiash has quit IRC06:40
*** ljxiash has joined #openstack-monasca06:43
*** nadya has quit IRC07:19
*** ljxiash has quit IRC07:36
*** ljxiash has joined #openstack-monasca07:40
*** ljxiash_ has joined #openstack-monasca08:18
*** ljxiash has quit IRC08:18
*** ljxiash_ has quit IRC08:29
*** ljxiash has joined #openstack-monasca08:30
openstackgerritWitold Bedyk proposed openstack/monasca-agent: Migrate from MySQLDB to pymysql  https://review.openstack.org/30266008:59
*** kei_yama has quit IRC09:09
*** ljxiash has quit IRC09:12
*** ljxiash has joined #openstack-monasca09:12
*** kse has quit IRC09:30
*** nadya has joined #openstack-monasca09:50
*** hosanai has quit IRC10:18
*** ljxiash has quit IRC10:29
*** nadya has quit IRC11:02
*** ddieterly has joined #openstack-monasca11:06
*** nadya has joined #openstack-monasca11:24
*** ddieterly is now known as ddieterly[away]11:25
*** ddieterly[away] is now known as ddieterly11:29
*** bobh has joined #openstack-monasca11:33
*** ddieterly is now known as ddieterly[away]11:38
*** ducttape_ has joined #openstack-monasca11:45
*** ddieterly[away] is now known as ddieterly12:18
*** ddieterly has quit IRC12:18
*** ljxiash has joined #openstack-monasca12:22
*** ducttape_ has quit IRC12:24
*** bobh has quit IRC12:25
*** nadya has quit IRC12:26
*** iurygregory has joined #openstack-monasca12:37
*** ddieterly has joined #openstack-monasca12:46
*** ducttape_ has joined #openstack-monasca12:56
*** ducttape_ has quit IRC13:01
*** ducttape_ has joined #openstack-monasca13:02
*** ducttape_ has quit IRC13:02
*** ducttape_ has joined #openstack-monasca13:06
*** bobh has joined #openstack-monasca13:19
*** rhochmuth has joined #openstack-monasca13:27
*** rbak has joined #openstack-monasca13:37
*** craigbr has joined #openstack-monasca13:39
*** vishwanathj has joined #openstack-monasca13:58
*** 14WAATBIX has joined #openstack-monasca14:16
openstackgerritBradley Klein proposed openstack/monasca-agent: Add plugin for gathering ovs virtual router statistics  https://review.openstack.org/30662114:43
*** ddieterly is now known as ddieterly[away]14:45
*** ddieterly[away] is now known as ddieterly14:46
*** slogan has joined #openstack-monasca14:53
*** bklei has joined #openstack-monasca15:00
*** bobh has quit IRC15:02
*** dschroeder has joined #openstack-monasca15:22
rbak14WAATBIX: Do you want to talk immediately after the monasca meeting, or is there a time that works better for you?15:35
14WAATBIXI can do immediately after, but I don't think jkeen will be in until later15:36
*** 14WAATBIX has quit IRC15:38
*** rbrndt has joined #openstack-monasca15:38
rbakAny idea what time?  I can sent out a meeting invite for later today.15:38
rbrndtI'd guess about 10, 10:30 mountain time15:39
rbakHow's 11 mountain time work for you then?15:40
rbrndtWe've got a monasca team meeting at that time15:40
rbrndtpersonally i've got 1-2 pm open15:40
rbrndtoops, sorry wrong day15:41
rbrndtyeah 11 works fine for me15:41
rbakAlright,  I'll send out an invite and we'll see who shows up15:41
rbrndtsounds good15:42
rbakI think we can just talk here, unless you prefer a bridge?15:42
rbrndtI can do IRC15:42
*** bobh has joined #openstack-monasca15:43
*** iurygregory has quit IRC15:45
*** slogan has quit IRC15:49
*** ljxiash has quit IRC15:50
*** iurygregory has joined #openstack-monasca15:55
*** rhochmuth has left #openstack-monasca16:02
*** ddieterly is now known as ddieterly[away]16:04
*** ddieterly[away] is now known as ddieterly16:11
*** ddieterly is now known as ddieterly[away]16:12
*** nadya has joined #openstack-monasca16:17
*** ddieterly[away] is now known as ddieterly16:33
*** bklei has quit IRC16:48
*** ljxiash has joined #openstack-monasca16:50
*** ddieterly is now known as ddieterly[away]16:52
*** ljxiash has quit IRC16:55
*** ddieterly[away] is now known as ddieterly16:58
rbakrbrndt: you there?17:01
rbrndtyup17:01
rbrndtgetting jkeen online17:01
rbakthanks17:01
*** jkeen has joined #openstack-monasca17:02
*** mhoppal has joined #openstack-monasca17:02
rbrndtOk, we all here now?17:03
jkeenI'm here17:03
mhoppalhere as well17:03
rbakAwesome.  thanks for taking the time to talk about this patch17:04
rbakFrom what I've gathered, the concern is that with this patch, when the pool restarts it loses the data of any running checks.17:04
rbrndtSo, you had a good way of describing the problem in the weekly meeting, rbak17:04
rbakNow I need to remember how I put it earlier.17:05
rbakBasically the pool restart is triggered by a check taking to long.17:05
rbakIn the best case the check eventually returns, and no data is lost.17:06
rbakBut in the case we've hit repeatedly the stuck check never returns, and so the thread pool hangs forever on the join17:06
rbakMy patch was intended to help the second case, and reduces the data lose to a minimum17:07
rbakBut from your perspective, you're addressing the first case which had no data loss and saying that my patch makes things worse.17:07
rbakDoes that make sense so far?17:08
rbrndtI think we're almost there.17:08
jkeenYes17:08
rbrndtThe issue I was wondering about is actually in the second case17:08
mhoppalmake sense to me17:08
rbrndtwhen we do lose data, how much and which data is lost?17:08
rbrndtI think we were looking at it, and it sounds like we could lose the whole set of instances for a check, if one of them fails17:09
jkeenGiven the way the current thread pool works the lost data is going to be indeterminate.  It'll kill the pool at some point long past the point where a check got stuck.17:09
rbakCurrently, the thread pool hangs and takes the entire agent with it, so all data for the entire agent is lost until the agent is manually restarted.17:09
jkeenWe've been running Monasca at scale for several months now and we've never seen an agent hang.  Do you know what leads to the hang?17:10
rbakNot really17:10
rbakBut we've seen it caused by both the nagios and http-check plugins17:10
jkeenOr, rather, we've never seen it hang on an http check.  We've seen it hang communicating to the API but think we've fixed that.17:10
rbakYeah, that's a different issue17:11
jkeenOk, I don't think we run nagios so that could expalin it.17:11
rbakIf you don't see things hang, what causes pool restarts for you?17:11
jkeenThere are a couple problems with the current implementation, which aren't your fault it's just how it was written originally, that make me concerned.17:12
jkeenFar as we know we've never seen a pool restart.17:12
rbakjkeen: what are the current problems?17:13
jkeenMy main problem is the way it attempts to get data back from the pool.  It gets an instance, checks to see if there is any available data, and then places the instance in the pool.17:13
jkeenThis applies to the thread pool and the new process pool code.17:13
jkeenIt looks like it can never get all the data for a give check, almost like it's expecting checks to hang for a time.17:14
rbakI'm not sure I follow that bit.17:14
jkeenIf the checks don't complete within a given time frame, 180 seconds by default, it kills the pool and we lose the data.17:14
rbakTrue17:15
mhoppalshould we never see it hang then jkeen by that logic?17:15
jkeenIn the check function it runs self._process_results() and then does self.pool.apply_async()17:15
jkeenI don't see it run self._process_results and wait for the data anywhere.17:16
rbakBut that's processing all results that have come in for the pool, not necessarily the instance that it's about to start17:16
jkeenSo it looks like if you run N checks you're going to get N-m the first time around and get the remaining unfinished checks the next collection cycle.17:16
jkeenRight, but since it's always checking first what happens when you reach the last item in the pool?  It checks for data before it ever tries to run the job.17:17
rbakTrue, but even if you moved that statement afterwards there's no guarantee the check will be done.17:18
rbakAlso I think last is misleading here since the asynchronous nature means they're all running at once.17:18
*** rhochmuth has joined #openstack-monasca17:18
rbakThe loop just sticks them all on the stack to run17:18
jkeenYes, and that's my main problem with how this works.  What I'd rather see is that we apply the instances to the pool and then read all the data.  There are few enough checks we should be able to make them robust enough to timeout and guarnatee a return from the pool.17:19
rbakThat doesn't necessarily work though17:20
rbakYou have to deal with the results asynchronously as well17:20
mhoppalim confused how it gets into a state where it hangs though with the current implementation as we run clean each run which stops and starts the pool if the job has been running for a configured time17:20
jkeenrbak, they can't all be running at once because we're still sticking them in the pool one at a time in a higher level loop.  They might eventually all be running at once but even then we're not waiting for any data.  We return as soon as the result queue is empty.17:21
rbakmhoppal: Because the pool stop doesn't work.  It waits for all running checks to return before stopping, and that's not necessarily going to happen.17:21
jkeenYou don't have to deal with the results asynchronously unless you want to.  You can use a blocking map with a timeout.  That has it's own issues but we'll know what we're dropping at that point.17:23
rbakBut that doesn't work for the nagios checks17:23
rbakIf nothing else17:23
rbakI could have a check that takes 5 minutes to run, and another that runs every minute.  If I block on waiting for data that limits everything to the rate of the longest running check.17:24
jkeenrbak, why doesn't that work with nagiois?  I don't have any experience with those checks?17:24
rbakI just gave you an example17:25
jkeenYou're having checks that run well outside the 30 second collection period?17:25
rbakBasically the checks could run at different rates17:25
rbakAnd yes, we have checks that only run once an hour, but take several minutes.17:25
rbakThat's an extreme example though17:26
rbrndthmm17:26
*** nadya has quit IRC17:26
rbakBut I think we're getting off track.  I don't really see how this impacts the thread pool restarts17:27
rbakMy patch boils down to this.  The current implementation assumes checks always return.  This isn't always the case.  So we have to handle the case where it's not true.17:29
rbakIt's impossible to tell the difference between a long running check and one that will never return, so at some point we have to just cut everything off and restart.17:30
rbrndtI think its something of a different use case to handle checks that take that long to return17:30
rbrndtjkeen mhoppal and roland are conferring for a moment17:30
rbakWorth noting, this only loses data on checks that are still running.  The results queue would be unaffected and that data would be collected later.17:31
rbakrbrndt: thanks for letting me know17:31
jkeenrbak, the problem here is we do want that behaviour for a given collection cycle.  I was planning to put a patch up that made the collection of these paralellized pieces more reliable but it sounds like it would break your usecase entirely.17:34
jkeenIs it only the nagios checks that are the long running ones or are there http checks that take a while to return?17:34
rbakAs far as I know it's just the nagios checks17:35
rbakEverything else returns fairly quickly17:35
jkeenIf you want to make a new superclass for the nagios checks that implements this new behaviour I'd be able to make our http and tcp checks more reliable without effecting your long running checks.17:35
jkeenIs that something you can do?17:36
rbakProbably not until after the summit, but sure17:36
rbakBut this still won't address the pool restarts17:36
jkeenIf we do this though we'll have separate pools for the nagios and the other parallelized checks.  I can fix the problem I see for the other checks but that's not a viable solution for the nagios checks.17:37
jkeenThere are other options there though.17:38
rbakI don't follow17:38
jkeenI'm suggesting that you make a new superclass specifically for nagios checks that implements the process pool you currently have up so that the nagios checks can do long running operations independent of the collection interval.17:39
rbakBut the process pool has nothing to do with long running checks17:39
rbakThey're separate issues17:40
rbakLet's ignore nagios checks for the moment, and say an http check hangs, which we've seen happen (not for a while, but we're not sure if that means the bug is fixed)17:41
jkeenI don't see that.  The current thread pool, and the process pool patch, result in unreliable collection.  I want to make it reliable but that means that you can't have a check that takes longer than the collection cycle.17:41
rbakIf an http check hangs, how does your new collection mechanism fix it?17:41
*** ybathia has joined #openstack-monasca17:42
jkeenFor http checks my current plan is to replace the _process_results function with a process.map call that will timeout if the checks take too long along with modifying the http checks so that they're reliable.17:42
jkeenSince they're running as subprocesses in the map we can interrupt them and force a return easily enough.17:42
rbakBut I still don't think you can interrupt a single process, even with a map.17:43
rbakAnd I don't see a timeout option.17:44
jkeenHaving the subprocess interrupt itself and return isn't a problem.  I was doing that in another part of Monasca before I found a cleaner way for that particular case.17:45
openstackgerritMichael Hoppal proposed openstack/monasca-api: Add periodic interval field to notification method  https://review.openstack.org/30850217:45
jkeenIf there isn't currently a timeout option we'd add one.17:46
rbakYou mentioned process.map, are you talking about the multiprocessing module?17:46
rbakOr are you still trying to use the thread pool?17:47
jkeenYes, we'd use the multiprocess module and get rid of the thread pool library.  It just looks like a problem waiting to happen.17:47
rbakAt least we agree there17:48
rbakI'm not sure how you would add a timeout option to the multiprocessing module though17:48
jkeenThere's several ways to timeout a map operation.  You can use the results object it returns to timeout but in that case you get no data.  You can use a callback function and time out the parent process.  You can use an imap operation and timeout there if a check takes too long but you'll still get a partial set of data back.17:50
jkeenWhat I'd look at first is using a signal handler in the subprocess to interupt it and force the return of a failure.  That will let us get a result back so we can identify the failing check rather than having a subset of the http checks go undetermined.17:50
rbakAlright, you seem to have some idea of how that would work.17:51
rbakI've never tried it.17:52
rbakOk, so that's fine for everything except the nagios module.17:52
rbakAny timeline on this patch your proposing?17:52
jkeenWell, like I mentioned earlier I don't think this idea works for the nagios checks.  That's why I'd like to see a new parent class for nagios with that contains your current patch set.17:55
*** mhoppal_ has joined #openstack-monasca17:55
*** craigbr has quit IRC17:55
jkeenThen when I can find some time, hopefully soon, I can do the proposed patch to the http and tcp checks.17:55
*** ddieterly is now known as ddieterly[away]17:57
*** mhoppal has quit IRC17:57
rbakjkeen: Is there any reason not to just merge this patch and work from there?  It would fix our immediate issue, you never hit restarts so you shouldn't have any problems with data lose.  It already rips out the thread pool and reformats the checks for use in a process pool.17:59
rbakI'm happy to separate out nagios in the long run, but this is a pressing problem for us.18:00
rbrndtexcept we did see data loss in our testing18:00
rbakI thought you said you never saw restarts?18:00
rbakWhat was causing the data loss?18:00
rbrndtDidn't find the root cause as of yet18:01
*** ybathia has quit IRC18:01
jkeenrbak, we never saw data loss with the thread pool but we have seen data loss with the multi processing modifications.18:02
rbakOut of curiosity, how do you know when there's data lost?18:02
rbrndtIn my test, I found an error in the collector log and saw missing metrics18:02
jkeenFor us it was on an http check since I don't think we use any nagios checks at the moment.18:03
rbrndtyeah, it was http18:03
rbakLooks like that bugs still around then18:03
rbakAlright, I'll apply this to just nagios checks, but let me know if you're not going to get around to your patch soon.18:04
jkeenOk, thanks.18:05
*** ybathia has joined #openstack-monasca18:20
*** mhoppal_ has quit IRC18:23
*** craigbr has joined #openstack-monasca18:31
*** vishwanathj has quit IRC18:50
*** vishwanathj has joined #openstack-monasca18:50
*** ljxiash has joined #openstack-monasca18:52
*** ddieterly[away] is now known as ddieterly18:54
*** ljxiash has quit IRC18:56
*** ducttape_ has quit IRC19:20
*** ducttape_ has joined #openstack-monasca19:27
*** ybathia has quit IRC19:40
openstackgerritRyan Brandt proposed openstack/monasca-api: Fix metric-list limits  https://review.openstack.org/30796319:43
*** ducttape_ has quit IRC19:47
*** ducttape_ has joined #openstack-monasca19:56
*** ddieterly is now known as ddieterly[away]20:01
*** ddieterly[away] is now known as ddieterly20:04
*** rbak has quit IRC20:04
*** ybathia has joined #openstack-monasca20:31
*** rbak has joined #openstack-monasca20:50
*** ljxiash has joined #openstack-monasca20:53
*** ljxiash has quit IRC20:58
*** ybathia has quit IRC20:59
*** ybathia has joined #openstack-monasca21:10
*** ybathia has quit IRC21:12
*** ybathia has joined #openstack-monasca21:12
openstackgerritMichael Hoppal proposed openstack/monasca-agent: Add upper-constraints to our tox file  https://review.openstack.org/30859121:13
openstackgerritDavid Schroeder proposed openstack/monasca-agent: Refresh of Agent plugin documentation  https://review.openstack.org/30859221:17
openstackgerritMichael Hoppal proposed openstack/monasca-agent: Add upper-constraints to our tox file  https://review.openstack.org/30859121:37
*** slogan has joined #openstack-monasca21:37
*** ddieterly is now known as ddieterly[away]21:37
*** ddieterly[away] is now known as ddieterly21:38
sloganrhochmuth: FYI https://github.com/openstack/broadview-ui21:44
sloganI think that's it for the projects21:44
sloganI can rest (a bit)21:45
slogan:-)21:45
rhochmuthcool21:47
rhochmuthi still haven't installed into devstack21:47
sloganrbak: I documented grafana, not sure if I shared that - until (and if) you decide to externalize some docs, maybe it will be useful to someone: https://github.com/openstack/broadview-collector/blob/master/doc/microburst_simulation.md21:47
rhochmuthi'm bascially jsut trying to cram al this work in21:47
slogandevstack?21:47
rhochmuthyeah21:48
sloganmy experience (other than the issues my patch addressed) is monasca and devstack work fine21:48
sloganI do do vagrant - it's a bit too resource heavy21:48
sloganwhat remains to be done?21:48
rbakslogan: I looked at those the other day and they looked good.  I entirely forgot about putting out docs myself, but I'll get to that.21:49
slogans/do do/don't do/21:49
sloganrbak: yup - I noted there the issue with keystone and the workaround21:50
sloganassuming that is still a problem21:50
rbakI'm not really sure what that problem is21:50
rbakWe've been running this in production for a while now with no problems21:51
slogannod21:51
sloganI never dug into it, the work around was reasonable21:51
rbakDo you have any more information on that?21:51
rbakOn the problem with keystone auth that is.21:51
slogannothing, no21:51
rbakTry it again when you get the chance.  There's been some changes so maybe it's fixed.21:52
rbakIf not just let me know what sort of error  you're seeing and I'll take a look21:52
sloganI'll try today too, unless I get diverted21:52
sloganthe error was basically I think being unable to test the connection successfully21:53
openstackgerritMichael Hoppal proposed openstack/monasca-agent: Change tox file  https://review.openstack.org/30859121:53
sloganso I generated a token, then it worked21:53
sloganI should be able to give it another try, I'll do it now in fact21:53
sloganalso, while I have your ears, I patched devstack plugin in a very simple way to get around some issues like mkdir failing because a directory already exists, and adduser failing because a user like mon-api exists. I am one of probably many users who do ./stack.sh, ./unstack.sh without a clean in the middle - I'm not allowed to contribute code how is the best way to get this to someone who can?21:56
openstackgerritMichael Hoppal proposed openstack/monasca-agent: Change tox file  https://review.openstack.org/30859121:57
openstackgerritMichael Hoppal proposed openstack/monasca-notification: Change tox file  https://review.openstack.org/30864421:57
openstackgerritMichael Hoppal proposed openstack/monasca-persister: Change tox file  https://review.openstack.org/30864521:57
openstackgerritMichael Hoppal proposed openstack/python-monascaclient: Change tox file  https://review.openstack.org/30864621:57
openstackgerritMichael Hoppal proposed openstack/monasca-api: Change tox file  https://review.openstack.org/30864721:57
*** ddieterly is now known as ddieterly[away]22:02
*** ybathia has quit IRC22:08
*** bobh has quit IRC22:21
*** ddieterly[away] is now known as ddieterly22:25
*** ybathia has joined #openstack-monasca22:40
*** ducttape_ has quit IRC22:40
*** ducttape_ has joined #openstack-monasca22:41
rhochmuthslogan: If you want to send me the fixes i can try and get them in22:41
*** Gen has joined #openstack-monasca22:43
*** ddieterly has quit IRC22:43
*** rhochmuth has quit IRC22:45
*** jkeen has quit IRC22:49
openstackgerritMichael Hoppal proposed openstack/monasca-agent: Change tox file  https://review.openstack.org/30859122:51
*** krotscheck is now known as krotscheck_dcm22:57
openstackgerritMichael Hoppal proposed openstack/monasca-agent: Change tox file  https://review.openstack.org/30859123:02
*** rbrndt has quit IRC23:07
*** ddieterly has joined #openstack-monasca23:15
*** dschroeder has quit IRC23:22
*** ducttape_ has quit IRC23:22
*** ddieterly is now known as ddieterly[away]23:23
*** bobh has joined #openstack-monasca23:32
*** kse has joined #openstack-monasca23:32
*** kei_yama has joined #openstack-monasca23:36
*** bobh has quit IRC23:56

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!