18:11:52 #startmeeting cue 18:11:53 Meeting started Tue Nov 10 18:11:52 2015 UTC and is due to finish in 60 minutes. The chair is sputnik13. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:11:55 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 18:11:57 The meeting name has been set to 'cue' 18:12:05 role call 18:12:07 o/ 18:12:15 o/ 18:12:44 hmm, no one else here 18:12:48 ? 18:13:20 alrighty then 18:15:05 #topic Action Items 18:15:15 only action from last week is on me... 18:15:20 * sputnik13 kicks the can a little further 18:15:33 #action sputnik13 to fill in details for https://blueprints.launchpad.net/cue/+spec/kafka 18:15:39 maybe I can stop kicking the can now :) 18:16:10 haha the can has been moving quite far :) 18:17:48 new bug 18:18:11 err yeah we're moving to bugs... I don't have anything in particular to discuss and with most of the team not here... 18:18:21 #topic Bugs 18:18:26 #link https://bugs.launchpad.net/cue/+bug/1514559 18:18:26 Launchpad bug 1514559 in Cue "Zookeeper runs out of Heap space" [Undecided,New] 18:19:17 sputnik13: yes, added that recently 18:19:56 there are a couple of changes that are required in order to address the issues related to this bug 18:21:39 the first being where we persist all job related data 18:21:52 the second, when we delete old job related data 18:22:38 the issue is that zookeeper runs out of heap space because data we save to it is never deleted and overtime it runs out of heap space 18:22:49 o/ 18:23:54 ok, so we probably need a couple different strategies to address this 18:24:14 but this does sound like a critical bug since cue will stop functioning over time 18:25:24 I agree 18:25:51 the "where we persist" thing seems easy enough, we just need to make sure that's well tested 18:26:09 the "when we delete" seems like it'll be more involved 18:26:21 sounds like it 18:27:14 ok, marked critical and triaged 18:27:19 we'll have to get it in asap 18:27:35 let's discuss the details of implementation in the bug 18:28:22 in terms of the where aspect first 18:28:59 what of it? 18:29:49 where we save task flow job related data 18:30:26 mysql! 18:30:28 alright, let's capture that in the bug 18:30:37 as in in the bug report 18:30:54 https://bugs.launchpad.net/cue/+bug/1514559 18:30:54 Launchpad bug 1514559 in Cue "Zookeeper runs out of Heap space" [Critical,Triaged] 18:31:06 any other bugs we need to cover? 18:31:15 however, that does yield a similar result, but slower. If we're constantly pushing data to mysql without cleaning periodically, it will fill too 18:32:39 dkalleg agreed, we need a more complete answer to this problem 18:32:59 good news with mysql though is there is usually monitoring 18:33:05 Is a simple size cap & destroy old records good enough? 18:33:56 to do that cue would have to be aware of how much space is being used 18:34:18 I think that's better accomplished via operator/infrastructure automation 18:34:24 wouldn't saving all job related data to disk vs memory affect performance though? 18:34:41 we can affect what is kept and what's deleted 18:35:05 davideagnello technically, but we're not dealing with ultra high throughput, so that shouldn't be an issue 18:36:34 ok 18:37:20 I think another aspect can be a recommended zookeeper configuration/tuning for Cue 18:38:15 davideagnello that's fine but that kind of masks the problem rather than resolving it 18:38:40 we need some functions to allow operators to safely remove records 18:38:44 sputnik13: yup, it wouldn't be replacing the correct solution, it's more of an optimization 18:38:56 and maybe some background process that periodically cleans things 18:39:36 hmm maybe that can be integrated with the conductor process 18:39:37 the latter could even be offloaded to whatever automation already exists for monitoring disk/database usage levels 18:41:51 * sputnik13 hears crickets 18:42:44 * dkalleg hires exterminator 18:42:50 the only issue I see with having all the logic in the monitor process is if it is ever down, the API (which posts jobs) can bring Cue down over time 18:43:25 "whatever automation already exists for monitoring" != cue-monitor 18:43:45 not saying to do the check in the API process either, since that would slow it down 18:44:01 sputnik13: ok 18:44:37 I mean some other monitoring suite that's being used to monitor the health of the box 18:44:42 monasca for instance 18:44:54 possibly when the api receives a request, it can spawn a monitor process which would do the work in parallel and terminate when finished 18:45:43 I think that makes things too complicated, the smarter you try to make one thing the more complex and error prone it will get 18:46:09 it's a relatively simple problem, if disk starts filling up, run a command to delete old records 18:46:12 simplicity in the design is key 18:46:26 pretty much 18:46:41 there will be monitoring to do things similar to this, it's a very common problem with databases 18:47:01 ok 18:47:17 what we need to do is provide a tool to safely delete records and provide documentation on how to do this 18:47:35 if the operator wants to just always run it every 2 minutes they can issue the command via cron as well 18:48:06 but if we try to make an intelligent service that runs this for them it'll get complicated pretty fast 18:49:14 it would make sense to have such process run on it's own, in a scheduler 18:51:00 I think that's an operator decision that we don't need to make for them 18:51:24 whether it's a scheduler or some monitoring process or manual shouldn't matter 18:51:44 ok, it should be documented then 18:51:49 yes 18:52:43 are we good with this bug? we're almost at the top of the hour 18:52:59 yup, it should be addressed soon 18:53:10 yup 18:55:15 cool, any other things to discuss? 18:55:54 that's all from me 18:56:05 alright I'm calling it... 18:56:09 going once 18:56:12 going twice.... 18:56:24 #endmeeting