18:11:52 <sputnik13> #startmeeting cue
18:11:53 <openstack> Meeting started Tue Nov 10 18:11:52 2015 UTC and is due to finish in 60 minutes.  The chair is sputnik13. Information about MeetBot at http://wiki.debian.org/MeetBot.
18:11:55 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
18:11:57 <openstack> The meeting name has been set to 'cue'
18:12:05 <sputnik13> role call
18:12:07 <sputnik13> o/
18:12:15 <davideagnello> o/
18:12:44 <sputnik13> hmm, no one else here
18:12:48 <sputnik13> ?
18:13:20 <sputnik13> alrighty then
18:15:05 <sputnik13> #topic Action Items
18:15:15 <sputnik13> only action from last week is on me...
18:15:20 * sputnik13 kicks the can a little further
18:15:33 <sputnik13> #action sputnik13 to fill in details for https://blueprints.launchpad.net/cue/+spec/kafka
18:15:39 <sputnik13> maybe I can stop kicking the can now :)
18:16:10 <davideagnello> haha the can has been moving quite far :)
18:17:48 <sputnik13> new bug
18:18:11 <sputnik13> err yeah we're moving to bugs... I don't have anything in particular to discuss and with most of the team not here...
18:18:21 <sputnik13> #topic Bugs
18:18:26 <sputnik13> #link https://bugs.launchpad.net/cue/+bug/1514559
18:18:26 <openstack> Launchpad bug 1514559 in Cue "Zookeeper runs out of Heap space" [Undecided,New]
18:19:17 <davideagnello> sputnik13: yes, added that recently
18:19:56 <davideagnello> there are a couple of changes that are required in order to address the issues related to this bug
18:21:39 <davideagnello> the first being where we persist all job related data
18:21:52 <davideagnello> the second, when we delete old job related data
18:22:38 <davideagnello> the issue is that zookeeper runs out of heap space because data we save to it is never deleted and overtime it runs out of heap space
18:22:49 <dkalleg> o/
18:23:54 <sputnik13> ok, so we probably need a couple different strategies to address this
18:24:14 <sputnik13> but this does sound like a critical bug since cue will stop functioning over time
18:25:24 <davideagnello> I agree
18:25:51 <sputnik13> the "where we persist" thing seems easy enough, we just need to make sure that's well tested
18:26:09 <sputnik13> the "when we delete" seems like it'll be more involved
18:26:21 <davideagnello> sounds like it
18:27:14 <sputnik13> ok, marked critical and triaged
18:27:19 <sputnik13> we'll have to get it in asap
18:27:35 <sputnik13> let's discuss the details of implementation in the bug
18:28:22 <davideagnello> in terms of the where aspect first
18:28:59 <sputnik13> what of it?
18:29:49 <davideagnello> where we save task flow job related data
18:30:26 <dkalleg> mysql!
18:30:28 <sputnik13> alright, let's capture that in the bug
18:30:37 <sputnik13> as in in the bug report
18:30:54 <sputnik13> https://bugs.launchpad.net/cue/+bug/1514559
18:30:54 <openstack> Launchpad bug 1514559 in Cue "Zookeeper runs out of Heap space" [Critical,Triaged]
18:31:06 <sputnik13> any other bugs we need to cover?
18:31:15 <dkalleg> however, that does yield a similar result, but slower.  If we're constantly pushing data to mysql without cleaning periodically, it will fill too
18:32:39 <sputnik13> dkalleg agreed, we need a more complete answer to this problem
18:32:59 <sputnik13> good news with mysql though is there is usually monitoring
18:33:05 <dkalleg> Is a simple size cap & destroy old records good enough?
18:33:56 <sputnik13> to do that cue would have to be aware of how much space is being used
18:34:18 <sputnik13> I think that's better accomplished via operator/infrastructure automation
18:34:24 <davideagnello> wouldn't saving all job related data to disk vs memory affect performance though?
18:34:41 <sputnik13> we can affect what is kept and what's deleted
18:35:05 <sputnik13> davideagnello technically, but we're not dealing with ultra high throughput, so that shouldn't be an issue
18:36:34 <davideagnello> ok
18:37:20 <davideagnello> I think another aspect can be a recommended zookeeper configuration/tuning for Cue
18:38:15 <sputnik13> davideagnello that's fine but that kind of masks the problem rather than resolving it
18:38:40 <sputnik13> we need some functions to allow operators to safely remove records
18:38:44 <davideagnello> sputnik13: yup, it wouldn't be replacing the correct solution, it's more of an optimization
18:38:56 <sputnik13> and maybe some background process that periodically cleans things
18:39:36 <davideagnello> hmm maybe that can be integrated with the conductor process
18:39:37 <sputnik13> the latter could even be offloaded to whatever automation already exists for monitoring disk/database usage levels
18:41:51 * sputnik13 hears crickets
18:42:44 * dkalleg hires exterminator
18:42:50 <davideagnello> the only issue I see with having all the logic in the monitor process is if it is ever down, the API (which posts jobs) can bring Cue down over time
18:43:25 <sputnik13> "whatever automation already exists for monitoring" != cue-monitor
18:43:45 <davideagnello> not saying to do the check in the API process either, since that would slow it down
18:44:01 <davideagnello> sputnik13: ok
18:44:37 <sputnik13> I mean some other monitoring suite that's being used to monitor the health of the box
18:44:42 <sputnik13> monasca for instance
18:44:54 <davideagnello> possibly when the api receives a request, it can spawn a monitor process which would do the work in parallel and terminate when finished
18:45:43 <sputnik13> I think that makes things too complicated, the smarter you try to make one thing the more complex and error prone it will get
18:46:09 <sputnik13> it's a relatively simple problem, if disk starts filling up, run a command to delete old records
18:46:12 <davideagnello> simplicity in the design is key
18:46:26 <davideagnello> pretty much
18:46:41 <sputnik13> there will be monitoring to do things similar to this, it's a very common problem with databases
18:47:01 <davideagnello> ok
18:47:17 <sputnik13> what we need to do is provide a tool to safely delete records and provide documentation on how to do this
18:47:35 <sputnik13> if the operator wants to just always run it every 2 minutes they can issue the command via cron as well
18:48:06 <sputnik13> but if we try to make an intelligent service that runs this for them it'll get complicated pretty fast
18:49:14 <davideagnello> it would make sense to have such process run on it's own, in a scheduler
18:51:00 <sputnik13> I think that's an operator decision that we don't need to make for them
18:51:24 <sputnik13> whether it's a scheduler or some monitoring process or manual shouldn't matter
18:51:44 <davideagnello> ok, it should be documented then
18:51:49 <sputnik13> yes
18:52:43 <sputnik13> are we good with this bug?  we're almost at the top of the hour
18:52:59 <davideagnello> yup, it should be addressed soon
18:53:10 <dkalleg> yup
18:55:15 <sputnik13> cool, any other things to discuss?
18:55:54 <davideagnello> that's all from me
18:56:05 <sputnik13> alright I'm calling it...
18:56:09 <sputnik13> going once
18:56:12 <sputnik13> going twice....
18:56:24 <sputnik13> #endmeeting