18:11:52 <sputnik13> #startmeeting cue 18:11:53 <openstack> Meeting started Tue Nov 10 18:11:52 2015 UTC and is due to finish in 60 minutes. The chair is sputnik13. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:11:55 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 18:11:57 <openstack> The meeting name has been set to 'cue' 18:12:05 <sputnik13> role call 18:12:07 <sputnik13> o/ 18:12:15 <davideagnello> o/ 18:12:44 <sputnik13> hmm, no one else here 18:12:48 <sputnik13> ? 18:13:20 <sputnik13> alrighty then 18:15:05 <sputnik13> #topic Action Items 18:15:15 <sputnik13> only action from last week is on me... 18:15:20 * sputnik13 kicks the can a little further 18:15:33 <sputnik13> #action sputnik13 to fill in details for https://blueprints.launchpad.net/cue/+spec/kafka 18:15:39 <sputnik13> maybe I can stop kicking the can now :) 18:16:10 <davideagnello> haha the can has been moving quite far :) 18:17:48 <sputnik13> new bug 18:18:11 <sputnik13> err yeah we're moving to bugs... I don't have anything in particular to discuss and with most of the team not here... 18:18:21 <sputnik13> #topic Bugs 18:18:26 <sputnik13> #link https://bugs.launchpad.net/cue/+bug/1514559 18:18:26 <openstack> Launchpad bug 1514559 in Cue "Zookeeper runs out of Heap space" [Undecided,New] 18:19:17 <davideagnello> sputnik13: yes, added that recently 18:19:56 <davideagnello> there are a couple of changes that are required in order to address the issues related to this bug 18:21:39 <davideagnello> the first being where we persist all job related data 18:21:52 <davideagnello> the second, when we delete old job related data 18:22:38 <davideagnello> the issue is that zookeeper runs out of heap space because data we save to it is never deleted and overtime it runs out of heap space 18:22:49 <dkalleg> o/ 18:23:54 <sputnik13> ok, so we probably need a couple different strategies to address this 18:24:14 <sputnik13> but this does sound like a critical bug since cue will stop functioning over time 18:25:24 <davideagnello> I agree 18:25:51 <sputnik13> the "where we persist" thing seems easy enough, we just need to make sure that's well tested 18:26:09 <sputnik13> the "when we delete" seems like it'll be more involved 18:26:21 <davideagnello> sounds like it 18:27:14 <sputnik13> ok, marked critical and triaged 18:27:19 <sputnik13> we'll have to get it in asap 18:27:35 <sputnik13> let's discuss the details of implementation in the bug 18:28:22 <davideagnello> in terms of the where aspect first 18:28:59 <sputnik13> what of it? 18:29:49 <davideagnello> where we save task flow job related data 18:30:26 <dkalleg> mysql! 18:30:28 <sputnik13> alright, let's capture that in the bug 18:30:37 <sputnik13> as in in the bug report 18:30:54 <sputnik13> https://bugs.launchpad.net/cue/+bug/1514559 18:30:54 <openstack> Launchpad bug 1514559 in Cue "Zookeeper runs out of Heap space" [Critical,Triaged] 18:31:06 <sputnik13> any other bugs we need to cover? 18:31:15 <dkalleg> however, that does yield a similar result, but slower. If we're constantly pushing data to mysql without cleaning periodically, it will fill too 18:32:39 <sputnik13> dkalleg agreed, we need a more complete answer to this problem 18:32:59 <sputnik13> good news with mysql though is there is usually monitoring 18:33:05 <dkalleg> Is a simple size cap & destroy old records good enough? 18:33:56 <sputnik13> to do that cue would have to be aware of how much space is being used 18:34:18 <sputnik13> I think that's better accomplished via operator/infrastructure automation 18:34:24 <davideagnello> wouldn't saving all job related data to disk vs memory affect performance though? 18:34:41 <sputnik13> we can affect what is kept and what's deleted 18:35:05 <sputnik13> davideagnello technically, but we're not dealing with ultra high throughput, so that shouldn't be an issue 18:36:34 <davideagnello> ok 18:37:20 <davideagnello> I think another aspect can be a recommended zookeeper configuration/tuning for Cue 18:38:15 <sputnik13> davideagnello that's fine but that kind of masks the problem rather than resolving it 18:38:40 <sputnik13> we need some functions to allow operators to safely remove records 18:38:44 <davideagnello> sputnik13: yup, it wouldn't be replacing the correct solution, it's more of an optimization 18:38:56 <sputnik13> and maybe some background process that periodically cleans things 18:39:36 <davideagnello> hmm maybe that can be integrated with the conductor process 18:39:37 <sputnik13> the latter could even be offloaded to whatever automation already exists for monitoring disk/database usage levels 18:41:51 * sputnik13 hears crickets 18:42:44 * dkalleg hires exterminator 18:42:50 <davideagnello> the only issue I see with having all the logic in the monitor process is if it is ever down, the API (which posts jobs) can bring Cue down over time 18:43:25 <sputnik13> "whatever automation already exists for monitoring" != cue-monitor 18:43:45 <davideagnello> not saying to do the check in the API process either, since that would slow it down 18:44:01 <davideagnello> sputnik13: ok 18:44:37 <sputnik13> I mean some other monitoring suite that's being used to monitor the health of the box 18:44:42 <sputnik13> monasca for instance 18:44:54 <davideagnello> possibly when the api receives a request, it can spawn a monitor process which would do the work in parallel and terminate when finished 18:45:43 <sputnik13> I think that makes things too complicated, the smarter you try to make one thing the more complex and error prone it will get 18:46:09 <sputnik13> it's a relatively simple problem, if disk starts filling up, run a command to delete old records 18:46:12 <davideagnello> simplicity in the design is key 18:46:26 <davideagnello> pretty much 18:46:41 <sputnik13> there will be monitoring to do things similar to this, it's a very common problem with databases 18:47:01 <davideagnello> ok 18:47:17 <sputnik13> what we need to do is provide a tool to safely delete records and provide documentation on how to do this 18:47:35 <sputnik13> if the operator wants to just always run it every 2 minutes they can issue the command via cron as well 18:48:06 <sputnik13> but if we try to make an intelligent service that runs this for them it'll get complicated pretty fast 18:49:14 <davideagnello> it would make sense to have such process run on it's own, in a scheduler 18:51:00 <sputnik13> I think that's an operator decision that we don't need to make for them 18:51:24 <sputnik13> whether it's a scheduler or some monitoring process or manual shouldn't matter 18:51:44 <davideagnello> ok, it should be documented then 18:51:49 <sputnik13> yes 18:52:43 <sputnik13> are we good with this bug? we're almost at the top of the hour 18:52:59 <davideagnello> yup, it should be addressed soon 18:53:10 <dkalleg> yup 18:55:15 <sputnik13> cool, any other things to discuss? 18:55:54 <davideagnello> that's all from me 18:56:05 <sputnik13> alright I'm calling it... 18:56:09 <sputnik13> going once 18:56:12 <sputnik13> going twice.... 18:56:24 <sputnik13> #endmeeting