21:00:21 #startmeeting ceilometer 21:00:21 Meeting started Wed Dec 18 21:00:21 2013 UTC and is due to finish in 60 minutes. The chair is jd__. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:22 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:24 The meeting name has been set to 'ceilometer' 21:00:36 o/ 21:00:39 hello 21:00:44 hello everyone 21:00:51 #link https://wiki.openstack.org/wiki/Meetings/Ceilometer 21:00:54 o/ 21:01:01 o/ 21:01:08 o/ 21:01:30 o/ 21:01:33 o/ 21:01:38 o/ 21:02:28 o/ 21:03:08 o/ 21:03:15 o/ 21:03:41 #topic Release python-ceilometerclient? 21:03:48 I think we're good on that 21:03:56 yep, no need AFAIK 21:04:55 +1 21:05:00 #topic housekeeping: meeting cancellation/re-scheduling over the holiday period (eglynn) 21:05:07 well, our next two scheduled meetings fall on public holidays here 21:05:14 (in Ireland at least) 21:05:23 i.e. Thrus Dec 26th and Wed Jan 1st 21:05:36 I think it's a good idea 21:05:37 so I'd propose we cancel the Dec 26th meeting and reschedule the Jan 1st meeting to Thurs Jan 2nd 21:05:39 same in Canada 21:05:41 yeah, I'll be on vacation those days, too 21:05:50 +1 21:05:53 +1 21:05:54 +1 21:05:55 dhellmann: Ditto. 21:06:00 +1 21:06:04 +1 21:06:09 eglynn: well Jan 2 is going to be vacation still for a lot of people including me 21:06:14 likely we can't skip both? 21:06:16 +1 21:06:17 s/can't/can/ 21:06:36 * dragondm will be driving home on Jan 2. 21:06:36 jd__: a-ha, I didn't realize that 21:06:52 I'll be out the 20th thru 6th in Ireland :P 21:06:52 so cancel both? 21:06:59 do we have anything pressing that we would need to talk about? I think it's safe to skip. 21:07:01 or move it to the third 21:07:08 I think it's safe to cancel both at this point 21:07:11 unless something big pops up 21:07:20 dhellmann, jd__: cool with me! 21:07:42 #agree skip meetings during holidays! 21:07:46 :) 21:07:56 \o/ 21:07:57 I'll update the wiki to indicate the next meeting 21:08:04 #topic use of the mongodb aggregation framework to serve out /v2/resources (eglynn) 21:08:25 yeah, so basically we're using an aggregation over the meter collection to construct the resources 21:08:36 #link https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/impl_mongodb.py#L638 21:08:38 I think I wrote that 21:08:52 presumably because we need to adorn the resources with first & last timestamps, meter links etc.? 21:08:58 eglynn: yes 21:09:05 not that we do anything with those timestamps currently ... 21:09:09 but we will 21:09:19 I think we expect clients to use the values 21:09:20 once https://bugs.launchpad.net/ceilometer/+bug/1262335 is fixed 21:09:20 oh yeah 21:09:23 Launchpad bug 1262335 in ceilometer "first & last sample timestamps not included in the Resource representation" [Medium,Triaged] 21:09:29 oh, haha 21:09:32 ... but I digress 21:09:48 the main problem is that the mongo aggregation pipeline performs sorting in-memory 21:09:57 in this case operating over a potentially very large collection 21:10:01 the main problem is mongo? :p 21:10:10 #chair eglynn 21:10:11 LOL :) 21:10:12 Current chairs: eglynn jd__ 21:10:41 (sorry I'm being kicked out, so I'm letting eglynn chairing just in case I go MIA) 21:10:50 cool 21:11:07 ... so the data involved is potentitally large, particularly if the GET /v2/resources isn't constrained with query params 21:11:08 Honestly, some of that schema in impl_sqlalchemy makes me sad too... 21:11:11 should we track the timestamp values as the data is recorded or something? 21:11:19 (e.g. to limit to a single tenant for ex.) 21:11:25 * jd__ pats dragondm 21:11:37 so in production we're the seeing the following error ... 21:11:40 * dragondm feels better. 21:11:48 "terminating request: request heap use exceeded 10% of physical RAM" 21:12:04 ... sad face on the ops guy 21:12:17 turns out mongo is hardcoded to abort any sorts in an aggregation pipeline that'll take more than 10% of memory 21:12:26 hehe 21:12:30 oh, wow 21:12:31 ... I did not know that 21:12:34 Lovely. 21:12:44 though fair enough policy I guess 21:12:53 Tho doing a big sort in mem is no bueno to begin with. 21:12:58 so question is ... 21:13:01 dragondm: yep 21:13:26 does anyone recall if the justification for the use of the aggregate framework was just convenience? 21:13:29 or something more? 21:13:44 a quick look at the fossil record on gerrit kinda implies it may have been just as convenience 21:13:52 #link https://review.openstack.org/35297 21:13:56 jd__ did that work, right? 21:14:19 if that was the case, /me thinking maybe we should just revert to the equivalent map-reduce for this? 21:14:29 coz in-memory sorting kinda gives me the shivers ;) 21:14:35 makes sense to me 21:14:59 i thought aggregate was a ' new feature' that newer mongo provided? i assume new features are suppose to be better 21:15:22 gordc: I think it was new in 2.0.1 or thereabouts 21:15:25 * dhellmann admires gordc's optimism 21:15:38 :) 21:15:44 apparently another alternative is to avoid the need for sorting in-memory for pipelines by ensuring sufficient indices exist 21:15:48 eglynn: like the idea too, I was reading about this issue in my quick research for statistics, I did not look good 21:15:51 ... or google tells me anyway 21:15:58 ildikov_: cool 21:16:08 eglynn: oh, if we can fix it with indices we should try that, too 21:16:34 I think the default sort order for resources is currently user_id, project_id, timestamp all descending 21:16:56 ... not sure if we want the cost of maintaining all those indices on db.meter? 21:17:16 those are the big query filters anyway, right? 21:17:21 maybe not user 21:17:30 I'm not sure why we sort on user, to be honest 21:17:40 yeah me neither 21:17:41 esp. for resources 21:17:50 samples maybe 21:18:21 samples by timestamp first seems more logical, or? 21:18:38 resource, timestamp 21:18:50 or vice versa 21:19:00 yeah 21:19:03 ^ one of those + 21:19:03 eglynn, for timelines, yes 21:19:11 I think we definitely want indexes on resource, though 21:19:24 anyhoo, seems there's broad agreement to ditch the aggregation pipeline unless we can make it work *without* in-memory sorting? 21:19:45 (by skirting round the issue with judicious use of indices ...) 21:19:45 does this same issue come up with the statistics API? 21:19:51 or is the dataset small enough there? 21:20:15 dhellmann: I guess there would almost always be narrowing constraints in that case 21:20:29 dhellmann: narrow timestamp, project_id matching etc. 21:20:35 yeah, I wasn't sure if they were narrow enough, but I suppose that depends on the TTL for the sample data 21:20:37 eglynn: out of curiosity what's the rough number of records we're talking about before the in-memory stuff becomes an issue. 21:20:55 gordc: very good question, and I don't know the answer as yet 21:21:11 dhellmann: ttl stuff was what i was hinting at. 21:21:13 * eglynn doesn't have direct access yet to the deployment on which it came up 21:21:26 ... but I'm hoping to get access this soon 21:21:31 dhellmann: if I know right statistics now works with map-reduce in mongo 21:21:47 (so I'll be able to do hands-on sizing etc.) 21:22:22 eglynn: I wonder if we could also store the first & last timestamp right in the resource document, to skip the map reduce? 21:22:29 ildikov_: a-ha, yes you're correct 21:22:29 ildikov_: ok, good 21:22:47 resources is the only case where we use an aggregation pipeline 21:23:29 dhellmann: and update the resource collection every time we add a sample to the meter collection? 21:23:43 eglynn: yeah, mongo has an update-in-place operation 21:23:51 I have no idea what that would do to efficiency 21:24:17 dhellmann: ... but we still need to adorn the resource with links to all the associated meters 21:24:35 we did that at one point, are we not doing that any more? 21:24:38 dhellmann: ... so I guess we'd need to store those too in the resource collection 21:24:38 eglynn: we do something similar for sql -- updating resource for every meter.... i notice deadlock errors because of it 21:24:42 maybe that's what this change you linked to was doing 21:24:58 dhellmann: ... I'm pretty sure we're still doing it 21:25:13 actually I think horizon depends on those links being present 21:25:28 lsmola: ^^^ am I recalling that correctly? 21:25:49 eglynn, thinking 21:26:01 eglynn, checking 21:26:02 i.e. horizon walks the resource representation at some point and grabs all the meter links 21:26:06 yeah, we have $addToSet for the meter with some metadata, but not the timestamps 21:26:24 eglynn: if i recall, we build those links on the api side...i don't think we actually associate them in database (might be wrong here) 21:26:40 impl_mongodb.py line 411 21:27:25 eglynn, yes I use to to get all resources for some meter 21:27:36 eglynn, though that is not very optimal 21:27:54 lsmola: yep, here for example IIUC https://github.com/openstack/horizon/blob/master/openstack_dashboard/dashboards/admin/metering/views.py#L182 21:28:07 dhellmann: whoops never mind then, guess we do store them in mongo. 21:28:11 but I guess the point is we have a couple of options and we don't need to pick the approach here 21:28:53 eglynn, yes 21:29:02 dhellmann: exactly ... so I'll proceed with some more experimentation 21:29:11 ... see if I can work-around simply with indices 21:29:22 ... otherwise use one of the approaches discussed above 21:29:24 sounds good -- is there a bug filed already? 21:29:42 nope, I wanted to pick some brains on the background first 21:29:53 ok 21:30:06 (... the issue only came to light late yesterday) 21:30:46 * jd__ back 21:30:49 eglynn: it was convenience 21:30:50 #action eglynn file a bug to capture all the background on mongo aggregation pipeline memory issues 21:30:55 if you have any other methods go ahead 21:30:58 it was actually convenience AND correctness 21:31:05 [22:22:22] eglynn: I wonder if we could also store the first & 21:31:09 last timestamp right in the resource document, to skip the map 21:31:09 jd__: cool 21:31:12 reduce? 21:31:15 that won't work dhellmann , that what it used to be 21:31:18 but as soon as you add a filter, that breaks 21:31:31 a-ha, yes, good point 21:31:39 jd__: ah, right, filtering on something in the metadata 21:31:40 there's is no static first & last, right? 21:31:46 nop 21:31:57 got it 21:31:58 so I'm not sure there is an easy way to precompute those values 21:32:17 maybe you could just add a test case that handles the fact MongoDB is going to fail for some queries 21:32:33 return a 542 Sorry MongoDB is there. 21:32:41 good point, we should make sure we're handling that anyway 21:32:41 Heh. 21:32:42 lol jd__ 21:32:42 LOL :) 21:32:47 hehe 21:33:07 that links back to the #action eglynn have added anyway 21:33:10 Error 599 Mongo saaaddd.. 21:33:12 hahaha 21:33:48 :-) 21:33:49 eglynn: all good on that? 21:34:01 for now I've pointed the ops guy to a work-around to partition the resource listing per-tenant in a bash for loop 21:34:11 so the sky is not going to fall in 21:34:24 ... i.e. we've time to consider this and come up with a well thought-out solution 21:34:39 yep, that's all I got on this, let's move on 21:34:55 #topic promote Sandy Walsh to core? (herndon) 21:35:27 I think we got the candidature but there was not enough vote on that 21:35:31 yeah, I just don't know what the procedure is here, I saw a couple of +1s come through on email, but what happens? 21:35:43 ah, ok then. 21:35:52 well it needs at least 5 +1 from ceilometer-core devs 21:36:03 I was curious about that, okay. 21:36:29 and AFAIK it only got 3 21:36:39 yep, that's what I saw as well. 21:36:49 I'd like to point out that I think sandywalsh_ would make an excellent core once he's increased his review count and patch-landing acitivity 21:37:06 eglynn: yeah, I think that was where I came up, too 21:37:27 s/came/ended/ 21:37:28 agreed 21:37:58 yep ... so I wouldn't like this to be a discouragement to him 21:38:32 Myself and thomasem have been working to tame the review queue as well. We could use some faster patch-landing (with appropriate reviews, of course) 21:38:47 undoubtedly 21:38:56 :) 21:38:59 I think we could re-visit the question soon-ish if and when his stats go in the right direction 21:39:22 +1 21:39:38 ok, thanks for the explanation :) 21:39:50 dragondm: thanks! I noticed that ... I've been a bit remiss with reviews myself lately due to the productization work on my plate 21:39:53 +1 on faster patch landing for events btw. 21:40:15 I feel guilty too, I'm trying to review a lot 21:40:30 For visibility, there has been a ton of work behind the scenes over here - we've been working on load testing the drivers and in the process of trying to clean up the SQLAlchemy schema and what-not. 21:40:35 anyway don't give up, every contribution is appreciated :) 21:40:55 eglynn: thanks on those reviews on some of that event code, tho'. 21:41:00 https://etherpad.openstack.org/p/ceilometer-data-store-scale-testing 21:41:01 thomasem: yep, I suspect as much on the "behind the scenes" work 21:41:42 #topic Open discussion 21:41:54 Yup. and there's a ton more event work coming doewn the pike soon, too. 21:42:10 cool 21:42:19 yeah, like up to my eyeballs amount of work to do 21:42:20 dragondm: i'm hoping it's after the holidays 21:42:23 eglynn, thanks; just wanted to bring more visibility to it since it's been very time consuming. 21:42:52 I have just one quick question 21:43:13 gordc: yah. I will probably put up a WIP branch for the event pipelines soon, but I want to add more docs to it, which will probably be after the holidays. 21:43:19 Do we have an agrreement on how to handle the NotImplementedError-s in the db drivers? 21:43:51 ildikov_: like we do currently, no? 21:44:02 ildikov_: re. David's comments on https://review.openstack.org/62157 21:44:06 dragondm: awesome. i really like the events stuff so feel free to ping me about them. 21:44:17 jd__ yes, I've seen the patch set from you 21:44:43 gordc: Will do. It's gonna be *cool* 21:44:43 eglynn: yes, the question is originating from there 21:44:51 #link https://review.openstack.org/#/c/62157/3/ceilometer/storage/base.py 21:45:01 I didn't check that review 21:45:33 eglynn: did you see https://review.openstack.org/#/c/62636/ then? 21:45:39 *BATTLE IS ON* 21:46:07 i think the abstract stuff is not a good idea now, because we have a lot of drivers doing different stuff, and we might even split them by feature 21:46:08 a-ha, I did not 21:46:18 jd__: I just wanted to be sure that it is the way that we are taking now 21:46:19 so this was a good idea in the early days, but it's getting weird now 21:46:32 ok, I see the context more clearly now 21:47:05 jd__: I agree with those changes 21:47:06 I'll review https://review.openstack.org/62636 tmrw 21:47:13 eglynn: thumbs up 21:48:25 jd__: I just wanted to make it clear, before upload the corrections 21:48:26 * dragondm wonders if we could just have a 'Not Implemented' decorator... 21:48:35 ildikov: and it was a good idae :) 21:48:39 jd__: the right approach there is to put the related methods in a base class, declaring the interface 21:48:53 Might make it easier to introspect what features what drivers support for doc purposes... 21:48:55 then the methods can still be abstract, but a subclass that doesn't inherit the interface doesn't have to implement them 21:48:58 jd__: thanks :) 21:49:22 dhellmann: I wouldn't see the point to have all that code duplication again 21:49:37 what is duplicated? 21:49:42 raise NotImplementedError is 21:49:58 well, the idea is the drivers ought to be implementing those methods :-) 21:50:13 well you have to implemented them if it breaks ceilometer not to have them, I agree 21:50:25 but since most part of Ceilometer handle those case correctly, there's no need for that in most cases 21:50:43 I think record_metering_data is a good candidate to be abstract for example 21:50:45 ok, I guess I think the real issue is the lack of feature parity in the drivers rather than the implementation 21:50:54 because likely ceilometer will explode if your driver don't implement it 21:51:16 dhellmann: yeah that's another issue :) 21:51:16 but ok 21:51:26 if it is an issue, I think we're pretty good actually on that 21:51:35 I mean I think we deal with that the best we can 21:51:38 yeah 21:52:00 Just wondering if anyone has the same problem with IRC in this channel as me. Mine sits silent for several minutes, and all of a sudden dozens of messages pops up at the same time. :( 21:52:20 nope 21:52:23 jd__: would it make sense to have an events interface and then drivers would inherit that interface if it were to support events... if you support events you should probably support all the related methods. 21:52:25 llu-laptop: I've seen that on occasion. I 21:52:35 llu-laptop: I get that sometimes, too 21:52:41 less today than other days 21:52:46 I think it's a freenode issue 21:52:46 llu-laptop: er, I usually blame znc 21:53:14 @jd__, now without these methods (throwing NotIMplementedError), what happens when it gets called? 21:53:31 tongli: the base class implementation would raise the exception 21:53:56 gordc: yes I think it would ultimately, especially since we could add the ability to use differents drivers for different features 21:53:59 @jd__, will the behavior be same as before from client perspective? 21:54:20 tongli: yes 21:54:32 jd__, gordc: right, that's what I meant by splitting the base class into different interface classes with related methods 21:54:40 jd__: yeah, i think that'd be cleaner...and then a grouping alarms as well 21:54:53 @jd__, ok, in that case, it will be nice to have the changes in the patch. 21:54:55 dhellmann: yeah, that's what i thought you meant 21:54:55 dhellmann: ok I didn't get that, I'm all for that totally, with that perspective in mind 21:55:09 +1 for the splitting 21:55:15 gordc: we could also explore aggregation, and have a driver object have several subobjects with the real implementations 21:55:16 dhellmann: Ah... yah. +1 21:55:21 driver.alarms.some_method() 21:55:26 driver.events.some_other_method() 21:55:27 etc. 21:55:34 Also good. 21:55:43 dhellmann: +1 21:55:44 +1 for splitting 21:56:07 Avoids the db layer looking like the one in Nova with n hundred methods... 21:56:08 dhellmann: that sounds cool. 21:56:15 dragondm: exactly 21:56:34 yep, an anti-pattern if ever I saw one! 21:56:42 and it solves the duplicate implementation issue, because the default for the events attribute could be an instance of a single class with NotImplemented implementations 21:57:15 yep 21:57:21 Bingo. And creates larger scale 'features' (i.e. 'event', 'alarms', etc) 21:57:36 conclusion http://persephonemagazine.com/wp-content/uploads/2013/02/himym-marshall-lily-high-five.gif 21:57:40 we'd have to figure out how much state would need to be shared between the different interface classes, so each driver might also need to pass a "state" object in or something 21:58:06 Have the parent object pass itself to the child's __init__ 21:59:08 dragondm: I'd be worried about circular references with that, but that's the general idea 21:59:28 next meeting will be the 9th January FWIW 21:59:59 cool, call it a night so? 22:00:00 have a good break, everyone 22:00:13 Likewise! have fun! :) 22:00:18 cheers 22:00:20 yep have fun guys 22:00:28 yep happy not-coding days 22:00:29 and have some rest :) 22:00:32 Merry Christmas, & Happy New Year all! 22:00:35 have a great holiday guys 22:00:36 #endmeeting