15:00:50 #startmeeting ceilometer 15:00:50 Meeting started Thu Sep 11 15:00:50 2014 UTC and is due to finish in 60 minutes. The chair is eglynn. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:51 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:54 The meeting name has been set to 'ceilometer' 15:01:10 hey folks, who is around for the ceilo weekly? 15:01:13 o/ 15:01:14 o/ 15:01:16 o/ 15:01:41 #topic Juno-3 status and FFEs 15:01:52 o/ 15:01:55 people taking a rest after j-3 :) 15:01:55 so we've made some progress on the FFEs :) 15:02:09 o/ 15:02:39 nealph has for his pass-event-format landed with a bunch of +1s from the PaaS side of the house, nice work! :) 15:02:56 eglynn: thx! 15:03:06 o/ 15:03:29 the main resource renormalization patch for bigger-data-sql is safely landed 15:03:44 so the outstanding one there is the sql-a ORM->core switch 15:04:00 +1 15:04:04 o/ 15:04:17 that's great news 15:04:24 cdent has been doing some detailed measurements of the performance benefits of the ORM switchout using rally 15:04:45 thanks to all those who've helped test out the performance patches. 15:05:26 gordc: agreed :) 15:05:33 the stack of variables involved in any performance test makes it hard to be absolutely sure about the details of the improvement, but its definitely an improvement 15:06:10 cdent: good to hear. 15:06:29 cdent, gordc: I'm thinking we should declare victory on it sooner rather than later as the window of opportunity to land this may start closing soon 15:06:45 I think that's fair and safe. 15:06:48 o/ 15:07:07 eglynn: you mean the sqla-core patch or the tc gap requirement? 15:07:19 ... so I think I'll go ahead and approve, and we can close out quantifying the performance gain in parallel 15:07:30 eglynn: gotcha... i think that's safe. 15:08:09 gordc: mainly the sqla-core patch in the short term 15:08:24 gordc: ... but in the longer term also the overall aggregated benefit of all the sql-a improvements made during Juno 15:08:34 eglynn: makes sense to me. 15:08:45 (as we're likely to have to make an end-of-cycle progress report to the TC) 15:08:47 cool 15:09:06 the last outstanding FFE is ipmi-support 15:09:17 sorry to shout out here, the ipmi patches https://review.openstack.org/#/q/project:openstack/ceilometer+topic:bp/ipmi-support,n,z will appreciate more reviews 15:09:18 still under active review 15:09:51 llu-laptop: yep, it feels like it getting close, gordc has added a bunch of feedback on the latter patch 15:10:14 llu-laptop: i gave it a quick look... i'm not really familiar enough to approve it. 15:10:26 llu: I'll get to it after the meeting as well 15:10:36 cool, thanks folks 15:10:45 eglynn, gordc: thx. Sorry gzhai isn't around here now, will notify him tomorrow morning 15:10:56 nsaje: thanks for the review 15:11:02 llu-laptop: thank you! 15:11:25 so other than those 3 FFEs, we've also got a bunch of bugs still targetted at RC1 15:11:27 https://launchpad.net/ceilometer/+milestone/juno-rc1 15:12:02 I'm going to start bumping the non-critical ones that look like they won't make it 15:12:32 what's the status of bug https://bugs.launchpad.net/bugs/1337761? 15:12:35 Launchpad bug 1337761 in ceilometer "significant performance degradation when ceilometer middleware for swift proxy uses" [High,In progress] 15:12:42 it seems no patch for that? 15:12:45 so if you've any further fixes up your sleeves, please get them proposed soon :) 15:12:56 cdent: can you speak to that one? ^^^ 15:13:30 i just got a patch in... 30hr merge time 15:13:37 ouch! 15:13:49 there was a patch in progress on that and it stalled out. nsaje and DinaBelova gave lots of good feedback that the author didn't appear to respond to 15:14:02 i'm just happy i didn't need to recheck 15:14:05 cdent: ok, it doesn't look like its likely to get traction by RC1 15:14:11 no 15:14:19 sounds like a candidate for bumping, sadly 15:14:36 just to clarify the endgame for RC1 ... 15:14:50 I think I mis-spoke at this meeting last week 15:15:01 I think one of the reasons it stalled is because several people said "we should totally change that and not even have it in the ceilometer package" 15:15:25 until RC1, everything just needs to be landed on master, no need for backporting 15:15:43 then the propose/juno branch will be cut, based off the juno-rc1 tag 15:16:12 from RC1 on, any patches for RCn (n>1) will need to be landed on master first then backported to proposed/juno 15:16:26 meanwhile master is opened for kilo development 15:16:31 good approach! 15:16:59 so no close window on master branch? 15:17:08 so expect a lot of scrutiny from the rel mgr for any post-RC1 patches, definitely in our interest to get as much as possible landed for RC1 15:17:12 is there a date for RC1 to be cut or is that at the whim of ttx 15:17:54 25th of Sep if I'm not mistaken 15:18:04 https://wiki.openstack.org/wiki/Juno_Release_Schedule 15:18:13 thanks 15:18:22 cdent: no, it's when the RC1 buglist is empty 15:18:27 cdent: not a set date, it's decided by agreement with PTLs, likely to be EoW next week 15:18:43 thanks ttx :) 15:18:49 ideally in the second half of September, yes 15:19:10 likely not if there's a 30hr merge time :) 15:20:01 is it my imagination, or has that merge time been slowing reducing since the juno-3 gate rush? 15:20:16 ... me wonders if 30hrs was an outlier 15:20:38 eglynn: you don't want to look at zuul 15:20:59 gordc: ... it generally makes my eyes bleed :( 15:21:17 * cdent loves to look at zuul 15:22:39 * eglynn hates the feeling of deflation when a nearly-complete verify is cancelled and all jobs for that patch are reset to "queued" 15:22:51 ... so close, and yet so far 15:23:40 OK, to sum up ... overall I think we're in relatively good shape on the FFE BPs, so we need to switch focus a bit to bugs over the coming week 15:23:47 shall we move on? 15:23:56 #topic TSDaaS/gnocchi status 15:23:56 so to clarify: master is not kilo until rc>=2? 15:24:20 is the pasta-chef around, I wonder? 15:25:01 cdent: no, master == kilo from RC1 onwards 15:25:16 thanks eglynn 15:25:29 I will start the work around the ceilometer dispatcher 15:25:30 eglynn: we have an internal meeting 15:25:45 cdent: there way not even be an RC2 if we don't need it (RCn with n>=2 is on-demand) 15:25:59 gordc: cool enough, let's punt gnocchi so 15:26:06 #topic Tempest status 15:26:42 this one may be short also, as DinaBelova is officially on vacation IIUC 15:26:58 i think we're going to need to start reenabling tests post juno given state of gate. 15:27:22 gordc: do you mean the nova notification test that was skipped? 15:27:43 #link https://review.openstack.org/115212 15:27:59 eglynn: yeah. seems like there's bigger issues... i'm not sure there's anything specifically wrong with our notification test though 15:28:13 doesn't seem like it based on memory/cpu usage of ceilometer services 15:28:45 gordc: ... yeah, /me notes Dina's comment "all the gate slaves eating kind of 30-40% of RAM on them (unfortunately, ZooKeeper among them :))" 15:28:52 #link http://lists.openstack.org/pipermail/openstack-dev/2014-September/045394.html 15:29:04 ^^ memory usage issues 15:29:42 ... that ZK usage mentioned is orthogonal to our central agent scale out, right? 15:30:29 the devstack stuff I'm working on right now is related to "larger issues" of process not existing for random reasons, e.g. the notification agent being unavailable 15:30:30 not sure if that is related here 15:30:42 eglynn, there is the issue in the fact that on all the slaves ZK is installed and doing just noothing eating the RAM - in production it's ok, in the gate - not so much 15:30:46 but there's a huge slew of changes related to memory and cpu usage in devstack queued up 15:31:08 that's not the central agent HA issue or any thing like that - that's infra issue 15:31:11 i'm not sure how decisions are made to pull in dependencies... technically the central agent scale out isn't being enabled in gates 15:31:12 the zk is apparently leakage from other images and is supposed to be fixed soonish 15:31:19 and ZK is not the only service there 15:31:27 (according to discussion tuesday night) 15:31:37 gordc, ++ 15:31:54 * DinaBelova found the laptop :) 15:32:01 DinaBelova: a-ha, so that ZK installation in the jenkins slaves would have pre-dated the ceilometer central agent's usage of ZK via tooz, right? 15:32:29 gordc: yeah that was my working assumption also 15:32:29 eglynn: central agent tests don't really use neither tooz, nor ZK 15:32:36 eglynn, for the gate we can even use memcached 15:32:45 not so reosurce eating thing in the gate 15:32:48 nsaje ++ 15:32:53 we're using tooz 15:32:59 not the ZK directly 15:33:06 DinaBelova, nsaje: cool, got it 15:33:20 DinaBelova: I'm guessing coordination_url is empty in devstack (as it's the default), no? 15:33:32 nsaje, I'm 100% sure yes 15:33:37 o/ 15:33:46 no coordination url set there for sure 15:33:49 jd__ ++ 15:33:51 oops 15:33:52 nsaje: I guess the longer term goal would be to enable central agent scale out in the gate also? 15:33:55 I meant hello :) 15:34:03 eglynn: definitely 15:34:08 nsaje: cool 15:34:19 eglynn, of course - but with memcached for the first time I guess 15:34:23 eglynn: but we could easily use memcached 15:34:25 eglynn: +1, be good to have it enabled and verified. 15:34:41 yeap, aggreed 15:34:44 nsaje: yeah memcached might be better... it works fine locally for me too. 15:34:50 as we won't be doing functional testing with simulating agent crashes etc. in the gate 15:34:59 so no worry about backend performance 15:35:07 gordc, ZK works good as well :) it's just not the gating thing :) 15:35:24 nsaje: still if we're recommending tooz/zk in production, that's what we should have running in the gate also, right? 15:35:25 DinaBelova: agreed 15:35:47 eglynn, I propose in this case periodical jobs probably 15:35:48 (even if we don't explicitly test the scale in/out) 15:35:48 shouldn't be a problem to have ZK in the gate, it's already installed at least 15:35:56 eglynn: ok, good point 15:35:58 we use it for tooz testing 15:36:04 jd__, well... that's actually the problem 15:36:18 jd__ - did you see RAM overusage thread? 15:36:30 I saw it but didn't read it through 15:36:51 jd__: see Dina's comment on https://review.openstack.org/115212 ... "on all the gate slaves eating kind of 30-40% of RAM on them (unfortunately, ZooKeeper among them :))" 15:36:54 jd__, it turned out that ZK is one of the things installed to all the gate slaves that eat too much RAM 15:37:13 too bad 15:37:16 it was a big noise around it previous week 15:37:27 jd__ - well, this overusage is not only becasue of ZK 15:37:37 but it's among of the 'bad' services 15:38:04 ok, so it sounds like the ZK issue is something we'll need to solve in order for the ceilo central agent to be run "realistically" in the gate 15:38:07 the problem is that currently it's installed to all the slaves no matter if we need it for the testing or not 15:38:15 eglynn, yes, indees 15:38:18 indeed* 15:38:45 jd__, to learn more about that RAM overusage research you may ping infra folks 15:38:52 there's some work in progress to narrow the number of workers that a lot of services are using which will make room for things like zk 15:38:59 DinaBelova: which openstack service is actually use ZK currently in gate? 15:39:11 llu-laptop, it is installed to test tooz 15:39:37 nova might be 15:39:53 service groups? 15:39:57 cdent, yes, I hope so as well 15:40:38 eglynn: yes 15:41:40 cdent: just fyi, i enabled multi workers for the collector to verify multi-worker functionality works... but kept it at half # of core 15:41:57 ok, so we're not going to solve the problem here & now ... but was there a sense on that discussion that ZK is gonna ne removed from the mix by the infra folks? 15:42:02 * cdent nods at gordc 15:42:06 is gonna *be 15:42:22 eglynn: somebody was looking into it to "fix the leakage" 15:42:26 so that it was only there where it was needed 15:42:29 ... i.e. if it's considered one of the 'bad' services 15:42:38 cdent: a-ha, ok 15:42:47 eglynn, it was the beginning of this discussion when I saw it... 15:42:54 so I'm not sure about the result 15:43:12 right-o, all good info ... seems like the discussion/investigation is still active 15:43:13 I guess cdent is closer to the truth here than I 15:43:33 I'm just reporting what I heard in os-qa, but there wasn't any timetable 15:43:46 eglynn, yeah... because nobody wants to remove innocent package there :) 15:44:13 DinaBelova: yeah, that would be unjust :) 15:44:37 cdent, I was lurking the infra channel (my irc client was ringing about words like ZK, etc.) but it was ~1am in my time zone that time so I just left it 15:45:06 Yeah, seems like all the action for those guys in the late or early hours 15:45:43 yeap, unfortunate timing from the European perspective 15:45:50 since jd__ is back can we get a gnocchi update before we run out? 15:46:00 jd__: BTW we punted on the gnocchi topic earlier, do you want to circle back to that? 15:46:06 cdent: snap! :) 15:46:07 jinx 15:46:12 sure I don't have much to say though 15:46:28 I've resumed my work on the archive policy today 15:46:37 and I'm planning to continue and get this merged ASAP 15:46:41 eglynn: I'll need you to review on that 15:46:49 I need to also continue to discuss with you about InfluxDB 15:46:51 jd__: I'll take another look at those patches before EoD today 15:47:12 the collector dispatcher job should be handled by sileht, finger crossed :) 15:47:36 we also need to check some things about the result fetch part 15:47:46 I think there's some thing wrong with how we merge data from different timeseries 15:47:51 (we already discussed that but didn't make any change) 15:48:05 and once that's done we should be pretty good towards having a Juno release 15:48:19 (all what I just said I mean :) 15:48:20 sileht: can you sync up with ildikov on that dispatcher work 15:48:22 ? 15:49:52 jd__: "the result fetch part" is on the measurements get from multiple entities or? 15:50:12 nop, it's about that which I'd call aggregation 15:50:30 it's about if we have results spanned over multiple archives for an entity (with different granularity) the current code merges them 15:50:35 and the result are kinda "weird" in the end 15:50:45 jd__: a-ha, got it, thanks 15:50:47 there's no technical challenge, just a correct design to pick 15:51:02 and change the code to reflect what we think is right 15:51:22 I'll probably run a mail thread for that when I get to it 15:51:28 sileht, also idegtiarov wanted to work on the dispatcher - he said about that ~2week ago on the meeting - he was on the vacation last week, so the talk was about to start work on it this one or the next 15:51:30 (when I don't know I ask :-) 15:51:53 jd__: yeah, a thread on that would be good, thanks! 15:53:25 DinaBelova: yep, so sileht, idegtiarov & ildikov will need to coordinate their efforts on that to avoid any duplication of effort 15:53:35 eglynn, yeah, +1 15:54:56 #topic open discussion 15:55:15 jd__ has nominated two new ceilometer cores in the past couple days :) 15:55:29 #link http://lists.openstack.org/pipermail/openstack-dev/2014-September/045537.html 15:55:35 #link http://lists.openstack.org/pipermail/openstack-dev/2014-September/045734.html 15:55:53 applause :) 15:56:00 vrovachev: indeed :) 15:56:02 yeah, +1 on both of those 15:56:17 thanks for the nomination jd__ ! 15:56:20 do I get a vote or am I only able to unofficially say "woot!"? 15:56:34 if you haven't done so yet, please consider chiming in on the ML by Monday next 15:57:01 cdent: all w00ts will be appreciated I'm sure :) 15:57:10 excellent 15:58:35 anyone go anything else they'd like to chat about? 15:58:49 nope 15:58:56 right-o, let's call this a wrap so 15:59:06 ... thanks all for a productive meeting! :) 15:59:10 #endmeeting ceilometer