21:01:45 <jd__> #startmeeting ceilometer 21:01:45 <openstack> Meeting started Wed Mar 26 21:01:45 2014 UTC and is due to finish in 60 minutes. The chair is jd__. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:01:47 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:01:49 <openstack> The meeting name has been set to 'ceilometer' 21:01:59 <gordc> o/ 21:02:04 <ildikov_> o/ 21:02:09 <lsmola> hello 21:02:09 <eglynn> o/ 21:02:34 <_nadya_> o/ 21:02:55 <jd__> #topic Milestone status icehouse-rc1 21:03:17 <jd__> #link https://launchpad.net/ceilometer/+milestone/icehouse-rc1 21:03:31 <jd__> we only have a few bug left, so that looks good 21:03:34 <jd__> our target is Friday for rc1 21:03:38 <afarrell> thomas - just trying to find the window your message popped up in :-) 21:03:56 <ildikov_> jd__: currently there is a gate issue currently, I added a topic about it to the agenda 21:03:58 <gordc> resource-list patch should be ok for review: https://review.openstack.org/#/c/80343/ 21:04:05 <jd__> ildikov_: ok 21:04:28 * eglynn would like to target https://bugs.launchpad.net/ceilometer/+bug/1297677 to RC1 also 21:04:29 <uvirtbot> Launchpad bug 1297677 in ceilometer "normal user create alarm with user-id or project-id specified will success instead of return 401" [High,Confirmed] 21:04:46 <gordc> deadlock issue might need to be pushed due to postgres issue... i think the oslo.db code doesn't work for multiple connections on postgres 21:04:47 <jd__> eglynn: targeted 21:04:50 <_nadya_> I have a question about rc1. We have merged several-workers patch. And there are some problems with it 21:04:56 <eglynn> jd__: thank you sir! 21:05:14 <Alexei_987> gordc: why do you think it's oslo problem? 21:05:23 <_nadya_> gordc: am I right that we need some more patches to be merge to make everything work right? 21:05:28 <gordc> _nadya_: i added a patch to default it to a single worker: https://review.openstack.org/#/c/83215/ 21:05:38 <Alexei_987> gordc: it was working for me with tests on read backends and it surely uses multiple connections 21:06:07 <gordc> Alexei_987: does it work on postgres for you? with multiple connections? it doesn't on my machine. 21:06:29 <Alexei_987> gordc: it was working a month ago.. haven't tryied recently 21:06:34 <gordc> and it doesn't work on gate 21:06:43 <Alexei_987> gordc it did :) 21:07:38 <gordc> :) there's something wonky happening. i think it's best to just default workers to single worker for now... my patch to explicitly call dispose works but i have no real idea why. 21:07:38 <_nadya_> jd__: gordc: I think we need to determine the list of patches to me merged before rc1 or revert multi-spawning :( 21:08:18 <gordc> _nadya_: we don't necessarily need to revert. we can just default to single worker so it behaves as before. 21:08:26 <Alexei_987> _nadya_ : I think we should revert in any case 21:08:37 <Alexei_987> workers are useless and doesn't solve any problem 21:09:22 <Alexei_987> workers are *only* usefull in case of specific MySQL driver 21:09:37 <Alexei_987> there are much simpler way to do without them 21:09:41 <gordc> Alexei_987: was just going to say that.... useless isn't exactly true 21:09:56 <jd__> it's not useless if you want to use several cores 21:10:02 <gordc> Alexei_987: what's the simpler way? 21:10:13 <Alexei_987> use python driver for mysql 21:10:18 <Alexei_987> it works in async mode 21:10:19 <jd__> gordc: having multiple worker is in no way different than having several collector running, right? 21:10:36 <Alexei_987> why would you use several cores? nginx works perfectly in 1 thread 21:10:41 <gordc> jd__: essentially yes. that's what it's doing. 21:10:44 <jd__> Alexei_987: nginx for the collector? 21:10:58 <jd__> gordc: so it's clearly just a visible effect of something that's broken more fundamentally :/ 21:11:00 <Alexei_987> nah.. I mean that true async code doesn't need workers 21:11:22 <Alexei_987> it won't consume 100% of the cpu 21:11:26 <jd__> Alexei_987: so you run nginx on 1 core on your 8 core machines? awesome 21:11:31 <jd__> lol 21:11:33 <Alexei_987> why not? 21:11:45 <Alexei_987> it's enough to fully load network 21:11:51 <Alexei_987> cpu is not the bottleneck 21:11:51 <jd__> and if your nginx is busy 100% of the time what do you do next? 21:12:09 <Alexei_987> again cpu is not the bottleneck 21:12:17 <Alexei_987> network or database 21:12:25 <jd__> good for you 21:12:26 <Alexei_987> database will become bottleneck in our case 21:12:35 <jd__> that doesn't solve our problem 21:12:53 <Alexei_987> our problem is poorly written code for mysql 21:13:11 <jd__> gordc: so I'm kind of worry, was it working with even Havana? 21:13:11 <gordc> Alexei_987: in oslo.db or in ceilometer? 21:13:16 <Alexei_987> ceilometer 21:13:31 <Alexei_987> it's already working on 400 req/s even when cloud is idle 21:13:45 <gordc> Alexei_987: we're not writing any mysql specific code. 21:13:55 <_nadya_> so we have 2 variants: 1. merge https://review.openstack.org/#/c/83215/ ASAP 2. revert multiple workers. We have no more variants. We have no time to repair SQL in 2 days 21:13:56 <gordc> jd__: i'm not sure. i've never tried to be honest 21:14:00 <Alexei_987> ok sqlalchemy code 21:14:13 <jd__> _nadya_: 1. does not fix the problem, it just hides it 21:14:20 <jd__> and 2 neither 21:14:39 <_nadya_> jd__: yep :D and? 21:15:01 <jd__> if you start 2 ceilometer-collector processes – which always have been a valid use case, you will have the same errors 21:15:21 <jd__> because the SQL driver is probably doing something in a wrong way 21:15:48 <gordc> jd__: agreed. there's something wonky happening which doesn't let postgres work with multiple connection but let's mysql 21:15:54 <Alexei_987> how driver can affect separate processes? 21:15:56 <jd__> so it needs to be fixed – gordc what's the bug # if there's one actually? 21:16:15 <gordc> jd__: for postgres issue? there is none right now 21:16:21 <jd__> I'm ok to try to target it at rc1 if somebod commits to fixing it 21:16:26 <jd__> otherwise it's pointless 21:16:38 <eglynn> jd__: how conservative do we have to be about RC1? 21:16:46 <eglynn> (i.e. treat it as effectively the final opportunity to get the above issues fixed?) 21:16:52 <gordc> jd__: i'm not sure we can fix it for rc1. 21:16:57 <_nadya_> I don't think it's possible to fix in 2 days 21:17:01 <jd__> eglynn: it should be final or that's not fair play 21:17:21 <jd__> ok 21:17:36 <Alexei_987> I propose to revert workers and I'll check this postgres issue :) 21:17:42 <jd__> so let's try to fix it someday, write a release not about that, and we'll backport the fix later 21:17:42 <eglynn> jd__: discussion above sounds like early-cycle stuff than 11th hour cusp-of-the-deadline hustle 21:18:26 <gordc> revert or default to single worker? it's essentially same thing but allows flexibility to have multiple workers for mongo/mysql backend which appear to work fine. 21:18:50 <Alexei_987> mongo backend only consumes 10-20% of the cpu 21:18:50 <gordc> i'm ok with revert if we want... just presenting alternative. 21:19:09 <gordc> Alexei_987: yeah. mysql is the real issue. 21:19:21 <jd__> and how reverting is going to make things work with 2 collectors running? 21:19:43 <Alexei_987> jd__: it won't I just don't like adding things that we don't actually need 21:19:57 <Alexei_987> deadlock issue needs to be fixed separately 21:20:32 <jd__> I don't really care about what you think when it's subjective and not technical 21:20:38 <gordc> deadlock issue as in this patch: https://review.openstack.org/#/c/80461/ 21:20:45 <jd__> I just want this bug fixed, so if reverting is not going to help, there's no point 21:21:11 <jd__> gordc: it depends on something outdated 21:21:28 <gordc> Alexei_987, jd__: i think that patch should be pushed to post rc-1 with multiple connection fix. 21:22:00 <jd__> I'm ok to postone the fix if it's long to have it, we can document that it's not working 21:22:08 <gordc> i'll red x my patches that should be moved to later. 21:22:08 <_nadya_> +1 21:22:14 <jd__> gordc: could you at least open a bug and document it? 21:22:29 <gordc> will do. 21:22:36 <jd__> I feeling bad talking about something we don't even have a record of 21:23:03 <eglynn> so what exactly is *not* going to work with RC1? (under the proposed approach) 21:23:19 <eglynn> I'm a bit lost in these options tro revert or default to single worker etc. 21:23:22 <jd__> eglynn: using PostgreSQL with more than one collector 21:23:24 <eglynn> e.g. are we saying horizontal scaling of the collector is simply not possible currently with sqla? 21:23:30 <jd__> eglynn: yep 21:23:39 <jd__> s/sqla/sqla-using-postgresql/ 21:23:48 <eglynn> jd__: a-ha, but just when psql is the backend, not mysql? 21:23:54 <eglynn> k, got it! 21:24:11 <jd__> so we can make postgresql work by setting the number of worker to 1 by default and documenting that postgresql shouldn't change anything and not run another copy 21:24:28 <jd__> that's a poor solution, but probably the best we can do in the timeframe 21:24:43 <ildikov_> and we will stay now at option 1, right? 21:25:08 <jd__> option 1 ? 21:25:16 <eglynn> k, so that pretty much keeps sqla/postgres in the not-usable-for-prod category, right? 21:25:26 <jd__> eglynn: clearly 21:25:44 <jd__> *sad face* 21:25:56 <ildikov_> jd__: merge the default to 1 worker patch 21:25:58 <_nadya_> is mongo in prod category still? 21:26:19 <eglynn> _nadya_: as long as you're a licensing maverick ;) 21:26:23 <jd__> ildikov_: yes, as a temporary solution :( 21:26:38 <gordc> https://bugs.launchpad.net/ceilometer/+bug/1298073 21:26:39 <uvirtbot> Launchpad bug 1298073 in ceilometer "postgres does not work with multiple connections" [High,Triaged] 21:26:55 <ildikov_> jd__: ok :( 21:26:59 <jd__> thanks gordc 21:27:21 <jd__> #topic Tempest integration 21:27:27 <jd__> _nadya_: enlighten us 21:28:16 <_nadya_> what can I say? We have no Mongo. mySQL and Postgres fail on gating 21:28:49 <_nadya_> and even anotification service was broken by infra-team 21:28:58 <_nadya_> repaired now 21:29:22 <_nadya_> I don't see any ability to add notification and pollster tests 21:29:27 <gordc> _nadya_: i assume that merged succesfully because we have no tests against notificaiton? 21:29:40 <_nadya_> gordc: exactly 21:30:09 <_nadya_> so we have tests but we can't make it work 21:30:16 <_nadya_> so sad story :( 21:30:21 <jd__> what's blocking? 21:30:39 <jd__> MySQL shouldn't fail on gating :( 21:30:49 <_nadya_> Ceilometer processes notification 16 minutes 21:30:56 <_nadya_> on mysql 21:31:04 <_nadya_> because of high load 21:31:26 <jd__> ok 21:31:40 <_nadya_> solution is to add Mongo-job 21:31:54 <jd__> mongo is not likely to happen soon indeed 21:31:55 <_nadya_> but everybody knows Sean's position 21:31:57 <gordc> _nadya_: or streamline sql schema 21:32:23 <gordc> i think it's best to make adjustments to schema but that is not rc1 item. 21:32:35 <jd__> yes, I don't think we can do much right now 21:32:44 <jd__> let's try to think and prepare for Juno now :( 21:32:45 <eglynn> _nadya_: 16 mins from receiving notification to persisting sample? 21:33:01 <eglynn> _nadya_: or 16 mins from the notification being emitted by nova? 21:33:02 <_nadya_> and today I was playing with HBase. It shows the same results as Mongo (good perf) but I can't imagine how to push it on gating too 21:33:37 <_nadya_> eglynn: between nova creates notifiation and Ceilometer starts to put it to database 21:34:49 <eglynn> _nadya_: wow! ... are the Jenkins slaves chronicallly under-scaled/under-powered? 21:35:30 <_nadya_> jd__: I don't know, do we have any estimates from TC about Tempest? 21:35:31 <eglynn> as in, are they a complete mismatch to the scale to be expected in a more realistic prod style deployment 21:35:38 <jd__> _nadya_: not yet 21:36:20 <_nadya_> jd__: maybe we should think about smth like fake-tests? I saw a lot of such test for other projects 21:36:39 <jd__> _nadya_: I don't know what would that be? 21:36:56 <_nadya_> jd__: that API return list 21:37:15 <eglynn> _nadya_: would that cut the legs out from under the whole idea of Tempest? 21:37:26 <eglynn> _nadya_: (i.e. non-faked integration tests) 21:37:51 <_nadya_> eglynn: ah, say it again in a different way :D 21:38:20 <eglynn> _nadya_: would fake tests make Tempest kinda pointless? 21:38:47 <eglynn> _nadya_: (if the point of Tempest is to test integration without the kind of fakery we use in the unit tests) 21:39:07 <gordc> eglynn: i think it's the way tests are done. it spikes at certain points and from there it's just backlogged because mysql isn't writing fast enough. 21:39:28 <_nadya_> eglynn: yep. but I'm afraid about TC only 21:39:43 <_nadya_> eglynn: I don't want to have such tests 21:39:55 <eglynn> _nadya_: TC == technical committee? 21:39:56 <_nadya_> eglynn: just thinking about variants 21:40:00 <_nadya_> eglynn: yep 21:40:08 <eglynn> _nadya_: k, got it, thanks! 21:40:58 <eglynn> so could the underlying core problem here be simply that the Jenkins slaves on which Tempest run are just way too under-resourced? 21:41:11 <_nadya_> eglynn: there are several mailing threads about Tempest+Ceilometer, I will discuss with you tomorrow. Maybe I've missed smth 21:41:21 <eglynn> ... i.e. running in bigger instances types on faster h/w would resolve? 21:41:28 <eglynn> ... /me prolly stating the obvious there 21:42:04 <jd__> eglynn: it seems to be a problem with the SQL driver again 21:42:05 <eglynn> _nadya_: ... actually its prolly just me being still behind on email backlog 21:42:08 <jd__> not with the power available 21:42:17 <eglynn> jd__: k 21:42:28 <jd__> let's move on for now 21:42:33 <jd__> #topic Release python-ceilometerclient? 21:42:40 <jd__> I think you're on that eglynn? :) 21:42:45 <eglynn> we'll be good to go with cutting 1.0.10 once the selectable aggregates patch lands 21:42:49 <asalkeld> o/ 21:42:49 <eglynn> https://review.openstack.org/80499 21:42:54 <jd__> cool 21:42:58 <eglynn> currently just a test coverage objection 21:43:03 <eglynn> should be quick to resolve 21:43:05 <jd__> hi asalkeld 21:43:07 <eglynn> ... so that 1.0.10 can coincide up with RC1 21:43:11 <eglynn> Angus hey! 21:43:14 <jd__> perfect timing :) 21:43:16 <asalkeld> howdy 21:43:41 <jd__> #topic MongoDB licensing issues raised by the failed Marconi graduation 21:43:45 <jd__> not a funny topic again 21:43:53 <eglynn> :) 21:44:09 <eglynn> so the withdrawal mail from the Marconi team put the problem thusly: 21:44:21 <eglynn> "The drivers currently supported by the project don't cover some important cases related to deploying it. One of them solves a licensing issue but introduces a scale issue whereas the other one solves the scale issue and introduces a licensing issue." 21:44:37 <eglynn> the scaling/encumbered DB is mongodb 21:44:43 <eglynn> the non-scaling/unencumered one is sqlalchemy 21:44:50 <eglynn> #link http://lists.openstack.org/pipermail/openstack-dev/2014-March/030638.html 21:45:03 <eglynn> so the $64k question is ... 21:45:19 <eglynn> is this licensing palaver around mongo also clear & present danger for our project? 21:45:46 <eglynn> ... given that prod deployment options seem to be narrowing back to mongo 21:46:30 <eglynn> fair enough that Marconi met with disapproval for various other reasons 21:46:40 <ildikov_> do we have a plan B, if Mongo is not an option because of licensing? 21:46:48 <_nadya_> HBase :) 21:46:50 <jd__> I think I don't want to answer that and would wait for the TC to discuss about it 21:46:54 <jd__> IANAL 21:47:16 <eglynn> I hasten to translate: "I ain't a lawyer" 21:47:29 <jd__> that's it :) 21:47:35 <ildikov_> jd__: IANAL either, I just thought that it is better to ask this before, than after :( 21:47:35 <eglynn> yeah fair enough 21:47:47 <jd__> yeah sure ildikov_ :) 21:47:59 <eglynn> if the concern was obvious FUD-throwing I wouldn't be worried 21:48:15 <ildikov_> jd__: what about the Cassandra BP of yours? 21:48:36 <jd__> ildikov_: it's not going to be more than what I posted for the time being 21:48:58 <_nadya_> Looks like I need to create demo of HBase to show that it works... 21:49:18 <ildikov_> jd__: ok 21:49:28 <eglynn> jd__: is mongo licensing on the TC agenda in the near term do you know? 21:49:34 <jd__> eglynn: no idea 21:49:42 <jd__> ttx: ^ 21:50:07 <jd__> #topic Periodic checkupdate.sh failures on the gate 21:50:10 <ildikov_> _nadya_: I just tried to ask about every option, I'm not against HBase :) 21:50:16 <jd__> #link https://bugs.launchpad.net/ceilometer/+bug/1297999 21:50:17 <uvirtbot> Launchpad bug 1297999 in ceilometer "gate failures due to tools/config/check_uptodate.sh" [Undecided,In progress] 21:50:22 <jd__> #link https://bugs.launchpad.net/nova/+bug/1268614 21:50:25 <uvirtbot> Launchpad bug 1268614 in cinder "pep8 gating fails due to tools/config/check_uptodate.sh" [Critical,Fix released] 21:50:28 <ildikov_> I included the latest bug report and an earlier one 21:50:50 <_nadya_> ildikov_: :) 21:51:03 <jd__> the failure is due to our inclusion of keystoneclient 21:51:17 <ildikov_> I have a fix with two patch sets, the first fixed the ceilometer.conf.sample, the second one currently deletes the checkupdate.sh from tox.ini 21:51:23 <jd__> the thing is that this config file should be built at compile/installation time and not checked into git 21:51:40 <ildikov_> the second one was the suggestion from gordc 21:51:44 <jd__> I think it's better to have the gate fails than an out of date conf 21:52:00 <jd__> or find a way to stop putting this file into git 21:52:01 <ildikov_> nova just did the same thing and including the generate-sample.sh into tox.ini 21:52:26 <jd__> ildikov_: so they're not putting the sample into git now? 21:52:49 <ildikov_> jd__: I meant that they removed this checkupdate.sh from tox 21:52:53 <jd__> ok 21:52:53 <eglynn> surely the gate failing stops everything in its tracks? 21:53:03 <jd__> eglynn: yes 21:53:04 <gordc> eglynn: yep. everything is blocked 21:53:19 <eglynn> would not regenerating until RC1 be acceptable if we know no *new* config is added by any remaining icehouse patches? 21:53:20 <jd__> so it's either the developers are blocked from time to time, or the users have crappy sample configuration files 21:53:35 <jd__> eglynn: it messes badly the diff most of the time :( 21:53:36 <ildikov_> jd__: tox.ini is the following in case of nova: #link https://github.com/openstack/nova/blob/master/tox.ini 21:53:57 <jd__> ildikov_: I see 21:54:09 <jd__> they went backward on that 21:54:24 <jd__> can't blame here – I'm the one responsible for the check_uptodate in the first place 21:54:24 <gordc> this happens more than i like. i like not checking sample to git but doesn't that mean we're just going to remove it from tox.ini as well? 21:54:48 <jd__> the point was to make people stop puting these default files in git, but it seems it didn't help :D 21:55:29 <gordc> i'd be game on turning it off for rc1 but i guess we'll eventually need to sync it up. 21:55:40 <ildikov_> jd__: lol :) 21:55:57 <jd__> good with me 21:56:03 <eglynn> gordc: yep, /me thinking also pragmatic short-term disable to drain the queue 21:56:05 <jd__> I vote for whatever works the best 21:56:32 * gordc already +2 turning it off patch. 21:57:01 <jd__> #topic Open discussion 21:57:10 <jd__> we have 3 minutes left if anything 21:57:16 <ildikov_> gordc: cool, thanks :) 21:57:32 <_nadya_> oh 21:57:51 <_nadya_> any good news :)? 22:00:06 <jd__> doesn't look so :) 22:00:09 <jd__> not yet at least 22:00:10 <ildikov_> _nadya_: it seems that the good news is there is no more bad one :) 22:00:23 <eglynn> _nadya_: ... well Ireland beat France at the rugby, does that count as good news? ;) 22:00:39 <_nadya_> eglynn: hehe :) 22:00:41 * jd__ is not sure 22:00:43 <ildikov_> eglynn: \o/ :) 22:01:14 <eglynn> ... /me had to get the dig in somehow ;) 22:01:26 <eglynn> ... 20 years of hurt and all that 22:01:42 <jd__> #endmeeting