21:01:45 #startmeeting ceilometer 21:01:45 Meeting started Wed Mar 26 21:01:45 2014 UTC and is due to finish in 60 minutes. The chair is jd__. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:01:47 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:01:49 The meeting name has been set to 'ceilometer' 21:01:59 o/ 21:02:04 o/ 21:02:09 hello 21:02:09 o/ 21:02:34 <_nadya_> o/ 21:02:55 #topic Milestone status icehouse-rc1 21:03:17 #link https://launchpad.net/ceilometer/+milestone/icehouse-rc1 21:03:31 we only have a few bug left, so that looks good 21:03:34 our target is Friday for rc1 21:03:38 thomas - just trying to find the window your message popped up in :-) 21:03:56 jd__: currently there is a gate issue currently, I added a topic about it to the agenda 21:03:58 resource-list patch should be ok for review: https://review.openstack.org/#/c/80343/ 21:04:05 ildikov_: ok 21:04:28 * eglynn would like to target https://bugs.launchpad.net/ceilometer/+bug/1297677 to RC1 also 21:04:29 Launchpad bug 1297677 in ceilometer "normal user create alarm with user-id or project-id specified will success instead of return 401" [High,Confirmed] 21:04:46 deadlock issue might need to be pushed due to postgres issue... i think the oslo.db code doesn't work for multiple connections on postgres 21:04:47 eglynn: targeted 21:04:50 <_nadya_> I have a question about rc1. We have merged several-workers patch. And there are some problems with it 21:04:56 jd__: thank you sir! 21:05:14 gordc: why do you think it's oslo problem? 21:05:23 <_nadya_> gordc: am I right that we need some more patches to be merge to make everything work right? 21:05:28 _nadya_: i added a patch to default it to a single worker: https://review.openstack.org/#/c/83215/ 21:05:38 gordc: it was working for me with tests on read backends and it surely uses multiple connections 21:06:07 Alexei_987: does it work on postgres for you? with multiple connections? it doesn't on my machine. 21:06:29 gordc: it was working a month ago.. haven't tryied recently 21:06:34 and it doesn't work on gate 21:06:43 gordc it did :) 21:07:38 :) there's something wonky happening. i think it's best to just default workers to single worker for now... my patch to explicitly call dispose works but i have no real idea why. 21:07:38 <_nadya_> jd__: gordc: I think we need to determine the list of patches to me merged before rc1 or revert multi-spawning :( 21:08:18 _nadya_: we don't necessarily need to revert. we can just default to single worker so it behaves as before. 21:08:26 _nadya_ : I think we should revert in any case 21:08:37 workers are useless and doesn't solve any problem 21:09:22 workers are *only* usefull in case of specific MySQL driver 21:09:37 there are much simpler way to do without them 21:09:41 Alexei_987: was just going to say that.... useless isn't exactly true 21:09:56 it's not useless if you want to use several cores 21:10:02 Alexei_987: what's the simpler way? 21:10:13 use python driver for mysql 21:10:18 it works in async mode 21:10:19 gordc: having multiple worker is in no way different than having several collector running, right? 21:10:36 why would you use several cores? nginx works perfectly in 1 thread 21:10:41 jd__: essentially yes. that's what it's doing. 21:10:44 Alexei_987: nginx for the collector? 21:10:58 gordc: so it's clearly just a visible effect of something that's broken more fundamentally :/ 21:11:00 nah.. I mean that true async code doesn't need workers 21:11:22 it won't consume 100% of the cpu 21:11:26 Alexei_987: so you run nginx on 1 core on your 8 core machines? awesome 21:11:31 lol 21:11:33 why not? 21:11:45 it's enough to fully load network 21:11:51 cpu is not the bottleneck 21:11:51 and if your nginx is busy 100% of the time what do you do next? 21:12:09 again cpu is not the bottleneck 21:12:17 network or database 21:12:25 good for you 21:12:26 database will become bottleneck in our case 21:12:35 that doesn't solve our problem 21:12:53 our problem is poorly written code for mysql 21:13:11 gordc: so I'm kind of worry, was it working with even Havana? 21:13:11 Alexei_987: in oslo.db or in ceilometer? 21:13:16 ceilometer 21:13:31 it's already working on 400 req/s even when cloud is idle 21:13:45 Alexei_987: we're not writing any mysql specific code. 21:13:55 <_nadya_> so we have 2 variants: 1. merge https://review.openstack.org/#/c/83215/ ASAP 2. revert multiple workers. We have no more variants. We have no time to repair SQL in 2 days 21:13:56 jd__: i'm not sure. i've never tried to be honest 21:14:00 ok sqlalchemy code 21:14:13 _nadya_: 1. does not fix the problem, it just hides it 21:14:20 and 2 neither 21:14:39 <_nadya_> jd__: yep :D and? 21:15:01 if you start 2 ceilometer-collector processes – which always have been a valid use case, you will have the same errors 21:15:21 because the SQL driver is probably doing something in a wrong way 21:15:48 jd__: agreed. there's something wonky happening which doesn't let postgres work with multiple connection but let's mysql 21:15:54 how driver can affect separate processes? 21:15:56 so it needs to be fixed – gordc what's the bug # if there's one actually? 21:16:15 jd__: for postgres issue? there is none right now 21:16:21 I'm ok to try to target it at rc1 if somebod commits to fixing it 21:16:26 otherwise it's pointless 21:16:38 jd__: how conservative do we have to be about RC1? 21:16:46 (i.e. treat it as effectively the final opportunity to get the above issues fixed?) 21:16:52 jd__: i'm not sure we can fix it for rc1. 21:16:57 <_nadya_> I don't think it's possible to fix in 2 days 21:17:01 eglynn: it should be final or that's not fair play 21:17:21 ok 21:17:36 I propose to revert workers and I'll check this postgres issue :) 21:17:42 so let's try to fix it someday, write a release not about that, and we'll backport the fix later 21:17:42 jd__: discussion above sounds like early-cycle stuff than 11th hour cusp-of-the-deadline hustle 21:18:26 revert or default to single worker? it's essentially same thing but allows flexibility to have multiple workers for mongo/mysql backend which appear to work fine. 21:18:50 mongo backend only consumes 10-20% of the cpu 21:18:50 i'm ok with revert if we want... just presenting alternative. 21:19:09 Alexei_987: yeah. mysql is the real issue. 21:19:21 and how reverting is going to make things work with 2 collectors running? 21:19:43 jd__: it won't I just don't like adding things that we don't actually need 21:19:57 deadlock issue needs to be fixed separately 21:20:32 I don't really care about what you think when it's subjective and not technical 21:20:38 deadlock issue as in this patch: https://review.openstack.org/#/c/80461/ 21:20:45 I just want this bug fixed, so if reverting is not going to help, there's no point 21:21:11 gordc: it depends on something outdated 21:21:28 Alexei_987, jd__: i think that patch should be pushed to post rc-1 with multiple connection fix. 21:22:00 I'm ok to postone the fix if it's long to have it, we can document that it's not working 21:22:08 i'll red x my patches that should be moved to later. 21:22:08 <_nadya_> +1 21:22:14 gordc: could you at least open a bug and document it? 21:22:29 will do. 21:22:36 I feeling bad talking about something we don't even have a record of 21:23:03 so what exactly is *not* going to work with RC1? (under the proposed approach) 21:23:19 I'm a bit lost in these options tro revert or default to single worker etc. 21:23:22 eglynn: using PostgreSQL with more than one collector 21:23:24 e.g. are we saying horizontal scaling of the collector is simply not possible currently with sqla? 21:23:30 eglynn: yep 21:23:39 s/sqla/sqla-using-postgresql/ 21:23:48 jd__: a-ha, but just when psql is the backend, not mysql? 21:23:54 k, got it! 21:24:11 so we can make postgresql work by setting the number of worker to 1 by default and documenting that postgresql shouldn't change anything and not run another copy 21:24:28 that's a poor solution, but probably the best we can do in the timeframe 21:24:43 and we will stay now at option 1, right? 21:25:08 option 1 ? 21:25:16 k, so that pretty much keeps sqla/postgres in the not-usable-for-prod category, right? 21:25:26 eglynn: clearly 21:25:44 *sad face* 21:25:56 jd__: merge the default to 1 worker patch 21:25:58 <_nadya_> is mongo in prod category still? 21:26:19 _nadya_: as long as you're a licensing maverick ;) 21:26:23 ildikov_: yes, as a temporary solution :( 21:26:38 https://bugs.launchpad.net/ceilometer/+bug/1298073 21:26:39 Launchpad bug 1298073 in ceilometer "postgres does not work with multiple connections" [High,Triaged] 21:26:55 jd__: ok :( 21:26:59 thanks gordc 21:27:21 #topic Tempest integration 21:27:27 _nadya_: enlighten us 21:28:16 <_nadya_> what can I say? We have no Mongo. mySQL and Postgres fail on gating 21:28:49 <_nadya_> and even anotification service was broken by infra-team 21:28:58 <_nadya_> repaired now 21:29:22 <_nadya_> I don't see any ability to add notification and pollster tests 21:29:27 _nadya_: i assume that merged succesfully because we have no tests against notificaiton? 21:29:40 <_nadya_> gordc: exactly 21:30:09 <_nadya_> so we have tests but we can't make it work 21:30:16 <_nadya_> so sad story :( 21:30:21 what's blocking? 21:30:39 MySQL shouldn't fail on gating :( 21:30:49 <_nadya_> Ceilometer processes notification 16 minutes 21:30:56 <_nadya_> on mysql 21:31:04 <_nadya_> because of high load 21:31:26 ok 21:31:40 <_nadya_> solution is to add Mongo-job 21:31:54 mongo is not likely to happen soon indeed 21:31:55 <_nadya_> but everybody knows Sean's position 21:31:57 _nadya_: or streamline sql schema 21:32:23 i think it's best to make adjustments to schema but that is not rc1 item. 21:32:35 yes, I don't think we can do much right now 21:32:44 let's try to think and prepare for Juno now :( 21:32:45 _nadya_: 16 mins from receiving notification to persisting sample? 21:33:01 _nadya_: or 16 mins from the notification being emitted by nova? 21:33:02 <_nadya_> and today I was playing with HBase. It shows the same results as Mongo (good perf) but I can't imagine how to push it on gating too 21:33:37 <_nadya_> eglynn: between nova creates notifiation and Ceilometer starts to put it to database 21:34:49 _nadya_: wow! ... are the Jenkins slaves chronicallly under-scaled/under-powered? 21:35:30 <_nadya_> jd__: I don't know, do we have any estimates from TC about Tempest? 21:35:31 as in, are they a complete mismatch to the scale to be expected in a more realistic prod style deployment 21:35:38 _nadya_: not yet 21:36:20 <_nadya_> jd__: maybe we should think about smth like fake-tests? I saw a lot of such test for other projects 21:36:39 _nadya_: I don't know what would that be? 21:36:56 <_nadya_> jd__: that API return list 21:37:15 _nadya_: would that cut the legs out from under the whole idea of Tempest? 21:37:26 _nadya_: (i.e. non-faked integration tests) 21:37:51 <_nadya_> eglynn: ah, say it again in a different way :D 21:38:20 _nadya_: would fake tests make Tempest kinda pointless? 21:38:47 _nadya_: (if the point of Tempest is to test integration without the kind of fakery we use in the unit tests) 21:39:07 eglynn: i think it's the way tests are done. it spikes at certain points and from there it's just backlogged because mysql isn't writing fast enough. 21:39:28 <_nadya_> eglynn: yep. but I'm afraid about TC only 21:39:43 <_nadya_> eglynn: I don't want to have such tests 21:39:55 _nadya_: TC == technical committee? 21:39:56 <_nadya_> eglynn: just thinking about variants 21:40:00 <_nadya_> eglynn: yep 21:40:08 _nadya_: k, got it, thanks! 21:40:58 so could the underlying core problem here be simply that the Jenkins slaves on which Tempest run are just way too under-resourced? 21:41:11 <_nadya_> eglynn: there are several mailing threads about Tempest+Ceilometer, I will discuss with you tomorrow. Maybe I've missed smth 21:41:21 ... i.e. running in bigger instances types on faster h/w would resolve? 21:41:28 ... /me prolly stating the obvious there 21:42:04 eglynn: it seems to be a problem with the SQL driver again 21:42:05 _nadya_: ... actually its prolly just me being still behind on email backlog 21:42:08 not with the power available 21:42:17 jd__: k 21:42:28 let's move on for now 21:42:33 #topic Release python-ceilometerclient? 21:42:40 I think you're on that eglynn? :) 21:42:45 we'll be good to go with cutting 1.0.10 once the selectable aggregates patch lands 21:42:49 o/ 21:42:49 https://review.openstack.org/80499 21:42:54 cool 21:42:58 currently just a test coverage objection 21:43:03 should be quick to resolve 21:43:05 hi asalkeld 21:43:07 ... so that 1.0.10 can coincide up with RC1 21:43:11 Angus hey! 21:43:14 perfect timing :) 21:43:16 howdy 21:43:41 #topic MongoDB licensing issues raised by the failed Marconi graduation 21:43:45 not a funny topic again 21:43:53 :) 21:44:09 so the withdrawal mail from the Marconi team put the problem thusly: 21:44:21 "The drivers currently supported by the project don't cover some important cases related to deploying it. One of them solves a licensing issue but introduces a scale issue whereas the other one solves the scale issue and introduces a licensing issue." 21:44:37 the scaling/encumbered DB is mongodb 21:44:43 the non-scaling/unencumered one is sqlalchemy 21:44:50 #link http://lists.openstack.org/pipermail/openstack-dev/2014-March/030638.html 21:45:03 so the $64k question is ... 21:45:19 is this licensing palaver around mongo also clear & present danger for our project? 21:45:46 ... given that prod deployment options seem to be narrowing back to mongo 21:46:30 fair enough that Marconi met with disapproval for various other reasons 21:46:40 do we have a plan B, if Mongo is not an option because of licensing? 21:46:48 <_nadya_> HBase :) 21:46:50 I think I don't want to answer that and would wait for the TC to discuss about it 21:46:54 IANAL 21:47:16 I hasten to translate: "I ain't a lawyer" 21:47:29 that's it :) 21:47:35 jd__: IANAL either, I just thought that it is better to ask this before, than after :( 21:47:35 yeah fair enough 21:47:47 yeah sure ildikov_ :) 21:47:59 if the concern was obvious FUD-throwing I wouldn't be worried 21:48:15 jd__: what about the Cassandra BP of yours? 21:48:36 ildikov_: it's not going to be more than what I posted for the time being 21:48:58 <_nadya_> Looks like I need to create demo of HBase to show that it works... 21:49:18 jd__: ok 21:49:28 jd__: is mongo licensing on the TC agenda in the near term do you know? 21:49:34 eglynn: no idea 21:49:42 ttx: ^ 21:50:07 #topic Periodic checkupdate.sh failures on the gate 21:50:10 _nadya_: I just tried to ask about every option, I'm not against HBase :) 21:50:16 #link https://bugs.launchpad.net/ceilometer/+bug/1297999 21:50:17 Launchpad bug 1297999 in ceilometer "gate failures due to tools/config/check_uptodate.sh" [Undecided,In progress] 21:50:22 #link https://bugs.launchpad.net/nova/+bug/1268614 21:50:25 Launchpad bug 1268614 in cinder "pep8 gating fails due to tools/config/check_uptodate.sh" [Critical,Fix released] 21:50:28 I included the latest bug report and an earlier one 21:50:50 <_nadya_> ildikov_: :) 21:51:03 the failure is due to our inclusion of keystoneclient 21:51:17 I have a fix with two patch sets, the first fixed the ceilometer.conf.sample, the second one currently deletes the checkupdate.sh from tox.ini 21:51:23 the thing is that this config file should be built at compile/installation time and not checked into git 21:51:40 the second one was the suggestion from gordc 21:51:44 I think it's better to have the gate fails than an out of date conf 21:52:00 or find a way to stop putting this file into git 21:52:01 nova just did the same thing and including the generate-sample.sh into tox.ini 21:52:26 ildikov_: so they're not putting the sample into git now? 21:52:49 jd__: I meant that they removed this checkupdate.sh from tox 21:52:53 ok 21:52:53 surely the gate failing stops everything in its tracks? 21:53:03 eglynn: yes 21:53:04 eglynn: yep. everything is blocked 21:53:19 would not regenerating until RC1 be acceptable if we know no *new* config is added by any remaining icehouse patches? 21:53:20 so it's either the developers are blocked from time to time, or the users have crappy sample configuration files 21:53:35 eglynn: it messes badly the diff most of the time :( 21:53:36 jd__: tox.ini is the following in case of nova: #link https://github.com/openstack/nova/blob/master/tox.ini 21:53:57 ildikov_: I see 21:54:09 they went backward on that 21:54:24 can't blame here – I'm the one responsible for the check_uptodate in the first place 21:54:24 this happens more than i like. i like not checking sample to git but doesn't that mean we're just going to remove it from tox.ini as well? 21:54:48 the point was to make people stop puting these default files in git, but it seems it didn't help :D 21:55:29 i'd be game on turning it off for rc1 but i guess we'll eventually need to sync it up. 21:55:40 jd__: lol :) 21:55:57 good with me 21:56:03 gordc: yep, /me thinking also pragmatic short-term disable to drain the queue 21:56:05 I vote for whatever works the best 21:56:32 * gordc already +2 turning it off patch. 21:57:01 #topic Open discussion 21:57:10 we have 3 minutes left if anything 21:57:16 gordc: cool, thanks :) 21:57:32 <_nadya_> oh 21:57:51 <_nadya_> any good news :)? 22:00:06 doesn't look so :) 22:00:09 not yet at least 22:00:10 _nadya_: it seems that the good news is there is no more bad one :) 22:00:23 _nadya_: ... well Ireland beat France at the rugby, does that count as good news? ;) 22:00:39 <_nadya_> eglynn: hehe :) 22:00:41 * jd__ is not sure 22:00:43 eglynn: \o/ :) 22:01:14 ... /me had to get the dig in somehow ;) 22:01:26 ... 20 years of hurt and all that 22:01:42 #endmeeting