#openstack-meeting log

21:01:45 <jd__> #startmeeting ceilometer
21:01:45 <openstack> Meeting started Wed Mar 26 21:01:45 2014 UTC and is due to finish in 60 minutes.  The chair is jd__. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:01:47 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:01:49 <openstack> The meeting name has been set to 'ceilometer'
21:01:59 <gordc> o/
21:02:04 <ildikov_> o/
21:02:09 <lsmola> hello
21:02:09 <eglynn> o/
21:02:34 <_nadya_> o/
21:02:55 <jd__> #topic Milestone status icehouse-rc1
21:03:17 <jd__> #link https://launchpad.net/ceilometer/+milestone/icehouse-rc1
21:03:31 <jd__> we only have a few bug left, so that looks good
21:03:34 <jd__> our target is Friday for rc1
21:03:38 <afarrell> thomas - just trying to find the window your message popped up in :-)
21:03:56 <ildikov_> jd__: currently there is a gate issue currently, I added a topic about it to the agenda
21:03:58 <gordc> resource-list patch should be ok for review: https://review.openstack.org/#/c/80343/
21:04:05 <jd__> ildikov_: ok
21:04:28 * eglynn would like to target https://bugs.launchpad.net/ceilometer/+bug/1297677 to RC1 also
21:04:29 <uvirtbot> Launchpad bug 1297677 in ceilometer "normal user create alarm with user-id or project-id specified will success instead of return 401" [High,Confirmed]
21:04:46 <gordc> deadlock issue might need to be pushed due to postgres issue... i think the oslo.db code doesn't work for multiple connections on postgres
21:04:47 <jd__> eglynn: targeted
21:04:50 <_nadya_> I have a question about rc1. We have merged several-workers patch. And there are some problems with it
21:04:56 <eglynn> jd__: thank you sir!
21:05:14 <Alexei_987> gordc: why do you think it's oslo problem?
21:05:23 <_nadya_> gordc: am I right that we need some more patches to be merge to make everything work right?
21:05:28 <gordc> _nadya_:  i added a patch to default it to a single worker: https://review.openstack.org/#/c/83215/
21:05:38 <Alexei_987> gordc: it was working for me with tests on read backends and it surely uses multiple connections
21:06:07 <gordc> Alexei_987: does it work on postgres for you? with multiple connections? it doesn't on my machine.
21:06:29 <Alexei_987> gordc: it was working a month ago.. haven't tryied recently
21:06:34 <gordc> and it doesn't work on gate
21:06:43 <Alexei_987> gordc it did :)
21:07:38 <gordc> :) there's something wonky happening. i think it's best to just default workers to single worker for now... my patch to explicitly call dispose works but i have no real idea why.
21:07:38 <_nadya_> jd__: gordc: I think we need to determine the list of patches to me merged before rc1 or revert multi-spawning :(
21:08:18 <gordc> _nadya_: we don't necessarily need to revert. we can just default to single worker so it behaves as before.
21:08:26 <Alexei_987> _nadya_ : I think we should revert in any case
21:08:37 <Alexei_987> workers are useless and doesn't solve any problem
21:09:22 <Alexei_987> workers are *only* usefull in case of specific MySQL driver
21:09:37 <Alexei_987> there are much simpler way to do without them
21:09:41 <gordc> Alexei_987: was just going to say that.... useless isn't exactly true
21:09:56 <jd__> it's not useless if you want to use several cores
21:10:02 <gordc> Alexei_987: what's the simpler way?
21:10:13 <Alexei_987> use python driver for mysql
21:10:18 <Alexei_987> it works in async mode
21:10:19 <jd__> gordc: having multiple worker is in no way different than having several collector running, right?
21:10:36 <Alexei_987> why would you use several cores? nginx works perfectly in 1 thread
21:10:41 <gordc> jd__: essentially yes. that's what it's doing.
21:10:44 <jd__> Alexei_987: nginx for the collector?
21:10:58 <jd__> gordc: so it's clearly just a visible effect of something that's broken more fundamentally :/
21:11:00 <Alexei_987> nah.. I mean that true async code doesn't need workers
21:11:22 <Alexei_987> it won't consume 100% of the cpu
21:11:26 <jd__> Alexei_987: so you run nginx on 1 core on your 8 core machines? awesome
21:11:31 <jd__> lol
21:11:33 <Alexei_987> why not?
21:11:45 <Alexei_987> it's enough to fully load network
21:11:51 <Alexei_987> cpu is not the bottleneck
21:11:51 <jd__> and if your nginx is busy 100% of the time what do you do next?
21:12:09 <Alexei_987> again cpu is not the bottleneck
21:12:17 <Alexei_987> network or database
21:12:25 <jd__> good for you
21:12:26 <Alexei_987> database will become bottleneck in our case
21:12:35 <jd__> that doesn't solve our problem
21:12:53 <Alexei_987> our problem is poorly written code for mysql
21:13:11 <jd__> gordc: so I'm kind of worry, was it working with even Havana?
21:13:11 <gordc> Alexei_987: in oslo.db or in ceilometer?
21:13:16 <Alexei_987> ceilometer
21:13:31 <Alexei_987> it's already working on 400 req/s even when cloud is idle
21:13:45 <gordc> Alexei_987: we're not writing any mysql specific code.
21:13:55 <_nadya_> so we have 2 variants: 1. merge https://review.openstack.org/#/c/83215/ ASAP 2. revert multiple workers.  We have no more variants. We have no time to repair SQL in 2 days
21:13:56 <gordc> jd__: i'm not sure. i've never tried to be honest
21:14:00 <Alexei_987> ok sqlalchemy code
21:14:13 <jd__> _nadya_: 1. does not fix the problem, it just hides it
21:14:20 <jd__> and 2 neither
21:14:39 <_nadya_> jd__: yep :D and?
21:15:01 <jd__> if you start 2 ceilometer-collector processes – which always have been a valid use case, you will have the same errors
21:15:21 <jd__> because the SQL driver is probably doing something in a wrong way
21:15:48 <gordc> jd__: agreed. there's something wonky happening which doesn't let postgres work with multiple connection but let's mysql
21:15:54 <Alexei_987> how driver can affect separate processes?
21:15:56 <jd__> so it needs to be fixed – gordc what's the bug # if there's one actually?
21:16:15 <gordc> jd__: for postgres issue? there is none right now
21:16:21 <jd__> I'm ok to try to target it at rc1 if somebod commits to fixing it
21:16:26 <jd__> otherwise it's pointless
21:16:38 <eglynn> jd__: how conservative do we have to be about RC1?
21:16:46 <eglynn> (i.e. treat it as effectively the final opportunity to get the above issues fixed?)
21:16:52 <gordc> jd__: i'm not sure we can fix it for rc1.
21:16:57 <_nadya_> I don't think it's possible to fix in 2 days
21:17:01 <jd__> eglynn: it should be final or that's not fair play
21:17:21 <jd__> ok
21:17:36 <Alexei_987> I propose to revert workers and I'll check this postgres issue :)
21:17:42 <jd__> so let's try to fix it someday, write a release not about that, and we'll backport the fix later
21:17:42 <eglynn> jd__: discussion above sounds like early-cycle stuff than 11th hour cusp-of-the-deadline hustle
21:18:26 <gordc> revert or default to single worker? it's essentially same thing but allows flexibility to have multiple workers for mongo/mysql backend which appear to work fine.
21:18:50 <Alexei_987> mongo backend only consumes 10-20% of the cpu
21:18:50 <gordc> i'm ok with revert if we want... just presenting alternative.
21:19:09 <gordc> Alexei_987: yeah. mysql is the real issue.
21:19:21 <jd__> and how reverting is going to make things work with 2 collectors running?
21:19:43 <Alexei_987> jd__: it won't I just don't like adding things that we don't actually need
21:19:57 <Alexei_987> deadlock issue needs to be fixed separately
21:20:32 <jd__> I don't really care about what you think when it's subjective and not technical
21:20:38 <gordc> deadlock issue as in this patch: https://review.openstack.org/#/c/80461/
21:20:45 <jd__> I just want this bug fixed, so if reverting is not going to help, there's no point
21:21:11 <jd__> gordc: it depends on something outdated
21:21:28 <gordc> Alexei_987, jd__: i think that patch should be pushed to post rc-1 with multiple connection fix.
21:22:00 <jd__> I'm ok to postone the fix if it's long to have it, we can document that it's not working
21:22:08 <gordc> i'll red x my patches that should be moved to later.
21:22:08 <_nadya_> +1
21:22:14 <jd__> gordc: could you at least open a bug and document it?
21:22:29 <gordc> will do.
21:22:36 <jd__> I feeling bad talking about something we don't even have a record of
21:23:03 <eglynn> so what exactly is *not* going to work with RC1? (under the proposed approach)
21:23:19 <eglynn> I'm a bit lost in these options tro revert or default to single worker etc.
21:23:22 <jd__> eglynn: using PostgreSQL with more than one collector
21:23:24 <eglynn> e.g. are we saying horizontal scaling of the collector is simply not possible currently with sqla?
21:23:30 <jd__> eglynn: yep
21:23:39 <jd__> s/sqla/sqla-using-postgresql/
21:23:48 <eglynn> jd__: a-ha, but just when psql is the backend, not mysql?
21:23:54 <eglynn> k, got it!
21:24:11 <jd__> so we can make postgresql work by setting the number of worker to 1 by default and documenting that postgresql shouldn't change anything and not run another copy
21:24:28 <jd__> that's a poor solution, but probably the best we can do in the timeframe
21:24:43 <ildikov_> and we will stay now at option 1, right?
21:25:08 <jd__> option 1 ?
21:25:16 <eglynn> k, so that pretty much keeps sqla/postgres in the not-usable-for-prod category, right?
21:25:26 <jd__> eglynn: clearly
21:25:44 <jd__> *sad face*
21:25:56 <ildikov_> jd__: merge the default to 1 worker patch
21:25:58 <_nadya_> is mongo in prod category still?
21:26:19 <eglynn> _nadya_: as long as you're a licensing maverick ;)
21:26:23 <jd__> ildikov_: yes, as a temporary solution :(
21:26:38 <gordc> https://bugs.launchpad.net/ceilometer/+bug/1298073
21:26:39 <uvirtbot> Launchpad bug 1298073 in ceilometer "postgres does not work with multiple connections" [High,Triaged]
21:26:55 <ildikov_> jd__: ok :(
21:26:59 <jd__> thanks gordc
21:27:21 <jd__> #topic Tempest integration
21:27:27 <jd__> _nadya_: enlighten us
21:28:16 <_nadya_> what can I say? We have no Mongo. mySQL and Postgres fail on gating
21:28:49 <_nadya_> and even anotification service was broken by infra-team
21:28:58 <_nadya_> repaired now
21:29:22 <_nadya_> I don't see any ability to add notification and pollster tests
21:29:27 <gordc> _nadya_: i assume that merged succesfully because we have no tests against notificaiton?
21:29:40 <_nadya_> gordc: exactly
21:30:09 <_nadya_> so we have tests but we can't make it work
21:30:16 <_nadya_> so sad story :(
21:30:21 <jd__> what's blocking?
21:30:39 <jd__> MySQL shouldn't fail on gating :(
21:30:49 <_nadya_> Ceilometer processes notification 16 minutes
21:30:56 <_nadya_> on mysql
21:31:04 <_nadya_> because of high load
21:31:26 <jd__> ok
21:31:40 <_nadya_> solution is to add Mongo-job
21:31:54 <jd__> mongo is not likely to happen soon indeed
21:31:55 <_nadya_> but everybody knows Sean's position
21:31:57 <gordc> _nadya_: or streamline sql schema
21:32:23 <gordc> i think it's best to make adjustments to schema but that is not rc1 item.
21:32:35 <jd__> yes, I don't think we can do much right now
21:32:44 <jd__> let's try to think and prepare for Juno now :(
21:32:45 <eglynn> _nadya_: 16 mins from receiving notification to persisting sample?
21:33:01 <eglynn> _nadya_: or 16 mins from the notification being emitted by nova?
21:33:02 <_nadya_> and today I was playing with HBase. It shows the same results as Mongo (good perf) but I can't imagine how to push it on gating too
21:33:37 <_nadya_> eglynn: between nova creates notifiation and Ceilometer starts to put it to database
21:34:49 <eglynn> _nadya_: wow! ... are the Jenkins slaves chronicallly under-scaled/under-powered?
21:35:30 <_nadya_> jd__: I don't know, do we have any estimates from TC about Tempest?
21:35:31 <eglynn> as in, are they a complete mismatch to the scale to be expected in a more realistic prod style deployment
21:35:38 <jd__> _nadya_: not yet
21:36:20 <_nadya_> jd__: maybe we should think about smth like fake-tests? I saw a lot of such test for other projects
21:36:39 <jd__> _nadya_: I don't know what would that be?
21:36:56 <_nadya_> jd__: that API return list
21:37:15 <eglynn> _nadya_: would that cut the legs out from under the whole idea of Tempest?
21:37:26 <eglynn> _nadya_: (i.e. non-faked integration tests)
21:37:51 <_nadya_> eglynn: ah, say it again in a different way :D
21:38:20 <eglynn> _nadya_: would fake tests make Tempest kinda pointless?
21:38:47 <eglynn> _nadya_: (if the point of Tempest is to test integration without the kind of fakery we use in the unit tests)
21:39:07 <gordc> eglynn: i think it's the way tests are done. it spikes at certain points and from there it's just backlogged because mysql isn't writing fast enough.
21:39:28 <_nadya_> eglynn: yep. but I'm afraid about TC only
21:39:43 <_nadya_> eglynn: I don't want to have such tests
21:39:55 <eglynn> _nadya_: TC == technical committee?
21:39:56 <_nadya_> eglynn: just thinking about variants
21:40:00 <_nadya_> eglynn: yep
21:40:08 <eglynn> _nadya_: k, got it, thanks!
21:40:58 <eglynn> so could the underlying core problem here be simply that the Jenkins slaves on which Tempest run are just way too under-resourced?
21:41:11 <_nadya_> eglynn: there are several mailing threads about Tempest+Ceilometer, I will discuss with you tomorrow. Maybe I've missed smth
21:41:21 <eglynn> ... i.e. running in bigger instances types on faster h/w would resolve?
21:41:28 <eglynn> ... /me prolly stating the obvious there
21:42:04 <jd__> eglynn: it seems to be a problem with the SQL driver again
21:42:05 <eglynn> _nadya_: ... actually its prolly just me being still behind on email backlog
21:42:08 <jd__> not with the power available
21:42:17 <eglynn> jd__: k
21:42:28 <jd__> let's move on for now
21:42:33 <jd__> #topic Release python-ceilometerclient?
21:42:40 <jd__> I think you're on that eglynn? :)
21:42:45 <eglynn> we'll be good to go with cutting 1.0.10 once the selectable aggregates patch lands
21:42:49 <asalkeld> o/
21:42:49 <eglynn> https://review.openstack.org/80499
21:42:54 <jd__> cool
21:42:58 <eglynn> currently just a test coverage objection
21:43:03 <eglynn> should be quick to resolve
21:43:05 <jd__> hi asalkeld
21:43:07 <eglynn> ... so that 1.0.10 can coincide up with RC1
21:43:11 <eglynn> Angus hey!
21:43:14 <jd__> perfect timing :)
21:43:16 <asalkeld> howdy
21:43:41 <jd__> #topic MongoDB licensing issues raised by the failed Marconi graduation
21:43:45 <jd__> not a funny topic again
21:43:53 <eglynn> :)
21:44:09 <eglynn> so the withdrawal mail from the Marconi team put the problem thusly:
21:44:21 <eglynn> "The drivers currently supported by the project don't cover some important cases related to deploying it. One of them solves a licensing issue but introduces a scale issue whereas the other one solves the scale issue and introduces a licensing issue."
21:44:37 <eglynn> the scaling/encumbered DB is mongodb
21:44:43 <eglynn> the non-scaling/unencumered one is sqlalchemy
21:44:50 <eglynn> #link http://lists.openstack.org/pipermail/openstack-dev/2014-March/030638.html
21:45:03 <eglynn> so the $64k question is ...
21:45:19 <eglynn> is this licensing palaver around mongo also clear & present danger for our project?
21:45:46 <eglynn> ... given that prod deployment options seem to be narrowing back to mongo
21:46:30 <eglynn> fair enough that Marconi met with disapproval for various other reasons
21:46:40 <ildikov_> do we have a plan B, if Mongo is not an option because of licensing?
21:46:48 <_nadya_> HBase :)
21:46:50 <jd__> I think I don't want to answer that and would wait for the TC to discuss about it
21:46:54 <jd__> IANAL
21:47:16 <eglynn> I hasten to translate: "I ain't a lawyer"
21:47:29 <jd__> that's it :)
21:47:35 <ildikov_> jd__: IANAL either, I just thought that it is better to ask this before, than after :(
21:47:35 <eglynn> yeah fair enough
21:47:47 <jd__> yeah sure ildikov_ :)
21:47:59 <eglynn> if the concern was obvious FUD-throwing I wouldn't be worried
21:48:15 <ildikov_> jd__: what about the Cassandra BP of yours?
21:48:36 <jd__> ildikov_: it's not going to be more than what I posted for the time being
21:48:58 <_nadya_> Looks like I need to create demo of HBase to show that it works...
21:49:18 <ildikov_> jd__: ok
21:49:28 <eglynn> jd__: is mongo licensing on the TC agenda in the near term do you know?
21:49:34 <jd__> eglynn: no idea
21:49:42 <jd__> ttx: ^
21:50:07 <jd__> #topic Periodic checkupdate.sh failures on the gate
21:50:10 <ildikov_> _nadya_: I just tried to ask about every option, I'm not against HBase :)
21:50:16 <jd__> #link https://bugs.launchpad.net/ceilometer/+bug/1297999
21:50:17 <uvirtbot> Launchpad bug 1297999 in ceilometer "gate failures due to tools/config/check_uptodate.sh" [Undecided,In progress]
21:50:22 <jd__> #link https://bugs.launchpad.net/nova/+bug/1268614
21:50:25 <uvirtbot> Launchpad bug 1268614 in cinder "pep8 gating fails due to tools/config/check_uptodate.sh" [Critical,Fix released]
21:50:28 <ildikov_> I included the latest bug report and an earlier one
21:50:50 <_nadya_> ildikov_: :)
21:51:03 <jd__> the failure is due to our inclusion of keystoneclient
21:51:17 <ildikov_> I have a fix with two patch sets, the first fixed the ceilometer.conf.sample, the second one currently deletes the checkupdate.sh from tox.ini
21:51:23 <jd__> the thing is that this config file should be built at compile/installation time and not checked into git
21:51:40 <ildikov_> the second one was the suggestion from gordc
21:51:44 <jd__> I think it's better to have the gate fails than an out of date conf
21:52:00 <jd__> or find a way to stop putting this file into git
21:52:01 <ildikov_> nova just did the same thing and including the generate-sample.sh into tox.ini
21:52:26 <jd__> ildikov_: so they're not putting the sample into git now?
21:52:49 <ildikov_> jd__: I meant that they removed this checkupdate.sh from tox
21:52:53 <jd__> ok
21:52:53 <eglynn> surely the gate failing stops everything in its tracks?
21:53:03 <jd__> eglynn: yes
21:53:04 <gordc> eglynn: yep. everything is blocked
21:53:19 <eglynn> would not regenerating until RC1 be acceptable if we know no *new* config is added by any remaining icehouse patches?
21:53:20 <jd__> so it's either the developers are blocked from time to time, or the users have crappy sample configuration files
21:53:35 <jd__> eglynn: it messes badly the diff most of the time :(
21:53:36 <ildikov_> jd__: tox.ini is the following in case of nova: #link https://github.com/openstack/nova/blob/master/tox.ini
21:53:57 <jd__> ildikov_: I see
21:54:09 <jd__> they went backward on that
21:54:24 <jd__> can't blame here – I'm the one responsible for the check_uptodate in the first place
21:54:24 <gordc> this happens more than i like. i like not checking sample to git but doesn't that mean we're just going to remove it from tox.ini as well?
21:54:48 <jd__> the point was to make people stop puting these default files in git, but it seems it didn't help :D
21:55:29 <gordc> i'd be game on turning it off for rc1 but i guess we'll eventually need to sync it up.
21:55:40 <ildikov_> jd__: lol :)
21:55:57 <jd__> good with me
21:56:03 <eglynn> gordc: yep, /me thinking also pragmatic short-term disable to drain the queue
21:56:05 <jd__> I vote for whatever works the best
21:56:32 * gordc already +2 turning it off patch.
21:57:01 <jd__> #topic Open discussion
21:57:10 <jd__> we have 3 minutes left if anything
21:57:16 <ildikov_> gordc: cool, thanks :)
21:57:32 <_nadya_> oh
21:57:51 <_nadya_> any good news :)?
22:00:06 <jd__> doesn't look so :)
22:00:09 <jd__> not yet at least
22:00:10 <ildikov_> _nadya_: it seems that the good news is there is no more bad one :)
22:00:23 <eglynn> _nadya_: ... well Ireland beat France at the rugby, does that count as good news? ;)
22:00:39 <_nadya_> eglynn: hehe :)
22:00:41 * jd__ is not sure
22:00:43 <ildikov_> eglynn: \o/ :)
22:01:14 <eglynn> ... /me had to get the dig in somehow ;)
22:01:26 <eglynn> ... 20 years of hurt and all that
22:01:42 <jd__> #endmeeting