#openstack-meeting log

21:09:43 <dansmith> #startmeeting nova
21:09:44 <openstack> Meeting started Thu Feb  7 21:09:43 2013 UTC.  The chair is dansmith. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:09:45 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:09:47 <openstack> The meeting name has been set to 'nova'
21:09:48 <dansmith> he'll be here
21:09:54 <dansmith> I'll #chair him when he shows up
21:10:01 <sdague> cool
21:10:09 <jog0> o/
21:10:18 <sdague> o/
21:10:36 <dansmith> although there's no agenda apparently
21:11:06 <comstud> do not commit future versions of oslo
21:11:12 <dansmith> heh
21:11:26 <dansmith> comstud: any chance rackspace could throw resources at the tempest gate problem?
21:11:35 <comstud> not sure
21:11:36 <dansmith> 1.5 hours to get a read on any patch is pretty brtual :(
21:11:37 <comstud> will ask
21:11:57 <dansmith> and the way zuul works,
21:12:09 <dansmith> that can easily become six hours if you're unlucky and behind some bad patches
21:12:21 <sdague> dansmith: you can always see the zuul work in progress
21:12:30 <sdague> http://status.openstack.org/zuul
21:12:34 <dansmith> sdague: ...yeah of course
21:12:48 <dansmith> sdague: but that does nothing for getting a read on whether the full run is going to pass,
21:12:51 <dansmith> which is always the last one
21:13:23 <westmaas> dansmith: do we know where the problem is? spinning up envs? running the tests?
21:13:28 <westmaas> sorry, just catching up
21:13:43 <comstud> dansmith: look, i brought resources.
21:13:50 <dansmith> westmaas: I think sdague knows more, but I think it's mostly cores and memory, surprise :)
21:13:52 <sdague> it's just that we run a lot of tests now, and dansmith is impatient
21:13:58 <comstud> lol
21:14:00 <westmaas> haha
21:14:33 <sdague> tempest will get quicker once the testr conversion happens over the next couple weeks
21:14:34 <dansmith> if I'm the only one that thinks it's frustrating,
21:15:05 <sdague> but we'll also have to look at bumping devstack node sizes
21:15:05 <jog0> dansmith: I agree with you
21:15:11 <dansmith> then that's fine, but I go days sometimes without being able to get something to a mergeable state because I'm always waiting for runs
21:15:17 <westmaas> sdague: do you know what size you use now?
21:15:28 <sdague> whatever is the 4G nodes
21:15:30 <westmaas> and the time is spent on test running, right?
21:15:31 <dansmith> and it compounds with every change you have stacked on top of another, which is often where I'm at
21:15:39 <dansmith> it's not as big of a deal for individual patches
21:15:42 <westmaas> not on bringing up the nodes
21:16:08 <sdague> we get past devstack in about 9 minutes
21:16:56 <sdague> there is detailed timing in all the logs
21:17:12 <dansmith> sdague: devstack setup is not parallelized, right?
21:17:20 <sdague> nope
21:17:26 <dansmith> I wonder if some of it could be?
21:17:44 <sdague> realize, we spent a lot of time squeezing out time from those runs
21:17:57 <dansmith> I'm just asking
21:18:02 <sdague> totally happy for other people to look at it as well, but there's not much low hanging fruit in there :)
21:18:26 <dansmith> sdague: if the instances are 2 or 4 cores right now,
21:18:32 <westmaas> node size could be the biggest thing unless its mosly about waiting idly for nodes to spin up
21:18:33 <dansmith> then the tempest testr change isn't going to help all that much,
21:18:36 <sdague> I think they are 2 cores
21:18:38 <dansmith> which is why I was asking about larger nodes
21:18:52 <sdague> westmaas: the tempest tests wait for guests to spin up
21:18:58 <sdague> that's where a lot of time is
21:19:03 <sdague> which is why testr will help
21:19:25 <dansmith> sdague: but chris was saying that the speedup is 50% of your core count, right?
21:19:40 <sdague> we'll see what 2 cores does
21:20:14 <westmaas> guests might spin up a little faster on larger nodes.  if someone wants to reach out to me offline about using more RS resources we can do that, would just want to understand how much we are using first.
21:20:19 <dansmith> okay, I thought he said that with 2 cores, four testr threads would actually go slower
21:20:34 <sdague> with 2 cores there won't be 4 testr threads
21:20:45 <dansmith> sdague: you can run it with as many threads as you want, of course
21:20:57 <sdague> dansmith: instead of speculating, why not wait for the actual testr review to hit tempest, then we'll have real data
21:21:01 <dansmith> sdague: he was trying to see if there was really any idle time to gain by overcommitting
21:21:35 <dansmith> yes, lets linearize the process!
21:21:45 <dansmith> anyway, I guess vishy isn't going to show up after all
21:21:58 <sdague> I mean the easy answer is just turn back off a lot of tests, but the new tests already blocked breaks going in
21:22:10 <dansmith> sdague: has anyone suggested that? I sure haven't :)
21:22:35 <sdague> :)
21:23:16 <dansmith> okay, so do we bail? No agenda, no rustlebee, no vishy
21:23:21 <dansmith> (and no-db-compute, by the way)
21:23:31 <comstud> hehe
21:24:07 <jog0> one quick comment: I have my BP ready for review  https://blueprints.launchpad.net/nova/+spec/db-api-cleanup https://review.openstack.org/#q,topic:bp/db-api-cleanup,n,z
21:24:22 <dansmith> jog0: yeah, I've been meaning to look at it
21:24:35 <dansmith> jog0: I'd ask why so many of them are failing jenkins, but I know why :)
21:25:04 <jog0> one of the patches is blocked on getting code into oslo.
21:25:17 <jog0> and the new tempest tests have been very helpful
21:25:32 <jog0> it seems nova/api/openstack/compute/contrib/security_groups.py has very little unit tests
21:25:44 <dansmith> yes, they found the last item in quantum_api for me :)
21:27:11 <dansmith> jog0: anything else?
21:27:22 <jog0> dansmith:  nope
21:27:27 <dansmith> anyone?
21:27:30 <dansmith> bueller?
21:28:24 <comstud> just a comment maybe
21:29:18 <comstud> db pooling becomes important when moving to conductor... yet we have something broken with our use of sqlalchemy that prevents db pooling from working..  seems that it's leaving transactions open.. thus innodb locks held.
21:29:33 <comstud> queries end up timing out waiting on locks in mysql
21:29:47 <comstud> working on tracking it down.
21:30:01 <jog0> devananda: ^
21:30:09 <dansmith> comstud: why is this not showing up in devstack? does it take a while to crop up?
21:30:15 <comstud> we were having a discussion yesterday or the day before
21:30:19 <sdague> comstud: so how broken are things with the oslo sync that wasn't? I didn't even occur to me that markmc might have inbounded non committed oslo code :)
21:30:20 <comstud> does devstack have db pooling on?
21:30:30 <sdague> comstud: I don't think so
21:30:32 <comstud> sdague: most of the things out of sync have merged now it appears
21:30:46 <comstud> dansmith: it seems to only show when db pooling is on
21:30:51 <comstud> https://review.openstack.org/#/c/21461/
21:30:56 <dansmith> comstud: is that something we should enable? or would it be too hard to track down issues?
21:31:01 <comstud> ^^ this is an improvement needed also to be committed to oslo and merged into nova
21:31:02 <uvirtbot> comstud: Error: "^" is not a valid command.
21:31:12 <comstud> otherwise you get connect timeouts under load, etc
21:31:31 <comstud> dansmith: it's certainly not ready yet
21:31:44 <comstud> i dunno if devstack provides enough parallelism and load to see the issue anyway
21:32:00 <dansmith> okay
21:32:08 <comstud> it's easy to reproduce in a test script that slams DB requests with pooling on
21:32:25 <dansmith> and conductor makes the problem worse because it takes the db hit for everyone and just hits the problem faster?
21:32:27 <sdague> from a gate perspective, I don't think there is enough db traffic to have it make a difference
21:32:27 <jog0> comstud: I wasn't able to see any difference in tempest or smoketests in devstack with db pool on
21:32:47 <sdague> yeh, I'm not surprised jog0
21:32:51 <comstud> dansmith: i have not run with nova-conductor yet, but I can tell you that yes it will.. because of that reason
21:32:58 <dansmith> comstud: okay
21:33:04 <comstud> dansmith: I see it with global nova-cells which takes DB updates from all child cells
21:33:21 <comstud> conductor and global nova-cells have nearly the same function
21:33:21 <dansmith> I'm embarrassed to say that I don't really know what this means.. is this avoiding connection setup and teardown for every query?
21:33:28 <sdague> the big timing issues in tempest are starting guests and cleaning them up, which we have to do carefully because it's easy to overrun memory in the gate
21:33:51 <comstud> jog0: ya, I'm not surprised.. I don't think there's probably enough parallelism in devstack tests
21:34:03 <comstud> so I think any perf issues you have are just something else
21:34:08 <comstud> Oh... one other thing to note :)
21:34:19 <comstud> if you ditch sqlalchemy and do raw mysql queries with MySQLdb...
21:34:25 <comstud> the problems go away and things are SIGNIFICANTLY faster.
21:34:30 <dansmith> heh
21:34:47 <comstud> even when doing serialized DB calls
21:34:50 <comstud> with no thread pooling.
21:34:52 <dripton> comstud: what about if you only use low-level sqlalchemy and not the top ORM layer?
21:35:02 <comstud> dripton: I was going to try that next
21:35:10 <dripton> comstud: +1
21:35:18 <dansmith> we're getting closer to not even needing the ORM layer for much, right?
21:35:20 <comstud> although I'm focusing on just trying to find where sqlalchemy is holding transaction open
21:35:22 <dansmith> with jog0's patches
21:35:52 <sdague> well you could at least write a non-sqlalchemy version of the api
21:35:55 <sdague> after those
21:36:00 <comstud> yea
21:36:01 <dripton> dansmith: we currently use the ORM layer all over, and I don't think jog0's patches change that
21:36:11 <comstud> dansmith: what dripton said
21:36:20 <dansmith> hrm..
21:36:29 <dansmith> I don't get that
21:36:43 <jog0> dripton: the patch prevents sqlalchemy  objects from being returned from nova.db.api functions
21:37:02 <comstud> In a way it gets you closer, I guess
21:37:07 <jog0> after that removing ORM should be easier
21:37:08 <dripton> jog0: right but we still use ORM inside the api functions
21:37:15 <jog0> dripton:  yeahhhh
21:37:24 <jog0> for now
21:37:24 <comstud> you need to rewrite almost all of sqlalchmey/api.py :)
21:37:25 <dripton> jog0: okay, we agree.  Not gone yet but maybe in H
21:37:25 <comstud> still
21:37:31 <sdague> dripton: yeh, but you now *could* write an alternate api implementation
21:37:38 <comstud> it might as well be an alternate api impl
21:37:41 <comstud> yes
21:37:43 <dripton> sdague: agreed, we're going the right way
21:37:56 <dripton> we still need to prove that *low-level* sqlalchemy doesn't have the performance problem
21:38:03 <dripton> hope not, because it buys us portability
21:38:14 <sdague> yeh, loosing portability would really suck
21:38:16 <comstud> depends on how low level we need to go
21:38:46 <comstud> anyway, we'll see.  first thing is just trying to find what's broken
21:38:50 <comstud> perfroamnce after
21:38:59 <sdague> comstud: there are subtle typing things between even mysql and pg, so it's more than you might imagine
21:39:06 <comstud> nod
21:39:20 <dripton> my db-archiving patch (not done yet) uses only low-level sqlalchemy.  I was worried someone might complain about that but it looks like we're headed that way anyway.
21:39:23 <sdague> but maybe do custom things for just a few of the most expensive bits
21:39:55 <comstud> I was considering starting a mysql only api implementation
21:40:02 <comstud> which falls back to sqlalchemy for methods not implemented yet
21:40:10 <jog0> comstud: +1
21:40:13 <devananda> ++
21:40:16 <comstud> which gives you a path to migrate if you're using myql
21:40:17 <comstud> mysql
21:40:17 <dripton> comstud: before being that extreme please try with low-level sqlalchemy
21:40:18 <sdague> dripton: you mean just doing .insert().values().execute() kind of things?
21:40:23 <dripton> sdague: yes
21:40:29 <comstud> dripton: nod
21:40:41 <sdague> dripton: yeh, I was doing that in the data injection for migrations, because I can't use the models
21:40:45 <sdague> it's not too bad
21:40:56 <sdague> it would be interesting if that sped us up
21:41:15 <devananda> a chunk of the issues i've seen appear to come from the way sqlalchemy automagically handles creates transactions
21:41:49 <devananda> even when there's no explicit "with session.begin", there is still a transaction
21:42:29 <devananda> doing the lower level calls directly might be circumventing that?
21:42:51 <dripton> devananda: I *think* you can go transactionless with low-level sqlalchemy, but I need to test
21:43:08 <dripton> my code is using "with session.begin"
21:43:12 <comstud> devananda: yeah, that's part of the problem I'm positive
21:43:27 <comstud> i changed a DB call to remove session.begin()
21:43:42 <comstud> and I don't get any deadlocks anymore... but I still get lock wait timeouts.
21:43:50 <comstud> but means sqlalchemy is not committing somewhere
21:43:52 <comstud> AFAIK
21:43:55 <devananda> right
21:44:54 <devananda> comstud: the real fix to the stuff we were discussing is, IMNSHO, to add UNIQUE indexes
21:45:11 <devananda> using a myslq-specific extension (INSERT .. ON DUP KEY UPDATE) is then possible
21:45:21 <devananda> and pgsql has a similar capability which they call upserts
21:45:30 <devananda> it's just a bit trickier (have to use a trigger or such)
21:45:36 <devananda> sqlite is screwed tho :)
21:45:58 <dripton> I don't think anyone would really use sqlite in production
21:46:11 <devananda> trying to enforce uniqueness with locking outside of the database is prone to all kinds of problems (like these)
21:46:50 <comstud> devandanda: unfort there's no portable way to do it in sqlalchemy
21:46:54 <comstud> the ON DUP KEY UPDATE
21:47:18 <sdague> can we extend sqlalchemy for it?
21:47:24 <comstud> but even without that deadlock problem.. I'm getting the lock wait timeouts
21:47:27 <sdague> at the end of the day it's just a meta compiler
21:47:42 <comstud> sdague: there was a way to compile it onto the end of a query
21:47:52 <sdague> seems like we could have an .insert_update method and backends
21:48:00 <comstud> but you'd need to figure out how to do it for pg, mysql, sqlite, etc :)
21:48:04 <devananda> comstud: IIUC, the lock wait timeouts are coming from the same thing -- >1 thread trying to lock the same rows with an UPDATE statement, when it really just wants to INSERT.
21:48:30 <comstud> devananda: and innodb doesn't detect a deadlock?
21:48:31 <sdague> comstud: yeh, but if we need to figure that out anyway for openstack code, might as well monkey patch sqlalchemy for it
21:48:42 <dripton> I've done it with sqlalchemy the  manual way: inside a transaction, select, if found then update else insert.
21:48:51 <dripton> ugly but portable
21:49:11 <devananda> comstud: if neither thread has the lock and both want it (and can't get it), then it deadlocks.
21:49:14 <sdague> dripton: yeh, if we could encapsulate it nicely, at least the ugly could live off in a method somewhere
21:49:30 <devananda> comstud: if one thread has the lock and just sits on it, other threads will timeout. so yes, that problem is sqlalchemy not committing properly
21:49:44 <devananda> but it's holding the lock because it did an UPDATE on 0 rows
21:49:44 <comstud> yeah, that's the issue it seems
21:49:47 <devananda> (i think)
21:49:49 <comstud> I'm trying to track down where it's sitting
21:49:50 <comstud> and why
21:50:13 <devananda> dripton: that "portable way" is precisely the cause of this problem :(
21:50:22 <dripton> sigh
21:50:57 <devananda> dripton: if the row doesn't exist, both trx will insert (and one should fail, but right now, won't because there's no UNIQUE key, leading to data inconsistency)
21:50:57 <dripton> stackoverflow says sqlalchemy has session.saveorupdate, but that might be at the wrong level
21:51:42 <dripton> Can we add unique keys wherever we need them?
21:51:48 <sdague> dripton: the sqlalchemy code isn't that bad, I'd say just read the code :)
21:51:49 <devananda> yes :)
21:51:51 <comstud> i added them
21:51:55 <comstud> in the method i was testing
21:51:57 <comstud> (bw usage updates)
21:52:08 <comstud> and it didn't help
21:52:12 <devananda> so uniques are possible now that soft_deletes set deleted=<id>
21:52:36 <devananda> comstud: it should help if you change the SQL
21:52:53 <devananda> comstud: either use INSERT .. ON DUP KEY, or remove the SELECT and do this
21:52:55 <devananda> try:
21:52:56 <devananda> INSERT
21:52:57 <comstud> yeah, ok, I had to... and I did get deadlocks to go away
21:53:02 <devananda> except Duplicate:
21:53:05 <devananda> UPDATE
21:53:08 <devananda> with no transaction around them
21:53:11 <comstud> however, I was left with this lock wait timeout issue still.. which is a sep problem and I'm sure something in sqlalchemy
21:53:17 <devananda> hmmm
21:53:17 <comstud> that only seems to happen when thread pooling
21:53:20 <devananda> yea, that's odd then
21:53:38 <dansmith> I think we've veered off nova meeting type stuff, anyone not in favor of ending?
21:53:56 <comstud> ya, sorry, was supposed to be a quick FYI ;)
21:54:01 <comstud> good discussion though
21:54:11 <boris-42> Hi all
21:54:17 <dansmith> don't stop discussing, I just want to end so I can run off at some point :)
21:54:21 <comstud> nod
21:54:26 <dansmith> #endmeeting