21:09:43 #startmeeting nova 21:09:44 Meeting started Thu Feb 7 21:09:43 2013 UTC. The chair is dansmith. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:09:45 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:09:47 The meeting name has been set to 'nova' 21:09:48 he'll be here 21:09:54 I'll #chair him when he shows up 21:10:01 cool 21:10:09 o/ 21:10:18 o/ 21:10:36 although there's no agenda apparently 21:11:06 do not commit future versions of oslo 21:11:12 heh 21:11:26 comstud: any chance rackspace could throw resources at the tempest gate problem? 21:11:35 not sure 21:11:36 1.5 hours to get a read on any patch is pretty brtual :( 21:11:37 will ask 21:11:57 and the way zuul works, 21:12:09 that can easily become six hours if you're unlucky and behind some bad patches 21:12:21 dansmith: you can always see the zuul work in progress 21:12:30 http://status.openstack.org/zuul 21:12:34 sdague: ...yeah of course 21:12:48 sdague: but that does nothing for getting a read on whether the full run is going to pass, 21:12:51 which is always the last one 21:13:23 dansmith: do we know where the problem is? spinning up envs? running the tests? 21:13:28 sorry, just catching up 21:13:43 dansmith: look, i brought resources. 21:13:50 westmaas: I think sdague knows more, but I think it's mostly cores and memory, surprise :) 21:13:52 it's just that we run a lot of tests now, and dansmith is impatient 21:13:58 lol 21:14:00 haha 21:14:33 tempest will get quicker once the testr conversion happens over the next couple weeks 21:14:34 if I'm the only one that thinks it's frustrating, 21:15:05 but we'll also have to look at bumping devstack node sizes 21:15:05 dansmith: I agree with you 21:15:11 then that's fine, but I go days sometimes without being able to get something to a mergeable state because I'm always waiting for runs 21:15:17 sdague: do you know what size you use now? 21:15:28 whatever is the 4G nodes 21:15:30 and the time is spent on test running, right? 21:15:31 and it compounds with every change you have stacked on top of another, which is often where I'm at 21:15:39 it's not as big of a deal for individual patches 21:15:42 not on bringing up the nodes 21:16:08 we get past devstack in about 9 minutes 21:16:56 there is detailed timing in all the logs 21:17:12 sdague: devstack setup is not parallelized, right? 21:17:20 nope 21:17:26 I wonder if some of it could be? 21:17:44 realize, we spent a lot of time squeezing out time from those runs 21:17:57 I'm just asking 21:18:02 totally happy for other people to look at it as well, but there's not much low hanging fruit in there :) 21:18:26 sdague: if the instances are 2 or 4 cores right now, 21:18:32 node size could be the biggest thing unless its mosly about waiting idly for nodes to spin up 21:18:33 then the tempest testr change isn't going to help all that much, 21:18:36 I think they are 2 cores 21:18:38 which is why I was asking about larger nodes 21:18:52 westmaas: the tempest tests wait for guests to spin up 21:18:58 that's where a lot of time is 21:19:03 which is why testr will help 21:19:25 sdague: but chris was saying that the speedup is 50% of your core count, right? 21:19:40 we'll see what 2 cores does 21:20:14 guests might spin up a little faster on larger nodes. if someone wants to reach out to me offline about using more RS resources we can do that, would just want to understand how much we are using first. 21:20:19 okay, I thought he said that with 2 cores, four testr threads would actually go slower 21:20:34 with 2 cores there won't be 4 testr threads 21:20:45 sdague: you can run it with as many threads as you want, of course 21:20:57 dansmith: instead of speculating, why not wait for the actual testr review to hit tempest, then we'll have real data 21:21:01 sdague: he was trying to see if there was really any idle time to gain by overcommitting 21:21:35 yes, lets linearize the process! 21:21:45 anyway, I guess vishy isn't going to show up after all 21:21:58 I mean the easy answer is just turn back off a lot of tests, but the new tests already blocked breaks going in 21:22:10 sdague: has anyone suggested that? I sure haven't :) 21:22:35 :) 21:23:16 okay, so do we bail? No agenda, no rustlebee, no vishy 21:23:21 (and no-db-compute, by the way) 21:23:31 hehe 21:24:07 one quick comment: I have my BP ready for review https://blueprints.launchpad.net/nova/+spec/db-api-cleanup https://review.openstack.org/#q,topic:bp/db-api-cleanup,n,z 21:24:22 jog0: yeah, I've been meaning to look at it 21:24:35 jog0: I'd ask why so many of them are failing jenkins, but I know why :) 21:25:04 one of the patches is blocked on getting code into oslo. 21:25:17 and the new tempest tests have been very helpful 21:25:32 it seems nova/api/openstack/compute/contrib/security_groups.py has very little unit tests 21:25:44 yes, they found the last item in quantum_api for me :) 21:27:11 jog0: anything else? 21:27:22 dansmith: nope 21:27:27 anyone? 21:27:30 bueller? 21:28:24 just a comment maybe 21:29:18 db pooling becomes important when moving to conductor... yet we have something broken with our use of sqlalchemy that prevents db pooling from working.. seems that it's leaving transactions open.. thus innodb locks held. 21:29:33 queries end up timing out waiting on locks in mysql 21:29:47 working on tracking it down. 21:30:01 devananda: ^ 21:30:09 comstud: why is this not showing up in devstack? does it take a while to crop up? 21:30:15 we were having a discussion yesterday or the day before 21:30:19 comstud: so how broken are things with the oslo sync that wasn't? I didn't even occur to me that markmc might have inbounded non committed oslo code :) 21:30:20 does devstack have db pooling on? 21:30:30 comstud: I don't think so 21:30:32 sdague: most of the things out of sync have merged now it appears 21:30:46 dansmith: it seems to only show when db pooling is on 21:30:51 https://review.openstack.org/#/c/21461/ 21:30:56 comstud: is that something we should enable? or would it be too hard to track down issues? 21:31:01 ^^ this is an improvement needed also to be committed to oslo and merged into nova 21:31:02 comstud: Error: "^" is not a valid command. 21:31:12 otherwise you get connect timeouts under load, etc 21:31:31 dansmith: it's certainly not ready yet 21:31:44 i dunno if devstack provides enough parallelism and load to see the issue anyway 21:32:00 okay 21:32:08 it's easy to reproduce in a test script that slams DB requests with pooling on 21:32:25 and conductor makes the problem worse because it takes the db hit for everyone and just hits the problem faster? 21:32:27 from a gate perspective, I don't think there is enough db traffic to have it make a difference 21:32:27 comstud: I wasn't able to see any difference in tempest or smoketests in devstack with db pool on 21:32:47 yeh, I'm not surprised jog0 21:32:51 dansmith: i have not run with nova-conductor yet, but I can tell you that yes it will.. because of that reason 21:32:58 comstud: okay 21:33:04 dansmith: I see it with global nova-cells which takes DB updates from all child cells 21:33:21 conductor and global nova-cells have nearly the same function 21:33:21 I'm embarrassed to say that I don't really know what this means.. is this avoiding connection setup and teardown for every query? 21:33:28 the big timing issues in tempest are starting guests and cleaning them up, which we have to do carefully because it's easy to overrun memory in the gate 21:33:51 jog0: ya, I'm not surprised.. I don't think there's probably enough parallelism in devstack tests 21:34:03 so I think any perf issues you have are just something else 21:34:08 Oh... one other thing to note :) 21:34:19 if you ditch sqlalchemy and do raw mysql queries with MySQLdb... 21:34:25 the problems go away and things are SIGNIFICANTLY faster. 21:34:30 heh 21:34:47 even when doing serialized DB calls 21:34:50 with no thread pooling. 21:34:52 comstud: what about if you only use low-level sqlalchemy and not the top ORM layer? 21:35:02 dripton: I was going to try that next 21:35:10 comstud: +1 21:35:18 we're getting closer to not even needing the ORM layer for much, right? 21:35:20 although I'm focusing on just trying to find where sqlalchemy is holding transaction open 21:35:22 with jog0's patches 21:35:52 well you could at least write a non-sqlalchemy version of the api 21:35:55 after those 21:36:00 yea 21:36:01 dansmith: we currently use the ORM layer all over, and I don't think jog0's patches change that 21:36:11 dansmith: what dripton said 21:36:20 hrm.. 21:36:29 I don't get that 21:36:43 dripton: the patch prevents sqlalchemy objects from being returned from nova.db.api functions 21:37:02 In a way it gets you closer, I guess 21:37:07 after that removing ORM should be easier 21:37:08 jog0: right but we still use ORM inside the api functions 21:37:15 dripton: yeahhhh 21:37:24 for now 21:37:24 you need to rewrite almost all of sqlalchmey/api.py :) 21:37:25 jog0: okay, we agree. Not gone yet but maybe in H 21:37:25 still 21:37:31 dripton: yeh, but you now *could* write an alternate api implementation 21:37:38 it might as well be an alternate api impl 21:37:41 yes 21:37:43 sdague: agreed, we're going the right way 21:37:56 we still need to prove that *low-level* sqlalchemy doesn't have the performance problem 21:38:03 hope not, because it buys us portability 21:38:14 yeh, loosing portability would really suck 21:38:16 depends on how low level we need to go 21:38:46 anyway, we'll see. first thing is just trying to find what's broken 21:38:50 perfroamnce after 21:38:59 comstud: there are subtle typing things between even mysql and pg, so it's more than you might imagine 21:39:06 nod 21:39:20 my db-archiving patch (not done yet) uses only low-level sqlalchemy. I was worried someone might complain about that but it looks like we're headed that way anyway. 21:39:23 but maybe do custom things for just a few of the most expensive bits 21:39:55 I was considering starting a mysql only api implementation 21:40:02 which falls back to sqlalchemy for methods not implemented yet 21:40:10 comstud: +1 21:40:13 ++ 21:40:16 which gives you a path to migrate if you're using myql 21:40:17 mysql 21:40:17 comstud: before being that extreme please try with low-level sqlalchemy 21:40:18 dripton: you mean just doing .insert().values().execute() kind of things? 21:40:23 sdague: yes 21:40:29 dripton: nod 21:40:41 dripton: yeh, I was doing that in the data injection for migrations, because I can't use the models 21:40:45 it's not too bad 21:40:56 it would be interesting if that sped us up 21:41:15 a chunk of the issues i've seen appear to come from the way sqlalchemy automagically handles creates transactions 21:41:49 even when there's no explicit "with session.begin", there is still a transaction 21:42:29 doing the lower level calls directly might be circumventing that? 21:42:51 devananda: I *think* you can go transactionless with low-level sqlalchemy, but I need to test 21:43:08 my code is using "with session.begin" 21:43:12 devananda: yeah, that's part of the problem I'm positive 21:43:27 i changed a DB call to remove session.begin() 21:43:42 and I don't get any deadlocks anymore... but I still get lock wait timeouts. 21:43:50 but means sqlalchemy is not committing somewhere 21:43:52 AFAIK 21:43:55 right 21:44:54 comstud: the real fix to the stuff we were discussing is, IMNSHO, to add UNIQUE indexes 21:45:11 using a myslq-specific extension (INSERT .. ON DUP KEY UPDATE) is then possible 21:45:21 and pgsql has a similar capability which they call upserts 21:45:30 it's just a bit trickier (have to use a trigger or such) 21:45:36 sqlite is screwed tho :) 21:45:58 I don't think anyone would really use sqlite in production 21:46:11 trying to enforce uniqueness with locking outside of the database is prone to all kinds of problems (like these) 21:46:50 devandanda: unfort there's no portable way to do it in sqlalchemy 21:46:54 the ON DUP KEY UPDATE 21:47:18 can we extend sqlalchemy for it? 21:47:24 but even without that deadlock problem.. I'm getting the lock wait timeouts 21:47:27 at the end of the day it's just a meta compiler 21:47:42 sdague: there was a way to compile it onto the end of a query 21:47:52 seems like we could have an .insert_update method and backends 21:48:00 but you'd need to figure out how to do it for pg, mysql, sqlite, etc :) 21:48:04 comstud: IIUC, the lock wait timeouts are coming from the same thing -- >1 thread trying to lock the same rows with an UPDATE statement, when it really just wants to INSERT. 21:48:30 devananda: and innodb doesn't detect a deadlock? 21:48:31 comstud: yeh, but if we need to figure that out anyway for openstack code, might as well monkey patch sqlalchemy for it 21:48:42 I've done it with sqlalchemy the manual way: inside a transaction, select, if found then update else insert. 21:48:51 ugly but portable 21:49:11 comstud: if neither thread has the lock and both want it (and can't get it), then it deadlocks. 21:49:14 dripton: yeh, if we could encapsulate it nicely, at least the ugly could live off in a method somewhere 21:49:30 comstud: if one thread has the lock and just sits on it, other threads will timeout. so yes, that problem is sqlalchemy not committing properly 21:49:44 but it's holding the lock because it did an UPDATE on 0 rows 21:49:44 yeah, that's the issue it seems 21:49:47 (i think) 21:49:49 I'm trying to track down where it's sitting 21:49:50 and why 21:50:13 dripton: that "portable way" is precisely the cause of this problem :( 21:50:22 sigh 21:50:57 dripton: if the row doesn't exist, both trx will insert (and one should fail, but right now, won't because there's no UNIQUE key, leading to data inconsistency) 21:50:57 stackoverflow says sqlalchemy has session.saveorupdate, but that might be at the wrong level 21:51:42 Can we add unique keys wherever we need them? 21:51:48 dripton: the sqlalchemy code isn't that bad, I'd say just read the code :) 21:51:49 yes :) 21:51:51 i added them 21:51:55 in the method i was testing 21:51:57 (bw usage updates) 21:52:08 and it didn't help 21:52:12 so uniques are possible now that soft_deletes set deleted= 21:52:36 comstud: it should help if you change the SQL 21:52:53 comstud: either use INSERT .. ON DUP KEY, or remove the SELECT and do this 21:52:55 try: 21:52:56 INSERT 21:52:57 yeah, ok, I had to... and I did get deadlocks to go away 21:53:02 except Duplicate: 21:53:05 UPDATE 21:53:08 with no transaction around them 21:53:11 however, I was left with this lock wait timeout issue still.. which is a sep problem and I'm sure something in sqlalchemy 21:53:17 hmmm 21:53:17 that only seems to happen when thread pooling 21:53:20 yea, that's odd then 21:53:38 I think we've veered off nova meeting type stuff, anyone not in favor of ending? 21:53:56 ya, sorry, was supposed to be a quick FYI ;) 21:54:01 good discussion though 21:54:11 Hi all 21:54:17 don't stop discussing, I just want to end so I can run off at some point :) 21:54:21 nod 21:54:26 #endmeeting