16:02:05 <jgriffith> #startmeeting cinder 16:02:06 <openstack> Meeting started Wed Oct 2 16:02:05 2013 UTC and is due to finish in 60 minutes. The chair is jgriffith. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:02:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:02:09 <openstack> The meeting name has been set to 'cinder' 16:02:14 <jgriffith> Hey ho everyone 16:02:17 <med_> \o 16:02:20 <caitlin_56> hello 16:02:21 <zhiyan> hello 16:02:24 <jungleboyj> Heyoooo! 16:02:25 <DuncanT> hey 16:02:25 <jjacob512> hello 16:02:28 <kmartin> hey 16:02:29 <bpb> hey 16:02:39 <bill_az_> Hi all 16:02:42 <avishay> hello all 16:02:46 <eharney> hi 16:02:51 <xyang_> hi 16:03:03 <dosaboy> gooday 16:03:09 <bswartz> .o/ 16:03:10 <thingee> o/ 16:03:20 <jungleboyj> What a crowd. 16:03:29 <jgriffith> DuncanT: you've got a number of things on the agenda, you want to start? 16:04:03 <jgriffith> DuncanT: you about? 16:04:09 <DuncanT> jgriffith: Once I'd gone through them all, most of them ended up being fix committed. I can only find 2 taskflow bugs though 16:04:15 <DuncanT> And last week suggested 3 16:04:27 <jgriffith> did you log a bug/bugs? 16:04:52 <DuncanT> jgriffith: These are all from last week's summary 16:05:06 <jgriffith> DuncanT: which *These* 16:05:11 <jgriffith> You mean the white-list topic? 16:05:19 <jgriffith> #topic TaskFlow 16:05:20 <DuncanT> Yeah 16:05:24 <jgriffith> Ok.. 16:05:31 <jgriffith> so we had two bugs that are in flight 16:05:37 <jgriffith> I've asked everybody to please review 16:05:53 <jgriffith> the white list issue a number of people objected to reversing that 16:06:00 <hemna_> which reviews ? 16:06:08 <eharney> i just put a -0 on 49103, but i think it's ok 16:06:31 <jgriffith> hemna_: go to https://launchpad.net/cinder/+milestone/havana-rc1 16:06:38 <hemna_> thnx 16:06:48 <jgriffith> hemna_: anything that's "In Progress" needs a review if it's not in flight 16:07:30 <DuncanT> All four seem to eb in flight now 16:07:31 <jgriffith> hemna_: There's actually on like 3 patches that I'm waiting on, one of them is yours :) 16:07:46 <jgriffith> DuncanT: Oh yeah!! 16:07:47 <hemna_> I need your iscsi patch to land 16:07:51 <jgriffith> My cry for help worked 16:08:02 <jungleboyj> :-) 16:08:02 <med_> +1 16:08:21 <hemna_> then I'll refactor mine (iser) to remove the volumes_dir conf entry 16:08:27 <hemna_> as it's a dupe 16:08:32 <hemna_> in both our patches 16:08:39 <jgriffith> hemna_: k.. if you need to you can cherry pick and make a dep 16:08:48 <jgriffith> hemna_: but hopefully gates are moving along still this morning 16:08:55 <avishay> don't jinx it... 16:09:06 <jgriffith> eeesssh... yeah, sorry :( 16:09:25 * jungleboyj is knocking on wood. 16:09:28 <jgriffith> DuncanT: what else on TaskFlow did you have (think we got side-tracked) 16:09:50 <DuncanT> jgriffith: My only question is that last week's summary said 3 bugs, and I could only find 2 16:10:09 <DuncanT> If there are no more real bugs, I'll stop worrying 16:10:37 <jgriffith> DuncanT: well, for H I *hope* we're good 16:10:50 <jgriffith> DuncanT: For Icehouse I think we have some work to do 16:11:08 <jgriffith> ie white-list versus black-list debate :) 16:11:34 <DuncanT> Sure. Hopefully somebody can take that debate to the summit? 16:11:36 <avishay> jgriffith: i don't know if you want to discuss this now, but i was wondering what the policy would be for new features in Icehouse - taskflow only? 16:11:50 <jgriffith> #topic Icehouse 16:12:04 <jgriffith> avishay: not sure what you mean? 16:12:16 <jgriffith> I hope that taskflow isn't the only thing we work on in I :) 16:12:23 <hemna_> the policy for new features? we add them no? 16:12:24 <avishay> jgriffith: if i'm submitting retype for example, should it use taskflow? 16:12:30 <jgriffith> although that seems to be everybody's interest lately 16:12:36 <thingee> jgriffith: not me 16:12:38 <jgriffith> avishay: OHHHH... excellent question! 16:12:39 <thingee> api all the way 16:12:40 <hemna_> :P 16:12:43 <jgriffith> thingee: :) 16:12:58 <caitlin_56> I think that favoring new features via taskflow would be a great idea. 16:13:03 <jgriffith> avishay: TBH I'm not sure how I feel about that yet 16:13:21 <hemna_> avishay, so that kinda begs the question about taskflow, are we propagating it to all of the driver apis ? 16:13:22 <jungleboyj> jgriffith: The goal is to eventually get everything there, right? 16:13:23 <jgriffith> caitlin_56: perhaps, but perhaps not 16:13:27 <avishay> I hope I'll have time to convert migration and retype to use taskflow for Icehouse, but can't promise 16:13:42 <caitlin_56> WE shouldn't force things to be taskflows that are not naturally. 16:13:49 <jgriffith> TBH I wanted to have some discussions about taskflow at the summit 16:14:00 <hemna_> jgriffith, ok cool, same here. 16:14:03 <jgriffith> I'd like to get a better picture of benefits etc and where it's going and when 16:14:05 <avishay> hemna: i think for something simple like extend volume we don't need it, but for more complex things it could be a good idea 16:14:27 <caitlin_56> summit discussions are good 16:14:27 <hemna_> avishay, well I think there could be a case made for even the simple ones. 16:14:31 <jgriffith> avishay: I think you're right, the trick is that "some here, some there" is a bit awkward 16:14:35 <avishay> anyway, something to think about until hong kong 16:14:40 <DuncanT> I'd certainly like chance to discuss some of the weaknesses of the currently taskflow implementation 16:14:50 <jgriffith> avishay: yeah, so long as you don't mind the wait 16:15:04 <hemna_> I was kind of hoping that tasklowing most things would lead to safe restart of cinder and all of it's services. 16:15:04 <jgriffith> Ok, I think we all seem to agree here 16:15:16 <hemna_> a la safe shutdown/resume 16:15:17 <jgriffith> hemna_: I think it will, that's the point 16:15:27 <hemna_> coolio 16:15:34 <jgriffith> we need to get more educated and help harlow :) 16:15:38 <hemna_> yah 16:15:47 <jgriffith> I'd also like to find out more about community uptake 16:15:49 <jgriffith> anyway... 16:15:50 <avishay> yep 16:15:50 <kmartin> already a sesion for what's next in taskflow: http://summit.openstack.org/cfp/details/117 16:15:55 <hemna_> I already have a long list of my wants for I :P 16:16:02 <caitlin_56> I've been working with harlow already. 16:16:11 <jgriffith> I tihnk we're still going that direction, we just need to organize. We don't want another Brick debacle :) 16:16:12 <avishay> kmartin: nice! 16:16:19 <hemna_> hey now 16:16:27 <jgriffith> hemna_: that was directed at ME 16:16:51 <jgriffith> #topic quota-syncing 16:17:02 <jgriffith> DuncanT: you're correct ,that's still hanging out there 16:17:25 <jgriffith> DuncanT: I looked at it a bit but quite frankly I ran away screaming 16:17:41 <DuncanT> jgriffith: It made my head hurt too 16:17:57 <jgriffith> I'd like to just drop quotas altogether :) 16:18:05 <bswartz> ha 16:18:08 <jgriffith> ;) 16:18:17 <guitarzan> quota syncing? 16:18:24 <jgriffith> guitarzan: yes 16:18:25 <avishay> guitarzan: https://bugs.launchpad.net/cinder/+bug/1202896 16:18:27 <uvirtbot> Launchpad bug 1202896 in nova "quota_usage data constantly out of sync" [High,Confirmed] 16:18:33 <guitarzan> ahh 16:19:02 <jgriffith> every time I mess with quotas I want to die, but... 16:19:16 <jgriffith> I also think that there are just fundamental issues with the design 16:19:20 <caitlin_56> No quotas are better than quotas enforced at the wrong locations. 16:19:29 <jgriffith> Might be something worth looking at for I??? 16:19:54 <jgriffith> Don't all volunteer at once now! 16:20:02 <eharney> i would seriously consider the suggestion in that bug to replace the usage table w/ a view if possible 16:20:29 <guitarzan> that's an interesting idea, but it doesn't really tell you if the resource is being used or not 16:20:37 <guitarzan> especially in the error cases 16:20:45 <DuncanT> I'm not sure that scales with large numbers of volumes and users, unfortunately 16:20:50 <jgriffith> DuncanT: +1 16:20:58 <jgriffith> I think scale is the big concern with that 16:21:07 <caitlin_56> guitarzan: I agree. We need definitions that deal with real resource usage. Otherwise we're enforcing phony quotas. 16:21:08 <jgriffith> However I think we could do something creative there 16:21:13 <jgriffith> DB caching etc 16:21:36 <jgriffith> anyway... I don't think we're going to solve it here in the next 40 minutes :) 16:21:40 <DuncanT> I attempted to write a tool that checked the current quota looked valid, and ran it periodically while doing many ops in a tight loop, but couldn't provoke the out-of-sync issue 16:22:04 <guitarzan> DuncanT: I think you just have to get something in an error state so you can delete it multiple times 16:22:23 <jgriffith> guitarzan: maybe we should focus on fixing that instead? 16:22:26 <DuncanT> guitarzan: Ah, ok, that I can provoke 16:22:28 <jgriffith> go abou tit the other way 16:22:31 <jgriffith> about 16:22:44 <guitarzan> jgriffith: I think that's totally fixable 16:22:50 <jgriffith> did somebody say State Machine (again) 16:22:57 <DuncanT> guitarzan: Is there a specific bug for that scenario? Sounds like low hanging fruit.... 16:23:01 <guitarzan> I wasn't going tos ay taht :) 16:23:10 <guitarzan> DuncanT: I don't know, I'm just reading the bug 16:23:17 <jgriffith> :) 16:23:34 <guitarzan> I have been able to mess up quotas before, negative 16:23:36 <jgriffith> DuncanT: there is not, and it's not as low hanging as one would hope IMO 16:23:46 <DuncanT> Bugger 16:23:59 <jgriffith> There's a number of little *holes* that we can run into 16:24:14 <jgriffith> anyway... quotas aside those are things that I'd really like to see us work on for I 16:24:28 <jgriffith> exceptions and exception handling falls in that category 16:24:36 <DuncanT> Hmmm, I'm wondering if a runtime fault injection framework might make reproducing these issues easier? 16:24:37 <jungleboyj> jgriffith: +1 16:24:38 <avishay> again, state machine 16:24:40 <jgriffith> having a better picture of what happened back up at the manager 16:24:43 <jgriffith> avishay: :) 16:24:49 <jungleboyj> I have seen several issues with deleting. 16:25:03 <jungleboyj> Also think the issue of exceptions goes along with the taskflow issue. :-) 16:25:06 <jgriffith> DuncanT: perhaps, but you can also just pick random points in a driver and raise some exception 16:25:10 <jgriffith> that works really well :) 16:25:16 <thingee> jgriffith: ended up just writing something to correct the quota that we use internally 16:25:22 <med_> DuncanT infectedmonkeypatch? 16:25:27 <jgriffith> thingee: Oh? 16:25:35 <thingee> jgriffith: that's just a bandaid fix though 16:25:37 <DuncanT> med: Sounds promising. I'll have a google 16:25:59 <jgriffith> thingee: might be something to pursue if DH is interested in sharing 16:26:02 * med_ made that up so google will likely fail miserably 16:26:07 <jgriffith> thingee: if nothing else experience 16:26:12 <DuncanT> jgriffith: I was pretty much thinking of formalising that approach so we can test it reproducably 16:26:19 <jgriffith> the *experience* you guys have would be helpful 16:26:47 <jgriffith> DuncanT: Got ya.. if we just implement a State Machine it's covered :) 16:26:47 <thingee> jgriffith: I think it just wasn't put upstream because it was a bad hack. but yeah we can take ideas from that. 16:26:51 <jgriffith> Just sayin 16:27:02 * jgriffith promises to not say *State Machine* again 16:27:13 <jgriffith> thingee: coolness 16:27:49 <jgriffith> okie dokie 16:28:05 <jgriffith> DuncanT: what else you got for us? 16:28:13 <DuncanT> I'm all out I think 16:28:15 * jgriffith keeps putting DuncanT on the spot 16:28:34 <DuncanT> Most of my stuff is summit stuff now 16:28:41 <jgriffith> Ok, I just wanted to catch folks up on the gating disaster over the past few days 16:28:47 <jgriffith> #topic gating issues 16:28:50 <hemna_> ugh 16:28:57 <jgriffith> so I'm sure you all noticed jobs failing 16:29:02 <hemna_> jgriffith, jenkins puked on https://review.openstack.org/#/c/48528/ 16:29:03 <jungleboyj> wheeee! 16:29:08 <jgriffith> but not sure how many people kept updated on what was going on 16:29:16 <jungleboyj> When were jobs failing? 16:29:34 <jgriffith> There were a number of intermittent faiures that were in various projects 16:29:47 <dosaboy> was mainly broken neutron gate test no? 16:29:50 <jgriffith> I also think that some bugs in projects exposed bugs in other projects etc etc 16:30:00 <DuncanT> hemna: That looks like a straight merge failure, manual rebase should sort it 16:30:02 <jgriffith> dosaboy: no 16:30:08 <jgriffith> dosaboy: it was realy a mixed bag 16:30:13 <dosaboy> ack 16:30:19 <jgriffith> Cinder, neutron, nova, keystone... 16:30:20 <jungleboyj> Apparently the one in Neutron was one that had been there for some time but it was a timing thing that was suddenly uncovered. 16:30:30 <jgriffith> jungleboyj: +1 16:30:40 <jgriffith> So anyway.... 16:31:04 <jgriffith> things are stabilizing a bit, but here's the critical take away for right now 16:31:20 <jgriffith> the recheck bug xxx is CRITICAL to track this stuff 16:31:48 <jgriffith> and even though the elastic search recommendation that pops up is sometimes pretty good, other times it's wayyy off 16:32:08 <jgriffith> we really need to make sure we take a good look at the recheck bugs and see if something fits, and if not log a new bug 16:32:20 <jgriffith> if you don't know where to log it against, log it against tempest for now 16:32:39 <jgriffith> best way to create these is to use the failing tests *name* as the bug title 16:32:51 <jgriffith> this makes it easier for people that encounter it later ot identify 16:32:55 <jungleboyj> jgriffith: +1 16:33:11 <jgriffith> so like "TestBlahBlah.test_some_stupid_thing Fails" 16:33:25 <avishay> also, if something is already approved, make sure to do 'reverify bug xxx' and not recheck 16:33:30 <jgriffith> include a link to the gate/log pages 16:33:37 <jgriffith> avishay: +1 16:33:57 <jungleboyj> Yeah, sorry for the ones I rechecked before I learned that tidbit. 16:34:00 <avishay> sucks when jenkins finally passes and need to send it through again :) 16:34:14 <jgriffith> also take a look at this: http://paste.openstack.org/show/47798 16:34:37 <jgriffith> particularly the last one 16:34:44 <jgriffith> Failed 385 times!!! 16:34:48 <jgriffith> that's crazy stuff 16:34:58 <med_> ouch. 16:34:59 <jgriffith> I wasn't even aware of it until it was at 300 16:35:03 <hemna_> doh 16:35:25 <jgriffith> BTW that wasn't the worst one :) 16:35:28 <jgriffith> anyway... 16:35:49 <jgriffith> I did some queires last night on those and updated when last seen etc 16:36:23 <jgriffith> that big one 1226337 pretty much died out a few days ago after the fix I put in (break out of the retry loop) 16:36:36 <jgriffith> but still hit occasionally 16:36:49 <jgriffith> It's an issue with tgtd IMO 16:36:57 <jgriffith> It's not as robust as one might like 16:37:12 <jgriffith> so the follow up is a recovery attempt to create th backing lun explicitly 16:37:15 <jgriffith> anyway... 16:37:26 <jgriffith> the other item: 1223469 16:37:47 <jgriffith> I wanted to point that out because I made a change that does a reovery but still logs the error message in the logs 16:38:12 <jgriffith> this seemed like a good idea at the time, but the querie writers grabbed on to that and still querie on it 16:38:28 <jgriffith> even though it recovers and doesn't fail it still gets dinged in the queries 16:38:49 <jgriffith> so I think I should change it to warning and chane the wording to throw them off he scent :) 16:39:10 <jgriffith> but I wanted to go through these to try and keep everybody informed of what was going on 16:39:30 <jgriffith> I spent most of the last 3 VERY long days monitoring gates and pouring over logs 16:39:42 <avishay> cool, thanks for the update and the work! 16:40:07 <jgriffith> Hoping that if/when we hit this sort of thing again we'll have a whole team working on it :) 16:40:12 <jungleboyj> avishay: +2 16:40:14 <jgriffith> Ok, that's all I have... 16:40:18 <jgriffith> anybody else? 16:40:23 <jgriffith> #topic open-discussion 16:41:01 <dosaboy> drinks are on the house! 16:41:01 <jgriffith> going twice.... 16:41:13 <jgriffith> dosaboy: Your house? Ok, I'm in :) 16:41:17 <dosaboy> :) 16:41:23 <jungleboyj> Yay! 16:41:25 <jgriffith> going three times... 16:41:31 <jgriffith> #endmeeting