16:02:05 <jgriffith> #startmeeting cinder
16:02:06 <openstack> Meeting started Wed Oct  2 16:02:05 2013 UTC and is due to finish in 60 minutes.  The chair is jgriffith. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:02:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:02:09 <openstack> The meeting name has been set to 'cinder'
16:02:14 <jgriffith> Hey ho everyone
16:02:17 <med_> \o
16:02:20 <caitlin_56> hello
16:02:21 <zhiyan> hello
16:02:24 <jungleboyj> Heyoooo!
16:02:25 <DuncanT> hey
16:02:25 <jjacob512> hello
16:02:28 <kmartin> hey
16:02:29 <bpb> hey
16:02:39 <bill_az_> Hi all
16:02:42 <avishay> hello all
16:02:46 <eharney> hi
16:02:51 <xyang_> hi
16:03:03 <dosaboy> gooday
16:03:09 <bswartz> .o/
16:03:10 <thingee> o/
16:03:20 <jungleboyj> What a crowd.
16:03:29 <jgriffith> DuncanT: you've got a number of things on the agenda, you want to start?
16:04:03 <jgriffith> DuncanT: you about?
16:04:09 <DuncanT> jgriffith: Once I'd gone through them all, most of them ended up being fix committed. I can only find 2 taskflow bugs though
16:04:15 <DuncanT> And last week suggested 3
16:04:27 <jgriffith> did you log a bug/bugs?
16:04:52 <DuncanT> jgriffith: These are all from last week's summary
16:05:06 <jgriffith> DuncanT: which *These*
16:05:11 <jgriffith> You mean the white-list topic?
16:05:19 <jgriffith> #topic TaskFlow
16:05:20 <DuncanT> Yeah
16:05:24 <jgriffith> Ok..
16:05:31 <jgriffith> so we had two bugs that are in flight
16:05:37 <jgriffith> I've asked everybody to please review
16:05:53 <jgriffith> the white list issue a number of people objected to reversing that
16:06:00 <hemna_> which reviews ?
16:06:08 <eharney> i just put a -0 on 49103, but i think it's ok
16:06:31 <jgriffith> hemna_: go to https://launchpad.net/cinder/+milestone/havana-rc1
16:06:38 <hemna_> thnx
16:06:48 <jgriffith> hemna_: anything that's "In Progress" needs a review if it's not in flight
16:07:30 <DuncanT> All four seem to eb in flight now
16:07:31 <jgriffith> hemna_: There's actually on like 3 patches that I'm waiting on, one of them is yours :)
16:07:46 <jgriffith> DuncanT: Oh yeah!!
16:07:47 <hemna_> I need your iscsi patch to land
16:07:51 <jgriffith> My cry for help worked
16:08:02 <jungleboyj> :-)
16:08:02 <med_> +1
16:08:21 <hemna_> then I'll refactor mine (iser) to remove the volumes_dir conf entry
16:08:27 <hemna_> as it's a dupe
16:08:32 <hemna_> in both our patches
16:08:39 <jgriffith> hemna_: k.. if you need to you can cherry pick and make a dep
16:08:48 <jgriffith> hemna_: but hopefully gates are moving along still this morning
16:08:55 <avishay> don't jinx it...
16:09:06 <jgriffith> eeesssh... yeah, sorry :(
16:09:25 * jungleboyj is knocking on wood.
16:09:28 <jgriffith> DuncanT: what else on TaskFlow did you have (think we got side-tracked)
16:09:50 <DuncanT> jgriffith: My only question is that last week's summary said 3 bugs, and I could only find 2
16:10:09 <DuncanT> If there are no more real bugs, I'll stop worrying
16:10:37 <jgriffith> DuncanT: well, for H I *hope* we're good
16:10:50 <jgriffith> DuncanT: For Icehouse I think we have some work to do
16:11:08 <jgriffith> ie white-list versus black-list debate :)
16:11:34 <DuncanT> Sure. Hopefully somebody can take that debate to the summit?
16:11:36 <avishay> jgriffith: i don't know if you want to discuss this now, but i was wondering what the policy would be for new features in Icehouse - taskflow only?
16:11:50 <jgriffith> #topic Icehouse
16:12:04 <jgriffith> avishay: not sure what you mean?
16:12:16 <jgriffith> I hope that taskflow isn't the only thing we work on in I :)
16:12:23 <hemna_> the policy for new features?  we add them no?
16:12:24 <avishay> jgriffith: if i'm submitting retype for example, should it use taskflow?
16:12:30 <jgriffith> although that seems to be everybody's interest lately
16:12:36 <thingee> jgriffith: not me
16:12:38 <jgriffith> avishay: OHHHH... excellent question!
16:12:39 <thingee> api all the way
16:12:40 <hemna_> :P
16:12:43 <jgriffith> thingee: :)
16:12:58 <caitlin_56> I think that favoring new features via taskflow would be a great idea.
16:13:03 <jgriffith> avishay: TBH I'm not sure how I feel about that yet
16:13:21 <hemna_> avishay, so that kinda begs the question about taskflow, are we propagating it to all of the driver apis ?
16:13:22 <jungleboyj> jgriffith: The goal is to eventually get everything there, right?
16:13:23 <jgriffith> caitlin_56: perhaps, but perhaps not
16:13:27 <avishay> I hope I'll have time to convert migration and retype to use taskflow for Icehouse, but can't promise
16:13:42 <caitlin_56> WE shouldn't force things to be taskflows that are not naturally.
16:13:49 <jgriffith> TBH I wanted to have some discussions about taskflow at the summit
16:14:00 <hemna_> jgriffith, ok cool, same here.
16:14:03 <jgriffith> I'd like to get a better picture of benefits etc and where it's going and when
16:14:05 <avishay> hemna: i think for something simple like extend volume we don't need it, but for more complex things it could be a good idea
16:14:27 <caitlin_56> summit discussions are good
16:14:27 <hemna_> avishay, well I think there could be a case made for even the simple ones.
16:14:31 <jgriffith> avishay: I think you're right, the trick is that "some here, some there" is a bit awkward
16:14:35 <avishay> anyway, something to think about until hong kong
16:14:40 <DuncanT> I'd certainly like chance to discuss some of the weaknesses of the currently taskflow implementation
16:14:50 <jgriffith> avishay: yeah, so long as you don't mind the wait
16:15:04 <hemna_> I was kind of hoping that tasklowing most things would lead to safe restart of cinder and all of it's services.
16:15:04 <jgriffith> Ok, I think we all seem to agree here
16:15:16 <hemna_> a la safe shutdown/resume
16:15:17 <jgriffith> hemna_: I think it will, that's the point
16:15:27 <hemna_> coolio
16:15:34 <jgriffith> we need to get more educated and help harlow :)
16:15:38 <hemna_> yah
16:15:47 <jgriffith> I'd also like to find out more about community uptake
16:15:49 <jgriffith> anyway...
16:15:50 <avishay> yep
16:15:50 <kmartin> already a sesion for what's next in taskflow: http://summit.openstack.org/cfp/details/117
16:15:55 <hemna_> I already have a long list of my wants for I :P
16:16:02 <caitlin_56> I've been working with harlow already.
16:16:11 <jgriffith> I tihnk we're still going that direction, we just need to organize.  We don't want another Brick debacle :)
16:16:12 <avishay> kmartin: nice!
16:16:19 <hemna_> hey now
16:16:27 <jgriffith> hemna_: that was directed at ME
16:16:51 <jgriffith> #topic quota-syncing
16:17:02 <jgriffith> DuncanT: you're correct ,that's still hanging out there
16:17:25 <jgriffith> DuncanT: I looked at it a bit but quite frankly I ran away screaming
16:17:41 <DuncanT> jgriffith: It made my head hurt too
16:17:57 <jgriffith> I'd like to just drop quotas altogether :)
16:18:05 <bswartz> ha
16:18:08 <jgriffith> ;)
16:18:17 <guitarzan> quota syncing?
16:18:24 <jgriffith> guitarzan: yes
16:18:25 <avishay> guitarzan: https://bugs.launchpad.net/cinder/+bug/1202896
16:18:27 <uvirtbot> Launchpad bug 1202896 in nova "quota_usage data constantly out of sync" [High,Confirmed]
16:18:33 <guitarzan> ahh
16:19:02 <jgriffith> every time I mess with quotas I want to die, but...
16:19:16 <jgriffith> I also think that there are just fundamental issues with the design
16:19:20 <caitlin_56> No quotas are better than quotas enforced at the wrong locations.
16:19:29 <jgriffith> Might be something worth looking at for I???
16:19:54 <jgriffith> Don't all volunteer at once now!
16:20:02 <eharney> i would seriously consider the suggestion in that bug to replace the usage table w/ a view if possible
16:20:29 <guitarzan> that's an interesting idea, but it doesn't really tell you if the resource is being used or not
16:20:37 <guitarzan> especially in the error cases
16:20:45 <DuncanT> I'm not sure that scales with large numbers of volumes and users, unfortunately
16:20:50 <jgriffith> DuncanT: +1
16:20:58 <jgriffith> I think scale is the big concern with that
16:21:07 <caitlin_56> guitarzan: I agree. We need definitions that deal with real resource usage. Otherwise we're enforcing phony quotas.
16:21:08 <jgriffith> However I think we could do something creative there
16:21:13 <jgriffith> DB caching etc
16:21:36 <jgriffith> anyway... I don't think we're going to solve it here in the next 40 minutes :)
16:21:40 <DuncanT> I attempted to write a tool that checked the current quota looked valid, and ran it periodically while doing many ops in a tight loop, but couldn't provoke the out-of-sync issue
16:22:04 <guitarzan> DuncanT: I think you just have to get something in an error state so you can delete it multiple times
16:22:23 <jgriffith> guitarzan: maybe we should focus on fixing that instead?
16:22:26 <DuncanT> guitarzan: Ah, ok, that I can provoke
16:22:28 <jgriffith> go abou tit the other way
16:22:31 <jgriffith> about
16:22:44 <guitarzan> jgriffith: I think that's totally fixable
16:22:50 <jgriffith> did somebody say State Machine (again)
16:22:57 <DuncanT> guitarzan: Is there a specific bug for that scenario? Sounds like low hanging fruit....
16:23:01 <guitarzan> I wasn't going tos ay taht :)
16:23:10 <guitarzan> DuncanT: I don't know, I'm just reading the bug
16:23:17 <jgriffith> :)
16:23:34 <guitarzan> I have been able to mess up quotas before, negative
16:23:36 <jgriffith> DuncanT: there is not, and it's not as low hanging as one would hope IMO
16:23:46 <DuncanT> Bugger
16:23:59 <jgriffith> There's a number of little *holes* that we can run into
16:24:14 <jgriffith> anyway... quotas aside those are things that I'd really like to see us work on for I
16:24:28 <jgriffith> exceptions and exception handling falls in that category
16:24:36 <DuncanT> Hmmm, I'm wondering if a runtime fault injection framework might make reproducing these issues easier?
16:24:37 <jungleboyj> jgriffith: +1
16:24:38 <avishay> again, state machine
16:24:40 <jgriffith> having a better picture of what happened back up at the manager
16:24:43 <jgriffith> avishay: :)
16:24:49 <jungleboyj> I have seen several issues with deleting.
16:25:03 <jungleboyj> Also think the issue of exceptions goes along with the taskflow issue.  :-)
16:25:06 <jgriffith> DuncanT: perhaps, but you can also just pick random points in a driver and raise some exception
16:25:10 <jgriffith> that works really well :)
16:25:16 <thingee> jgriffith: ended up just writing something to correct the quota that we use internally
16:25:22 <med_> DuncanT infectedmonkeypatch?
16:25:27 <jgriffith> thingee: Oh?
16:25:35 <thingee> jgriffith: that's just a bandaid fix though
16:25:37 <DuncanT> med: Sounds promising. I'll have a google
16:25:59 <jgriffith> thingee: might be something to pursue if DH is interested in sharing
16:26:02 * med_ made that up so google will likely fail miserably
16:26:07 <jgriffith> thingee: if nothing else experience
16:26:12 <DuncanT> jgriffith: I was pretty much thinking of formalising that approach so we can test it reproducably
16:26:19 <jgriffith> the *experience* you guys have would be helpful
16:26:47 <jgriffith> DuncanT: Got ya.. if we just implement a State Machine it's covered :)
16:26:47 <thingee> jgriffith: I think it just wasn't put upstream because it was a bad hack. but yeah we can take ideas from that.
16:26:51 <jgriffith> Just sayin
16:27:02 * jgriffith promises to not say *State Machine* again
16:27:13 <jgriffith> thingee: coolness
16:27:49 <jgriffith> okie dokie
16:28:05 <jgriffith> DuncanT: what else you got for us?
16:28:13 <DuncanT> I'm all out I think
16:28:15 * jgriffith keeps putting DuncanT on the spot
16:28:34 <DuncanT> Most of my stuff is summit stuff now
16:28:41 <jgriffith> Ok, I just wanted to catch folks up on the gating disaster over the past few days
16:28:47 <jgriffith> #topic gating issues
16:28:50 <hemna_> ugh
16:28:57 <jgriffith> so I'm sure you all noticed jobs failing
16:29:02 <hemna_> jgriffith, jenkins puked on https://review.openstack.org/#/c/48528/
16:29:03 <jungleboyj> wheeee!
16:29:08 <jgriffith> but not sure how many people kept updated on what was going on
16:29:16 <jungleboyj> When were jobs failing?
16:29:34 <jgriffith> There were a number of intermittent faiures that were in various projects
16:29:47 <dosaboy> was mainly broken neutron gate test no?
16:29:50 <jgriffith> I also think that some bugs in projects exposed bugs in other projects etc etc
16:30:00 <DuncanT> hemna: That looks like a straight merge failure, manual rebase should sort it
16:30:02 <jgriffith> dosaboy: no
16:30:08 <jgriffith> dosaboy: it was realy a mixed bag
16:30:13 <dosaboy> ack
16:30:19 <jgriffith> Cinder, neutron, nova, keystone...
16:30:20 <jungleboyj> Apparently the one in Neutron was one that had been there for some time but it was a timing thing that was suddenly uncovered.
16:30:30 <jgriffith> jungleboyj: +1
16:30:40 <jgriffith> So anyway....
16:31:04 <jgriffith> things are stabilizing a bit, but here's the critical take away for right now
16:31:20 <jgriffith> the recheck bug xxx is CRITICAL to track this stuff
16:31:48 <jgriffith> and even though the elastic search recommendation that pops up is sometimes pretty good, other times it's wayyy off
16:32:08 <jgriffith> we really need to make sure we take a good look at the recheck bugs and see if something fits, and if not log a new bug
16:32:20 <jgriffith> if you don't know where to log it against, log it against tempest for now
16:32:39 <jgriffith> best way to create these is to use the failing tests *name* as the bug title
16:32:51 <jgriffith> this makes it easier for people that encounter it later ot identify
16:32:55 <jungleboyj> jgriffith: +1
16:33:11 <jgriffith> so like "TestBlahBlah.test_some_stupid_thing Fails"
16:33:25 <avishay> also, if something is already approved, make sure to do 'reverify bug xxx' and not recheck
16:33:30 <jgriffith> include a link to the gate/log pages
16:33:37 <jgriffith> avishay: +1
16:33:57 <jungleboyj> Yeah, sorry for the ones I rechecked before I learned that tidbit.
16:34:00 <avishay> sucks when jenkins finally passes and need to send it through again :)
16:34:14 <jgriffith> also take a look at this: http://paste.openstack.org/show/47798
16:34:37 <jgriffith> particularly the last one
16:34:44 <jgriffith> Failed 385 times!!!
16:34:48 <jgriffith> that's crazy stuff
16:34:58 <med_> ouch.
16:34:59 <jgriffith> I wasn't even aware of it until it was at 300
16:35:03 <hemna_> doh
16:35:25 <jgriffith> BTW that wasn't the worst one :)
16:35:28 <jgriffith> anyway...
16:35:49 <jgriffith> I did some queires last night on those and updated when last seen etc
16:36:23 <jgriffith> that big one 1226337 pretty much died out a few days ago after the fix I put in (break out of the retry loop)
16:36:36 <jgriffith> but still hit occasionally
16:36:49 <jgriffith> It's an issue with tgtd IMO
16:36:57 <jgriffith> It's not as robust as one might like
16:37:12 <jgriffith> so the follow up is a recovery attempt to create th backing lun explicitly
16:37:15 <jgriffith> anyway...
16:37:26 <jgriffith> the other item:  1223469
16:37:47 <jgriffith> I wanted to point that out because I made a change that does a reovery but still logs the error message in the logs
16:38:12 <jgriffith> this seemed like a good idea at the time, but the querie writers grabbed on to that and still querie on it
16:38:28 <jgriffith> even though it recovers and doesn't fail it still gets dinged in the queries
16:38:49 <jgriffith> so I think I should change it to warning and chane the wording to throw them off he scent :)
16:39:10 <jgriffith> but I wanted to go through these to try and keep everybody informed of what was going on
16:39:30 <jgriffith> I spent most of the last 3 VERY long days monitoring gates and pouring over logs
16:39:42 <avishay> cool, thanks for the update and the work!
16:40:07 <jgriffith> Hoping that if/when we hit this sort of thing again we'll have a whole team working on it :)
16:40:12 <jungleboyj> avishay: +2
16:40:14 <jgriffith> Ok, that's all I have...
16:40:18 <jgriffith> anybody else?
16:40:23 <jgriffith> #topic open-discussion
16:41:01 <dosaboy> drinks are on the house!
16:41:01 <jgriffith> going twice....
16:41:13 <jgriffith> dosaboy: Your house?  Ok, I'm in :)
16:41:17 <dosaboy> :)
16:41:23 <jungleboyj> Yay!
16:41:25 <jgriffith> going three times...
16:41:31 <jgriffith> #endmeeting