16:00:01 <smcginnis> #startmeeting Cinder
16:00:03 <openstack> Meeting started Wed Nov 18 16:00:01 2015 UTC and is due to finish in 60 minutes.  The chair is smcginnis. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:04 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:07 <openstack> The meeting name has been set to 'cinder'
16:00:16 <dulek> Hi!
16:00:18 <eharney> hi
16:00:19 <tbarron> hi
16:00:20 <geguileo> Hi
16:00:21 <scottda> hey
16:00:22 <diablo_rojo> Hello :)
16:00:27 <smcginnis> Hey everyone.
16:00:33 <kmartin> hi
16:00:39 <yusuke> hi
16:00:42 <jseiler> hi
16:01:05 <smcginnis> #topic Big Status
16:01:15 <smcginnis> #info 482 Cinder bugs
16:01:23 <smcginnis> #info 54 python-cinderclient bugs
16:01:36 <smcginnis> #info 13 os-brick bugs
16:02:00 <thangp> o/
16:02:02 <smcginnis> Just a summary of bug counts to make sure there is awareness.
16:02:07 <xyang1> hi
16:02:28 <smcginnis> Any kind of bug scrub or other attention would definitely be appreciated.
16:02:53 <rajinir> Any link that shows the list?
16:02:55 <smcginnis> scottda also points out there are 12 nova bugs tagged as volume.
16:03:02 <Swanson> hi
16:03:06 <smcginnis> #link https://bugs.launchpad.net/nova/+bugs?field.status:list=NEW&field.tag=volumes
16:03:13 <scottda> Those are just the "new" ones that need triage
16:03:26 <smcginnis> scottda: :/
16:03:41 <smcginnis> rajinir: Nothing else really except here:
16:03:43 <smcginnis> #link https://bugs.launchpad.net/cinder/+bugs
16:04:04 <smcginnis> Launchpad is the greatest interface for searching through, but they're all there.
16:04:23 <smcginnis> scottda: Do you know of any particularly interesting ones?
16:04:49 <smcginnis> s/is the greatest/isn't the greatest/
16:04:54 <scottda> There's 1 with NetApp in the title :)
16:05:02 <smcginnis> :)
16:05:17 <e0ne> hi
16:05:36 <smcginnis> I think in general I want to at least spend some time adding tags to the bug reports to make it easy to pull out by vendor or particular feature.
16:05:43 <scottda> Most Nova volume bugs look familiar...things get out of sync during attach/detach :(
16:06:16 <smcginnis> scottda: OK, at least it's consistent.
16:06:30 <smcginnis> There's also the spec etherpad from last week: https://etherpad.openstack.org/p/mitaka-cinder-spec-review-tracking
16:06:39 <hemna> mornin
16:06:41 <smcginnis> That to be honest I haven't had the time to do a thing with.
16:06:49 <tbarron> scottda: we've got an E-Series driver person looing at that one
16:07:05 <tbarron> looking
16:07:16 <smcginnis> But the format is used by nova with some success, so I think it's worth spending more time on it.
16:07:34 <smcginnis> I also haven't had a chance to get some good reports up like nova has.
16:07:48 <smcginnis> Mostly because I just haven't been able to sit down and hack away at it.
16:07:58 <smcginnis> But hopefully some helpful tools coming soon.
16:08:10 <smcginnis> OK, just want to keep awareness here.
16:08:14 <smcginnis> Let's move on to business.
16:08:25 <smcginnis> OSLO incubator going away (DuncanT)
16:08:32 <e0ne> great news!
16:08:35 <DuncanT> Hi
16:08:35 <smcginnis> #topic OSLO incubator going away
16:08:59 <DuncanT> We still use a few bits from osl incubator, most noticably the scheduler and image utils
16:09:21 <DuncanT> Image utils is trivial, but I wonder if it is worth doing a last sync on the scheduler
16:09:39 <DuncanT> Tehy both need moving into a standard cinder namespace, and the unit tests pulling in from incubator
16:09:41 <e0ne> DuncanT: can we move imagutils to oslo.utils?
16:10:28 <DuncanT> e0ne: Maybe, yeah. There's some argument over how it should work though, and it isn't complex, so it might be easier to have our own copy
16:10:46 <e0ne> DuncanT: sounds reasonable
16:11:44 <smcginnis> At least I start I suppose.
16:11:59 <DuncanT> I've nothing else to say on the subject, it was just a heads up... if I get time I'll put up patches, but volunteers welcome
16:12:09 <smcginnis> DuncanT: Thanks!
16:12:23 <smcginnis> #topic Adding extra data to event notifications for searchlight to consume
16:12:33 <smcginnis> DuncanT: Still all yours. :)
16:12:34 <DuncanT> Me again
16:13:09 <DuncanT> So searchlight shoves openstack metadata into an elastic search engine, with just enough extra metadata that it can do RBAC with it
16:13:23 <DuncanT> I wrote a cinder plugin for it, modelled off the nova one
16:13:44 <DuncanT> Trouble is, like nova, it needs to hit the API after nearly every event to get extra info
16:13:55 <hemna> :(
16:13:58 <DuncanT> This sucks, and makes the API and DB work twice as hard
16:14:06 <e0ne> +1
16:14:16 <smcginnis> What additional data does it need?
16:14:19 <DuncanT> I propose stuffing more details into the events to avoid this
16:14:36 <DuncanT> Usually RBAC related fields, sometimes status
16:14:56 <e0ne> can we add cinder objects to events?
16:15:00 <DuncanT> I suggest we just JSON encode the DB record we have in our hand at the time, and attach that to the event
16:15:07 <DuncanT> Events are just JSON
16:16:04 <smcginnis> That does seem like it would make the event more useful to consumers.
16:16:29 <smcginnis> I'd be concerned about passing too much extra, especially for non-searchlight consumers.
16:16:43 <smcginnis> But I think the overhead would be fairly low at this point.?
16:16:58 <DuncanT> In all cases I can find, it is less than a k of data unless people go mad with metadata
16:17:55 <DuncanT> Doesn't sound like anybody hates it so far, I'll post patches
16:18:03 <smcginnis> Any concern with large and active system deployments?
16:18:33 <smcginnis> Yeah, might be worth proposing some patches and see if we get any other feedback there.
16:18:55 <dulek> e0ne: Cinder objects are JSON serializable by definition, we can attach them.
16:19:12 <DuncanT> I can run it on ~hundred nodes test system, see how that goes re extra load
16:19:18 <hemna> dulek, that is a lot of extra data floating around though.
16:19:22 <hemna> is it really worth this ?
16:19:22 <e0ne> dulek: sure, it's what I'm proposing
16:19:55 <DuncanT> I'll come up with the fields that are actually interesting, but I suspect ti ends up being most of them
16:20:03 <smcginnis> DuncanT, hemna: That extra stream of data is my only concern.
16:20:04 <dulek> hemna: It is, but I basing on my undestanding how searchlight works it needs that. Or just diffs?
16:20:20 <hemna> dulek, any urls on searchlight ?
16:20:36 <dulek> And if someone want to run searchlight - notifications are still better than SL hitting APIs all the time for indexing. ;)
16:20:52 <hemna> smcginnis, maybe this is CONF enabled but disabled by default ?
16:21:10 <DuncanT> https://www.youtube.com/watch?v=0jYXsK4j26s http://docs.openstack.org/developer/searchlight/
16:21:11 <smcginnis> Defintely better when you're using searchclight. But what about when not?
16:21:14 <hemna> otherwise we are stuffing lots more data around for no use.
16:21:26 <dulek> hemna: I don't have nothing more than Google shows me.
16:21:32 <dulek> BTW - TravT_away ^
16:21:38 <hemna> who is asking for this ?
16:21:43 <smcginnis> hemna: Yeah, that's what I was thinking. Just a flag to flip.
16:22:22 <DuncanT> hemna: Searchlight? Several interested parties, and several past attempts to add e.g. regex filtering to cinder
16:22:56 <xyang> hemna: http://stackalytics.com/?metric=commits&module=searchlight-group&release=liberty
16:23:21 <DuncanT> I'll have to write the searchlight plugin to do the API queries if the events don't contain enough info, since in needs to work against existing systems ideally, so a conf option is fine by me, but I think the overhead should be negligable
16:23:23 <smcginnis> Well, there's certainly one company fully behind this. :)
16:23:29 <hemna> heh
16:23:32 <dulek> smcginnis: :D
16:23:33 <xyang> :)
16:23:59 <hemna> so make it disabled by default and enable it in cinder.conf
16:24:19 <smcginnis> DuncanT: I think it would help if you are able to get some metrics. Then we know what we're dealing with.
16:24:22 <hemna> until we get more performance metrics out of the impact of adding it.
16:24:26 <DuncanT> hemna: Another conf option is a burden though.... I'll benchmark the difference
16:24:37 <hemna> if it's negligible, then enable it by default.
16:24:42 <hemna> a burden?  meh
16:25:11 <smcginnis> Once we have the data it will be easier to say if we should just have on by default.
16:25:15 <DuncanT> hemna: We have loads that aren't tested or documented already
16:25:31 <hemna> that's not an argument for not creating another one. :)
16:25:38 <smcginnis> I think have a switch to turn it off for those that aren't using and want to optimize things a bit might not be bad.
16:25:51 <smcginnis> Or is that optimise? :)
16:26:21 <DuncanT> I'll benchmark it, we have some performance guys who're good at that sort of thing
16:26:27 <hemna> if we are talking about the entire volume object and all of it's entries, that's quite a bit of data per event.
16:26:29 <smcginnis> #action DuncanT to get some data for overhead with sending object data in notifications.
16:26:36 <hemna> that's not negligible IMHO.
16:27:55 <smcginnis> Well, let's see how things look once we have real data.
16:28:02 <smcginnis> DuncanT: all for this?
16:28:04 <DuncanT> hemna: It's going via rabbit, so message size isn't actually a bit overhead according to previous benchmark
16:28:09 <DuncanT> smcginnis: Yup
16:28:19 <hemna> coolio.
16:28:23 <smcginnis> OK, thanks.
16:28:31 <smcginnis> #topic Looping on failed driver init.
16:28:37 <smcginnis> It's the DuncanT show today. :)
16:29:20 <DuncanT> Ok, this is hopefully an easy one. If you power your array and cinder node on at the smae time, and the array comes up second, c-vol is useless until you restart it. This sucks
16:29:36 <DuncanT> (at least for drivers that check the array in init, which many do)
16:29:42 <dulek> DuncanT: +1
16:29:53 <smcginnis> I agree.
16:29:57 <e0ne> DuncanT: +1
16:30:02 <kmartin> DuncanT, we had customers asking for this
16:30:14 <DuncanT> Ok, I'll post patches. Thanks
16:30:18 <e0ne> kmartin: not only you :)
16:30:33 <smcginnis> I think that's a pretty clear answer then. :)
16:30:34 <kmartin> DuncanT, +1
16:30:38 <rajinir> c-vol shows up succesful in the services-list even if the driver failed
16:30:41 <DuncanT> (Were 'I' quite possibly means 'one of my colleagues')
16:30:48 <dulek> rajinir: That's not true.
16:30:53 <dulek> rajinir: At least in Liberty.
16:31:01 <smcginnis> Delegation - DuncanT must be a manager now. :)
16:31:05 <hemna> DuncanT, so are you proposing a retry N times thing ?
16:31:06 <DuncanT> dulek: It was true until very late in liberty
16:31:29 <rajinir> dule: that's great, if some work was done to fix that
16:31:37 <DuncanT> hemna: Just retry every half minute or something for ever - there's nothing else the process can usefully be doing at that point
16:31:39 <hemna> so in the case of our driver, we puke if we can't hit the array, but we also puke if there are conf settings that don't verify on the array, so retry won't help.
16:31:50 <eharney> i put a comment in the bug, but i'm concerned about doing this for __init__ if that's what you mean
16:31:57 <hemna> that basically means the log will show exceptions every half minute :)
16:32:03 <hemna> continually failing
16:32:03 <e0ne> hemna, DuncanT: we can make retrive interval configurable
16:32:13 <DuncanT> hemna: So my idea doesn't make it any worse, just doesn't fix all cases. I'm ok with that
16:32:20 <rajinir> Another config option:)
16:32:21 <smcginnis> Or maybe have a scaling backoff.
16:32:23 <tbarron> DuncanT: multiple backends?  maybe configurable bound on the number of retries.
16:32:31 <DuncanT> hemna: Currently you get a log full of failing periodic tasks
16:32:40 <hemna> it should only retry if the driver pukes because it can't talk to the array
16:32:43 <smcginnis> DuncanT: I agree. We still spam the logs on failures as it is.
16:32:43 <flip211> that's a problem that I can understand .... +1 ;)
16:32:51 <hemna> if it pukes because of misconfiguration then it shouldn't retry
16:32:56 <hemna> as it will fail forever
16:33:03 <smcginnis> hemna: So maybe new exception that everyone should implement.
16:33:05 <DuncanT> tbarron: Multi-backend is in different processes by this point
16:33:09 <hemna> smcginnis, +1
16:33:15 <tbarron> DuncanT: oh right
16:33:18 <smcginnis> RetryableCinderException or some such thing.
16:33:23 <hemna> yup
16:33:37 <smcginnis> That makes sense.
16:33:37 <DuncanT> hemna:smcginnis : Ok, I'll look at adding a new exception to say 'stop, this can't be retried
16:33:44 <hemna> DuncanT, +1
16:33:47 <hemna> thanks!
16:33:53 <geguileo> DuncanT: +1
16:33:54 <eharney> DuncanT: what about __init__ failures due to missing requirements, syntax errors, etc?
16:33:57 <smcginnis> ComeBackLaterCinderException
16:34:06 <dulek> smcginnis: +1 ;)
16:34:12 <hemna> eharney, I think that new exception would cover those
16:34:18 <DuncanT> eharney: Retries are still fairly harmless IMO
16:34:22 <eharney> hemna: it wouldn't
16:34:41 <hemna> well, not syntax errors
16:34:47 <hemna> but missing conf entries ?  it should
16:34:56 <eharney> i don't think we want to retry if you have broken python, personally
16:34:56 <smcginnis> eharney: You don't think so?
16:35:08 <eharney> that's just a mess in the logs
16:35:11 <hemna> driver recognizes it doesn't have the needed conf entry and raises the DontRetryException
16:35:33 <kmartin> HavingABeerWillCheckLaterCinderExcpetion
16:35:37 <hemna> syntax errors in python shouldn't make it into the code :)
16:35:53 <DuncanT> eharney: Right now the logs are a mess - you get 'driver not initialised' messages and have to scroll back up to find the real problem
16:35:57 <smcginnis> Yeah, I think it could be explicit by the driver. I'm all good but I can't complete init.
16:36:02 <hemna> if those aren't being covered by py27 and py34 we have bigger problems
16:36:35 <rajinir> Is there a way to propagate the error to the services-list
16:36:36 <eharney> hemna: eh... there's a lot more to it than that
16:37:20 <DuncanT> rajinir: Not in the first patch
16:37:55 <eharney> upstream unit tests don't help with people deploying out of tree drivers, broken packages, etc
16:38:31 <DuncanT> eharney: I'll keep it in mind. The logs are already really unhelpful in that case though
16:38:45 <hemna> the service itself would puke out
16:38:55 <hemna> I'm not sure what you are trying to solve w/ that one.
16:38:57 <smcginnis> eharney: So if the driver has to raise an exception saying everrything else is good but it can't talk to the array right now, wouldn't that cover it?
16:39:08 <smcginnis> eharney: Otherwise it blows up in a different way.
16:39:36 <hemna> smcginnis, I think he's trying to cover the case where someone installs an out of tree driver that's horribly broken (syntax errors).
16:39:41 <eharney> smcginnis: yes
16:39:54 <smcginnis> But then it wouldn't raise a ComeBackLaterException.
16:40:03 <smcginnis> It would be a ThisCodeIsCrapException.
16:40:14 <eharney> all i'm saying is, no point in retrying indefinitely for "your system is broken" kinds of errors
16:40:15 <smcginnis> And then things just fail.
16:40:27 <eharney> i'm not too concerned with the other details
16:40:31 <smcginnis> eharney: Yeah, definitely.
16:41:12 <hemna> eharney, I thought that's what I had raised earlier though?  and why we created the dontretryexeption ?
16:41:14 <hemna> I'm confused.
16:41:25 <eharney> but that assumes you can raise a dontretryexception
16:41:46 <eharney> then you don't cover other assorted error exceptions that can be raised outside of your intent
16:41:50 <hemna> ok this is going in circles.  I give up.
16:42:14 <smcginnis> Oh, I think instead of a dontretyexception it needs to be DoRetryException.
16:42:34 <Swanson> Is this only for init?
16:42:38 <smcginnis> So the driver can let the upper layer know it's a condition that may resolve itself.
16:42:42 <smcginnis> Swanson: Yeah.
16:42:45 <eharney> smcginnis: that was the thought
16:42:50 <dulek> We can have that discussion on review probably…
16:42:55 <smcginnis> I agree.
16:43:02 <smcginnis> We don't need to design this here.
16:43:18 <smcginnis> Good enough for now I think. Let the details come out in the code and reviews.
16:43:31 <smcginnis> dulek: Keeping us on track. :)
16:43:34 <Swanson> Do I know the difference between a bad config and an array being down?
16:43:55 * dulek is doing 2 meetings at once, so time's valuable. ;)
16:44:01 <smcginnis> Swanson: You would know the difference between a missing config and not being able to contact the array.
16:44:11 <DuncanT> Swanson: If the config is worng password, probably not, but retrying is fairly cheap in that case
16:44:16 <smcginnis> Like DuncanT was pointing out, not any worse than how it is now.
16:44:21 <dulek> smcginnis: +1
16:44:32 <smcginnis> OK, let's move on. Especially since dulek is next. :)
16:44:39 <smcginnis> #topic reno for release notes
16:44:42 <dulek> hi!
16:44:46 <smcginnis> dulek: Thanks for bringing this up by the way.
16:44:52 <dulek> And this is also just a heads up.
16:45:08 <dulek> We started to use reno for relase notes management.
16:45:28 <dulek> #link http://lists.openstack.org/pipermail/openstack-dev/2015-November/078301.html
16:45:34 <dulek> #link http://docs.openstack.org/developer/keystone/developing.html#release-notes
16:45:41 <dulek> #link https://review.openstack.org/#/c/246455/
16:45:55 <dulek> These are the resources that explain this stuff.
16:46:25 <dulek> As you see Keystone already merged some guidelines on what requires a relase note
16:46:31 <dulek> We probably should do the same.
16:47:01 <smcginnis> Thanks, I hadnt seen the keystone notes. That would be good for us to have somewhere in our devref.
16:47:03 <DuncanT> dulek: +1
16:47:08 <smcginnis> Probably on our wiki too.
16:47:11 <dulek> I can come up with proposition (getting last link into account - so cross-project guidelines) to get reviewed.
16:47:29 * DuncanT would like it if the git commit messages were good enough, but they probably aren't :-(
16:48:02 <smcginnis> Having to pull together the release notes for liberty, I can definitely ay they are not good enough.
16:48:07 <smcginnis> s/ay/say/
16:48:12 <e0ne> DuncanT: feel free to -1 on such commit messages :)
16:48:12 <dulek> Ah, and BTW - there's two last patches waiting for merging.
16:48:29 <dulek> One is unreleased notes page backported to Liberty
16:48:43 <dulek> And the second one - a job for release notes.
16:48:54 <smcginnis> dulek: Shoot, I thought we had gotten those through. Will look it up.
16:48:58 <dulek> CI job I mean.
16:49:10 <dulek> smcginnis: https://review.openstack.org/#/c/245431/
16:49:13 <dulek> That's the one.
16:49:33 <dulek> I believe project-config changes are out from our scope, but +1 from smcginnis would be helpful.
16:49:37 <smcginnis> dulek: Thanks!
16:49:43 <dulek> https://review.openstack.org/#/c/244764/
16:49:48 <dulek> That's CI job.
16:50:17 <dulek> So summing up - we all have another thing to -1 reviews for. :D
16:50:25 <smcginnis> :/
16:50:32 <smcginnis> I hope this pays off.
16:51:01 <dulek> I hope too, I'm not totally convinced, but let's see.
16:51:16 <smcginnis> It does look promising. I think we will just need to adjust so it's just normal.
16:51:44 <smcginnis> dulek: Well, again, thanks for bringing it up.
16:51:46 <dulek> So if anyone will have a question on release notes stuff - I'm here to help. :)
16:51:53 <smcginnis> dulek: Anything else.
16:51:53 <dulek> That's all I've wanted to say. :)
16:52:02 <smcginnis> Thank you.
16:52:10 <smcginnis> #topic Open Discussion
16:52:20 <smcginnis> The floor is open. (for 8 minutes)
16:52:35 <dulek> Just a note - we're moving in right direction with upgrades stuff!
16:53:06 <dulek> There's a spec for making DB schema migrations in live manner: https://review.openstack.org/#/c/245976/
16:53:10 <dulek> If someone interested.
16:53:16 <hemna> dulek, nice
16:53:37 <dulek> Thanks everyone for reviewing and especially thangp for driving this that well.
16:53:49 <smcginnis> +1
16:54:27 <xyang> dulek: we ran into one issue with versionedobjects conversion
16:54:51 <xyang> dulek: volume_metadata in volume db is changed to metadata in volume object
16:55:00 <xyang> dulek: our CI failed because of that
16:55:06 <dulek> xyang: I've noticed that in logs recently.
16:55:14 <xyang> dulek: so other drivers could be affected
16:55:29 <dulek> xyang: We should probably add an alias to Volume object.
16:55:52 <xyang> dulek: so for VNX driver, the create_volume function needs to use metadata but migrate_volume needs volume_metadata
16:55:55 <xyang> dulek: +1
16:56:04 <dulek> xyang: I'll take a look on that. Thanks for reporting.
16:56:17 <xyang> dulek: thanks
16:57:08 <hemna> so, FWIW, I created a dumping ground github repo for some of my brick tools I've been using here: https://github.com/WaltHP/hpe-openstack-tools
16:57:23 <hemna> if anyone wants to use/contribute to em
16:57:25 <smcginnis> Cool!
16:57:29 <hemna> I use those when testing stuffs
16:57:37 <hemna> that Cinder/Nova doesn't call just yet
16:58:04 <hemna> the brick_volume.py dumps out the existing volume paths on the system for an attached volume.
16:58:28 <hemna> http://paste.openstack.org/show/479295/
16:58:30 <hemna> looks like that
16:59:11 <smcginnis> OK, pretty much out of time. Thanks everyone.
16:59:19 <geguileo> Thanks
16:59:22 <smcginnis> #endmeeting