16:00:01 <thingee> #startmeeting cinder
16:00:02 <openstack> Meeting started Wed Apr  8 16:00:01 2015 UTC and is due to finish in 60 minutes.  The chair is thingee. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:03 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:05 <openstack> The meeting name has been set to 'cinder'
16:00:09 <kmartin> o/
16:00:11 <thingee> hi everyone!
16:00:12 <jungleboyj> o/
16:00:16 <rhe00> hi
16:00:16 <geguileo> Hi!
16:00:18 <scottda> hi
16:00:20 <cebruns> Hi
16:00:22 <smcginnis> hi
16:00:25 <thangp> o/
16:00:26 <mriedem> hi
16:00:27 <xyang2> hi
16:00:31 <bswartz> hi
16:00:38 <thingee> Cinder Kilo RC is wrapping up https://etherpad.openstack.org/p/cinder-kilo-rc-priorities
16:00:40 <Swanson> hello
16:00:42 <patrickeast> hi
16:00:48 <flip214> hi
16:00:51 <rajinir_r> hi
16:00:52 <jgravel> hello
16:00:59 <li_zhang> hi
16:01:01 <deepakcs> hello
16:01:02 <thingee> I will be doing a cut soon. We're just waiting on the owners of the patches that are marked not ready now
16:01:04 <jbernard> hi
16:01:09 <smcginnis> Woot
16:01:11 <thingee> thanks everyone for the reviews!
16:01:16 <thingee> and patches to bug fixes!
16:01:38 <thingee> alright I'm stuck in the airport lobby and need to check in so lets get started, short meeting!
16:01:47 <thingee> agenda for today:
16:01:50 <thingee> #link https://wiki.openstack.org/wiki/CinderMeetings#Next_meeting
16:02:02 <e0ne> hi
16:02:19 <thingee> #topic Proposal to make the (check|gate)-tempest-dsvm-full-ceph job votin
16:02:21 <asselin_> hi
16:02:24 <thingee> mriedem: hi
16:02:25 <avishay> hello
16:02:28 <mriedem> hey
16:02:29 <thingee> dansmith: hi
16:02:34 <mriedem> https://review.openstack.org/#/c/170913/
16:02:37 <thingee> #link https://review.openstack.org/#/c/170913/
16:02:39 <dansmith> hola
16:02:45 <mriedem> so we have the non-voting ceph job in infra on cinder/nova changes today
16:02:54 <mriedem> glance was left out of that for some reason, but it runs with an rbd store in glance
16:02:56 <thingee> #idea ceph is a infra ran ci and stable. Should we allow it to vote?
16:02:58 <dansmith> before we talk about the voting, should/can we talk about jbernard's patch?
16:03:10 <dansmith> because we can't make it voting until we fix the bug
16:03:16 <mriedem> dansmith: yeah we can
16:03:20 <mriedem> https://review.openstack.org/#/c/170903/
16:03:23 <dansmith> mriedem: did the skip land?
16:03:25 <mriedem> the infra patch depends on the tempest skip
16:03:30 <dansmith> oh, then nevermind :)
16:03:33 <mriedem> maybe once i bug sdage
16:03:36 <mriedem> sdague
16:03:49 <mriedem> anyway, test_volume_boot_pattern pukes on non-lvm backends
16:03:55 <mriedem> so we skip that until the bug is fixed
16:04:03 <mriedem> and make the ceph job voting for nova/cinder/glance
16:04:03 <thingee> mriedem: oh I was wondering about that :(
16:04:10 <jbernard> more precisely, it pukes whenever you use glance api v2
16:04:13 <mtreinish> mriedem: not all non lvm backends, just some
16:04:16 <mriedem> sounds like the bug also affects glusterfs/gpfs, maybe others
16:04:19 <e0ne> imo, we need to fix bug first
16:04:28 <deepakcs> mriedem, yes it affects glusterfs too
16:04:45 <jbernard> it would also effect lvm, if glance v2 is used
16:04:45 <mriedem> the thing is, the ceph job was passing for awhile, then we enabled test_volume_boot_pattern again, and it blew up the ceph job
16:04:46 <thingee> jbernard, mriedem what exactly is the bug testing?
16:04:55 <jungleboyj> mriedem:  It does affect GPFS for sure.
16:05:01 <mriedem> i'd like to see us not regress the ceph job while trying to get this bug fixed
16:05:08 <thingee> mriedem: it is hitting a lot of the cinder volume drivers.
16:05:12 <mtreinish> thingee: http://git.openstack.org/cgit/openstack/tempest/tree/tempest/scenario/test_volume_boot_pattern.py#n25
16:05:29 <mriedem> sure, so it's a nasty bug for shared storage drivers
16:05:31 <e0ne> #link  proposed fix https://review.openstack.org/#/c/171312/
16:05:32 <dansmith> the thing is, allowing ceph to vote gets a lot of coverage in nova, cinder, and glance of these rbd backends
16:05:46 <mriedem> and ceph is the #1 storage backend according to the user survey
16:05:47 <dansmith> which is very valuable, as evidenced by the above list of "yes, affects us too" :)
16:05:52 <dansmith> and we can test it in infra
16:05:55 <mriedem> so we should test production things probably, since lvm != production
16:05:57 <dansmith> unlike many of the others
16:06:08 <thingee> mtreinish, mriedem: seems kind of important to pass? why would we not fix the problem to begin with?
16:06:22 <thingee> dansmith: +1
16:06:23 <mriedem> thingee: we don't want to regress while fixing that bug
16:06:24 <mriedem> on other things
16:06:29 <dansmith> thingee: we have a patch up to fix it
16:06:35 <thingee> dansmith: link?
16:06:38 <dansmith> thingee: and when I say "we" I mean jbernard  :)
16:06:42 <thingee> #link http://git.openstack.org/cgit/openstack/tempest/tree/tempest/scenario/test_volume_boot_pattern.py#n25
16:06:46 <mriedem> so skip the test, make the job voting so we don't regress, then once the bug is fixed we unskip the test
16:06:54 <thingee> mriedem: got it
16:06:58 <dansmith> I'm fine with that procedure
16:07:02 <mriedem> https://review.openstack.org/#/c/171312/
16:07:08 <dansmith> I'd like to see the patch get into kilo if at all possible
16:07:09 <mriedem> that's the proposed fix
16:07:11 <dansmith> but I'd rather have regression protection (tm)
16:07:11 <thingee> #link https://review.openstack.org/#/c/171312/
16:07:32 <thingee> #idea don't fix volumebootpattern to avoid regress
16:07:43 <thingee> DuncanT: here?
16:07:45 <flip214> thingee: I'd like to ask for a few minutes after the meeting, thank you so much.
16:07:50 <mriedem> well, we want to fix test_volume_boot_pattern
16:07:52 <thingee> flip214: got it
16:08:02 <mriedem> we just don't want to regress more while fixing that
16:08:04 <thingee> #undo
16:08:05 <openstack> Removing item from minutes: <ircmeeting.items.Idea object at 0x95f9e50>
16:08:23 <thingee> #idea eventually fix volumebootpattern
16:08:42 <mriedem> there was a -1 on the infra change b/c the glusterfs and sheepdog jobs are to remain non-voting in cinder
16:08:53 <thingee> #action jbernard and dansmith have a patch to fix VolumeBootPattern
16:09:08 <mriedem> i don't know the whole background on that in cinder land, but those seem specific to cinder whereas the ceph job touches multiple projects
16:09:16 <mriedem> and there is an rbd specific image backend module in libvirt in nova
16:09:18 <thingee> does anyone oppose to ceph not voting for now?
16:09:24 <thingee> I know DuncanT was opposed originally
16:09:35 <e0ne> thingee: actually, we've got two patches
16:09:44 <jungleboyj> thingee: Do you mean oppose to ceph voting ?
16:09:48 <e0ne> but the second one requires new python-glanceclient
16:10:01 <thingee> jungleboyj: yes sorry, I'm in airport..hectic right now
16:10:05 <avishay> thingee: holiday here, DuncanT may not be around
16:10:05 <avishay> I see his point, it puts drivers on unequal footing
16:10:08 <dansmith> mriedem: although this failing is not in the ceph driver in nova, it's common across all BFV scenarios in nova
16:10:08 * thingee is excited to finish this meeting
16:10:09 <mriedem> jungleboyj: yes, comment in here https://review.openstack.org/#/c/170913/
16:10:32 <thingee> to refresh everyone's memory, we decided against this back in oct 15
16:10:34 <jungleboyj> thingee: Ok, just wanted to clarify.  I do not oppose making it voting. I think it is a good improvement.
16:10:34 <thingee> #link http://eavesdrop.openstack.org/meetings/cinder/2014/cinder.2014-10-15-16.00.log.html
16:11:02 <thingee> I think Ceph is unique situation. It's integrated in a variety of projects and maintained by infra.
16:11:14 <dansmith> and #1 in the user survey for production
16:11:18 <mriedem> right
16:11:34 <jungleboyj> Given that this is unique and looks like it is likely to give us much better test coverage, I think this is the right thing to do.
16:11:38 <dansmith> and we have involvement from people to fix bugs when they arise
16:11:42 <smcginnis> I would love to ceph, sheepdog, etc actually moved out to a different test queue. Just my opinion.
16:11:53 <xyang2> thingee: then what about other CI's maintained by infra?
16:11:57 <dansmith> smcginnis: right now, infra doesn't support that
16:12:07 <thingee> xyang2: so so far sheepdog is not stable
16:12:08 <jungleboyj> I understand DuncanT 's concerns, but I don't think improving cross project coverage on a production back end is hurting anyone.
16:12:14 <smcginnis> dansmith: Yeah, just wishing. :)
16:12:16 <mriedem> ceph job was on the experimental queue for a long time
16:12:16 <thingee> xyang2: we're landing patches right now to fix that though
16:12:19 <mriedem> and it was broken for a long time
16:12:24 <winston-d> smcginnis: because?
16:12:30 <mriedem> jbernard: got the ceph job passing so we moved to check queue non-voting
16:12:43 <smcginnis> It's good these are tested since they are open source drivers.
16:12:49 <thingee> mriedem: so were others, that's not unique for cinder volume drivers.
16:12:54 <hemna> mornin
16:12:54 <smcginnis> But it slows down our Jenkins gating.
16:12:58 <thingee> mriedem: re being in gate for a while
16:13:02 <smcginnis> We want to know if they fail.
16:13:07 <clarkb> smcginnis: does it? that test shouldn't be in the gate
16:13:08 <jungleboyj> hemna: Is here, we can start now.  ;-)
16:13:09 <smcginnis> But it shouldn't block thigns.
16:13:11 <thingee> smcginnis: truth
16:13:15 <hemna> heh
16:13:23 <xyang2> thingee: are we going to allow all of them to vote eventually
16:13:24 <smcginnis> Sorry, check not gate.
16:13:37 <mriedem> right now ceph is non-voting in the cinder check queue
16:13:41 <patrickeast> iirc the biggest concern with having ci’s voting was that it would cause too much noise with false positives… if the ceph job is stable, why not let it vote then? at least for check
16:13:42 <jungleboyj> xyang2: I think we decided against that.
16:13:42 <thingee> xyang2: I think the idea is eventually CI's would vote. they're just not usually stable
16:13:56 <thingee> mriedem: correct
16:13:56 <dansmith> so right now, adding ceph gets us a ton more coverage. adding sheepdog after that doesn't give us the same gain
16:14:03 <DuncanT> Sorry, just got here
16:14:05 <jungleboyj> patrickeast: +1
16:14:06 <avishay> i have no problem with ceph being voting, assuming it won't break things too much when its infra goes down
16:14:08 <dansmith> it gets us some, but not exercising code that wasn't ever touched before
16:14:20 <mriedem> avishay: it's not 3rd party CI
16:14:20 <thingee> patrickeast: +1
16:14:26 <mriedem> avishay: community infra hosts the job
16:14:30 <smcginnis> It is a bit uneven, because them voting would result in a Jenkins -1 whereas other drivers are separate.
16:14:30 <e0ne> patrickeast: +1
16:14:38 <DuncanT> smcginnis: +1
16:14:43 <avishay> mriedem: ah ok, then definitely no problem for me
16:14:47 <DuncanT> That was my original concern
16:14:50 <mriedem> smcginnis: it already does, but no one cares b/c it's non-voting
16:14:58 <mriedem> so they don't pay attention
16:15:02 <DuncanT> Also worth noting ceph has been failing all day
16:15:11 <hemna> go ceph!
16:15:13 <thingee> smcginnis: Honestly I think we need to be better at checking non-voting ci's. I've been getting on people in reviews if people haven't noticed
16:15:16 <dansmith> DuncanT: it's failing because of a bug in cinder
16:15:18 <smcginnis> I'd like to see it stay non-voting or moved out where it doesn't look like a failure if there is some driver specific issue.
16:15:24 <smcginnis> thingee: True
16:15:25 <dansmith> DuncanT: ever since the boot from volume test was turned on recently
16:15:31 <dansmith> DuncanT: it had been green for months before that
16:15:34 <mriedem> DuncanT: ceph is failing until we skip test_volume_boot_pattern on it
16:15:35 <thingee> DuncanT: most ci's are failing VolumeBootPattern
16:15:38 <mriedem> which prompted this whole discussion
16:15:39 <thingee> DuncanT: it will be skipped
16:15:41 <e0ne> imo, any _stable_ ci must bt voting
16:15:47 <DuncanT> dansmith: So we'd be unable to merge anything into cinder right now if it was voting
16:15:51 <smcginnis> CI's need to get to the point of being stable enough to stand out when theres a failure.
16:15:56 <jungleboyj> DuncanT: That is what has started this discussion.  :-)
16:16:05 <smcginnis> DuncanT: My concern exactly.
16:16:06 <thingee> DuncanT: I think the idea is to land the skip first and then watch ceph ci, then change to voting
16:16:12 <jbernard> DuncanT: there is a patch here: https://review.openstack.org/#/c/171312/
16:16:13 <e0ne> DuncanT: we've got 2 proposed fixes
16:16:14 <dansmith> DuncanT: no, you wouldn't have been able to re-enable the broken test until it worked :)
16:16:17 <jordanP> thingee, +1
16:16:23 <mriedem> remember that this isn't just a cinder issue
16:16:33 <mriedem> so we can't make it non-voting in cinder and voting in nova/glance
16:16:42 <mriedem> b/c a cinder merge can then break the gate for nova/glacne
16:16:44 <mriedem> so it's all or none
16:16:45 <thingee> mriedem: +1
16:16:51 <dansmith> DuncanT: and this is a legitimately broken thing in cinder, so it not letting you merge something is... what you want :)
16:16:57 <dansmith> mriedem: right
16:17:03 <dansmith> this affects three projects
16:17:25 <mriedem> and this is the only shared storage backend we have in the infra check queue today for nova
16:17:30 <mriedem> everything else is lvm
16:17:41 <DuncanT> lvm is the cinder reference implementation
16:17:46 * thingee waits for ceph to become the reference implementation :)
16:17:47 <jungleboyj> Remember, if the job had been voting we wouldn't have gotten in the state where it was failing for everyone.
16:17:51 <mriedem> DuncanT: yeah, but lvm isn't a production backend
16:17:54 <DuncanT> *Everything* else ahs been classed as third party
16:17:57 <mriedem> ceph is the most used in production
16:18:00 <erlon> what defines a stable CI?
16:18:03 <dansmith> DuncanT: but it doesn't cover a ton of code that is required for production
16:18:10 <DuncanT> mriedem: Plenty of people using it in production right now
16:18:17 <thingee> erlon: that's a good question.
16:18:39 <hemna> thingee, keep waiting
16:18:45 <thingee> dansmith, mriedem: do we have numbers before the VolumeBootPattern issue how successful the ceph ci has been running?
16:18:45 <DuncanT> hemna++
16:18:54 <erlon> thingee: is there a way to count/measure this? or create parameters to have a success hardline?
16:19:06 <clarkb> thingee: yes all that data is in graphite.openstack.org and logstash.openstack.org
16:19:06 <mriedem> thingee: it was passing
16:19:09 <dansmith> thingee: we only have ten days of data I think, and it's been broken for several now
16:19:11 <mtreinish> thingee: it's probably worth pointing out that the only reason the boot from volume test was ever skipped was because of lvm config issues
16:19:14 <thingee> mriedem: it was passing is not numbers
16:19:18 <erlon> e.g. more than 95% in last 30 days is stable
16:19:19 <mtreinish> which jgriffith only recently sorted
16:19:22 <clarkb> dansmith: graphite.openstack.org has like a year
16:19:24 <smcginnis> Sounds like a good Liberty midcycle discussion.
16:19:29 <thingee> clarkb: thanks, can we gather that?
16:19:32 <xyang2> thingee: what is the criteria to determine that skipping a test is ok?
16:19:35 <dansmith> clarkb: oh, I was going on logstash data
16:19:48 <mtreinish> dansmith: well that's only 10 days :)
16:19:53 <thingee> xyang2: I think we just over that because of the regress point
16:19:55 <erlon> thingee: will save a lot of discussions when adding removing CIs from voting
16:20:13 <thingee> dansmith: ah ok
16:20:18 <xyang2> thingee: there are bugs in tests that affect some particular CI, but not others
16:20:23 <dansmith> clarkb: can you help me find what element to plot for this?
16:20:25 * mtreinish disappears
16:20:30 <DuncanT> erlon: 5 disagreements in a row with jenkins reference run was the metric being discussed at Paris
16:20:57 <dansmith> DuncanT: from experience that's way too tight
16:21:04 <dansmith> DuncanT: jenkins fails that a lot :)
16:21:22 <smcginnis> dansmith: I agree.
16:21:35 <thingee> So I'll say it again, I think the eventual goal with Cinder's CI was to allow voting, someone can correct me if I'm wrong on that. I think if we want to start getting coverage on a driver other than the reference implementation and something that's maintained by infra and a community, this could be a good start.
16:21:53 <patrickeast> thingee: +1
16:21:54 <DuncanT> dansmith: I'm only repeating what was said
16:22:04 <mriedem> http://graphite.openstack.org/render/?width=586&height=308&_salt=1428510122.331&target=stats_counts.nodepool.job.check-tempest-dsvm-full-ceph.master.devstack-trusty.builds&from=00%3A00_20150101&until=23%3A59_20150408
16:22:05 <patrickeast> it makes for a nice pilot program to see how we like having other backends voting
16:22:09 <dansmith> http://graphite.openstack.org/render/?width=586&height=308&_salt=1428510113.778&target=stats.zuul.pipeline.check.job.check-tempest-dsvm-full-ceph.SUCCESS&from=-90days
16:22:12 <mriedem> that's graphite on the ceph job going back to january
16:22:13 <thingee> mriedem: thanks!
16:22:22 <dansmith> or rather:
16:22:22 <dansmith> http://graphite.openstack.org/render/?width=586&height=308&_salt=1428510134.342&from=-90days&target=stats.zuul.pipeline.check.job.check-tempest-dsvm-full-ceph.SUCCESS&target=stats.zuul.pipeline.check.job.check-tempest-dsvm-full-ceph.FAILURE
16:22:24 <clarkb> dansmith: you should add the failures too
16:22:24 <jungleboyj> thingee:  +1
16:22:31 <dansmith> clarkb: yeah
16:22:33 <clarkb> dansmith: thats it
16:22:36 <DuncanT> thingee: I'm not totally opposed, as long as (a) the goal is to do this with other drivers too (b) we can turn off the voting easily if the job gets unstable
16:22:37 <smcginnis> I thought we had decided no voting. It would only make querying maybe a little easier.
16:22:37 <clarkb> dansmith: I was about to link something similar
16:22:42 <dansmith> you can see that it just started failing, if you smooth out the noise
16:22:45 <dansmith> that's 90 days
16:22:52 <Swanson> It would be nice if, when CI's vote, that recheck <ci> would not recheck jenkins.
16:23:20 <thingee> smcginnis: that was in oct 15 for now, is where we left off..until things are more stable
16:23:30 <mriedem> DuncanT: turning off voting is easy
16:23:38 <mriedem> it's a 1 line change to project-config
16:23:47 <mriedem> and if the job is breaking, there will be 3 projects looking to fix it
16:23:47 <DuncanT> thingee: If the plan is not to allow other stable CIs to also vote with similar criteria for stability (whatever we pick), then I'm very much opposed
16:23:47 <thingee> smcginnis: https://wiki.openstack.org/wiki/Cinder/tested-3rdParty-drivers#When_thirdparty_CI_voting_will_be_required.3F
16:23:53 <clarkb> DuncanT: iirc there are at least 2 other drivers in the process of doing this
16:23:58 <clarkb> sheepdog and gluster
16:24:07 <thingee> DuncanT: I think that would be the goal eventually to allow others. That's the point of this discussion
16:24:23 <thingee> DuncanT: it's just we talked about this on Oct 15th when we had maybe three CI's? https://wiki.openstack.org/wiki/Cinder/tested-3rdParty-drivers#When_thirdparty_CI_voting_will_be_required.3F
16:24:34 <thingee> DuncanT: now we have more and some are doing pretty ok IMO
16:24:35 <clarkb> where this is follwing along in ceph's testing footsteps
16:24:48 <thingee> not perfect yet.
16:25:03 <thingee> I just think we need to start somewhere
16:25:05 <jungleboyj> We have to try adding voting somewhere.
16:25:10 <jungleboyj> thingee: :-)
16:25:16 <thingee> and I think a community driven driver is a good start
16:25:37 <e0ne> thingee: +1
16:25:48 <thingee> but people have to see the value in CI's eventually voting.
16:25:56 <DuncanT> If the plan is to add more, I'm not so opposed...
16:26:07 <jungleboyj> This is being raised up as a good starting point with good benefits.  If we enable it and it screws us over, then we can disable it.
16:26:12 <thingee> So one thing being opposed to drivers for me is we have around 45 drivers (counting all supported protocols)
16:26:20 <thingee> there is going to be a lot of rechecks if everyone starts to vote
16:26:37 <avishay> thingee: good point
16:26:43 <smcginnis> Yikes
16:27:00 <dansmith> the only ones that can vote are the ones in infra anyway right?
16:27:01 <cebruns> With infra/comm supporting it - it seems like a good prototype/beginning.
16:27:11 <mriedem> the only ones that can gate are in infra too
16:27:13 <dansmith> most of those 45 are 3rd party, uncontrolled, unsupervised
16:27:16 <mriedem> third party ci can't gate
16:27:16 <jungleboyj> thingee: I think a better goal is to have voting from a stable representation of the the backend types.
16:27:16 <thingee> dansmith: well that's what will start pissing people off here :)
16:27:33 <dansmith> thingee: it's a decree outside the scope of this group or conversation, I think
16:27:36 <DuncanT> jungleboyj: ++
16:27:39 <jungleboyj> We can't hold this up on whether everyone can vote.
16:27:49 <mriedem> infra isn't going to run a storwize backend
16:27:49 <dansmith> jungleboyj: agreed
16:27:56 <mriedem> so let's not wait for that
16:28:02 <thingee> mriedem: that's not the point
16:28:07 <avishay> jungleboyj: what does "stable representation of the the backend types" mean?
16:28:08 <DuncanT> I believe infra are, in principle, happy to host any open-source solution CI?
16:28:08 <thingee> mriedem: storwize is hosted offsite
16:28:14 <thingee> mriedem: the point is if it can vote
16:28:23 <smcginnis> I guess as a pilot to see how things go Ceph would be a good starting point, especially given it's level of usage.
16:28:30 <dansmith> thingee: right, and infra has said it can't right?
16:28:33 <mriedem> thingee: yeah, i know, but i don't want people thinking it's unfair that infra doesn't host everything
16:28:34 <jungleboyj> We have block covered right now with LVM.  Add in a shared filesystem with ceph if that is what users are using.
16:28:41 <jungleboyj> Continue to readdress.
16:28:44 <thingee> dansmith: infra left it up to the cinder group
16:28:47 <dansmith> jungleboyj: right
16:28:50 <mriedem> avishay: i take that to mean we should have at least one backend testing shared storage in the gate
16:28:54 <mriedem> since nova doesn't have that today
16:28:55 <thingee> dansmith: re the convo from last year october
16:28:57 <mriedem> the ceph job is already there
16:29:00 <jungleboyj> mriedem: +1
16:29:09 <dansmith> thingee: for check you mean, right?
16:29:19 <thingee> dansmith: yes
16:29:19 <dansmith> thingee: I think they said, across the board, no non-infra things in gate, period
16:29:26 <mriedem> yes
16:29:28 <dansmith> gotcha, okay
16:29:30 <DuncanT> I'd be happy if we agreed that any of the infra-hosted (i.e. open-source) cinder jobs will be eligable to vote in future, once they've proved stability
16:29:30 <mriedem> clarkb: can confirm
16:29:39 <dansmith> DuncanT: +1
16:30:07 <DuncanT> And that we'll pull ceph (or any other voting project) if it proves to be a pain
16:30:09 <mriedem> DuncanT: i think that's the goal for all jobs that infra hosts which are non-voting today
16:30:09 <smcginnis> So much for a short meeting. :)
16:30:16 * thingee is kind of glad this and flip214's topic are the only things in this meeting :)
16:30:32 <jungleboyj> DuncanT: +1
16:30:34 <mriedem> e.g. cells is non-voting today, once we can get it passing we'll make it voting so we don't regress it
16:30:34 <thingee> DuncanT: +1
16:30:35 <dansmith> sounds like we're arriving at the decision
16:30:40 <avishay> +1
16:30:42 <thingee> dansmith: I hope so
16:30:44 <thingee> :)
16:30:55 <dansmith> but I was ready for so much arguing!
16:31:05 <thingee> dansmith: this is cinder :)
16:31:05 <dansmith> paint it red!
16:31:07 <hemna> why argue, when everyone can just say no.
16:31:14 <patrickeast> lol
16:31:28 <DuncanT> Ok. Somebody minute that and I'll stop objecting
16:31:38 <avishay> hemna: we argue even when everyone says yes
16:31:45 <hemna> avishay, yah this is true.
16:31:51 <jungleboyj> avishay: No we don't!
16:31:53 <jungleboyj> ;-)
16:31:59 <thingee> #agreed infra-hosted open-source cinder jobs will be eligable to vote in future once they've proved stable
16:32:06 <avishay> jungleboyj: :)
16:32:07 <DuncanT> Thanks
16:32:17 <avishay> i'm good with this decision
16:32:22 <thingee> #agreed we'll start with Ceph having voting in Cinder/glance/nova
16:32:33 <erlon> thingee: so, the criteria will be the one DuncanT said?
16:32:33 <eikke> next up:: define 'stable'
16:32:36 <thingee> dansmith mriedem thanks for your time
16:32:38 <hemna> yah
16:32:38 <thingee> clarkb: you too
16:32:42 <thingee> mtreinish: and you!
16:32:48 <hemna> I was going to say, we need agreed upon criteria for 'stable'
16:32:50 <hemna> bleh
16:32:50 <dansmith> thingee: glad to, thanks
16:33:01 <erlon> hemna: yep
16:33:03 * jungleboyj thinks this is good.  Thanks mriedem dane_leblanc
16:33:07 <hemna> doesn't break any more than jenkins?
16:33:16 <DuncanT> dansmith: Thanks for sticking with the discussion 'til it got somewhere
16:33:16 <thingee> #topic How to determine a CI is stable
16:33:20 <jungleboyj> Doh, dansmith was what I meant.
16:33:26 <dansmith> jungleboyj: I figured :)
16:33:30 <avishay> no need to define everything - when it's stable we'll know it
16:33:40 <thingee> mriedem, dansmith if you want to help with this topic too :)
16:33:42 <mriedem> if a job is consistently breaking the gate and no one is stepping up to fix it, then it's made non-voting
16:33:52 <mriedem> that's why we made cells non-voting the last time
16:33:55 <jungleboyj> DuncanT: Knew you would come around.  ;-)
16:34:03 <thingee> mriedem: but how do we eventually graduate a ci
16:34:06 <dansmith> thingee: so, I don't have the graphite fu to smooth that graph,
16:34:19 <dansmith> thingee: but it looks to me like it was stable for about 90 days before the recent breakage
16:34:38 <mriedem> thingee: to voting?
16:34:51 <DuncanT> 30 days of less than N jenkins disagreements in a row is a nice high bar
16:34:58 <thingee> mriedem: yes, we infra hosted ci's will be graduated to voting
16:35:08 <DuncanT> And measurable, which is important IMO
16:35:10 <mriedem> thingee: based on graphite data i suppose
16:35:23 <mriedem> there isn't really a standard for this somewhere that i know of
16:35:31 <thingee> DuncanT: I think we would want to review the negatives to see if they are real. might be a bit time consuming though
16:35:32 <mriedem> if we have a job that we care about working, we make sure it's working and votiing
16:35:55 <dansmith> thingee: DuncanT http://goo.gl/BUotuv
16:36:00 <thingee> mriedem: I guess the point of this conversation is how are we determining the ceph ci is stable enough. sounds like 90 days of being stable? :P
16:36:11 <dansmith> that ^ shows that it pretty much tracks jenkins I think
16:36:12 <DuncanT> And measurable, which is important IMO
16:36:17 <avishay> i don't think we need to set rules in stone...when something is stable, it's pretty obvious...when it's more trouble than it's worth, that's obvious too
16:36:27 <mriedem> avishay: +1
16:36:28 <erlon> DuncanT: 3 days of disagreements would revoke it?
16:36:42 <mriedem> and yeah, from the graph that dansmith linked, you can see around 4/2 it goes nuts
16:36:43 <deepakcs> Is there a way for jenkins to notify the CI owner only if the job fails. For eg: if job was doing fine for 90 days and breaks later, it would be good to have a email notifucation ?
16:36:50 <mriedem> which is when test_volume_boot_pattern was unskipped
16:36:54 <DuncanT> erlon: some number like that and somebody can at least nominate it for removal
16:37:02 <mriedem> deepakcs: this isn't third party
16:37:10 <erlon> DuncanT: +1
16:37:17 <mriedem> deepakcs: as in the community owns it
16:37:44 <deepakcs> mriedem, how is that related to sending email notification or not ?
16:37:51 <mriedem> who are you going to send it to?
16:37:55 <DuncanT> Same with nomination... once a CI meets the criteria, somebody can nominate it. The person(s) nominating it can filter out real failures if they are so motivated
16:37:57 <thingee> I'm fine with going to 30 days for now.
16:38:08 <thingee> just because no one else is strong on a number
16:38:13 <deepakcs> mriedem, for glusterfs say it would be me or BharatK , for ceph it would be jbernard
16:38:33 <DuncanT> The point being a highly motivated sponsor is a good thing initially
16:38:41 <mriedem> yeah idk, that seems like unnecessary noise - if it's pissing people off, bring it up in irc or meetings
16:38:45 <jungleboyj> DuncanT: +1
16:38:50 <dansmith> mriedem: +1
16:38:51 <mriedem> you can always push a project-config patch to make it non-voting
16:39:01 <dansmith> it's a constant job, not just reacting when there is an email notification
16:39:02 <mriedem> if we killed CI after 3 bad days, jenkins would be dead http://status.openstack.org/elastic-recheck/index.html
16:39:31 <deepakcs> dansmith, yes, but email would be good to filter failure from sucess, esp when there is 1 failure in many successess, i feel
16:39:31 <mriedem> and neutron wouldn't be in the gate until about 9 months ago
16:39:32 <mriedem> :)
16:39:35 <jungleboyj> mriedem: :-)
16:39:52 <mriedem> deepakcs: we're talking about thousands of jobs
16:39:56 <dansmith> deepakcs: well, you have access to the data to generate your own notifications
16:40:04 <thingee> #idea 30 days for infra hosted ci's to be stable - any falures will be reviewed
16:40:05 <dansmith> anyway, this is off-topic
16:40:10 <thingee> :)
16:40:13 <mriedem> deepakcs: we can track this all with elastic-recheck
16:40:20 <mriedem> if the top gate bug is for a specific job, e.g. ceph
16:40:21 <thingee> we can revise this later
16:40:23 <deepakcs> mriedem, dansmith sure
16:40:26 <mriedem> then we kick someone's ass or make it non-voting
16:40:29 <thingee> just want to get people to agree on something so we can move on
16:40:31 <DuncanT> mriedem: Disagreeing with jenkins has always been the interesting metric for me....
16:40:40 <patrickeast> thingee: +1 sounds like a good starting point to me
16:40:43 * thingee waits for people to oppose
16:40:46 <dansmith> DuncanT: sometimes that's useful, other times not :P
16:40:50 <smcginnis> +1 to start
16:40:57 <jungleboyj> +1 from me
16:41:05 <jbernard> +1 from me
16:41:06 <dansmith> thingee: so the above implies we're good with the ceph proposal?
16:41:07 <DuncanT> +1 for me too, see how it works out
16:41:14 <e0ne> DuncanT: we need to be careful on disagrees
16:41:15 <DuncanT> dansmith: I believe so
16:41:24 <thingee> dansmith: ceph ci seems fine given the situation. no need to wait another 30 days
16:41:26 <e0ne> +1
16:41:29 <dansmith> sweet
16:41:45 <DuncanT> e0ne: It's just a point where we start to poke, not where we pull it. An alert point
16:41:48 * dansmith drops the mic and walks out
16:41:49 <mriedem> the unofficial metric is sdague's rage gauge in the ML
16:41:50 <avishay> that's fine, but again, it might fail a little, then be fixed, etc. no rule will cover everything. we'll know when something sucks.
16:41:52 <mriedem> on gate breaks
16:42:01 <jungleboyj> agreed, I think the 30 days is a future item.  For Ceph we already see there is value here and past stability.
16:42:16 <thingee> #agree 30 days for infra hosted ci's to prove being stable - any failures will be reviewed
16:42:20 <jungleboyj> mriedem: Rage gauge?
16:42:20 <thingee> #agreed 30 days for infra hosted ci's to prove being stable - any failures will be reviewed
16:42:29 <mriedem> the patented sdague rage gauge (tm)
16:42:29 <thingee> mriedem, dansmith again thanks
16:42:35 <thingee> #topic open discussion
16:42:37 <thingee> flip214: hi
16:43:15 <smcginnis> * crickets * :)
16:43:25 <thingee> flip214: are
16:43:26 <thingee> flip214: you
16:43:28 <thingee> flip214: ready?
16:43:34 <thingee> :)
16:43:41 <flip214> thingee: yes
16:43:46 <jungleboyj> * chirp chirp chirp *
16:43:48 <thingee> whew ok go ahead
16:44:03 <flip214> as you can probably guess, it's about the DRBD driver...
16:44:11 <flip214> the driver got renamed, like so many others.
16:44:15 <flip214> because of no CI.
16:44:38 <flip214> http://lists.openstack.org/pipermail/openstack-dev/2015-February/057585.html is Clarks reply to my "CI via infra for the DRBD Cinder driver" ML post
16:44:46 <thingee> clarkb: ^
16:44:56 <flip214> one of the gerrit links he posted is https://review.openstack.org/#/c/121528/
16:45:13 <clarkb> flip214: thingee please read jeblairs followup though
16:45:14 <flip214> which still isn't merged.
16:45:28 <clarkb> flip214: yes it hasn't merged because it hasn't been ap priority for the reasons in jeblairs followup
16:45:44 <clarkb> flip214: it is also no a blocker
16:45:46 <flip214> AFAIK these changes are *needed* to get a multi-node testing environment
16:45:51 <thingee> flip214: I might do a cut today and there hasn't been time to let the ci run to verify things.
16:45:51 <clarkb> flip214: no
16:45:54 <flip214> oh, it is not?
16:45:59 <clarkb> flip214: no...
16:46:02 <flip214> okay.
16:46:11 <clarkb> and even if it were I suggest getting a single node test running first
16:46:14 <clarkb> against your devstack plugin
16:46:17 <flip214> so, things like the LVM driver with iSCSI support can already be tested?
16:46:21 <clarkb> thats step 0 and aiui has not been started
16:46:34 <flip214> clarkb: yeah, that's the second point.
16:46:36 <thingee> flip214: I remember talking to clarkb that it was recommended to not do a multi-node setup for now.
16:46:51 <thingee> flip214: after we spoke when i was closing kilo-3
16:46:51 <clarkb> yes, please set up a single node test against a devstack plugin that configures your driver
16:47:03 <clarkb> then you can worry about multinode
16:47:13 <flip214> okay, that should be even easier.
16:47:17 <clarkb> and potential zuul changes for voting (though this was discussed earlier and I don't think its that urgent)
16:47:40 <flip214> I'll try to talk to #infra tomorrow about that (it's 18:47 here)
16:47:54 <thingee> flip214: keep in mind. I might do a cut today
16:47:58 <thingee> for RC.. that would be it.
16:48:03 <flip214> my second point is that I'm trying to make devstack work for our development needs...
16:48:27 <flip214> thingee: there's no real chance that I can talk #infra into doing that in the next 3 hours, is there?
16:48:36 <clarkb> flip214: doing what?
16:48:44 <flip214> and the 1-2 days I try to invest in devstack, every time something else breaks.
16:49:07 <flip214> currently I get "AttributeError: 'NoneType' object has no attribute 'url'" for the "python update.py /opt/stack/keystone" command...
16:49:25 <thingee> flip214: might need to do a fresh vm
16:49:29 <flip214> clarkb: sorry, which of my sentences is your question about?
16:49:30 <smcginnis> flip214: That's been fixed.
16:49:33 <thingee> flip214: I was getting that too
16:49:39 <clarkb> flip214: what are you looking for infra to do in the next 3 hours?
16:49:40 <eikke> flip214: that should be fixed
16:49:47 <flip214> thingee: I'm *always* running a fresh vm.
16:49:50 <flip214> that's the point.
16:49:51 <geguileo> flip214: I think that's because you have the clone set to no
16:50:06 <flip214> clarkb: running my devstack DRBD plugin to configure the DRBD driver.
16:50:12 <geguileo> Ok, then that's not it  :(
16:50:20 <clarkb> flip214: are those changes proposed to infra yet?
16:50:39 <flip214> clarkb: I talked to you, and then (yes, my fault) waited for the reviews to get merged.
16:50:53 <flip214> I've got a git tree in https://github.com/LINBIT/devstack
16:51:07 <flip214> but nothing pushed against gerrit yet.
16:51:12 <clarkb> yes we need it in gerrit
16:51:13 <flip214> although the change is minimal.
16:51:17 <clarkb> so likely won't happen in the next 3 hours
16:51:19 <thingee> ok I think some of this can be taken to #openstack-infra
16:51:35 <flip214> so reviewing should take only a few minutes
16:51:46 <flip214> thingee: okay, I'll move to there. time's up anyway.
16:52:21 <thingee> flip214: looking forward to hearing the story you mentioned when we meet up at the summit
16:52:33 <thingee> anyone else have a topic before I close the meeting?
16:52:49 <asselin_> this bug might be affecting others: https://bugs.launchpad.net/tempest/+bug/1440227
16:52:50 <openstack> Launchpad bug 1440227 in tempest "encrypted volume tests don't check if a volume is actually encrypted" [Undecided,New]
16:52:51 <flip214> thingee: thanks.
16:53:33 <thingee> oh and I just wanted to mention to folks here, don't post to the ML about your CI results. anteaya has had to talk to some of you.
16:53:52 * jungleboyj looks sheepish
16:54:00 <thingee> feel free to bug us on #openstack-cinder. this just doesn't work well with everyone's CI. it starts a tend and then it comes in masses
16:54:08 <jungleboyj> thingee: On that note, the DS8k CI has been fixed.
16:54:09 <asselin_> take a look and see if your driver is (incorrectly) passing that test
16:54:26 <thingee> I'll have something more formal in the wiki about this later. just didn't have time yet. apologies anteaya
16:54:53 <thingee> and yes please review the bug asselin_ mentioned https://bugs.launchpad.net/tempest/+bug/1440227
16:54:54 <openstack> Launchpad bug 1440227 in tempest "encrypted volume tests don't check if a volume is actually encrypted" [Undecided,New]
16:54:58 <thingee> with your driver
16:55:03 <thingee> mtreinish: ^
16:55:15 <thingee> anything else?
16:55:20 * thingee has to catch a plane
16:55:42 <thingee> annnnnnd we're done, I'll be at pycon the rest of this week and here and there doing reviews.
16:55:46 <thingee> #endmeeting