16:00:01 <thingee> #startmeeting cinder 16:00:02 <openstack> Meeting started Wed Apr 8 16:00:01 2015 UTC and is due to finish in 60 minutes. The chair is thingee. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:03 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:05 <openstack> The meeting name has been set to 'cinder' 16:00:09 <kmartin> o/ 16:00:11 <thingee> hi everyone! 16:00:12 <jungleboyj> o/ 16:00:16 <rhe00> hi 16:00:16 <geguileo> Hi! 16:00:18 <scottda> hi 16:00:20 <cebruns> Hi 16:00:22 <smcginnis> hi 16:00:25 <thangp> o/ 16:00:26 <mriedem> hi 16:00:27 <xyang2> hi 16:00:31 <bswartz> hi 16:00:38 <thingee> Cinder Kilo RC is wrapping up https://etherpad.openstack.org/p/cinder-kilo-rc-priorities 16:00:40 <Swanson> hello 16:00:42 <patrickeast> hi 16:00:48 <flip214> hi 16:00:51 <rajinir_r> hi 16:00:52 <jgravel> hello 16:00:59 <li_zhang> hi 16:01:01 <deepakcs> hello 16:01:02 <thingee> I will be doing a cut soon. We're just waiting on the owners of the patches that are marked not ready now 16:01:04 <jbernard> hi 16:01:09 <smcginnis> Woot 16:01:11 <thingee> thanks everyone for the reviews! 16:01:16 <thingee> and patches to bug fixes! 16:01:38 <thingee> alright I'm stuck in the airport lobby and need to check in so lets get started, short meeting! 16:01:47 <thingee> agenda for today: 16:01:50 <thingee> #link https://wiki.openstack.org/wiki/CinderMeetings#Next_meeting 16:02:02 <e0ne> hi 16:02:19 <thingee> #topic Proposal to make the (check|gate)-tempest-dsvm-full-ceph job votin 16:02:21 <asselin_> hi 16:02:24 <thingee> mriedem: hi 16:02:25 <avishay> hello 16:02:28 <mriedem> hey 16:02:29 <thingee> dansmith: hi 16:02:34 <mriedem> https://review.openstack.org/#/c/170913/ 16:02:37 <thingee> #link https://review.openstack.org/#/c/170913/ 16:02:39 <dansmith> hola 16:02:45 <mriedem> so we have the non-voting ceph job in infra on cinder/nova changes today 16:02:54 <mriedem> glance was left out of that for some reason, but it runs with an rbd store in glance 16:02:56 <thingee> #idea ceph is a infra ran ci and stable. Should we allow it to vote? 16:02:58 <dansmith> before we talk about the voting, should/can we talk about jbernard's patch? 16:03:10 <dansmith> because we can't make it voting until we fix the bug 16:03:16 <mriedem> dansmith: yeah we can 16:03:20 <mriedem> https://review.openstack.org/#/c/170903/ 16:03:23 <dansmith> mriedem: did the skip land? 16:03:25 <mriedem> the infra patch depends on the tempest skip 16:03:30 <dansmith> oh, then nevermind :) 16:03:33 <mriedem> maybe once i bug sdage 16:03:36 <mriedem> sdague 16:03:49 <mriedem> anyway, test_volume_boot_pattern pukes on non-lvm backends 16:03:55 <mriedem> so we skip that until the bug is fixed 16:04:03 <mriedem> and make the ceph job voting for nova/cinder/glance 16:04:03 <thingee> mriedem: oh I was wondering about that :( 16:04:10 <jbernard> more precisely, it pukes whenever you use glance api v2 16:04:13 <mtreinish> mriedem: not all non lvm backends, just some 16:04:16 <mriedem> sounds like the bug also affects glusterfs/gpfs, maybe others 16:04:19 <e0ne> imo, we need to fix bug first 16:04:28 <deepakcs> mriedem, yes it affects glusterfs too 16:04:45 <jbernard> it would also effect lvm, if glance v2 is used 16:04:45 <mriedem> the thing is, the ceph job was passing for awhile, then we enabled test_volume_boot_pattern again, and it blew up the ceph job 16:04:46 <thingee> jbernard, mriedem what exactly is the bug testing? 16:04:55 <jungleboyj> mriedem: It does affect GPFS for sure. 16:05:01 <mriedem> i'd like to see us not regress the ceph job while trying to get this bug fixed 16:05:08 <thingee> mriedem: it is hitting a lot of the cinder volume drivers. 16:05:12 <mtreinish> thingee: http://git.openstack.org/cgit/openstack/tempest/tree/tempest/scenario/test_volume_boot_pattern.py#n25 16:05:29 <mriedem> sure, so it's a nasty bug for shared storage drivers 16:05:31 <e0ne> #link proposed fix https://review.openstack.org/#/c/171312/ 16:05:32 <dansmith> the thing is, allowing ceph to vote gets a lot of coverage in nova, cinder, and glance of these rbd backends 16:05:46 <mriedem> and ceph is the #1 storage backend according to the user survey 16:05:47 <dansmith> which is very valuable, as evidenced by the above list of "yes, affects us too" :) 16:05:52 <dansmith> and we can test it in infra 16:05:55 <mriedem> so we should test production things probably, since lvm != production 16:05:57 <dansmith> unlike many of the others 16:06:08 <thingee> mtreinish, mriedem: seems kind of important to pass? why would we not fix the problem to begin with? 16:06:22 <thingee> dansmith: +1 16:06:23 <mriedem> thingee: we don't want to regress while fixing that bug 16:06:24 <mriedem> on other things 16:06:29 <dansmith> thingee: we have a patch up to fix it 16:06:35 <thingee> dansmith: link? 16:06:38 <dansmith> thingee: and when I say "we" I mean jbernard :) 16:06:42 <thingee> #link http://git.openstack.org/cgit/openstack/tempest/tree/tempest/scenario/test_volume_boot_pattern.py#n25 16:06:46 <mriedem> so skip the test, make the job voting so we don't regress, then once the bug is fixed we unskip the test 16:06:54 <thingee> mriedem: got it 16:06:58 <dansmith> I'm fine with that procedure 16:07:02 <mriedem> https://review.openstack.org/#/c/171312/ 16:07:08 <dansmith> I'd like to see the patch get into kilo if at all possible 16:07:09 <mriedem> that's the proposed fix 16:07:11 <dansmith> but I'd rather have regression protection (tm) 16:07:11 <thingee> #link https://review.openstack.org/#/c/171312/ 16:07:32 <thingee> #idea don't fix volumebootpattern to avoid regress 16:07:43 <thingee> DuncanT: here? 16:07:45 <flip214> thingee: I'd like to ask for a few minutes after the meeting, thank you so much. 16:07:50 <mriedem> well, we want to fix test_volume_boot_pattern 16:07:52 <thingee> flip214: got it 16:08:02 <mriedem> we just don't want to regress more while fixing that 16:08:04 <thingee> #undo 16:08:05 <openstack> Removing item from minutes: <ircmeeting.items.Idea object at 0x95f9e50> 16:08:23 <thingee> #idea eventually fix volumebootpattern 16:08:42 <mriedem> there was a -1 on the infra change b/c the glusterfs and sheepdog jobs are to remain non-voting in cinder 16:08:53 <thingee> #action jbernard and dansmith have a patch to fix VolumeBootPattern 16:09:08 <mriedem> i don't know the whole background on that in cinder land, but those seem specific to cinder whereas the ceph job touches multiple projects 16:09:16 <mriedem> and there is an rbd specific image backend module in libvirt in nova 16:09:18 <thingee> does anyone oppose to ceph not voting for now? 16:09:24 <thingee> I know DuncanT was opposed originally 16:09:35 <e0ne> thingee: actually, we've got two patches 16:09:44 <jungleboyj> thingee: Do you mean oppose to ceph voting ? 16:09:48 <e0ne> but the second one requires new python-glanceclient 16:10:01 <thingee> jungleboyj: yes sorry, I'm in airport..hectic right now 16:10:05 <avishay> thingee: holiday here, DuncanT may not be around 16:10:05 <avishay> I see his point, it puts drivers on unequal footing 16:10:08 <dansmith> mriedem: although this failing is not in the ceph driver in nova, it's common across all BFV scenarios in nova 16:10:08 * thingee is excited to finish this meeting 16:10:09 <mriedem> jungleboyj: yes, comment in here https://review.openstack.org/#/c/170913/ 16:10:32 <thingee> to refresh everyone's memory, we decided against this back in oct 15 16:10:34 <jungleboyj> thingee: Ok, just wanted to clarify. I do not oppose making it voting. I think it is a good improvement. 16:10:34 <thingee> #link http://eavesdrop.openstack.org/meetings/cinder/2014/cinder.2014-10-15-16.00.log.html 16:11:02 <thingee> I think Ceph is unique situation. It's integrated in a variety of projects and maintained by infra. 16:11:14 <dansmith> and #1 in the user survey for production 16:11:18 <mriedem> right 16:11:34 <jungleboyj> Given that this is unique and looks like it is likely to give us much better test coverage, I think this is the right thing to do. 16:11:38 <dansmith> and we have involvement from people to fix bugs when they arise 16:11:42 <smcginnis> I would love to ceph, sheepdog, etc actually moved out to a different test queue. Just my opinion. 16:11:53 <xyang2> thingee: then what about other CI's maintained by infra? 16:11:57 <dansmith> smcginnis: right now, infra doesn't support that 16:12:07 <thingee> xyang2: so so far sheepdog is not stable 16:12:08 <jungleboyj> I understand DuncanT 's concerns, but I don't think improving cross project coverage on a production back end is hurting anyone. 16:12:14 <smcginnis> dansmith: Yeah, just wishing. :) 16:12:16 <mriedem> ceph job was on the experimental queue for a long time 16:12:16 <thingee> xyang2: we're landing patches right now to fix that though 16:12:19 <mriedem> and it was broken for a long time 16:12:24 <winston-d> smcginnis: because? 16:12:30 <mriedem> jbernard: got the ceph job passing so we moved to check queue non-voting 16:12:43 <smcginnis> It's good these are tested since they are open source drivers. 16:12:49 <thingee> mriedem: so were others, that's not unique for cinder volume drivers. 16:12:54 <hemna> mornin 16:12:54 <smcginnis> But it slows down our Jenkins gating. 16:12:58 <thingee> mriedem: re being in gate for a while 16:13:02 <smcginnis> We want to know if they fail. 16:13:07 <clarkb> smcginnis: does it? that test shouldn't be in the gate 16:13:08 <jungleboyj> hemna: Is here, we can start now. ;-) 16:13:09 <smcginnis> But it shouldn't block thigns. 16:13:11 <thingee> smcginnis: truth 16:13:15 <hemna> heh 16:13:23 <xyang2> thingee: are we going to allow all of them to vote eventually 16:13:24 <smcginnis> Sorry, check not gate. 16:13:37 <mriedem> right now ceph is non-voting in the cinder check queue 16:13:41 <patrickeast> iirc the biggest concern with having ci’s voting was that it would cause too much noise with false positives… if the ceph job is stable, why not let it vote then? at least for check 16:13:42 <jungleboyj> xyang2: I think we decided against that. 16:13:42 <thingee> xyang2: I think the idea is eventually CI's would vote. they're just not usually stable 16:13:56 <thingee> mriedem: correct 16:13:56 <dansmith> so right now, adding ceph gets us a ton more coverage. adding sheepdog after that doesn't give us the same gain 16:14:03 <DuncanT> Sorry, just got here 16:14:05 <jungleboyj> patrickeast: +1 16:14:06 <avishay> i have no problem with ceph being voting, assuming it won't break things too much when its infra goes down 16:14:08 <dansmith> it gets us some, but not exercising code that wasn't ever touched before 16:14:20 <mriedem> avishay: it's not 3rd party CI 16:14:20 <thingee> patrickeast: +1 16:14:26 <mriedem> avishay: community infra hosts the job 16:14:30 <smcginnis> It is a bit uneven, because them voting would result in a Jenkins -1 whereas other drivers are separate. 16:14:30 <e0ne> patrickeast: +1 16:14:38 <DuncanT> smcginnis: +1 16:14:43 <avishay> mriedem: ah ok, then definitely no problem for me 16:14:47 <DuncanT> That was my original concern 16:14:50 <mriedem> smcginnis: it already does, but no one cares b/c it's non-voting 16:14:58 <mriedem> so they don't pay attention 16:15:02 <DuncanT> Also worth noting ceph has been failing all day 16:15:11 <hemna> go ceph! 16:15:13 <thingee> smcginnis: Honestly I think we need to be better at checking non-voting ci's. I've been getting on people in reviews if people haven't noticed 16:15:16 <dansmith> DuncanT: it's failing because of a bug in cinder 16:15:18 <smcginnis> I'd like to see it stay non-voting or moved out where it doesn't look like a failure if there is some driver specific issue. 16:15:24 <smcginnis> thingee: True 16:15:25 <dansmith> DuncanT: ever since the boot from volume test was turned on recently 16:15:31 <dansmith> DuncanT: it had been green for months before that 16:15:34 <mriedem> DuncanT: ceph is failing until we skip test_volume_boot_pattern on it 16:15:35 <thingee> DuncanT: most ci's are failing VolumeBootPattern 16:15:38 <mriedem> which prompted this whole discussion 16:15:39 <thingee> DuncanT: it will be skipped 16:15:41 <e0ne> imo, any _stable_ ci must bt voting 16:15:47 <DuncanT> dansmith: So we'd be unable to merge anything into cinder right now if it was voting 16:15:51 <smcginnis> CI's need to get to the point of being stable enough to stand out when theres a failure. 16:15:56 <jungleboyj> DuncanT: That is what has started this discussion. :-) 16:16:05 <smcginnis> DuncanT: My concern exactly. 16:16:06 <thingee> DuncanT: I think the idea is to land the skip first and then watch ceph ci, then change to voting 16:16:12 <jbernard> DuncanT: there is a patch here: https://review.openstack.org/#/c/171312/ 16:16:13 <e0ne> DuncanT: we've got 2 proposed fixes 16:16:14 <dansmith> DuncanT: no, you wouldn't have been able to re-enable the broken test until it worked :) 16:16:17 <jordanP> thingee, +1 16:16:23 <mriedem> remember that this isn't just a cinder issue 16:16:33 <mriedem> so we can't make it non-voting in cinder and voting in nova/glance 16:16:42 <mriedem> b/c a cinder merge can then break the gate for nova/glacne 16:16:44 <mriedem> so it's all or none 16:16:45 <thingee> mriedem: +1 16:16:51 <dansmith> DuncanT: and this is a legitimately broken thing in cinder, so it not letting you merge something is... what you want :) 16:16:57 <dansmith> mriedem: right 16:17:03 <dansmith> this affects three projects 16:17:25 <mriedem> and this is the only shared storage backend we have in the infra check queue today for nova 16:17:30 <mriedem> everything else is lvm 16:17:41 <DuncanT> lvm is the cinder reference implementation 16:17:46 * thingee waits for ceph to become the reference implementation :) 16:17:47 <jungleboyj> Remember, if the job had been voting we wouldn't have gotten in the state where it was failing for everyone. 16:17:51 <mriedem> DuncanT: yeah, but lvm isn't a production backend 16:17:54 <DuncanT> *Everything* else ahs been classed as third party 16:17:57 <mriedem> ceph is the most used in production 16:18:00 <erlon> what defines a stable CI? 16:18:03 <dansmith> DuncanT: but it doesn't cover a ton of code that is required for production 16:18:10 <DuncanT> mriedem: Plenty of people using it in production right now 16:18:17 <thingee> erlon: that's a good question. 16:18:39 <hemna> thingee, keep waiting 16:18:45 <thingee> dansmith, mriedem: do we have numbers before the VolumeBootPattern issue how successful the ceph ci has been running? 16:18:45 <DuncanT> hemna++ 16:18:54 <erlon> thingee: is there a way to count/measure this? or create parameters to have a success hardline? 16:19:06 <clarkb> thingee: yes all that data is in graphite.openstack.org and logstash.openstack.org 16:19:06 <mriedem> thingee: it was passing 16:19:09 <dansmith> thingee: we only have ten days of data I think, and it's been broken for several now 16:19:11 <mtreinish> thingee: it's probably worth pointing out that the only reason the boot from volume test was ever skipped was because of lvm config issues 16:19:14 <thingee> mriedem: it was passing is not numbers 16:19:18 <erlon> e.g. more than 95% in last 30 days is stable 16:19:19 <mtreinish> which jgriffith only recently sorted 16:19:22 <clarkb> dansmith: graphite.openstack.org has like a year 16:19:24 <smcginnis> Sounds like a good Liberty midcycle discussion. 16:19:29 <thingee> clarkb: thanks, can we gather that? 16:19:32 <xyang2> thingee: what is the criteria to determine that skipping a test is ok? 16:19:35 <dansmith> clarkb: oh, I was going on logstash data 16:19:48 <mtreinish> dansmith: well that's only 10 days :) 16:19:53 <thingee> xyang2: I think we just over that because of the regress point 16:19:55 <erlon> thingee: will save a lot of discussions when adding removing CIs from voting 16:20:13 <thingee> dansmith: ah ok 16:20:18 <xyang2> thingee: there are bugs in tests that affect some particular CI, but not others 16:20:23 <dansmith> clarkb: can you help me find what element to plot for this? 16:20:25 * mtreinish disappears 16:20:30 <DuncanT> erlon: 5 disagreements in a row with jenkins reference run was the metric being discussed at Paris 16:20:57 <dansmith> DuncanT: from experience that's way too tight 16:21:04 <dansmith> DuncanT: jenkins fails that a lot :) 16:21:22 <smcginnis> dansmith: I agree. 16:21:35 <thingee> So I'll say it again, I think the eventual goal with Cinder's CI was to allow voting, someone can correct me if I'm wrong on that. I think if we want to start getting coverage on a driver other than the reference implementation and something that's maintained by infra and a community, this could be a good start. 16:21:53 <patrickeast> thingee: +1 16:21:54 <DuncanT> dansmith: I'm only repeating what was said 16:22:04 <mriedem> http://graphite.openstack.org/render/?width=586&height=308&_salt=1428510122.331&target=stats_counts.nodepool.job.check-tempest-dsvm-full-ceph.master.devstack-trusty.builds&from=00%3A00_20150101&until=23%3A59_20150408 16:22:05 <patrickeast> it makes for a nice pilot program to see how we like having other backends voting 16:22:09 <dansmith> http://graphite.openstack.org/render/?width=586&height=308&_salt=1428510113.778&target=stats.zuul.pipeline.check.job.check-tempest-dsvm-full-ceph.SUCCESS&from=-90days 16:22:12 <mriedem> that's graphite on the ceph job going back to january 16:22:13 <thingee> mriedem: thanks! 16:22:22 <dansmith> or rather: 16:22:22 <dansmith> http://graphite.openstack.org/render/?width=586&height=308&_salt=1428510134.342&from=-90days&target=stats.zuul.pipeline.check.job.check-tempest-dsvm-full-ceph.SUCCESS&target=stats.zuul.pipeline.check.job.check-tempest-dsvm-full-ceph.FAILURE 16:22:24 <clarkb> dansmith: you should add the failures too 16:22:24 <jungleboyj> thingee: +1 16:22:31 <dansmith> clarkb: yeah 16:22:33 <clarkb> dansmith: thats it 16:22:36 <DuncanT> thingee: I'm not totally opposed, as long as (a) the goal is to do this with other drivers too (b) we can turn off the voting easily if the job gets unstable 16:22:37 <smcginnis> I thought we had decided no voting. It would only make querying maybe a little easier. 16:22:37 <clarkb> dansmith: I was about to link something similar 16:22:42 <dansmith> you can see that it just started failing, if you smooth out the noise 16:22:45 <dansmith> that's 90 days 16:22:52 <Swanson> It would be nice if, when CI's vote, that recheck <ci> would not recheck jenkins. 16:23:20 <thingee> smcginnis: that was in oct 15 for now, is where we left off..until things are more stable 16:23:30 <mriedem> DuncanT: turning off voting is easy 16:23:38 <mriedem> it's a 1 line change to project-config 16:23:47 <mriedem> and if the job is breaking, there will be 3 projects looking to fix it 16:23:47 <DuncanT> thingee: If the plan is not to allow other stable CIs to also vote with similar criteria for stability (whatever we pick), then I'm very much opposed 16:23:47 <thingee> smcginnis: https://wiki.openstack.org/wiki/Cinder/tested-3rdParty-drivers#When_thirdparty_CI_voting_will_be_required.3F 16:23:53 <clarkb> DuncanT: iirc there are at least 2 other drivers in the process of doing this 16:23:58 <clarkb> sheepdog and gluster 16:24:07 <thingee> DuncanT: I think that would be the goal eventually to allow others. That's the point of this discussion 16:24:23 <thingee> DuncanT: it's just we talked about this on Oct 15th when we had maybe three CI's? https://wiki.openstack.org/wiki/Cinder/tested-3rdParty-drivers#When_thirdparty_CI_voting_will_be_required.3F 16:24:34 <thingee> DuncanT: now we have more and some are doing pretty ok IMO 16:24:35 <clarkb> where this is follwing along in ceph's testing footsteps 16:24:48 <thingee> not perfect yet. 16:25:03 <thingee> I just think we need to start somewhere 16:25:05 <jungleboyj> We have to try adding voting somewhere. 16:25:10 <jungleboyj> thingee: :-) 16:25:16 <thingee> and I think a community driven driver is a good start 16:25:37 <e0ne> thingee: +1 16:25:48 <thingee> but people have to see the value in CI's eventually voting. 16:25:56 <DuncanT> If the plan is to add more, I'm not so opposed... 16:26:07 <jungleboyj> This is being raised up as a good starting point with good benefits. If we enable it and it screws us over, then we can disable it. 16:26:12 <thingee> So one thing being opposed to drivers for me is we have around 45 drivers (counting all supported protocols) 16:26:20 <thingee> there is going to be a lot of rechecks if everyone starts to vote 16:26:37 <avishay> thingee: good point 16:26:43 <smcginnis> Yikes 16:27:00 <dansmith> the only ones that can vote are the ones in infra anyway right? 16:27:01 <cebruns> With infra/comm supporting it - it seems like a good prototype/beginning. 16:27:11 <mriedem> the only ones that can gate are in infra too 16:27:13 <dansmith> most of those 45 are 3rd party, uncontrolled, unsupervised 16:27:16 <mriedem> third party ci can't gate 16:27:16 <jungleboyj> thingee: I think a better goal is to have voting from a stable representation of the the backend types. 16:27:16 <thingee> dansmith: well that's what will start pissing people off here :) 16:27:33 <dansmith> thingee: it's a decree outside the scope of this group or conversation, I think 16:27:36 <DuncanT> jungleboyj: ++ 16:27:39 <jungleboyj> We can't hold this up on whether everyone can vote. 16:27:49 <mriedem> infra isn't going to run a storwize backend 16:27:49 <dansmith> jungleboyj: agreed 16:27:56 <mriedem> so let's not wait for that 16:28:02 <thingee> mriedem: that's not the point 16:28:07 <avishay> jungleboyj: what does "stable representation of the the backend types" mean? 16:28:08 <DuncanT> I believe infra are, in principle, happy to host any open-source solution CI? 16:28:08 <thingee> mriedem: storwize is hosted offsite 16:28:14 <thingee> mriedem: the point is if it can vote 16:28:23 <smcginnis> I guess as a pilot to see how things go Ceph would be a good starting point, especially given it's level of usage. 16:28:30 <dansmith> thingee: right, and infra has said it can't right? 16:28:33 <mriedem> thingee: yeah, i know, but i don't want people thinking it's unfair that infra doesn't host everything 16:28:34 <jungleboyj> We have block covered right now with LVM. Add in a shared filesystem with ceph if that is what users are using. 16:28:41 <jungleboyj> Continue to readdress. 16:28:44 <thingee> dansmith: infra left it up to the cinder group 16:28:47 <dansmith> jungleboyj: right 16:28:50 <mriedem> avishay: i take that to mean we should have at least one backend testing shared storage in the gate 16:28:54 <mriedem> since nova doesn't have that today 16:28:55 <thingee> dansmith: re the convo from last year october 16:28:57 <mriedem> the ceph job is already there 16:29:00 <jungleboyj> mriedem: +1 16:29:09 <dansmith> thingee: for check you mean, right? 16:29:19 <thingee> dansmith: yes 16:29:19 <dansmith> thingee: I think they said, across the board, no non-infra things in gate, period 16:29:26 <mriedem> yes 16:29:28 <dansmith> gotcha, okay 16:29:30 <DuncanT> I'd be happy if we agreed that any of the infra-hosted (i.e. open-source) cinder jobs will be eligable to vote in future, once they've proved stability 16:29:30 <mriedem> clarkb: can confirm 16:29:39 <dansmith> DuncanT: +1 16:30:07 <DuncanT> And that we'll pull ceph (or any other voting project) if it proves to be a pain 16:30:09 <mriedem> DuncanT: i think that's the goal for all jobs that infra hosts which are non-voting today 16:30:09 <smcginnis> So much for a short meeting. :) 16:30:16 * thingee is kind of glad this and flip214's topic are the only things in this meeting :) 16:30:32 <jungleboyj> DuncanT: +1 16:30:34 <mriedem> e.g. cells is non-voting today, once we can get it passing we'll make it voting so we don't regress it 16:30:34 <thingee> DuncanT: +1 16:30:35 <dansmith> sounds like we're arriving at the decision 16:30:40 <avishay> +1 16:30:42 <thingee> dansmith: I hope so 16:30:44 <thingee> :) 16:30:55 <dansmith> but I was ready for so much arguing! 16:31:05 <thingee> dansmith: this is cinder :) 16:31:05 <dansmith> paint it red! 16:31:07 <hemna> why argue, when everyone can just say no. 16:31:14 <patrickeast> lol 16:31:28 <DuncanT> Ok. Somebody minute that and I'll stop objecting 16:31:38 <avishay> hemna: we argue even when everyone says yes 16:31:45 <hemna> avishay, yah this is true. 16:31:51 <jungleboyj> avishay: No we don't! 16:31:53 <jungleboyj> ;-) 16:31:59 <thingee> #agreed infra-hosted open-source cinder jobs will be eligable to vote in future once they've proved stable 16:32:06 <avishay> jungleboyj: :) 16:32:07 <DuncanT> Thanks 16:32:17 <avishay> i'm good with this decision 16:32:22 <thingee> #agreed we'll start with Ceph having voting in Cinder/glance/nova 16:32:33 <erlon> thingee: so, the criteria will be the one DuncanT said? 16:32:33 <eikke> next up:: define 'stable' 16:32:36 <thingee> dansmith mriedem thanks for your time 16:32:38 <hemna> yah 16:32:38 <thingee> clarkb: you too 16:32:42 <thingee> mtreinish: and you! 16:32:48 <hemna> I was going to say, we need agreed upon criteria for 'stable' 16:32:50 <hemna> bleh 16:32:50 <dansmith> thingee: glad to, thanks 16:33:01 <erlon> hemna: yep 16:33:03 * jungleboyj thinks this is good. Thanks mriedem dane_leblanc 16:33:07 <hemna> doesn't break any more than jenkins? 16:33:16 <DuncanT> dansmith: Thanks for sticking with the discussion 'til it got somewhere 16:33:16 <thingee> #topic How to determine a CI is stable 16:33:20 <jungleboyj> Doh, dansmith was what I meant. 16:33:26 <dansmith> jungleboyj: I figured :) 16:33:30 <avishay> no need to define everything - when it's stable we'll know it 16:33:40 <thingee> mriedem, dansmith if you want to help with this topic too :) 16:33:42 <mriedem> if a job is consistently breaking the gate and no one is stepping up to fix it, then it's made non-voting 16:33:52 <mriedem> that's why we made cells non-voting the last time 16:33:55 <jungleboyj> DuncanT: Knew you would come around. ;-) 16:34:03 <thingee> mriedem: but how do we eventually graduate a ci 16:34:06 <dansmith> thingee: so, I don't have the graphite fu to smooth that graph, 16:34:19 <dansmith> thingee: but it looks to me like it was stable for about 90 days before the recent breakage 16:34:38 <mriedem> thingee: to voting? 16:34:51 <DuncanT> 30 days of less than N jenkins disagreements in a row is a nice high bar 16:34:58 <thingee> mriedem: yes, we infra hosted ci's will be graduated to voting 16:35:08 <DuncanT> And measurable, which is important IMO 16:35:10 <mriedem> thingee: based on graphite data i suppose 16:35:23 <mriedem> there isn't really a standard for this somewhere that i know of 16:35:31 <thingee> DuncanT: I think we would want to review the negatives to see if they are real. might be a bit time consuming though 16:35:32 <mriedem> if we have a job that we care about working, we make sure it's working and votiing 16:35:55 <dansmith> thingee: DuncanT http://goo.gl/BUotuv 16:36:00 <thingee> mriedem: I guess the point of this conversation is how are we determining the ceph ci is stable enough. sounds like 90 days of being stable? :P 16:36:11 <dansmith> that ^ shows that it pretty much tracks jenkins I think 16:36:12 <DuncanT> And measurable, which is important IMO 16:36:17 <avishay> i don't think we need to set rules in stone...when something is stable, it's pretty obvious...when it's more trouble than it's worth, that's obvious too 16:36:27 <mriedem> avishay: +1 16:36:28 <erlon> DuncanT: 3 days of disagreements would revoke it? 16:36:42 <mriedem> and yeah, from the graph that dansmith linked, you can see around 4/2 it goes nuts 16:36:43 <deepakcs> Is there a way for jenkins to notify the CI owner only if the job fails. For eg: if job was doing fine for 90 days and breaks later, it would be good to have a email notifucation ? 16:36:50 <mriedem> which is when test_volume_boot_pattern was unskipped 16:36:54 <DuncanT> erlon: some number like that and somebody can at least nominate it for removal 16:37:02 <mriedem> deepakcs: this isn't third party 16:37:10 <erlon> DuncanT: +1 16:37:17 <mriedem> deepakcs: as in the community owns it 16:37:44 <deepakcs> mriedem, how is that related to sending email notification or not ? 16:37:51 <mriedem> who are you going to send it to? 16:37:55 <DuncanT> Same with nomination... once a CI meets the criteria, somebody can nominate it. The person(s) nominating it can filter out real failures if they are so motivated 16:37:57 <thingee> I'm fine with going to 30 days for now. 16:38:08 <thingee> just because no one else is strong on a number 16:38:13 <deepakcs> mriedem, for glusterfs say it would be me or BharatK , for ceph it would be jbernard 16:38:33 <DuncanT> The point being a highly motivated sponsor is a good thing initially 16:38:41 <mriedem> yeah idk, that seems like unnecessary noise - if it's pissing people off, bring it up in irc or meetings 16:38:45 <jungleboyj> DuncanT: +1 16:38:50 <dansmith> mriedem: +1 16:38:51 <mriedem> you can always push a project-config patch to make it non-voting 16:39:01 <dansmith> it's a constant job, not just reacting when there is an email notification 16:39:02 <mriedem> if we killed CI after 3 bad days, jenkins would be dead http://status.openstack.org/elastic-recheck/index.html 16:39:31 <deepakcs> dansmith, yes, but email would be good to filter failure from sucess, esp when there is 1 failure in many successess, i feel 16:39:31 <mriedem> and neutron wouldn't be in the gate until about 9 months ago 16:39:32 <mriedem> :) 16:39:35 <jungleboyj> mriedem: :-) 16:39:52 <mriedem> deepakcs: we're talking about thousands of jobs 16:39:56 <dansmith> deepakcs: well, you have access to the data to generate your own notifications 16:40:04 <thingee> #idea 30 days for infra hosted ci's to be stable - any falures will be reviewed 16:40:05 <dansmith> anyway, this is off-topic 16:40:10 <thingee> :) 16:40:13 <mriedem> deepakcs: we can track this all with elastic-recheck 16:40:20 <mriedem> if the top gate bug is for a specific job, e.g. ceph 16:40:21 <thingee> we can revise this later 16:40:23 <deepakcs> mriedem, dansmith sure 16:40:26 <mriedem> then we kick someone's ass or make it non-voting 16:40:29 <thingee> just want to get people to agree on something so we can move on 16:40:31 <DuncanT> mriedem: Disagreeing with jenkins has always been the interesting metric for me.... 16:40:40 <patrickeast> thingee: +1 sounds like a good starting point to me 16:40:43 * thingee waits for people to oppose 16:40:46 <dansmith> DuncanT: sometimes that's useful, other times not :P 16:40:50 <smcginnis> +1 to start 16:40:57 <jungleboyj> +1 from me 16:41:05 <jbernard> +1 from me 16:41:06 <dansmith> thingee: so the above implies we're good with the ceph proposal? 16:41:07 <DuncanT> +1 for me too, see how it works out 16:41:14 <e0ne> DuncanT: we need to be careful on disagrees 16:41:15 <DuncanT> dansmith: I believe so 16:41:24 <thingee> dansmith: ceph ci seems fine given the situation. no need to wait another 30 days 16:41:26 <e0ne> +1 16:41:29 <dansmith> sweet 16:41:45 <DuncanT> e0ne: It's just a point where we start to poke, not where we pull it. An alert point 16:41:48 * dansmith drops the mic and walks out 16:41:49 <mriedem> the unofficial metric is sdague's rage gauge in the ML 16:41:50 <avishay> that's fine, but again, it might fail a little, then be fixed, etc. no rule will cover everything. we'll know when something sucks. 16:41:52 <mriedem> on gate breaks 16:42:01 <jungleboyj> agreed, I think the 30 days is a future item. For Ceph we already see there is value here and past stability. 16:42:16 <thingee> #agree 30 days for infra hosted ci's to prove being stable - any failures will be reviewed 16:42:20 <jungleboyj> mriedem: Rage gauge? 16:42:20 <thingee> #agreed 30 days for infra hosted ci's to prove being stable - any failures will be reviewed 16:42:29 <mriedem> the patented sdague rage gauge (tm) 16:42:29 <thingee> mriedem, dansmith again thanks 16:42:35 <thingee> #topic open discussion 16:42:37 <thingee> flip214: hi 16:43:15 <smcginnis> * crickets * :) 16:43:25 <thingee> flip214: are 16:43:26 <thingee> flip214: you 16:43:28 <thingee> flip214: ready? 16:43:34 <thingee> :) 16:43:41 <flip214> thingee: yes 16:43:46 <jungleboyj> * chirp chirp chirp * 16:43:48 <thingee> whew ok go ahead 16:44:03 <flip214> as you can probably guess, it's about the DRBD driver... 16:44:11 <flip214> the driver got renamed, like so many others. 16:44:15 <flip214> because of no CI. 16:44:38 <flip214> http://lists.openstack.org/pipermail/openstack-dev/2015-February/057585.html is Clarks reply to my "CI via infra for the DRBD Cinder driver" ML post 16:44:46 <thingee> clarkb: ^ 16:44:56 <flip214> one of the gerrit links he posted is https://review.openstack.org/#/c/121528/ 16:45:13 <clarkb> flip214: thingee please read jeblairs followup though 16:45:14 <flip214> which still isn't merged. 16:45:28 <clarkb> flip214: yes it hasn't merged because it hasn't been ap priority for the reasons in jeblairs followup 16:45:44 <clarkb> flip214: it is also no a blocker 16:45:46 <flip214> AFAIK these changes are *needed* to get a multi-node testing environment 16:45:51 <thingee> flip214: I might do a cut today and there hasn't been time to let the ci run to verify things. 16:45:51 <clarkb> flip214: no 16:45:54 <flip214> oh, it is not? 16:45:59 <clarkb> flip214: no... 16:46:02 <flip214> okay. 16:46:11 <clarkb> and even if it were I suggest getting a single node test running first 16:46:14 <clarkb> against your devstack plugin 16:46:17 <flip214> so, things like the LVM driver with iSCSI support can already be tested? 16:46:21 <clarkb> thats step 0 and aiui has not been started 16:46:34 <flip214> clarkb: yeah, that's the second point. 16:46:36 <thingee> flip214: I remember talking to clarkb that it was recommended to not do a multi-node setup for now. 16:46:51 <thingee> flip214: after we spoke when i was closing kilo-3 16:46:51 <clarkb> yes, please set up a single node test against a devstack plugin that configures your driver 16:47:03 <clarkb> then you can worry about multinode 16:47:13 <flip214> okay, that should be even easier. 16:47:17 <clarkb> and potential zuul changes for voting (though this was discussed earlier and I don't think its that urgent) 16:47:40 <flip214> I'll try to talk to #infra tomorrow about that (it's 18:47 here) 16:47:54 <thingee> flip214: keep in mind. I might do a cut today 16:47:58 <thingee> for RC.. that would be it. 16:48:03 <flip214> my second point is that I'm trying to make devstack work for our development needs... 16:48:27 <flip214> thingee: there's no real chance that I can talk #infra into doing that in the next 3 hours, is there? 16:48:36 <clarkb> flip214: doing what? 16:48:44 <flip214> and the 1-2 days I try to invest in devstack, every time something else breaks. 16:49:07 <flip214> currently I get "AttributeError: 'NoneType' object has no attribute 'url'" for the "python update.py /opt/stack/keystone" command... 16:49:25 <thingee> flip214: might need to do a fresh vm 16:49:29 <flip214> clarkb: sorry, which of my sentences is your question about? 16:49:30 <smcginnis> flip214: That's been fixed. 16:49:33 <thingee> flip214: I was getting that too 16:49:39 <clarkb> flip214: what are you looking for infra to do in the next 3 hours? 16:49:40 <eikke> flip214: that should be fixed 16:49:47 <flip214> thingee: I'm *always* running a fresh vm. 16:49:50 <flip214> that's the point. 16:49:51 <geguileo> flip214: I think that's because you have the clone set to no 16:50:06 <flip214> clarkb: running my devstack DRBD plugin to configure the DRBD driver. 16:50:12 <geguileo> Ok, then that's not it :( 16:50:20 <clarkb> flip214: are those changes proposed to infra yet? 16:50:39 <flip214> clarkb: I talked to you, and then (yes, my fault) waited for the reviews to get merged. 16:50:53 <flip214> I've got a git tree in https://github.com/LINBIT/devstack 16:51:07 <flip214> but nothing pushed against gerrit yet. 16:51:12 <clarkb> yes we need it in gerrit 16:51:13 <flip214> although the change is minimal. 16:51:17 <clarkb> so likely won't happen in the next 3 hours 16:51:19 <thingee> ok I think some of this can be taken to #openstack-infra 16:51:35 <flip214> so reviewing should take only a few minutes 16:51:46 <flip214> thingee: okay, I'll move to there. time's up anyway. 16:52:21 <thingee> flip214: looking forward to hearing the story you mentioned when we meet up at the summit 16:52:33 <thingee> anyone else have a topic before I close the meeting? 16:52:49 <asselin_> this bug might be affecting others: https://bugs.launchpad.net/tempest/+bug/1440227 16:52:50 <openstack> Launchpad bug 1440227 in tempest "encrypted volume tests don't check if a volume is actually encrypted" [Undecided,New] 16:52:51 <flip214> thingee: thanks. 16:53:33 <thingee> oh and I just wanted to mention to folks here, don't post to the ML about your CI results. anteaya has had to talk to some of you. 16:53:52 * jungleboyj looks sheepish 16:54:00 <thingee> feel free to bug us on #openstack-cinder. this just doesn't work well with everyone's CI. it starts a tend and then it comes in masses 16:54:08 <jungleboyj> thingee: On that note, the DS8k CI has been fixed. 16:54:09 <asselin_> take a look and see if your driver is (incorrectly) passing that test 16:54:26 <thingee> I'll have something more formal in the wiki about this later. just didn't have time yet. apologies anteaya 16:54:53 <thingee> and yes please review the bug asselin_ mentioned https://bugs.launchpad.net/tempest/+bug/1440227 16:54:54 <openstack> Launchpad bug 1440227 in tempest "encrypted volume tests don't check if a volume is actually encrypted" [Undecided,New] 16:54:58 <thingee> with your driver 16:55:03 <thingee> mtreinish: ^ 16:55:15 <thingee> anything else? 16:55:20 * thingee has to catch a plane 16:55:42 <thingee> annnnnnd we're done, I'll be at pycon the rest of this week and here and there doing reviews. 16:55:46 <thingee> #endmeeting