15:00:37 #startmeeting manila 15:00:39 Meeting started Thu Feb 25 15:00:37 2016 UTC and is due to finish in 60 minutes. The chair is bswartz. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:40 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:42 The meeting name has been set to 'manila' 15:00:46 hi 15:00:47 Hi 15:00:47 hello all 15:00:48 hello o/ 15:00:49 hi 15:00:49 hello 15:00:52 hello 15:00:54 hi 15:01:03 hi 15:01:06 \o 15:01:09 hi 15:01:19 someone added an agenda item without putting their name on it 15:01:21 hi 15:01:25 * bswartz uses wiki history.... 15:01:29 hi 15:01:38 hi 15:01:44 ah it was cknight 15:01:54 hi 15:02:06 bswartz: I knew I couldn't hide. 15:02:17 bswartz, You didn't recognize his attention to detail? 15:02:21 bswartz: Just sounding the alarm. 15:02:27 markstur_: :-) 15:02:34 * bswartz had his suspicions 15:03:06 okay it's a short agenda today 15:03:19 last week of M-3 so we all know what we need to be doing 15:03:27 #topic OpenStack bug smash event 15:03:36 ok 15:03:51 toabctl and I just wanted to highlight that there will be a bug smash event 15:04:11 #link https://etherpad.openstack.org/p/OpenStack-Bug-Smash-Mitaka 15:04:12 oh this is the cross-project event? 15:04:21 for germany it will be at the suse office 15:04:32 bswartz: yes. cross project cross country 15:04:49 I do remember reading about this 15:05:25 what do we need to do to get attention on manila? 15:05:27 toabctl: we need to got trough the bug list and marking the ones we want to work on 15:05:42 bswartz: toabctl and I will work on manila bug 15:05:45 yes. and add the bug links to 15:05:48 and hopefully some more 15:05:48 #link https://etherpad.openstack.org/p/OpenStack-Bug-Smash-Mitaka-Bugs 15:06:06 we will be using LP to target and prioritize bugs 15:06:20 bswartz: ok that's fine 15:06:21 since that event happens between FF and RC1, we can use the mitaka-rc1 target 15:06:32 bswartz: ok cool 15:06:46 bswartz: fine for me. I may just copy the link to mitaka-rc1 to the etherpad then 15:06:59 okay 15:07:02 and everyone is invited to join us :) 15:07:03 hi 15:07:11 I doubt there is anything targeted yet -- it should all be targeted at M-3 right now 15:07:29 bswartz: ok we will see 15:07:39 although we will probably retarget nearly all of them because people are too busy to fix bugs this week 15:08:04 thanks mkoderer, toabctl 15:08:11 can people participate remotely? 15:08:27 bswartz: I'll be remote, too (for private reasons) 15:08:27 like those of us in RTP -- is there a way we could join in on those days? 15:08:28 bswartz: toabctl will be remote for instance:) 15:09:07 so if we see that many ppl want to join we can setup telco bridges or coodinating with irc 15:09:13 let's just use the usual #openstack-manila channel for communication. and if needed we can setup a hangout. I don't expect to many participants outside of the usual group... 15:09:21 toabctl: +1 15:09:22 okay 15:09:40 anyone interested check out that etherpad and use the IRC channel on those days to get linked in 15:10:04 #topic Status of new drivers 15:10:06 mkoderer: Thanks for setting that up. Sounds interesting. 15:10:39 bswartz: You wanna lead this one? I think I said it all in the agenda. 15:10:47 The short version is that merge rates are low due to the rechecks. 15:10:52 So we switch to the new drivers as the voting jobs now, or we accept that many of our Mitaka priorities won't make the deadline. 15:10:55 well I'm a bit confused 15:11:20 the new drivers -- specifically LXD -- isn't ready to be voting 15:11:35 we could make LVM voting and take generic out of no-share-servers jobs 15:11:48 no-share-servers is non-voting for ages 15:11:50 but it seems to have concurrency issues at the moment 15:11:54 errr 15:12:12 okay that's a problem for different reasons 15:12:23 Anything that can increase the probability of successful test runs helps. 15:12:24 we have 2 voting jobs currently 15:12:37 gate-manila-tempest-dsvm-neutron and gate-manila-tempest-dsvm-neutron-multibackend 15:12:54 I agree they're flaky and need replacement 15:13:05 Both of those use generic driver with DHSS = True? 15:13:10 why do we require gate-manila-tempest-dsvm-neutron to be voting if gate-manila-tempest-dsvm-neutron-multibackend runs more tests? 15:13:11 ZFS, vote for stability! 15:13:22 but nothing we have gives us the same coverage at higher reliability 15:13:27 cknight: yes 15:13:36 ganso: +1 Let's use one or the other. 15:13:54 vponomaryov: +1 for ZFS as well, if it's ready. 15:14:10 cknight: unit tests TBD 15:14:13 we could make the single backend one nonvoting and add the LVM no-share-servers job as voting 15:14:26 1/3 of my rechecks are because multibackend passes, but non-multibackend fails, then on recheck it's the opposite 15:14:31 bswartz: +1 15:14:50 so I'm in favor of making the new drivers voting, but only after they've been merged and have a track record of consistent results 15:14:57 bswartz: Sounds good. That actually improves our coverage. 15:15:04 there simply no time for ZFS and LXD to do that in Mitaka 15:15:23 bswartz: from what I've seen, the generic-no-share-servers job has been more reliable than LVM 15:15:38 bswartz: Maybe not in M.3, but we could lessen the pain by using them during M.4. 15:15:39 ganso: that's my intuition too 15:15:39 ganso: generic driver more reliable? 15:15:49 vponomaryov: generic driver DHSS=False 15:15:59 ganso; LVm just has one bug 15:16:06 tbh I don't feel very comfortabl with having zfs and lxd for gating because both is ubuntu-only technology afaics. but I guess that's another discussion ... 15:16:06 LVM definitely has at least one bug that triggers randomly with high concurrency 15:16:33 wishlist: ZFS driver be tested on multi-AZ, multi-backend environment in the gate 15:16:33 I think creating a scenario job with generic-no-share-servers is a good idea, I don't think we will be able to achieve a stable scenario using generic in DHSS=True 15:16:43 toabctl: devstack part is ubuntu-based 15:16:49 toabctl: +1, and yes another discussion 15:16:53 toabctl: ZFS is not tied to ubuntu 15:16:57 toabctl: you are free to add some other distro support 15:17:08 it's just a support question IMHO 15:17:52 I'm less sure about LXD --- I know it's a canonical created project, but I assumed it had cross-distro support 15:18:00 unblocking gate with these is fine but we ought to have truly generic solutions working by the time of release 15:18:10 re. zfs, whether it's possible for some other distro to support it doesn't change the fact that today there is only one distro that does. 15:18:13 tbarron: +1 15:18:13 lxd isn't supported in rhel, suse, helion 15:18:13 ZFS is DHSS=False, we may need at least one DHSS=True voting job 15:18:34 lxc yes 15:18:35 I think gating is fine but changing one of these to reference driver implementation would be problematic 15:18:46 jcsp:actually, two 15:19:08 jcsp: debian and ubuntu 15:19:23 even debian doesn't have lxd. and zfs - still the license issue afaik. anyway. another discussion. 15:19:32 vponomaryov: helion guy told me yesterday debian doesn't have kernel support for zfs 15:19:51 tbarron: driver supports fuse too 15:19:52 and that they wouldn't ship it 15:20:05 also we should note that the ZFS driver itself is pure python and will run on any distro -- it just needs SSH access to something with ZFS 15:20:33 the question is what would be shipped as a complete solution 15:20:46 not whether the driver could connect to something external 15:20:57 yeah I agree 15:21:03 we're mixing together multiple issues now though 15:21:06 but this is a different topic than unblocking gate now 15:21:12 bswartz: +1 15:21:13 cknight is concerned about the emergency we have in the next week 15:21:24 ZFS is NOT the answer to that problem 15:21:28 neither is LXD 15:21:39 neither is generic drv 15:21:42 those we will look at using for newton 15:22:13 so vponomaryov what do you propose? 15:22:28 with this instability I am very concerned about every patch that we keep rechecking, if they end up adding more instability, we never know because we are always rechecking and merging when jenkins finally +1 15:22:28 bswartz: fix LVM drv instability bug 15:22:34 generic has been good enough for the last 2 years -- it gives good coverage 15:22:47 I agree the flakiness is unacceptable 15:22:54 but I worry about lowering test coverage 15:23:03 Is "good enough" what we want for the future of Manila? 15:23:08 how to we ensure regressions don't slip in? 15:23:18 dustins: absolutely not 15:23:22 dustins: no, but the current situation isn't good enough 15:23:37 we spent most of Mitaka trying to fix this problem and we will continue until it's really fixed 15:23:39 Indeed, between a rock and a hard place :/ 15:24:04 I mean one option is to turn off the gate entirely and merge whatever we feel like -- I don't think that's a reasonable approach 15:24:16 I think there are concurrency issues with our tests, but it's entirely possible the they are in Manila itself, in which case merging stuff by doing rechecks doesn't help anyone. 15:24:27 so where is the middle ground that gives us test coverage we're comfortable with, and reliability that's tolerable 15:25:04 bswartz: We have two generic jobs that have a lot of overlap. Why not make just one voting? 15:25:15 cknight: that's a good starting point 15:25:29 just on generic driver job is more likely to pass than 2 15:25:33 one* 15:25:53 however only 1 voting tempest job seems light to me 15:26:07 bswartz: Exactly. At the same time, we can fix the LVM issue and make it voting ASAP. That would give us DHSS=False coverage. 15:26:08 in past releases we had 4 voting tempest jobs 15:26:30 cknight: which order do you propose? 15:26:40 make LVM voting immediately and fix the bug after? 15:26:42 bswartz: good question 15:26:48 vponomaryov: how long to get LVM solid? 15:26:59 we have 3 jobs with generic driver in DHSS=True mode and can make all of them non-voting and merge stuff only when some of them passed 15:26:59 so, one voting LVM as DHSS=False 15:26:59 and one of three passed in check queue DHSS=True 15:26:59 make as requirement 15:26:59 but first -fix instability of LVM 15:26:59 * vponomaryov where are all? too quiet 15:26:59 besides that instability, I am facing access rules concurrency only on LVM. The same code runs on generic-no-share-servers and does not have the problem 15:27:01 it's more likely to get fixed if it's a voting job 15:27:12 so there are 2 bugs 15:27:14 irc lag 15:27:17 whoa lots of irc lags 15:27:22 yeah IRC lag 15:27:41 vponomaryov: interesting. I didn't know voting could be an OR operation among multiple jobs. 15:27:58 I don't think vponomaryov has to be the one to fix the LVM bug 15:28:04 anyone can fix it 15:28:17 Any volunteers? 15:28:44 is the bug captured in LP yet? 15:28:58 people don't know what they're signing up for 15:29:10 it's a tricky concurrency bug 15:29:11 cknight: I didn't mean technical side, I meant exactly us - people 15:29:22 vponomaryov: got it 15:29:26 cknight: 3 non-voting and press "workflow" only when we know it works 15:29:28 I am working on fixing one of the bugs, I don't know the other yet 15:29:51 I see plenty of lvm-group errors when shares are being creates, don't know if that is expected or not 15:29:55 even on tests that pass 15:30:02 vponomaryov: Does the LVM bug still happen at concurrency = 1? 15:30:33 it's hard to tell currecntly because LVM jobs also fail due to neutron+postgres bug 15:31:08 perhaps before it's made voting, also remove either neutron or postgres from that job 15:31:10 cknight: didn't dig up in concurrency case,but I know that it just fails to delete share that refers to absent volume 15:31:11 the bug I am working on does not happen at concurrency = 1, but it is a major bug, access rules get overwritten by other access rules calls 15:31:40 ganso: yep that's bad. I had to use locking around my new access rule code. 15:31:59 cknight: I already added locking, but it still is not solving the problem, is is worrying me 15:32:05 s/is is/this is 15:32:16 ganso: external locking? 15:32:20 alright let's save the LVM bug for another topic 15:32:25 I want to wrap up this topic 15:32:31 vponomaryov: that's the next thing I am going to try 15:32:46 if anyone opposed to simply decreasing our 2 voting tempest jobs to 1? 15:32:49 err 15:32:58 is anyone opposed to simply decreasing our 2 voting tempest jobs to 1? 15:33:02 I am ok making only one voting job 15:33:09 for the moment 15:33:10 ok too 15:33:13 so removing gate-manila-tempest-dsvm-neutron ? 15:33:19 yes 15:33:21 yes 15:33:21 Don't like it, but it's temporary. +1 15:33:24 that will make forward progress, and we can sort out the mess related to other new drivers in the mean time 15:33:24 with less coverage one 15:33:26 temp. +1 15:33:31 +1 15:33:36 +1 15:33:37 For now, +1 15:33:45 There has to be a large focus on these test failures during M.4. 15:33:46 +1 15:33:57 cknight: +1 15:33:59 +1 15:34:04 so for remainder of M-3 only voting tempest job will be gate-manila-tempest-dsvm-neutron-multibackend 15:34:07 +1 15:34:08 in bes case we should have "scenario" job voting )) 15:34:13 s/bes/best/ 15:34:17 vponomaryov: +1 :) 15:34:29 vponomaryov: yeah 15:34:29 and we will spend M4 making everything bulletproof that we can 15:34:32 Thanks, Ben. And I would propose that no Newton features are merged at all until this is 100% behind us. 15:35:02 cknight: +10000000000000000000000 15:35:03 We should have the new drivers solid by then. And the test bugs fixed. 15:35:06 cknight: +1 15:35:12 cknight: wow, it could not happen at all )) 15:35:22 vponomaryov: so feeature freeze forever 15:35:27 )) 15:35:29 it sounds like we have non-technical issues to address with LXD and ZFS 15:35:36 toabctl: feature winter 15:36:03 I'm confident we can find a set of drivers that give us good coverage and are reliable 15:36:52 we may need to spend significant time retooling our testing workflow, but the alternative is too horrible to contemplate 15:37:24 btw, don't know if any of you have noticed, but all current jobs in Zuul (manila ones at least) are stuck 15:37:37 ganso: yep 15:37:42 Anyone know why? 15:37:43 is anyone surprised? )) 15:37:52 vponomaryov: lol 15:37:52 I don't actually agree with full feature freeze in the mean time 15:38:16 bswartz: what do you mean? 15:38:16 but I would propose a moratorium on significant features 15:38:42 another concern is: should we prioritize our current features? 15:38:42 ganso: solving this problem completely could easily take another 6 months 15:38:47 bswartz: yes 15:38:58 ganso: they are already prioritized on launchpad 15:39:08 ganso: that should dictate priority of reviews 15:39:12 we can't ALL be working on this problem, so other in the community should be able to make progress on smaller things 15:40:03 I really hope it doesn't take that long 15:40:07 bswartz: These are tricky bugs, and there are likely more than one. The more people working on them, the better. 15:40:24 but we were in this spot 6 months ago and I was optimistic we'd have it fixed before christmas 15:40:31 and here we are 15:40:39 bswartz: If we all focus on them in M.4, we have a chance to make things much better. 15:41:19 cknight: are you suggesting that we ignore the bugs now and just focus on features merging? 15:41:22 okay we have a plan for the next week at least 15:41:35 let's move on to the other topic that came up 15:41:40 ganso: no, features before the deadline. then big focus on test bugs. 15:41:42 #topic LVM concurrency bug 15:41:49 cknight: exactly 15:42:00 so we started discussing this 15:42:10 I agree it would be nice to make LVM voting asap 15:42:29 since it has a reasonably good record, except for at least one currency problem 15:42:43 missed that - do we have a bug report for "the lvm bug" ? 15:42:55 toabctl: that was my next question 15:43:03 is this problem reported on LP or is it just in our heads 15:43:27 personally I've observed it and spent a few hours poking around 15:43:39 ok we should really track the issues in LP 15:43:46 bswartz: So it doesn't occur at concurrency = 1? 15:43:52 otherwise every dev will start from beginning 15:43:53 initially I was under the assumption the logic flaw was in our tempest plugin but I'm less sure of that now 15:44:18 cknight: personally I haven't observed it at concurrency=1 15:44:26 mkoderer: +100 15:44:41 but I agree with vponomaryov: that testing with concurrency=1 is cheating 15:44:57 bswartz: OK, I'm just wondering how hard it is to repro. 15:45:04 because real users don't single thread their usage of manila 15:45:12 cknight: use rally for LVM 15:45:15 bswartz: production environments are not concurrency=1 15:45:20 cknight: to perform load, that's all 15:45:41 vponomaryov: is it reproducible in rally? 15:45:51 cknight: OpenStack Rally has scenarios for DHSS=False Manila drive rmode 15:46:16 mkoderer: I suspect it, and sure for 99,9% 15:46:19 okay so I think we need to assume the bug isn't reported in LP yet 15:46:31 I need a volunteer to reproduce the bug and file a bug report 15:46:53 the one to fix it? 15:46:55 then we can make progress on fixing it with multiple eyes 15:47:17 vponomaryov: no just to file the bug report with enough interesting details that someone can investigate 15:47:17 This is for the concurrency bug for the LVM driver, yeah? 15:47:40 dustins: in our case yes 15:47:50 I would volunteer myself but I've managed to overcommit myself in the short term 15:48:14 bswartz: If you can give me a quick overview of how to set it up, I'll volunteer 15:48:29 thanks dustins 15:48:48 dustins: you can look at what LVM CI job does 15:48:56 #action dustins will file the bug and bswartz will triage with highest possible priority 15:49:04 vponomaryov: I'll do that, thanks for the suggestion! 15:49:12 yeah I'll get with dustins and show him where it occurs 15:49:17 in CI 15:49:24 Sounds like a plan 15:49:31 the trick is how to separate those failures from postgres+neutron failures 15:49:40 okay 15:49:46 #topic open discussion 15:49:50 anything else for today 15:49:53 weren't we going to disable neutron for LVM job? 15:49:56 bswartz: can't we switch to mysql for that job? 15:50:17 ganso: +1, bswartz? 15:50:24 cknight: or maybe create another job without neutron and use mysql 15:50:34 and make that one voting 15:50:37 without everything that is not used 15:50:40 cknight: switching to mysql is absolutely an option, but after neutron fixes their bug I'd like to have gate coverage for postgres 15:51:01 ganso: I want to but I suspect tempest depends on neutron currently 15:51:02 bswartz: good point 15:51:08 LVM job requires only Keystone from OpenStack projects 15:51:28 -1 to only vote on mysql 15:51:38 has anyone attempted to run tempest without neutron? 15:51:50 it's on my todo list for this afternoon 15:51:52 bswartz: it should be disabled on infra side 15:51:52 how can there be neutron failures if the lvm job just needs keystone? 15:51:58 bswartz: in job config 15:52:06 bswartz: no but I can do that tomorrow 15:52:08 bswartz: it used to work with nova-network 15:52:16 toabctl: it is installed by devstack by default 15:52:21 vponomaryov: I know how to make jenkins do it -- I'm not sure it will work though 15:52:25 toabctl: and installation of DB for it fails 15:52:31 I'd like to test it on my system first 15:52:41 does nobody run tempest on their own dev systems? 15:52:50 bswartz: sure I do 15:52:55 bswartz: flies with ZFS )) 15:53:03 bswartz: I do 15:53:07 okay good 15:53:14 do you all use neutron or not? 15:53:19 bswartz: my tea is still hot while tempest ends run 15:53:22 bswartz: I always do :\ 15:53:34 bswartz: yes with neutron 15:53:50 bswartz: just give me the action item. I will have a look tomorrow 15:53:51 bswartz: I have never tried running without neutron 15:53:55 okay so I'm still unconvinced that it's possible to run tempest without either neutron or n-net 15:54:09 and n-net is NOT an option 15:54:25 bswartz: if we disable dynamic creation of tenants - should be ok 15:54:32 bswartz: it is quota tests in our case 15:54:51 vponomaryov: okay so how do you skip the quota tests 15:55:07 and can those tests be modified to work without neutron? 15:55:10 bswartz: https://review.openstack.org/#/c/281477/ 15:55:25 awesome 15:55:28 ^ adds possiblity to just disable them 15:55:41 mkoderer: so you can try this out and let me know? 15:55:41 it has jenkins -2 15:55:47 bswartz: yep 15:55:52 okay 15:56:00 markstur_: it is because CI was broken 15:56:05 bswartz: you'll get me feedback tomorrow EOB 15:56:16 alright 15:56:20 anything else for today? 15:56:31 bswartz: One more topic on agenda. 15:56:38 vponomaryov, so we can't fix the gate until we fix the gate? 15:56:42 bswartz: Mid-cycle location. 15:56:49 oh it was added late 15:56:51 markstur_: gate is partially fixed 15:56:54 markstur_: Exactly :-( 15:56:59 #topic Next Manila mid-cycle location 15:57:02 markstur_: only postgres patches are failing 15:57:12 markstur_: but at this moment gate is stuck 15:57:16 markstur_: not exactly broken 15:57:25 yes I asked everyone to find out if they could attend a european midcycle this summer 15:58:06 cknight and I can 15:58:16 anyone else have a definite yes or no? 15:58:20 Looking pretty unlikely on my end 15:58:28 definite no =( 15:58:32 unlikely 15:58:42 ganso: could you attend a US-based one? 15:58:48 bswartz: no as well 15:59:01 Fortunately there's always Hangouts, etc 15:59:05 if you're going to be remote no matter what then the location should matter less -- it's just a question of time zones then 15:59:25 bswartz: f2f mid-cycles are usually much more produtive 15:59:38 but sure if nobody can make it ... makes no sense 15:59:44 mkoderer: I agree 16:00:00 mkoderer: absolutely 16:00:04 but some employers are stingy about travel 16:00:16 bswartz: I know 16:00:27 therefore we have to find a way to work without face to face time 16:00:37 okay we've over our time 16:00:38 bswartz: some? most! 16:01:03 looks like you have a bit more information mkoderer -- you should still ask other cores directly for answers 16:01:16 #endmeeting