16:00:29 <jungleboyj> #startmeeting Cinder 16:00:30 <openstack> Meeting started Wed Oct 18 16:00:29 2017 UTC and is due to finish in 60 minutes. The chair is jungleboyj. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:31 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:33 <openstack> The meeting name has been set to 'cinder' 16:00:36 <Swanson> hello 16:00:40 <bswartz> .o/ 16:00:47 <smcginnis> o/ 16:00:52 <xyang1> Hi 16:00:57 <jungleboyj> Courtesy ping: jungleboyj DuncanT diablo_rojo, diablo_rojo_phon, rajinir tbarron xyang xyang1 e0ne gouthamr thingee erlon patrickeast tommylikehu eharney geguileo smcginnis lhx_ lhx__ aspiers 16:00:58 <eharney> hi 16:01:03 <jungleboyj> xyang1: Good to have you here! 16:01:09 <e0ne_> hi 16:01:14 <geguileo> hi! o/ 16:01:21 <xyang1> jungleboyj: thanks:) 16:01:22 <jungleboyj> @! 16:01:25 <tommylikehu> hi 16:01:28 <tommylikehu> hi xyang1 16:01:33 * jungleboyj misses pewp bot! 16:01:35 <tbarron> hi 16:01:41 <jungleboyj> hemna: ^^^ :-) 16:02:01 <jungleboyj> xyang1: We going to be seeing more of you again? 16:02:14 <xyang1> jungleboyj: hope so 16:02:25 <jungleboyj> xyang1: Me too! You have been missed. 16:02:43 <jungleboyj> tbarron: Good to see you as well. 16:02:46 <patrickeast> o/ 16:02:48 <xyang1> jungleboyj: thanks! I miss you guys too 16:02:54 <jungleboyj> :-) 16:02:56 <tbarron> jungleboyj: :) 16:03:53 <jungleboyj> Ok, lets get started. We have the usual suspects 16:03:53 <lhx__> hi 16:04:07 <jungleboyj> #topic announcements 16:04:23 <jungleboyj> As always, please keep an eye on the review priorities: 16:04:34 <jungleboyj> #link https://etherpad.openstack.org/p/cinder-spec-review-tracking 16:05:03 <jungleboyj> The Hedvig driver should have CI running soon. I need to look through that patch. Would be good to get eye on it once the CI is running. 16:05:29 <jungleboyj> Have seen some people pick up the review pace since last week's plea. Thank you. Appreciate any help that can be given. 16:05:39 <jungleboyj> ZuulV3 migration .... 16:05:42 <jungleboyj> Wheeeee 16:05:54 <jungleboyj> So, I have been seeing mostly successful runs. 16:06:09 <jungleboyj> Sean, want to give a quick update on what issues the release team is seeing? 16:06:17 <smcginnis> Inspur also looks good. At least as far as having a reporting CI. 16:06:17 <jungleboyj> smcginnis: ^^ 16:06:22 <e0ne> jungleboyj: did we face any issues with new zuul? 16:06:38 <jungleboyj> smcginnis: Ah, thank you. I will look at Inspur as well. 16:06:54 <smcginnis> Yeah, there are various changes with zuulv3 that are tripping up the post-release jobs. We should have them sorted out soon, but can't really release anything right now. 16:07:18 <smcginnis> So folks can still propose releases, we just might not be able to process them immediately. 16:08:03 <bswartz> smcginnis: boo 16:08:12 <bswartz> how will we do the milestone? 16:08:17 <jungleboyj> :-) 16:08:28 <jungleboyj> bswartz: Hey, way to segway. 16:08:29 <smcginnis> Should be OK by tomorrow. I hope. 16:08:40 * bswartz hopes too 16:08:51 <smcginnis> Actually, I think we have the final fix going through right now. 16:08:57 * smcginnis keeps his fingers crossed. 16:09:00 * jungleboyj crosses my fingers 16:09:17 <jungleboyj> So, tomorrow is Milestone 1. 16:09:35 <jungleboyj> I will be proposing our milestone branch later today and it will get merged whenever things are working. 16:09:43 <jungleboyj> Any concerns there? 16:10:17 <jungleboyj> Good. 16:10:26 <jungleboyj> So, that was all I had for announcements. 16:10:34 <jungleboyj> #topic Sydney 16:10:48 <jungleboyj> So, I have the etherpad to record Sydney info: 16:10:58 <jungleboyj> #link https://etherpad.openstack.org/p/cinder-sydney-information 16:11:18 <jungleboyj> Is it really just smcginnis diablo_rojo and I going? 16:11:56 <jungleboyj> If so, that would explain why the foundation was asking Lenovo if we were sure we didn't want more people to go. 16:12:06 <bswartz> there will be a few netapp ppl 16:12:13 <bswartz> not me though 16:12:18 <_alastor_> jungleboyj: I'll be there :) 16:12:38 <e0ne> I'll miss this Summit :( 16:12:42 <smcginnis> bswartz: Staying home for another baby? :) 16:12:42 <jungleboyj> bswartz: erlon or gonso > 16:12:51 <jungleboyj> e0ne: :-( 16:12:51 <lhx__> not me too :( 16:13:14 <jungleboyj> _alastor_: Great. Will be good to see you again. 16:13:37 <jungleboyj> Thanks for updating the etherpad. 16:13:40 <tommylikehu> small party then 16:13:48 <jungleboyj> tommylikehu: Not you either? 16:13:55 <Swanson> No way am I going down there to be attacked by a poison koala. 16:14:01 <smcginnis> Really too bad how many can't make it this time. :{ 16:14:04 <tommylikehu> jungleboyj: I am not able to 16:14:10 * bswartz is afraid of the drop bears 16:14:11 <smcginnis> Swanson: Drop bears. ;) 16:14:12 * jungleboyj shakes my head 16:14:17 <smcginnis> bswartz: Hah! 16:14:17 <jungleboyj> bswartz: OMG! 16:14:35 <jungleboyj> Good thing I hear they only hate people from the east coast of the US. 16:14:37 <e0ne> It's pretty hard to get budget approved both for PTG in USA and the Summit in Sydney 16:14:55 <jungleboyj> e0ne: Understood. Well, just wanted to make sure. 16:15:03 <tommylikehu> e0ne: yes, very hard 16:15:05 <smcginnis> Next PTG is confirmed to be in Dublin the week of Feb 26. 16:15:14 <smcginnis> Should be a little cheaper/easier for some. 16:15:15 <bswartz> they don't attack people with australian accents, so start working on your fake aussie accent now 16:15:21 <lhx__> e0ne, +1 16:15:22 <e0ne> smcginnis: sounds good 16:15:28 <jungleboyj> We have a couple of forum sessions scheduled. I will add that info to the etherpad. 16:15:38 <jungleboyj> Still waiting for the on-boarding room to be scheduled. 16:15:52 <jungleboyj> bswartz: Gooday Mate 16:16:09 <jungleboyj> smcginnis: Where did you see that confirmed? 16:16:33 <Swanson> malonga gilderchuck 16:17:10 <smcginnis> jungleboyj: Theirry confirmed in the -dev channel this morning. 16:17:20 <jungleboyj> smcginnis: Nice! 16:17:30 <jungleboyj> That is going to be great. Hope everyone can make it. 16:18:05 <jungleboyj> So, I was going to ask about planning a Cinder event at the Summit but that seems unnecessary given the lack of people going. 16:18:38 <smcginnis> Maybe we can see who actually ends up showing up and try to arrange something information via IRC. 16:19:23 <jungleboyj> Sure. We can just play it by ear. 16:19:43 <e0ne> it's only $340 for flight to Dublin and back - I like it 16:19:50 <_alastor_> jungleboyj: I'm already trying to figure out how many board games I can fit in my luggage. My goal is 10+ with at least 2 big ones 16:20:07 <lhx__> smcginnis, jungleboyj ,maybe we can use ZOOM for Remote communication 16:20:08 * jungleboyj shakes my head 16:20:41 <jungleboyj> _alastor_: smcginnis keeps trying to get me to travel in one carryon. No big games for me. 16:20:59 <jungleboyj> lhx__: What is ZOOM? 16:21:22 <smcginnis> jungleboyj: Webex that works. 16:21:35 <bswartz> given how much bag fees cost, it's cheaper to buy the board games on the other side and throw them away 16:22:00 <lhx__> jungleboyj, you can google to search "zoom" :) 16:22:13 <_alastor_> bswartz: You just have to get creative. Boxes are unecessary most of the time 16:22:20 <bswartz> #link https://www.zoom.us/ 16:22:33 <lhx__> bswartz, cool 16:22:38 <jungleboyj> lhx__: Interesting. I will have to take a look at that. I would like to be able to record/broadcast the on-boarding to get more people there/involved. 16:22:59 <jungleboyj> Anyway, we can talk more of those details as the summit approaches. 16:23:28 <bswartz> zoom works on Linux, which makes it INFINITY% better than webex in my opinion 16:23:36 <jungleboyj> bswartz: ++ 16:23:39 * erlon sneaks in 16:23:44 <jungleboyj> erlon: ! 16:23:51 <jungleboyj> So, anything else on the summit? 16:23:57 <lhx__> jungleboyj, I can apply a paid account of zoom if necessary ;) 16:24:28 <jungleboyj> lhx__: Ok, we can talk over in the cinder channel about that. 16:24:53 <jungleboyj> Moving on then ... 16:25:09 <jungleboyj> #topic Self-fencing on active/active HA 16:25:14 <aspiers> o/ 16:25:16 <jungleboyj> aspiers: You here? 16:25:18 <jungleboyj> Yay! 16:25:28 <jungleboyj> #link https://review.openstack.org/#/c/237076/ 16:25:32 <jungleboyj> Take it away. 16:25:55 <aspiers> well there's been some discussion in that review about how to handle fencing 16:26:13 <aspiers> I think (and I think Duncan agrees?) it needs to be out-of-band 16:26:20 <aspiers> or at least a combination of in- and out- 16:26:33 <aspiers> and I had this crazy idea regarding using SBD, potentially without Pacemaker 16:26:45 <aspiers> so I just wanted to see if anyone had any initial reactions to that 16:27:24 <aspiers> we don't need to discuss now, but I wanted to say that I'm more than happy to brainstorm with anyone who's interested 16:27:34 <DuncanT> I need to learn more about SBD before I can comment. If it doesn't do STONITH then I don't understand how it can cover the corner cases, but that doesn't mean that somebody smarter than me hasn't figured it out 16:27:43 <jungleboyj> SBD? 16:27:46 <aspiers> I'm not a Cinder expert but have been very active on OpenStack HA in general 16:28:03 <aspiers> yes, Storage-Based Death - the links are in the review but I can repaste here 16:28:12 <bswartz> silent but deadly 16:28:13 <jungleboyj> aspiers: Oh, ok. 16:28:21 <aspiers> bswartz: X-) 16:28:22 <jungleboyj> I need to look at the review. 16:28:29 <jungleboyj> Sorry, haven't done that yet. 16:28:35 <aspiers> #link http://www.linux-ha.org/wiki/SBD_Fencing 16:28:42 <aspiers> #link https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_storage_protect_fencing.html 16:28:45 <jungleboyj> There are days where I feel like I am suffering from Storage Based Death 16:28:51 <aspiers> #link http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit 16:28:55 <DuncanT> The amount of state outside of cinder (including in the kernel for iSCSI and others) means that anything short of shooting is going to fail in corner cases I believe 16:28:55 <aspiers> jungleboyj: LOL, me too ;-) 16:29:07 <aspiers> DuncanT: exactly 16:29:24 <aspiers> and to me, SBD appears at first sight to be the perfect fit 16:29:37 <jungleboyj> aspiers: Interesting. 16:29:45 <aspiers> assuming it would be possible to reserve a tiny partition on the same shared storage which the cinder backend is using 16:30:01 <aspiers> then SBD uses that for heartbeating the connection to the storage 16:30:12 <aspiers> if a cinder-volume loses access to it, it self-fences 16:30:33 <aspiers> and a watchdog protects against kernel-level issues like blocking I/O, low memory etc. 16:30:57 <DuncanT> Watchdog as in hardware watchdog? 16:31:05 <aspiers> I'm not even sure we'd need to use the poison-pill functionality SBD provides 16:31:11 <aspiers> DuncanT: right 16:31:19 <DuncanT> Anything in kernel space can go wrong and will go wring :-) 16:31:23 <aspiers> exactly 16:31:46 <DuncanT> aspiers: I will definitely read more 16:31:54 <aspiers> cool 16:32:27 <aspiers> my company (SUSE) has several SBD experts including the original author, and I confirmed with him that SBD should work fine without Pacemaker 16:32:57 <tbarron> aspiers: are you thinking of a 2-node cluster? (no quorum?) 16:33:01 <aspiers> I'm assuming we wouldn't want to pull in Pacemaker although TBH I haven't thought about that side of it much yet 16:33:15 <jungleboyj> aspiers: So, for things like iSCSI configuration, etc. How does SBD help? 16:33:30 <DuncanT> aspiers: This does require a real hardware watchdog device though, which I don't think is standard 16:33:40 <jungleboyj> aspiers: Also, is this only supported on SuSE or do other distros also support it? 16:33:57 <aspiers> jungleboyj: it's general Linux HA, part of the Clusterlabs project 16:34:05 <aspiers> I'm pretty sure Red Hat supports it 16:34:10 <jungleboyj> eharney: ? 16:34:41 <tbarron> aspiers: well you'd probably need to socialize it 16:34:48 <jungleboyj> He was here before. 16:34:49 <aspiers> tbarron: I was thinking there is no need for a cluster manager tracking quorum 16:35:03 <tbarron> lots of stuff upstream isn't officially supported downstream yet 16:35:20 <jungleboyj> :-) 16:35:21 <tbarron> not yet "tried, tested, trusted" certified, etc. 16:35:21 <aspiers> SBD is a bog-standard component in Pacemaker stacks (which RH definitely supports) 16:35:53 <DuncanT> Hmmm, it looks like reliably detecting that a node has shot itself is not possible with SBD, so we don't have a way to kick off the recovery processes cinder needs to do? 16:36:00 <tbarron> aspiers: not arguing that it's not a good path forwards or supportable though ... 16:36:03 <jungleboyj> Ok, just want to make sure it can be a general solution. 16:36:21 <DuncanT> In which case it doesn't help... 16:36:27 <jungleboyj> geguileo: Thoughts here as Mr. HA for Cinder? 16:36:38 * DuncanT speed reads docs badly 16:36:50 <jungleboyj> DuncanT: You aren't alone. 16:36:55 <tbarron> get DuncanT off the meth 16:36:58 <aspiers> haha 16:37:10 <jungleboyj> tbarron: There is not DuncanT without meth. ;-) 16:37:23 <jungleboyj> Oh wait, I mean beer. 16:37:28 <geguileo> jungleboyj: Sorry, was busy in other battles, and I don't know what we are talking about... So can't comment 16:37:30 <DuncanT> :-) 16:37:41 <jungleboyj> Gahhh! 16:37:51 <tbarron> geguileo: that doesn't stop anyone else 16:38:05 <geguileo> tbarron: but at least they are reading this chat ;-P 16:38:06 <smcginnis> tbarron: :) 16:38:21 <DuncanT> Part of the issue is that we need to do work once we know a node has definitely gone away... easy with STONITH since the shooter can do said work.... hard with self fencing 16:38:25 <aspiers> DuncanT: that's a good point, I suspect that sbd will write something to its slot on the partition before self-fencing though, in which case you would know when it's done 16:38:52 <DuncanT> aspiers: It can't do that if it has lost the connection to the storage, by definition 16:39:00 <aspiers> doh, oh yeah :) 16:39:17 <DuncanT> aspiers: Corner cases are the problem :-) Healthy, working nodes don't need to die 16:39:19 <aspiers> but if it's lost connection to the storage then by definition it's safe to kick off recovery 16:39:44 <DuncanT> aspiers: How do you know if it's a lost connection or a buggy kernel driver dropping some but not all I/O? 16:40:10 <aspiers> the distinction is about what happens if the connection is re-established 16:40:38 <aspiers> if a buggy kernel driver drops enough traffic, SBD will lose a heartbeat and self-fence 16:40:58 <aspiers> so I guess you are talking about that grey zone 16:41:04 <DuncanT> aspiers: Yes, but the node that is taking over eneeds to know that that has happened, with 100% certainty 16:41:29 <DuncanT> The grey zone is a really big, worrying zone in the H/A storage world 16:41:58 <aspiers> presumably we'd want to be able to handle *some* intermittent I/O drops though? or not? 16:42:13 <aspiers> if it's only for a few millisecs, do we still need fencing? 16:42:25 <DuncanT> aspiers: I'm not sure of the value of a 50% solution 16:42:45 <aspiers> well I'm asking how much resilience cinder and the storage has against that kind of condition 16:42:49 <jungleboyj> If a node self fences the other node can't STONITH? 16:43:16 <DuncanT> Self-fencing to trigger a faster STONITH definitely helps 16:43:18 <aspiers> jungleboyj: the other node could still write a poison pill to the shared partition 16:43:40 <DuncanT> STONITH is the only 100% solution I think... 16:43:52 <jungleboyj> aspiers: Ok ... So, it would seem that we want the poison pill if going this direction. 16:44:18 <aspiers> maybe involving Pacemaker is also a possibility. that way you get the state machine properties you are asking for 16:44:34 <aspiers> how scalable does a cinder-volume A/A cluster need to be? 16:44:50 <aspiers> Pacemaker currently only scales to 16-32 nodes, depending on who you ask for support 16:44:58 <aspiers> but maybe that's enough for this scenario? 16:45:33 <jungleboyj> aspiers: Depends on the end user's configuration. If they are managing SAN type boxes that is fine. 16:45:40 <jungleboyj> If they have a huge LVM cluster ... 16:46:06 <jungleboyj> Though, that is kind of a different discussion. 16:46:21 <jungleboyj> Forget I said that. 16:46:24 <aspiers> with Pacemaker there is the option for multiple types of fencing device, not just SBD but also IPMI and all kinds of other out-of-band mechanisms 16:46:44 <aspiers> you can even layer multiple devices for redundancy in the fencing process 16:46:55 <aspiers> with awareness of UPS topology etc. 16:47:41 <aspiers> well, we don't have to solve this in the remaining 12 minutes ;-) 16:47:48 <aspiers> but I'll be in Sydney and more than happy to discuss in more depth 16:48:08 <jungleboyj> The problem with HA is I feel like we are kind-of paralyzed trying to get a perfect solution. 16:48:09 <aspiers> or even before then if you prefer 16:48:35 <jungleboyj> Yeah, we aren't going to solve this now. 16:48:41 <aspiers> jungleboyj: there is no perfect solution ;-) it's like security, you have to consider the relative risks of the various things which could go wrong and then make a judgement call 16:49:03 <aspiers> in HA it's a question of deciding what are the possible failure modes we care about 16:49:08 <aspiers> and making sure those are handled 16:49:26 <jungleboyj> So, I think the way forward is to get the team reviewing the spec: 16:49:27 <aspiers> and as DuncanT originally pointed out, I'm pretty sure that "kernel with low memory" is one of those failure modes 16:49:37 <jungleboyj> #action Team to review spec: https://review.openstack.org/#/c/237076/ 16:49:46 <bswartz> it's also important to make sure that the cure isn't worse than the disease though 16:49:53 <aspiers> bswartz: totally agree ;-) 16:50:11 <jungleboyj> Then I think it would be good for those of us in Sydney to meet f2f so I can better understand this. 16:50:13 <bswartz> if a failover could cause corruption when HA is in use when a failure would just cause loss of access without HA, then no HA is preferrable 16:50:13 <geguileo> silly question, does it really make sense fencing Cinder because the node cannot access it using iSCSI for example? 16:50:16 <jungleboyj> smcginnis: You up for that? 16:50:58 <smcginnis> jungleboyj: Sure! 16:51:01 <jungleboyj> geguileo: Uh, depends. ;-) 16:51:03 <aspiers> bswartz: in an active/active scenario what kind of failover are you thinking of? 16:51:08 <jungleboyj> smcginnis: Ok. 16:51:27 <jungleboyj> #action Cinder team and aspiers to get together in Sydney to discuss further. 16:51:33 <aspiers> cool 16:51:34 <geguileo> jungleboyj: it would only make sense if we have multiple nodes in the cluster and there were some capable of accessing the storage, right? 16:51:42 <jungleboyj> aspiers: I have office hours planned. If all else fails we can talk then. 16:51:51 <aspiers> jungleboyj: ACK 16:52:01 <geguileo> but if we only have 1 node or none of them can access the data layer we shouldn't 16:52:04 <bswartz> aspiers: the types of failures DuncanT was alluding to, mostly split brain syndromes 16:52:14 <jungleboyj> geguileo: Right. 16:52:22 <jungleboyj> geguileo: That was why I said it depends. 16:53:01 * geguileo needs to review the spec... 16:53:07 <aspiers> bswartz: SBD is kind of a bit like a quorum disk in that it can eliminate the split brain problem 16:53:17 <jungleboyj> geguileo: Exactly. 16:53:38 <geguileo> jungleboyj: because I set that to -1 on the workflow until we actually had something to work with 16:53:42 <jungleboyj> So, lets do that and those of us at the summit can assimilate that data and try to come back with more info for a future meeting. 16:54:06 <aspiers> bswartz: split brain happens when nodes are normally supposed to talk to each other through a cluster mesh and then get partitioned, but that's not the architecture I'm proposing here 16:55:01 <smcginnis> 5 minute warning. 16:55:19 <jungleboyj> aspiers: geguileo DuncanT Any concern with the plan going forward? 16:55:28 <aspiers> nope sounds good to me! 16:55:34 <aspiers> thanks a lot everyone for this discussion 16:55:52 <jungleboyj> #action jungleboyj To find some time to get people together to talk about this in Sydney. 16:55:53 <geguileo> jungleboyj: you mean picking up the spec again, and those going to the summit talking about it? 16:56:01 <jungleboyj> geguileo: Correct 16:56:08 <geguileo> jungleboyj: sounds good to me 16:56:13 <jungleboyj> geguileo: Cool. 16:56:35 <jungleboyj> We have a plan. Thanks for bringing this up aspiers 16:56:44 <aspiers> sure :) 16:56:48 <jungleboyj> #topic Open Discussion 16:57:04 <jungleboyj> Reminder to vote for the TC election if you haven't already done so. 16:57:39 <Swanson> I voted for everyone I've ever heard of. 16:57:52 <jungleboyj> Also, if you have thoughts on multi-attach functionality, we need to get this spec reviewed and merged ASAP. https://review.openstack.org/#/c/499777 16:59:02 <jungleboyj> #link https://review.openstack.org/#/c/499777 16:59:07 <jungleboyj> Anything else in the last minute? 16:59:32 <jungleboyj> Ok. Thanks all for the good discussion. 16:59:46 <jungleboyj> Have a great rest of the week. Thanks for working on Cinder! 16:59:53 <jungleboyj> See you next week. 16:59:57 <aspiers> thanks, and bye for now! 16:59:57 <_alastor_> o/ 17:00:02 <jungleboyj> #endmeeting cinder