#openstack-meeting log

16:00:29 <jungleboyj> #startmeeting Cinder
16:00:30 <openstack> Meeting started Wed Oct 18 16:00:29 2017 UTC and is due to finish in 60 minutes.  The chair is jungleboyj. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:31 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:33 <openstack> The meeting name has been set to 'cinder'
16:00:36 <Swanson> hello
16:00:40 <bswartz> .o/
16:00:47 <smcginnis> o/
16:00:52 <xyang1> Hi
16:00:57 <jungleboyj> Courtesy ping:  jungleboyj DuncanT diablo_rojo, diablo_rojo_phon, rajinir tbarron xyang xyang1 e0ne gouthamr thingee erlon patrickeast tommylikehu eharney geguileo smcginnis lhx_ lhx__ aspiers
16:00:58 <eharney> hi
16:01:03 <jungleboyj> xyang1:  Good to have you here!
16:01:09 <e0ne_> hi
16:01:14 <geguileo> hi! o/
16:01:21 <xyang1> jungleboyj: thanks:)
16:01:22 <jungleboyj> @!
16:01:25 <tommylikehu> hi
16:01:28 <tommylikehu> hi xyang1
16:01:33 * jungleboyj misses pewp bot!
16:01:35 <tbarron> hi
16:01:41 <jungleboyj> hemna: ^^^  :-)
16:02:01 <jungleboyj> xyang1:  We going to be seeing more of you  again?
16:02:14 <xyang1> jungleboyj: hope so
16:02:25 <jungleboyj> xyang1:  Me too!  You have been missed.
16:02:43 <jungleboyj> tbarron:  Good to see you as well.
16:02:46 <patrickeast> o/
16:02:48 <xyang1> jungleboyj: thanks! I miss you guys too
16:02:54 <jungleboyj> :-)
16:02:56 <tbarron> jungleboyj: :)
16:03:53 <jungleboyj> Ok, lets get started.  We have the usual suspects
16:03:53 <lhx__> hi
16:04:07 <jungleboyj> #topic announcements
16:04:23 <jungleboyj> As always, please keep an eye on the review priorities:
16:04:34 <jungleboyj> #link https://etherpad.openstack.org/p/cinder-spec-review-tracking
16:05:03 <jungleboyj> The Hedvig driver should have CI running soon.  I need to look through that patch.  Would be good to get eye on it once the CI is running.
16:05:29 <jungleboyj> Have seen some people pick up the review pace since last week's plea.  Thank you.  Appreciate any help that can be given.
16:05:39 <jungleboyj> ZuulV3 migration ....
16:05:42 <jungleboyj> Wheeeee
16:05:54 <jungleboyj> So, I have been seeing mostly successful runs.
16:06:09 <jungleboyj> Sean, want to give a quick update on what issues the release team is seeing?
16:06:17 <smcginnis> Inspur also looks good. At least as far as having a reporting CI.
16:06:17 <jungleboyj> smcginnis: ^^
16:06:22 <e0ne> jungleboyj: did we face any issues with new zuul?
16:06:38 <jungleboyj> smcginnis:  Ah, thank you.  I will look at Inspur as well.
16:06:54 <smcginnis> Yeah, there are various changes with zuulv3 that are tripping up the post-release jobs. We should have them sorted out soon, but can't really release anything right now.
16:07:18 <smcginnis> So folks can still propose releases, we just might not be able to process them immediately.
16:08:03 <bswartz> smcginnis: boo
16:08:12 <bswartz> how will we do the milestone?
16:08:17 <jungleboyj> :-)
16:08:28 <jungleboyj> bswartz:  Hey, way to segway.
16:08:29 <smcginnis> Should be OK by tomorrow. I hope.
16:08:40 * bswartz hopes too
16:08:51 <smcginnis> Actually, I think we have the final fix going through right now.
16:08:57 * smcginnis keeps his fingers crossed.
16:09:00 * jungleboyj crosses my fingers
16:09:17 <jungleboyj> So, tomorrow is Milestone 1.
16:09:35 <jungleboyj> I will be proposing our milestone branch later today and it will get merged whenever things are working.
16:09:43 <jungleboyj> Any concerns there?
16:10:17 <jungleboyj> Good.
16:10:26 <jungleboyj> So, that was all I had for announcements.
16:10:34 <jungleboyj> #topic Sydney
16:10:48 <jungleboyj> So, I have the etherpad to record Sydney info:
16:10:58 <jungleboyj> #link https://etherpad.openstack.org/p/cinder-sydney-information
16:11:18 <jungleboyj> Is it really just smcginnis diablo_rojo and I going?
16:11:56 <jungleboyj> If so, that would explain why the foundation was asking Lenovo if we were sure we didn't want more people to go.
16:12:06 <bswartz> there will be a few netapp ppl
16:12:13 <bswartz> not me though
16:12:18 <_alastor_> jungleboyj: I'll be there :)
16:12:38 <e0ne> I'll miss this Summit :(
16:12:42 <smcginnis> bswartz: Staying home for another baby? :)
16:12:42 <jungleboyj> bswartz: erlon or gonso >
16:12:51 <jungleboyj> e0ne:  :-(
16:12:51 <lhx__> not me too :(
16:13:14 <jungleboyj> _alastor_:  Great.  Will be good to see you again.
16:13:37 <jungleboyj> Thanks for updating the etherpad.
16:13:40 <tommylikehu> small party then
16:13:48 <jungleboyj> tommylikehu:  Not you either?
16:13:55 <Swanson> No way am I going down there to be attacked by a poison koala.
16:14:01 <smcginnis> Really too bad how many can't make it this time. :{
16:14:04 <tommylikehu> jungleboyj: I am not able to
16:14:10 * bswartz is afraid of the drop bears
16:14:11 <smcginnis> Swanson: Drop bears. ;)
16:14:12 * jungleboyj shakes my head
16:14:17 <smcginnis> bswartz: Hah!
16:14:17 <jungleboyj> bswartz:  OMG!
16:14:35 <jungleboyj> Good thing I hear they only hate people from the east coast of the US.
16:14:37 <e0ne> It's pretty hard to get budget approved both for PTG in USA and the Summit in Sydney
16:14:55 <jungleboyj> e0ne:  Understood.  Well, just wanted to make sure.
16:15:03 <tommylikehu> e0ne:  yes, very hard
16:15:05 <smcginnis> Next PTG is confirmed to be in Dublin the week of Feb 26.
16:15:14 <smcginnis> Should be a little cheaper/easier for some.
16:15:15 <bswartz> they don't attack people with australian accents, so start working on your fake aussie accent now
16:15:21 <lhx__> e0ne, +1
16:15:22 <e0ne> smcginnis: sounds good
16:15:28 <jungleboyj> We have a couple of forum sessions scheduled.  I will add that info to the etherpad.
16:15:38 <jungleboyj> Still waiting for the on-boarding room to be scheduled.
16:15:52 <jungleboyj> bswartz:   Gooday Mate
16:16:09 <jungleboyj> smcginnis:  Where did you see that confirmed?
16:16:33 <Swanson> malonga gilderchuck
16:17:10 <smcginnis> jungleboyj: Theirry confirmed in the -dev channel this morning.
16:17:20 <jungleboyj> smcginnis:  Nice!
16:17:30 <jungleboyj> That is going to be great.  Hope everyone can make it.
16:18:05 <jungleboyj> So, I was going to ask about planning a Cinder event at the Summit but that seems unnecessary given the lack of people going.
16:18:38 <smcginnis> Maybe we can see who actually ends up showing up and try to arrange something information via IRC.
16:19:23 <jungleboyj> Sure.  We can just play it by ear.
16:19:43 <e0ne> it's only $340 for flight to Dublin and back - I like it
16:19:50 <_alastor_> jungleboyj: I'm already trying to figure out how many board games I can fit in my luggage.  My goal is 10+ with at least 2 big ones
16:20:07 <lhx__> smcginnis, jungleboyj ,maybe we can use ZOOM for Remote communication
16:20:08 * jungleboyj shakes my head
16:20:41 <jungleboyj> _alastor_:  smcginnis  keeps trying to get me to travel in one carryon.  No big games for me.
16:20:59 <jungleboyj> lhx__:  What is ZOOM?
16:21:22 <smcginnis> jungleboyj: Webex that works.
16:21:35 <bswartz> given how much bag fees cost, it's cheaper to buy the board games on the other side and throw them away
16:22:00 <lhx__> jungleboyj, you can google to search "zoom" :)
16:22:13 <_alastor_> bswartz: You just have to get creative.  Boxes are unecessary most of the time
16:22:20 <bswartz> #link https://www.zoom.us/
16:22:33 <lhx__> bswartz, cool
16:22:38 <jungleboyj> lhx__: Interesting.  I will have to take a look at that.  I would like to be able to record/broadcast the on-boarding to get more people there/involved.
16:22:59 <jungleboyj> Anyway, we can talk more of those details as the summit approaches.
16:23:28 <bswartz> zoom works on Linux, which makes it INFINITY% better than webex in my opinion
16:23:36 <jungleboyj> bswartz:  ++
16:23:39 * erlon sneaks in
16:23:44 <jungleboyj> erlon:  !
16:23:51 <jungleboyj> So, anything else on the summit?
16:23:57 <lhx__> jungleboyj, I can apply a paid account of zoom if necessary ;)
16:24:28 <jungleboyj> lhx__:  Ok, we can talk over in the cinder channel about that.
16:24:53 <jungleboyj> Moving on then ...
16:25:09 <jungleboyj> #topic Self-fencing on active/active HA
16:25:14 <aspiers> o/
16:25:16 <jungleboyj> aspiers:  You here?
16:25:18 <jungleboyj> Yay!
16:25:28 <jungleboyj> #link https://review.openstack.org/#/c/237076/
16:25:32 <jungleboyj> Take it away.
16:25:55 <aspiers> well there's been some discussion in that review about how to handle fencing
16:26:13 <aspiers> I think (and I think Duncan agrees?) it needs to be out-of-band
16:26:20 <aspiers> or at least a combination of in- and out-
16:26:33 <aspiers> and I had this crazy idea regarding using SBD, potentially without Pacemaker
16:26:45 <aspiers> so I just wanted to see if anyone had any initial reactions to that
16:27:24 <aspiers> we don't need to discuss now, but I wanted to say that I'm more than happy to brainstorm with anyone who's interested
16:27:34 <DuncanT> I need to learn more about SBD before I can comment. If it doesn't do STONITH then I don't understand how it can cover the corner cases, but that doesn't mean that somebody smarter than me hasn't figured it out
16:27:43 <jungleboyj> SBD?
16:27:46 <aspiers> I'm not a Cinder expert but have been very active on OpenStack HA in general
16:28:03 <aspiers> yes, Storage-Based Death - the links are in the review but I can repaste here
16:28:12 <bswartz> silent but deadly
16:28:13 <jungleboyj> aspiers:  Oh, ok.
16:28:21 <aspiers> bswartz: X-)
16:28:22 <jungleboyj> I need to look at the review.
16:28:29 <jungleboyj> Sorry, haven't done that yet.
16:28:35 <aspiers> #link http://www.linux-ha.org/wiki/SBD_Fencing
16:28:42 <aspiers> #link https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_storage_protect_fencing.html
16:28:45 <jungleboyj> There are days where I feel like I am suffering from Storage Based Death
16:28:51 <aspiers> #link http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit
16:28:55 <DuncanT> The amount of state outside of cinder (including in the kernel for iSCSI and others) means that anything short of shooting is going to fail in corner cases I believe
16:28:55 <aspiers> jungleboyj: LOL, me too ;-)
16:29:07 <aspiers> DuncanT: exactly
16:29:24 <aspiers> and to me, SBD appears at first sight to be the perfect fit
16:29:37 <jungleboyj> aspiers: Interesting.
16:29:45 <aspiers> assuming it would be possible to reserve a tiny partition on the same shared storage which the cinder backend is using
16:30:01 <aspiers> then SBD uses that for heartbeating the connection to the storage
16:30:12 <aspiers> if a cinder-volume loses access to it, it self-fences
16:30:33 <aspiers> and a watchdog protects against kernel-level issues like blocking I/O, low memory etc.
16:30:57 <DuncanT> Watchdog as in hardware watchdog?
16:31:05 <aspiers> I'm not even sure we'd need to use the poison-pill functionality SBD provides
16:31:11 <aspiers> DuncanT: right
16:31:19 <DuncanT> Anything in kernel space can go wrong and will go wring :-)
16:31:23 <aspiers> exactly
16:31:46 <DuncanT> aspiers: I will definitely read more
16:31:54 <aspiers> cool
16:32:27 <aspiers> my company (SUSE) has several SBD experts including the original author, and I confirmed with him that SBD should work fine without Pacemaker
16:32:57 <tbarron> aspiers: are you thinking of a 2-node cluster?  (no quorum?)
16:33:01 <aspiers> I'm assuming we wouldn't want to pull in Pacemaker although TBH I haven't thought about that side of it much yet
16:33:15 <jungleboyj> aspiers: So, for things like iSCSI configuration, etc.  How does SBD help?
16:33:30 <DuncanT> aspiers: This does require a real hardware watchdog device though, which I don't think is standard
16:33:40 <jungleboyj> aspiers:  Also, is this only supported on SuSE or do other distros also support it?
16:33:57 <aspiers> jungleboyj: it's general Linux HA, part of the Clusterlabs project
16:34:05 <aspiers> I'm pretty sure Red Hat supports it
16:34:10 <jungleboyj> eharney:  ?
16:34:41 <tbarron> aspiers: well you'd probably need to socialize it
16:34:48 <jungleboyj> He was here before.
16:34:49 <aspiers> tbarron: I was thinking there is no need for a cluster manager tracking quorum
16:35:03 <tbarron> lots of stuff upstream isn't officially supported downstream yet
16:35:20 <jungleboyj> :-)
16:35:21 <tbarron> not yet "tried, tested, trusted" certified, etc.
16:35:21 <aspiers> SBD is a bog-standard component in Pacemaker stacks (which RH definitely supports)
16:35:53 <DuncanT> Hmmm, it looks like reliably detecting that a node has shot itself is not possible with SBD, so we don't have a way to kick off the recovery processes cinder needs to do?
16:36:00 <tbarron> aspiers: not arguing that it's not a good path forwards or supportable though ...
16:36:03 <jungleboyj> Ok, just want to make sure it can be a general solution.
16:36:21 <DuncanT> In which case it doesn't help...
16:36:27 <jungleboyj> geguileo:  Thoughts here as Mr. HA for Cinder?
16:36:38 * DuncanT speed reads docs badly
16:36:50 <jungleboyj> DuncanT:  You aren't alone.
16:36:55 <tbarron> get DuncanT off the meth
16:36:58 <aspiers> haha
16:37:10 <jungleboyj> tbarron:  There is not DuncanT without meth.  ;-)
16:37:23 <jungleboyj> Oh wait, I mean beer.
16:37:28 <geguileo> jungleboyj: Sorry, was busy in other battles, and I don't know what we are talking about... So can't comment
16:37:30 <DuncanT> :-)
16:37:41 <jungleboyj> Gahhh!
16:37:51 <tbarron> geguileo: that doesn't stop anyone else
16:38:05 <geguileo> tbarron: but at least they are reading this chat   ;-P
16:38:06 <smcginnis> tbarron: :)
16:38:21 <DuncanT> Part of the issue is that we need to do work once we know a node has definitely gone away... easy with STONITH since the shooter can do said work.... hard with self fencing
16:38:25 <aspiers> DuncanT: that's a good point, I suspect that sbd will write something to its slot on the partition before self-fencing though, in which case you would know when it's done
16:38:52 <DuncanT> aspiers: It can't do that if it has lost the connection to the storage, by definition
16:39:00 <aspiers> doh, oh yeah :)
16:39:17 <DuncanT> aspiers: Corner cases are the problem :-) Healthy, working nodes don't need to die
16:39:19 <aspiers> but if it's lost connection to the storage then by definition it's safe to kick off recovery
16:39:44 <DuncanT> aspiers: How do you know if it's a lost connection or a buggy kernel driver dropping some but not all I/O?
16:40:10 <aspiers> the distinction is about what happens if the connection is re-established
16:40:38 <aspiers> if a buggy kernel driver drops enough traffic, SBD will lose a heartbeat and self-fence
16:40:58 <aspiers> so I guess you are talking about that grey zone
16:41:04 <DuncanT> aspiers: Yes, but the node that is taking over eneeds to know that that has happened, with 100% certainty
16:41:29 <DuncanT> The grey zone is a really big, worrying zone in the H/A storage world
16:41:58 <aspiers> presumably we'd want to be able to handle *some* intermittent I/O drops though? or not?
16:42:13 <aspiers> if it's only for a few millisecs, do we still need fencing?
16:42:25 <DuncanT> aspiers: I'm not sure of the value of a 50% solution
16:42:45 <aspiers> well I'm asking how much resilience cinder and the storage has against that kind of condition
16:42:49 <jungleboyj> If a node self fences the other node can't STONITH?
16:43:16 <DuncanT> Self-fencing to trigger a faster STONITH definitely helps
16:43:18 <aspiers> jungleboyj: the other node could still write a poison pill to the shared partition
16:43:40 <DuncanT> STONITH is the only 100% solution I think...
16:43:52 <jungleboyj> aspiers:  Ok ... So, it would seem that we want the poison pill if going this direction.
16:44:18 <aspiers> maybe involving Pacemaker is also a possibility. that way you get the state machine properties you are asking for
16:44:34 <aspiers> how scalable does a cinder-volume A/A cluster need to be?
16:44:50 <aspiers> Pacemaker currently only scales to 16-32 nodes, depending on who you ask for support
16:44:58 <aspiers> but maybe that's enough for this scenario?
16:45:33 <jungleboyj> aspiers: Depends on the end user's configuration.  If they are managing SAN type boxes that is fine.
16:45:40 <jungleboyj> If they have a huge LVM cluster ...
16:46:06 <jungleboyj> Though, that is kind of a different discussion.
16:46:21 <jungleboyj> Forget I said that.
16:46:24 <aspiers> with Pacemaker there is the option for multiple types of fencing device, not just SBD but also IPMI and all kinds of other out-of-band mechanisms
16:46:44 <aspiers> you can even layer multiple devices for redundancy in the fencing process
16:46:55 <aspiers> with awareness of UPS topology etc.
16:47:41 <aspiers> well, we don't have to solve this in the remaining 12 minutes ;-)
16:47:48 <aspiers> but I'll be in Sydney and more than happy to discuss in more depth
16:48:08 <jungleboyj> The problem with HA is I feel like we are kind-of paralyzed trying to get a perfect solution.
16:48:09 <aspiers> or even before then if you prefer
16:48:35 <jungleboyj> Yeah, we aren't going to solve this now.
16:48:41 <aspiers> jungleboyj: there is no perfect solution ;-) it's like security, you have to consider the relative risks of the various things which could go wrong and then make a judgement call
16:49:03 <aspiers> in HA it's a question of deciding what are the possible failure modes we care about
16:49:08 <aspiers> and making sure those are handled
16:49:26 <jungleboyj> So, I think the way forward is to get the team reviewing the spec:
16:49:27 <aspiers> and as DuncanT originally pointed out, I'm pretty sure that "kernel with low memory" is one of those failure modes
16:49:37 <jungleboyj> #action Team to review spec:  https://review.openstack.org/#/c/237076/
16:49:46 <bswartz> it's also important to make sure that the cure isn't worse than the disease though
16:49:53 <aspiers> bswartz: totally agree ;-)
16:50:11 <jungleboyj> Then I think it would be good for those of us in Sydney to meet f2f so I can better understand this.
16:50:13 <bswartz> if a failover could cause corruption when HA is in use when a failure would just cause loss of access without HA, then no HA is preferrable
16:50:13 <geguileo> silly question, does it really make sense fencing Cinder because the node cannot access it using iSCSI for example?
16:50:16 <jungleboyj> smcginnis:  You up for that?
16:50:58 <smcginnis> jungleboyj: Sure!
16:51:01 <jungleboyj> geguileo:  Uh, depends.  ;-)
16:51:03 <aspiers> bswartz: in an active/active scenario what kind of failover are you thinking of?
16:51:08 <jungleboyj> smcginnis:  Ok.
16:51:27 <jungleboyj> #action Cinder team and aspiers to get together in Sydney to discuss further.
16:51:33 <aspiers> cool
16:51:34 <geguileo> jungleboyj: it would only make sense if we have multiple nodes in the cluster and there were some capable of accessing the storage, right?
16:51:42 <jungleboyj> aspiers:  I have office hours planned.  If all else fails we can talk then.
16:51:51 <aspiers> jungleboyj: ACK
16:52:01 <geguileo> but if we only have 1 node or none of them can access the data layer we shouldn't
16:52:04 <bswartz> aspiers: the types of failures DuncanT was alluding to, mostly split brain syndromes
16:52:14 <jungleboyj> geguileo:  Right.
16:52:22 <jungleboyj> geguileo:  That was why I said it depends.
16:53:01 * geguileo needs to review the spec...
16:53:07 <aspiers> bswartz: SBD is kind of a bit like a quorum disk in that it can eliminate the split brain problem
16:53:17 <jungleboyj> geguileo:  Exactly.
16:53:38 <geguileo> jungleboyj: because I set that to -1 on the workflow until we actually had something to work with
16:53:42 <jungleboyj> So, lets do that and those of us at the summit can assimilate that data and try to come back with more info for a future meeting.
16:54:06 <aspiers> bswartz: split brain happens when nodes are normally supposed to talk to each other through a cluster mesh and then get partitioned, but that's not the architecture I'm proposing here
16:55:01 <smcginnis> 5 minute warning.
16:55:19 <jungleboyj> aspiers: geguileo DuncanT Any concern with the plan going forward?
16:55:28 <aspiers> nope sounds good to me!
16:55:34 <aspiers> thanks a lot everyone for this discussion
16:55:52 <jungleboyj> #action jungleboyj  To find some time to get people together to talk about this in Sydney.
16:55:53 <geguileo> jungleboyj: you mean picking up the spec again, and those going to the summit talking about it?
16:56:01 <jungleboyj> geguileo:  Correct
16:56:08 <geguileo> jungleboyj: sounds good to me
16:56:13 <jungleboyj> geguileo:  Cool.
16:56:35 <jungleboyj> We have a plan.  Thanks for bringing this up aspiers
16:56:44 <aspiers> sure :)
16:56:48 <jungleboyj> #topic Open Discussion
16:57:04 <jungleboyj> Reminder to vote for the TC election if you haven't already done so.
16:57:39 <Swanson> I voted for everyone I've ever heard of.
16:57:52 <jungleboyj> Also, if you have thoughts on multi-attach functionality, we need to get this spec reviewed and merged ASAP.  https://review.openstack.org/#/c/499777
16:59:02 <jungleboyj> #link https://review.openstack.org/#/c/499777
16:59:07 <jungleboyj> Anything else in the last minute?
16:59:32 <jungleboyj> Ok.  Thanks all for the good discussion.
16:59:46 <jungleboyj> Have a great rest of the week.  Thanks for working on Cinder!
16:59:53 <jungleboyj> See you next week.
16:59:57 <aspiers> thanks, and bye for now!
16:59:57 <_alastor_> o/
17:00:02 <jungleboyj> #endmeeting cinder