04:01:19 <samP> #startmeeting masakari
04:01:20 <openstack> Meeting started Tue Apr 25 04:01:19 2017 UTC and is due to finish in 60 minutes.  The chair is samP. Information about MeetBot at http://wiki.debian.org/MeetBot.
04:01:21 <sagara> hi
04:01:22 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
04:01:24 <openstack> The meeting name has been set to 'masakari'
04:01:25 <rkmrHonjo> hi
04:01:27 <samP> hi all o/
04:01:41 <samP> sorry for last week..
04:01:48 <abhishekk> o/
04:02:06 <tpatil> NP
04:02:40 <samP> Had a super busy week, cause I became a father :)
04:02:58 <samP> Anyway..
04:03:12 <tpatil> Congratulations!!!
04:03:16 <Dinesh_Bhor> samP: congrats!!
04:03:21 <sagara> congrats!
04:03:25 <samP> thanks..
04:03:27 <rkmrHonjo> Congratulations..!
04:03:36 <samP> thank you all..
04:03:46 <samP> let's jump in to agenda
04:03:58 <samP> #topic critical bugs
04:04:14 <samP> any bugs to discuss?
04:05:54 <samP> If no bugs to discuss, let's move to Discussion. if any we can address them later in AOB
04:06:14 <samP> #topic Discussion Points
04:06:41 <tpatil> #link: https://etherpad.openstack.org/p/masakari-recovery-method-customization
04:06:51 <samP> triggering crash dump in a server?
04:07:03 <tpatil> samP: Yes
04:07:23 <tpatil> samP: can you please explain a little bit about this use case?
04:07:30 <samP> tpatil: sure
04:08:43 <samP> This is for wait core dump or crash dump befor shout down a server
04:09:23 <samP> When pacemakser stonith a server, it does not wait for core domp.
04:10:03 <samP> In our user environment, servers has 264GB RAM and it take 20-30 mis to do the core dump.
04:11:05 <samP> However, when host fails, pacemaker do the stonith and no time for server to do the core dump
04:12:19 <samP> This feture is for isolate the server from the network except form IPMI and give some time for server to do the core dump.
04:14:11 <tpatil> samP: can masakari-monitor receive the event to trigger core dump?
04:14:38 <samP> tpatil: not in current
04:15:09 <rkmrHonjo> Oh, I haven't recognised it.
04:15:56 <rkmrHonjo> Should I write receiving trigger core dump bp for monitor?
04:16:44 <tpatil> samP: I think this job should be done by masakari-monitor instead of masakari
04:17:27 <samP> rkmrHonjo: currently I have no idea for how to receiv this trigger in masakari monitors
04:18:13 <samP> tpatil: IMHO, job can be done in masakari monitors, but recovery method shold define in masakari,
04:18:52 <tpatil> samP: what action will massacre take after receiving notification to take core dump?
04:19:09 <tpatil> sorry, s/massacre/masakari
04:20:03 <rkmrHonjo> Massacre is scared...
04:20:05 <samP> tpatil: IMO, masakari will not get the notificaion for core dump.
04:20:28 <samP> It is a one of the recovery actions
04:20:52 <tpatil> On my machine auto spell check is enabled, I will find out a way to disable this feature
04:21:17 <samP> Masakari only get the node failure notification, and recovery action would be isolate server(wait for core dumo) -> evacuate
04:22:44 <tpatil> samP: Who will trigger core dump is complete?
04:23:24 <samP> tpatil: core dump will automanitcally triggered by the kernel
04:24:22 <samP> In HW failures , exceptions, kernel panic...etc will trigger the core dump in the server
04:24:47 <samP> we just have wait for it to dumo all the pages to file..
04:25:09 <tpatil> samP: Are you suggesting masakari should run recovery action which will trigger core dump and wait until kernel signals it's complete using IPMI protocol?
04:25:11 <samP> ^^ just have to wait
04:27:37 <samP> tpatil: No, that can not be done. On the other hand, masakari do not have wait for core dump. it just need to ask pacemaker to ifdown the networks and isolate the node.
04:28:21 <samP> for masakari, network isolation = node dead
04:28:54 <samP> So masakari can do evacuate VMs as noraml way
04:29:51 <tpatil> samP: I'm trying to understand the end-to-end workflow when any node is down for some reason
04:30:47 <samP> tpatil: OK let me write down the simple flow
04:31:11 <tpatil> samP: thank you
04:31:15 <samP> (1) masakari monitor sends the node failure notificaion
04:32:13 <samP> (2) Masakari ask pacemaker to NW isolation of the node (<- call the pacemaker cluster to do that)
04:32:53 <samP> (3) Masakari get the reply from pacemaker "node isolation is done"
04:33:23 <samP> (4) Masakari trigger nova evacuate for VMs on that node
04:33:31 <samP> (5) done
04:33:38 <tpatil> samP: Masakari doesn't store any info about pacemaker cluster, need to figure out how to store this info when operator configures failover segment and hosts
04:34:19 <samP> tpatil: you are correct..
04:34:31 <tpatil> samP: understood the workflow. will check what information should be stored to isolate the node
04:35:10 <samP> tpatil: we have to configure the resoureces on pacemaker side to do this..
04:35:46 <samP> most of the work will be done in the pacemaker side. So, pacemaker and corosync need to configure correctly
04:36:41 <samP> I will wire down more info related to this in etherpad..
04:37:07 <sagara> I think we need to avoid split brain about volume booted VMs on failure node
04:37:12 <tpatil> samP: thank you
04:37:40 <sagara> Do we need some confirmation? and we need to design this feature carefully about fenced enough for that VMs.
04:37:48 <samP> sagara: does NW isolation avoid that?
04:39:04 <sagara> I don't know NW isolation is enough. In many L2 switches case, is it enough isolated?
04:40:12 <sagara> example, management NW, storage NW isolated environment case
04:40:32 <samP> sagara: NW isolation in node means, ifdown the IF will kill all the connection and sessions
04:41:24 <samP> sagara: In above, I mentioned the about the NW isolation. Ex: NW isolation = ifdown all the IFs, except IPMI
04:42:35 <samP> sagara: you may choose specific IF to down, such as only Storage NW and Tenent NW, but not the Managemtn NW.
04:43:15 <sagara> So do we need to clarify if we are using FC-HBA host, we cannot fence VM enough. Is that right?
04:43:35 <tpatil> samP: that will surely help us to figure out how to implement this recovery action
04:43:56 <samP> sagara: Ah..yes...In that case we need to do some specil thins
04:44:11 <samP> sorry,s/thins/things
04:46:01 <sagara> FC-HBA environment maybe rare than iSCSI, firstly do we take it forward without FC case, or
04:46:26 <sagara> Do we consider some general design?
04:47:20 <samP> sagara: For my understanding, problem in FC case is how to disable the port wich is highly HW dependet.
04:48:01 <samP> sagara: On the otherhad, how to disable it is a pacemaker configuration problem and not a masakri problem
04:48:25 <sagara> I think there is two way, one is disabling the FC port, another is wait dump enough.
04:49:40 <samP> IMO, we can proceed without considering FC case. Because I can not see what can we do in masakari side for FCs
04:49:48 <samP> sagara: please correct me if Im wrong
04:51:43 <sagara> operating FC port will be little difficult, Cinder already has FC auto zoning feature, so some FC switch can control with cinder FC switch driver code.
04:52:08 <sagara> I agree to proceed without considering FC case.
04:52:38 <samP> sagara: are you proposing to cut off FC channels from the SW side?
04:53:22 <samP> sagara: I was only focused in server side
04:54:40 <sagara> Yes, Sampath-san said "problem in FC case is how to disable the port wich is highly HW dependet", so I understood that controlling FC switches.
04:55:18 <samP> sagara: OK..
04:56:01 <sagara> I think controlling FC-HBA on server is also difficult
04:56:03 <samP> I was mention about FC ports in the server side.
04:57:00 <samP> Anyway, I will write more detils in the etherpad for this. So, we can discuss how to proceed with this.
04:57:28 <samP> we do not have much time... let move to AOB
04:57:33 <samP> #topic AOB
04:58:06 <rkmrHonjo> In masakari-recovery-method-customization, "Send Alert/Mail to operator", how does it send "Alert"? Logs?
04:58:06 <sagara> I don't know well that FC-HBA's device path is still alive after dump kernel start to work
04:58:10 <samP> tpatil: sorry for the delay, I reply to your ( and aldo Pooja-san's) mail about summit presentaion
04:58:11 <sagara> ok
04:58:16 <rkmrHonjo> oh, sorry...
04:58:32 <tpatil> samP: Thanks
04:58:44 <samP> rkmrHonjo: operator may configure it
04:59:13 <samP> sagara: path will stay alive till you kill it
04:59:18 <rkmrHonjo> samP: Do operators configure drivers? e.g. logs, messaging?
05:00:13 <samP> rkmrHonjo: In my mind, this was a mistal work flow. operator does not configure the drivers
05:00:16 <sagara> samP: but login/logout mechanism just only iSCSI. FC is not.
05:00:25 <samP> lest's finish
05:00:27 <rkmrHonjo> samP: Thanks. I understand.
05:00:32 <samP> we are out of time
05:00:50 <samP> lest discuss this on ML or openstac-masakari
05:00:53 <Dinesh_Bhor> ok
05:00:55 <samP> thank you all......
05:00:59 <samP> #endmeeting