#openstack-meeting log

04:00:23 <samP> #startmeeting masakari
04:00:24 <openstack> Meeting started Tue Jun 27 04:00:23 2017 UTC and is due to finish in 60 minutes.  The chair is samP. Information about MeetBot at http://wiki.debian.org/MeetBot.
04:00:25 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
04:00:27 <openstack> The meeting name has been set to 'masakari'
04:00:40 <tpatil> Hi
04:00:43 <samP> hi
04:00:44 <sagara> hi
04:01:11 <samP> abhishekk: First of all, congratulations for become a glance core
04:01:22 <abhishekk> samP: thank you
04:01:39 <sagara> abhishekk: congratulations!
04:01:58 <abhishekk> sagara: thank you
04:02:27 <samP> OK, lets start
04:02:45 <samP> #topic Critical bugs
04:02:56 <samP> Any bugs to discuss?
04:03:46 <Dinesh_Bhor> #link: https://bugs.launchpad.net/masakari/+bug/1690995
04:03:48 <openstack> Launchpad bug 1690995 in masakari "If masakari recover "resized" instance that's state was "stopped" before resizing, it will be "active"." [Undecided,In progress] - Assigned to Dinesh Bhor (dinesh-bhor)
04:04:50 <Dinesh_Bhor> According to the bug report if the instance is in stopped state before resize then after recovery it should be stopped.
04:05:32 <samP> Its looks like a nova bug to me..
04:06:37 <Dinesh_Bhor> In this particular patch we have decided to stop the instance after recovery if it is in vm_state other than active and stopped : https://review.openstack.org/#/c/469029/
04:07:04 <tpatil> samP: After resized operation and before confirm_resize, if the compute host goes down, in that case after evacuation the vm_state should be stopped instead of active as per bug reporter
04:07:10 <samP> Dinesh_Bhor: that is one possible solution..
04:08:04 <samP> tpatil: yes..it should be on "stopped"
04:09:07 <tpatil> it should be stopped only in case the previous action on the instance is stopped before user calling resize operation
04:09:14 <samP> user stopped it for a reason, and make it active after evacuate could cause troubles.
04:09:45 <samP> tpatil: correct
04:09:53 <tpatil> samP:In that case, we will need to get instance_actions to take this decision
04:10:03 <rkmrHonjo> samP: thank you for explaining. Your explanation is right.
04:10:41 <tpatil> In future, if we decide to move the workflow to Mistral, we cannot call db_apis to get instance actions. Will need to add restFul apis to get instance actions
04:10:56 <samP> tpatil: I thought in previous discussion we agreed to use instance_actions to refer its previous status
04:11:20 <samP> tpatil: Ah...got it
04:12:33 <rkmrHonjo> tpatil: There is List Actions For Server API in nova.
04:12:34 <tpatil> samP: we have confirmed there is an existing nova api to get instance actions (instance_action_list)
04:12:36 <rkmrHonjo> #link https://developer.openstack.org/api-ref/compute/#servers-actions-servers-os-instance-actions
04:13:36 <samP> rkmrHonjo: I am looking at same
04:13:58 <tpatil> rkmrHonjo: Yes, the api is present
04:14:26 <samP> tpatil: can we use this API to get previous actions?
04:14:35 <tpatil> We may need to add this action in mistral to get the instance_actions for an instance
04:14:38 <tpatil> samP:Yes
04:15:31 <Dinesh_Bhor> This can be fixed in the same patch: https://review.openstack.org/#/c/469029/ I will fix it
04:15:31 <samP> tpatil: so, we do not need to call db_apis, right?
04:15:46 <tpatil> samP: correct
04:15:50 <samP> Dinesh_Bhor: thank you..
04:16:01 <samP> tpatil: OK, thanks
04:17:56 <samP> However, regarding these cases, my initial concern stays same..
04:18:44 <samP> which is, VMs are turn on instantly before we turn them off..
04:19:35 <samP> IMO, we have to fix these in nova
04:20:44 <sagara> samP: I agree. We need some internal state change (or something locking) API in Nova.
04:20:45 <samP> such as, in this case, evacuate does not turn the instance to active, if states before resize it stopped.
04:20:45 <tpatil> Ideally, user will expect the vm_state to be resized instead of stopped or active.
04:20:51 <abhishekk> samP: First we need to find out the reason why nova is not allowing evacuatiuon if instance is in reset state
04:21:32 <abhishekk> samP: we need to convince them, otherwise nova will never accept this fix if there is any strong reason
04:21:45 <samP> abhishekk: agree..
04:21:46 <abhishekk> s/reset/resize
04:24:02 <samP> can some one please check why nova does not allow this?
04:24:34 <samP> Then we could start to talk with nova, if we have strong reason.
04:25:28 <sagara> I will do that, but if I cannot, I will ask someone to help me
04:25:45 <samP> sagara: sure..thank you..
04:25:54 <samP> sagara: lets use ML with [masakari]
04:26:12 <tpatil> samP: so what we want to confirm with nova community is after evacuation the instance should be set to active or stopped state based on the previous action before resize
04:26:16 <rkmrHonjo> sagara: thank you.
04:26:51 <sagara> #action: sagara find out the reason why nova is not allowing evacuatiuon if instance is in reset state
04:26:56 <samP> tpatil: yes
04:27:08 <samP> sagara: thank you.
04:27:43 <samP> OK then, any other bugs
04:28:23 <samP> if not, lets move to pike work items
04:28:32 <samP> #topic pike work items
04:29:00 <tpatil> samP: the nova is not allowing resize vm state to be evacuated before the instance files resding on the source compute node will not be accessible during evacuation
04:29:16 <tpatil> s/before/because
04:30:04 <samP> tpatil: but that does not true in shared storage case, right?
04:30:12 <tpatil> but in case of shared_storage, it should be allowed
04:31:26 <samP> tpatil: in non shared storage cases, it does make sense
04:31:26 <tpatil> samP: Can nova community allow evacuation in resize vm_state based on the shared_storage/non-shared storage, is the main question
04:32:47 <tpatil> samP: Let's discuss this point on the nova ML
04:32:54 <samP> tpatil: might hard to convince them, but lets give it a try
04:33:12 <samP> tpatil: sure
04:33:37 <tpatil> sagara: Do you want to send this mail or should I request Abhishek to send it
04:34:59 <sagara> tpatil: Please send a mail to Nova ML. thanks.
04:35:06 <tpatil> sagara: OK
04:35:42 <sagara> I will look up the reason from source or commit log.
04:35:55 <samP> Thanks..
04:36:29 <samP> sagara: could you please share those in ML with [masakari], then abhishekk could use those info
04:36:58 <sagara> samP: Yes, of course.
04:37:08 <samP> sagara: thanks
04:38:19 <samP> OK, about pike work items... need to do lot of review work..
04:39:05 <abhishekk> For recovery method customization, to add mistral driver in masakari we have analyzed some issues
04:39:18 <abhishekk> 1. need to add some actions in mistral, like filter vms on the basis of metadata, custom evacuation in which we need to find out aggregate host and add reserved_host to that aggregate, confirmation of evacuation, reset instances state to stop if instance is in state other than active or stop etc.
04:39:34 <tpatil> #link Pike work items https://etherpad.openstack.org/p/masakari-pike-workitems
04:39:51 <abhishekk> 2. Mistral workflow executes in asynchronous way and to get the output we have to continuously hit the "mistral execution-get <execution-id>" API
04:40:16 <abhishekk> problem: If any workflow is executing and mistral-engine service goes down  suddenly during execution and workflow remains in "RUNNING" state until the mistral-engine comes up again and processes the pending workflow (there must be some periodic task which is picking up the pending executions).
04:41:01 <abhishekk> 3. Assuming we have a mistral driver then in case of recovery_method as a "reserved_host" we are going to loop through the reserved_hosts and will pass  the reserved_host one by one to workflow for execution. In this case if operator has configured "send e-mail" action after completion of workflow execution.
04:41:01 <abhishekk> For partial workflow execution operator will get multiple emails that the workflow is failed and it has evacuated some instances out of total.
04:41:57 <samP> abhishekk: That is lot of Mistral work there...
04:42:22 <abhishekk> For point 3 consider we have set reserved_host recovery action,
04:42:31 <abhishekk> samP: yes
04:42:45 <samP> Is it possible to ask Mistral team to help up with this..
04:43:05 <tpatil> samP: Right now, mistral has only one action servers_evacuate which takes instance id for evacuation
04:43:52 <tpatil> samP: but we will need to add a major action in mistral to evacuate instances from a failed compute host
04:43:52 <samP> tpatil: is that the evacuate work flow?
04:44:54 <abhishekk> samP: for reference https://github.com/gryf/mistral-evacuate
04:45:19 <samP> abhishekk: ah.. that is not a part of Mistral
04:45:31 <abhishekk> samP: yes
04:45:57 <tpatil> samP: yes, but we will need to work with Mistral community to add the required actions in mistral
04:46:20 <samP> we could define our own workflows for evacuation or start, stop etc..
04:46:26 <abhishekk> samP: we need to make major changes to this workflow to address our need, for example filter on the basis of metadata, add host to aggreagte, reset instance state to error etc
04:46:41 <samP> abhishekk: yes
04:47:09 <tpatil> Apart from making all these changes, there are few challenges
04:48:01 <samP> the problem is, does Mistral engine support those requirements? <- this is the part we should discuss with Mistral Team
04:48:27 <tpatil> Problem 1: As pointed above by Abhishek, if mistral_engine is down, the workflow will remain in running state forever
04:48:35 <tpatil> Problem 2
04:49:21 <tpatil> For reserved_host recovery_method, there is a possibility all instance might not be evacuated on a reserved_host
04:49:58 <tpatil> so we will need to call workflow again from masakari passing the next reserved host
04:50:54 <tpatil> in this particular case, if operator has configured email action, these email will be sent multiple times
04:51:59 <samP> tpatil: IMHO, problem1 is very important and can not be done without Mistral support
04:53:29 <samP> for Problem2, Mistral work flow could return value with "not done yet, need another reserved host" and then masakari could call it again with another reserved host.
04:54:08 <tpatil> samP: Problem 1: Yes, mistral_api should check mistral_engine is up or down and according return status to the caller
04:54:30 <samP> tpatil: agree
04:55:48 <tpatil> samP: problem 2: that's what is in my mind too, but then we may need to add some conditions to execute partial actions from the workflow on the mistral side for the subsequent evacuation of the remaining instances
04:56:12 <tpatil> samP: that's anyway ,we will need in masakari specs for discussion
04:56:44 <tpatil> samP:  that's anyway ,we will need to add in masakari specs for discussion
04:57:00 <samP> tpatil: agree
04:57:50 <samP> I think we should use ML for this.
04:57:58 <samP> + masakari spec
04:58:23 <samP> we could get Mistral community feed back for this in ML
04:58:37 <tpatil> samP: First, we will upload a new PS in masakari-specs and then start this discussion on the ML if required
04:58:49 <samP> tpatil: sure..
04:59:20 <tpatil> samP: Thanks
04:59:27 <samP> sounds good!
04:59:35 <samP> we only have 1min left
04:59:57 <samP> please continue the discussion on #openstack-masakari or ML with [masakari]
05:00:08 <samP> thank you all....
05:00:15 <samP> #endmeeting