04:00:23 <samP> #startmeeting masakari 04:00:24 <openstack> Meeting started Tue Jun 27 04:00:23 2017 UTC and is due to finish in 60 minutes. The chair is samP. Information about MeetBot at http://wiki.debian.org/MeetBot. 04:00:25 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 04:00:27 <openstack> The meeting name has been set to 'masakari' 04:00:40 <tpatil> Hi 04:00:43 <samP> hi 04:00:44 <sagara> hi 04:01:11 <samP> abhishekk: First of all, congratulations for become a glance core 04:01:22 <abhishekk> samP: thank you 04:01:39 <sagara> abhishekk: congratulations! 04:01:58 <abhishekk> sagara: thank you 04:02:27 <samP> OK, lets start 04:02:45 <samP> #topic Critical bugs 04:02:56 <samP> Any bugs to discuss? 04:03:46 <Dinesh_Bhor> #link: https://bugs.launchpad.net/masakari/+bug/1690995 04:03:48 <openstack> Launchpad bug 1690995 in masakari "If masakari recover "resized" instance that's state was "stopped" before resizing, it will be "active"." [Undecided,In progress] - Assigned to Dinesh Bhor (dinesh-bhor) 04:04:50 <Dinesh_Bhor> According to the bug report if the instance is in stopped state before resize then after recovery it should be stopped. 04:05:32 <samP> Its looks like a nova bug to me.. 04:06:37 <Dinesh_Bhor> In this particular patch we have decided to stop the instance after recovery if it is in vm_state other than active and stopped : https://review.openstack.org/#/c/469029/ 04:07:04 <tpatil> samP: After resized operation and before confirm_resize, if the compute host goes down, in that case after evacuation the vm_state should be stopped instead of active as per bug reporter 04:07:10 <samP> Dinesh_Bhor: that is one possible solution.. 04:08:04 <samP> tpatil: yes..it should be on "stopped" 04:09:07 <tpatil> it should be stopped only in case the previous action on the instance is stopped before user calling resize operation 04:09:14 <samP> user stopped it for a reason, and make it active after evacuate could cause troubles. 04:09:45 <samP> tpatil: correct 04:09:53 <tpatil> samP:In that case, we will need to get instance_actions to take this decision 04:10:03 <rkmrHonjo> samP: thank you for explaining. Your explanation is right. 04:10:41 <tpatil> In future, if we decide to move the workflow to Mistral, we cannot call db_apis to get instance actions. Will need to add restFul apis to get instance actions 04:10:56 <samP> tpatil: I thought in previous discussion we agreed to use instance_actions to refer its previous status 04:11:20 <samP> tpatil: Ah...got it 04:12:33 <rkmrHonjo> tpatil: There is List Actions For Server API in nova. 04:12:34 <tpatil> samP: we have confirmed there is an existing nova api to get instance actions (instance_action_list) 04:12:36 <rkmrHonjo> #link https://developer.openstack.org/api-ref/compute/#servers-actions-servers-os-instance-actions 04:13:36 <samP> rkmrHonjo: I am looking at same 04:13:58 <tpatil> rkmrHonjo: Yes, the api is present 04:14:26 <samP> tpatil: can we use this API to get previous actions? 04:14:35 <tpatil> We may need to add this action in mistral to get the instance_actions for an instance 04:14:38 <tpatil> samP:Yes 04:15:31 <Dinesh_Bhor> This can be fixed in the same patch: https://review.openstack.org/#/c/469029/ I will fix it 04:15:31 <samP> tpatil: so, we do not need to call db_apis, right? 04:15:46 <tpatil> samP: correct 04:15:50 <samP> Dinesh_Bhor: thank you.. 04:16:01 <samP> tpatil: OK, thanks 04:17:56 <samP> However, regarding these cases, my initial concern stays same.. 04:18:44 <samP> which is, VMs are turn on instantly before we turn them off.. 04:19:35 <samP> IMO, we have to fix these in nova 04:20:44 <sagara> samP: I agree. We need some internal state change (or something locking) API in Nova. 04:20:45 <samP> such as, in this case, evacuate does not turn the instance to active, if states before resize it stopped. 04:20:45 <tpatil> Ideally, user will expect the vm_state to be resized instead of stopped or active. 04:20:51 <abhishekk> samP: First we need to find out the reason why nova is not allowing evacuatiuon if instance is in reset state 04:21:32 <abhishekk> samP: we need to convince them, otherwise nova will never accept this fix if there is any strong reason 04:21:45 <samP> abhishekk: agree.. 04:21:46 <abhishekk> s/reset/resize 04:24:02 <samP> can some one please check why nova does not allow this? 04:24:34 <samP> Then we could start to talk with nova, if we have strong reason. 04:25:28 <sagara> I will do that, but if I cannot, I will ask someone to help me 04:25:45 <samP> sagara: sure..thank you.. 04:25:54 <samP> sagara: lets use ML with [masakari] 04:26:12 <tpatil> samP: so what we want to confirm with nova community is after evacuation the instance should be set to active or stopped state based on the previous action before resize 04:26:16 <rkmrHonjo> sagara: thank you. 04:26:51 <sagara> #action: sagara find out the reason why nova is not allowing evacuatiuon if instance is in reset state 04:26:56 <samP> tpatil: yes 04:27:08 <samP> sagara: thank you. 04:27:43 <samP> OK then, any other bugs 04:28:23 <samP> if not, lets move to pike work items 04:28:32 <samP> #topic pike work items 04:29:00 <tpatil> samP: the nova is not allowing resize vm state to be evacuated before the instance files resding on the source compute node will not be accessible during evacuation 04:29:16 <tpatil> s/before/because 04:30:04 <samP> tpatil: but that does not true in shared storage case, right? 04:30:12 <tpatil> but in case of shared_storage, it should be allowed 04:31:26 <samP> tpatil: in non shared storage cases, it does make sense 04:31:26 <tpatil> samP: Can nova community allow evacuation in resize vm_state based on the shared_storage/non-shared storage, is the main question 04:32:47 <tpatil> samP: Let's discuss this point on the nova ML 04:32:54 <samP> tpatil: might hard to convince them, but lets give it a try 04:33:12 <samP> tpatil: sure 04:33:37 <tpatil> sagara: Do you want to send this mail or should I request Abhishek to send it 04:34:59 <sagara> tpatil: Please send a mail to Nova ML. thanks. 04:35:06 <tpatil> sagara: OK 04:35:42 <sagara> I will look up the reason from source or commit log. 04:35:55 <samP> Thanks.. 04:36:29 <samP> sagara: could you please share those in ML with [masakari], then abhishekk could use those info 04:36:58 <sagara> samP: Yes, of course. 04:37:08 <samP> sagara: thanks 04:38:19 <samP> OK, about pike work items... need to do lot of review work.. 04:39:05 <abhishekk> For recovery method customization, to add mistral driver in masakari we have analyzed some issues 04:39:18 <abhishekk> 1. need to add some actions in mistral, like filter vms on the basis of metadata, custom evacuation in which we need to find out aggregate host and add reserved_host to that aggregate, confirmation of evacuation, reset instances state to stop if instance is in state other than active or stop etc. 04:39:34 <tpatil> #link Pike work items https://etherpad.openstack.org/p/masakari-pike-workitems 04:39:51 <abhishekk> 2. Mistral workflow executes in asynchronous way and to get the output we have to continuously hit the "mistral execution-get <execution-id>" API 04:40:16 <abhishekk> problem: If any workflow is executing and mistral-engine service goes down suddenly during execution and workflow remains in "RUNNING" state until the mistral-engine comes up again and processes the pending workflow (there must be some periodic task which is picking up the pending executions). 04:41:01 <abhishekk> 3. Assuming we have a mistral driver then in case of recovery_method as a "reserved_host" we are going to loop through the reserved_hosts and will pass the reserved_host one by one to workflow for execution. In this case if operator has configured "send e-mail" action after completion of workflow execution. 04:41:01 <abhishekk> For partial workflow execution operator will get multiple emails that the workflow is failed and it has evacuated some instances out of total. 04:41:57 <samP> abhishekk: That is lot of Mistral work there... 04:42:22 <abhishekk> For point 3 consider we have set reserved_host recovery action, 04:42:31 <abhishekk> samP: yes 04:42:45 <samP> Is it possible to ask Mistral team to help up with this.. 04:43:05 <tpatil> samP: Right now, mistral has only one action servers_evacuate which takes instance id for evacuation 04:43:52 <tpatil> samP: but we will need to add a major action in mistral to evacuate instances from a failed compute host 04:43:52 <samP> tpatil: is that the evacuate work flow? 04:44:54 <abhishekk> samP: for reference https://github.com/gryf/mistral-evacuate 04:45:19 <samP> abhishekk: ah.. that is not a part of Mistral 04:45:31 <abhishekk> samP: yes 04:45:57 <tpatil> samP: yes, but we will need to work with Mistral community to add the required actions in mistral 04:46:20 <samP> we could define our own workflows for evacuation or start, stop etc.. 04:46:26 <abhishekk> samP: we need to make major changes to this workflow to address our need, for example filter on the basis of metadata, add host to aggreagte, reset instance state to error etc 04:46:41 <samP> abhishekk: yes 04:47:09 <tpatil> Apart from making all these changes, there are few challenges 04:48:01 <samP> the problem is, does Mistral engine support those requirements? <- this is the part we should discuss with Mistral Team 04:48:27 <tpatil> Problem 1: As pointed above by Abhishek, if mistral_engine is down, the workflow will remain in running state forever 04:48:35 <tpatil> Problem 2 04:49:21 <tpatil> For reserved_host recovery_method, there is a possibility all instance might not be evacuated on a reserved_host 04:49:58 <tpatil> so we will need to call workflow again from masakari passing the next reserved host 04:50:54 <tpatil> in this particular case, if operator has configured email action, these email will be sent multiple times 04:51:59 <samP> tpatil: IMHO, problem1 is very important and can not be done without Mistral support 04:53:29 <samP> for Problem2, Mistral work flow could return value with "not done yet, need another reserved host" and then masakari could call it again with another reserved host. 04:54:08 <tpatil> samP: Problem 1: Yes, mistral_api should check mistral_engine is up or down and according return status to the caller 04:54:30 <samP> tpatil: agree 04:55:48 <tpatil> samP: problem 2: that's what is in my mind too, but then we may need to add some conditions to execute partial actions from the workflow on the mistral side for the subsequent evacuation of the remaining instances 04:56:12 <tpatil> samP: that's anyway ,we will need in masakari specs for discussion 04:56:44 <tpatil> samP: that's anyway ,we will need to add in masakari specs for discussion 04:57:00 <samP> tpatil: agree 04:57:50 <samP> I think we should use ML for this. 04:57:58 <samP> + masakari spec 04:58:23 <samP> we could get Mistral community feed back for this in ML 04:58:37 <tpatil> samP: First, we will upload a new PS in masakari-specs and then start this discussion on the ML if required 04:58:49 <samP> tpatil: sure.. 04:59:20 <tpatil> samP: Thanks 04:59:27 <samP> sounds good! 04:59:35 <samP> we only have 1min left 04:59:57 <samP> please continue the discussion on #openstack-masakari or ML with [masakari] 05:00:08 <samP> thank you all.... 05:00:15 <samP> #endmeeting