04:01:10 <samP> #startmeeting masakari 04:01:11 <openstack> Meeting started Tue Feb 14 04:01:10 2017 UTC and is due to finish in 60 minutes. The chair is samP. Information about MeetBot at http://wiki.debian.org/MeetBot. 04:01:12 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 04:01:15 <openstack> The meeting name has been set to 'masakari' 04:01:19 <samP> hi all.. 04:01:25 <abhishekk> \o 04:01:31 <rkmrHonjo> Hi 04:01:40 <samP> #topic Bugs 04:01:53 <takashi> o/ 04:02:04 <samP> rkmrHonjo has reported 2 bugs. 04:02:34 <samP> rkmrHonjo: could you please explain them? 04:02:41 <rkmrHonjo> OK. 04:02:55 <tpatil> # link : https://bugs.launchpad.net/masakari/+bug/1663513 04:02:55 <openstack> Launchpad bug 1663513 in masakari "Masakari failed to rescue PAUSED instances" [Undecided,New] 04:03:07 <tpatil> # link : https://bugs.launchpad.net/masakari/+bug/1663513 04:03:14 <rkmrHonjo> tpatil: thanks. 04:03:48 <rkmrHonjo> In this report, I found that masakari can't rescue PAUSED instance. 04:04:19 <takashi> rkmrHonjo: you mean, the instance paused by users, right? 04:05:07 <rkmrHonjo> Yes. Masakari try to call stop API to paused instance, but the return code of API is "409". 04:05:33 <tpatil> rkmrHonjo: when user pause instance, what libvirt event is sent in the notification? 04:06:31 <takashi> rkmrHonjo: I'd like to confirm which version you are using for your testing 04:06:39 <takashi> rkmrHonjo: Does it include this change? https://review.openstack.org/#/c/430121/ 04:06:53 <rkmrHonjo> tpatil: Please wait, I check it. 04:07:08 <rkmrHonjo> s/tpatil/takashi/g 04:07:13 <takashi> I think that tpatil is thinking about the smme thing as me... 04:07:40 <tpatil> takashi: yes 04:08:26 <rkmrHonjo> tpatil: Reproduce procedure is... 1. Pause instance. 2. Kill the instance process after paused. This is a simulation of instance failure. 04:08:26 <abhishekk> IMO if we do this, then user will never be able to pause the instance 04:09:34 <takashi> rkmrHonjo: so you are trying the case when some process failure happens 'after' user pauses their instance. 04:09:47 <takashi> s/some process/qemu process/g 04:09:54 <rkmrHonjo> takashi: Yes. 04:10:37 <rkmrHonjo> Pausing is not trigger. Paused instance failure is the trigger of this bug. 04:11:57 <rkmrHonjo> So https://review.openstack.org/#/c/430121/ doesn't resolve this issue. 04:12:02 <takashi> rkmrHonjo: ok. so let's back to tpatil's question. What kind of notification does masakari monitor send for that situation? 04:12:13 <takashi> LIFECYCLE STOPPED_FAILED? 04:12:39 <takashi> just for my confirmation 04:12:45 <rkmrHonjo> takashi: Ah...Please wait. 04:13:26 <abhishekk> inshort wheneve masakari-engine receives notification, we need to reset it to error state? 04:14:04 <abhishekk> we are setting instance to error sstate if it is i resize (IMO this is for host failure) 04:14:12 <abhishekk> s/i/in 04:15:40 <rkmrHonjo> takashi: STOPPED_FAILED, yes. 04:15:53 <takashi> rkmrHonjo: thx 04:16:03 <takashi> abhishekk: your are talking about this? https://github.com/openstack/masakari/blob/master/masakari/engine/drivers/taskflow/instance_failure.py#L60-L62 04:16:31 <abhishekk> takashi: yes 04:16:50 <takashi> the one for host failure https://github.com/openstack/masakari/blob/master/masakari/engine/drivers/taskflow/host_failure.py#L95-L102 04:17:26 <tpatil> If user has paused instance, does libvirt generated STOPPED_FAILED event? 04:17:50 <abhishekk> IMO this way for each vm_state except (shelved_offloaded) we can trigger this failure, am I right? 04:18:27 <takashi> tpatil: STOPPED_FAILED is not caused just for pausing instance 04:19:06 <takashi> tpatil: even after we pause instance, qemu process for the paused instance remains on compute node, and ... 04:19:23 <rkmrHonjo> abhishekk, takashi: Maybe yes. I'm checking other statuses now. PAUSED is one of the examples. 04:19:33 <samP> Pased by the user right? Notification sent by the process monitor? 04:19:54 <takashi> tpatil: if we get some failure for the qemu process, we get notification with STOPPED_FAILED 04:20:10 <rkmrHonjo> samP: Instance monitor sent notification. Because this is a instance failure. 04:20:38 <abhishekk> He is talking if instance is paused and after oausing if it fails (killing qemu process) then 04:20:53 <abhishekk> s/ousing/pausing 04:20:53 <tpatil> takashi: and in this case, you are simulating to kill the qemu process and then it send STOPPED_FAILED event, correct? 04:22:05 <takashi> tpatil: right 04:23:31 <takashi> so the point here is, we can get STOPPED_FAILED event for active instance and poused instance, because for both cases qemu process remains and send that notification when the qemu process dies for some reasons 04:23:32 <tpatil> masakari-engine doesn't have enough information to take all these smart decision as it's unaware of what's going on the compute node. 04:23:39 <takashi> like OOM-killer 04:24:25 <takashi> s/send/masakari-monitor sends/ 04:24:50 <rkmrHonjo> takashi: You're right. thanks. 04:25:25 <tpatil> rkmrHonjo: Do you think calling reset-state api, stop and start will solve this problem? 04:25:54 <rkmrHonjo> tpatil:Yes. 04:26:33 <takashi> If we can assume that STOPPED_FAILED is raised only when qemu process dies unexpectedly, it makes sense to me to reset the instance state to error for any cases 04:26:40 <tpatil> before calling reset-state, engine can check the vm state/task status and accordingly make some decisions and call reset-state api before calling stop and start apis. 04:27:08 <takashi> tpatil: that is what we are doing for resizing, right? 04:27:18 <tpatil> correct 04:27:37 <samP> IMO, if it is a user paused instance, in this case we change the state to paused -> active/stop 04:27:47 <samP> is acceptable? 04:27:57 <samP> is it acceptable? 04:28:16 <rkmrHonjo> samP: Yeah. I agree with tpatil's opinion. 04:30:08 <tpatil> rkmrHonjo: Let me confirm the problem again. user pauses instance, before it's paused gracefully qemu process dies and it generate STOPPED_FAILED event and send an notification to masakari. Masakari check the vm/task state and call reset-state api before calling stop and start apis. 04:30:28 <tpatil> so finally user will see the instance in active status and NOT in paused state. 04:31:46 <tpatil> provided qemu process is up again on the compute node. 04:32:18 <tpatil> rkmrHonjo: Please describe the reproduction steps on the LP bug in detail 04:32:28 <samP> tpatil: thanks, this method looks good to me 04:32:42 <rkmrHonjo> tpatil: OK, I describe the steps. 04:33:02 <samP> tpatil: so, user can paused it again 04:34:25 <samP> #action rkmrHonjo add reproduction steps on the LP bug #1663513 in detai 04:34:25 <openstack> Launchpad bug 1663513 in masakari "Masakari failed to rescue PAUSED instances" [Undecided,New] https://launchpad.net/bugs/1663513 04:34:38 <rkmrHonjo> tpatil, samP: I agree with your opinion. 04:34:59 <tpatil> samP: Ok 04:35:12 <samP> OK then, lets discuss details on the LP or gerrit for this 04:35:19 <takashi> +1 04:35:31 <sagara> +1 04:35:36 <rkmrHonjo> +1 04:35:43 <samP> Next bug, 04:35:50 <samP> #link https://bugs.launchpad.net/masakari/+bug/1664183 04:35:50 <openstack> Launchpad bug 1664183 in masakari "Failed to update a status of hosts." [Undecided,New] 04:36:17 <rkmrHonjo> OK. This report says... Masakari user can't update status of host-foo if there are "error" notifications in host-bar. (host-foo and bar belong to same segment.) 04:36:17 <rkmrHonjo> Is this correct? 04:36:53 <tpatil> you can only update on_maintenance and reserved properties of hosts 04:37:32 <tpatil> there is no status field for host 04:37:39 <Dinesh_Bhor> host doesn't have any property like status 04:37:40 <rkmrHonjo> Ah, sorry, "status" means on_maintenance. 04:37:58 <rkmrHonjo> I fix LP after that. 04:38:24 <takashi> if the error is temporal one, it will be resolved by periodic task. Is it right? 04:38:51 <takashi> In my understanding, if the fail over segment has notifications only in finished or failed status, we can update the hosts in it 04:39:22 <samP> takashi: same understanding 04:39:26 <tpatil> takashi: correct 04:39:38 <takashi> and periodic task should pick up notifications in error, and change it to finished or failed 04:39:49 <samP> so, we have to wait for periodic task to clear it 04:40:07 <tpatil> the notification is error status will be processed by the periodic tasks 04:40:39 <takashi> #link https://review.openstack.org/#/c/431335/ 04:40:40 <takashi> this one 04:41:30 <tpatil> #link :https://review.openstack.org/#/c/427072/ 04:42:06 <takashi> sorry, I pasted wrong url... 04:42:08 <takashi> :-( 04:42:21 <tpatil> after processing notification in error status, it will change the status either to finished or failed. After that, you should be able to change the on_maintenance flag 04:43:06 <rkmrHonjo> All: Thank you for explaining.In conclusion, this is correct as Masakari specs, right? 04:43:20 <tpatil> rkmrHonjo: yes 04:44:11 <samP> rkmrHonjo: yes, if there is error notifications in the segment, you have to wait till periodic task clear the state of the notifications 04:44:46 <rkmrHonjo> samP, tpatil:OK. I fix describe and change status to "invalid" after that. 04:44:57 <samP> rkmrHonjo: thanks 04:45:11 <samP> is there any other bugs to discuss? 04:45:45 <tpatil> rkmrHonjo: the default interval of periodic task is 120 seconds 04:46:01 <takashi> samP: I think tpatil listed up some bugs in high priority etherpad 04:46:10 <takashi> so can we move to the topic? 04:46:12 <samP> takashi: yep.. 04:46:23 <samP> that is the next topic 04:46:53 <samP> #topic High priority items for Ocata 04:47:12 <samP> lest discuss about tpatil's list 04:47:24 <samP> #link https://etherpad.openstack.org/p/ocata-priorities-masakari 04:48:11 <samP> I added the items for python-masakariclinet and masakari-monitors. 04:48:40 <tpatil> Add periodic tasks to process notifications in error/new states 04:49:18 <tpatil> # link : https://review.openstack.org/#/c/431335/ 04:49:28 <tpatil> above patch is pending for review 04:50:33 <samP> tpatil: I will review it 04:50:39 <tpatil> 2. Implement reserve host recovery methods 04:50:53 <tpatil> #link : https://review.openstack.org/#/c/432314/ 04:51:03 <tpatil> I have review this patch and voted -1 04:51:12 <tpatil> Dinesh will address these review comments 04:51:37 <Dinesh_Bhor> ^^ I will address them today 04:51:41 <samP> tpatil: Dinesh_Bhor thanks 04:52:40 <takashi> will review the patch, and also its spec 04:54:01 <tpatil> Release notes build is failing 04:54:26 <tpatil> It requires major changes 04:55:11 <rkmrHonjo> tpatil: major changes? 04:55:17 <samP> tpatil: What changes are you proposing? 04:55:42 <tpatil> after you build the release notes using to command, it should show Newton Release Notes and Current Series Release Notes 04:55:51 <tpatil> s/to/tox 04:56:16 <tpatil> currently, it's not showing Newton Release notes at all. 04:57:20 <takashi> fyi: http://docs.openstack.org/releasenotes/nova/unreleased.html 04:57:42 <tpatil> Also ,the release note formatting is incorrect at few other places which we are planning to fix in the same LP bug 04:58:03 <takashi> in this release note, we can find newton release note in the page also as current seriese 04:58:27 <samP> tpatil: takashi: got it. 04:58:49 <samP> tpatil: are you addressing this on same LP bug? 04:59:13 <tpatil> samP: yes 05:00:13 <samP> tpatil: OK, thanks. Me too take look into this, sice masakari-monitors,and python-masakariclient need the same fix 05:00:17 <sagara> we have no time, do we need to change the room? 05:00:24 <samP> nop 05:00:55 <samP> OK, lets finish the discussion here and move to #opesntack-masakari 05:01:30 <samP> thank you all.. 05:02:25 <tpatil> Thanks 05:02:26 <samP> lets finish, please use #opesntack-masakari or ML for further discussion 05:02:30 <samP> #endmeeting