#openstack-meeting log

04:01:10 <samP> #startmeeting masakari
04:01:11 <openstack> Meeting started Tue Feb 14 04:01:10 2017 UTC and is due to finish in 60 minutes.  The chair is samP. Information about MeetBot at http://wiki.debian.org/MeetBot.
04:01:12 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
04:01:15 <openstack> The meeting name has been set to 'masakari'
04:01:19 <samP> hi all..
04:01:25 <abhishekk> \o
04:01:31 <rkmrHonjo> Hi
04:01:40 <samP> #topic Bugs
04:01:53 <takashi> o/
04:02:04 <samP> rkmrHonjo has reported 2 bugs.
04:02:34 <samP> rkmrHonjo: could you please explain them?
04:02:41 <rkmrHonjo> OK.
04:02:55 <tpatil> # link : https://bugs.launchpad.net/masakari/+bug/1663513
04:02:55 <openstack> Launchpad bug 1663513 in masakari "Masakari failed to rescue PAUSED instances" [Undecided,New]
04:03:07 <tpatil> # link : https://bugs.launchpad.net/masakari/+bug/1663513
04:03:14 <rkmrHonjo> tpatil: thanks.
04:03:48 <rkmrHonjo> In this report, I found that masakari can't rescue PAUSED instance.
04:04:19 <takashi> rkmrHonjo: you mean, the instance paused by users, right?
04:05:07 <rkmrHonjo> Yes. Masakari try to call stop API to paused instance, but the return code of API is "409".
04:05:33 <tpatil> rkmrHonjo: when user pause instance, what libvirt event is sent in the notification?
04:06:31 <takashi> rkmrHonjo: I'd like to confirm which version you are using for your testing
04:06:39 <takashi> rkmrHonjo: Does it include this change? https://review.openstack.org/#/c/430121/
04:06:53 <rkmrHonjo> tpatil: Please wait, I check it.
04:07:08 <rkmrHonjo> s/tpatil/takashi/g
04:07:13 <takashi> I think that tpatil is thinking about the smme thing as me...
04:07:40 <tpatil> takashi: yes
04:08:26 <rkmrHonjo> tpatil: Reproduce procedure is... 1. Pause instance. 2. Kill the instance process after paused. This is a simulation of instance failure.
04:08:26 <abhishekk> IMO if we do this, then user will never be able to pause the instance
04:09:34 <takashi> rkmrHonjo: so you are trying the case when some process failure happens 'after' user pauses their instance.
04:09:47 <takashi> s/some process/qemu process/g
04:09:54 <rkmrHonjo> takashi: Yes.
04:10:37 <rkmrHonjo> Pausing is not trigger. Paused instance failure is the trigger of this bug.
04:11:57 <rkmrHonjo> So https://review.openstack.org/#/c/430121/ doesn't resolve this issue.
04:12:02 <takashi> rkmrHonjo: ok. so let's back to tpatil's question. What kind of notification does masakari monitor send for that situation?
04:12:13 <takashi> LIFECYCLE STOPPED_FAILED?
04:12:39 <takashi> just for my confirmation
04:12:45 <rkmrHonjo> takashi: Ah...Please wait.
04:13:26 <abhishekk> inshort wheneve masakari-engine receives notification, we need to reset it to error state?
04:14:04 <abhishekk> we are setting instance to error sstate if it is i resize (IMO this is for host failure)
04:14:12 <abhishekk> s/i/in
04:15:40 <rkmrHonjo> takashi: STOPPED_FAILED, yes.
04:15:53 <takashi> rkmrHonjo: thx
04:16:03 <takashi> abhishekk: your are talking about this? https://github.com/openstack/masakari/blob/master/masakari/engine/drivers/taskflow/instance_failure.py#L60-L62
04:16:31 <abhishekk> takashi: yes
04:16:50 <takashi> the one for host failure https://github.com/openstack/masakari/blob/master/masakari/engine/drivers/taskflow/host_failure.py#L95-L102
04:17:26 <tpatil> If user has paused instance, does libvirt generated STOPPED_FAILED event?
04:17:50 <abhishekk> IMO this way for each vm_state except (shelved_offloaded) we can trigger this failure, am I right?
04:18:27 <takashi> tpatil: STOPPED_FAILED is not caused just for pausing instance
04:19:06 <takashi> tpatil: even after we pause instance, qemu process for the paused instance remains on compute node, and ...
04:19:23 <rkmrHonjo> abhishekk, takashi: Maybe yes. I'm checking other statuses now. PAUSED is one of the examples.
04:19:33 <samP> Pased by the user right? Notification sent by the process monitor?
04:19:54 <takashi> tpatil: if we get some failure for the qemu process, we get notification with STOPPED_FAILED
04:20:10 <rkmrHonjo> samP: Instance monitor sent notification. Because this is a instance failure.
04:20:38 <abhishekk> He is talking if instance is paused and after oausing if it fails (killing qemu process) then
04:20:53 <abhishekk> s/ousing/pausing
04:20:53 <tpatil> takashi: and in this case, you are simulating to kill the qemu process and then it send STOPPED_FAILED event, correct?
04:22:05 <takashi> tpatil: right
04:23:31 <takashi> so the point here is, we can get STOPPED_FAILED event for active instance and poused instance, because for both cases qemu process remains and send that notification when the qemu process dies for some reasons
04:23:32 <tpatil> masakari-engine doesn't have enough information to take all these smart decision as it's unaware of what's going on the compute node.
04:23:39 <takashi> like OOM-killer
04:24:25 <takashi> s/send/masakari-monitor sends/
04:24:50 <rkmrHonjo> takashi: You're right. thanks.
04:25:25 <tpatil> rkmrHonjo: Do you think calling reset-state api, stop and start will solve this problem?
04:25:54 <rkmrHonjo> tpatil:Yes.
04:26:33 <takashi> If we can assume that STOPPED_FAILED is raised only when qemu process dies unexpectedly, it makes sense to me to reset the instance state to error for any cases
04:26:40 <tpatil> before calling reset-state, engine can check the vm state/task status and accordingly make some decisions and call reset-state api before calling stop and start apis.
04:27:08 <takashi> tpatil: that is what we are doing for resizing, right?
04:27:18 <tpatil> correct
04:27:37 <samP> IMO, if it is a user paused instance, in this case we change the state to paused -> active/stop
04:27:47 <samP> is acceptable?
04:27:57 <samP> is it acceptable?
04:28:16 <rkmrHonjo> samP: Yeah. I agree with tpatil's opinion.
04:30:08 <tpatil> rkmrHonjo: Let me confirm the problem again. user pauses instance, before it's paused gracefully qemu process dies and it generate STOPPED_FAILED event and send an notification to masakari. Masakari check the vm/task state and call reset-state api before calling stop and start apis.
04:30:28 <tpatil> so finally user will see the instance in active status and NOT in paused state.
04:31:46 <tpatil> provided qemu process is up again on the compute node.
04:32:18 <tpatil> rkmrHonjo: Please describe the reproduction steps on the LP bug in detail
04:32:28 <samP> tpatil: thanks, this method looks good to me
04:32:42 <rkmrHonjo> tpatil: OK, I describe the steps.
04:33:02 <samP> tpatil: so, user can paused it again
04:34:25 <samP> #action rkmrHonjo add reproduction steps on the LP bug #1663513 in detai
04:34:25 <openstack> Launchpad bug 1663513 in masakari "Masakari failed to rescue PAUSED instances" [Undecided,New] https://launchpad.net/bugs/1663513
04:34:38 <rkmrHonjo> tpatil, samP: I agree with your opinion.
04:34:59 <tpatil> samP: Ok
04:35:12 <samP> OK then, lets discuss details on the LP or gerrit for this
04:35:19 <takashi> +1
04:35:31 <sagara> +1
04:35:36 <rkmrHonjo> +1
04:35:43 <samP> Next bug,
04:35:50 <samP> #link https://bugs.launchpad.net/masakari/+bug/1664183
04:35:50 <openstack> Launchpad bug 1664183 in masakari "Failed to update a status of hosts." [Undecided,New]
04:36:17 <rkmrHonjo> OK. This report says... Masakari user can't update status of host-foo if there are "error" notifications in host-bar. (host-foo and bar belong to same segment.)
04:36:17 <rkmrHonjo> Is this correct?
04:36:53 <tpatil> you can only update on_maintenance and reserved properties of hosts
04:37:32 <tpatil> there is no status field for host
04:37:39 <Dinesh_Bhor> host doesn't have any property like status
04:37:40 <rkmrHonjo> Ah, sorry, "status" means on_maintenance.
04:37:58 <rkmrHonjo> I fix LP after that.
04:38:24 <takashi> if the error is temporal one, it will be resolved by periodic task. Is it right?
04:38:51 <takashi> In my understanding, if the fail over segment has notifications only in finished or failed status, we can update the hosts in it
04:39:22 <samP> takashi: same understanding
04:39:26 <tpatil> takashi: correct
04:39:38 <takashi> and periodic task should pick up notifications in error, and change it to finished or failed
04:39:49 <samP> so, we have to wait for periodic task to clear it
04:40:07 <tpatil> the notification is error status will be processed by the periodic tasks
04:40:39 <takashi> #link https://review.openstack.org/#/c/431335/
04:40:40 <takashi> this one
04:41:30 <tpatil> #link :https://review.openstack.org/#/c/427072/
04:42:06 <takashi> sorry, I pasted wrong url...
04:42:08 <takashi> :-(
04:42:21 <tpatil> after processing notification in error status, it will change the status either to finished or failed. After that, you should be able to change the on_maintenance flag
04:43:06 <rkmrHonjo> All: Thank you for explaining.In conclusion, this is correct as Masakari specs, right?
04:43:20 <tpatil> rkmrHonjo: yes
04:44:11 <samP> rkmrHonjo: yes, if there is error notifications in the segment, you have to wait till periodic task clear the state of the notifications
04:44:46 <rkmrHonjo> samP, tpatil:OK. I fix describe and change status to "invalid" after that.
04:44:57 <samP> rkmrHonjo: thanks
04:45:11 <samP> is there any other bugs to discuss?
04:45:45 <tpatil> rkmrHonjo: the default interval of periodic task is 120 seconds
04:46:01 <takashi> samP: I think tpatil listed up some bugs in high priority etherpad
04:46:10 <takashi> so can we move to the topic?
04:46:12 <samP> takashi: yep..
04:46:23 <samP> that is the next topic
04:46:53 <samP> #topic High priority items for Ocata
04:47:12 <samP> lest discuss about tpatil's list
04:47:24 <samP> #link https://etherpad.openstack.org/p/ocata-priorities-masakari
04:48:11 <samP> I added the items for python-masakariclinet and masakari-monitors.
04:48:40 <tpatil> Add periodic tasks to process notifications in error/new states
04:49:18 <tpatil> # link : https://review.openstack.org/#/c/431335/
04:49:28 <tpatil> above patch is pending for review
04:50:33 <samP> tpatil: I will review it
04:50:39 <tpatil> 2. Implement reserve host recovery methods
04:50:53 <tpatil> #link : https://review.openstack.org/#/c/432314/
04:51:03 <tpatil> I have review this patch and voted -1
04:51:12 <tpatil> Dinesh will address these review comments
04:51:37 <Dinesh_Bhor> ^^ I will address them today
04:51:41 <samP> tpatil: Dinesh_Bhor thanks
04:52:40 <takashi> will review the patch, and also its spec
04:54:01 <tpatil> Release notes build is failing
04:54:26 <tpatil> It requires major changes
04:55:11 <rkmrHonjo> tpatil: major changes?
04:55:17 <samP> tpatil: What changes are you proposing?
04:55:42 <tpatil> after you build the release notes using to command, it should show Newton Release Notes and Current Series Release Notes
04:55:51 <tpatil> s/to/tox
04:56:16 <tpatil> currently, it's not showing Newton Release notes at all.
04:57:20 <takashi> fyi: http://docs.openstack.org/releasenotes/nova/unreleased.html
04:57:42 <tpatil> Also ,the release note formatting is incorrect at few other places which we are planning to fix in the same LP bug
04:58:03 <takashi> in this release note, we can find newton release note in the page also as current seriese
04:58:27 <samP> tpatil: takashi: got it.
04:58:49 <samP> tpatil: are you addressing this on same LP bug?
04:59:13 <tpatil> samP: yes
05:00:13 <samP> tpatil: OK, thanks. Me too take look into this, sice masakari-monitors,and python-masakariclient need the same fix
05:00:17 <sagara> we have no time, do we need to change the room?
05:00:24 <samP> nop
05:00:55 <samP> OK, lets finish the discussion here and move to #opesntack-masakari
05:01:30 <samP> thank you all..
05:02:25 <tpatil> Thanks
05:02:26 <samP> lets finish, please use #opesntack-masakari or ML for further discussion
05:02:30 <samP> #endmeeting