04:01:10 #startmeeting masakari 04:01:11 Meeting started Tue Feb 14 04:01:10 2017 UTC and is due to finish in 60 minutes. The chair is samP. Information about MeetBot at http://wiki.debian.org/MeetBot. 04:01:12 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 04:01:15 The meeting name has been set to 'masakari' 04:01:19 hi all.. 04:01:25 \o 04:01:31 Hi 04:01:40 #topic Bugs 04:01:53 o/ 04:02:04 rkmrHonjo has reported 2 bugs. 04:02:34 rkmrHonjo: could you please explain them? 04:02:41 OK. 04:02:55 # link : https://bugs.launchpad.net/masakari/+bug/1663513 04:02:55 Launchpad bug 1663513 in masakari "Masakari failed to rescue PAUSED instances" [Undecided,New] 04:03:07 # link : https://bugs.launchpad.net/masakari/+bug/1663513 04:03:14 tpatil: thanks. 04:03:48 In this report, I found that masakari can't rescue PAUSED instance. 04:04:19 rkmrHonjo: you mean, the instance paused by users, right? 04:05:07 Yes. Masakari try to call stop API to paused instance, but the return code of API is "409". 04:05:33 rkmrHonjo: when user pause instance, what libvirt event is sent in the notification? 04:06:31 rkmrHonjo: I'd like to confirm which version you are using for your testing 04:06:39 rkmrHonjo: Does it include this change? https://review.openstack.org/#/c/430121/ 04:06:53 tpatil: Please wait, I check it. 04:07:08 s/tpatil/takashi/g 04:07:13 I think that tpatil is thinking about the smme thing as me... 04:07:40 takashi: yes 04:08:26 tpatil: Reproduce procedure is... 1. Pause instance. 2. Kill the instance process after paused. This is a simulation of instance failure. 04:08:26 IMO if we do this, then user will never be able to pause the instance 04:09:34 rkmrHonjo: so you are trying the case when some process failure happens 'after' user pauses their instance. 04:09:47 s/some process/qemu process/g 04:09:54 takashi: Yes. 04:10:37 Pausing is not trigger. Paused instance failure is the trigger of this bug. 04:11:57 So https://review.openstack.org/#/c/430121/ doesn't resolve this issue. 04:12:02 rkmrHonjo: ok. so let's back to tpatil's question. What kind of notification does masakari monitor send for that situation? 04:12:13 LIFECYCLE STOPPED_FAILED? 04:12:39 just for my confirmation 04:12:45 takashi: Ah...Please wait. 04:13:26 inshort wheneve masakari-engine receives notification, we need to reset it to error state? 04:14:04 we are setting instance to error sstate if it is i resize (IMO this is for host failure) 04:14:12 s/i/in 04:15:40 takashi: STOPPED_FAILED, yes. 04:15:53 rkmrHonjo: thx 04:16:03 abhishekk: your are talking about this? https://github.com/openstack/masakari/blob/master/masakari/engine/drivers/taskflow/instance_failure.py#L60-L62 04:16:31 takashi: yes 04:16:50 the one for host failure https://github.com/openstack/masakari/blob/master/masakari/engine/drivers/taskflow/host_failure.py#L95-L102 04:17:26 If user has paused instance, does libvirt generated STOPPED_FAILED event? 04:17:50 IMO this way for each vm_state except (shelved_offloaded) we can trigger this failure, am I right? 04:18:27 tpatil: STOPPED_FAILED is not caused just for pausing instance 04:19:06 tpatil: even after we pause instance, qemu process for the paused instance remains on compute node, and ... 04:19:23 abhishekk, takashi: Maybe yes. I'm checking other statuses now. PAUSED is one of the examples. 04:19:33 Pased by the user right? Notification sent by the process monitor? 04:19:54 tpatil: if we get some failure for the qemu process, we get notification with STOPPED_FAILED 04:20:10 samP: Instance monitor sent notification. Because this is a instance failure. 04:20:38 He is talking if instance is paused and after oausing if it fails (killing qemu process) then 04:20:53 s/ousing/pausing 04:20:53 takashi: and in this case, you are simulating to kill the qemu process and then it send STOPPED_FAILED event, correct? 04:22:05 tpatil: right 04:23:31 so the point here is, we can get STOPPED_FAILED event for active instance and poused instance, because for both cases qemu process remains and send that notification when the qemu process dies for some reasons 04:23:32 masakari-engine doesn't have enough information to take all these smart decision as it's unaware of what's going on the compute node. 04:23:39 like OOM-killer 04:24:25 s/send/masakari-monitor sends/ 04:24:50 takashi: You're right. thanks. 04:25:25 rkmrHonjo: Do you think calling reset-state api, stop and start will solve this problem? 04:25:54 tpatil:Yes. 04:26:33 If we can assume that STOPPED_FAILED is raised only when qemu process dies unexpectedly, it makes sense to me to reset the instance state to error for any cases 04:26:40 before calling reset-state, engine can check the vm state/task status and accordingly make some decisions and call reset-state api before calling stop and start apis. 04:27:08 tpatil: that is what we are doing for resizing, right? 04:27:18 correct 04:27:37 IMO, if it is a user paused instance, in this case we change the state to paused -> active/stop 04:27:47 is acceptable? 04:27:57 is it acceptable? 04:28:16 samP: Yeah. I agree with tpatil's opinion. 04:30:08 rkmrHonjo: Let me confirm the problem again. user pauses instance, before it's paused gracefully qemu process dies and it generate STOPPED_FAILED event and send an notification to masakari. Masakari check the vm/task state and call reset-state api before calling stop and start apis. 04:30:28 so finally user will see the instance in active status and NOT in paused state. 04:31:46 provided qemu process is up again on the compute node. 04:32:18 rkmrHonjo: Please describe the reproduction steps on the LP bug in detail 04:32:28 tpatil: thanks, this method looks good to me 04:32:42 tpatil: OK, I describe the steps. 04:33:02 tpatil: so, user can paused it again 04:34:25 #action rkmrHonjo add reproduction steps on the LP bug #1663513 in detai 04:34:25 Launchpad bug 1663513 in masakari "Masakari failed to rescue PAUSED instances" [Undecided,New] https://launchpad.net/bugs/1663513 04:34:38 tpatil, samP: I agree with your opinion. 04:34:59 samP: Ok 04:35:12 OK then, lets discuss details on the LP or gerrit for this 04:35:19 +1 04:35:31 +1 04:35:36 +1 04:35:43 Next bug, 04:35:50 #link https://bugs.launchpad.net/masakari/+bug/1664183 04:35:50 Launchpad bug 1664183 in masakari "Failed to update a status of hosts." [Undecided,New] 04:36:17 OK. This report says... Masakari user can't update status of host-foo if there are "error" notifications in host-bar. (host-foo and bar belong to same segment.) 04:36:17 Is this correct? 04:36:53 you can only update on_maintenance and reserved properties of hosts 04:37:32 there is no status field for host 04:37:39 host doesn't have any property like status 04:37:40 Ah, sorry, "status" means on_maintenance. 04:37:58 I fix LP after that. 04:38:24 if the error is temporal one, it will be resolved by periodic task. Is it right? 04:38:51 In my understanding, if the fail over segment has notifications only in finished or failed status, we can update the hosts in it 04:39:22 takashi: same understanding 04:39:26 takashi: correct 04:39:38 and periodic task should pick up notifications in error, and change it to finished or failed 04:39:49 so, we have to wait for periodic task to clear it 04:40:07 the notification is error status will be processed by the periodic tasks 04:40:39 #link https://review.openstack.org/#/c/431335/ 04:40:40 this one 04:41:30 #link :https://review.openstack.org/#/c/427072/ 04:42:06 sorry, I pasted wrong url... 04:42:08 :-( 04:42:21 after processing notification in error status, it will change the status either to finished or failed. After that, you should be able to change the on_maintenance flag 04:43:06 All: Thank you for explaining.In conclusion, this is correct as Masakari specs, right? 04:43:20 rkmrHonjo: yes 04:44:11 rkmrHonjo: yes, if there is error notifications in the segment, you have to wait till periodic task clear the state of the notifications 04:44:46 samP, tpatil:OK. I fix describe and change status to "invalid" after that. 04:44:57 rkmrHonjo: thanks 04:45:11 is there any other bugs to discuss? 04:45:45 rkmrHonjo: the default interval of periodic task is 120 seconds 04:46:01 samP: I think tpatil listed up some bugs in high priority etherpad 04:46:10 so can we move to the topic? 04:46:12 takashi: yep.. 04:46:23 that is the next topic 04:46:53 #topic High priority items for Ocata 04:47:12 lest discuss about tpatil's list 04:47:24 #link https://etherpad.openstack.org/p/ocata-priorities-masakari 04:48:11 I added the items for python-masakariclinet and masakari-monitors. 04:48:40 Add periodic tasks to process notifications in error/new states 04:49:18 # link : https://review.openstack.org/#/c/431335/ 04:49:28 above patch is pending for review 04:50:33 tpatil: I will review it 04:50:39 2. Implement reserve host recovery methods 04:50:53 #link : https://review.openstack.org/#/c/432314/ 04:51:03 I have review this patch and voted -1 04:51:12 Dinesh will address these review comments 04:51:37 ^^ I will address them today 04:51:41 tpatil: Dinesh_Bhor thanks 04:52:40 will review the patch, and also its spec 04:54:01 Release notes build is failing 04:54:26 It requires major changes 04:55:11 tpatil: major changes? 04:55:17 tpatil: What changes are you proposing? 04:55:42 after you build the release notes using to command, it should show Newton Release Notes and Current Series Release Notes 04:55:51 s/to/tox 04:56:16 currently, it's not showing Newton Release notes at all. 04:57:20 fyi: http://docs.openstack.org/releasenotes/nova/unreleased.html 04:57:42 Also ,the release note formatting is incorrect at few other places which we are planning to fix in the same LP bug 04:58:03 in this release note, we can find newton release note in the page also as current seriese 04:58:27 tpatil: takashi: got it. 04:58:49 tpatil: are you addressing this on same LP bug? 04:59:13 samP: yes 05:00:13 tpatil: OK, thanks. Me too take look into this, sice masakari-monitors,and python-masakariclient need the same fix 05:00:17 we have no time, do we need to change the room? 05:00:24 nop 05:00:55 OK, lets finish the discussion here and move to #opesntack-masakari 05:01:30 thank you all.. 05:02:25 Thanks 05:02:26 lets finish, please use #opesntack-masakari or ML for further discussion 05:02:30 #endmeeting