04:00:17 #startmeeting masakari 04:00:18 Meeting started Tue Jan 16 04:00:17 2018 UTC and is due to finish in 60 minutes. The chair is samP. Information about MeetBot at http://wiki.debian.org/MeetBot. 04:00:19 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 04:00:21 Hi 04:00:21 The meeting name has been set to 'masakari' 04:00:27 tpatil: Hi 04:00:28 hi 04:00:36 sorry for long absent 04:00:59 #topic High priority items 04:01:14 Any Hight Priority items to discuss? 04:01:53 if any please bring them up any time. proceeding to next topic 04:02:01 #topic Critical bugs 04:02:09 Any Bugs to discuss? 04:02:22 #link https://review.openstack.org/#/c/531310/ 04:02:53 I think this bug should be fixed in queens release 04:02:59 I have voted -1 04:03:31 tpatil: I heard that Takahara is addressing your comments now. 04:03:51 tpatil: thanks for the review. 04:03:56 rkmrHonjo: Ok 04:04:28 comments are not critical, its a easy fix. 04:04:48 tpatil: rkmrHonjo: let's merge this in Q 04:04:58 samP: Yes 04:05:08 samP: ok, I tell it to takahara. 04:05:15 Another patch : https://review.openstack.org/#/c/486576/ 04:05:20 tpatil: rkmrHonjo: thanks 04:05:48 we should merge this patch as py35 tests are failing on all patches 04:06:13 I have already voted +2, need another +2 04:06:26 tpatil: I will look into thsi 04:06:33 s/thsi/this 04:06:40 samP: ok, Thanks 04:06:45 samP: thanks. 04:06:53 tpatil: thanks for review 04:07:00 rkmrHonjo: thanks for the fix 04:08:17 Dinesh suggested to change the py35 test to vote.(current py35 test is non-vote.) I think that we can change it after merging it. 04:09:04 rkmrHonjo: sure, let's see how things work after merge above patch 04:09:18 ok. 04:09:37 if no problem, then let's put py35 to vote 04:10:01 Any other bugs? 04:10:17 https://bugs.launchpad.net/masakari/+bug/1738340 04:10:18 Launchpad bug 1738340 in masakari "When no reserved_host available, nova-compute service on failed host remains enabled" [Undecided,In progress] - Assigned to takahara.kengo (takahara.kengo) 04:11:34 in the fix, a new config option is introduced so I have commented on the patch asking to write down lite-specs 04:11:51 is it ok to fix this issue as a bug or feature? 04:12:50 tpatil: I think that this patch doesn't add a new config option. 04:12:53 sorry, no new config option 04:13:58 tpatil: ok. And, I think that this is just a bug fix. Because this patch doesn't add new action. 04:14:51 In current rh_workflow, failure host is still enable and VMs are not evacuated if reserved host is nothing. 04:14:55 rkmrHonjo: tpatil: sorry fot the delay, tried to understand the problem here 04:15:33 The notification request will be marked as complete which will give false impression to operator 04:15:34 rkmrHonjo: this patch propose to disable to nova-compute on failed host 04:15:49 after disabling compute host 04:15:54 But, failure host is disable if this patch will be merged. I think that it is not new action, and it is good from point of safety view. 04:16:52 tpatil: Ah, thanks, I understand your opinion. 04:17:03 if notification request is complete, it means that the failed host is evacuated successfully, but if reserved host is not available, then it is simply disabling compute host which I think isn't sufficient 04:17:40 tpatil: agree. 04:18:02 tpatil: Do you think this idea: "Disable host, and not complete the notification" ? 04:18:19 you may disable to compute node but should not complete the recovery 04:18:26 samP: yes. 04:19:12 samP: agree 04:19:52 ok, can I(and Takahara) create a new patch according to samP's idea? 04:19:55 in the periodic task, it will again try to execute the workflow and finally it will give up and set it failed 04:20:12 tpatil: correct 04:21:46 rkmrHonjo: I will add this comment on the patch 04:21:56 I think we should bring this step to very beginning of the flow. (1) disable the compte node (2) runt evacuation 04:22:08 I will add my comments too 04:22:24 tpatil: thanks! In that case, should I write a lite-spec? 04:22:40 Just noticed that it is raising ReservedHostsUnavailable exception 04:23:23 so I think whatever we discussed is already taken care of. I will review the patch again 04:24:06 tpatil: I got it. 04:24:52 rkmrHonjo: I will add a comment if spec required 04:25:06 samP: ok. 04:25:32 I think that releasenote is required. 04:27:16 rkmrHonjo: it depends on how you implement this. 04:27:42 samP: Yeah. I wait your comments. 04:28:30 let's do review first and check what kind of changes needed 04:28:52 I got it. 04:28:56 then we could discuss about spec and release notes 04:29:29 rkmrHonjo: BTW thanks for bringing up the release note point 04:30:13 any other bugs/patches for discuss? 04:30:31 sam 04:30:38 samP: not my side 04:30:50 no. 04:31:02 thanks.. let's move to next topic 04:31:11 #topic Dicussion Points 04:32:06 Sorry that I couldn't follow the work. 04:32:55 Please proceed if you have any updates on your work. 04:33:16 Regarding horizon dashboard 04:33:31 Niraj has pushed initial cookie cutter patch 04:33:54 I have voted + 2, need another +2 04:34:25 tpatil: I will check 04:34:33 oh, sorry, I couldn't review it last week... 04:34:43 tpatil: Niraj: Thanks 04:37:17 any other updates? 04:37:18 Can I talk about my update? 04:37:40 rkmrHonjo: sure, goahead 04:37:48 samP: thanks. 04:37:50 Call Force down API when host-failure will be notified 04:38:03 tpatil: Takahara replied to you on gerrit. Please check it. 04:38:21 #link https://review.openstack.org/#/c/526598/ 04:39:55 rkmrHonjo: I have read his reply and I agree the evacuation will succeed. But still compute service might be still up and running which could update instance states 04:41:32 tpatil: I think that it is prevented. There is force down flag. I re-confirm it and write it on gerrit. 04:42:29 after force down api is called , is compute service still running? 04:44:41 I will test this case and let you know my results 04:45:37 tpatil: I think that is case-by-case. force down api doesn't kill the process. But, basically, operator will configure the crm to stop the node. And, it is same on current implementation(waiting 3 minutes). 04:47:26 rkmrHonjo: If operator is going to handle force down notification and kill the compute service on the failed node, then I don't see there will be any problem 04:48:58 rkmrHonjo: not clear, operator config the crm to catch "force-down" flag? 04:49:29 s/force-down/forced_down 04:50:50 tpatil: No, crm doesn't catch force-down flag. Operator will config the crm to catch host down. If pacemaker catch the host down, masakari-monitor send a notification(host down). 04:51:16 rkmrHonjo: got it. 04:51:36 I think that force down flag is referred from nova. If it is true, nova doesn't change the status to up. 04:51:46 your point is, when masakari gets the HostDown notification, the host is already down for sure 04:52:15 rkmrHonjo: I got it 04:52:28 and its totally safe to use force-down api to bring down the binary and proceed the evacuation work flow 04:54:07 correction: not bring down the binary, but just the put foreced-down flag on that binary 04:54:19 rkmrHonjo: understood 04:54:49 we only have 5mis left 04:57:47 I think if we do this change to masakari, this means pacemaker or anyone controlling the cluster must make sure that it kill the node before it sends the host failed notification to masakari 04:58:18 I think we have already put this point in our docs. (if not we must) 04:58:37 any other updates? 04:58:51 No 04:59:05 no. 04:59:10 Thank you all... 04:59:17 thank you. 04:59:30 please use #openstack-masakari or ML with [masakari] for further discussion 04:59:38 Thank you all 04:59:45 #endmeeting