04:02:28 #startmeeting masakari 04:02:29 Meeting started Tue Jan 24 04:02:28 2017 UTC and is due to finish in 60 minutes. The chair is samP. Information about MeetBot at http://wiki.debian.org/MeetBot. 04:02:30 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 04:02:32 The meeting name has been set to 'masakari' 04:02:39 Hi all 04:02:43 samP: hi 04:02:49 o/ 04:03:02 hi 04:03:13 since we do not have critical bugs, let jump in to discuttion 04:03:35 ok. 04:03:49 In the last meeting we discuss about signal handler issue 04:03:58 tpatil: yes, 04:04:13 signal issue is fixed in patch https://review.openstack.org/#/c/421767/ 04:04:38 Just saw you have approved that patch 04:04:46 tpatil: LGTM 04:05:07 tpatil: thanks. 04:05:13 rkmrHonjo: if its ok, then +1 the workflow 04:05:15 needs to set w+ on that patch 04:05:46 samP, abhishekk:sure. 04:06:35 is does not effect to other patches, so, no need to rebase them, right? 04:07:03 rkmrHonjo: We will submit a separate patch to exit child process gracefully 04:07:33 tpatil:Is the patch use ServiceLauncher? 04:07:46 s/Is/Does/gc 04:08:05 rkmrHonjo: No, it uses same code as previous i.e ProcessLauncher 04:09:04 tpatil: OK, thanks. 04:10:20 tpatil: thanks.. 04:10:28 sorry for short disconnection. just come back from wifi trouble. 04:11:30 takashi: np 04:11:58 tpatil: when will you plan to submit the new patch? 04:13:29 samP: tushar san is facing internet issue 04:13:30 we 04:13:33 sorry, I lost my internet connection 04:13:40 tpatil_: np 04:13:42 tpatil: when will you plan to submit the new patch? 04:13:57 samP: can you please suggest what should be the default interval for the new periodic task? 04:14:34 tpatil_: are you refering to https://review.openstack.org/#/c/423059/1 04:14:45 yes 04:16:58 more the probability of notification failure , less should be the interval time 04:18:06 Do you think 2 mins interval is appropriate to run the new periodic task? 04:18:39 tpatil_: current default is 300 right, feels bit long. but 120 would be nice 04:18:55 samP: ok 04:19:17 I will update the lite-specs accordingly 04:19:50 tpatil_: in nova, is 60. but we are not just polling things, in that case 120 would be OK 04:20:31 however, Im wondering, cat we recommend a minimum value for retry_notification_new_status_interval? 04:21:36 tpatil_: thanks, I put a minor comment on minimum value for retry_notification_new_status_interval 04:22:08 samP: We can keep the default value to 60 for retry_notification_new_status_interval 04:23:06 tpatil_: that meens, generated_time+60s it will re-try the new notifications 04:23:27 correct 04:24:37 tpatil_: for my understandig, it wont take long befor recovery folw get the new notifications, which is way less than 60s. 04:25:27 so, 60s would be fine. I will add these comments to spec review. 04:25:57 samP: execution time of each flow cannot be predicted as it goes through different services 04:27:59 tpatil_: ah, right. I will check other config values and comment to the spec 04:28:09 samP: ok 04:28:28 one other question, how does the operator know which flow is not re-trying? 04:29:04 Abhishek: Please explain the notification status flow 04:29:30 it seems, both notifications are end in "failed" states 04:29:37 yes 04:29:57 so 1st is if notification ends in error state 04:30:25 then periodic task will pick that notification and states will be error > running > success or failed 04:30:34 ah, got it 04:30:53 2nd one if notification is ignored then it will be new > running > failed or success state 04:30:54 if failed, the operator must look in to it 04:30:59 yes 04:31:07 samP: correct 04:31:08 abhishekk: thanks, got it. 04:31:38 tpatil_: thanks 04:31:39 samP: All notifications whose status is Failed should be resolved by the operator 04:31:59 tpatil_: clear.. thanks 04:33:02 OK then, overall flow is LGTM except minor comments about default values. I will add my comment on them. 04:34:01 if no other comments or questins abt periodic task, shall we move to "RESERVED_HOST recovery action"? 04:34:09 Sure 04:34:21 here is the spec. 04:34:37 humm... link is not working 04:34:40 https://review.openstack.org/#/c/423072/1/specs/ocata/approved/implement-reserved-host-action.rst 04:35:33 are we goning to set reserved=False at the end of the execution? 04:35:40 #link: https://review.openstack.org/#/c/423072 04:36:24 samP: as per current approach yes 04:36:55 for me it seems that, there is a possibility for multiple work folw could take the same reserved host... 04:37:58 samP: you are correct, the reserved host should be set to False immediately even before evacuating vms from the failed compute node. 04:38:51 sam: We will incorporate this use case in the current specs 04:38:51 Or do we need a lock about reserved host? 04:39:40 as per current design ,we don't want to call masakari db api from the workflow 04:39:40 tpatil_: thanks..we may introduce a new flag/lock, but make reserved=false & nova-compute=disable is same effect right? 04:40:38 ah..no..it wont 04:40:45 IMO when setting reserved=false means we need to enable compute service for that host 04:41:47 abhishekk: yes, but we cant set reserved=fale at the start and set nova-compute=enable at the end, in the same flow. 04:42:22 abhishekk: in the middle, we can set the error andlings 04:43:35 samP: flow should be: loop through the available reserved host list->set reserved to False, enable compute node of reserved host->evacuate vms 04:43:36 we need to enable compute service at the start only 04:44:42 we need to enable nova-compute service before evacuation, right? 04:44:58 takashi: correct 04:45:09 tpatil_: before evacuate, correct 04:45:34 samP: yes 04:46:48 tpatil_: from start to end of the recovery flow, how do we prevent nova-scheduler from assigning new VM to that host? 04:47:11 samP: That's not possible 04:47:53 tpatil_: cant we use on maintenance? is it block the evacuation too? 04:48:15 samP: Evacuate api will fail if the compute host is out of resources and then it will get the new reserved host from the list and continue evacuation on new compute node 04:49:43 samP: nova does not know about on_maintenance, once compute service is enabled on reserved host then nova can use that host to schedule new instances 04:51:04 samP: Abhishek is correct 04:51:31 sorry, probably my mistake of nova service-disable --reason maintenance 04:52:59 however, it seems that migrate can preform even in service-disable mode, may be that made me confuse 04:53:10 http://docs.openstack.org/ops-guide/ops-maintenance-compute.html 04:55:50 samP: Just a information: The software that is base of masakari didn't care this scheduling problem. (Of course, improving it is nice.) 04:56:12 IMO, the nova compute service should be enabled even in live migration case other RPC message won't work at all 04:57:01 rkmrHonjo: correct. 04:57:51 s/other/otherwise 04:58:38 tpatil_: agree... I will chekck the current nova code and update myself... 04:59:11 ok then, its almost time... 04:59:34 In my understanding, http://docs.openstack.org/ops-guide/ops-maintenance-compute.html says that source-computenode can be disable. But, dest-compute node should be enable. 04:59:53 Let's keep discussion on specs. would be glad if we can get some feedback from previous masakari implementation 05:00:06 rkmrHonjo: right 05:00:26 Please put your comments and questions on the spec.. 05:00:37 #topic AOB 05:00:44 I'll have a look about specs and some remainig patches 05:00:52 takashi: sure, thanks 05:01:05 Should we move to #openstack-masakari? 05:01:45 takashi: sure, but I was thinkg to link gerrit with it 05:02:22 Since, infra said thay dont have enough bots, I left it to TODO 05:03:47 takashi: sice we are out of time, lets dicuss it on next meeting.. 05:04:42 any ohter questions or comments? 05:04:58 Can I talk about other topic?(Sorry, I forgot to write topic on wiki.) 05:05:08 rkmrHonjo: sure 05:05:15 samP: Thanks. 05:05:45 Masakari-monitors: Takahara & I re-thought about "ProcessLauncher or ServiceLauncher" after last week meeting. 05:06:07 As a result, we thought that service launcher is better than process launcher. There are 2 reasons. 05:06:53 1: Using service launcher for non-http server is a general way.(This reason was already said in last week meeting.) 05:07:59 2: Launching 3 monitors as workers of one parent is not useful. Some users won't wish to use all monitors. And restating/shutoff one monitor is not easy in this way. 05:08:19 samP: How do you think about this? 05:09:24 rkmrHonjo: Are you recomending the service launcher insted of current process launcher? 05:09:36 samP: Yes. 05:10:52 rkmrHonjo: For my POV, (1) is not that importent, however (2) is importent 05:10:59 rkmrHonjo: aggree with you on (2) 05:12:15 samP: Thanks. We'll change launcher after tpatil's patch will be merged. 05:12:32 rkmrHonjo: if we are going to change this, then we have to discuss with tpatil and abhishekk too. because they are working on the new patch 05:13:18 rkmrHonjo: can you create a doc on ether pad or PoC for that ptach? 05:14:40 samP: "That patch" means change launcher, right? 05:14:46 rkmrHonjo: yes 05:15:35 samP: OK. I create doc or PoC and notify it to you, tpatil and abhichekk. 05:16:10 rkmrHonjo: thank you very much.. 05:16:19 samP: thanks. 05:17:09 OK then, we are 16mis out of the schedule..lets wrap up 05:17:25 is there are any other things to discuss? 05:18:07 if no, then lets finish the meeting. use #openstack-masakari for further discussions 05:18:13 Thank you all 05:18:35 #endmeeting