09:00:42 <masahito> #startmeeting blazar 09:00:42 <openstack> Meeting started Tue Dec 19 09:00:42 2017 UTC and is due to finish in 60 minutes. The chair is masahito. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:00:44 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:00:46 <openstack> The meeting name has been set to 'blazar' 09:00:54 <masahito> #topic RollCall 09:01:24 <priteau> o/ 09:01:32 <bertys__> o/ 09:01:51 <masahito> priteau, bertys__: hello 09:01:57 <masahito> Today's agenda is 09:02:09 <masahito> 1. resource monitoring 09:02:12 <masahito> 2. next meeting 09:02:14 <masahito> 3. AOB 09:02:20 <masahito> anything else? 09:02:39 <masahito> hiro-kobayashi is out of town today 09:04:01 <masahito> #topic resource monitoring 09:04:47 <masahito> This topic is raised by hiro-kobayashi. 09:05:49 <masahito> As I commented in https://review.openstack.org/#/c/524054/, resource-monitoring re-calculate ALL allocations on a failed host. 09:06:23 <masahito> My comment and hiro's replay are https://review.openstack.org/#/c/524054/10/blazar/plugins/oshosts/host_plugin.py@690 09:07:06 <masahito> priteau: do you have a preference or an idea for this because you're an operator of Blazar. 09:07:23 <priteau> I haven't seen these comments yet, just a minute 09:07:30 <masahito> got it. 09:07:36 <priteau> I had a day off yesterday 09:07:43 <masahito> np 09:07:49 <bertys__> masahito: this is related to what we discussed last week during the code review. The first challenge is that Masakari, Vitrage and Congress behave differently once a host failure has been detected 09:08:07 <bertys__> masahito: https://review.openstack.org/#/c/526598/1/masakari/engine/drivers/taskflow/host_failure.py 09:08:43 <bertys__> For Masakari for instance, the compute node is first disabled 09:09:18 <priteau> It's a hard problem because Nova doesn't give information about how long a node might be disabled for 09:09:28 <bertys__> Whereas for Vitrage and Congress, the compute node is marked down 09:09:35 <masahito> The summary is that resource-monitoring tries to re-allocate ALL reservations which use the failed host. 09:10:14 <masahito> but the question is that should Blazar re-allocate a reservation which will starts in a year later? 09:10:16 <priteau> As operators, we often have to do a quick maintenance session on a node which has errors, but it could last only a few minutes 09:10:49 <masahito> bertys__: thanks for sharing the info. 09:11:51 <priteau> masahito: I think we could combine "time since the node has been down or disabled" plus "time until the lease start" to come up with something sensible 09:12:31 <priteau> e.g., if a node has been disabled only for 30 minutes, there is no urgency in reallocating leases that are a month away 09:12:33 <GeraldK> I agree. We could define a threshold of e.g. 1 day 09:13:50 <GeraldK> we could also introduce a background service that is then 1 day ahead of the start time verifying that resources are still available and trying to re-allocate otherwise 09:13:56 <masahito> GeraldK: meaning 1 day for detecting failure? or 1 day advance re-allocation? 09:14:25 <GeraldK> masahito: 1-day advance re-allocation 09:14:34 <masahito> GeraldK: got it. 09:15:07 <priteau> Should be configurable 09:15:11 <GeraldK> the operator can set the parameter according to his requirements and preferences 09:15:32 <masahito> yes, of course. 09:16:08 <masahito> Looks like approach #2 is better for this 09:16:35 <bertys__> Does this mean that we would implement a new event "before_start_date"? 09:17:25 <masahito> bertys__: I don't think it's needed. 09:18:44 <masahito> bertys__: I though the approach we discuss is Blazar re-allocate reservations which starts in configured time. 09:19:16 <masahito> so the Blazar checks only start_time of each lease. 09:20:08 <bertys__> ok it seems I have misunderstood GeraldK's intention 09:20:58 <GeraldK> we may have some misunderstanding here 09:21:17 <masahito> priteau's idea is adding a decision time frame to detect whether the host is really failed or not. 09:21:29 <GeraldK> let me try to summarize my proposal: 09:22:28 <GeraldK> host down event and reservation start time is more than 1 day (configurable) ahead -> no re-allocation 09:23:23 <GeraldK> as we don't know how long the host will be down, a background service/before-start-date event will check 1 day ahead of the start time whether the node is still down. if yes, try to re-allocate 09:23:57 <GeraldK> if host down and start time less then 1 day ahead -> immediate re-allocation 09:24:10 <GeraldK> does that make sense? 09:25:40 <priteau> We already have the event processing pool running every 10 seconds. The manager could create an event to remind itself of checking whether the lease is ok 09:25:47 <masahito> GeraldK: Blazar doesn't re-allocate 1 day ahead reservations in second check for host status, right? 09:26:55 <masahito> GeraldK: meaning 3 days ahead leases at the event of host failure are re-allocated 2 days later. 09:27:09 <priteau> That's my understanding of GeraldK's proposal 09:27:24 <masahito> me too. 09:27:37 <GeraldK> masahito: no, so far it does not. but if we want to omit re-allocation of reservations that are in the long future (>1 day ahead), wouldn't we need such option ? 09:29:07 <GeraldK> priteau's proposal to have the manager create an event to remind itself (to check periodically, e.g. every 24 h, or to check 1 day ahead of start time) sounds good to me 09:29:45 <priteau> And my additional proposal is to add to GeraldK's by having a minimum time required to confirm that a host is down 09:29:49 <GeraldK> masahito: sorry, mis-read you message. yes. that is true. 09:29:52 <priteau> e.g. 30 seconds, or 5 minutes (configurable) 09:29:57 <masahito> Users can reserve resources that start in 1 month or year. So it could have lot's of re-allocations. 09:30:52 <masahito> priteau: it's good to have. the monitor system has polling, so we can use the periodic task. 09:30:55 <GeraldK> masahito: that is why I proposed to re-allocate only in the case the node is still unavailable 1 day ahead of the start time 09:31:47 <masahito> GeraldK: np, my wrong grammar could miss lead you. 09:33:57 <masahito> okay, looks like we got a good idea for the problem. 09:34:50 <masahito> any comments for the topic? 09:36:31 <masahito> #topic next meeting 09:37:11 <masahito> Next week is a last week of this year. 09:37:52 <GeraldK> I will be out of office in the next two weeks 09:38:04 <masahito> So lots of area could be in holiday weeks or days. 09:38:37 <masahito> And I'm also out of office in two weeks. 09:39:45 <masahito> If there is less people we can skip next two meetings. 09:40:06 <priteau> That's fine for me 09:40:15 <masahito> okay. 09:40:32 <masahito> Then next meeting is 9th January. 09:40:56 <masahito> I'll announce it in openstack-dev. 09:41:24 <masahito> #topic AOB 09:41:51 <masahito> Does someone have something to share/discuss? 09:43:07 <masahito> FYI: Q-2 is released on last Friday. https://review.openstack.org/#/c/526616/ 09:43:48 <masahito> Thank you everyone in the team! This milestone was good progress to the team. 09:44:12 <GeraldK> Congratulations to the team for this milestone. 09:44:21 <priteau> Are we going to release a new client soon? 09:44:40 <priteau> We now have the gate jobs to push to PyPI 09:44:44 <masahito> priteau: thanks for heading up this. 09:44:51 <masahito> I'll push the patch soon. 09:46:08 <priteau> Thanks 09:50:16 <masahito> anything else? 09:50:33 <masahito> If nothing we can finish early today. 09:51:36 <priteau> Thanks everyone 09:51:46 <priteau> Enjoy the holidays 09:52:39 <GeraldK> thx. happy holidays. 09:53:33 <masahito> Thanks all. have a good holidays! 09:53:40 <masahito> #endmeeting