09:00:42 <masahito> #startmeeting blazar
09:00:42 <openstack> Meeting started Tue Dec 19 09:00:42 2017 UTC and is due to finish in 60 minutes.  The chair is masahito. Information about MeetBot at http://wiki.debian.org/MeetBot.
09:00:44 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
09:00:46 <openstack> The meeting name has been set to 'blazar'
09:00:54 <masahito> #topic RollCall
09:01:24 <priteau> o/
09:01:32 <bertys__> o/
09:01:51 <masahito> priteau, bertys__: hello
09:01:57 <masahito> Today's agenda is
09:02:09 <masahito> 1. resource monitoring
09:02:12 <masahito> 2. next meeting
09:02:14 <masahito> 3. AOB
09:02:20 <masahito> anything else?
09:02:39 <masahito> hiro-kobayashi is out of town today
09:04:01 <masahito> #topic resource monitoring
09:04:47 <masahito> This topic is raised by hiro-kobayashi.
09:05:49 <masahito> As I commented in https://review.openstack.org/#/c/524054/, resource-monitoring re-calculate ALL allocations on a failed host.
09:06:23 <masahito> My comment and hiro's replay are https://review.openstack.org/#/c/524054/10/blazar/plugins/oshosts/host_plugin.py@690
09:07:06 <masahito> priteau: do you have a preference or an idea for this because you're an operator of Blazar.
09:07:23 <priteau> I haven't seen these comments yet, just a minute
09:07:30 <masahito> got it.
09:07:36 <priteau> I had a day off yesterday
09:07:43 <masahito> np
09:07:49 <bertys__> masahito: this is related to what we discussed last week during the code review. The first challenge is that Masakari, Vitrage and Congress behave differently once a host failure has been detected
09:08:07 <bertys__> masahito: https://review.openstack.org/#/c/526598/1/masakari/engine/drivers/taskflow/host_failure.py
09:08:43 <bertys__> For Masakari for instance, the compute node is first disabled
09:09:18 <priteau> It's a hard problem because Nova doesn't give information about how long a node might be disabled for
09:09:28 <bertys__> Whereas for Vitrage and Congress, the compute node is marked down
09:09:35 <masahito> The summary is that resource-monitoring tries to re-allocate ALL reservations which use the failed host.
09:10:14 <masahito> but the question is that should Blazar re-allocate a reservation which will starts in a year later?
09:10:16 <priteau> As operators, we often have to do a quick maintenance session on a node which has errors, but it could last only a few minutes
09:10:49 <masahito> bertys__: thanks for sharing the info.
09:11:51 <priteau> masahito: I think we could combine "time since the node has been down or disabled" plus "time until the lease start" to come up with something sensible
09:12:31 <priteau> e.g., if a node has been disabled only for 30 minutes, there is no urgency in reallocating leases that are a month away
09:12:33 <GeraldK> I agree. We could define a threshold of e.g. 1 day
09:13:50 <GeraldK> we could also introduce a background service that is then 1 day ahead of the start time verifying that resources are still available and trying to re-allocate otherwise
09:13:56 <masahito> GeraldK: meaning 1 day for detecting failure? or 1 day advance re-allocation?
09:14:25 <GeraldK> masahito: 1-day advance re-allocation
09:14:34 <masahito> GeraldK: got it.
09:15:07 <priteau> Should be configurable
09:15:11 <GeraldK> the operator can set the parameter according to his requirements and preferences
09:15:32 <masahito> yes, of course.
09:16:08 <masahito> Looks like approach #2 is better for this
09:16:35 <bertys__> Does this mean that we would implement a new event "before_start_date"?
09:17:25 <masahito> bertys__: I don't think it's needed.
09:18:44 <masahito> bertys__: I though the approach we discuss is Blazar re-allocate reservations which starts in configured time.
09:19:16 <masahito> so the Blazar checks only start_time of each lease.
09:20:08 <bertys__> ok it seems I have misunderstood GeraldK's intention
09:20:58 <GeraldK> we may have some misunderstanding here
09:21:17 <masahito> priteau's idea is adding a decision time frame to detect whether the host is really failed or not.
09:21:29 <GeraldK> let me try to summarize my proposal:
09:22:28 <GeraldK> host down event and reservation start time is more than 1 day (configurable) ahead -> no re-allocation
09:23:23 <GeraldK> as we don't know how long the host will be down, a background service/before-start-date event will check 1 day ahead of the start time whether the node is still down. if yes, try to re-allocate
09:23:57 <GeraldK> if host down and start time less then 1 day ahead -> immediate re-allocation
09:24:10 <GeraldK> does that make sense?
09:25:40 <priteau> We already have the event processing pool running every 10 seconds. The manager could create an event to remind itself of checking whether the lease is ok
09:25:47 <masahito> GeraldK: Blazar doesn't re-allocate 1 day ahead reservations in second check for host status, right?
09:26:55 <masahito> GeraldK: meaning 3 days ahead leases at the event of host failure are re-allocated 2 days later.
09:27:09 <priteau> That's my understanding of GeraldK's proposal
09:27:24 <masahito> me too.
09:27:37 <GeraldK> masahito: no, so far it does not. but if we want to omit re-allocation of reservations that are in the long future (>1 day ahead), wouldn't we need such option ?
09:29:07 <GeraldK> priteau's proposal to have the manager create an event to remind itself (to check periodically, e.g. every 24 h, or to check 1 day ahead of start time) sounds good to me
09:29:45 <priteau> And my additional proposal is to add to GeraldK's by having a minimum time required to confirm that a host is down
09:29:49 <GeraldK> masahito: sorry, mis-read you message. yes. that is true.
09:29:52 <priteau> e.g. 30 seconds, or 5 minutes (configurable)
09:29:57 <masahito> Users can reserve resources that start in 1 month or year. So it could have lot's of re-allocations.
09:30:52 <masahito> priteau: it's good to have. the monitor system has polling, so we can use the periodic task.
09:30:55 <GeraldK> masahito: that is why I proposed to re-allocate only in the case the node is still unavailable 1 day ahead of the start time
09:31:47 <masahito> GeraldK: np, my wrong grammar could miss lead you.
09:33:57 <masahito> okay, looks like we got a good idea for the problem.
09:34:50 <masahito> any comments for the topic?
09:36:31 <masahito> #topic next meeting
09:37:11 <masahito> Next week is a last week of this year.
09:37:52 <GeraldK> I will be out of office in the next two weeks
09:38:04 <masahito> So lots of area could be in holiday weeks or days.
09:38:37 <masahito> And I'm also out of office in two weeks.
09:39:45 <masahito> If there is less people we can skip next two meetings.
09:40:06 <priteau> That's fine for me
09:40:15 <masahito> okay.
09:40:32 <masahito> Then next meeting is 9th January.
09:40:56 <masahito> I'll announce it in openstack-dev.
09:41:24 <masahito> #topic AOB
09:41:51 <masahito> Does someone have something to share/discuss?
09:43:07 <masahito> FYI: Q-2 is released on last Friday. https://review.openstack.org/#/c/526616/
09:43:48 <masahito> Thank you everyone in the team! This milestone was good progress to the team.
09:44:12 <GeraldK> Congratulations to the team for this milestone.
09:44:21 <priteau> Are we going to release a new client soon?
09:44:40 <priteau> We now have the gate jobs to push to PyPI
09:44:44 <masahito> priteau: thanks for heading up this.
09:44:51 <masahito> I'll push the patch soon.
09:46:08 <priteau> Thanks
09:50:16 <masahito> anything else?
09:50:33 <masahito> If nothing we can finish early today.
09:51:36 <priteau> Thanks everyone
09:51:46 <priteau> Enjoy the holidays
09:52:39 <GeraldK> thx. happy holidays.
09:53:33 <masahito> Thanks all. have a good holidays!
09:53:40 <masahito> #endmeeting