#openstack-meeting log

07:07:55 <aspiers> #startmeeting ha
07:07:56 <openstack> Meeting started Mon Jun 13 07:07:55 2016 UTC and is due to finish in 60 minutes.  The chair is aspiers. Information about MeetBot at http://wiki.debian.org/MeetBot.
07:07:57 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
07:07:59 <openstack> The meeting name has been set to 'ha'
07:08:06 <aspiers> OK, let's start anyway
07:08:10 <samP> hi
07:08:18 <aspiers> oh hi samP! :)
07:08:25 <samP> sorry late response
07:08:28 <aspiers> no problem :)
07:08:30 <aspiers> I was late too :)
07:08:57 <samP> just came back after 1week+ vacation
07:09:00 <aspiers> nice :)
07:09:18 <aspiers> I guess the user story is a good place to start today
07:09:23 <aspiers> #topic HA VMs user story
07:09:38 <aspiers> so, pretty much all my reviews got merged I think
07:09:57 <aspiers> I also sent a mail to openstack-dev@ (cc product-wg@) about how to handle the extra usage scenarios
07:10:04 <aspiers> but no responses yet :-/
07:10:17 <samP> Thanks, I read the mail
07:10:48 <aspiers> I was thinking that we should just go ahead and rebase the existing review, and submit 4 extra user stories
07:11:11 <aspiers> so that we have the 4 scenarios covered by 5 user stories: one for each, plus also in the existing HA VMs story, like we discussed
07:11:20 <aspiers> does that sound OK?
07:11:33 <haukebruno> +1
07:11:58 <aspiers> in fact, that could all be done within the same existing review
07:12:13 <samP> as long as they are mentioned in the main story, I am OK with extra 4 stories
07:12:23 <aspiers> samP: exactly
07:12:40 <aspiers> putting them in the same review would make sense, because we would need to cross-link between the main story and the other 4
07:12:48 <aspiers> and also we want to minimise overlap
07:12:59 <aspiers> so that the main story focuses on the HA-specific side
07:13:32 <aspiers> I had another idea on this
07:13:46 <aspiers> unfortunately only after sending the mail I realised I should have cc'd openstack-operators :-/
07:13:55 <aspiers> since these user stories are very strongly focused on operators
07:14:00 <aspiers> maybe we will get more feedback that way
07:14:08 <aspiers> should I forward my original mail to that list?
07:14:43 <samP> aspiers; yes I think that would be a good idea
07:14:55 <aspiers> OK
07:14:58 <haukebruno> also +1, I'm on that list too and I guess more feedback (in general) is always better :p
07:15:06 <aspiers> #action aspiers to forward original user story email to openstack-operators
07:15:25 <aspiers> I have some more small changes to submit to the main story
07:15:32 <aspiers> I will try to submit them this morning
07:16:09 <aspiers> #topic specs
07:16:22 <aspiers> I'm sorry I didn't make much more progress on the specs recently
07:16:30 <aspiers> I still have a draft of the first one I am doing though
07:16:40 <aspiers> again I hope to submit one soon
07:16:49 <samP> me nether, I will try to complete them this week
07:16:57 <aspiers> I think it's pretty important to get these done soon
07:17:09 <aspiers> as they are required for agreeing on how to proceed with code
07:17:22 <aspiers> this week we have our team workshop in Germany, so I have limited time :-/
07:17:29 <aspiers> maybe in the evenings, I don't know
07:17:31 <samP> aspiers: yes +1
07:17:39 <aspiers> ok
07:17:58 <aspiers> #topic Pacemaker vs. systemd
07:18:16 <aspiers> I'm not sure if you saw beekhof's latest blog?
07:18:22 <aspiers> #link http://blog.clusterlabs.org/blog/2016/next-openstack-ha-arch
07:18:34 <aspiers> he makes some good points, but I do not agree with everything
07:18:50 <aspiers> beekhof knows this already ;-) as I have discussed this topic with him in great detail in the past
07:19:14 <aspiers> personally I think it's really important to have application-level monitoring for active/active services
07:19:29 <aspiers> otherwise for example keystone could hang, and noone would notice
07:19:49 <aspiers> my proposal (which I have mentioned before) is to change the OCF RAs so that they wrap service(8)
07:20:21 <aspiers> this avoids the divergence which he mentioned in the blog, and also avoids unnecessary duplication of service config data and start/stop/restart logic
07:20:31 <aspiers> whilst adding decent monitoring
07:20:41 <haukebruno> I don't know about the general discussion, but yes: +1 for application-level monitoring. a running service doesn't mean a healthy service in so many cases
07:20:52 <aspiers> I hope to write a spec on this also sometime soon, in my fictional free time
07:21:12 <aspiers> haukebruno: exactly
07:21:21 <samP> +1 for application-level monitoring
07:21:49 <aspiers> I don't quite understand why he thinks avoiding divergence of HA from non-HA cases is more important than proper app-level monitoring and recovery
07:22:27 <aspiers> he mentions nagios and sensu for monitoring, but AFAIK they won't do proper automatic recovery
07:22:48 <aspiers> anyway, I'm sure this debate is just getting started ;-)
07:23:12 <aspiers> #topic AOB
07:23:19 <aspiers> anything else anyone wants to discuss?
07:23:34 <aspiers> oh, I forgot one important thing!
07:23:45 <aspiers> #topic nova service-disable of failing nova-compute
07:24:04 <aspiers> samP: are you on the users@clusterlabs.org list?
07:25:56 <aspiers> beekhof proposed that we call nova service-disable *every* stop, not just the final one when Pacemaker won't attempt any more restarts of the resource on that node:
07:25:58 <aspiers> #link
07:26:02 <aspiers> oops
07:26:09 <aspiers> #link http://clusterlabs.org/pipermail/users/2016-June/003218.html
07:26:29 <aspiers> oh that's unfortunate
07:26:55 <samP_> sorry, cutoff from the net ;)
07:27:00 <aspiers> no problem :)
07:27:05 <aspiers> did you see my question?
07:27:17 <samP_> sorry, no
07:27:32 <aspiers> beekhof proposed that we call nova service-disable *every* stop, not just the final one when Pacemaker won't attempt any more restarts of the resource on that node
07:27:43 <aspiers> http://clusterlabs.org/pipermail/users/2016-June/003218.html
07:27:58 <aspiers> and it looks like the proposed new pacemaker feature won't make it into 1.1.15 now
07:28:03 <aspiers> that's my guess anyway
07:28:18 <aspiers> I think I'm OK with calling it every time, if you are
07:28:35 <aspiers> this approach came from masakari, so I wanted to get your thoughts
07:28:58 <aspiers> the service-disable would need to be called with a timeout, and any failures ignored
07:29:02 <samP_> does it mean, call nova service-disable before service stop at host (every time)?
07:29:06 <aspiers> otherwise we could get fencing from a failed stop
07:29:35 <aspiers> samP_: yes, but also when the service fails
07:29:43 <aspiers> samP_: since then Pacemaker will still call stop
07:29:55 <aspiers> IIRC
07:30:32 <aspiers> for nova-compute I think we could set migration-threshold=1 anyway
07:30:45 <aspiers> I'm not sure there is much benefit to trying to restart 2 or more times
07:30:58 <aspiers> this would avoid service flapping
07:31:00 <aspiers> what do you think?
07:32:06 <samP_> Its looks ok. I think this would do no harm.
07:32:07 <haukebruno> some other flapping (network or rabbitmq maybe) could cause in "unneeded" migration, if migration-threshold=1
07:32:42 <aspiers> haukebruno: there is no migration in this case, because nova-compute is active/active
07:33:01 <aspiers> haukebruno: it just means that nova-compute on that host is dead and disabled
07:33:25 <haukebruno> can't remember the correct case now, but in the past we had a stopped nova-compute sometimes because of something (everythings works as expected, we just restarted nova-compute)
07:33:27 <aspiers> samP_: OK, we can try it
07:33:43 <haukebruno> ah aspiers sorry, with a/a I agree
07:33:51 <aspiers> haukebruno: yes, in that case we definitely want to try to restart once
07:33:58 <aspiers> but I think more than once is maybe unnecessary
07:34:08 <haukebruno> yes
07:34:17 <aspiers> if it takes >= 2 restart attempts to work then probably something is badly wrong anyway
07:34:28 <aspiers> in which case we maybe can't rely on the service even if it starts correctly
07:34:34 <aspiers> e.g. it might die again soon
07:35:08 <haukebruno> of course, if needed, 1 retry should be enough
07:35:13 <aspiers> samP_: is that a change you could easily try in masakari?
07:36:23 <samP_> aspiers: I can try
07:38:54 <aspiers> so it seems I got the meeting time wrong ... AGAIN :-(
07:39:02 <aspiers> I think we are one hour early
07:39:20 <aspiers> sorry, I was confused since I am currently in Germany
07:39:53 <samP_> Think no harm to others, cause no other meeting on Monday at 0700 UTC
07:40:01 <aspiers> right
07:40:04 <aspiers> luckily :)
07:40:17 <haukebruno> ah lol, from my calendar view the time time was correct
07:40:24 <haukebruno> so WHAT exactly is the right time?
07:40:26 <aspiers> but maybe bad for ddeja etc. who are expecting it in the next hour
07:40:36 <samP_> its 0800 UTC
07:40:37 <aspiers> haukebruno: we changed it to 8am UTC
07:41:16 <aspiers> https://review.openstack.org/#/c/307002/
07:41:19 <haukebruno> ah fucking summer-/winter time... everytime the same confusion. Ok, i'll update my calendar too
07:41:24 <aspiers> hehe
07:41:30 <aspiers> #topic time of meeting
07:41:39 <aspiers> maybe we need to change it anyway
07:41:48 <aspiers> beekhof requested another time, since this time is impossible for him
07:42:44 <aspiers> I will try to work out a better time for everyone
07:42:48 <samP_> I think we should do this discussion after 0800 UTC so, deja and others can comment
07:42:56 <aspiers> agreed
07:43:02 <aspiers> any earlier is difficult for me too
07:43:41 <aspiers> #action aspiers to figure out a meeting time which works for everyone
07:43:59 <aspiers> haukebruno, samP_: please could you let me know which times of day/week work for you?
07:44:31 <samP_> our previous time was 0900 UTC, which is 1800 in Japan (JST) was OK for me.
07:44:46 <haukebruno> would be ok for me too
07:45:14 <haukebruno> in general I am ok with anything from 0400 to 2000 UTC
07:45:42 <aspiers> OK thanks
07:45:50 <aspiers> I think 0900 UTC is probably also difficult for beekhof
07:46:13 <aspiers> maybe he can do later, but I will try to find out
07:47:28 <samP_> So, 0100UTC to 0900UTC would be totally OK for me.
07:48:54 <aspiers> samP_: OK great, thanks. I've just emailed beekhof
07:49:08 <aspiers> I'll announce if there are any changes
07:49:22 <aspiers> actually, I'd just add you all as reviewers on the gerrit review :)
07:49:31 <aspiers> then you can +1 / -1 the proposal
07:49:49 <aspiers> #topic AOB
07:50:22 <aspiers> alright, I suggest we close the meeting now - we could always restart in 10 minutes for a short discussion if e.g. ddeja appears :)
07:50:35 <aspiers> but I'm open to other suggestions too :)
07:50:49 <samP_> sure, I ll be there
07:50:55 <haukebruno> yep. me too \o/
07:51:40 <aspiers> cool :) thanks guys!
07:51:49 <aspiers> #endmeeting