07:07:55 <aspiers> #startmeeting ha 07:07:56 <openstack> Meeting started Mon Jun 13 07:07:55 2016 UTC and is due to finish in 60 minutes. The chair is aspiers. Information about MeetBot at http://wiki.debian.org/MeetBot. 07:07:57 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 07:07:59 <openstack> The meeting name has been set to 'ha' 07:08:06 <aspiers> OK, let's start anyway 07:08:10 <samP> hi 07:08:18 <aspiers> oh hi samP! :) 07:08:25 <samP> sorry late response 07:08:28 <aspiers> no problem :) 07:08:30 <aspiers> I was late too :) 07:08:57 <samP> just came back after 1week+ vacation 07:09:00 <aspiers> nice :) 07:09:18 <aspiers> I guess the user story is a good place to start today 07:09:23 <aspiers> #topic HA VMs user story 07:09:38 <aspiers> so, pretty much all my reviews got merged I think 07:09:57 <aspiers> I also sent a mail to openstack-dev@ (cc product-wg@) about how to handle the extra usage scenarios 07:10:04 <aspiers> but no responses yet :-/ 07:10:17 <samP> Thanks, I read the mail 07:10:48 <aspiers> I was thinking that we should just go ahead and rebase the existing review, and submit 4 extra user stories 07:11:11 <aspiers> so that we have the 4 scenarios covered by 5 user stories: one for each, plus also in the existing HA VMs story, like we discussed 07:11:20 <aspiers> does that sound OK? 07:11:33 <haukebruno> +1 07:11:58 <aspiers> in fact, that could all be done within the same existing review 07:12:13 <samP> as long as they are mentioned in the main story, I am OK with extra 4 stories 07:12:23 <aspiers> samP: exactly 07:12:40 <aspiers> putting them in the same review would make sense, because we would need to cross-link between the main story and the other 4 07:12:48 <aspiers> and also we want to minimise overlap 07:12:59 <aspiers> so that the main story focuses on the HA-specific side 07:13:32 <aspiers> I had another idea on this 07:13:46 <aspiers> unfortunately only after sending the mail I realised I should have cc'd openstack-operators :-/ 07:13:55 <aspiers> since these user stories are very strongly focused on operators 07:14:00 <aspiers> maybe we will get more feedback that way 07:14:08 <aspiers> should I forward my original mail to that list? 07:14:43 <samP> aspiers; yes I think that would be a good idea 07:14:55 <aspiers> OK 07:14:58 <haukebruno> also +1, I'm on that list too and I guess more feedback (in general) is always better :p 07:15:06 <aspiers> #action aspiers to forward original user story email to openstack-operators 07:15:25 <aspiers> I have some more small changes to submit to the main story 07:15:32 <aspiers> I will try to submit them this morning 07:16:09 <aspiers> #topic specs 07:16:22 <aspiers> I'm sorry I didn't make much more progress on the specs recently 07:16:30 <aspiers> I still have a draft of the first one I am doing though 07:16:40 <aspiers> again I hope to submit one soon 07:16:49 <samP> me nether, I will try to complete them this week 07:16:57 <aspiers> I think it's pretty important to get these done soon 07:17:09 <aspiers> as they are required for agreeing on how to proceed with code 07:17:22 <aspiers> this week we have our team workshop in Germany, so I have limited time :-/ 07:17:29 <aspiers> maybe in the evenings, I don't know 07:17:31 <samP> aspiers: yes +1 07:17:39 <aspiers> ok 07:17:58 <aspiers> #topic Pacemaker vs. systemd 07:18:16 <aspiers> I'm not sure if you saw beekhof's latest blog? 07:18:22 <aspiers> #link http://blog.clusterlabs.org/blog/2016/next-openstack-ha-arch 07:18:34 <aspiers> he makes some good points, but I do not agree with everything 07:18:50 <aspiers> beekhof knows this already ;-) as I have discussed this topic with him in great detail in the past 07:19:14 <aspiers> personally I think it's really important to have application-level monitoring for active/active services 07:19:29 <aspiers> otherwise for example keystone could hang, and noone would notice 07:19:49 <aspiers> my proposal (which I have mentioned before) is to change the OCF RAs so that they wrap service(8) 07:20:21 <aspiers> this avoids the divergence which he mentioned in the blog, and also avoids unnecessary duplication of service config data and start/stop/restart logic 07:20:31 <aspiers> whilst adding decent monitoring 07:20:41 <haukebruno> I don't know about the general discussion, but yes: +1 for application-level monitoring. a running service doesn't mean a healthy service in so many cases 07:20:52 <aspiers> I hope to write a spec on this also sometime soon, in my fictional free time 07:21:12 <aspiers> haukebruno: exactly 07:21:21 <samP> +1 for application-level monitoring 07:21:49 <aspiers> I don't quite understand why he thinks avoiding divergence of HA from non-HA cases is more important than proper app-level monitoring and recovery 07:22:27 <aspiers> he mentions nagios and sensu for monitoring, but AFAIK they won't do proper automatic recovery 07:22:48 <aspiers> anyway, I'm sure this debate is just getting started ;-) 07:23:12 <aspiers> #topic AOB 07:23:19 <aspiers> anything else anyone wants to discuss? 07:23:34 <aspiers> oh, I forgot one important thing! 07:23:45 <aspiers> #topic nova service-disable of failing nova-compute 07:24:04 <aspiers> samP: are you on the users@clusterlabs.org list? 07:25:56 <aspiers> beekhof proposed that we call nova service-disable *every* stop, not just the final one when Pacemaker won't attempt any more restarts of the resource on that node: 07:25:58 <aspiers> #link 07:26:02 <aspiers> oops 07:26:09 <aspiers> #link http://clusterlabs.org/pipermail/users/2016-June/003218.html 07:26:29 <aspiers> oh that's unfortunate 07:26:55 <samP_> sorry, cutoff from the net ;) 07:27:00 <aspiers> no problem :) 07:27:05 <aspiers> did you see my question? 07:27:17 <samP_> sorry, no 07:27:32 <aspiers> beekhof proposed that we call nova service-disable *every* stop, not just the final one when Pacemaker won't attempt any more restarts of the resource on that node 07:27:43 <aspiers> http://clusterlabs.org/pipermail/users/2016-June/003218.html 07:27:58 <aspiers> and it looks like the proposed new pacemaker feature won't make it into 1.1.15 now 07:28:03 <aspiers> that's my guess anyway 07:28:18 <aspiers> I think I'm OK with calling it every time, if you are 07:28:35 <aspiers> this approach came from masakari, so I wanted to get your thoughts 07:28:58 <aspiers> the service-disable would need to be called with a timeout, and any failures ignored 07:29:02 <samP_> does it mean, call nova service-disable before service stop at host (every time)? 07:29:06 <aspiers> otherwise we could get fencing from a failed stop 07:29:35 <aspiers> samP_: yes, but also when the service fails 07:29:43 <aspiers> samP_: since then Pacemaker will still call stop 07:29:55 <aspiers> IIRC 07:30:32 <aspiers> for nova-compute I think we could set migration-threshold=1 anyway 07:30:45 <aspiers> I'm not sure there is much benefit to trying to restart 2 or more times 07:30:58 <aspiers> this would avoid service flapping 07:31:00 <aspiers> what do you think? 07:32:06 <samP_> Its looks ok. I think this would do no harm. 07:32:07 <haukebruno> some other flapping (network or rabbitmq maybe) could cause in "unneeded" migration, if migration-threshold=1 07:32:42 <aspiers> haukebruno: there is no migration in this case, because nova-compute is active/active 07:33:01 <aspiers> haukebruno: it just means that nova-compute on that host is dead and disabled 07:33:25 <haukebruno> can't remember the correct case now, but in the past we had a stopped nova-compute sometimes because of something (everythings works as expected, we just restarted nova-compute) 07:33:27 <aspiers> samP_: OK, we can try it 07:33:43 <haukebruno> ah aspiers sorry, with a/a I agree 07:33:51 <aspiers> haukebruno: yes, in that case we definitely want to try to restart once 07:33:58 <aspiers> but I think more than once is maybe unnecessary 07:34:08 <haukebruno> yes 07:34:17 <aspiers> if it takes >= 2 restart attempts to work then probably something is badly wrong anyway 07:34:28 <aspiers> in which case we maybe can't rely on the service even if it starts correctly 07:34:34 <aspiers> e.g. it might die again soon 07:35:08 <haukebruno> of course, if needed, 1 retry should be enough 07:35:13 <aspiers> samP_: is that a change you could easily try in masakari? 07:36:23 <samP_> aspiers: I can try 07:38:54 <aspiers> so it seems I got the meeting time wrong ... AGAIN :-( 07:39:02 <aspiers> I think we are one hour early 07:39:20 <aspiers> sorry, I was confused since I am currently in Germany 07:39:53 <samP_> Think no harm to others, cause no other meeting on Monday at 0700 UTC 07:40:01 <aspiers> right 07:40:04 <aspiers> luckily :) 07:40:17 <haukebruno> ah lol, from my calendar view the time time was correct 07:40:24 <haukebruno> so WHAT exactly is the right time? 07:40:26 <aspiers> but maybe bad for ddeja etc. who are expecting it in the next hour 07:40:36 <samP_> its 0800 UTC 07:40:37 <aspiers> haukebruno: we changed it to 8am UTC 07:41:16 <aspiers> https://review.openstack.org/#/c/307002/ 07:41:19 <haukebruno> ah fucking summer-/winter time... everytime the same confusion. Ok, i'll update my calendar too 07:41:24 <aspiers> hehe 07:41:30 <aspiers> #topic time of meeting 07:41:39 <aspiers> maybe we need to change it anyway 07:41:48 <aspiers> beekhof requested another time, since this time is impossible for him 07:42:44 <aspiers> I will try to work out a better time for everyone 07:42:48 <samP_> I think we should do this discussion after 0800 UTC so, deja and others can comment 07:42:56 <aspiers> agreed 07:43:02 <aspiers> any earlier is difficult for me too 07:43:41 <aspiers> #action aspiers to figure out a meeting time which works for everyone 07:43:59 <aspiers> haukebruno, samP_: please could you let me know which times of day/week work for you? 07:44:31 <samP_> our previous time was 0900 UTC, which is 1800 in Japan (JST) was OK for me. 07:44:46 <haukebruno> would be ok for me too 07:45:14 <haukebruno> in general I am ok with anything from 0400 to 2000 UTC 07:45:42 <aspiers> OK thanks 07:45:50 <aspiers> I think 0900 UTC is probably also difficult for beekhof 07:46:13 <aspiers> maybe he can do later, but I will try to find out 07:47:28 <samP_> So, 0100UTC to 0900UTC would be totally OK for me. 07:48:54 <aspiers> samP_: OK great, thanks. I've just emailed beekhof 07:49:08 <aspiers> I'll announce if there are any changes 07:49:22 <aspiers> actually, I'd just add you all as reviewers on the gerrit review :) 07:49:31 <aspiers> then you can +1 / -1 the proposal 07:49:49 <aspiers> #topic AOB 07:50:22 <aspiers> alright, I suggest we close the meeting now - we could always restart in 10 minutes for a short discussion if e.g. ddeja appears :) 07:50:35 <aspiers> but I'm open to other suggestions too :) 07:50:49 <samP_> sure, I ll be there 07:50:55 <haukebruno> yep. me too \o/ 07:51:40 <aspiers> cool :) thanks guys! 07:51:49 <aspiers> #endmeeting