07:07:55 #startmeeting ha 07:07:56 Meeting started Mon Jun 13 07:07:55 2016 UTC and is due to finish in 60 minutes. The chair is aspiers. Information about MeetBot at http://wiki.debian.org/MeetBot. 07:07:57 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 07:07:59 The meeting name has been set to 'ha' 07:08:06 OK, let's start anyway 07:08:10 hi 07:08:18 oh hi samP! :) 07:08:25 sorry late response 07:08:28 no problem :) 07:08:30 I was late too :) 07:08:57 just came back after 1week+ vacation 07:09:00 nice :) 07:09:18 I guess the user story is a good place to start today 07:09:23 #topic HA VMs user story 07:09:38 so, pretty much all my reviews got merged I think 07:09:57 I also sent a mail to openstack-dev@ (cc product-wg@) about how to handle the extra usage scenarios 07:10:04 but no responses yet :-/ 07:10:17 Thanks, I read the mail 07:10:48 I was thinking that we should just go ahead and rebase the existing review, and submit 4 extra user stories 07:11:11 so that we have the 4 scenarios covered by 5 user stories: one for each, plus also in the existing HA VMs story, like we discussed 07:11:20 does that sound OK? 07:11:33 +1 07:11:58 in fact, that could all be done within the same existing review 07:12:13 as long as they are mentioned in the main story, I am OK with extra 4 stories 07:12:23 samP: exactly 07:12:40 putting them in the same review would make sense, because we would need to cross-link between the main story and the other 4 07:12:48 and also we want to minimise overlap 07:12:59 so that the main story focuses on the HA-specific side 07:13:32 I had another idea on this 07:13:46 unfortunately only after sending the mail I realised I should have cc'd openstack-operators :-/ 07:13:55 since these user stories are very strongly focused on operators 07:14:00 maybe we will get more feedback that way 07:14:08 should I forward my original mail to that list? 07:14:43 aspiers; yes I think that would be a good idea 07:14:55 OK 07:14:58 also +1, I'm on that list too and I guess more feedback (in general) is always better :p 07:15:06 #action aspiers to forward original user story email to openstack-operators 07:15:25 I have some more small changes to submit to the main story 07:15:32 I will try to submit them this morning 07:16:09 #topic specs 07:16:22 I'm sorry I didn't make much more progress on the specs recently 07:16:30 I still have a draft of the first one I am doing though 07:16:40 again I hope to submit one soon 07:16:49 me nether, I will try to complete them this week 07:16:57 I think it's pretty important to get these done soon 07:17:09 as they are required for agreeing on how to proceed with code 07:17:22 this week we have our team workshop in Germany, so I have limited time :-/ 07:17:29 maybe in the evenings, I don't know 07:17:31 aspiers: yes +1 07:17:39 ok 07:17:58 #topic Pacemaker vs. systemd 07:18:16 I'm not sure if you saw beekhof's latest blog? 07:18:22 #link http://blog.clusterlabs.org/blog/2016/next-openstack-ha-arch 07:18:34 he makes some good points, but I do not agree with everything 07:18:50 beekhof knows this already ;-) as I have discussed this topic with him in great detail in the past 07:19:14 personally I think it's really important to have application-level monitoring for active/active services 07:19:29 otherwise for example keystone could hang, and noone would notice 07:19:49 my proposal (which I have mentioned before) is to change the OCF RAs so that they wrap service(8) 07:20:21 this avoids the divergence which he mentioned in the blog, and also avoids unnecessary duplication of service config data and start/stop/restart logic 07:20:31 whilst adding decent monitoring 07:20:41 I don't know about the general discussion, but yes: +1 for application-level monitoring. a running service doesn't mean a healthy service in so many cases 07:20:52 I hope to write a spec on this also sometime soon, in my fictional free time 07:21:12 haukebruno: exactly 07:21:21 +1 for application-level monitoring 07:21:49 I don't quite understand why he thinks avoiding divergence of HA from non-HA cases is more important than proper app-level monitoring and recovery 07:22:27 he mentions nagios and sensu for monitoring, but AFAIK they won't do proper automatic recovery 07:22:48 anyway, I'm sure this debate is just getting started ;-) 07:23:12 #topic AOB 07:23:19 anything else anyone wants to discuss? 07:23:34 oh, I forgot one important thing! 07:23:45 #topic nova service-disable of failing nova-compute 07:24:04 samP: are you on the users@clusterlabs.org list? 07:25:56 beekhof proposed that we call nova service-disable *every* stop, not just the final one when Pacemaker won't attempt any more restarts of the resource on that node: 07:25:58 #link 07:26:02 oops 07:26:09 #link http://clusterlabs.org/pipermail/users/2016-June/003218.html 07:26:29 oh that's unfortunate 07:26:55 sorry, cutoff from the net ;) 07:27:00 no problem :) 07:27:05 did you see my question? 07:27:17 sorry, no 07:27:32 beekhof proposed that we call nova service-disable *every* stop, not just the final one when Pacemaker won't attempt any more restarts of the resource on that node 07:27:43 http://clusterlabs.org/pipermail/users/2016-June/003218.html 07:27:58 and it looks like the proposed new pacemaker feature won't make it into 1.1.15 now 07:28:03 that's my guess anyway 07:28:18 I think I'm OK with calling it every time, if you are 07:28:35 this approach came from masakari, so I wanted to get your thoughts 07:28:58 the service-disable would need to be called with a timeout, and any failures ignored 07:29:02 does it mean, call nova service-disable before service stop at host (every time)? 07:29:06 otherwise we could get fencing from a failed stop 07:29:35 samP_: yes, but also when the service fails 07:29:43 samP_: since then Pacemaker will still call stop 07:29:55 IIRC 07:30:32 for nova-compute I think we could set migration-threshold=1 anyway 07:30:45 I'm not sure there is much benefit to trying to restart 2 or more times 07:30:58 this would avoid service flapping 07:31:00 what do you think? 07:32:06 Its looks ok. I think this would do no harm. 07:32:07 some other flapping (network or rabbitmq maybe) could cause in "unneeded" migration, if migration-threshold=1 07:32:42 haukebruno: there is no migration in this case, because nova-compute is active/active 07:33:01 haukebruno: it just means that nova-compute on that host is dead and disabled 07:33:25 can't remember the correct case now, but in the past we had a stopped nova-compute sometimes because of something (everythings works as expected, we just restarted nova-compute) 07:33:27 samP_: OK, we can try it 07:33:43 ah aspiers sorry, with a/a I agree 07:33:51 haukebruno: yes, in that case we definitely want to try to restart once 07:33:58 but I think more than once is maybe unnecessary 07:34:08 yes 07:34:17 if it takes >= 2 restart attempts to work then probably something is badly wrong anyway 07:34:28 in which case we maybe can't rely on the service even if it starts correctly 07:34:34 e.g. it might die again soon 07:35:08 of course, if needed, 1 retry should be enough 07:35:13 samP_: is that a change you could easily try in masakari? 07:36:23 aspiers: I can try 07:38:54 so it seems I got the meeting time wrong ... AGAIN :-( 07:39:02 I think we are one hour early 07:39:20 sorry, I was confused since I am currently in Germany 07:39:53 Think no harm to others, cause no other meeting on Monday at 0700 UTC 07:40:01 right 07:40:04 luckily :) 07:40:17 ah lol, from my calendar view the time time was correct 07:40:24 so WHAT exactly is the right time? 07:40:26 but maybe bad for ddeja etc. who are expecting it in the next hour 07:40:36 its 0800 UTC 07:40:37 haukebruno: we changed it to 8am UTC 07:41:16 https://review.openstack.org/#/c/307002/ 07:41:19 ah fucking summer-/winter time... everytime the same confusion. Ok, i'll update my calendar too 07:41:24 hehe 07:41:30 #topic time of meeting 07:41:39 maybe we need to change it anyway 07:41:48 beekhof requested another time, since this time is impossible for him 07:42:44 I will try to work out a better time for everyone 07:42:48 I think we should do this discussion after 0800 UTC so, deja and others can comment 07:42:56 agreed 07:43:02 any earlier is difficult for me too 07:43:41 #action aspiers to figure out a meeting time which works for everyone 07:43:59 haukebruno, samP_: please could you let me know which times of day/week work for you? 07:44:31 our previous time was 0900 UTC, which is 1800 in Japan (JST) was OK for me. 07:44:46 would be ok for me too 07:45:14 in general I am ok with anything from 0400 to 2000 UTC 07:45:42 OK thanks 07:45:50 I think 0900 UTC is probably also difficult for beekhof 07:46:13 maybe he can do later, but I will try to find out 07:47:28 So, 0100UTC to 0900UTC would be totally OK for me. 07:48:54 samP_: OK great, thanks. I've just emailed beekhof 07:49:08 I'll announce if there are any changes 07:49:22 actually, I'd just add you all as reviewers on the gerrit review :) 07:49:31 then you can +1 / -1 the proposal 07:49:49 #topic AOB 07:50:22 alright, I suggest we close the meeting now - we could always restart in 10 minutes for a short discussion if e.g. ddeja appears :) 07:50:35 but I'm open to other suggestions too :) 07:50:49 sure, I ll be there 07:50:55 yep. me too \o/ 07:51:40 cool :) thanks guys! 07:51:49 #endmeeting