09:05:04 #startmeeting ha 09:05:04 Meeting started Mon Jan 25 09:05:04 2016 UTC and is due to finish in 60 minutes. The chair is aspiers. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:05:05 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:05:09 The meeting name has been set to 'ha' 09:05:35 hi all (although it seems today "all" just refers to one or two people :) 09:05:47 Hi 09:05:53 (once again ;) 09:05:58 :) 09:06:06 let's do the normal update 09:06:07 hi :) 09:06:08 #topic Current status (progress, issues, roadblocks, further plans) 09:06:21 not too much from my side 09:06:27 just a couple of things 09:06:40 I'm still working on automating setup of compute node HA using Crowbar and Chef 09:07:08 and secondly, the private conversation with beekhof continued but I think we pretty much reached alignment by now 09:07:18 aspiers: coll 09:07:25 s/coll/cool 09:07:55 as a reminder, that was the conversation discussed in last week's meeting about potentially rearchitecting OCF RAs so that they wrap around service(8) and (hence typically systemd) 09:08:24 I'll give a few more details of that after the status round 09:08:30 ddeja: any updates from you? 09:08:50 #info aspiers still working on automating setup of compute node HA using Crowbar and Chef 09:08:56 aspiers: yes. I have submitted bug to Mistral that is blocking my workflow https://bugs.launchpad.net/mistral/+bug/1535722 09:08:57 Launchpad bug 1535722 in Mistral "Mistral workflow including 'with-items' is not starting on-success task" [Undecided,New] 09:09:08 cool 09:09:24 unfortunately, it doesn't got many attention, so I'm working on it by myself. 09:09:31 ah ;-) 09:09:37 I have a root cause and a fix, I'm working on unit tests 09:09:42 excellent! 09:10:04 I'm jealous - wish I had time to work on mistral ;-) 09:10:10 that's all from my side 09:10:13 hi guys, i'm sort of around. just putting kids to bed 09:10:19 oh hey beekhof :) 09:10:25 beekhof: you wanna report any status update? 09:10:29 Hi beekhof 09:11:55 I guess kids get a higher priority on his scheduler ;-) 09:12:03 which is very understandable :) 09:12:24 heh, it looks so 09:12:28 haoli: are you working on HA, or just lurking? 09:13:07 im a green hand 09:13:09 #info ddeja figured out root cause and a fix for the mistral bug discovered, and is working on unit tests 09:13:42 just lurking 09:13:51 ok, welcome :) 09:14:07 hi 09:14:16 hey bogdando :) any status updates from you? 09:14:35 IIRC, I saw that the new ha-guide meeting time is now decided? 09:15:31 nothing special 09:16:04 yes 09:16:16 I wrote an announce to openstack-docs ML 09:16:17 back modulo tantrums 09:17:09 http://lists.openstack.org/pipermail/openstack-docs/2016-January/008209.html 09:17:15 status is that we have the old instance HA stuff working. there is a nice bug in nova that needs to be fixed to make it worthwhile though 09:18:40 ack on the meeting. i won't be able to make it unfortunately 09:19:47 so how far away from a shootout are we? 09:19:47 cool 09:20:00 what kind of shootout? :) 09:20:18 #info new ha-guide meeting time is now decided 09:20:20 cage match, 3 implementations enter, 1 implementation leaves 09:20:31 hah 09:20:41 well you could suggest that for the next Ruler Of The Stack competition ;-) 09:21:23 beekhof: got a URL for that nova bug? 09:22:11 i'll try to dig it up 09:22:13 its old though 09:22:28 thanks 09:24:01 beekhof: gentle reminder - there are a few open PRs in fence-agents which should be very quick and simple to review 09:24:23 https://bugs.launchpad.net/nova/+bug/1441950 09:24:24 Launchpad bug 1441950 in OpenStack Compute (nova) "instance on source host can not be cleaned after evacuating" [Undecided,Confirmed] - Assigned to Zhenyu Zheng (zhengzhenyu) 09:24:37 i nearly got into them today, much catchup was required 09:24:46 ok great, thanks! 09:25:17 oh, fence-agents... not really my area. but i assume these are openstack related? 09:25:37 i thought you meant the gerrit thingies 09:26:07 no, I meant https://github.com/ClusterLabs/fence-agents/pulls 09:26:40 but you do have one remaining task on gerrit too ;-) 09:26:57 https://review.openstack.org/#/c/254515/ 09:27:18 OK 09:27:20 #topic OCF RAs and systemd 09:27:39 just a very quick update on what beekhof and I have been chatting about privately 09:27:57 he made some very good points about challenges with OCF RAs wrapping systemd 09:28:07 one big one is how systemd handles node shutdown 09:28:22 it will want to stop services which Pacemaker is controlling 09:28:36 since Pacemaker started them via systemd, but systemd is too stupid to realise that it doesn't "own" those services 09:28:53 there may be a good solution but I haven't investigated yet 09:29:03 but currently this is a potential show stopper 09:29:31 #info aspiers and beekhof have continued discussing the idea of OCF RAs wrapping systemd 09:29:47 #info beekhof highlighted some issues which are not yet solved 09:29:56 another issue is around timeouts for start/stop 09:30:08 you can create the same override files that pacemaker does 09:30:19 its not pretty, but it should work 09:30:27 which files are those? 09:31:19 look in systemd_unit_exec_with_unit() 09:31:25 just a note, we should describe this in the future sepc 09:31:26 spec 09:31:49 everything you want to do can be made to work, i just have doubts that it should :) 09:32:14 its not something RH is likely to go in for, we get monitoring from elsewhere 09:33:06 beekhof: RH is likely to stop using these OCF RAs altogether, right? 09:33:14 I mean the ones for active/active OpenStack services 09:33:26 excluding NovaCompute 09:33:38 i.e. the "controller" ones 09:33:47 yes 09:34:05 OK, thanks for the info 09:34:17 BTW who are the main fence-agents maintainers? 09:34:38 marek 09:34:58 i can get in there if its the openstack agent though 09:35:15 here is systemd_unit_exec_with_unit(): https://github.com/ClusterLabs/pacemaker/blob/master/lib/services/systemd.c#L516 09:35:30 systemd wrapped to ocf may be done as configurable 09:35:37 with default off 09:35:46 so all will be happy 09:35:53 ahah, I see what beekhof means by overrides now 09:35:57 bogdando: RH is basically planning to kick all those services out of the cluster 09:36:12 out of their cluster anyway 09:36:23 others are welcome to keep them in 09:36:27 openstack stateless services like nova api? 09:36:31 we'll just hand them off to systemd 09:36:33 bogdando: yes 09:36:35 I see 09:36:50 I guess RH will offer monitoring via nagios or similar 09:37:14 usually customers have some sort of monitoring system in place - so i'm told 09:37:22 there are several nice things about monitoring via Pacemaker 09:37:41 it will attempt recovery much more intelligently than something like nagios 09:37:54 actually a lot of monitoring systems won't attempt recovery at all 09:38:01 yes, and it does distributed coordination way better for sure 09:38:11 you also get a unified UI to the services 09:38:21 but stateful services often just don't need this 09:38:40 but stateless services not always so stateless in openstack ;) 09:38:47 i know, you dont have to sell me of all people on the benefits of pacemaker :) 09:38:54 hehe :) 09:38:59 bogdando: bingo 09:39:11 but I keep hearing how we're not needed 09:39:19 beekhof: come back to SUSE ;-) ;-) 09:39:29 * aspiers didn't just say that 09:39:34 moving swiftly on 09:40:05 :) 09:40:10 since systemctl start/stop is async non-blocking, any OCF RA which wraps it would need to poll for start/stop completion, with a timeout 09:40:13 not sure lmb would have me 09:40:52 and then the Pacemaker timeouts need to be configured to match systemd's Exec{Start,Stop}Timeout parameters 09:41:08 well, not necessarily match 09:41:12 but take into account 09:41:27 this is slightly ugly but shouldn't be a show stopper 09:41:38 "be higher than" 09:41:56 yeah 09:42:12 but the action is still on me to write a proposal spec and a PoC implementation 09:42:18 unfortunately this won't happen for a few weeks at least 09:42:25 we are coming up to our next major release 09:42:31 so I have a lot of other stuff to do 09:42:43 and additionally I'm taking 2 weeks holiday from Wed ;-) 09:43:00 #topic next meetings 09:43:03 how selfish 09:43:10 hehe I know, isn't it great ;-) 09:43:16 so I'll miss the next two meetinigs 09:43:19 ack 09:43:23 any volunteers to chair in my absence? 09:43:31 i appear to be too unreliable :) 09:43:36 haha fair enough 09:43:43 aspiers: I can do it 09:43:48 ddeja: great thanks! 09:43:59 #action ddeja will chair the next two meetings 09:44:15 #info aspiers will be away from Wed 27th for 2 weeks 09:44:15 about my earlier question... it was mostly serious. how far are we from picking a winner? 09:44:33 also, will anyone else be in austin? 09:44:33 #topic relative merits of different compute HA approaches 09:44:49 beekhof: I think we're still quite a way away 09:44:56 I'm personally very interested in the mistral approach 09:45:06 but I think masakari has some nice stuff too 09:45:13 so I'm interested to see that working with Pacemaker remote 09:45:22 ideally for me those two would converge 09:46:00 my vote is for mistral approach. This would bring more love and care to the project as well 09:46:34 yeah, I think ultiimately mistral+congress makes the most sense 09:46:39 and perhaps new bugs will stop looking like poor orphants 09:46:50 seems clear to me that each cloud will want to handle compute HA using different policies 09:46:58 and Congress driving Mistral is a really nice way to do that 09:47:27 so it seems that I need to speed up my work ;) 09:47:50 hehe 09:47:56 I would LOVE to help with that work 09:48:01 hopefully after our next release 09:49:09 #topic Austin 09:49:14 I am going, who else? 09:49:18 me 09:49:37 oh no^H^H^H^Hcool ;-) 09:49:48 :) 09:49:52 _gryf will be there, for me it's still a mistery ;) 09:49:57 haha 09:50:02 ddeja: I will cross my fingers for you 09:50:15 well ... I said I am going, but travel budget not approved yet 09:50:50 from my side there is same issue - we do not have budget approved for summits 09:51:04 I was planning to submit a "State of the Nation" talk summarising the various compute node HA approaches in the wild, as captured in https://etherpad.openstack.org/p/automatic-evacuation 09:51:36 obviously I would include a section on the Mistral approach 09:51:42 ddeja: perhaps you or _gryf would like to join me to cover that section? 09:51:47 will we have few minutes for open discussion? 09:51:52 bogdando: sure 09:52:14 aspiers: If I join you, my company will *have* to send me there ;) 09:52:20 ddeja: exactly ;-) 09:53:15 so I would like to join you on this 09:53:16 beekhof: you're the best heckler I know, so you would need to be in the audience asking awkward questions ;-) 09:53:28 ok 09:54:05 I also asked Florian if he's interested to co-present this talk 09:54:24 it would help keep it vendor-neutral, also he is well known and trusted in the OpenStack HA world 09:54:47 I'll see what he says 09:54:52 ok only 5 mins left 09:54:55 #topic open discussion 09:54:57 I'm just wondering if fault resilience of Openstack services is in the scope of the HA project as well? If so, there are som much interesting things to put on the radar... for future R&D perhaps 09:54:58 bogdando: your turn :) 09:55:16 bogdando: sure - we're not really an official project yet anyway 09:55:25 there is no team defined in the governance repo 09:55:28 so I'd like we to do some RTFM 09:55:29 so not Big Tent yet 09:55:43 http://conferences2.sigcomm.org/co-next/2015/img/papers/conext15-final156.pdf 09:55:46 https://kabru.eecs.umich.edu/papers/publications/2013/socc2013_ju.pdf 09:55:53 currently fault tolerance seems more the scope of specialist vendors 09:55:54 back 09:55:55 found in the google scholar 09:55:58 * beekhof re-reads 09:56:15 but I guess it will become commoditized over time 09:56:37 aspiers: i can probably talk about RH if that helps 09:56:41 the latter one describes very interesting approach for tracing flows in OpenStack and injecting failures to test resiliency 09:56:57 should be really adopted as a framework in OpenStack community 09:57:11 so, that was my 2 cents 09:57:36 cool 09:58:10 if you find anything interesting in your research, please continue to keep us updated :) 09:58:18 it can certainly be a topic at future meetings 09:58:27 i'll be around friday morning htis time too 09:58:39 #info bogdando interested in fault resilience and has started doing some research 09:58:56 alright, I guess it's time to end 09:59:21 thanks a lot all, don't enjoy the next two meetings without me too much, and I'll see you in 3 weeks! 09:59:30 bye :) 09:59:34 bye :) 10:00:31 #endmeeting