#openstack-meeting log

09:05:04 <aspiers> #startmeeting ha
09:05:04 <openstack> Meeting started Mon Jan 25 09:05:04 2016 UTC and is due to finish in 60 minutes.  The chair is aspiers. Information about MeetBot at http://wiki.debian.org/MeetBot.
09:05:05 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
09:05:09 <openstack> The meeting name has been set to 'ha'
09:05:35 <aspiers> hi all (although it seems today "all" just refers to one or two people :)
09:05:47 <ddeja> Hi
09:05:53 <ddeja> (once again ;)
09:05:58 <aspiers> :)
09:06:06 <aspiers> let's do the normal update
09:06:07 <haoli> hi  :)
09:06:08 <aspiers> #topic Current status (progress, issues, roadblocks, further plans)
09:06:21 <aspiers> not too much from my side
09:06:27 <aspiers> just a couple of things
09:06:40 <aspiers> I'm still working on automating setup of compute node HA using Crowbar and Chef
09:07:08 <aspiers> and secondly, the private conversation with beekhof continued but I think we pretty much reached alignment by now
09:07:18 <ddeja> aspiers: coll
09:07:25 <ddeja> s/coll/cool
09:07:55 <aspiers> as a reminder, that was the conversation discussed in last week's meeting about potentially rearchitecting OCF RAs so that they wrap around service(8) and (hence typically systemd)
09:08:24 <aspiers> I'll give a few more details of that after the status round
09:08:30 <aspiers> ddeja: any updates from you?
09:08:50 <aspiers> #info aspiers still working on automating setup of compute node HA using Crowbar and Chef
09:08:56 <ddeja> aspiers: yes. I have submitted bug to Mistral that is blocking my workflow https://bugs.launchpad.net/mistral/+bug/1535722
09:08:57 <openstack> Launchpad bug 1535722 in Mistral "Mistral workflow including 'with-items' is not starting on-success task" [Undecided,New]
09:09:08 <aspiers> cool
09:09:24 <ddeja> unfortunately, it doesn't got many attention, so I'm working on it by myself.
09:09:31 <aspiers> ah ;-)
09:09:37 <ddeja> I have a root cause and a fix, I'm working on unit tests
09:09:42 <aspiers> excellent!
09:10:04 <aspiers> I'm jealous - wish I had time to work on mistral ;-)
09:10:10 <ddeja> that's all from my side
09:10:13 <beekhof> hi guys, i'm sort of around. just putting kids to bed
09:10:19 <aspiers> oh hey beekhof :)
09:10:25 <aspiers> beekhof: you wanna report any status update?
09:10:29 <ddeja> Hi beekhof
09:11:55 <aspiers> I guess kids get a higher priority on his scheduler ;-)
09:12:03 <aspiers> which is very understandable :)
09:12:24 <ddeja> heh, it looks so
09:12:28 <aspiers> haoli: are you working on HA, or just lurking?
09:13:07 <haoli> im a green hand
09:13:09 <aspiers> #info ddeja figured out root cause and a fix for the mistral bug discovered, and is working on unit tests
09:13:42 <haoli> just lurking
09:13:51 <aspiers> ok, welcome :)
09:14:07 <bogdando> hi
09:14:16 <aspiers> hey bogdando :) any status updates from you?
09:14:35 <aspiers> IIRC, I saw that the new ha-guide meeting time is now decided?
09:15:31 <bogdando> nothing special
09:16:04 <bogdando> yes
09:16:16 <bogdando> I wrote an announce to openstack-docs ML
09:16:17 <beekhof> back modulo tantrums
09:17:09 <bogdando> http://lists.openstack.org/pipermail/openstack-docs/2016-January/008209.html
09:17:15 <beekhof> status is that we have the old instance HA stuff working. there is a nice bug in nova that needs to be fixed to make it worthwhile though
09:18:40 <beekhof> ack on the meeting. i won't be able to make it unfortunately
09:19:47 <beekhof> so how far away from a shootout are we?
09:19:47 <aspiers> cool
09:20:00 <aspiers> what kind of shootout? :)
09:20:18 <aspiers> #info new ha-guide meeting time is now decided
09:20:20 <beekhof> cage match, 3 implementations enter, 1 implementation leaves
09:20:31 <aspiers> hah
09:20:41 <aspiers> well you could suggest that for the next Ruler Of The Stack competition ;-)
09:21:23 <aspiers> beekhof: got a URL for that nova bug?
09:22:11 <beekhof> i'll try to dig it up
09:22:13 <beekhof> its old though
09:22:28 <aspiers> thanks
09:24:01 <aspiers> beekhof: gentle reminder - there are a few open PRs in fence-agents which should be very quick and simple to review
09:24:23 <beekhof> https://bugs.launchpad.net/nova/+bug/1441950
09:24:24 <openstack> Launchpad bug 1441950 in OpenStack Compute (nova) "instance on source host can not be cleaned after evacuating" [Undecided,Confirmed] - Assigned to Zhenyu Zheng (zhengzhenyu)
09:24:37 <beekhof> i nearly got into them today, much catchup was required
09:24:46 <aspiers> ok great, thanks!
09:25:17 <beekhof> oh, fence-agents... not really my area. but i assume these are openstack related?
09:25:37 <beekhof> i thought you meant the gerrit thingies
09:26:07 <aspiers> no, I meant https://github.com/ClusterLabs/fence-agents/pulls
09:26:40 <aspiers> but you do have one remaining task on gerrit too ;-)
09:26:57 <aspiers> https://review.openstack.org/#/c/254515/
09:27:18 <aspiers> OK
09:27:20 <aspiers> #topic OCF RAs and systemd
09:27:39 <aspiers> just a very quick update on what beekhof and I have been chatting about privately
09:27:57 <aspiers> he made some very good points about challenges with OCF RAs wrapping systemd
09:28:07 <aspiers> one big one is how systemd handles node shutdown
09:28:22 <aspiers> it will want to stop services which Pacemaker is controlling
09:28:36 <aspiers> since Pacemaker started them via systemd, but systemd is too stupid to realise that it doesn't "own" those services
09:28:53 <aspiers> there may be a good solution but I haven't investigated yet
09:29:03 <aspiers> but currently this is a potential show stopper
09:29:31 <aspiers> #info aspiers and beekhof have continued discussing the idea of OCF RAs wrapping systemd
09:29:47 <aspiers> #info beekhof highlighted some issues which are not yet solved
09:29:56 <aspiers> another issue is around timeouts for start/stop
09:30:08 <beekhof> you can create the same override files that pacemaker does
09:30:19 <beekhof> its not pretty, but it should work
09:30:27 <aspiers> which files are those?
09:31:19 <beekhof> look in systemd_unit_exec_with_unit()
09:31:25 <bogdando> just a note, we should describe this in the future sepc
09:31:26 <bogdando> spec
09:31:49 <beekhof> everything you want to do can be made to work, i just have doubts that it should :)
09:32:14 <beekhof> its not something RH is likely to go in for, we get monitoring from elsewhere
09:33:06 <aspiers> beekhof: RH is likely to stop using these OCF RAs altogether, right?
09:33:14 <aspiers> I mean the ones for active/active OpenStack services
09:33:26 <aspiers> excluding NovaCompute
09:33:38 <aspiers> i.e. the "controller" ones
09:33:47 <beekhof> yes
09:34:05 <aspiers> OK, thanks for the info
09:34:17 <aspiers> BTW who are the main fence-agents maintainers?
09:34:38 <beekhof> marek
09:34:58 <beekhof> i can get in there if its the openstack agent though
09:35:15 <aspiers> here is systemd_unit_exec_with_unit(): https://github.com/ClusterLabs/pacemaker/blob/master/lib/services/systemd.c#L516
09:35:30 <bogdando> systemd wrapped to ocf may be done as configurable
09:35:37 <bogdando> with default off
09:35:46 <bogdando> so all will be happy
09:35:53 <aspiers> ahah, I see what beekhof means by overrides now
09:35:57 <beekhof> bogdando: RH is basically planning to kick all those services out of the cluster
09:36:12 <beekhof> out of their cluster anyway
09:36:23 <beekhof> others are welcome to keep them in
09:36:27 <bogdando> openstack stateless services like nova api?
09:36:31 <beekhof> we'll just hand them off to systemd
09:36:33 <aspiers> bogdando: yes
09:36:35 <bogdando> I see
09:36:50 <aspiers> I guess RH will offer monitoring via nagios or similar
09:37:14 <beekhof> usually customers have some sort of monitoring system in place - so i'm told
09:37:22 <aspiers> there are several nice things about monitoring via Pacemaker
09:37:41 <aspiers> it will attempt recovery much more intelligently than something like nagios
09:37:54 <aspiers> actually a lot of monitoring systems won't attempt recovery at all
09:38:01 <bogdando> yes, and it does distributed coordination way better for sure
09:38:11 <aspiers> you also get a unified UI to the services
09:38:21 <bogdando> but stateful services often just don't need this
09:38:40 <bogdando> but stateless services not always so stateless in openstack ;)
09:38:47 <beekhof> i know, you dont have to sell me of all people on the benefits of pacemaker :)
09:38:54 <aspiers> hehe :)
09:38:59 <beekhof> bogdando: bingo
09:39:11 <beekhof> but I keep hearing how we're not needed
09:39:19 <aspiers> beekhof: come back to SUSE ;-) ;-)
09:39:29 * aspiers didn't just say that
09:39:34 <aspiers> moving swiftly on
09:40:05 <beekhof> :)
09:40:10 <aspiers> since systemctl start/stop is async non-blocking, any OCF RA which wraps it would need to poll for start/stop completion, with a timeout
09:40:13 <beekhof> not sure lmb would have me
09:40:52 <aspiers> and then the Pacemaker timeouts need to be configured to match systemd's Exec{Start,Stop}Timeout parameters
09:41:08 <aspiers> well, not necessarily match
09:41:12 <aspiers> but take into account
09:41:27 <aspiers> this is slightly ugly but shouldn't be a show stopper
09:41:38 <beekhof> "be higher than"
09:41:56 <aspiers> yeah
09:42:12 <aspiers> but the action is still on me to write a proposal spec and a PoC implementation
09:42:18 <aspiers> unfortunately this won't happen for a few weeks at least
09:42:25 <aspiers> we are coming up to our next major release
09:42:31 <aspiers> so I have a lot of other stuff to do
09:42:43 <aspiers> and additionally I'm taking 2 weeks holiday from Wed ;-)
09:43:00 <aspiers> #topic next meetings
09:43:03 <beekhof> how selfish
09:43:10 <aspiers> hehe I know, isn't it great ;-)
09:43:16 <aspiers> so I'll miss the next two meetinigs
09:43:19 <beekhof> ack
09:43:23 <aspiers> any volunteers to chair in my absence?
09:43:31 <beekhof> i appear to be too unreliable :)
09:43:36 <aspiers> haha fair enough
09:43:43 <ddeja> aspiers: I can do it
09:43:48 <aspiers> ddeja: great thanks!
09:43:59 <aspiers> #action ddeja will chair the next two meetings
09:44:15 <aspiers> #info aspiers will be away from Wed 27th for 2 weeks
09:44:15 <beekhof> about my earlier question... it was mostly serious. how far are we from picking a winner?
09:44:33 <beekhof> also, will anyone else be in austin?
09:44:33 <aspiers> #topic relative merits of different compute HA approaches
09:44:49 <aspiers> beekhof: I think we're still quite a way away
09:44:56 <aspiers> I'm personally very interested in the mistral approach
09:45:06 <aspiers> but I think masakari has some nice stuff too
09:45:13 <aspiers> so I'm interested to see that working with Pacemaker remote
09:45:22 <aspiers> ideally for me those two would converge
09:46:00 <bogdando> my vote is for mistral approach. This would bring more love and care to the project as well
09:46:34 <aspiers> yeah, I think ultiimately mistral+congress makes the most sense
09:46:39 <bogdando> and perhaps new bugs will stop looking like poor orphants
09:46:50 <aspiers> seems clear to me that each cloud will want to handle compute HA using different policies
09:46:58 <aspiers> and Congress driving Mistral is a really nice way to do that
09:47:27 <ddeja> so it seems that I need to speed up my work ;)
09:47:50 <aspiers> hehe
09:47:56 <aspiers> I would LOVE to help with that work
09:48:01 <aspiers> hopefully after our next release
09:49:09 <aspiers> #topic Austin
09:49:14 <aspiers> I am going, who else?
09:49:18 <beekhof> me
09:49:37 <aspiers> oh no^H^H^H^Hcool ;-)
09:49:48 <beekhof> :)
09:49:52 <ddeja> _gryf will be there, for me it's still a mistery ;)
09:49:57 <aspiers> haha
09:50:02 <aspiers> ddeja: I will cross my fingers for you
09:50:15 <aspiers> well ... I said I am going, but travel budget not approved yet
09:50:50 <ddeja> from my side there is same issue - we do not have budget approved for summits
09:51:04 <aspiers> I was planning to submit a "State of the Nation" talk summarising the various compute node HA approaches in the wild, as captured in https://etherpad.openstack.org/p/automatic-evacuation
09:51:36 <aspiers> obviously I would include a section on the Mistral approach
09:51:42 <aspiers> ddeja: perhaps you or _gryf would like to join me to cover that section?
09:51:47 <bogdando> will we have few minutes for open discussion?
09:51:52 <aspiers> bogdando: sure
09:52:14 <ddeja> aspiers: If I join you, my company will *have* to send me there ;)
09:52:20 <aspiers> ddeja: exactly ;-)
09:53:15 <ddeja> so I would like to join you on this
09:53:16 <aspiers> beekhof: you're the best heckler I know, so you would need to be in the audience asking awkward questions ;-)
09:53:28 <beekhof> ok
09:54:05 <aspiers> I also asked Florian if he's interested to co-present this talk
09:54:24 <aspiers> it would help keep it vendor-neutral, also he is well known and trusted in the OpenStack HA world
09:54:47 <aspiers> I'll see what he says
09:54:52 <aspiers> ok only 5 mins left
09:54:55 <aspiers> #topic open discussion
09:54:57 <bogdando> I'm just wondering if fault resilience of Openstack services is in the scope of the HA project as well? If so, there are som much interesting things to put on the radar... for future R&D perhaps
09:54:58 <aspiers> bogdando: your turn :)
09:55:16 <aspiers> bogdando: sure - we're not really an official project yet anyway
09:55:25 <aspiers> there is no team defined in the governance repo
09:55:28 <bogdando> so I'd like we to do some RTFM
09:55:29 <aspiers> so not Big Tent yet
09:55:43 <bogdando> http://conferences2.sigcomm.org/co-next/2015/img/papers/conext15-final156.pdf
09:55:46 <bogdando> https://kabru.eecs.umich.edu/papers/publications/2013/socc2013_ju.pdf
09:55:53 <aspiers> currently fault tolerance seems more the scope of specialist vendors
09:55:54 <beekhof> back
09:55:55 <bogdando> found in the google scholar
09:55:58 * beekhof re-reads
09:56:15 <aspiers> but I guess it will become commoditized over time
09:56:37 <beekhof> aspiers: i can probably talk about RH if that helps
09:56:41 <bogdando> the latter one describes very interesting approach for tracing flows in OpenStack and injecting failures to test resiliency
09:56:57 <bogdando> should be really adopted as a framework in OpenStack community
09:57:11 <bogdando> so, that was my 2 cents
09:57:36 <aspiers> cool
09:58:10 <aspiers> if you find anything interesting in your research, please continue to keep us updated :)
09:58:18 <aspiers> it can certainly be a topic at future meetings
09:58:27 <beekhof> i'll be around friday morning htis time too
09:58:39 <aspiers> #info bogdando interested in fault resilience and has started doing some research
09:58:56 <aspiers> alright, I guess it's time to end
09:59:21 <aspiers> thanks a lot all, don't enjoy the next two meetings without me too much, and I'll see you in 3 weeks!
09:59:30 <ddeja> bye :)
09:59:34 <aspiers> bye :)
10:00:31 <aspiers> #endmeeting