09:00:52 <aspiers> #startmeeting ha 09:00:54 <openstack> Meeting started Mon Mar 7 09:00:52 2016 UTC and is due to finish in 60 minutes. The chair is aspiers. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:00:55 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:00:58 <openstack> The meeting name has been set to 'ha' 09:01:15 <aspiers> OK let's get started 09:01:28 <haukebruno> hi everyhone :) 09:01:29 <aspiers> we've had apologies from beekhof who is not feeling well 09:01:40 <aspiers> hi haukebruno 09:02:02 <aspiers> #topic introductions 09:02:14 <aspiers> haukebruno: since we are a small group, would you like to introduce yourself? 09:03:14 <aspiers> since I don't think we saw you in these meetings before 09:03:46 <haukebruno> sure. my name is hauke, 27 years old and from germany. I work in a small private cloud startup near frankfurt and we use openstack for 2.5 years now. before I used a lot of the HA stuff around pacemaker/corosync and haproxy, thats the main reason why I joined this meeting/channel 09:04:01 <aspiers> great! really glad to have you join :) 09:04:08 <haukebruno> ah, + I am poorly a 100% ops guy 09:04:12 <aspiers> haha 09:04:18 <masahito> haukebruno: welcome 09:04:24 <haukebruno> thanks masahito 09:04:25 <aspiers> I'm from SUSE, masahito is from NTT, and ddeja / _gryf are from Intel 09:04:37 <haukebruno> nice to meet you guys 09:04:50 <aspiers> we are all working on HA for our companies, and currently focusing quite a bit on compute node HA 09:05:04 <aspiers> #topic Current status (progress, issues, roadblocks, further plans) 09:05:13 <haukebruno> very nice, I guess compute node HA is the most wanted kind of HA these days 09:05:13 <aspiers> OK, quick update from me: 09:05:24 <aspiers> yes it seems in demand :) 09:05:53 <aspiers> I did some reviewing of changes from Norbert Illes on openstack-resource-agents 09:06:10 <aspiers> we have been tidying up the code and now it passes bashate 100% 09:06:17 <aspiers> next step will be to add some basic CI 09:06:52 <aspiers> #info https://bugs.launchpad.net/openstack-resource-agents/+bug/1550203 (bashate violations in OCF scripts) is now fixed 09:06:52 <openstack> Launchpad bug 1550203 in openstack-resource-agents "Bashate violations in OCF scripts" [Medium,Fix committed] - Assigned to Norbert Illes (nilles) 09:07:21 <aspiers> I think that's the only interesting news I have this week 09:08:07 * _gryf passes the baton to ddeja :) 09:08:11 <aspiers> :) 09:08:25 <ddeja> ok 09:08:59 <ddeja> so I was on a medical leave - today is my first day at office in March 09:09:09 <Qiming> so the basic assumption when we are talking about HA is still about pacemaker/corosync? 09:09:19 <aspiers> hi Qiming 09:09:33 <Qiming> hi, everyone 09:09:39 <Qiming> just sneaked in 09:09:40 <aspiers> Qiming: not necessarily, but at the moment all topics usually involve pacemaker 09:09:57 <Qiming> got it 09:09:57 <ddeja> I was thinking a little about demo of Tenants HA using Mistral, and I'm about to prepare some short film to present how it works 09:10:05 <ddeja> that's all from my side 09:10:12 <aspiers> ddeja: great, looking forward to seeing that :) 09:10:17 <aspiers> a short film would be really useful 09:10:19 <masahito> ddeja: sounds nice. 09:10:33 <ddeja> thanks guys :) 09:10:41 <aspiers> masahito: same for masakari ;-) 09:10:58 <aspiers> masahito: any updates from your side? 09:11:24 <masahito> I don't have any update to report the team. 09:11:28 <aspiers> ok 09:11:51 <aspiers> haukebruno / Qiming: anything you want to share under the current #topic ? 09:12:03 <aspiers> any info on current work or future plans? 09:12:09 <aspiers> no problem if not 09:12:11 * Qiming is wondering who pushed him into the room ... 09:12:17 <haukebruno> not from my site sadly 09:12:17 <aspiers> haha 09:13:02 <aspiers> ok 09:13:03 <Qiming> aspiers, seriously, I'm from Heat and Senlin team, we have some thoughts, design, prototype on VM/App HA 09:13:41 <aspiers> Qiming: oh, well we definitely want to hear about that then! 09:13:45 <haukebruno> maybe important for you: I am pretty new to the community site of openstack, I wanted to contribute as good as I can, but no idea about the 'how' 09:13:49 <Qiming> we are working with some NFV guys, soliciting their requirements on workload HA 09:14:19 <aspiers> #topic Vm/App HA work within Heat and Senlin teams 09:14:50 <Qiming> previously we tried to inject some HA mechanisms into Heat 09:14:57 <aspiers> Qiming: ok, hopefully we can work together with you on this? one of the big goals of these meetings is to try to converge efforts in the long term 09:15:13 <Qiming> but the proposal was rejected because it doesn't align well with Heat's mission, which is a pure orchestrator 09:15:24 <Qiming> aspiers, definitely 09:15:45 <Qiming> then later when we started the Senlin project (a clustering service), now an official project 09:16:02 <aspiers> Qiming: have you seen http://specs.openstack.org/openstack/openstack-user-stories/user-stories/draft/ha_vm.html ? 09:16:03 <Qiming> we tried to get HA designed into the service 09:16:22 <Qiming> no, aspiers, will read that offline 09:16:39 <aspiers> Qiming: also https://etherpad.openstack.org/p/automatic-evacuation 09:17:01 <Qiming> in senlin, our understanding is that behind any HA solution, you need redundancy, which is a cluster 09:17:16 <aspiers> right 09:17:34 <Qiming> and to do HA, you will need to think about three aspects (at least): detection, signaling and recovery 09:17:58 <Qiming> we have prototyped some policies that can be enforced on VM clusters or Heat stack clusters 09:18:02 <aspiers> and recovery requires fencing 09:18:28 <Qiming> aspiers, exactly, we were working with some IBMers from Haifa research lab on this 09:18:29 <aspiers> as well as election if the cluster is decentralized 09:18:55 <aspiers> Qiming: how far along is your prototype? 09:18:57 <Qiming> exactly, it is never a simple solution 09:19:12 <aspiers> Qiming: could you describe the architecture and/or what it achieves, or point us to a URL with docs? 09:20:05 <Qiming> so, back to senlin's prototype, we plan to failure detections in three ways: 1) periodically polling the VM states from Nova 2) listen to VM lifecycle events 3) inquire the load-balancer (health monitor) if the cluster does have a load-balancer 09:20:43 <Qiming> http://git.openstack.org/cgit/openstack/senlin/tree/senlin/policies/health_policy.py 09:21:20 <Qiming> that is a skeleton, team is still debating on the details, as always, :) 09:21:47 <aspiers> ok 09:22:03 <aspiers> I was not aware of Senlin before, so it's great that you joined this meeting to tell us 09:22:11 <Qiming> my pleasure 09:22:26 <Qiming> but I'm not gonna hijacking this meeting for a senlin tutorial 09:22:40 <aspiers> Qiming: well, I'm not sure we have much else to discuss today 09:22:45 <aspiers> Qiming: it's a smaller group than usual 09:22:55 <Qiming> okay 09:22:58 <aspiers> Qiming: so I think it's a good use of the time 09:23:08 <Qiming> good to know that 09:23:10 <aspiers> although if anyone else has urgent issues to discuss, please let me know :) 09:23:27 <aspiers> Qiming: how would you describe the main differences to the existing approaches to HA? 09:23:52 <Qiming> it is more customizable, it is not tied to pacemaker/corosync 09:23:55 <aspiers> it seems that this is some kind of "HA as a service" 09:24:22 <Qiming> HA was treated as one of the policies that you can attach to a group of things 09:24:34 <Qiming> a pretty bold simplification 09:24:37 <aspiers> what are the "homogeneous objects" referred to? 09:25:07 <aspiers> the OpenStack infrastructure services, e.g. API endpoints? 09:25:09 <Qiming> a cluster can be a group of nova server, a group of heat stacks, for instance 09:25:25 <Qiming> but you are not supposed to have a cluster mixed of nova servers and heat stacks 09:25:49 <aspiers> is the idea that Senlin is only used by other OpenStack services? or also by OpenStack end users? 09:26:05 <Qiming> main target is end users 09:26:21 <aspiers> oh 09:26:26 <Qiming> it can be used by other projects as well because we have a REST API 09:26:40 <aspiers> can you give us an example use case? 09:26:46 <Qiming> some friends have helped implemented Heat resource types for Senlin 09:27:10 <Qiming> create a cluster of Nova server, get it load-balanced, make it auto-scale, and ensure HA for the instances 09:27:12 <aspiers> e.g. what would a cluster of nova servers (I assume you mean VMs) look like? 09:28:01 <aspiers> so it would need to have access inside each VM, e.g. to install/configure clustering software? 09:28:22 <aspiers> or would you only monitor from outside the VMs? 09:28:35 <Qiming> when adding new nodes (e.g VMs), you can decide where those VMs will be created (i.e. affinity or anti-affinity), when deleting existing nodes, you can have a say which nodes are preferred 09:29:05 <Qiming> senlin is not yet installing any other clustering software into the VMs 09:29:10 <aspiers> ok 09:29:22 <Qiming> but that has been considered as a usage scenario 09:29:27 <aspiers> so decisions on cluster management would be made centrally by Senlin server? 09:29:33 <Qiming> yep 09:29:43 <aspiers> ok, I understand now 09:29:46 <haukebruno> what will happen if one instance fails? spawning another one with the same kind of 'metadata' (you pointed out the affinity thing)? 09:30:21 <Qiming> in a health_policy attached to a cluster, you can specify the recover actions you want to try 09:30:32 <aspiers> #info Qiming gave an introduction to Senlin (clustering service for OpenStack end users) 09:30:37 <aspiers> #link https://github.com/openstack/senlin 09:30:39 <Qiming> for nova servers, it could be 'reboot', 'rebuild', 'evacuate', ... 'recreate 09:30:48 <_gryf> Qiming, you said, that the state of the vm you polling from nova 09:30:56 <haukebruno> ah, now I understand too, thanks ;) 09:31:31 <Qiming> all nova server clusters that have a health policy attached will be registered 09:31:43 <Qiming> then checked periodically (i.e. http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/health_manager.py#n76) 09:32:19 <Qiming> we haven't yet decided whether auto-recover is a good thing 09:32:30 <aspiers> Qiming: how will you do fencing? 09:32:42 <Qiming> there are other details to the cluster_recover function to be figured out 09:33:11 <haukebruno> sorry for kind of offtopic, but I am curious if there is anything inside openstack that someone could use as a fencing device for instances 09:33:12 <Qiming> aspiers, our friends from Israel lab helped developed those components back in 2014 09:33:27 <aspiers> haukebruno: there isn't, that's why we use Pacemaker 09:33:34 <_gryf> haukebruno, there is no such thing 09:33:36 <Qiming> IIRC, they remote operate the gateway 09:33:38 <aspiers> haukebruno: well, it is one of many reasons why we use Pacemaker 09:34:24 <haukebruno> i see, thanks 09:34:43 <Qiming> I heard some different opinions regarding using pacemaker to do HA for OpenStack controllers 09:35:05 <Qiming> our friends in New York lab is doing OpenStack controller HA without using pacemaker 09:35:17 <Qiming> they are using consul for monitoring, I believe 09:35:17 <haukebruno> Qiming, we also 09:35:36 <aspiers> Qiming: how does NY lab do fencing? 09:35:46 <Qiming> aspiers, have to check out 09:36:25 <Qiming> there is no fencing API as far as I know, so our prototype only works on certain type of network switch 09:37:31 <aspiers> Qiming: I heard different opinions on Pacemaker too, although I never heard any convincing arguments against Pacemaker 09:38:05 <aspiers> most (not all) of the arguments I heard against it were based on misunderstanding 09:38:26 <Qiming> okay, I asked my colleagues there when I heard this, their opinion is they hate switching between nova commands and pcs 09:38:28 <_gryf> Qiming, so your solution actually is dependent on some sort of things, like certain type of switch, otherwise it wouldn't be able to fence nodes, right? 09:39:11 <Qiming> and their resource agents are not always yielding a reliable result regarding whether glance-api is still alive 09:39:18 <aspiers> clustering is really difficult, and the Pacemaker code is based on 15-20 years of experience of writing clustering software 09:39:31 <Qiming> the only thing pacemaker knows for sure is that the PID is still there, :) 09:39:45 <Qiming> _gryf, correct 09:40:05 <aspiers> Qiming: Pacemaker knows about a lot more than the PID if you use the openstack-resource-agents project :) 09:40:34 <Qiming> aspiers, that is beyond my knowledge, :) Haven't been following that for a long time 09:40:51 <Qiming> good to know that things are improving 09:41:31 <aspiers> Qiming: I maintain that project. For a long time the OpenStack OCF RAs have been capable of monitoring the actual service, not just the pids 09:42:06 <Qiming> that's great 09:42:31 <aspiers> and the monitoring is direct. IIUC nova-server <-> nova-compute relies on the message bus 09:43:12 <aspiers> alright 09:43:19 <aspiers> that was a really useful intro to Senlin, thanks! 09:43:30 <Qiming> my pleasure 09:43:38 <aspiers> Qiming: please take a look at those links so you can understand what the rest of the community is doing 09:43:50 <Qiming> really very happy there are finally more people looking into this area 09:43:51 <aspiers> Qiming: also http://www.slideshare.net/adamspiers/compute-node-ha-current-upstream-development 09:43:54 <Qiming> \o/ 09:44:07 <Qiming> will do, aspiers 09:44:19 <aspiers> Qiming: there is a cross-project IRC meeting in 35 hours from now which aims to cover this topci 09:44:21 <aspiers> topic 09:44:27 <aspiers> it would be great if you could join 09:44:41 <aspiers> https://wiki.openstack.org/wiki/Meetings/CrossProjectMeeting 09:45:11 <Qiming> it is 2100 UTC? 09:45:14 <aspiers> yes 09:45:24 <aspiers> it's a difficult time for some of us 09:45:24 <ddeja> aspiers: but it's not 100% that it would take place this week 09:45:28 <Qiming> 5 am here, :( 09:45:31 <aspiers> :( 09:45:42 <ddeja> like, it was canceled last week 09:45:57 <aspiers> ddeja: thingee sent an email in the last few days asking for someone to chair this week 09:46:11 <aspiers> ddeja: so I think it's probably 80% likely 09:46:28 <aspiers> I am not sure though 09:46:32 <ddeja> ok, but on the other hand there was this mail if some of us can cover the topic 09:46:42 <aspiers> #topic AOB (Any Other Business) 09:47:09 <aspiers> yeah, let's see what thingee says 09:47:27 <ddeja> I can talk with Renat from Mistral team if he can contact Timofey (the guy who originaly put the spec in review) 09:47:30 <aspiers> I think it's really important that at least 1 or 2 of us are there 09:47:44 <ddeja> they work with each other AFAICT 09:47:49 <aspiers> ok, thanks 09:48:17 <aspiers> ddeja: is that time OK for you? or do you think we should try to push for a different time? 09:48:29 <aspiers> masahito: I guess 2100 UTC is a bad time for you too? 09:48:36 <ddeja> aspiers: it's 10 P.M for me, but I can make it 09:48:45 <aspiers> ok 09:48:51 <masahito> aspiers: yap, it doesn't work for me. 09:48:53 <_gryf> aspiers, I'll try to participate 09:48:59 <masahito> 6am X( 09:49:01 <aspiers> _gryf: me too 09:49:04 <aspiers> masahito: :( 09:49:12 <ddeja> watching evening movie or joining meeting... same fun! ;) 09:49:16 <aspiers> I guess the challenge is that the meeting also needs to cover non-HA topics 09:49:19 <aspiers> haha 09:51:03 <aspiers> any other topics people want to discuss? 09:51:19 <ddeja> aspiers: only short question 09:51:40 <ddeja> didi you get some mail about presentation in Austin? 09:51:47 <aspiers> ddeja: not yet 09:51:56 <ddeja> aspiers: ok 09:52:52 <aspiers> ok 1 minute to raise any other topics 09:53:07 <aspiers> otherwise we can end the meeting slightly early 09:53:25 <haukebruno> is the meeting normally about 1 hour? 09:53:31 <_gryf> haukebruno, yup 09:53:54 <_gryf> haukebruno, but you can join #openstack-ha for further discussion anytime 09:53:59 <aspiers> yes 09:54:14 <haukebruno> _gryf, thanks, I am also in #openstack-ha ;) 09:54:20 <_gryf> haukebruno, k :) 09:54:26 <aspiers> Qiming: please join #openstack-ha too :) 09:54:33 <masahito> haukebruno: or use openstack-dev ML to send a mail with [HA] in title 09:54:34 <aspiers> Qiming: and encourage your colleagues to also join 09:54:45 <haukebruno> masahito, good to know, thanks 09:55:11 <aspiers> ok great, thanks everyone! 09:55:16 <aspiers> see you next week :) 09:55:22 <haukebruno> thanks too, was interesting :) 09:55:34 <ddeja> bye 09:55:43 <masahito> thanks, bye 09:55:44 <_gryf> cu 09:55:52 <aspiers> bye :) 09:55:54 <aspiers> #endmeeting