09:00:52 #startmeeting ha 09:00:54 Meeting started Mon Mar 7 09:00:52 2016 UTC and is due to finish in 60 minutes. The chair is aspiers. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:00:55 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:00:58 The meeting name has been set to 'ha' 09:01:15 OK let's get started 09:01:28 hi everyhone :) 09:01:29 we've had apologies from beekhof who is not feeling well 09:01:40 hi haukebruno 09:02:02 #topic introductions 09:02:14 haukebruno: since we are a small group, would you like to introduce yourself? 09:03:14 since I don't think we saw you in these meetings before 09:03:46 sure. my name is hauke, 27 years old and from germany. I work in a small private cloud startup near frankfurt and we use openstack for 2.5 years now. before I used a lot of the HA stuff around pacemaker/corosync and haproxy, thats the main reason why I joined this meeting/channel 09:04:01 great! really glad to have you join :) 09:04:08 ah, + I am poorly a 100% ops guy 09:04:12 haha 09:04:18 haukebruno: welcome 09:04:24 thanks masahito 09:04:25 I'm from SUSE, masahito is from NTT, and ddeja / _gryf are from Intel 09:04:37 nice to meet you guys 09:04:50 we are all working on HA for our companies, and currently focusing quite a bit on compute node HA 09:05:04 #topic Current status (progress, issues, roadblocks, further plans) 09:05:13 very nice, I guess compute node HA is the most wanted kind of HA these days 09:05:13 OK, quick update from me: 09:05:24 yes it seems in demand :) 09:05:53 I did some reviewing of changes from Norbert Illes on openstack-resource-agents 09:06:10 we have been tidying up the code and now it passes bashate 100% 09:06:17 next step will be to add some basic CI 09:06:52 #info https://bugs.launchpad.net/openstack-resource-agents/+bug/1550203 (bashate violations in OCF scripts) is now fixed 09:06:52 Launchpad bug 1550203 in openstack-resource-agents "Bashate violations in OCF scripts" [Medium,Fix committed] - Assigned to Norbert Illes (nilles) 09:07:21 I think that's the only interesting news I have this week 09:08:07 * _gryf passes the baton to ddeja :) 09:08:11 :) 09:08:25 ok 09:08:59 so I was on a medical leave - today is my first day at office in March 09:09:09 so the basic assumption when we are talking about HA is still about pacemaker/corosync? 09:09:19 hi Qiming 09:09:33 hi, everyone 09:09:39 just sneaked in 09:09:40 Qiming: not necessarily, but at the moment all topics usually involve pacemaker 09:09:57 got it 09:09:57 I was thinking a little about demo of Tenants HA using Mistral, and I'm about to prepare some short film to present how it works 09:10:05 that's all from my side 09:10:12 ddeja: great, looking forward to seeing that :) 09:10:17 a short film would be really useful 09:10:19 ddeja: sounds nice. 09:10:33 thanks guys :) 09:10:41 masahito: same for masakari ;-) 09:10:58 masahito: any updates from your side? 09:11:24 I don't have any update to report the team. 09:11:28 ok 09:11:51 haukebruno / Qiming: anything you want to share under the current #topic ? 09:12:03 any info on current work or future plans? 09:12:09 no problem if not 09:12:11 * Qiming is wondering who pushed him into the room ... 09:12:17 not from my site sadly 09:12:17 haha 09:13:02 ok 09:13:03 aspiers, seriously, I'm from Heat and Senlin team, we have some thoughts, design, prototype on VM/App HA 09:13:41 Qiming: oh, well we definitely want to hear about that then! 09:13:45 maybe important for you: I am pretty new to the community site of openstack, I wanted to contribute as good as I can, but no idea about the 'how' 09:13:49 we are working with some NFV guys, soliciting their requirements on workload HA 09:14:19 #topic Vm/App HA work within Heat and Senlin teams 09:14:50 previously we tried to inject some HA mechanisms into Heat 09:14:57 Qiming: ok, hopefully we can work together with you on this? one of the big goals of these meetings is to try to converge efforts in the long term 09:15:13 but the proposal was rejected because it doesn't align well with Heat's mission, which is a pure orchestrator 09:15:24 aspiers, definitely 09:15:45 then later when we started the Senlin project (a clustering service), now an official project 09:16:02 Qiming: have you seen http://specs.openstack.org/openstack/openstack-user-stories/user-stories/draft/ha_vm.html ? 09:16:03 we tried to get HA designed into the service 09:16:22 no, aspiers, will read that offline 09:16:39 Qiming: also https://etherpad.openstack.org/p/automatic-evacuation 09:17:01 in senlin, our understanding is that behind any HA solution, you need redundancy, which is a cluster 09:17:16 right 09:17:34 and to do HA, you will need to think about three aspects (at least): detection, signaling and recovery 09:17:58 we have prototyped some policies that can be enforced on VM clusters or Heat stack clusters 09:18:02 and recovery requires fencing 09:18:28 aspiers, exactly, we were working with some IBMers from Haifa research lab on this 09:18:29 as well as election if the cluster is decentralized 09:18:55 Qiming: how far along is your prototype? 09:18:57 exactly, it is never a simple solution 09:19:12 Qiming: could you describe the architecture and/or what it achieves, or point us to a URL with docs? 09:20:05 so, back to senlin's prototype, we plan to failure detections in three ways: 1) periodically polling the VM states from Nova 2) listen to VM lifecycle events 3) inquire the load-balancer (health monitor) if the cluster does have a load-balancer 09:20:43 http://git.openstack.org/cgit/openstack/senlin/tree/senlin/policies/health_policy.py 09:21:20 that is a skeleton, team is still debating on the details, as always, :) 09:21:47 ok 09:22:03 I was not aware of Senlin before, so it's great that you joined this meeting to tell us 09:22:11 my pleasure 09:22:26 but I'm not gonna hijacking this meeting for a senlin tutorial 09:22:40 Qiming: well, I'm not sure we have much else to discuss today 09:22:45 Qiming: it's a smaller group than usual 09:22:55 okay 09:22:58 Qiming: so I think it's a good use of the time 09:23:08 good to know that 09:23:10 although if anyone else has urgent issues to discuss, please let me know :) 09:23:27 Qiming: how would you describe the main differences to the existing approaches to HA? 09:23:52 it is more customizable, it is not tied to pacemaker/corosync 09:23:55 it seems that this is some kind of "HA as a service" 09:24:22 HA was treated as one of the policies that you can attach to a group of things 09:24:34 a pretty bold simplification 09:24:37 what are the "homogeneous objects" referred to? 09:25:07 the OpenStack infrastructure services, e.g. API endpoints? 09:25:09 a cluster can be a group of nova server, a group of heat stacks, for instance 09:25:25 but you are not supposed to have a cluster mixed of nova servers and heat stacks 09:25:49 is the idea that Senlin is only used by other OpenStack services? or also by OpenStack end users? 09:26:05 main target is end users 09:26:21 oh 09:26:26 it can be used by other projects as well because we have a REST API 09:26:40 can you give us an example use case? 09:26:46 some friends have helped implemented Heat resource types for Senlin 09:27:10 create a cluster of Nova server, get it load-balanced, make it auto-scale, and ensure HA for the instances 09:27:12 e.g. what would a cluster of nova servers (I assume you mean VMs) look like? 09:28:01 so it would need to have access inside each VM, e.g. to install/configure clustering software? 09:28:22 or would you only monitor from outside the VMs? 09:28:35 when adding new nodes (e.g VMs), you can decide where those VMs will be created (i.e. affinity or anti-affinity), when deleting existing nodes, you can have a say which nodes are preferred 09:29:05 senlin is not yet installing any other clustering software into the VMs 09:29:10 ok 09:29:22 but that has been considered as a usage scenario 09:29:27 so decisions on cluster management would be made centrally by Senlin server? 09:29:33 yep 09:29:43 ok, I understand now 09:29:46 what will happen if one instance fails? spawning another one with the same kind of 'metadata' (you pointed out the affinity thing)? 09:30:21 in a health_policy attached to a cluster, you can specify the recover actions you want to try 09:30:32 #info Qiming gave an introduction to Senlin (clustering service for OpenStack end users) 09:30:37 #link https://github.com/openstack/senlin 09:30:39 for nova servers, it could be 'reboot', 'rebuild', 'evacuate', ... 'recreate 09:30:48 <_gryf> Qiming, you said, that the state of the vm you polling from nova 09:30:56 ah, now I understand too, thanks ;) 09:31:31 all nova server clusters that have a health policy attached will be registered 09:31:43 then checked periodically (i.e. http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/health_manager.py#n76) 09:32:19 we haven't yet decided whether auto-recover is a good thing 09:32:30 Qiming: how will you do fencing? 09:32:42 there are other details to the cluster_recover function to be figured out 09:33:11 sorry for kind of offtopic, but I am curious if there is anything inside openstack that someone could use as a fencing device for instances 09:33:12 aspiers, our friends from Israel lab helped developed those components back in 2014 09:33:27 haukebruno: there isn't, that's why we use Pacemaker 09:33:34 <_gryf> haukebruno, there is no such thing 09:33:36 IIRC, they remote operate the gateway 09:33:38 haukebruno: well, it is one of many reasons why we use Pacemaker 09:34:24 i see, thanks 09:34:43 I heard some different opinions regarding using pacemaker to do HA for OpenStack controllers 09:35:05 our friends in New York lab is doing OpenStack controller HA without using pacemaker 09:35:17 they are using consul for monitoring, I believe 09:35:17 Qiming, we also 09:35:36 Qiming: how does NY lab do fencing? 09:35:46 aspiers, have to check out 09:36:25 there is no fencing API as far as I know, so our prototype only works on certain type of network switch 09:37:31 Qiming: I heard different opinions on Pacemaker too, although I never heard any convincing arguments against Pacemaker 09:38:05 most (not all) of the arguments I heard against it were based on misunderstanding 09:38:26 okay, I asked my colleagues there when I heard this, their opinion is they hate switching between nova commands and pcs 09:38:28 <_gryf> Qiming, so your solution actually is dependent on some sort of things, like certain type of switch, otherwise it wouldn't be able to fence nodes, right? 09:39:11 and their resource agents are not always yielding a reliable result regarding whether glance-api is still alive 09:39:18 clustering is really difficult, and the Pacemaker code is based on 15-20 years of experience of writing clustering software 09:39:31 the only thing pacemaker knows for sure is that the PID is still there, :) 09:39:45 _gryf, correct 09:40:05 Qiming: Pacemaker knows about a lot more than the PID if you use the openstack-resource-agents project :) 09:40:34 aspiers, that is beyond my knowledge, :) Haven't been following that for a long time 09:40:51 good to know that things are improving 09:41:31 Qiming: I maintain that project. For a long time the OpenStack OCF RAs have been capable of monitoring the actual service, not just the pids 09:42:06 that's great 09:42:31 and the monitoring is direct. IIUC nova-server <-> nova-compute relies on the message bus 09:43:12 alright 09:43:19 that was a really useful intro to Senlin, thanks! 09:43:30 my pleasure 09:43:38 Qiming: please take a look at those links so you can understand what the rest of the community is doing 09:43:50 really very happy there are finally more people looking into this area 09:43:51 Qiming: also http://www.slideshare.net/adamspiers/compute-node-ha-current-upstream-development 09:43:54 \o/ 09:44:07 will do, aspiers 09:44:19 Qiming: there is a cross-project IRC meeting in 35 hours from now which aims to cover this topci 09:44:21 topic 09:44:27 it would be great if you could join 09:44:41 https://wiki.openstack.org/wiki/Meetings/CrossProjectMeeting 09:45:11 it is 2100 UTC? 09:45:14 yes 09:45:24 it's a difficult time for some of us 09:45:24 aspiers: but it's not 100% that it would take place this week 09:45:28 5 am here, :( 09:45:31 :( 09:45:42 like, it was canceled last week 09:45:57 ddeja: thingee sent an email in the last few days asking for someone to chair this week 09:46:11 ddeja: so I think it's probably 80% likely 09:46:28 I am not sure though 09:46:32 ok, but on the other hand there was this mail if some of us can cover the topic 09:46:42 #topic AOB (Any Other Business) 09:47:09 yeah, let's see what thingee says 09:47:27 I can talk with Renat from Mistral team if he can contact Timofey (the guy who originaly put the spec in review) 09:47:30 I think it's really important that at least 1 or 2 of us are there 09:47:44 they work with each other AFAICT 09:47:49 ok, thanks 09:48:17 ddeja: is that time OK for you? or do you think we should try to push for a different time? 09:48:29 masahito: I guess 2100 UTC is a bad time for you too? 09:48:36 aspiers: it's 10 P.M for me, but I can make it 09:48:45 ok 09:48:51 aspiers: yap, it doesn't work for me. 09:48:53 <_gryf> aspiers, I'll try to participate 09:48:59 6am X( 09:49:01 _gryf: me too 09:49:04 masahito: :( 09:49:12 watching evening movie or joining meeting... same fun! ;) 09:49:16 I guess the challenge is that the meeting also needs to cover non-HA topics 09:49:19 haha 09:51:03 any other topics people want to discuss? 09:51:19 aspiers: only short question 09:51:40 didi you get some mail about presentation in Austin? 09:51:47 ddeja: not yet 09:51:56 aspiers: ok 09:52:52 ok 1 minute to raise any other topics 09:53:07 otherwise we can end the meeting slightly early 09:53:25 is the meeting normally about 1 hour? 09:53:31 <_gryf> haukebruno, yup 09:53:54 <_gryf> haukebruno, but you can join #openstack-ha for further discussion anytime 09:53:59 yes 09:54:14 _gryf, thanks, I am also in #openstack-ha ;) 09:54:20 <_gryf> haukebruno, k :) 09:54:26 Qiming: please join #openstack-ha too :) 09:54:33 haukebruno: or use openstack-dev ML to send a mail with [HA] in title 09:54:34 Qiming: and encourage your colleagues to also join 09:54:45 masahito, good to know, thanks 09:55:11 ok great, thanks everyone! 09:55:16 see you next week :) 09:55:22 thanks too, was interesting :) 09:55:34 bye 09:55:43 thanks, bye 09:55:44 <_gryf> cu 09:55:52 bye :) 09:55:54 #endmeeting