08:59:15 <ddeja> #startmeeting HA 08:59:16 <openstack> Meeting started Mon Feb 8 08:59:15 2016 UTC and is due to finish in 60 minutes. The chair is ddeja. Information about MeetBot at http://wiki.debian.org/MeetBot. 08:59:17 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 08:59:19 <openstack> The meeting name has been set to 'ha' 08:59:26 <ddeja> Hello all 08:59:48 <masahito> Hi 09:00:32 <kazuIchikawa> hi 09:01:13 <bogdando> hi 09:01:27 <ddeja> Ok, lets start with a quick status 09:01:29 <_gryf> hi 09:01:42 <ddeja> #topic Quick status report 09:02:29 <ddeja> My status: I have prepared two workflows for instance HA, both can be seen on https://github.com/gryf/mistral-evacuate 09:02:57 <ddeja> For one of them, I have hit a bug https://bugs.launchpad.net/mistral/+bug/1535295 09:02:58 <openstack> Launchpad bug 1535295 in Mistral "Task with join runs more than once" [Undecided,New] - Assigned to Dawid Deja (dawid-deja-0) 09:03:36 <ddeja> That's all from my side 09:03:46 <ddeja> masahito: are you willing to give a report? 09:04:27 <masahito> The patch for enabling masakari to work on pacemaker-remote was merged. 09:04:43 <ddeja> That's great 09:04:56 <masahito> And some bug fixs were also merged. 09:05:02 <masahito> That's from my side. 09:05:04 <ddeja> #info patch for enabling masakari to work on pacemaker-remote was merged. 09:05:28 <ddeja> Ok, cool. Anyone else have something to report? 09:05:43 <bogdando> I have one announce 09:06:12 <beekhof> hey guys 09:06:25 <masahito> beekhof: hi 09:06:26 <bogdando> new ha guide docs time was assigned. Next meeting is at Feb 10, 17:00 UTC, don't hesitate to participate and add agenda items https://wiki.openstack.org/wiki/Documentation/HA_Guide_Update#Next_Meeting 09:06:27 <ddeja> hi beekhof :) 09:06:29 <bogdando> hi 09:07:34 <ddeja> OK, thanks bogdando 09:08:19 <ddeja> I think we are done in status reports 09:08:36 <ddeja> #topic Mistral Workflow for instance HA 09:08:55 <bogdando> Should we open a nova-client bug related to the bug 1535295 as well? Like it shall not throw errors on the second evacuate request arrival? 09:08:56 <openstack> bug 1535295 in Mistral "Task with join runs more than once" [Undecided,New] https://launchpad.net/bugs/1535295 - Assigned to Dawid Deja (dawid-deja-0) 09:09:37 <ddeja> I think it's an expected behaviour 09:09:58 <ddeja> I was rather thinkig about catching the exception 09:10:23 <ddeja> as a workaround 09:10:52 <beekhof> can i ask how this workflow... works :) 09:11:02 <ddeja> yup, sure 09:11:32 <beekhof> eg. when is filter_vm_action.py called? 09:11:54 <ddeja> beekhof: you are looking at code on master branch, aren't you? 09:12:00 <beekhof> yep 09:12:11 <ddeja> ok, so let me strart with explaining this workflow 09:12:38 <ddeja> It starts with calling nova.servers.list() in nova python client 09:12:54 <ddeja> nova.servers_list is a wrapper for python client 09:13:01 <beekhof> oh, i see the yaml now 09:13:18 <ddeja> then it calls filter_vm_action 09:13:23 <beekhof> this is rather neat 09:13:39 <beekhof> i can see the attraction 09:13:55 <ddeja> yup, it's very simple 09:14:38 <ddeja> Right now I'm thinkig about adding some action that would check if evacuation succeeded 09:15:19 <ddeja> also, if you change the branch, there is another workflow, that separete filtering and asking nova for flavors 09:16:01 <beekhof> interesting 09:16:35 <ddeja> I thinks it's better since in approach no 2, there would be less nova calls 09:16:36 <masahito> the branch looks changing the workflow depending on the extra_spec. 09:16:51 <masahito> looks interesting. 09:16:58 <beekhof> whats the advantage of splitting them up? 09:17:38 <ddeja> masahito: oh, asking for flavor_extra_spec should also happen in first approach, I'll fix that :) 09:17:51 <ddeja> beekhof: less calls to nova API 09:18:16 <masahito> I guess it allows admin to define extra_spec attached to evacuated VMs. 09:18:45 <ddeja> masahito: There are two ways of determing if given VM should be evacuated 09:19:09 <ddeja> 1. It has flavor with extra spec 'evacuation:evacuate' set to True 09:19:32 <ddeja> 2. VM itself has a metadata with 'evacuate' flag set to True 09:19:51 <ddeja> I should've write it in repo... 09:19:52 <bogdando> ddeja, could you please put that to the README as well? 09:20:11 <bogdando> yup, read my minds 09:20:20 <ddeja> bogdando: :) 09:20:50 <ddeja> I'll do it as soon as meeting ends 09:21:07 <masahito> ddeja: sounds nice! 09:21:20 <bogdando> this looks really simple and neat, indeed. So what do you think about the final solution? 09:21:29 <bogdando> like that PoC + fence agent? 09:22:18 <ddeja> So from my side the most simple scenario is: this workflow + really simple fence agent that only calls Mistral API 09:23:09 <bogdando> like to post the YAML? 09:23:31 <ddeja> not really - yaml file should be loaded first and is in DB 09:23:36 <bogdando> ah 09:23:49 <ddeja> you only call it by it's name and provide input argiments 09:23:59 <ddeja> arguments* 09:24:06 <bogdando> I like that idea 09:24:37 <ddeja> #action ddeja to write simple fence agent that will call mistral API 09:24:39 <bogdando> and this to be put into the fence topology perhaps 09:24:50 <ddeja> yup 09:25:02 <beekhof> i like it as the mechanism for performing the evacuation, but i can also see scope for some of the masakari pieces for deciding when to trigger it 09:25:32 <ddeja> beekhof: yes, Masakari may call the workflow instead of fence agent 09:25:33 <bogdando> we could use it as a "fallback fence level", if Mistral fails ) 09:25:56 <beekhof> one thing though... i like the concept, but how well does it handle corner cases? 09:26:14 <bogdando> so it may be a topology like that: 1) first try Mistral flow, 2)then try the masakari, 3)fence the node 09:26:22 <beekhof> thats the key thing 09:26:42 <ddeja> bogdando: you need to have node fenced before you call evacuate 09:26:51 <ddeja> It's a post-mortem process, when node is dead 09:27:17 <bogdando> ddeja, I see, then it probably cannot fit the classic fence topos 09:27:20 <ddeja> and we can be sure that there won't be two VMs writing to same storage 09:27:57 <ddeja> bogdando: It can, we can have fencing topology like 1) fence node; evacuate VMS (call mistral) 09:28:11 <ddeja> beekhof: Which corner cases do you mean? 09:28:21 <bogdando> at the one level? yes, that should work 09:28:39 <beekhof> mistral nodes falling over while a workflow is in progress 09:28:40 <bogdando> though, I may have forget details 09:28:57 <beekhof> services APIs dropping in and out while a workflow is in progress 09:28:59 <beekhof> etc 09:29:16 <ddeja> beekhof: That's a problem with Mistral, unfortunetly 09:29:32 <beekhof> nodes returning while a WFIIP 09:29:44 <beekhof> exactly 09:29:59 <ddeja> there is a blueprint for Mistral HA itself 09:30:04 <ddeja> https://blueprints.launchpad.net/mistral/+spec/mistral-ha-spec 09:30:16 <masahito> ddeja: Does Mistral plan to resolve it in Mitaka? 09:30:17 <ddeja> that should be done for M3 09:30:20 <beekhof> the design is beautiful, but it comes down to the suitability of mistral itself 09:30:51 <ddeja> so for Mitaka there would be spec for what needs to be done 09:31:28 <ddeja> and for N release mistral team would be working on resolving it 09:32:21 <ddeja> beekhof: so asking your question: It is not fully relailable now 09:32:57 <bogdando> Do we have a spec for this initiative, to reflect alternatives and decisions made here?.. 09:33:05 <beekhof> so the big question for me is: how long until it is, and can we wait that long? 09:34:08 <ddeja> bogdando: I'm not sure if I understand correctly. Do you mean if we have a spec for using Mistral Workflow for instance evacuate? 09:34:21 <bogdando> ddeja just a spec for instance evacuate 09:34:57 <ddeja> so there is this spec 09:35:00 <ddeja> #link https://blueprints.launchpad.net/mistral/+spec/mistral-ha-spec 09:35:08 <ddeja> not this one... 09:35:13 <ddeja> #link https://review.openstack.org/#/c/257809/ 09:35:16 <ddeja> sorry 09:35:18 <ddeja> this one 09:35:27 <bogdando> thank you 09:35:40 <ddeja> but I'm not the subbmiter 09:36:15 <ddeja> beekhof: I don't know if we can wait that long 09:36:47 <bogdando> we can make this cionfigurable, how long to wait and to which option failback 09:37:01 <beekhof> ddeja: i somewhat suspect that too 09:37:15 <ddeja> current scenario is to have Mistral HA for N relase, so it will be like 8 months 09:37:20 <bogdando> assuming 100% reliable things would be a design flaw :) 09:38:11 <bogdando> that is why they use fence topologies AFAICT 09:38:22 <beekhof> ddeja: TBH, i'd be surprised if 8 months is all that was needed 09:38:25 <bogdando> so perhaps we should do the like 09:38:58 <bogdando> and think of Mistral flow and masakari as just two fence agents 09:39:24 <bogdando> to co-exist and help each over to cover more fail modes 09:40:02 <ddeja> bogdando: The problem is that we can, for exmaple call mistral from fence_agent 09:40:12 <ddeja> and got OK as a reponse 09:40:40 <bogdando> let's make fence agent to verify results before rerutning its own OL 09:40:41 <ddeja> but it only means that Mistral accepted the request 09:40:42 <bogdando> OK 09:40:58 <ddeja> and in fence agents, we can't wait for the result 09:40:59 <bogdando> by a given timeout 09:41:15 <bogdando> Why can't we? 09:41:29 <bogdando> all agents have things like power timeout to wait for results 09:41:36 <bogdando> here is the same 09:41:36 <beekhof> you're blocking recovery 09:41:41 <ddeja> because fencing frezes the whole cloud, if I remember correctly 09:41:50 <bogdando> oh, why?.. 09:41:54 <beekhof> which might be the bit that is needed for the workflow to complete 09:42:01 <beekhof> == deadlock 09:42:59 <bogdando> well, we could run all evacuate options in parallel in the hope of the idempotance and immutability 09:44:14 <bogdando> I mean using both Mistral and masakari based fence agents to ensure results... Makes sense?.. 09:44:40 <bogdando> And leave operators options to decide which agent to go with 09:44:48 <bogdando> It is better to have more options 09:45:06 <ddeja> maybe it's the way to go... 09:45:30 <bogdando> I'm not sure about implementation details, but from design point I see no issues here 09:45:44 <ddeja> on the other hand, there was a discussion to use Masakari to call Mistral workflows 09:45:52 <bogdando> evacuation request must be idempotent and immutable for any agents acting 09:46:40 <ddeja> since Mistral is not HA, maybe Masakari can look if evacuation ended succesfully, and if not call Mistral API again? 09:46:50 <ddeja> but that's just an idea 09:47:02 <bogdando> good point as well 09:47:02 <masahito> bogdando: you means we make notification interface from monitoring process will be same between mistral workflow and masakari. 09:47:49 <bogdando> masahito, yes, monitoring should understand competing agents 09:48:19 <bogdando> or retries, if agents are kinda "nested" and masakari retries Mistral 09:48:40 <masahito> ddeja: just question. Can Mistral execute evacuate action in paralell? 09:49:05 <masahito> bogdando: got it. 09:49:27 <ddeja> masahito: yes. If we have like 5 VMs to evacuate, there would be 5 request running in paralell 09:49:32 <bogdando> that was only my point though, we should think of it and suggest to the spec perhaps... 09:50:31 <masahito> ddeja: thanks. 09:51:15 <ddeja> masahito: but to be sure - you call mistral only once. The paralelism is done inside the mistral engine :) 09:53:18 <ddeja> OK, I guess with this topic, we have a few minutes for open discussion 09:53:24 <ddeja> #topic Open discussion 09:53:53 <masahito> I think we should discuss the detail in some where without IRC meeting. 09:54:20 <masahito> because TL will flow. 09:55:33 <ddeja> masahito: I agree, but it will be good to wait till aspiers come back from vacation 09:55:57 <masahito> ddeja: right. 09:57:03 <kazuIchikawa> another topic, we are working on sqlalchemy support of masakari. Is there any database backend you mind to use other than MySQL? 09:57:28 <ddeja> kazuIchikawa: postgres 09:57:38 <_gryf> kazuIchikawa, postgresql is also a popular choice 09:57:43 <kazuIchikawa> ddeja: got it 09:57:50 <ddeja> most of OpenStack deploments is on postgressql or mysql 09:59:23 <ddeja> ok, we are running out of time 09:59:37 <ddeja> thanks you all for productive meeting and see you next week :) 09:59:49 <masahito> bye 09:59:51 <ddeja> bye 09:59:58 <_gryf> cu 10:00:08 <kazuIchikawa> bye 10:00:20 <ddeja> #endmeeting