#openstack-meeting log

08:59:15 <ddeja> #startmeeting HA
08:59:16 <openstack> Meeting started Mon Feb  8 08:59:15 2016 UTC and is due to finish in 60 minutes.  The chair is ddeja. Information about MeetBot at http://wiki.debian.org/MeetBot.
08:59:17 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
08:59:19 <openstack> The meeting name has been set to 'ha'
08:59:26 <ddeja> Hello all
08:59:48 <masahito> Hi
09:00:32 <kazuIchikawa> hi
09:01:13 <bogdando> hi
09:01:27 <ddeja> Ok, lets start with a quick status
09:01:29 <_gryf> hi
09:01:42 <ddeja> #topic Quick status report
09:02:29 <ddeja> My status: I have prepared two workflows for instance HA, both can be seen on https://github.com/gryf/mistral-evacuate
09:02:57 <ddeja> For one of them, I have hit a bug https://bugs.launchpad.net/mistral/+bug/1535295
09:02:58 <openstack> Launchpad bug 1535295 in Mistral "Task with join runs more than once" [Undecided,New] - Assigned to Dawid Deja (dawid-deja-0)
09:03:36 <ddeja> That's all from my side
09:03:46 <ddeja> masahito: are you willing to give a report?
09:04:27 <masahito> The patch for enabling masakari to work on pacemaker-remote was merged.
09:04:43 <ddeja> That's great
09:04:56 <masahito> And some bug fixs were also merged.
09:05:02 <masahito> That's from my side.
09:05:04 <ddeja> #info patch for enabling masakari to work on pacemaker-remote was merged.
09:05:28 <ddeja> Ok, cool. Anyone else have something to report?
09:05:43 <bogdando> I have one announce
09:06:12 <beekhof> hey guys
09:06:25 <masahito> beekhof: hi
09:06:26 <bogdando> new ha guide docs time was assigned. Next meeting is at Feb 10, 17:00 UTC, don't hesitate to participate and add agenda items https://wiki.openstack.org/wiki/Documentation/HA_Guide_Update#Next_Meeting
09:06:27 <ddeja> hi beekhof :)
09:06:29 <bogdando> hi
09:07:34 <ddeja> OK, thanks bogdando
09:08:19 <ddeja> I think we are done in status reports
09:08:36 <ddeja> #topic Mistral Workflow for instance HA
09:08:55 <bogdando> Should we open a nova-client bug related to the bug 1535295 as well? Like it shall not throw errors on the second evacuate request arrival?
09:08:56 <openstack> bug 1535295 in Mistral "Task with join runs more than once" [Undecided,New] https://launchpad.net/bugs/1535295 - Assigned to Dawid Deja (dawid-deja-0)
09:09:37 <ddeja> I think it's an expected behaviour
09:09:58 <ddeja> I was rather thinkig about catching the exception
09:10:23 <ddeja> as a workaround
09:10:52 <beekhof> can i ask how this workflow... works :)
09:11:02 <ddeja> yup, sure
09:11:32 <beekhof> eg. when is filter_vm_action.py called?
09:11:54 <ddeja> beekhof: you are looking at code on master branch, aren't you?
09:12:00 <beekhof> yep
09:12:11 <ddeja> ok, so let me strart with explaining this workflow
09:12:38 <ddeja> It starts with calling nova.servers.list() in nova python client
09:12:54 <ddeja> nova.servers_list is a wrapper for python client
09:13:01 <beekhof> oh, i see the yaml now
09:13:18 <ddeja> then it calls filter_vm_action
09:13:23 <beekhof> this is rather neat
09:13:39 <beekhof> i can see the attraction
09:13:55 <ddeja> yup, it's very simple
09:14:38 <ddeja> Right now I'm thinkig about adding some action that would check if evacuation succeeded
09:15:19 <ddeja> also, if you change the branch, there is another workflow, that separete filtering and asking nova for flavors
09:16:01 <beekhof> interesting
09:16:35 <ddeja> I thinks it's better since in approach no 2, there would be less nova calls
09:16:36 <masahito> the branch looks changing the workflow depending on the extra_spec.
09:16:51 <masahito> looks interesting.
09:16:58 <beekhof> whats the advantage of splitting them up?
09:17:38 <ddeja> masahito: oh, asking for flavor_extra_spec should also happen in first approach, I'll fix that :)
09:17:51 <ddeja> beekhof: less calls to nova API
09:18:16 <masahito> I guess it allows admin to define extra_spec attached to evacuated VMs.
09:18:45 <ddeja> masahito: There are two ways of determing if given VM should be evacuated
09:19:09 <ddeja> 1. It has flavor with extra spec 'evacuation:evacuate' set to True
09:19:32 <ddeja> 2. VM itself has a metadata with 'evacuate' flag set to True
09:19:51 <ddeja> I should've write it in repo...
09:19:52 <bogdando> ddeja, could you please put that to the README as well?
09:20:11 <bogdando> yup, read my minds
09:20:20 <ddeja> bogdando: :)
09:20:50 <ddeja> I'll do it as soon as meeting ends
09:21:07 <masahito> ddeja: sounds nice!
09:21:20 <bogdando> this looks really simple and neat, indeed. So what do you think about the final solution?
09:21:29 <bogdando> like that PoC + fence agent?
09:22:18 <ddeja> So from my side the most simple scenario is: this workflow + really simple fence agent that only calls Mistral API
09:23:09 <bogdando> like to post the YAML?
09:23:31 <ddeja> not really - yaml file should be loaded first and is in DB
09:23:36 <bogdando> ah
09:23:49 <ddeja> you only call it by it's name and provide input argiments
09:23:59 <ddeja> arguments*
09:24:06 <bogdando> I like that idea
09:24:37 <ddeja> #action ddeja to write simple fence agent that will call mistral API
09:24:39 <bogdando> and this to be put into the fence topology perhaps
09:24:50 <ddeja> yup
09:25:02 <beekhof> i like it as the mechanism for performing the evacuation, but i can also see scope for some of the masakari pieces for deciding when to trigger it
09:25:32 <ddeja> beekhof: yes, Masakari may call the workflow instead of fence agent
09:25:33 <bogdando> we could use it as a "fallback fence level", if Mistral fails )
09:25:56 <beekhof> one thing though... i like the concept, but how well does it handle corner cases?
09:26:14 <bogdando> so it may be a topology like that: 1) first try Mistral flow, 2)then try the masakari, 3)fence the node
09:26:22 <beekhof> thats the key thing
09:26:42 <ddeja> bogdando: you need to have node fenced before you call evacuate
09:26:51 <ddeja> It's a post-mortem process, when node is dead
09:27:17 <bogdando> ddeja, I see, then it probably cannot fit the classic fence topos
09:27:20 <ddeja> and we can be sure that there won't be two VMs writing to same storage
09:27:57 <ddeja> bogdando: It can, we can have fencing topology like 1) fence node; evacuate VMS (call mistral)
09:28:11 <ddeja> beekhof: Which corner cases do you mean?
09:28:21 <bogdando> at the one level? yes, that should work
09:28:39 <beekhof> mistral nodes falling over while a workflow is in progress
09:28:40 <bogdando> though, I may have forget details
09:28:57 <beekhof> services APIs dropping in and out while a workflow is in progress
09:28:59 <beekhof> etc
09:29:16 <ddeja> beekhof: That's a problem with Mistral, unfortunetly
09:29:32 <beekhof> nodes returning while a WFIIP
09:29:44 <beekhof> exactly
09:29:59 <ddeja> there is a blueprint for Mistral HA itself
09:30:04 <ddeja> https://blueprints.launchpad.net/mistral/+spec/mistral-ha-spec
09:30:16 <masahito> ddeja: Does Mistral plan to resolve it in Mitaka?
09:30:17 <ddeja> that should be done for M3
09:30:20 <beekhof> the design is beautiful, but it comes down to the suitability of mistral itself
09:30:51 <ddeja> so for Mitaka there would be spec for what needs to be done
09:31:28 <ddeja> and for N release mistral team would be working on resolving it
09:32:21 <ddeja> beekhof: so asking your question: It is not fully relailable now
09:32:57 <bogdando> Do we have a spec for this initiative, to reflect alternatives and decisions made here?..
09:33:05 <beekhof> so the big question for me is: how long until it is, and can we wait that long?
09:34:08 <ddeja> bogdando: I'm not sure if I understand correctly. Do you mean if we have a spec for using Mistral Workflow for instance evacuate?
09:34:21 <bogdando> ddeja just a spec for instance evacuate
09:34:57 <ddeja> so there is this spec
09:35:00 <ddeja> #link https://blueprints.launchpad.net/mistral/+spec/mistral-ha-spec
09:35:08 <ddeja> not this one...
09:35:13 <ddeja> #link https://review.openstack.org/#/c/257809/
09:35:16 <ddeja> sorry
09:35:18 <ddeja> this one
09:35:27 <bogdando> thank you
09:35:40 <ddeja> but I'm not the subbmiter
09:36:15 <ddeja> beekhof: I don't know if we can wait that long
09:36:47 <bogdando> we can make this cionfigurable, how long to wait and to which option failback
09:37:01 <beekhof> ddeja: i somewhat suspect that too
09:37:15 <ddeja> current scenario is to have Mistral HA for N relase, so it will be like 8 months
09:37:20 <bogdando> assuming 100% reliable things would be a design flaw :)
09:38:11 <bogdando> that is why they use fence topologies AFAICT
09:38:22 <beekhof> ddeja: TBH, i'd be surprised if 8 months is all that was needed
09:38:25 <bogdando> so perhaps we should do the like
09:38:58 <bogdando> and think of Mistral flow and masakari as just two fence agents
09:39:24 <bogdando> to co-exist and help each over to cover more fail modes
09:40:02 <ddeja> bogdando: The problem is that we can, for exmaple call mistral from fence_agent
09:40:12 <ddeja> and got OK as a reponse
09:40:40 <bogdando> let's make fence agent to verify results before rerutning its own OL
09:40:41 <ddeja> but it only means that Mistral accepted the request
09:40:42 <bogdando> OK
09:40:58 <ddeja> and in fence agents, we can't wait for the result
09:40:59 <bogdando> by a given timeout
09:41:15 <bogdando> Why can't we?
09:41:29 <bogdando> all agents have things like power timeout to wait for results
09:41:36 <bogdando> here is the same
09:41:36 <beekhof> you're blocking recovery
09:41:41 <ddeja> because fencing frezes the whole cloud, if I remember correctly
09:41:50 <bogdando> oh, why?..
09:41:54 <beekhof> which might be the bit that is needed for the workflow to complete
09:42:01 <beekhof> == deadlock
09:42:59 <bogdando> well, we could run all evacuate options in parallel in the hope of the idempotance and immutability
09:44:14 <bogdando> I mean using both Mistral and masakari based fence agents to ensure results... Makes sense?..
09:44:40 <bogdando> And leave operators options to decide which agent to go with
09:44:48 <bogdando> It is better to have more options
09:45:06 <ddeja> maybe it's the way to go...
09:45:30 <bogdando> I'm not sure about implementation details, but from design point I see no issues here
09:45:44 <ddeja> on the other hand, there was a discussion to use Masakari to call Mistral workflows
09:45:52 <bogdando> evacuation request must be idempotent and immutable for any agents acting
09:46:40 <ddeja> since Mistral is not HA, maybe Masakari can look if evacuation ended succesfully, and if not call Mistral API again?
09:46:50 <ddeja> but that's just an idea
09:47:02 <bogdando> good point as well
09:47:02 <masahito> bogdando: you means we make notification interface from monitoring process will be same between mistral workflow and masakari.
09:47:49 <bogdando> masahito, yes, monitoring should understand competing agents
09:48:19 <bogdando> or retries, if agents are kinda "nested" and masakari retries Mistral
09:48:40 <masahito> ddeja: just question. Can Mistral execute evacuate action in paralell?
09:49:05 <masahito> bogdando: got it.
09:49:27 <ddeja> masahito: yes. If we have like 5 VMs to evacuate, there would be 5 request running in paralell
09:49:32 <bogdando> that was only my point though, we should think of it and suggest to the spec perhaps...
09:50:31 <masahito> ddeja: thanks.
09:51:15 <ddeja> masahito: but to be sure - you call mistral only once. The paralelism is done inside the mistral engine :)
09:53:18 <ddeja> OK, I guess with this topic, we have a few minutes for open discussion
09:53:24 <ddeja> #topic Open discussion
09:53:53 <masahito> I think we should discuss the detail in some where without IRC meeting.
09:54:20 <masahito> because TL will flow.
09:55:33 <ddeja> masahito: I agree, but it will be good to wait till aspiers come back from vacation
09:55:57 <masahito> ddeja: right.
09:57:03 <kazuIchikawa> another topic, we are working on sqlalchemy support of masakari. Is there any database backend you mind to use other than MySQL?
09:57:28 <ddeja> kazuIchikawa: postgres
09:57:38 <_gryf> kazuIchikawa, postgresql is also a popular choice
09:57:43 <kazuIchikawa> ddeja: got it
09:57:50 <ddeja> most of OpenStack deploments is on postgressql or mysql
09:59:23 <ddeja> ok, we are running out of time
09:59:37 <ddeja> thanks you all for productive meeting and see you next week :)
09:59:49 <masahito> bye
09:59:51 <ddeja> bye
09:59:58 <_gryf> cu
10:00:08 <kazuIchikawa> bye
10:00:20 <ddeja> #endmeeting