#openstack-meeting log

09:00:00 <aspiers> #startmeeting HA
09:00:01 <openstack> Meeting started Mon Nov 23 09:00:00 2015 UTC and is due to finish in 60 minutes.  The chair is aspiers. Information about MeetBot at http://wiki.debian.org/MeetBot.
09:00:02 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
09:00:04 <openstack> The meeting name has been set to 'ha'
09:00:13 <aspiers> Hi all and welcome to the 2nd ever official HA IRC meeting!
09:00:21 <aspiers> I have done a bit of research on how to run an IRC meeting better, and also a bit more advance preparation this time.
09:00:25 <_gryf> hi everyone :)
09:00:33 <bogdando> hi
09:00:35 <ddeja> Hello
09:00:35 <aspiers> so even though IMHO our previous meeting was a really great start, hopefully this one will be even smoother :)
09:00:49 <aspiers> The plan is to follow the agenda listed on https://wiki.openstack.org/wiki/Meetings/HATeamMeeting
09:00:59 <aspiers> This should make sure everyone gets a chance to raise topics
09:01:12 <aspiers> #topic Minutes and logs from previous meeting
09:01:24 <aspiers> Firstly I have to apologise for not starting the last meeting via "#startmeeting HA"
09:01:31 <aspiers> As a result, the logs of the previous meeting have not appeared at http://eavesdrop.openstack.org/meetings/ha/ but instead ...
09:01:41 <aspiers> #link minutes of previous meeting are at http://eavesdrop.openstack.org/meetings/ha__automated_recovery_from_hypervisor_failure/2015/ha__automated_recovery_from_hypervisor_failure.2015-11-16-09.00.html
09:01:53 <aspiers> which is a considerably uglier URL :-(
09:02:08 <aspiers> I have updated https://wiki.openstack.org/wiki/Meetings/HATeamMeeting accordingly to make them easier to find.
09:02:26 <aspiers> I've also asked in #openstack-infra about renaming them for consistency, and apparently that is possible, so I will get that done
09:02:39 <aspiers> BTW for the benefit of nice meeting minutes, I'm going to use #info to summarise topics :-)
09:02:49 <aspiers> #topic Actions from previous meeting
09:03:02 <aspiers> There were 3 actions from the last meeting which I think I can summarise quickly:
09:03:18 <aspiers> #info ddeja has shared the mistral PoC code on github
09:03:25 <aspiers> #link https://github.com/gryf/mistral-evacuate
09:03:35 <aspiers> then there was an action for _gryf which seems to have been done:
09:03:43 <aspiers> #info etherpad has been updated with more details of the mistral PoC
09:03:53 <aspiers> #link https://etherpad.openstack.org/p/automatic-evacuation
09:04:09 <aspiers> Finally, there was an action for masahito to investigate possibility of converging masakari with Mistral PoC, once details of the latter are published
09:04:18 <aspiers> However since they have only just been published, I would be really impressed if he had time to do this yet ;-)
09:04:30 <aspiers> masahito: I guess you can start looking at this in the next week or two?
09:05:05 <masahito> yap. I'll see the details of the repo in next week or two.
09:05:13 <aspiers> cool, thanks :)
09:05:41 <aspiers> #action masahito will review mistral code and start thinking about routes for convergence
09:05:53 <aspiers> so those were the 3 actions from last time
09:06:00 <aspiers> #topic Current status (progress, issues, roadblocks, further plans)
09:06:14 <aspiers> I thought it would be useful if we did a brief round of status updates, where we each say what HA work we did this week, what we're planning next, any issues etc.
09:06:21 <masahito> I'm not familier with Mistral, so is the link in the etherpad enough to understand what is Mistral?
09:06:36 <aspiers> probably not, but you can also look here:
09:06:44 <_gryf> masahito, I'll put some more links
09:07:03 <aspiers> #link https://wiki.openstack.org/wiki/Mistral
09:07:07 <_gryf> just for convenience
09:07:09 <masahito> _gryf: got it. I'll check it too.
09:07:28 <aspiers> don't worry if you don't have anything to report, or simply don't want to report anything - this status report is optional :)
09:07:37 <aspiers> but it's a good opportunity to share news if you want to
09:07:43 <aspiers> Let's take turns so that we don't all speak at once :)
09:07:52 <_gryf> so let me be first
09:07:52 <aspiers> I'll go first
09:07:55 <aspiers> oh ok :)
09:07:57 <_gryf> :>
09:08:11 <aspiers> please go ahead :)
09:08:27 <_gryf> together with ddeja we were working on preparing the simplest use case with implemented as Mistral workflow
09:08:41 <_gryf> the result is on the github
09:09:01 <_gryf> (link provided in the etherpad)
09:09:01 <jklare> hi o/
09:09:08 <aspiers> hi jklare :)
09:09:31 <aspiers> _gryf: do you want to briefly summarise how it works, or just point people at the README / etherpad?
09:09:47 <_gryf> we had bumped at critical bug which ddeja currently is resolving
09:10:10 <_gryf> sure. maybe i'll elaboate it here
09:10:45 <_gryf> there is assumption, that we have a control plane which is in HA using pacemaker
09:11:02 <aspiers> sounds good
09:11:13 <_gryf> and all the compute nodes are using pacemaker_remote
09:11:22 <aspiers> #info _gryf and ddeja are preparing the simplest evac use case implemented via Mistral https://github.com/gryf/mistral-evacuate
09:11:26 <_gryf> nothing different than in RH provided solution
09:12:25 <aspiers> ok
09:12:33 <_gryf> if there is a problem with compute node, than it would be (eventually) fenced, and right after that, evacuation process would be triggered
09:12:40 <aspiers> I've had a quick look at the repo and it doesn't seem too hard to understand
09:13:14 <aspiers> so currently the evac is triggered by hostname?
09:13:21 <_gryf> right
09:13:22 <aspiers> according to that JSON file?
09:13:33 <_gryf> json file can be prepared onfly
09:13:45 <aspiers> but in the future we could have mistral constantly monitoring for hosts with the evacuate attribute set?
09:13:55 <aspiers> like NovaEvacuate currently does?
09:14:06 <aspiers> or we would keep NovaEvacuate to do that?
09:14:09 <_gryf> actually - no
09:14:24 <bogdando> folks, does evacuate requires access to the fenced compute node?
09:14:47 <aspiers> bogdando: if you mean "nova evacuate", no
09:14:55 <_gryf> we can make mistral to "monitor" the hosts, but it would be an overkill IMO
09:15:05 <_gryf> bogdando, not at all
09:15:19 <bogdando> so it is a STONITH, for example, and then an evacuate, and it works
09:15:26 <bogdando> looks good then, thank you
09:15:37 <_gryf> mistral mifgt be configured to perform some actions periodically
09:15:52 <aspiers> _gryf: ok. since this is just the status update part of the meeting I guess we don't have to spend too much time on details here :)
09:15:54 <_gryf> but in the manner of cron rather, than like a heartbeat
09:16:07 <_gryf> ok
09:16:08 <aspiers> we can delve deeper later in the meeting if we have time
09:16:26 <aspiers> anything else to report from your side?
09:16:32 <_gryf> nope :)
09:16:36 <aspiers> also ddeja?
09:17:29 <ddeja> as _gryf said, right now I'm working on critical in Mistral
09:18:02 <aspiers> OK thanks, if nothing else to add then I'll give my status report
09:18:17 <ddeja> OK
09:18:28 <aspiers> #info ddeja working on a bug in the Mistral PoC
09:18:32 <aspiers> #info NovaCompute / NovaEvacuate RAs are now in openstack-resource-agents
09:18:48 <aspiers> #link https://git.openstack.org/cgit/openstack/openstack-resource-agents/
09:19:00 <aspiers> Question for all, but especially Red Hat / Intel / anyone else who currently uses the Nova{Compute,Evacuate} resource agents:
09:19:09 <aspiers> are you OK with me renaming them to nova-compute and nova-evacuate, for consistency with the other nova RAs?
09:19:19 <aspiers> too bad beekhof couldn't be here
09:19:33 <_gryf> aspiers, it's fine from our side
09:19:36 <aspiers> since I guess this mostly affects his work, I'll check with him later
09:19:37 <bogdando> aspiers, nova-compute will confuse with the classic nova-compute
09:19:44 <aspiers> bogdando: oh yeah, good point!
09:20:17 <aspiers> bogdando: so maybe we would need nova-compute-server and nova-compute-client or something
09:20:27 <aspiers> hmm
09:20:51 <aspiers> #info naming is inconsistent with existing nova-* RAs, hopefully we can rename for consistency
09:20:55 <bogdando> I mean the nova-compute service, it may be run by the pacemaker RA as well
09:21:07 <aspiers> bogdando: you're absolutely right
09:21:09 <bogdando> so what is nova-compute-server/client may look still confusing :)
09:21:18 <aspiers> I agree, not the best names ...
09:21:27 <aspiers> but even nova-compute-hypervisor is not right
09:21:39 <aspiers> since e.g. Docker nodes are not really hypervisor-based
09:21:48 <bogdando> what does it run exactly?
09:22:06 <aspiers> it's responsible for the nova-compute service on the compute nodes
09:22:27 <ddeja> bogdando: it runs nova-compute service BUT it's not using systemd
09:22:38 <aspiers> oh wait, there is no existing nova-compute OCF RA
09:22:40 <aspiers> so there is no conflict
09:22:41 <bogdando> well, then I understood it wrong
09:22:48 <bogdando> and the name should be nova-compute
09:22:57 <aspiers> ok cool, that makes it easy then :)
09:23:14 <aspiers> #action aspiers will talk to beekhof about the potential rename (or just submit a gerrit review)
09:23:23 <_gryf> the resources agents are separate from the serrvices, esp that there are other - like cinder-api, nova-network and so on
09:23:30 <aspiers> right
09:23:43 <aspiers> I'm also considering making the RAs a wrapper around service(8)
09:23:44 <_gryf> so i think it's fine to rename those accordingly
09:23:56 <aspiers> so that they reuse any existing systemd / SysV stuff
09:24:06 <aspiers> instead of reimplementing it in a way which is inconsistent with the underlying OS
09:24:10 <aspiers> and OpenStack packages
09:24:24 <aspiers> so the RAs would become thinner
09:24:30 <_gryf> aspiers, AFAIRC beekhof havd some objections regarding systemd used as ra
09:24:39 <aspiers> _gryf: I'm not proposing to use systemd as an RA
09:24:52 <aspiers> _gryf: I'm proposing for the RAs to wrap the systemd services
09:25:07 <aspiers> that way we can still do monitoring at the service level
09:25:12 <aspiers> which is the main thing systemd is missing
09:25:50 <aspiers> #idea change OCF RAs so that they invoke service(8), in order to wrap around existing systemd / SysV services
09:25:50 <_gryf> no no. I'm just saying that there was some issues with using systemd as a monitoring tool
09:26:00 <aspiers> agreed
09:26:03 <bogdando> aspiers, do you mean service foo status to call the OCF agent?
09:26:13 <aspiers> bogdando: no, the other way around
09:27:20 <aspiers> anyway, this is something for further discussion and thought
09:27:32 <aspiers> we shouldn't aim to finalise the plan now ;-)
09:27:42 <aspiers> so continuing with my status
09:27:53 <aspiers> I'm also beginning to dripfeed my patches to this new location:
09:28:03 <aspiers> https://review.openstack.org/#/projects/openstack/openstack-resource-agents,dashboards/important-changes:review-inbox-dashboard
09:28:24 <aspiers> #info aspiers is feeding his patches "upstream" to the new RAs
09:28:43 <aspiers> on another topic ...
09:28:45 <aspiers> #info aspiers gave a presentation on compute node HA to OpenStack London Meetup
09:28:54 <aspiers> this was on last Wednesday, and there were 50-60 people in the audience
09:29:02 <aspiers> #link http://www.slideshare.net/adamspiers/compute-node-ha-current-upstream-development
09:29:13 <aspiers> I didn't have time for Q&A but people seemed to appreciate the talk :)
09:29:32 <aspiers> #info new pacemaker_transaction Chef LWRP in progress
09:29:45 <aspiers> last week I've been working on adding support for multi-object transactions into the Pacemaker Chef cookbook
09:29:52 <aspiers> #link https://github.com/crowbar/crowbar-ha/compare/master...aspiers:transactions
09:30:02 <aspiers> This will allow us to add multiple objects to the CIB in one go
09:30:14 <aspiers> This is useful for being able to safely add resources which only run on the compute nodes (or controller nodes)
09:30:35 <aspiers> since then it is possible to simultaneously add a primitive (e.g. libvirtd) and location constraint when prevents Pacemaker trying to run libvirtd on controller nodes
09:30:38 <bogdando> aspiers, great
09:30:47 <aspiers> It's possible to achieve the same thing in most cases by adding the primitive with target-role="Stopped", but this can lead to other problems
09:31:00 <aspiers> (it can even cause fencing in the DRBD case, don't ask ... :-/ )
09:31:19 <aspiers> that's it for my status update
09:31:23 <aspiers> masahito: any news you want to report?
09:31:28 <bogdando> note, in fuel we developed the complete new pacemaker cs_resource provider to allow simult. CIB access. Do you plan to contribute yours upstream community-corosync?
09:31:35 <masahito> yeah
09:31:48 <masahito> I checked whether Masakari can work on pacemaker-remote or not.
09:31:51 <aspiers> bogdando: sure, all my code is on github already
09:31:55 <bogdando> Our stroy is not so great, the pacemaker module still lives downstream :(
09:32:05 <aspiers> masahito: oh, great!
09:32:19 <aspiers> bogdando: I didn't realise you used Chef. We should definitely collaborate then
09:32:32 <bogdando> aspiers, Im I right, that is about running CIB configuration in parallel?
09:32:47 <aspiers> bogdando: not the shadow CIB
09:33:03 <aspiers> bogdando: simply changing more than one CIB object in a single commit
09:33:10 <masahito> Masakari host-monitor just monitors Online[] and Offline[] in output of crm now.
09:33:15 <bogdando> no, ours is for puppets
09:33:22 <aspiers> bogdando: ah right
09:33:26 <bogdando> let me share it, jsut for the info
09:33:37 <bogdando> it allows to make parallel crm configure from nodes
09:33:45 <aspiers> masahito: what did it monitor before?
09:33:58 <masahito> So if we use the host-monitor with pacemaker-remote, we need some patch to use it.
09:34:12 <aspiers> masahito: oh I see what you mean
09:34:16 <bogdando> https://github.com/openstack/fuel-library/tree/master/deployment/puppet/pacemaker
09:34:22 <aspiers> masahito: how is it monitoring? via crmsh?
09:34:25 <masahito> Remote-nodes by pcameker-remote would be showen as RemoteOnline, right?
09:34:34 <aspiers> masahito: it depends on how you monitor
09:35:01 <masahito> aspiers: currently the monitor maybe calls crm_mon.
09:35:12 <aspiers> masahito: ah I see. I can suggest better ways to monitor
09:36:00 <aspiers> masahito: it is now possible to use crm node list
09:36:39 <aspiers> #info there are various ways to monitor the status of pacemaker_remote nodes, e.g. https://github.com/ClusterLabs/crmsh/issues/103
09:36:59 <aspiers> also cibadmin -Q --xpath '/cib/status/node_state[@remote_node="true"]'
09:37:48 <aspiers> anyone else want to give a status report?
09:38:00 <aspiers> if not I propose we move onto deciding next steps
09:38:03 <masahito> aspiers: thanks. I'll try it.
09:38:32 <aspiers> bogdando: want to report anything?
09:38:42 <bogdando> aspiers, nope
09:38:48 <aspiers> no problem :)
09:38:50 <aspiers> OK
09:38:57 <aspiers> I think that's probably everyone
09:39:02 <aspiers> #topic Automatic VM resurrection - next steps
09:39:24 <aspiers> so I guess there's an action for everyone to look at the work of _gryf and ddeja
09:39:30 <aspiers> to understand it better
09:40:02 <_gryf> If there are any questions or doubt - dont hesitate to ask either me or ddeja
09:40:10 <ddeja> +1
09:40:11 <aspiers> #action everyone who's interested to look at the Mistral PoC code by _gryf and ddeja
09:40:12 <bogdando> if one would take notes, we could end up putting them to the HA guide
09:40:21 <aspiers> thanks :)
09:40:30 <trinaths> can you share the PoC link - code if any
09:40:42 <bogdando> masakari + Mistral how-to , that's what I mean
09:40:42 <aspiers> trinaths: it's linked above but I'll repaste
09:40:51 <aspiers> https://github.com/gryf/mistral-evacuate
09:40:58 <trinaths> aspiers: I joined in the middle
09:41:01 <bogdando> setting up and running with a pacemaker, and so on
09:41:08 <aspiers> trinaths: np :) the meeting is also being recorded
09:41:21 <trinaths> aspiers: okay :)
09:41:31 <bogdando> this could be etherpad as well, and I could submit patches to HA guide later
09:41:35 <aspiers> bogdando: yeah, I think putting them in the HA guide is probably the final step when everything is working pretty well
09:41:50 <bogdando> for now, would be nice to have even drafts on review
09:41:52 <aspiers> bogdando: good idea - etherpad or wiki is a good place for new docs to be born :)
09:41:58 <bogdando> for other folks on the way
09:42:35 <aspiers> it would be really cool if we could have some collaborative whiteboard for co-designing architecture diagrams
09:42:49 <aspiers> like etherpad, but for diagrams. maybe google docs?
09:42:51 <_gryf> one more thing regarding the poc - it doesn't describe the complete solution, only the evacuation part
09:43:12 <jklare> @aspiers gliffy is an amazing tool for that (diagramms)
09:43:13 <_gryf> it's assumed the monitoring and triggering is on the cluster managaer side
09:43:17 <aspiers> _gryf: right - that's exactly why I'm thinking about architecture diagrams, so we can figure out how it should integrate with the other pieces
09:43:32 <ddeja> For better understand of how mistral wokrflow works it's good to start with reading this: http://docs.openstack.org/developer/mistral/dsl/dsl_v2.html#direct-workflow
09:43:51 <aspiers> cool, thanks ddeja
09:44:13 <aspiers> #info mistral workflows are explained at http://docs.openstack.org/developer/mistral/dsl/dsl_v2.html#direct-workflow
09:44:26 <aspiers> jklare: thanks, I'll look into it
09:44:53 <aspiers> _gryf: there are other monitoring pieces which should maybe live outside Pacemaker, e.g. the libvirtd-level monitoring which masakari does (IIUC)
09:45:40 <aspiers> #idea start using some tool for shared collaboration on architecture diagrams?
09:45:43 <bogdando> so, will we make the how-to notes as a action item?.. :)
09:45:48 <bogdando> an&
09:46:03 <aspiers> bogdando: howto for which piece?
09:46:09 <_gryf> aspiers, I think that libvirt monitoring could also be placed as a pacemaker ra
09:46:19 <bogdando> evacuation PoC with pacemaker remote
09:46:31 <bogdando> masakari and Mistral based, IIUC
09:46:34 <aspiers> _gryf: yeah that could be possible I guess. masahito what do you think?
09:47:25 <_gryf> unless, there are some obivous obstacles we can't see right now
09:47:26 <aspiers> bogdando: I thought beekhof maybe mentioned he was already working on docs for the existing approach, but I'm not sure
09:47:34 <bogdando> okay
09:47:39 <aspiers> bogdando: but I guess it's too early for a full howto on the mistral approach
09:47:44 <bogdando> let's keep it to be elaborated
09:47:46 <aspiers> since it still has critical bugs
09:47:49 <masahito> _gryf: libvirt-monitoring means libvirt itself? or server instance under control of libvert?
09:48:08 <trinaths> regarding the HA (I'm newbee, might this may be answered at the end of the meeting) Will there be some thing like even the sessions and memory state of the VM is maintained ?
09:48:11 <aspiers> masahito: by libvrtd-level monitoring I meant, monitoring libvirtd logs for VMs which die
09:48:27 <bogdando> not a full one, as I said, if anyone would repeat it, he'd get the common guide
09:48:27 <aspiers> trinaths: no, that is fault-tolerance not HA
09:48:38 <_gryf> masahito, both :) first level is a vm monitoring through libvirt, second is libvirt process, isn't it?
09:48:45 <aspiers> trinaths: that requires special hypervisor support and a hot standby, so it's a totally different topic
09:48:50 <bogdando> and gliffy diagrams perhaps
09:48:54 <trinaths> aspiers: okay. got it
09:49:25 <aspiers> trinaths: btw welcome to our little community ;-)
09:50:04 <aspiers> #idea write an OCF RA which monitors libvirtd for VMs which fail (even when libvirtd/hypervisor are still healthy)
09:50:17 <aspiers> ok, we only have 10 mins left
09:50:30 <aspiers> so are there any more actions we should take for next steps?
09:51:09 <masahito> _gryf: aspiers: I'll write my idea to the doc. the chat is sometimes fast for me :)
09:51:13 <bogdando> did we have one for Poc?
09:51:31 <_gryf> masahito, ok
09:51:41 <aspiers> masahito: thanks :) sorry for the meeting not being in Japanese ;-)
09:52:02 <masahito> no problem ;-)
09:52:04 <trinaths> aspiers: thanks sir. Interested to HA. So exploring the things and was 15 mins late to the meeting
09:52:31 <aspiers> bogdando: you mean for the mistral PoC? I guess that's ongoing work so maybe no need for an action
09:52:43 <aspiers> I guess _gryf and ddeja will continue to provide updates on their progress
09:52:44 <bogdando> I mean masakari as well
09:52:49 <bogdando> something to show
09:52:53 <aspiers> oh right
09:53:09 <aspiers> masahito: do you have any plans for masakari this week?
09:53:11 <bogdando> with pacemaker_remote ofc
09:53:40 <aspiers> BTW I still have some patches to submit for the Nova* RAs
09:53:43 <masahito> aspiers: I plan to work with pacemaker-remote
09:53:54 <aspiers> masahito: awesome!
09:54:10 <masahito> or replace MySQLdb with SQLalchemy
09:54:21 <aspiers> #action masahito will start experimenting with pacemaker_remote
09:54:32 <bogdando> great, thanks
09:54:38 <aspiers> OK
09:54:53 <aspiers> #topic Open discussion
09:55:03 <aspiers> anybody want to raise anything else in the last 5 mins?
09:55:04 <masahito> sorry, I forgot we uses MySQLdb to Masakari since some aren't familer with SQLalchemy.
09:55:37 <aspiers> #action masahito to look at converting masakari from using MySQLdb to SQLalchemy
09:56:02 <aspiers> masahito: that would be important for SUSE to be able to use masakari since we use PostgreSQL
09:56:14 <aspiers> but I guess it's also generally appreciated for being consistent with the rest of OpenStack
09:56:35 <aspiers> bogdando: you are doing quite a bit of work with the HA guide, right?
09:56:37 <masahito> I also think SQLAlchemy is better.
09:56:46 <bogdando> aspiers, yes
09:56:55 <aspiers> bogdando: anything you want to share on that?
09:57:17 <bogdando> for now, I have only activites around patches
09:57:22 <aspiers> ok
09:58:09 <aspiers> #info quite a bit of activity on the ha-guide currently
09:58:13 <aspiers> #link https://review.openstack.org/#/projects/openstack/ha-guide,dashboards/important-changes:review-inbox-dashboard
09:58:28 <aspiers> oh, I almost forgot
09:58:35 <aspiers> there's the work on mistral HA
09:59:00 <aspiers> #info Mistral team are starting to work on making Mistral HA
09:59:03 <aspiers> #link https://blueprints.launchpad.net/mistral/+spec/mistral-ha
09:59:11 <aspiers> I guess anyone is welcome to join those efforts
09:59:29 <aspiers> OK well that seems to be the perfect time to end the meeting
09:59:40 <aspiers> thanks a lot everyone for another great meeting, and see you next week!
09:59:47 <_gryf> thanks!
09:59:50 <aspiers> bye for now :)
09:59:58 <masahito> thanks, bye
10:00:06 <ddeja> bye
10:00:13 <trinaths> bye all
10:00:22 <aspiers> #endmeeting