#openstack-meeting-4 log

16:00:34 <Jeffrey4l> #startmeeting kolla
16:00:35 <openstack> Meeting started Wed Nov  1 16:00:34 2017 UTC and is due to finish in 60 minutes.  The chair is Jeffrey4l. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:37 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:40 <openstack> The meeting name has been set to 'kolla'
16:00:58 <Jeffrey4l> #topic roll-call
16:01:02 <duonghq> o/
16:01:08 <chason> o/
16:01:23 <Jeffrey4l> hi chason ;D
16:01:41 <chason> Jeffrey4l Hahah
16:01:43 <vhosakot> o/
16:01:47 <duonghq> hi vhosakot
16:01:50 <rfxn> \o/
16:01:51 <coolsvap> o/
16:01:59 <vhosakot> hi duonghq :)
16:02:37 <Jeffrey4l> let us wait another two minutes.
16:02:46 <Jeffrey4l> this will be a short meeting.
16:02:59 <duonghq> ya, I have 2 topics
16:03:08 <Jeffrey4l> duonghq, cool.
16:03:29 <Jeffrey4l> we have no scheduled topics today.
16:03:49 <duonghq> ya, I forgot add to our schedule
16:04:15 <Jeffrey4l> ok. let us start.
16:04:22 <Jeffrey4l> #topic Announcements
16:04:54 <Jeffrey4l> Sydney summit will be hold next week.
16:05:22 <spsurya__> 0/
16:05:26 <duonghq> ohhh, time flies too fast
16:05:37 <Jeffrey4l> so next meeting will be canceled.
16:05:45 <Jeffrey4l> duonghq, yes.
16:05:57 <Jeffrey4l> any other announcement from community?
16:06:52 <Jeffrey4l> let us start the open discuss directly.
16:07:04 <Jeffrey4l> #topic open discuss
16:07:11 <Jeffrey4l> duonghq, your call.
16:07:18 <duonghq> thank you Jeffrey4l
16:07:27 <duonghq> first one is quite simple
16:07:39 <duonghq> I get this bug many time: https://bugs.launchpad.net/kolla-ansible/+bug/1729246
16:07:40 <openstack> Launchpad bug 1729246 in kolla-ansible "MariaDB cluster fails to start after upgrade" [Undecided,New]
16:07:58 <pbourke> o/
16:08:00 <duonghq> I run on 2 nodes with 6GB memory/node
16:08:03 <duonghq> hi pbourke
16:08:21 <duonghq> can somebody help me test this upgrade again
16:08:42 <Jeffrey4l> duonghq, mariadb changed some thing recently.
16:09:05 <Jeffrey4l> our upgrade process ( ansible roles ) should fix the gap.
16:09:19 <duonghq> so, can you triaged this bug?
16:09:28 <Jeffrey4l> the safe_to_bootstrap do not exist before.
16:09:29 <Jeffrey4l> sure.
16:09:39 <duonghq> I cannot find any bug related to this issue
16:09:44 <duonghq> thank Jeffrey4l
16:10:09 <Jeffrey4l> curiosity upgrade is failed.
16:10:25 <Jeffrey4l> anyway, i will check this.
16:10:36 <Jeffrey4l> thank you for pointing this out.
16:11:09 <duonghq> :)
16:11:40 <Jeffrey4l> btw, there is another old bug report that possible data loss during mariadb recovery.
16:11:52 <Jeffrey4l> https://bugs.launchpad.net/kolla-ansible/+bug/1682153
16:11:53 <openstack> Launchpad bug 1682153 in kolla-ansible "mariadb_recovery is prone to data loss" [Critical,Confirmed]
16:12:15 <duonghq> sure, I saw that, seem that Sam proposed a fix for the bug
16:12:37 <rfxn> that bug has bitten me before, that articulate bug report helped narrow down and recover cluster
16:12:37 <Jeffrey4l> no patch right now
16:12:58 <Jeffrey4l> but he propose a possible solution way in the description.
16:13:34 <duonghq> we should implements his proposal
16:13:39 <Jeffrey4l> yes.
16:13:43 <duonghq> than let it roll for awhile
16:14:14 <Jeffrey4l> okay. please move on
16:14:30 <duonghq> sure,
16:14:45 <duonghq> so, my 2nd topic is about Kolla-ansible HA layer
16:15:13 <duonghq> do anybody know why we use haproxy/keepalived for HA layer?
16:15:23 <duonghq> but not pacemaker/corosync stack
16:15:41 <Jeffrey4l> duonghq, pacemaker/corosync is more complicated.
16:15:47 <rfxn> far more complicated
16:16:07 <Jeffrey4l> and iirc, pacemaker can not be containerized before ( now it should work )
16:16:29 <duonghq> but it provides some mechanism for react with the failure, like data plane evacuation
16:17:03 <duonghq> it is invaluable feature (IMO)
16:17:43 <Jeffrey4l> duonghq, yes. pacemaker is powerful than keepalived. but what kind of issue we are facing by using keepalived.
16:18:08 <duonghq> it just for add some functionality to our stack,
16:18:38 <duonghq> I'm thinking in implement pacemaker into Kolla and let user choose which HA stack they want to use
16:19:24 <Jeffrey4l> that will be cool. pacemaker can handle more than keepalived.
16:20:00 <duonghq> Jeffrey4l, so, I'll create a blueprint for this, is it ok?
16:20:24 <duonghq> and try to containerized pacemaker (again)
16:20:28 <Jeffrey4l> but i just afraid what it will take to Kolla. more better health check? or fail over?
16:21:03 <Jeffrey4l> sure. a blueprint is necessary for others to evaluate the possibility.
16:21:29 <duonghq> about healthcheck, I'm not sure only it can make Kolla better
16:21:48 <rfxn> pacemaker in kolla feels like a solution to a problem that doesn't yet exist; granted from a maturity standpoint eventually moving to pacemaker from keepalived is probably what needs to happen
16:21:49 <duonghq> but I'm certainly about failover
16:22:01 <Jeffrey4l> please write what you think and the benefit.
16:22:22 <duonghq> sure
16:23:50 <Jeffrey4l> rfxn, tbh, i will current keepalived is enough. ;)
16:24:22 <Jeffrey4l> i think*
16:24:27 <Jeffrey4l> but who know what duonghq will take
16:24:37 <rfxn> current, ya i think keepalived is more than enough -- time could be better spent on other areas of HA and Disaster Recovery instead of pacemaker atm
16:24:48 <duonghq> I think we can let user choose which stack they like,
16:25:04 <duonghq> rfxn, can you suggest some area?
16:25:37 <rfxn> MariaDB is treated emphemeral right now, we put it behind a galera cluster and smile
16:25:45 <rfxn> in reality, if you loose mysql data, your done
16:25:53 <rfxn> we need a reliable, backup strategy for mariadb
16:26:05 <rfxn> clustering = HA, backup = DR
16:27:08 <Jeffrey4l> rfxn, are u meaning use pacemaker as DR solution?
16:27:23 <duonghq> I'm thinking it is slightly out of scope of Kolla
16:28:48 <rfxn> Jeffrey4l, no, im all for keepalive (now) and pacemaker (later, as a maturity point); duonghq asked some area to suggest, my thought is lack of backup solution for mariadb is the largest, most volatile gap, currently
16:29:13 <Jeffrey4l> ah, got.
16:29:45 <Jeffrey4l> db backup is really necessary.
16:30:39 <Jeffrey4l> so please register a bp for this duonghq , and let us discuss base on the bp.
16:30:40 <Jeffrey4l> thanks
16:31:00 <Jeffrey4l> then any other topics?
16:31:05 <duonghq> Jeffrey4l, sure
16:31:15 <rfxn> happy to help discuss and lay out options in that bp
16:31:22 <duonghq> thank you rfxn
16:32:04 <Jeffrey4l> guess no topic.
16:32:06 <rfxn> https://blueprints.launchpad.net/kolla/+spec/database-backup-recovery <- related
16:32:33 <duonghq> rfxn, ah, it should be in kolla-ansible
16:32:36 <duonghq> how do you think, Jeffrey4l
16:32:44 <Jeffrey4l> sure.
16:32:51 <Jeffrey4l> and xtrabackup should be the best solution.
16:32:59 <Jeffrey4l> the backup should run periodic.
16:33:15 <rfxn> agreed xtrabackup/innobackupx with a few output options = win
16:33:56 <rfxn> two smaller items; updating haproxy from 1.5 to 1.7
16:33:57 <Jeffrey4l> and we can add this jobs into cron containers. and save the backups into a new docker volumes.
16:34:38 <rfxn> and letsencrypt for automagick issuance of external_fqdn ssl certs
16:35:03 <Jeffrey4l> refer to backup, there are other ways to backup. like ceph pg map, crush rule.
16:35:10 <Jeffrey4l> rfxn, what upgrade haproxy?
16:35:16 <Jeffrey4l> what/why
16:35:52 <rfxn> 1.7 offers better ssl termination, http2, more advanced acl features, far more performant
16:36:22 <Jeffrey4l> rfxn, letencrypt may be hard. because during sign new certs, it requires network connective and a public domain.
16:37:01 <Jeffrey4l> rfxn, basically, package version in kolla based on the linux distro repo.
16:37:27 <rfxn> i dont think those are blockers, most production deployments are going to be on an external fqdn  w/ internet access
16:37:52 <rfxn> i have a poc with haproxy that routes letsencrypt CA challenge/response to a dedicated listener on network group systems
16:38:00 <rfxn> so it never needs to touch the horizon container
16:38:41 <rfxn> letsencrypt is a nice to have,  maybe not a need to have :)
16:39:06 <Jeffrey4l> yep
16:39:08 <rfxn> but would minimize the barrier to entry on new deployments imo
16:40:14 <duonghq> rfxn, can you elaborate your point? about new deployments
16:40:18 <Jeffrey4l> and we can implement this in "kolla-ansible certifications" command.
16:42:11 <rfxn> SSL certificates, valid browser recognized CA certificates, are the norm on any production ready deployment. Right now, managing SSL certificates is a pain, prone to human error and nothing should ever be internet facing without a valid cert.
16:42:23 <rfxn> letsencrypt allows us to automate the process entirely, in very trivial way
16:43:07 <Jeffrey4l> letsencrypt is a gread service ;p
16:43:11 <duonghq> rfxn, thank you for teach me this
16:43:26 <Jeffrey4l> great*
16:43:47 <rfxn> not trying to preach, just speak aloud that imo automagick ssl certificates would set kolla apart and remove a human error prone component for the stack
16:43:59 <rfxn> e.g azure going down cause of forgetting to renew api connector certs :P
16:44:13 <Jeffrey4l> lol
16:44:43 <Jeffrey4l> rfxn, will you add such feature into kolla?
16:45:56 <rfxn> i can bp it
16:46:00 <rfxn> and we go from there
16:46:10 <Jeffrey4l> cool. thanks.
16:46:26 <rfxn> and happy to share out my ugly code as poc :)
16:46:54 <Jeffrey4l> rfxn, thanks for sharing ;D
16:47:10 <Jeffrey4l> any other topics?
16:47:23 <duonghq> rfxn, poc is always has many idea
16:48:03 <Jeffrey4l> guess no. let us end the meeting.
16:48:08 <Jeffrey4l> thanks for all coming.
16:48:13 <Jeffrey4l> #endmeeting