18:00:32 <daneyon> #startmeeting container-networking 18:00:32 <openstack> Meeting started Thu Oct 1 18:00:32 2015 UTC and is due to finish in 60 minutes. The chair is daneyon. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:00:34 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 18:00:37 <openstack> The meeting name has been set to 'container_networking' 18:00:39 <daneyon> Agenda 18:00:46 <daneyon> #link https://wiki.openstack.org/wiki/Meetings/Containers#Agenda 18:00:59 <daneyon> I'll wait a minute for everyone to review the agenda 18:01:08 <daneyon> It's a short one :-) 18:01:30 <daneyon> #topic roll call 18:01:32 <adrian_otto> might as well begin roll call 18:01:37 <adrian_otto> Adrian Otto 18:01:44 <dane_leblanc> o/ 18:01:44 <suro-patz> Surojit Pathak 18:01:54 <vilobhmm111> o/ 18:02:26 <daneyon> Thank you adrian_otto dane_leblanc suro-patz vilobhmm111 for attending the meeting. 18:02:33 <daneyon> #topic Review Swarm patch. 18:02:37 <daneyon> #link https://review.openstack.org/#/c/224367/ 18:02:40 <eghobo> o/ 18:02:49 <daneyon> Not much has changed with the patch I posted last week 18:02:58 <daneyon> eghobo thanks for joining 18:03:20 <daneyon> I have a newer version of the patch locally that I'm still playing with. 18:03:39 <daneyon> I got a bit side tracked fixing a few bugs. 18:04:22 <daneyon> Hopefully I can post an updated version of the patch later today that will address using None as the default net_driver for Swarm 18:05:00 <daneyon> I have removed using a VIP and all the associated load-balancing config for the swarm api 18:05:11 <daneyon> it's not needed and does not work since the tls patch was merged. 18:05:51 <daneyon> since neutron lbaas does not support tls offload, we will need to figure out a plan for supporting tls with load-balancing. 18:06:14 <daneyon> Is anyone familiar with project Octavia? 18:06:21 <adrian_otto> each node holds the cert, and use layer 3 lb (TCP port forwarding) 18:06:36 <daneyon> #link https://wiki.openstack.org/wiki/Octavia 18:06:40 <adrian_otto> use a simple health check to drop dead nodes 18:07:24 <daneyon> adrian_otto that can be a near-term fix 18:08:15 <daneyon> long-term, it would be nice to perform l7 load-balancing by offloading the session to the load-balancer and then re-encrypting on the backend from the lb -> the swarm managers 18:09:08 <daneyon> adrian_otto we will look at reimplementing the swarm mgr load-balancing when the bay type supports multiple swarm managers. 18:09:35 <daneyon> here is the guide that will be followed for implementing multiple managers: 18:09:37 <daneyon> #link https://docs.docker.com/swarm/multi-manager-setup/ 18:09:49 <adrian_otto> daneyon, I don't understand the desire to offload ssl, and then use encrypted back channels 18:09:59 <eghobo> daneyon: but I believe only one can be active 18:10:01 <adrian_otto> seems like more complexity that may not be needed 18:10:16 <daneyon> as you can see from the guide, only 1 mgr is primary and others are backups 18:10:22 <adrian_otto> is there some routing decision taht involved layer 7? 18:10:39 <daneyon> I would expect that Docker better addresses swarm mgr ha/scale in a future release. 18:11:01 <eghobo> daneyon: mesos has the same model 18:11:15 <daneyon> adrian_otto I would expect that we may need to address different security use cases. 18:12:42 <adrian_otto> have we detailed the uses cases anywhere? 18:12:51 <daneyon> From my experience, some users are OK with off-loading ssl to a slb and clear on the back-end. Others want to e-2-e encryption. In that case, we can do simple L4 checks/load-balancing, but L7 is preferred as long as the hw can handle it 18:13:31 <adrian_otto> if the client can do simple SRV lookups, and designate is present, there may be no need for load balancing 18:13:55 <daneyon> adrian_otto currently load-balancing the swwarm mgr's is unneeded. It can be implemented, but any traffic to the replicas will be forwarded to the primary 18:14:05 <adrian_otto> just inform designate to update the SRV record when the service availability changes 18:14:11 <hongbin> o/ 18:14:28 <eghobo> adrian_otto: +1 18:14:50 <eghobo> most clients can handle retries 18:15:09 <adrian_otto> becasue that sounds to me like a "Where do I find the active master" question, which is a service discovery issue, not a load balancing one 18:15:15 <Tango> joining late 18:15:58 <daneyon> we could setup load-balancing so the vip always sends traffic to the primary, until the L3/4 health check fails and then goes to 1 of the replicas. However, we may get in a situation where node-3 becomes the master and the slb sends traffic to node-2. node-2 will redirect to node-3. ATM I don't see much value in load-balancing the swarm managers until Docker provides a better ha/scale story 18:16:58 <eghobo> daneyon: how do you know how is primary? 18:17:00 <daneyon> eghobo you are correct,, kind of. The replicas simply redirect requests to the primary in the cluster. 18:17:19 <daneyon> eghobo good to know that mesos follows the same approach. 18:17:49 <daneyon> adrian_otto I have not detailed the swarm manager ha, scale, load-balancing, etc.. use cases. 18:18:16 <eghobo> actually it's another way around, swarm mimic it from Mesos ;) 18:18:30 <adrian_otto> I suggest we record the use cases first, and then consider design/implementation options based on those 18:19:00 <daneyon> atm i think we simply table using a load-balancer for swarm managers until A. We implement swarm clustering (right now we only deploy a single swarm mgr) B. Docker has a better ha/scale story for swarm. 18:19:17 <adrian_otto> fine with me 18:19:55 <daneyon> Tango thx for joining 18:20:00 <hongbin> sure. The swarm HA should be addressed in a dedicated blueprint 18:20:42 <eghobo> hongbin: +1, the same way as ha for kub and mesos 18:21:49 <Tango> Would it make sense for us to get involved in developing the ha/scale proposal for Docker, or at least follow it closely? 18:21:52 <daneyon> adrian_otto I think it's a bit of both and why i reference ha/scale. If we had a large swarm cluster, we want to have all mgr nodes in the cluster active. In that scenario, we want to front-end the mgr's with a load-balancer. This is the typical ha/scale scenario that I see most users request. ATM this is a moot point since swarm scaling is not there. 18:22:20 <daneyon> eghobo primary = 1st node in the cluster 18:22:28 <Tango> Especially if we have opinion about how it should be done 18:22:58 <daneyon> hongbin agreed re; swarm ha bp 18:23:06 <daneyon> I believe I have already created one 18:23:54 <daneyon> Tango I think it's a good idea to get involved in any upstream projects that can have an effect on Magnum 18:24:41 <daneyon> here is the link o the swarm ha bp 18:24:45 <daneyon> #link https://blueprints.launchpad.net/magnum/+spec/swarm-high-availability 18:24:50 <daneyon> feel free to add to it 18:25:34 <daneyon> I have also created a bp for swarm scaling 18:25:36 <daneyon> #link https://blueprints.launchpad.net/magnum/+spec/swarm-scale-manager 18:26:04 <daneyon> it would be nice to eventually auto scale swarm nodes 18:26:56 <daneyon> it would be great to see someone from the team tackle these bp's 18:27:20 <daneyon> If not, I am hoping that I can tackle them when I'm done with the net-driver implementation across all bay types 18:27:39 <eghobo> daneyon: I feel it's out of magnum scope, it's feature of swarm scheduler 18:27:44 <Tango> There is a talk on autoscaling at the Summit, we can follow up with these BPs 18:27:57 <daneyon> eghobo what is? 18:28:14 <eghobo> scale-up 18:28:36 <hongbin> Here is the autoscale blueprint: 18:28:40 <hongbin> #link https://blueprints.launchpad.net/magnum/+spec/autoscale-bay 18:28:52 <daneyon> thanks hongbin 18:30:01 <daneyon> eghobo I am referring to adding new nodes to the bay. If I create a bay with master_count 1 and node_count 1. THings work great and now I need add'l capacity. I need to scale out the node count 18:30:43 <daneyon> eghobo the swarm scheduler seems pretty decent, so I'm not talking about touching the swarm scheduler 18:31:02 <daneyon> swarm scheduler strategies 18:31:04 <daneyon> #link https://docs.docker.com/swarm/scheduler/strategy/ 18:31:11 <eghobo> daneyon: i see, we definitely need it and it should work the same way for all coe 18:31:12 <suro-patz> daneyon: Would you please elaborate - what we want to achieve on https://blueprints.launchpad.net/magnum/+spec/swarm-high-availability 18:31:16 <daneyon> swarm scheduler filters 18:31:19 <daneyon> #link https://docs.docker.com/swarm/scheduler/filter/ 18:31:54 <eghobo> should we return to networking topic ;) 18:32:12 <daneyon> eghobo agreed. Unfortunatly, as adrian_otto has mentioned, we do not have feature parity across all bay types. 18:32:27 <daneyon> hopefully that will change going fwd 18:33:05 <suro-patz> daneyon: my incrementing the —master_count attribute, from magnum point of view we are just adding a node to the bay as one more control end-point. Providing HA for API/ETCD should be out of magnum's scope 18:33:12 <eghobo> add/delete nodes it's common for all bays, isn't? 18:33:22 <daneyon> suro-patz I am bsaicly saying in the bp we should implement ha for the swarm mgr's. Our only solution is from Docker's HA guide 18:33:29 <daneyon> #link https://docs.docker.com/swarm/multi-manager-setup/ 18:33:46 <eghobo> daneyon: +1 18:34:02 <hongbin> eghobo: yes, currently users can manually add/remove nodes from bay 18:34:15 <hongbin> eghobo: for all bay types 18:34:43 <hongbin> although remove node doesn't work very well with swarm, due to the lack of replication controller 18:35:17 <daneyon> suro-patz so, the swarm bay type needs to implement the master_count attr. The heat templates need to be updated to orchestrate multiple masters. When master_count is > 1, the --replication and --advertise flags should be added to the swarm manage command 18:35:49 <daneyon> I think it could be done pretty easily. I think this is really important to address Magnum's primary goal of production ready 18:36:28 <daneyon> In the mean-time users would have to deploy multiple swarm bays and spread their containerized app's across the multiple bays to achieve HA 18:36:47 <daneyon> I think it would be nice to provide users with an option to have in-cluster ha 18:38:08 <eghobo> daneyon: +1 18:38:17 <daneyon> eghobo re: scaling. I was referring to having a future option to auto scale nodes. FOr swarm mgr's I don't think auto-scaling is needed anytime soon. Instead we need to support multiple masters for HA purposes. 18:38:49 <suro-patz> daneyon: I see, this is to support HA of the control plane of swarm, and magnum should help setting that up 18:38:53 <suro-patz> +1 18:39:08 <daneyon> suro-patz master_count adds swarm manager nodes, not swarm agent nodes. 18:39:49 <suro-patz> daneyon: correct, I meant swarm manager by 'control plane' 18:40:59 <daneyon> suro-patz the patch I'm working on removes the swarm agent from the swarm manager. This provides a clear seperation between control/data planes. swarm managers are strictly the control plane while swarm agent nodes are the data plane. We will eventually want to separate the communication between swarm mgr/agent and standard container traffic, but that's a different topic. 18:41:32 <daneyon> suro-patz that is correct. We want HA in the control plane 18:42:06 <eghobo> daneyon: can we do without ha first and will add it latter 18:42:19 <suro-patz> daneyon: I am still not clear on the original LB issue you raised, may be we can spend some time on the IRC after this meetingt 18:42:20 <daneyon> we will leave it up to the swarm scheduler to provide ha to containers based on the scheduling strategy. 18:42:57 <daneyon> eghobo yes. None of my network-driver work depends on ha. 18:43:08 <eghobo> great 18:44:44 <daneyon> suro-patz sure. In summary, a load-balancer is not needed b/c A. We have not implemented multiple swarm managers and B. Swarm mgr clustering != all mgr's are active... only 1 active mgr (primary) and others (replicas) are on standby. 18:45:06 <daneyon> #topic Review Action Items 18:45:13 * daneyon danehans to look into changing the default network-driver for swarm to none. 18:45:58 <daneyon> I have looked into it and working through the changes to default swarm to network_driver None and have the option for flannel 18:46:12 <daneyon> dane_leblanc is working on a required patch to make this work too 18:46:33 <suro-patz> daneyon: if we are suggesting flannel for kub, why not for swarm too, as default? 18:46:48 <daneyon> He and I imeplmented network-driver api validation... currently the validation only allows for network-driver=flannel 18:46:57 <daneyon> not good for the none type ;-) 18:47:24 <daneyon> this is the validation patch that was merged: 18:47:26 <daneyon> #link https://review.openstack.org/#/c/222337/ 18:47:52 <daneyon> dane_leblanc is working on a patch to update the validation to include "none" type 18:48:13 <daneyon> suro-patz we had a lengthy discussion on that topic during last week's meeting. 18:48:14 <dane_leblanc> Should have the validation up for review today 18:48:32 <suro-patz> daneyon: Will check the archive 18:48:38 <daneyon> pls review the meeting logs to come up to speed and ping myself or others over irc if you would like to discuss further. 18:48:57 * daneyon danehans to continue coordinating with gsagie on a combined kuryr/magnum design summit session. 18:49:05 <daneyon> I still have not had time to address this 18:49:19 <daneyon> I tried pinging gsagie today, but I did not see him on irc 18:49:24 <daneyon> I will carry this fwd 18:49:31 <daneyon> #action danehans to continue coordinating with gsagie on a combined kuryr/magnum design summit session. 18:49:41 <daneyon> #topic Open Discussion 18:50:13 <daneyon> We have a few minutes to discuss anything the group would like. 18:50:19 <eghobo> daneyon: are you testing swarm with atomic 3 or 5? 18:50:42 <daneyon> eghobo 3 18:50:46 <eghobo> thx 18:52:15 <daneyon> anyone see this article 18:52:18 <daneyon> #link http://blog.kubernetes.io/2015/09/kubernetes-performance-measurements-and.html 18:52:39 <daneyon> I think it would be awesome if we can pull something like this off for Magnum 18:52:57 <daneyon> would give users/operators a lot of confidence in using Magnum 18:53:39 <daneyon> I'll wait 1 minute before ending the meeting. 18:55:21 <daneyon> Alright then... thanks for joining. 18:55:26 <daneyon> #endmeeting