18:00:32 #startmeeting container-networking 18:00:32 Meeting started Thu Oct 1 18:00:32 2015 UTC and is due to finish in 60 minutes. The chair is daneyon. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:00:34 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 18:00:37 The meeting name has been set to 'container_networking' 18:00:39 Agenda 18:00:46 #link https://wiki.openstack.org/wiki/Meetings/Containers#Agenda 18:00:59 I'll wait a minute for everyone to review the agenda 18:01:08 It's a short one :-) 18:01:30 #topic roll call 18:01:32 might as well begin roll call 18:01:37 Adrian Otto 18:01:44 o/ 18:01:44 Surojit Pathak 18:01:54 o/ 18:02:26 Thank you adrian_otto dane_leblanc suro-patz vilobhmm111 for attending the meeting. 18:02:33 #topic Review Swarm patch. 18:02:37 #link https://review.openstack.org/#/c/224367/ 18:02:40 o/ 18:02:49 Not much has changed with the patch I posted last week 18:02:58 eghobo thanks for joining 18:03:20 I have a newer version of the patch locally that I'm still playing with. 18:03:39 I got a bit side tracked fixing a few bugs. 18:04:22 Hopefully I can post an updated version of the patch later today that will address using None as the default net_driver for Swarm 18:05:00 I have removed using a VIP and all the associated load-balancing config for the swarm api 18:05:11 it's not needed and does not work since the tls patch was merged. 18:05:51 since neutron lbaas does not support tls offload, we will need to figure out a plan for supporting tls with load-balancing. 18:06:14 Is anyone familiar with project Octavia? 18:06:21 each node holds the cert, and use layer 3 lb (TCP port forwarding) 18:06:36 #link https://wiki.openstack.org/wiki/Octavia 18:06:40 use a simple health check to drop dead nodes 18:07:24 adrian_otto that can be a near-term fix 18:08:15 long-term, it would be nice to perform l7 load-balancing by offloading the session to the load-balancer and then re-encrypting on the backend from the lb -> the swarm managers 18:09:08 adrian_otto we will look at reimplementing the swarm mgr load-balancing when the bay type supports multiple swarm managers. 18:09:35 here is the guide that will be followed for implementing multiple managers: 18:09:37 #link https://docs.docker.com/swarm/multi-manager-setup/ 18:09:49 daneyon, I don't understand the desire to offload ssl, and then use encrypted back channels 18:09:59 daneyon: but I believe only one can be active 18:10:01 seems like more complexity that may not be needed 18:10:16 as you can see from the guide, only 1 mgr is primary and others are backups 18:10:22 is there some routing decision taht involved layer 7? 18:10:39 I would expect that Docker better addresses swarm mgr ha/scale in a future release. 18:11:01 daneyon: mesos has the same model 18:11:15 adrian_otto I would expect that we may need to address different security use cases. 18:12:42 have we detailed the uses cases anywhere? 18:12:51 From my experience, some users are OK with off-loading ssl to a slb and clear on the back-end. Others want to e-2-e encryption. In that case, we can do simple L4 checks/load-balancing, but L7 is preferred as long as the hw can handle it 18:13:31 if the client can do simple SRV lookups, and designate is present, there may be no need for load balancing 18:13:55 adrian_otto currently load-balancing the swwarm mgr's is unneeded. It can be implemented, but any traffic to the replicas will be forwarded to the primary 18:14:05 just inform designate to update the SRV record when the service availability changes 18:14:11 o/ 18:14:28 adrian_otto: +1 18:14:50 most clients can handle retries 18:15:09 becasue that sounds to me like a "Where do I find the active master" question, which is a service discovery issue, not a load balancing one 18:15:15 joining late 18:15:58 we could setup load-balancing so the vip always sends traffic to the primary, until the L3/4 health check fails and then goes to 1 of the replicas. However, we may get in a situation where node-3 becomes the master and the slb sends traffic to node-2. node-2 will redirect to node-3. ATM I don't see much value in load-balancing the swarm managers until Docker provides a better ha/scale story 18:16:58 daneyon: how do you know how is primary? 18:17:00 eghobo you are correct,, kind of. The replicas simply redirect requests to the primary in the cluster. 18:17:19 eghobo good to know that mesos follows the same approach. 18:17:49 adrian_otto I have not detailed the swarm manager ha, scale, load-balancing, etc.. use cases. 18:18:16 actually it's another way around, swarm mimic it from Mesos ;) 18:18:30 I suggest we record the use cases first, and then consider design/implementation options based on those 18:19:00 atm i think we simply table using a load-balancer for swarm managers until A. We implement swarm clustering (right now we only deploy a single swarm mgr) B. Docker has a better ha/scale story for swarm. 18:19:17 fine with me 18:19:55 Tango thx for joining 18:20:00 sure. The swarm HA should be addressed in a dedicated blueprint 18:20:42 hongbin: +1, the same way as ha for kub and mesos 18:21:49 Would it make sense for us to get involved in developing the ha/scale proposal for Docker, or at least follow it closely? 18:21:52 adrian_otto I think it's a bit of both and why i reference ha/scale. If we had a large swarm cluster, we want to have all mgr nodes in the cluster active. In that scenario, we want to front-end the mgr's with a load-balancer. This is the typical ha/scale scenario that I see most users request. ATM this is a moot point since swarm scaling is not there. 18:22:20 eghobo primary = 1st node in the cluster 18:22:28 Especially if we have opinion about how it should be done 18:22:58 hongbin agreed re; swarm ha bp 18:23:06 I believe I have already created one 18:23:54 Tango I think it's a good idea to get involved in any upstream projects that can have an effect on Magnum 18:24:41 here is the link o the swarm ha bp 18:24:45 #link https://blueprints.launchpad.net/magnum/+spec/swarm-high-availability 18:24:50 feel free to add to it 18:25:34 I have also created a bp for swarm scaling 18:25:36 #link https://blueprints.launchpad.net/magnum/+spec/swarm-scale-manager 18:26:04 it would be nice to eventually auto scale swarm nodes 18:26:56 it would be great to see someone from the team tackle these bp's 18:27:20 If not, I am hoping that I can tackle them when I'm done with the net-driver implementation across all bay types 18:27:39 daneyon: I feel it's out of magnum scope, it's feature of swarm scheduler 18:27:44 There is a talk on autoscaling at the Summit, we can follow up with these BPs 18:27:57 eghobo what is? 18:28:14 scale-up 18:28:36 Here is the autoscale blueprint: 18:28:40 #link https://blueprints.launchpad.net/magnum/+spec/autoscale-bay 18:28:52 thanks hongbin 18:30:01 eghobo I am referring to adding new nodes to the bay. If I create a bay with master_count 1 and node_count 1. THings work great and now I need add'l capacity. I need to scale out the node count 18:30:43 eghobo the swarm scheduler seems pretty decent, so I'm not talking about touching the swarm scheduler 18:31:02 swarm scheduler strategies 18:31:04 #link https://docs.docker.com/swarm/scheduler/strategy/ 18:31:11 daneyon: i see, we definitely need it and it should work the same way for all coe 18:31:12 daneyon: Would you please elaborate - what we want to achieve on https://blueprints.launchpad.net/magnum/+spec/swarm-high-availability 18:31:16 swarm scheduler filters 18:31:19 #link https://docs.docker.com/swarm/scheduler/filter/ 18:31:54 should we return to networking topic ;) 18:32:12 eghobo agreed. Unfortunatly, as adrian_otto has mentioned, we do not have feature parity across all bay types. 18:32:27 hopefully that will change going fwd 18:33:05 daneyon: my incrementing the —master_count attribute, from magnum point of view we are just adding a node to the bay as one more control end-point. Providing HA for API/ETCD should be out of magnum's scope 18:33:12 add/delete nodes it's common for all bays, isn't? 18:33:22 suro-patz I am bsaicly saying in the bp we should implement ha for the swarm mgr's. Our only solution is from Docker's HA guide 18:33:29 #link https://docs.docker.com/swarm/multi-manager-setup/ 18:33:46 daneyon: +1 18:34:02 eghobo: yes, currently users can manually add/remove nodes from bay 18:34:15 eghobo: for all bay types 18:34:43 although remove node doesn't work very well with swarm, due to the lack of replication controller 18:35:17 suro-patz so, the swarm bay type needs to implement the master_count attr. The heat templates need to be updated to orchestrate multiple masters. When master_count is > 1, the --replication and --advertise flags should be added to the swarm manage command 18:35:49 I think it could be done pretty easily. I think this is really important to address Magnum's primary goal of production ready 18:36:28 In the mean-time users would have to deploy multiple swarm bays and spread their containerized app's across the multiple bays to achieve HA 18:36:47 I think it would be nice to provide users with an option to have in-cluster ha 18:38:08 daneyon: +1 18:38:17 eghobo re: scaling. I was referring to having a future option to auto scale nodes. FOr swarm mgr's I don't think auto-scaling is needed anytime soon. Instead we need to support multiple masters for HA purposes. 18:38:49 daneyon: I see, this is to support HA of the control plane of swarm, and magnum should help setting that up 18:38:53 +1 18:39:08 suro-patz master_count adds swarm manager nodes, not swarm agent nodes. 18:39:49 daneyon: correct, I meant swarm manager by 'control plane' 18:40:59 suro-patz the patch I'm working on removes the swarm agent from the swarm manager. This provides a clear seperation between control/data planes. swarm managers are strictly the control plane while swarm agent nodes are the data plane. We will eventually want to separate the communication between swarm mgr/agent and standard container traffic, but that's a different topic. 18:41:32 suro-patz that is correct. We want HA in the control plane 18:42:06 daneyon: can we do without ha first and will add it latter 18:42:19 daneyon: I am still not clear on the original LB issue you raised, may be we can spend some time on the IRC after this meetingt 18:42:20 we will leave it up to the swarm scheduler to provide ha to containers based on the scheduling strategy. 18:42:57 eghobo yes. None of my network-driver work depends on ha. 18:43:08 great 18:44:44 suro-patz sure. In summary, a load-balancer is not needed b/c A. We have not implemented multiple swarm managers and B. Swarm mgr clustering != all mgr's are active... only 1 active mgr (primary) and others (replicas) are on standby. 18:45:06 #topic Review Action Items 18:45:13 * daneyon danehans to look into changing the default network-driver for swarm to none. 18:45:58 I have looked into it and working through the changes to default swarm to network_driver None and have the option for flannel 18:46:12 dane_leblanc is working on a required patch to make this work too 18:46:33 daneyon: if we are suggesting flannel for kub, why not for swarm too, as default? 18:46:48 He and I imeplmented network-driver api validation... currently the validation only allows for network-driver=flannel 18:46:57 not good for the none type ;-) 18:47:24 this is the validation patch that was merged: 18:47:26 #link https://review.openstack.org/#/c/222337/ 18:47:52 dane_leblanc is working on a patch to update the validation to include "none" type 18:48:13 suro-patz we had a lengthy discussion on that topic during last week's meeting. 18:48:14 Should have the validation up for review today 18:48:32 daneyon: Will check the archive 18:48:38 pls review the meeting logs to come up to speed and ping myself or others over irc if you would like to discuss further. 18:48:57 * daneyon danehans to continue coordinating with gsagie on a combined kuryr/magnum design summit session. 18:49:05 I still have not had time to address this 18:49:19 I tried pinging gsagie today, but I did not see him on irc 18:49:24 I will carry this fwd 18:49:31 #action danehans to continue coordinating with gsagie on a combined kuryr/magnum design summit session. 18:49:41 #topic Open Discussion 18:50:13 We have a few minutes to discuss anything the group would like. 18:50:19 daneyon: are you testing swarm with atomic 3 or 5? 18:50:42 eghobo 3 18:50:46 thx 18:52:15 anyone see this article 18:52:18 #link http://blog.kubernetes.io/2015/09/kubernetes-performance-measurements-and.html 18:52:39 I think it would be awesome if we can pull something like this off for Magnum 18:52:57 would give users/operators a lot of confidence in using Magnum 18:53:39 I'll wait 1 minute before ending the meeting. 18:55:21 Alright then... thanks for joining. 18:55:26 #endmeeting