Bare Metal Container Network Design

dol...@etsy.com

unread,

Aug 12, 2016, 11:56:47 AM8/12/16

to CoreOS User

I have mocked up the simplest possible bare metal container networking design on CoreOS stable (1068.8.0). Attached is the result. This design keeps broadcast domains small while allowing the container environment to appear as a single vlan to the physical network. It also eliminates NAT which is generally desirable. Load balancing could still be integrated and I believe microsegmentation could still be achieved with iptables. Container-to-container communication is achieved via the shortest L2 path without static routes, a routing protocol or transiting an upstream router. ARP table size is a consideration, but no more so than in vm networking. There is no vlan segmentation down to the container, but perhaps multiple bridges could be used to emulate such. I am interested in feedback: Is this a good design? Why is this a bad design? Is this design incompatible with Kubernetes or Mesos?

Container Network.png

Seán C. McCord

unread,

Aug 12, 2016, 12:16:52 PM8/12/16

to dol...@etsy.com, CoreOS User

This is a perfectly acceptable design. You are simply performing manually and at a lower level than other tools (ex: flannel) would do automatically and at a higher (overlay) level.

I cannot speak for Mesos, but it should be fine for Kubernetes, so far as I can see.

On Fri, Aug 12, 2016 at 11:56 AM dolvany via CoreOS User <coreo...@googlegroups.com> wrote:

I have mocked up the simplest possible bare metal container networking design on CoreOS stable (1068.8.0). Attached is the result. This design keeps broadcast domains small while allowing the container environment to appear as a single vlan to the physical network. It also eliminates NAT which is generally desirable. Load balancing could still be integrated and I believe microsegmentation could still be achieved with iptables. Container-to-container communication is achieved via the shortest L2 path without static routes, a routing protocol or transiting an upstream router. ARP table size is a consideration, but no more so than in vm networking. There is no vlan segmentation down to the container, but perhaps multiple bridges could be used to emulate such. I am interested in feedback: Is this a good design? Why is this a bad design? Is this design incompatible with Kubernetes or Mesos?

--
You received this message because you are subscribed to the Google Groups "CoreOS User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Seán C McCord

CyCore Systems, Inc

+1 888 240 0308

PGP/GPG: http://cycoresys.com/scm.asc

c...@tigera.io

unread,

Aug 12, 2016, 5:50:42 PM8/12/16

to CoreOS User

As Seán already noted, this will work. The question is, to what scale. You are basically creating what at one time (I'm dating myself here) a brouter. To one side the network looks like a router, to the other a bridge. In reality you are using ARP to distribute 'routes'. As you note, you may run out of ARP table entries, as there is no aggregation of routes on the router. It also means that there is a LOT of ARPing for destinations that are really behind a router hop. You also might have issues in that ARP records don't necessarily time-out right away. So you may end up forwarding to incorrect destinations (if a server goes down and it's 'subnet' is moved somewhere else). There are fixes to that as well (like GARP), but now you have more moving parts.

There are solutions to this, such as Canal (full disclosure, I'm the CTO of Tigera, the home for Canal, Calico, and co-maintainers on Flannel). With Canal, you would be either able to route or 'switch' at the networking layer, and use Calico for policy/segmentation. Either of those two configurations would meet your requirements (1 VLAN, small broadcast domains, no NAT, and segmentation). Best of all, they are all open source and widely adopted and deployed.

Dennis Olvany

unread,

Aug 12, 2016, 9:31:11 PM8/12/16

to c...@tigera.io, CoreOS User

One thing that I have been considering is subnet size on the docker bridge. This bounds the number of containers that can run on a host. Clearly this is subjective, but what is a typical number of containers to run on one host? 10, 100, 1000?

On Fri, Aug 12, 2016 at 5:50 PM <c...@tigera.io> wrote:

As Seán already noted, this will work. The question is, to what scale. You are basically creating what at one time (I'm dating myself here) a brouter. To one side the network looks like a router, to the other a bridge. In reality you are using ARP to distribute 'routes'. As you note, you may run out of ARP table entries, as there is no aggregation of routes on the router. It also means that there is a LOT of ARPing for destinations that are really behind a router hop. You also might have issues in that ARP records don't necessarily time-out right away. So you may end up forwarding to incorrect destinations (if a server goes down and it's 'subnet' is moved somewhere else). There are fixes to that as well (like GARP), but now you have more moving parts.

There are solutions to this, such as Canal (full disclosure, I'm the CTO of Tigera, the home for Canal, Calico, and co-maintainers on Flannel). With Canal, you would be either able to route or 'switch' at the networking layer, and use Calico for policy/segmentation. Either of those two configurations would meet your requirements (1 VLAN, small broadcast domains, no NAT, and segmentation). Best of all, they are all open source and widely adopted and deployed.

--
You received this message because you are subscribed to a topic in the Google Groups "CoreOS User" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/coreos-user/5ohcc39-hws/unsubscribe.
To unsubscribe from this group and all its topics, send an email to coreos-user...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Dennis Olvany

Senior Network Engineer @Etsy

347-844-2767

Seán C. McCord

unread,

Aug 12, 2016, 10:54:31 PM8/12/16

to Dennis Olvany, c...@tigera.io, CoreOS User

If you are using rkt or kubernetes, consider your IP allocation per pod, rather than per container. The number of pods per node will be highly subject to your own installation. The common default allocation of a /24 is usually more than sufficient, giving you 252 available pod IPs per node.

You received this message because you are subscribed to the Google Groups "CoreOS User" group.

To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Dennis Olvany

unread,

Aug 16, 2016, 11:26:37 AM8/16/16

to Seán C. McCord, c...@tigera.io, CoreOS User

Sean,

If my understanding is correct, a pod is a grouping of one or more containers. Do containers in a pod cease having an IP address? I am a bit unclear on the mapping of IP to pods.

Jeffrey Hulten

unread,

Aug 16, 2016, 12:25:06 PM8/16/16

to Dennis Olvany, Seán C. McCord, c...@tigera.io, CoreOS User

If I remember correctly, all containers in a pod share a network namespace and therefore only use one IP.

Dennis Olvany

unread,

Aug 16, 2016, 12:36:47 PM8/16/16

to Jeffrey Hulten, Seán C. McCord, c...@tigera.io, CoreOS User

So multiple containers use the same container network stack? Interesting. So, contention could occur if two containers wish to bind to the same port within the same pod?

Jonathan Boulle

unread,

Aug 16, 2016, 12:39:40 PM8/16/16

to Dennis Olvany, Jeffrey Hulten, Seán C. McCord, c...@tigera.io, CoreOS User

Correct, containers within a pod should be cooperative and hence not conflict on ports.

c...@tigera.io

unread,

Aug 30, 2016, 12:17:19 PM8/30/16

to CoreOS User, c...@tigera.io

Greetings Dennis,

That's a variable number. Most folks are looking at the high 10's to low 100's per node (containers on the higher side, pods on the lower side of that). I have had folks talk about 1k+, but those seem to be outliers, as of yet.

Christopher

Reply all

Reply to author

Forward