C10m

1,960 views
Skip to first unread message

Rajiv Kurian

unread,
May 14, 2013, 1:37:38 PM5/14/13
to mechanica...@googlegroups.com
Saw a presentation at http://www.youtube.com/watch?v=73XNtI0w7jA#! on tecniques on how to support 10 million concurrent connections. Thought this group would find it interesting. Has anyone here had experience with building the TCP/IP stack in user space? On a sidenote the PF_RING architecture looks a lot like the Disruptor's ring buffer.

Martin Thompson

unread,
May 14, 2013, 1:47:33 PM5/14/13
to mechanica...@googlegroups.com
Open Onload is such a user space stack.  http://www.openonload.org/

I've benchmarked the Solareflare cards with this stack and it much lower latency and higher throughput than a standard Intel cards and the Linux kernel stack.

BTW many of the ideas for the Disruptor come from the design of network devices :-)  These ideas have been around a very long time.

Luke Gorrie

unread,
May 14, 2013, 2:24:39 PM5/14/13
to Rajiv Kurian, mechanica...@googlegroups.com
That's how we built the Teclo Sambal product (http://teclo.net/) actually. Homebrew tcp/ip in userspace running on a small userspace Intel 10G device driver like the one in Snabb Switch. The product does 10M sockets of proxying at 20Gbps of real intenet traffic. Parallelism comes from running N totally independent processes with hardware dispatching ("RSS" in Intel NIC parlance) between them - no threads or locks anywhere.

I would say the performance side is pretty straightforward in that the hardware really wants to run fast and you only need to avoid getting in the way -- not too hard if you write the whole stack to match your application, but very hard if you depend on abstractions and misunderstand what's going on.

Building a solid TCP stack is real work though. In our case we wanted to do that anyway but more generally there seems to be a place for a good open source userspace one in the world. Maybe Snabb Switch will grow one one day...

On Tuesday, May 14, 2013, Rajiv Kurian wrote:
Saw a presentation at Shmoocon 2013 - C10M Defending The Internet At Scale! on tecniques on how to support 10 million concurrent connections. Thought this group would find it interesting. Has anyone here had experience with building the TCP/IP stack in user space? On a sidenote the PF_RING architecture looks a lot like the Disruptor's ring buffer.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Rajiv Kurian

unread,
May 14, 2013, 2:31:16 PM5/14/13
to mechanica...@googlegroups.com, Rajiv Kurian, lu...@snabb.co
Do you guys know of an open source implementation, that one could use? My search seems to suggest there aren't any well-tested open source implementations.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Martin Thompson

unread,
May 14, 2013, 3:00:44 PM5/14/13
to mechanica...@googlegroups.com
This talk has some good points and I think the subject is really interesting.  I would take the suggested approach with serious caution.  For starters the Linux kernel is nowhere near as bad as it made out.  Last year I worked with a client and we scaled a single server to 1 million concurrent connections with async programming in Java and some sensible kernel tuning.  I've heard they have since taken this to over 5 million concurrent connections.

BTW Open Onload is an open source implementation.  Writing a network stack is a serious undertaking.  In a previous life I wrote a network probe and had to reassemble TCP streams and kept getting tripped up by edge cases.  It is a great exercise in data structures and lock-free programming.  If you need very high-end performance I'd talk to the Solarflare or Mellanox guys before writing my own.

There are some errors and omissions in this talk.  For example, his range of ephemeral ports is not quite right, and atomic operations are only 15 cycles on Sandy Bridge when hitting local cache.  A big issue for me is when he defined C10M he did not mention the TIME_WAIT issue with closing connections.  Creating and destroying 1 million connections per second is a major issue.  A protocol like HTTP is very broken in that the server closes the socket and therefore has to retain the TCB until the specified timeout occurs to ensure no older packet is delivered to a new socket connection.



On Tuesday, May 14, 2013 6:37:38 PM UTC+1, Rajiv Kurian wrote:

Luke Gorrie

unread,
May 14, 2013, 3:28:32 PM5/14/13
to Martin Thompson, mechanica...@googlegroups.com
On 14 May 2013 21:00, Martin Thompson <mjp...@gmail.com> wrote:
Last year I worked with a client and we scaled a single server to 1 million concurrent connections with async programming in Java and some sensible kernel tuning.  I've heard they have since taken this to over 5 million concurrent connections.

That is very impressive! 

BTW Open Onload is an open source implementation.

and that is really interesting!

Cool mailing list :-)


Rajiv Kurian

unread,
May 14, 2013, 4:17:02 PM5/14/13
to mechanica...@googlegroups.com
Martin,

It would be awesome, if you can talk about the work you did to get such numbers using Java.

I just saw that Open Onload is an open-source implementation.

Kirk Pepperdine

unread,
May 15, 2013, 1:30:59 AM5/15/13
to Rajiv Kurian, mechanica...@googlegroups.com
Interesting, did a 1 VM to 1,000,000 (socket) clients once. The hardest part was configuration.

-- Kirk

On 2013-05-14, at 7:37 PM, Rajiv Kurian <geet...@gmail.com> wrote:

> Saw a presentation at http://www.youtube.com/watch?v=73XNtI0w7jA#! on tecniques on how to support 10 million concurrent connections. Thought this group would find it interesting. Has anyone here had experience with building the TCP/IP stack in user space? On a sidenote the PF_RING architecture looks a lot like the Disruptor's ring buffer.
>
> --
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Luke Gorrie

unread,
May 15, 2013, 2:56:10 AM5/15/13
to mechanica...@googlegroups.com, Michael Barker
On 14 May 2013 23:16, Michael Barker <mik...@gmail.com> wrote:
> Building a solid TCP stack is real work though. In our case we wanted to do
> that anyway but more generally there seems to be a place for a good open
> source userspace one in the world. Maybe Snabb Switch will grow one one
> day...

I'd be interested in hearing your approach here especially around how
you present the stack to the application.  E.g. the OpenOnload
approach is to replace the standard C library so that you can stick to
your existing Linux/Posix API (select/poll/epoll/read/write), which is
quite nice for retrofitting OpenOnload into existing applications.

In this product there is actually no separate application - it's a packet-oriented network element like a firewall.

The device sits inline ("bump in the wire") on a mobile telecom backbone and gives TCP packets store-and-forward semantics, i.e. every TCP packet is buffered in the device until ACK'd and the device implements custom (non-RFC) algorithms for congestion control and retransmission that we designed ourselves. The design is for minimal impact on the traffic passing through -- leaving sequence numbers intact and so on -- so that you could power off the device abruptly and still let (most) connections continue.

So that was a very special case and we wouldn't have wanted to use an off-the-shelf TCP stack even had one been readily available.

Generally my mental model of packet-oriented devices is that people are losing a lot of sleep over kernel-based solutions in production networks and that moving from kernel space to userspace (e.g. DPDK) really is helping to make x86-based devices respectable again.

btw we recently implemented "vhost architecture" (http://blog.vmsplice.net/2011/09/qemu-internals-vhost-architecture.html) support in Snabb Switch. This is the latest Linux kernel interface for Ethernet I/O, designed for KVM. Haven't tried pushing performance on it yet though, more focused on hardware drivers for now.

An open source TCP/IP stack for the Intel 82599 would be a very cool
thing to have.

Almost sounds like the kind of thing that might one day spontaneously pop out of e.g. Google :-)


Luke Gorrie

unread,
May 15, 2013, 2:57:16 AM5/15/13
to mechanica...@googlegroups.com, Michael Barker
On 14 May 2013 23:16, Michael Barker <mik...@gmail.com> wrote:
I'd be interested in hearing your approach here especially around how
you present the stack to the application.  E.g. the OpenOnload
approach is to replace the standard C library so that you can stick to
your existing Linux/Posix API (select/poll/epoll/read/write), which is
quite nice for retrofitting OpenOnload into existing applications.

Actually we are looking at a related system interface issue in Snabb Switch right now.

Brief digression:

We want to move the bottom layer of the network stack - ethernet I/O with devices - from the Linux kernel into a userspace application. This is so that we can write high-level userspace code that does the same job as kernel-based extensions to iptables, ebtables, bridge, and so on. The goal is to make that kind of programming much quicker and easier, and exploit this to build cool applications.

The layer above us can be anything that operates at ethernet level. For example KVM, Xen, or the Linux kernel.

Back on topic:

So how do we interface towards these "applications"?

The Linux kernel is easy: we use /dev/net/tun to create a virtual 'tap' interface and /dev/vhost-net to setup a DMA-like I/O interface that cuts out the usual system calls and context switches. Now Linux could do all of its networking via tap and let snabbswitch worry about connecting that with a physical network.

KVM is the problem on the table now. How do we make KVM talk to us instead of the Linux kernel? Today KVM uses the same /dev/net/tun and /dev/vhost-net that we are using, and the trouble is that this is designed as a user-kernel interface and not a user-user interface. It uses system calls like ioctl() and I don't know any really easy way to make those call into the snabbswitch process instead of the Linux kernel.

If anybody has a bright idea here I would be really glad to hear it :-) this is genuinely a major open technical problem for us.

There are two ideas on the table now, both potentially viable:

1. Hack KVM. Generalize the code so that it can optionally use different system calls that are user-user friendly.
2. Emulate /dev/net/tun and /dev/vhost-net. This is the newer idea: use FUSE to create a user-space filesystem interface to snabbswitch with compatible clones of /dev/net/tun (to create virtual network interfaces) and /dev/vhost-net (to negotiate which memory to use for packet descriptor rings). The filesystem part doesn't necessarily have to be super-efficient because the heavy lifting is done via shared memory.

It would be cool to find a third option that's simpler though :-).

Xen we haven't looked into beyond drinking beer with one of the developers...


Martin Thompson

unread,
May 15, 2013, 3:05:58 AM5/15/13
to mechanica...@googlegroups.com
The majority of the work is Linux config for ports, buffers, queues, etc.  It would apply to any language.

From Java once you are using non-blocking IO the big thing is the design of data structures to manage the state for all your concurrent connections efficiently.  When data sets get large this becomes combined modelling, data structures and mechanical sympathy challenge :-)

Peter Lawrey

unread,
May 15, 2013, 3:31:13 AM5/15/13
to Martin Thompson, mechanica...@googlegroups.com
It is worth noting that the more connections you are supporting, the less each connection gets of the server.  For higher values services, you might spend $1-10 per years per active user, and if you are spending $1K to $10K per server per year, you might want one server per 100 to 10K active connections.

Where you need very high number of connections per application, is low value, highly concurrent functionality. e.g. DNS, firewalls.  These applications tend to be well serviced by existing solutions and companies.  Where you are likely to develop a new solutions, is in the higher value solutions where the cost of the hardware means you have diminishing returns by cramming more connections into a single box.

Where this discussion is particularly interesting is
- how do I minimise the overhead on handling network requests on my system.
- how to I handle bursts of activity (or denial of service attacks)
- how far can I stress test my application. (or is my network framework going to be the limiting factor)

So while not directly useful for most applications, higher scalability is an interesting topic for what I see as secondary reasons.

My point is, you should consider the whole solution and the economics of your hardware and software.  Do you really want to spend more money on the software to save money on the hardware.  In many cases, it is lower risk and easier to spend money on the hardware.  I do think this option is over used in many cases and people should be smarter about their software.  There is always a trade off.

Peter.




--

Tony Finch

unread,
May 15, 2013, 6:39:08 AM5/15/13
to mechanica...@googlegroups.com
Also relevant to this discussion is Luigi Rizzo's "netmap" design for
userspace network stacks on FreeBSD.

https://www.usenix.org/conference/usenixfederatedconferencesweek/netmap-novel-framework-fast-packet-io

Tony.
--
f.anthony.n.finch <d...@dotat.at> http://dotat.at/
Forties, Cromarty: East, veering southeast, 4 or 5, occasionally 6 at first.
Rough, becoming slight or moderate. Showers, rain at first. Moderate or good,
occasionally poor at first.

Santos Das

unread,
Apr 23, 2014, 2:57:43 AM4/23/14
to mechanica...@googlegroups.com
Does the Open onload's userspace TCP stack can be used on any platform running on Linux. Looks like it is designed to work only with the Solarflare NIC. Am I missing anything?
If it is tightly coupled with their NIC, then it is not quite an open source on.
I am trying to figure out an open source user space TCP which works with multi process mode like Nginx. The rumpkernel does not support that yet.


On Tuesday, May 14, 2013 11:17:33 PM UTC+5:30, Martin Thompson wrote:
Open Onload is such a user space stack.  http://www.openonload.org/

I've benchmarked the Solareflare cards with this stacck and it much lower latency and higher throughput than a standard Intel cards and the Linux kernel stack.

Martin Thompson

unread,
Apr 23, 2014, 3:20:43 AM4/23/14
to mechanica...@googlegroups.com
To the best of my knowledge the Open Onload stack only works with the Solarflare NICs. I believe it is the same story with the Mellanox cards (VMA).

I've heard talk of NUSE but not seen anything further:


Has anyone seen more open work in this area? I've seen a fair bit of stuff to support working at a packet level but not a full TCP/UDP/IP user space stack.


Redhat are supporting RoCE but this is for RDMA.


Martin...


On Wednesday, 23 April 2014 07:57:43 UTC+1, Santos Das wrote:
Does the Open onload's userspace TCP stack can be used on any platform running on Linux. Looks like it is designed to work only with the Solarflare NIC. Am I missing anything?
If it is tightly coupled with their NIC, then it is not quite an open source on.
I am trying to figure out an open source user space TCP which works with multi process mode like Nginx. The rump kernel does not support that yet.

Michael Barker

unread,
Apr 23, 2014, 3:32:37 AM4/23/14
to mechanica...@googlegroups.com
Hi,

Intel are doing some work to introduce "Busy Poll Sockets" to Linux that will shorten the path to the NIC for low-latency applications.  This is going to be in RHEL 7 (AFAIK).


I haven't done a deep dive at this point, but will be looking at it in the coming weeks/months.

Mike.



--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Santos Das

unread,
May 6, 2014, 9:23:39 AM5/6/14
to mechanica...@googlegroups.com
Does anyone know about any opensource user space TCP IP stack that provides great performance. I was inclined towards netmap-rumptcp but was not able to get a good PPS.
There is another project called "Ministack" which is based on netmap http://www0.cs.ucl.ac.uk/staff/f.huici/publications/ministack-osdi12poster.pdf but not sure if it is an open source project, not able to find the source code.

Does anyone has anything in stock?
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages