High performance IO on non-blocking sockets

Troels Walsted Hansen

unread,

Mar 14, 2003, 5:42:54 AM3/14/03

to

I'm trying to do IO on non-blocking sockets (within the asyncore framework),
and it seems to me that Python lacks a few primitives that would make this
more efficient.

Let's begin with socket writes. Assume that I have a string called self.data
that I want to send on a non-blocking socket. This string can be anywhere
from 1 to hundreds of megabytes.

This is a classic approach, seen in many Python examples:

sent = self.socket.send(self.data)
self.data = self.data[sent:]

This approach is bad because self.data gets reallocated for every socket
send. Worst case, 1 byte is sent each time and the realloc+copy cost goes
through the roof.

send_size = 32*1024 # for example
send_buffer_end = self.offset + send_size
send_buffer = self.data[self.offset:send_buffer_end]
sent = self.socket.send(send_buffer)
self.offset += sent

This is the most efficient approach that I've been able to come up with. For
small strings, send_buffer is just a reference to self.data. For large
strings, send_size needs to be tuned to match socket buffers so we don't do
unnecessary copying.

It is still sub-optimal though. I think the optimal approach requires a
small extension of the socket module API, specifically an "offset" parameter
to socket.send().

sent = self.socket.send(self.data, self.offset)
self.offset += sent

This would be trivial to implement in socketmodule.c, and eliminate all
unnecessary slicing and copying on socket sends. I'll be glad to submit a
patch for this if the Python gods think this could be added to Python 2.3.

***

Now for recv operations on non-blocking sockets. Assume that I want to read
a known number of bytes (total_recv_size) from a socket and assemble the
result as a Python string called self.data (again, think anywhere from 1 to
hundreds of megabytes of data).

Approach #1 (list+string.join based):

self.data = []
...
# following code runs when socket is read-ready
recv_size = 64*1024 # for example
data = self.socket.recv(recv_size)
self.data.append(data)
...
self.data = ''.join(self.data)

Approach #2 (cStringIO):

self.data = cStringIO.StringIO()
...
# following code runs when socket is read-ready
recv_size = 64*1024 # for example
data = self.socket.recv(recv_size)
self.data.write(data)
...
self.data = self.data.getvalue()

Approach #3 (cStringIO with pre-allocation):

self.data = cStringIO.StringIO()
# make cStringIO allocate correctly sized buffer
self.data.seek(total_recv_size)
self.data.seek(0)
...
recv_size = 64*1024 # for example
# following code runs when socket is read-ready
data = self.socket.recv(recv_size)
self.data.write(data)
...
self.data = self.data.getvalue()

All these three approaches have faults. #1 will cause memory fragmentation
by allocating len(data) strings (in an unlikely worst case, recv() returns 1
byte for each recv()). #2 will reallocate and copy buffer for every X number
of bytes received (cStringIO doubles buffer for every realloc). #3 will
pre-allocate a sufficiently large buffer, but cStringIO needs to initialize
this whole buffer to 0 in order to support random reads.

All approaches will use use at least total_recv_size*2 bytes of memory when
the final Python string is created.

Have I overlooked any better approaches?

I'm not sure what the best solution would be. Perhaps modifying cStringIO to
support an initial size-hint that pre-allocates a buffer without
initializing it until it is needed.

--
Troels Walsted Hansen

Richie Hindle

unread,

Mar 14, 2003, 8:42:05 AM3/14/03

to

[Troels]

> sent = self.socket.send(self.data)
> self.data = self.data[sent:]
>

> This approach is bad [...]

From http://www.python.org/doc/current/lib/socket-objects.html :

------------------------------- snip snip -------------------------------

7.2.1 Socket Objects

[...]

sendall( string[, flags])

Send data to the socket. The socket must be connected to a remote socket.
The optional flags argument has the same meaning as for recv() above.
Unlike send(), this method continues to send data from string until either
all data has been sent or an error occurs. None is returned on success. On
error, an exception is raised, and there is no way to determine how much
data, if any, was successfully sent.

------------------------------- snip snip -------------------------------

Nothing springs to mind for the recv case (but the problems are lesser in
that case).

--
Richie Hindle
ric...@entrian.com

Troels Walsted Hansen

unread,

Mar 14, 2003, 9:22:01 AM3/14/03

to

"Richie Hindle" <ric...@entrian.com> wrote in message
news:lgm37vsi6631fimch...@4ax.com...
> From http://www.python.org/doc/current/lib/socket-objects.html :
> 7.2.1 Socket Objects

> sendall( string[, flags])

I should have mentioned that in my original post... sendall() doesn't work
for non-blocking sockets. :(

The good news is that Kjetil Jacobsen came up with the perfect solution for
send().

send_buffer = buffer(self.data, self.offset)
sent = self.socket.send(send_buffer)
self.buffer_offset += sent

No copying, and all writes are as big as the socket buffer can hold. Cost of
allocating a buffer object should be very minor in the grand scheme of
things.

> Nothing springs to mind for the recv case (but the problems are lesser in
> that case).

2*total_recv_size memory consumption is 1*total_recv_size too many. :)

Working on this one...

--
Troels Walsted Hansen

Dave Brueck

unread,

Mar 14, 2003, 12:01:41 PM3/14/03

to

On Fri, 14 Mar 2003, Richie Hindle wrote:

>
> [Troels]
> > sent = self.socket.send(self.data)
> > self.data = self.data[sent:]
> >
> > This approach is bad [...]
>
> >From http://www.python.org/doc/current/lib/socket-objects.html :
>
> ------------------------------- snip snip -------------------------------
>
> 7.2.1 Socket Objects
>
> [...]
>
> sendall( string[, flags])
>
> Send data to the socket. The socket must be connected to a remote socket.
> The optional flags argument has the same meaning as for recv() above.
> Unlike send(), this method continues to send data from string until either
> all data has been sent or an error occurs. None is returned on success. On
> error, an exception is raised, and there is no way to determine how much
> data, if any, was successfully sent.
>
> ------------------------------- snip snip -------------------------------

Nope - the OP needs something for non-blocking sockets.

-Dave

Jeremy Hylton

unread,

Mar 14, 2003, 11:36:37 AM3/14/03

to

On Fri, 2003-03-14 at 05:42, Troels Walsted Hansen wrote:
> I'm trying to do IO on non-blocking sockets (within the asyncore framework),
> and it seems to me that Python lacks a few primitives that would make this
> more efficient.
>

There's another source of inefficiency that has always bugged me about
the asyncore framework. There are many layers of Python code between
the send() method you call on an aysncore socket and the send() call in
C. A lot of these layers seem unnecessary.

There's a method on the socket like so:

def send(self, data):
try:
result = self.socket.send(data)
return result
except socket.error, why:
if why[0] == EWOULDBLOCK:
return 0
else:
raise socket.error, why
return 0

It seems like it would be much simpler to have a non-blocking socket
implemented in C that took a single argument and return a 2-tuple of
data and errno. The C implemenation could use METH_O instead of
METH_VARARGS, and the caller wouldn't need to use a try/except.

Jeremy

Dave Brueck

unread,

Mar 14, 2003, 12:45:32 PM3/14/03

to

On Fri, 14 Mar 2003, Troels Walsted Hansen wrote:

> I'm trying to do IO on non-blocking sockets (within the asyncore framework),
> and it seems to me that Python lacks a few primitives that would make this
> more efficient.
>
> Let's begin with socket writes. Assume that I have a string called self.data
> that I want to send on a non-blocking socket. This string can be anywhere
> from 1 to hundreds of megabytes.

Yes, although with a little analysis you can generally know whether your
write is closer to 1 byte or hundreds of megabytes, and take different
approaches accordingly - see below.

> This is a classic approach, seen in many Python examples:
>
> sent = self.socket.send(self.data)
> self.data = self.data[sent:]
>
> This approach is bad because self.data gets reallocated for every socket
> send. Worst case, 1 byte is sent each time and the realloc+copy cost goes
> through the roof.

Yes - to achieve truly high performance (BTW - how high do you need?) you
need to pay attention to what sort of writing you're doing. Is it
HTTP-like traffic or a custom protocol? How many simultaneous connections
do you need to support? Are they likely to be LAN-speed connections,
DSL-speed, modem, or some mix?

At the company I work for we have several different custom HTTP servers,
and we saw huge performance gains when we started grouping the types of
I/O according to size and acting on them differently. For example, in the
hundreds of megabytes (or even half a megabyte) cases, it's likely that
the data you're writing is coming off the disk. Our servers primarily run
on Linux, so we created a tiny C extension module that calls the sendfile
API and in cases where there's a large chunk of data coming off disk we
call sendfile so that the data never even makes it to Python (or our
process memory space, for that matter). On platforms without a sendfile C
API the call gets routed to a simulated sendfile (all Python) instead.

Anyway, with sendfile we hit some crazy performance levels - a PII (<500
MHz) easily sustains 300 Mbps throughput for example for hundreds of
simultaneous DSL-like connections, and a PIII (~900 MHz) has passed 1.5
Gbps over the loopback adapter.

For our work, it's quite unlikely that we _ever_ send out 1 byte of
anything, but we do see lots of cases (like building HTTP response
headers) where there's lots of little chunks. In those situations we build
up a list of little strings, ''.join() them, and send them out as one
chunk.

One idea we've considered but not pursued is using the buffer() objects to
avoid the send-a-piece-then-copy-the-substring problem you identified.
We haven't gone down that path too far yet because sendfile has helped
immensely and we maintain our outgoing queues as lists of strings that we
keep as a list until right before sending, at which time we combine enough
of them to create a string large enough to fill the output buffer of the
socket, but (hopefully) not too much more.

One other thing: we stopped using asyncore/asynchat early on because it
was too low level and too slow for what we needed (although it works fine
for many other uses). We rolled our own asynch. socket framework and in
the process got to make something that works particularly well for HTTP
traffic.

> Now for recv operations on non-blocking sockets.

The recv side of things has always been slow (relatively speaking) for us,
second only to proxying (which, of course, relies directly on our recv
code), so I'd be really interested in any insights you have here!

> Assume that I want to read a known number of bytes (total_recv_size)
> from a socket and assemble the result as a Python string called
> self.data (again, think anywhere from 1 to hundreds of megabytes of
> data).

Again, though, the approach to how you read the data can benefit if you
can give hints on what you'll do with it.

For example, when we're proxying between two sockets we leave the data in
a list of chunks because our sending code can use it in that form anyway.
When we're receiving an upload, we don't really want a buffer the size of
the entire upload in memory anyway because we're going to be tossing the
data to disk.

Still though, I do wish there was a better way to do the receives because
even with leaving the data in a chunked list our proxying is slow and the
primary bottleneck appears to be the recv side of things.

> self.data = []
> ...
> # following code runs when socket is read-ready
> recv_size = 64*1024 # for example
> data = self.socket.recv(recv_size)
> self.data.append(data)
> ...
> self.data = ''.join(self.data)

This is the approach we use, except that we never do the final ''.join
(well, our framework doesn't, the application might if it makes sense)
because as a list the data is in suitable form for writing to disk or
handing off to the send code.

> Have I overlooked any better approaches?

Please let me know when you discover the magic solution. I'd sure like to
know what it is. :)

-Dave

Jeremy Hylton

unread,

Mar 14, 2003, 11:38:15 AM3/14/03

to

On Fri, 2003-03-14 at 05:42, Troels Walsted Hansen wrote:

> Now for recv operations on non-blocking sockets. Assume that I want to read
> a known number of bytes (total_recv_size) from a socket and assemble the
> result as a Python string called self.data (again, think anywhere from 1 to
> hundreds of megabytes of data).
>
> Approach #1 (list+string.join based):
>
> self.data = []
> ...
> # following code runs when socket is read-ready
> recv_size = 64*1024 # for example
> data = self.socket.recv(recv_size)
> self.data.append(data)
> ...
> self.data = ''.join(self.data)
>

> All these three approaches have faults. #1 will cause memory fragmentation
> by allocating len(data) strings (in an unlikely worst case, recv() returns 1
> byte for each recv()).

What's the expected case?

Jeremy

Jp Calderone

unread,

Mar 14, 2003, 7:32:34 PM3/14/03

to

On Fri, Mar 14, 2003 at 03:22:01PM +0100, Troels Walsted Hansen wrote:
> "Richie Hindle" <ric...@entrian.com> wrote in message
> news:lgm37vsi6631fimch...@4ax.com...
> > From http://www.python.org/doc/current/lib/socket-objects.html :
> > 7.2.1 Socket Objects
> > sendall( string[, flags])
>
> I should have mentioned that in my original post... sendall() doesn't work
> for non-blocking sockets. :(
>
> The good news is that Kjetil Jacobsen came up with the perfect solution for
> send().
>
> send_buffer = buffer(self.data, self.offset)
> sent = self.socket.send(send_buffer)
> self.buffer_offset += sent
>
> No copying, and all writes are as big as the socket buffer can hold. Cost of
> allocating a buffer object should be very minor in the grand scheme of
> things.

Have you timed this, vs the original, naive code? Slicing a string
shouldn't copy any bytes, only create a new string object with a modified
starting pointer and length. This seems as if it would be about as
expensive as creating a new buffer object, but has the advantage of running
more of the work in C, rather than Python (no name lookup for buffer, for
example).

I ask because I can't seem to squeeze any speedup out of Twisted by making
this change (in fact, I find a significant slowdown, 12309.8 KB/s using the
original "buf = buf[sent:]" code to 9408.2 KB/s using the buffer()
approach).

I'm hoping I've just screwed something up, of course, and would love to
hear that the buffer() approach is, in fact, much faster :)

Jp

--
Somewhere, something incredible is waiting to be known.
-- Carl Sagan
--
up 11 days, 15:59, 6 users, load average: 0.13, 0.50, 0.42

Troels Walsted Hansen

unread,

Mar 15, 2003, 6:33:57 AM3/15/03

to

Jp Calderone wrote:
>> send_buffer = buffer(self.data, self.offset)
>> sent = self.socket.send(send_buffer)
>> self.buffer_offset += sent

> Have you timed this, vs the original, naive code?

I have to admit that I haven't timed it. I've only looked at the Python
source and based my statements on that.

> Slicing a string
> shouldn't copy any bytes, only create a new string object with a modified
> starting pointer and length.

I believe this is wrong, at least for Python 2.2.2 which I'm working
with. Only if the string slice is the same as the original string do you
get an optimized slice without any copying. See source snippet below.

The buffer object is the one that implements read-only slices in the
manner that you describe.

static PyObject *
string_slice(register PyStringObject *a, register int i, register int j)
{
[...]
if (i == 0 && j == a->ob_size && PyString_CheckExact(a)) {
/* It's the same as a */
Py_INCREF(a);
return (PyObject *)a;
}
[...]
return PyString_FromStringAndSize(a->ob_sval + i, (int) (j-i));

> This seems as if it would be about as
> expensive as creating a new buffer object, but has the advantage of running
> more of the work in C, rather than Python (no name lookup for buffer, for
> example).
>
> I ask because I can't seem to squeeze any speedup out of Twisted by making
> this change (in fact, I find a significant slowdown, 12309.8 KB/s using the
> original "buf = buf[sent:]" code to 9408.2 KB/s using the buffer()
> approach).
>
> I'm hoping I've just screwed something up, of course, and would love to
> hear that the buffer() approach is, in fact, much faster :)

Your numbers are surprising and very interesting. Would you care to test
a modified send loop with the Twisted framework for me?

if self.offset:
sent = self.socket.send(buffer(self.data, self.offset))
else:
sent = self.socket.send(self.data)
self.offset += sent

The idea here is to avoid the cost of creating a buffer object for short
sends that fit into the kernel's socket buffer.

How large is self.data in your test?

--
Troels Walsted Hansen

Troels Walsted Hansen

unread,

Mar 15, 2003, 7:53:16 AM3/15/03

to

Well... I guess that really depends on the speed of the network and the
CPU, as well as the interval between calls to recv().

--
Troels Walsted Hansen

Troels Walsted Hansen

unread,

Mar 15, 2003, 8:15:55 AM3/15/03

to

Dave Brueck wrote:
> Yes - to achieve truly high performance (BTW - how high do you need?) you
> need to pay attention to what sort of writing you're doing. Is it
> HTTP-like traffic or a custom protocol? How many simultaneous connections
> do you need to support? Are they likely to be LAN-speed connections,
> DSL-speed, modem, or some mix?

XML-RPC over HTTP. You don't need to tell me that XML-RPC is unsuited
for large payloads, I'm painfully aware of that fact. :)

> At the company I work for we have several different custom HTTP servers,
> and we saw huge performance gains when we started grouping the types of
> I/O according to size and acting on them differently. For example, in the
> hundreds of megabytes (or even half a megabyte) cases, it's likely that
> the data you're writing is coming off the disk. Our servers primarily run
> on Linux, so we created a tiny C extension module that calls the sendfile
> API and in cases where there's a large chunk of data coming off disk we
> call sendfile so that the data never even makes it to Python (or our
> process memory space, for that matter). On platforms without a sendfile C
> API the call gets routed to a simulated sendfile (all Python) instead.
>
> Anyway, with sendfile we hit some crazy performance levels - a PII (<500
> MHz) easily sustains 300 Mbps throughput for example for hundreds of
> simultaneous DSL-like connections, and a PIII (~900 MHz) has passed 1.5
> Gbps over the loopback adapter.

sendfile() is indeed great if your data is coming off a disk.

> For our work, it's quite unlikely that we _ever_ send out 1 byte of
> anything, but we do see lots of cases (like building HTTP response
> headers) where there's lots of little chunks. In those situations we build
> up a list of little strings, ''.join() them, and send them out as one
> chunk.
>
> One idea we've considered but not pursued is using the buffer() objects to
> avoid the send-a-piece-then-copy-the-substring problem you identified.
> We haven't gone down that path too far yet because sendfile has helped
> immensely and we maintain our outgoing queues as lists of strings that we
> keep as a list until right before sending, at which time we combine enough
> of them to create a string large enough to fill the output buffer of the
> socket, but (hopefully) not too much more.

Another thing to watch out for is too many substrings on the list. You
can quite easily fragment the address space of the process and prevent
the malloc library from shrinking the address space when memory is
freed. Some OSs seem more sensitive to this than others...

> Again, though, the approach to how you read the data can benefit if you
> can give hints on what you'll do with it.
>
> For example, when we're proxying between two sockets we leave the data in
> a list of chunks because our sending code can use it in that form anyway.
> When we're receiving an upload, we don't really want a buffer the size of
> the entire upload in memory anyway because we're going to be tossing the
> data to disk.
>
> Still though, I do wish there was a better way to do the receives because
> even with leaving the data in a chunked list our proxying is slow and the
> primary bottleneck appears to be the recv side of things.

Yeah, there's quite a bit of copying going on in recv() unfortunately.

A more optimal approach might include a modified socket.recv() that
writes directly to a buffer object. The buffer object could have
cStringIO semantics with support for pre-allocation hints (for the times
when you know total_recv_size) and dynamic expansion through
reallocation (for the times when you don't know total_recv_size).

> This is the approach we use, except that we never do the final ''.join
> (well, our framework doesn't, the application might if it makes sense)
> because as a list the data is in suitable form for writing to disk or
> handing off to the send code.

Unfortunately I need the complete data in order to parse and decode the
XML-RPC payload. :(

--
Troels Walsted Hansen

Jp Calderone

unread,

Mar 15, 2003, 11:41:33 AM3/15/03

to

On Sat, Mar 15, 2003 at 12:33:57PM +0100, Troels Walsted Hansen wrote:
> Jp Calderone wrote:
> >> send_buffer = buffer(self.data, self.offset)
> >> sent = self.socket.send(send_buffer)
> >> self.buffer_offset += sent
> > Have you timed this, vs the original, naive code?
>
> I have to admit that I haven't timed it. I've only looked at the Python
> source and based my statements on that.
>
> > Slicing a string shouldn't copy any bytes, only create a new string
> > object with a modified starting pointer and length.
>
> I believe this is wrong, at least for Python 2.2.2 which I'm working
> with. Only if the string slice is the same as the original string do you
> get an optimized slice without any copying. See source snippet below.
>
> The buffer object is the one that implements read-only slices in the
> manner that you describe.
>
> static PyObject *
> string_slice(register PyStringObject *a, register int i, register int j)
> {
> [...]
> if (i == 0 && j == a->ob_size && PyString_CheckExact(a)) {
> /* It's the same as a */
> Py_INCREF(a);
> return (PyObject *)a;
> }
> [...]
> return PyString_FromStringAndSize(a->ob_sval + i, (int) (j-i));
>

Woops, I misinterpreted this code.

> > This seems as if it would be about as expensive as creating a new buffer
> > object, but has the advantage of running more of the work in C, rather
> > than Python (no name lookup for buffer, for example).
> >
> > I ask because I can't seem to squeeze any speedup out of Twisted by
> > making this change (in fact, I find a significant slowdown, 12309.8 KB/s
> > using the original "buf = buf[sent:]" code to 9408.2 KB/s using the
> > buffer() approach).
> >
> > I'm hoping I've just screwed something up, of course, and would love to
> > hear that the buffer() approach is, in fact, much faster :)
>
> Your numbers are surprising and very interesting. Would you care to test
> a modified send loop with the Twisted framework for me?
>
> if self.offset:
> sent = self.socket.send(buffer(self.data, self.offset))
> else:
> sent = self.socket.send(self.data)
> self.offset += sent
>
> The idea here is to avoid the cost of creating a buffer object for short
> sends that fit into the kernel's socket buffer.
>

Good call. This approach does show a speedup. With the same
settings as the previous numbers, the throughput rate rises to 12750 KB/s.

> How large is self.data in your test?

I tried using three different file sizes (0.5MB, 10MB, 100MB), but since I
used the standard Twisted.web server, these got chunked up into 65KB pieces.
I thought that this limited buffer size would reduce the effectiveness of
this optimization, so in addition to the change you suggested above, I
benchmarked the server with a ~4MB chunk size instead. This only showed a
very minor speedup, 12778 KB/s, possibly one inside the error range for the
benchmark applied.

For my edification, what are common sizes for kernel socket buffers, or
does it vary too widely to answer for anything but specific systems?

Jp

--
"One World, one Web, one Program." - Microsoft(R) promotional ad
"Ein Volk, ein Reich, ein Fuhrer." - Adolf Hitler
--
up 12 days, 7:59, 8 users, load average: 0.12, 0.17, 0.24