connection reset by peer

bham

unread,

Jan 13, 2009, 6:39:54 PM1/13/09

to cogen

Hello -

I'm just learning cogen and am seeing 'connection reset by peer'
errors when a high frequency of connections are accepted in a short
time period; the server seems to drop existing connections.

At roughly 10,000 client connections the clients start getting
dropped. Why would this be?

I'm playing around by modifying the echoc.py and echoserver.py
examples from trunk. I added a simple time.sleep(0.001) before the
socket.connect(..) and the problem went away but this made me curious.

My system is running python 2.6.1 on Mac OS X 10.5. I've verified
that cogen is using the kqueue proactor (nice!). Also, I'm sure that
my env is setup correctly w.r.t. open file limits and the like --
launchctl limit, sysctl (kern.max*), and ulimit (-n) are all
configured correctly.

Here are the slightly modified files I'm using:

# echoserver.py

from cogen.core import sockets, schedulers, proactors
from cogen.core.coroutines import coroutine
import sys, socket

port = 1200

@coroutine
def server():
srv = sockets.Socket()
srv.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
addr = ('0.0.0.0', port)
srv.bind(addr)
srv.listen(64)
print "Listening on", addr
while 1:
conn, addr = yield srv.accept()
m.add(handler, args=(conn, addr))

client_count = 0

@coroutine
def handler(sock, addr):
global client_count
client_count += 1
print "SERVER: [connect] clients=%d" % client_count
fh = sock.makefile()
yield fh.write("WELCOME TO (modified) ECHO SERVER !\r\n")
yield fh.flush()
try:
while 1:
line = yield fh.readline(1024)
#print `line`
if line.strip() == 'exit':
yield fh.write("GOOD BYE")
yield fh.close()
raise sockets.ConnectionClosed('goodbye')
yield fh.write(line)
yield fh.flush()
except sockets.ConnectionClosed:
pass
fh.close()
sock.close()
client_count -= 1
print "SERVER: [disconnect] clients=%d" % client_count

m = schedulers.Scheduler()
m.add(server)
m.run()

# echoc.py

import sys, os, traceback, socket, time
from cogen.common import *
from cogen.core import sockets

port, conn_count = 1200, 10000
clients = 0

@coroutine
def client(num):
sock = sockets.Socket()
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
reader = None
try:
try:
# remove this sleep and we start to see
# 'connection reset by peer' errors
time.sleep(0.001) # <------------------ REMOVE ME AND
FAIL.
yield sock.connect(("127.0.0.1", port))
except Exception:
print 'Error in client # ', num
traceback.print_exc()
return
global clients
clients += 1
print "CLIENT #=%d [connect] clients=%d" % (num,clients)
reader = sock.makefile('r')
while 1:
line = yield reader.readline(1024)
except sockets.ConnectionClosed:
pass
except:
print "CLIENT #=%d got some other error" % num
finally:
if reader: reader.close()
sock.close()
clients -= 1
print "CLIENT #=%d [disconnect] clients=%d" % (num,clients)

m = Scheduler()
for i in range(0, conn_count):
m.add(client, args=(i,))
m.run()

bham

unread,

Jan 14, 2009, 3:17:30 AM1/14/09

to cogen

on the server side,

srv.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

gets me to ~16000 client connections without issue.

It then dawned on me that at this point I'm running out of ephemeral
ports on my Mac OS X box.

From sysctl:

net.inet.ip.portrange.hifirst: 49152
net.inet.ip.portrange.hilast: 65535

I suppose then, all is well really. I'm all ears if someone can
explain to me why *not* using TCP_NODELAY causes connections to be
dropped.

Ionel Maries Cristian

unread,

Jan 14, 2009, 4:43:04 AM1/14/09

to co...@googlegroups.com

Hey,

I suppose it's really some sort of performance issue due to the fact that TCP_NODELAY disables nagle's algorithm on the tcp connection.

Here's a good read on TCP_CORK/NODELAY that might shed some light on the problem: http://www.baus.net/on-tcp_cork

PS, I noticed you've used time.sleep in coroutine code while you should be really using cogen.core.events.Sleep instead.

--
ionel

bham

unread,

Jan 14, 2009, 10:48:40 AM1/14/09

to cogen

On Jan 14, 4:43 am, Ionel Maries Cristian <ionel...@gmail.com> wrote:
> PS, I noticed you've used time.sleep in coroutine code while you should be
> really using cogen.core.events.Sleep instead.

I did an 'ack' for 'sleep' in the cogen trunk and didn't see that.
'ack -i' :)

bham

unread,

Jan 14, 2009, 5:05:39 PM1/14/09

to cogen

I was able to bring up a few IPs on my Mac OS X box by duplicating the
entries in System Preferences and manually assigning IPs to use. Then
in my client I manually bound the sockets to each port in (1024,65535)
on each such IP. Thus, I ran the server (in one process) and 75K
clients (in one process) on the same machine! The server process
ended up using about 10% of the CPU and only 150 MB resident memory.
This is not a typical setup of course but it was a good first load-
test of cogen and it was good to be able to do this all from one box.

I did notice that the more clients I tried to connect, the longer it
seemed to take to connect new clients...

Brian

On Jan 13, 6:39 pm, bham <or.else.it.gets.the.h...@gmail.com> wrote:

bham

unread,

Jan 14, 2009, 5:07:49 PM1/14/09

to cogen

I should also mention that in an attempt to connect 100K clients, the
process stopped accepting clients at about 77K clients already
connected. The system had 2GB memory available, the server process
using about 10% CPU and the client process about 90% CPU.

Anyone have any tips for tuning this as a general matter?

Ionel Maries Cristian

unread,

Jan 15, 2009, 3:31:14 AM1/15/09

to co...@googlegroups.com

Here's a guy's attempt to open 1M connections:
http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-1/

Though he lists some kernel tweaks for the tcp stack they aren't for mac os x, maybe you can infer the right settings using this general tweaking guide: http://www.psc.edu/networking/projects/tcptune/

Then again, it looks you just don't have enough ports for the client ? So maybe add some more interfaces for the client.

Also, read the 3rd part: http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3/

--
ionel

Brian Hammond

unread,

Jan 15, 2009, 9:44:39 AM1/15/09

to co...@googlegroups.com

Hey thanks I read that the other day. I am already using a couple of IPs (duplicate airport in system prefs and manually assign it an IP - easy). Note that I am at 77K connections which is more than what is possible on one client IP.

I will look into the kernel params for the mac. I will be deploying on Linux though so we'll see how motivated I will be to tweak OS X. :)

I have looked at libevent and made a simple echo server in about 100 loc. I want to see how much overhead python adds. I cannot imagine it is very much but I am curious.

Ionel Maries Cristian

unread,

Jan 15, 2009, 10:22:42 AM1/15/09

to co...@googlegroups.com

On Thu, Jan 15, 2009 at 16:44, Brian Hammond <or.else.it.g...@gmail.com> wrote:

Hey thanks I read that the other day. I am already using a couple of IPs (duplicate airport in system prefs and manually assign it an IP - easy). Note that I am at 77K connections which is more than what is possible on one client IP.

About that, it's odd. Maybe some connections are silently droppped ?

You're using the trunk version right?

--
ionel

bham

unread,

Jan 15, 2009, 11:09:15 AM1/15/09

to cogen

Yes, trunk. Not sure about the connections being dropped. I would
expect that the client side would notice it.

On Jan 15, 10:22 am, Ionel Maries Cristian <ionel...@gmail.com> wrote:
> On Thu, Jan 15, 2009 at 16:44, Brian Hammond <
>

> or.else.it.gets.the.h...@gmail.com> wrote:
> > Hey thanks I read that the other day. I am already using a couple of IPs
> > (duplicate airport in system prefs and manually assign it an IP - easy).
> > Note that I am at 77K connections which is more than what is possible on one
> > client IP.
>
> About that, it's odd. Maybe some connections are silently droppped ?
>
> You're using the trunk version right?
>
>
>
> > I will look into the kernel params for the mac. I will be deploying on
> > Linux though so we'll see how motivated I will be to tweak OS X. :)
>
> > I have looked at libevent and made a simple echo server in about 100 loc. I
> > want to see how much overhead python adds. I cannot imagine it is very much
> > but I am curious.
>

> > On Jan 15, 2009, at 3:31 AM, Ionel Maries Cristian <ionel...@gmail.com>

> > wrote:
>
> > Here's a guy's attempt to open 1M connections:
>

> > <http://www.metabrew.com/article/a-million-user-comet-application-with...>
> >http://www.metabrew.com/article/a-million-user-comet-application-with...

>
> > Though he lists some kernel tweaks for the tcp stack they aren't for mac os
> > x, maybe you can infer the right settings using this general tweaking guide:
> > <http://www.psc.edu/networking/projects/tcptune/>
> >http://www.psc.edu/networking/projects/tcptune/
>
> > Then again, it looks you just don't have enough ports for the client ? So
> > maybe add some more interfaces for the client.
>
> > Also, read the 3rd part:

> > <http://www.metabrew.com/article/a-million-user-comet-application-with...>
> >http://www.metabrew.com/article/a-million-user-comet-application-with...

bham

unread,

Jan 16, 2009, 2:11:47 AM1/16/09

to cogen

On Jan 15, 9:44 am, Brian Hammond <or.else.it.gets.the.h...@gmail.com>
wrote:

> I have looked at libevent and made a simple echo server in about 100
> loc. I want to see how much overhead python adds. I cannot imagine it
> is very much but I am curious.

Boy was I wrong! I redid the client-connection test server and client-
creator in C using libevent. I connected 100K clients on the same box
as the server process using a manual client socket binding to ports
1024-65535 on each of 127.0.0.1, then aliased IPs. All 100K clients
were connected in under 20 seconds. The server used only 25MB of
memory!

The next test is to test performance but now we're no longer talking
about cogen.

Ionel Maries Cristian

unread,

Jan 16, 2009, 5:18:15 AM1/16/09

to co...@googlegroups.com

Interesting, post some code.

Mind you, you can't really compare cogen to libevent since they are different things and have different purposes.

What are you trying to build anyway?

--
ionel

Ionel Maries Cristian

unread,

Jan 16, 2009, 8:32:13 AM1/16/09

to co...@googlegroups.com

Besides OS settings you can tweak several things in cogen to make opening 100k connections faster:
- remove handler coroutines, since you don't want to get you server or client getting overloaded with coroutines waiting for nothing to happen
- parallelize the connect requests, eg: add 100k coroutines each one attempting a connect - otherwise you'll have sequential code (well, looks like you did that)
- make the acceptor socket's listen queue larger
- tweak params to scheduler/proactor; I need to document them better, but see: http://groups.google.com/group/cogen/msg/de9c40f994c1bb30

I've tried something similar and it's doable, in a decent amount of time; see: http://cogen.googlecode.com/svn/trunk/examples/c100k/
I get 100k connections in 32secs user time on a lousy virtual machine - it's not that bad.

--
ionel

Ionel Maries Cristian

unread,

Jan 16, 2009, 10:00:35 AM1/16/09

to co...@googlegroups.com

Errr... i meant 32s real, not user time.

--
ionel

bham

unread,

Jan 16, 2009, 10:35:06 AM1/16/09

to cogen

Thanks for the info Ionel!

> - remove handler coroutines, since you don't want to get you server or
> client getting overloaded with coroutines waiting for nothing to happen

I'm not sure what the alternative is then. How would you handle each
client connection? That's not what you are doing in your C100K
example.

> - make the acceptor socket's listen queue larger

I think the max (SOMAXCONN) in most systems is 128 and that's what I'm
using. I heard this was capped to 128 in some listen()
implementations but that's all hand-waving; I'm not really sure.

> - tweak params to scheduler/proactor; I need to document them better, but
> see:http://groups.google.com/group/cogen/msg/de9c40f994c1bb30

Ohh, I'll try to dig in more.

> I've tried something similar and it's doable, in a decent amount of time;
> see:http://cogen.googlecode.com/svn/trunk/examples/c100k/
> I get 100k connections in 32secs user time on a lousy virtual machine - it's
> not that bad.

No, that's not bad at all!

Thanks, Brian

bham

unread,

Jan 16, 2009, 1:02:37 PM1/16/09

to cogen

On Jan 16, 5:18 am, Ionel Maries Cristian <ionel...@gmail.com> wrote:
> Interesting, post some code.

It is just a typical "echo server" and client pretty similar to this:
http://www.cppblog.com/tx7do/archive/2007/08/21/30483.html

> Mind you, you can't really compare cogen to libevent since they are
> different things and have different purposes.

Well, at a high level both enable people to write high performance,
scalable network services. What I miss in vanilla libevent is the
nice linear path of handling a client connection that I get in cogen
and other coroutine-based frameworks.

> What are you trying to build anyway?

A multiplayer game server for turn-based games somehwat similar to the
service that runs freechess.org (but not for chess).

Thanks.

Ionel Maries Cristian

unread,

Jan 17, 2009, 7:15:26 AM1/17/09

to co...@googlegroups.com

On Fri, Jan 16, 2009 at 17:35, bham <or.else.it.g...@gmail.com> wrote:

Thanks for the info Ionel!

> - remove handler coroutines, since you don't want to get you server or
> client getting overloaded with coroutines waiting for nothing to happen

I'm not sure what the alternative is then. How would you handle each
client connection? That's not what you are doing in your C100K
example.

I'm saying remove the handlers to isolate the specific microbenchmark you attempted - you were wondering why things got slow compared to pure libevent test that didn't overload the kqueue with notification requests after accepting a connection (well, I suppose). In other words, the other 77k notification requests in the kqueue were slowing the only one important notification request for the acceptor socket.

Well, regardless, in the end you still want to pass some data on those 100k connections so you still need to find ways to get the kqueue faster.

> - make the acceptor socket's listen queue larger

I think the max (SOMAXCONN) in most systems is 128 and that's what I'm
using. I heard this was capped to 128 in some listen()
implementations but that's all hand-waving; I'm not really sure.

> - tweak params to scheduler/proactor; I need to document them better, but
> see:http://groups.google.com/group/cogen/msg/de9c40f994c1bb30

Ohh, I'll try to dig in more.

> I've tried something similar and it's doable, in a decent amount of time;
> see:http://cogen.googlecode.com/svn/trunk/examples/c100k/
> I get 100k connections in 32secs user time on a lousy virtual machine - it's
> not that bad.

No, that's not bad at all!

Thanks, Brian

--
ionel

Reply all

Reply to author

Forward