Asynchronous appservers

Garrett Smith

unread,

Nov 30, 2008, 2:06:36 PM11/30/08

to cogen

This is kinda of a general question about using asynchronous concepts
to build application servers. It's long and probably should be a blog-
type posting, but it is geared for this list so I figured I'd send it
this way.

Asnc web servers like Nginx and Yaws have gotten a lot of attention
lately due to their performance and support for very high levels of
concurrency. The natural evolution is to use these servers somehow
super-charge traditional web application (e.g. the sort built with
Django, Pylons, etc.)

The obvious problem with this approach is that asynchronous servers
don't rely on preemptive multitasking for concurrency. This means that
request handlers must voluntarily yield processing time to avoid
blocking other requests. As any typical web app doesn't do this,
hooking it up to an asynchronous front end will end up serializing all
requests. This is clearly not a good formula for supporting high
levels of concurrency.

Why can't we magically make our applications faster and more scalable
using the goodness of this new breed of server?

Servers like Nginx that perform a well defined set of tasks (e.g.
static file, redirects, proxying, etc.) can be optimized to handle
huge loads by explicitly coordinating their work across requests. They
do this by making sure that they only do a very small amount of work
at a time and then religiously yielding to a scheduler.

To build a similarly killer application server, the application
_itself_ has to work this way. It can't blindly proceed doing it's
thing. It must carefully and frequently yield so that others can
perform work.

But this is hard given the fact that the entire application stack
should be refactored to use this approach. This means addressing
potential blocking bottlenecks from database access, template parsing/
processing, controller logic, and calls to external systems.

So here's what I'm getting to. First, have I missed something above?
Is it not necessary to refactor the _entire_ application stack to
really benefit from the asynchronous model? At a minimum, we at least
need to ensure that no single operation can take a lot of time.

Second, cogen (and similar servers like twisted, etc.) are often
billed as best suited for applications that demand huge levels of
concurrency with low activity per connection (e.g. like chat
applications, etc.). However, given an application stack that smartly
yields and leverages events where appropriate, could not cogen be used
as a killer platform for traditional web applications?

Here's my motivation:

* Avoid threads - these scale poorly and wreak havoc on the stability
of an application at even modest levels of load.

* Avoid forking (i.e. process level concurrency) - this consumes
ridiculous amounts of memory as the number of concurrent requests
builds -- and this means buying/running more servers.

* More easily avoid contests for shared resources (memory, disk,
network IO, etc.) - it's not a panacea, but coordinated multitasking
is a hell of a lot easier to manage than preemptive multitasking when
it comes to accessing shared resources.

The net result is that applications should behave themselves at
extremely high levels of load. Rather than getting errors, users will
wait longer for a response. The problem of scale is therefore a
function of response time and not reliability. This is a huge win as
typical web applications experience load only at specific points in
time. If the result is merely that users have to wait longer during
that time (as opposed to get 500 errors), web shops could get by with
far less hardware.

As it is today, most apps run at 3x to 5x what they need to make sure
their bases are covered. If apps were better behaved, they could cut
way back on opex assuming occasional wait times are acceptable (and in
many cases, they are). That real money given a typical production
server costs $800+ per month to host!

Practically, if I'm not smoking crack, the implications for this
project could be along these lines:

* cogen could use some examples that illustrate how to handle a
series of typical web app operations such as database reads/writes,
template processing, etc. These could serve as a baseline for building
a real web application.

* cogen could be stress tested to prove the benefits discussed above.
(E.g. I've used curl-loader to test cogen, nginx, apache, yaws, and
lighttp with 10K concurrent requests over several minutes, but this
was only for serving trivial content.)

* There might be a need for some async-enabled modules that "play
nice" in cogen or perhaps a generalized breed of async servers.

Okay, with all that...thoughts?

Ionel Maries Cristian

unread,

Nov 30, 2008, 2:53:33 PM11/30/08

to co...@googlegroups.com

Very good writeup - blogpost worthy.

Speaking of typical examples - well, cogen is not for your typical web app. Most db connectors do not have an async interface - so I didn't bother making a solution like twisted.enterprise.adbapi (threadpolled wrappers around blocking db interfaces). However, for example, psycopg2 has an async interface ( http://initd.org/svn/psycopg/psycopg2/trunk/doc/async.txt ) - something could be done in cogen.

Templates should not usualy be a problem - unless ouputing huge ammounts of content. Using a template engine that streams the output should solve this - though I still have to get a chance to investigate this.

On the other hand there are some typical async app examples in cogen (the webchat and the web irc client - done in the pylons wsgi stack) found in the examples dir. Hopefully in the future there might be more elaborate examples - or, heck, even real apps.

Speaking of stress tests - have you any results ? Would be very interesting to see what you found out :)

There's has been some discussion for making async apps work better (or at least in a unified way) in a wsgi stack: http://mail.python.org/pipermail/web-sig/2008-May/thread.html#3401 - the discussion died off at some point but the ideas are there.

--
ionel

Erik G

unread,

Dec 2, 2008, 2:37:54 AM12/2/08

to cogen

Hi Garret,

You created a fine summary of the advantages for async I/O and
releated generator stuff in Python. I agree with wat you say and I am
pursueing the same goals for years now. I created a thing very much
like Cogen, and I called it Weightless (http://weightless.io). I use
that in production software, search engines and data repositories and
portals. Here is a few things I learned:

The performance indeed is fantastic we run search engines on small
VM's that outperform large installations with traditional software.

Although the Reactor/Scheduler serializes all requests and calls
coroutines, these coroutines just use blocking open() and read().
With siege applying a realistic load taken from the logs, we were not
able to make the server go in 'wait' state. It seems very, very hard
to do any better than buffered I/O, on Linux that is. (The recent
trunk op Weightless contains nonblocking I/O with other file
descriptors than the main on, but that code is not yet in production.)

When dealing with portals, a delay of a few seconds goes unnotices due
to internet network latency. We ran a server with heavy uploading,
storing and indexing going on, but browsing remained fast an natural.
There really were lots of 1-3 seconds blockings going on.

So we are quite happy to use async and coroutines based I/O with being
too strict on non-blocking operations. It delivers a huges
performance upgrade anyway.

But your main question as to use async servers to superscharge
traditional applications, I did not discover a way to do that. Is is
what Twisted tries to, but then every write() to the socket is first
buffered, and it is common knowledge I believe that well-tuned servers
often are limited just because the are busy copying data over and over
again. This is where the generators come in, in which the yield
expression allows for rescheduling the handler. Also, a generator
that yields, must be called properly so that the yield propagates
through the call stack until it reaches the reactor/scheduler. So for
generators/coroutines to work, you whole application must be
decomposed into generators. This has quite a fundamental impact. See
http://weightless.io/compose for how I see program decomposition with
generators is possible.

But there is hope! Many initiatives like Cogen and Weightless are
showing up, and more and more people become aware of it!

Best regards
Erik

On Nov 30, 7:53 pm, "Ionel Maries Cristian" <ionel...@gmail.com>
wrote:

> Very good writeup - blogpost worthy.
>
> Speaking of typical examples - well, cogen is not for your typical web app.
> Most db connectors do not have an async interface - so I didn't bother
> making a solution like twisted.enterprise.adbapi (threadpolled wrappers
> around blocking db interfaces). However, for example, psycopg2 has an async

> interface (http://initd.org/svn/psycopg/psycopg2/trunk/doc/async.txt) -

> something could be done in cogen.
>
> Templates should not usualy be a problem - unless ouputing huge ammounts of
> content. Using a template engine that streams the output should solve this -
> though I still have to get a chance to investigate this.
>
> On the other hand there are some typical async app examples in cogen (the
> webchat and the web irc client - done in the pylons wsgi stack) found in the
> examples dir. Hopefully in the future there might be more elaborate examples
> - or, heck, even real apps.
>
> Speaking of stress tests - have you any results ? Would be very interesting
> to see what you found out :)
>
> There's has been some discussion for making async apps work better (or at

> least in a unified way) in a wsgi stack:http://mail.python.org/pipermail/web-sig/2008-May/thread.html#3401- the

Garrett Smith

unread,

Dec 4, 2008, 12:01:51 AM12/4/08

to cogen

Anecdotally, here's what I saw:

* cogen had the lower relative number of page errors (i.e. dropped
connections or blank pages) of any of the servers
* cogen's response times were quite variable (high variation),
relative to the others
* nginx was amazingly stable and fast but accomplished this by
dropping a huge percentage of its connections (so, high error rates)
* apache and yaws had huge error rates and very poor throughput

The tests attempted to run 10K concurrent requests against the server,
so it was a pounding. The servers only had to provide static content.
Cogen used wsgi to output a short snippet of HTML via code.

There's obviously a ton more to quality and report here. To do it
justice, I've started a Google code project (called cloudrunner) that
can be used to a) report test results like this and b) provide the
underlying code so the tests can be easily replicated.

The goal of coderunner is to provide testing packages that can be
easily tweaked and run on Amazon's EC2. The point here is that Amazon
is excellent at spinning up servers for a short time for very little
cost. This would, at least theoretically, support a huge array of test
scenarios that could be executed by anyone with an EC2 account for
very little cost.

I'll work on getting my current load tests out there for public
consumption, replication, tweaking, etc. It might prove to be an
interesting project.

Back to the async stuff...

I don't think it's reasonable to try to generally shoe-horn existing
libraries into an asyc framework. I think we'd need to look at a new
set of light weight components that could plug into cogen or similar
servers. It's be nice to have a generalized Pythonic approach that was
not tied to a particular async implementation, though that might also
be borderline impossible. (Will read through the wsgi async thread
cited.)

I'd like to work on a small app server that thoughtfully leveraged
cogen (or other coroutine centered like weightless) and see how it
scales/performs relative to traditional wsgi and f/cgi approaches.

Would any be interested in lending a hand on this? I'd like to
showcase the ability to break an HTTP response into tiny bits,
providing library interfaces for:

* HTML generation (e.g. Cheetah, Genshi, etc.)
* Async database access

For the database stuff, would it not be possible to use events/
messages in hand offs to database operations? This would avoid the
need for specialized low level async database access drivers, which of
course no one wants to write and maintain.

For the HTML stuff, I wouldn't assume that page rendering is so fast
that it doesn't need to yield along the way. In a lot of Python apps
that I've worked with, the two long poles in the response processing
are the database access and page rendering.

It should be pretty easy in some templating libraries to hack into the
parsing and page rendering stages to introduce appropriate yields.

Garrett

On Nov 30, 1:53 pm, "Ionel Maries Cristian" <ionel...@gmail.com>
wrote:

> Very good writeup - blogpost worthy.
>
> Speaking of typical examples - well, cogen is not for your typical web app.
> Most db connectors do not have an async interface - so I didn't bother
> making a solution like twisted.enterprise.adbapi (threadpolled wrappers
> around blocking db interfaces). However, for example, psycopg2 has an async

> interface (http://initd.org/svn/psycopg/psycopg2/trunk/doc/async.txt) -

> something could be done in cogen.
>
> Templates should not usualy be a problem - unless ouputing huge ammounts of
> content. Using a template engine that streams the output should solve this -
> though I still have to get a chance to investigate this.
>
> On the other hand there are some typical async app examples in cogen (the
> webchat and the web irc client - done in the pylons wsgi stack) found in the
> examples dir. Hopefully in the future there might be more elaborate examples
> - or, heck, even real apps.
>
> Speaking of stress tests - have you any results ? Would be very interesting
> to see what you found out :)
>
> There's has been some discussion for making async apps work better (or at

> least in a unified way) in a wsgi stack:http://mail.python.org/pipermail/web-sig/2008-May/thread.html#3401- the

Garrett Smith

unread,

Dec 4, 2008, 12:14:43 AM12/4/08

to cogen

Very interesting observations from the trenches. Thank you.

One of the reasons why I'm so interested in measuring is that I
suspect there's payoff to this approach. Nginx proves that a single
threaded app can provide huge throughput with none of the penalties
associated with forking or threads. It just needs to be purpose built.

It'd be nice to have a web framework that essentially supported
purpose-built web apps that fully leveraged the async model.

But, of course, no one's going to pay any attention unless you can
definitely prove the savings on hardware/opex you could realize.

> decomposed into generators. This has quite a fundamental impact. Seehttp://weightless.io/composefor how I see program decomposition with

Garrett Smith

unread,

Dec 4, 2008, 12:45:03 AM12/4/08

to co...@googlegroups.com

I uploaded the benchmark source files and results for a particular test.

http://code.google.com/p/cloudrunner/downloads/list

The interesting stuff is in the benchmarks directory. The *.conf files are use with curl-loader (http://curl-loader.sourceforge.net).

The benchmarks/results dir contains results for the ec2-1x.conf file run against apache2, lighttpd, yaws, cogen, and nginx. To understand the metrics, refer to the curl-loader FAQ. If you're new to curl-loader, it will take a little time to get into the numbers, but they're quite interesting.

The cogen app run is benchmarks/cogen_wsgi. For the other servers, I just served static HTML.

Obviously, these tests aren't fair by any stretch as I'm not publishing the server configuration and no doubt got some things wrong. E.g. nginx is clearly throttled at 1K clients and is very, very steady at serving that number without fail, very efficiently I would add. There's no doubt something I'm missing to increase that threshold as nginx wasn't remotely breaking a sweat during the tests.

The cloudrunner project is something I just through up. When I get some time I'll build out something that's more generally useful for building and running stuff like these benchmarks. E.g. I'd love to provide a packaging/process that would make it easy for someone to grab some tests, tweak them, execute them, and upload the results.

Garrett

> On Nov 30, 1:53 pm, "Ionel Maries Cristian" <ionel...@gmail.com>

Ionel Maries Cristian

unread,

Dec 5, 2008, 7:54:24 PM12/5/08

to co...@googlegroups.com

I'm a bit intrigued that cogen had a very unstable response time and i was wondering about what revision of cogen you benchmarked against (since some stuff has changed over time - see below).

cogen has a number of configuration options for the wsgi server that can improve/worsen the throughput/latency for certain client profiles.

Well, these options aren't in the trunk docs - they are a bit outdated due to some problems i have right now with my api docs generator due to the fact i have platform code and i can't build all the docs at once. Oh well, things will be fixed. (I'm thinking of switching to sphinx and make some better guides, intros etc)
So, for server_factory, as keyword arguments:
- proactor - i suppose you've used epoll or kqueue (cogen tries to pick the best one by default)
- sched_default_priority - from 0 to 3 (i have to change this to take the constant name in cogen.core.util.priority). This would change how greedy are operations are by default - so for example, 3 will make the sched/proactor/ops run the coroutine till it necessarily needs to wait, obviously in the devour of the other coroutines who are waiting the the sched's queue.
- sched_default_timeout - default timeout for coroutine operations - but the ops in the wsgi server all have specified timeouts so it doesn't matter (well, it may matter if you are running other coroutines besides the wsgi server but that is another story)
- proactor_resolution - the amount of time the multiplexing call the the proactor waits - this one doesn't usually matter as the multiplexing call would return sooner if there are results
- proactor_multiplex_first - this would make the unixey proactors (select/(e)poll/kqueue) try attempting the recv/send/accept before checking for socket readiness
- proactor_greedy - setting this to false would make the scheduler delay running the proactor till its coroutine queue is empty
- ops_greedy - this would make network awoken coroutines run till they need to sleep (in contrast to the fact that by default they are just passed the result and added in the sched's queue)
- request_queue_size - this is passed to listen() so it's pretty obvious
- sockoper_timeout - the timeout for the network ops, disabling this improves perf a bit but eh, you may have hanging connections (not that bad since the cost of a connection is small in cogen)
- sendfile_timeout - a separate timeout for sendfile ops if you want such a thing
- sockaccept_greedy - this would make the acceptor coroutine a connection guzzler - you'll end up with very small connection time, but longer response time for some unlucky coroutines :)

So the idea is if there is some latency spike - it's because some unlucky coroutine keeps staying in the sched's queue and not serving the request due to these dodgy options.
Then again, setting these options to run the coros more even would use the sched's queue more and that queue has some overhead (impacts perf if you abuse it too much).

The wsgi server uses one coroutine for the acceptor, and one coroutine for each connection.
There options have changed over time, I've tried to infer some acceptable defaults from benchmarks. That coderunner could prove very useful for testing how these options (and things to come) affect the performance in a more reliable and easy way. Looking forward to seeing it. I've tested with ab/httperf and, sigh, didn't had the hardware i wanted at my disposal.

Well, I think cogen should get some form of async ipc/rpc, a threadpoll for the ugly blocking stuff (like blocking db-wrappers), some async db-wrapper for psycopg2 and more coverage for unittests in order to get it good for the general public.

Still - I don't like the idea of implementing a threadpoll at all :)

Oh well, all in all, I need to think a bit more on this - would certainly like to hear any thoughts on what cogen should have or what's wrong with it (besides the docs)

=]

--
ionel

Erik Groeneveld

unread,

Dec 10, 2008, 3:12:03 AM12/10/08

to co...@googlegroups.com

Since the beginning of Weightless, I have done many performance tests
as well. The net result is that I threw away much code, since it did
not have any effects. Now I am back to the simplest Reactor you could
imagine. It just processes all active sockets in each loop.

Responses were very stable, only when the system became very heavy
loaded, connections started dropping, but this was under really big
load. Also, the response time were very close together, nicely
distributed. (This is not to promote Weightless, just to confirm to
you that a simple non-threaded select-based reactor is the best thing
to have.)

As for the use of edge based i/o triggers such as epoll and kqueue, I
did not use them, mainly because from tests, it turns out that there
were no more than say 5 active sockets, even with a load of 4500 qps,
and the clients percieving a concurrency of over 100's. With this
amount of sockets, select is good enough.

There is a timer by the way, and a way to set a lower priority on a
port. I really hope it'll never become more complicated than that.

As for the threadpool. Weightless had it, but I removed it. It turns
out that all the overhead involved in thread communication and syncing
them (using a pipe as a mutex in Python makes no sense) is beaten by
the Linux buffered i/o. We have search engines running that in which
a query (Lucene, sort of database, might block when it needs to read
an index) returns ten XML files concatenated to each other. Doing the
query and reading the ten files from the files system all might block.
However, no matter how much load we apply to it, the system hardly
gets in wait state, and disk wait is certainly not the bottleneck.

So we must not try to be more catholic than the pope: even without
your application doing pure async I/O, the schedulers like Cogen and
Weightless do already yield much improvement!

In the case a I/O handler must read other data not from a file but
from another socket on another host (bad, but it happens every now and
then), you do need async I/O for reading this other socket. This was
actually one of the most complicated things to achive, but I now have
this solution:

def handler(socket):
with socket:
request = yield
with open('http://otherhost/'):
yield request
response = yield
yield response

This is pseudo code that demonstrates how to switch the context of the
generators. One of the main reasons I want to do it this way, is to
make the generators more independent of the schedulers, so they become
better testable.

I am really curious what you think about combining blocking I/O (on
Linux that is) with async I/O and why you need the other polling
functions like epoll.

Best regards,
Erik

--
E.J. Groeneveld
Seek You Too
twitter, skype: ejgroene
mobiel: 0624 584 029

Ionel Maries Cristian

unread,

Dec 10, 2008, 6:12:35 PM12/10/08

to co...@googlegroups.com

poll already improves a lot over select in terms of performance but if it still
sends the list of all fds on each call. epoll stores the list of fds in kernel
space so each epoll_wait call will take way less.
epoll is nice for high concurrency (like 10k connections) because of that.
Other features like edge triggering aren't that useful in a generic reactor
pattern that need to abstract over several apis. It also has ONESHOT
option for events that i find useful.

Speaking of high concurrency - i think the sweetspot of cogen and most
async frameworks in python is "high concurrency, not very high throughput".
You can hardly achieve high throughput due to various overheads over the
python vm. That's why you'll never ever feel any improvements over async
disc i/o - whatever gains you get from that will be nulled by overhead over
the regular sync i/o due to additional system calls and additional handling
code. Then again, i have yet to see a python library wrapping aio linux
syscalls.

Disc io is fast enough, and if you need it faster you can just move the
storage to the ram, tweak the os or get better hardware - there are so
many ways to improve. In comparison, there is very little you can do to
improve network io. Latency is always going to be there and you have to
deal with it.

Curiously, if say, one would write a framework in C as a python library,
and run those generators from C code you could supposedly have more
throughput, but the app code is still going to be slow - or at least the
bottleneck. No one was insane enough to attempt such a thing - one
would have to do nasty stuff in order to make it extendable (easily
extendable). Probably because if you really want throughput you would not
care over whatever productivity python code would offer you for app code.

--
ionel

Erik Groeneveld

unread,

Dec 12, 2008, 2:43:28 AM12/12/08

to co...@googlegroups.com

On Wed, Dec 10, 2008 at 11:12 PM, Ionel Maries Cristian
<ione...@gmail.com> wrote:
[ ]

> Speaking of high concurrency - i think the sweetspot of cogen and most
> async frameworks in python is "high concurrency, not very high throughput".
> You can hardly achieve high throughput due to various overheads over the
> python vm. That's why you'll never ever feel any improvements over async
> disc i/o - whatever gains you get from that will be nulled by overhead over
> the regular sync i/o due to additional system calls and additional handling
> code. Then again, i have yet to see a python library wrapping aio linux
> syscalls.

Were you able to test with a network connection > 1 GB? I wasn't.
What I saw on the test with 1 GB network, was that there were 4500
requests per second, each request about 100 bytes and each response
about 5 kB. The network was fully saturated, the CPU was 60% busy.
The CPU was a 1 GHz Pentium, 6 years old. I do not remember the exact
concurrency, but I do remember that there were never more than about 5
sockets in the reactor.

[ ]

> Curiously, if say, one would write a framework in C as a python library,
> and run those generators from C code you could supposedly have more
> throughput, but the app code is still going to be slow - or at least the
> bottleneck. No one was insane enough to attempt such a thing - one
> would have to do nasty stuff in order to make it extendable (easily
> extendable). Probably because if you really want throughput you would not
> care over whatever productivity python code would offer you for app code.

I wonder if this would yield much, the Python VM is horribly slow, but
since times in network latency are usually worth millions of machine
instructions, all the Python overhead could not hinder that much?

Erik

Ionel Maries Cristian

unread,

Dec 12, 2008, 8:12:07 PM12/12/08

to co...@googlegroups.com

On Fri, Dec 12, 2008 at 09:43, Erik Groeneveld <ejgr...@gmail.com> wrote:

On Wed, Dec 10, 2008 at 11:12 PM, Ionel Maries Cristian

<ionel.mc@gmail.com> wrote:

[ ]

> Speaking of high concurrency - i think the sweetspot of cogen and most
> async frameworks in python is "high concurrency, not very high throughput".
> You can hardly achieve high throughput due to various overheads over the
> python vm. That's why you'll never ever feel any improvements over async
> disc i/o - whatever gains you get from that will be nulled by overhead over
> the regular sync i/o due to additional system calls and additional handling
> code. Then again, i have yet to see a python library wrapping aio linux
> syscalls.

Were you able to test with a network connection > 1 GB? I wasn't.
What I saw on the test with 1 GB network, was that there were 4500
requests per second, each request about 100 bytes and each response
about 5 kB. The network was fully saturated, the CPU was 60% busy.
The CPU was a 1 GHz Pentium, 6 years old. I do not remember the exact
concurrency, but I do remember that there were never more than about 5
sockets in the reactor.

Sending large chunks of data via sendfile or from the memory usually doesn't
reflect how fast your server is able to process requests - it usually shows
unuseful statistics because you don't test the server as much you test the
operating system's capability to move chunks of data around (or the python
vm's capability for that matter).

I usually test with very small response body (like a dozen of bytes) and
that hardly saturates a regular ethernet connection. You have to be very
careful when testing and have adequate setup and configuration - otherwise
you are testing anything but your server.

[ ]

> Curiously, if say, one would write a framework in C as a python library,
> and run those generators from C code you could supposedly have more
> throughput, but the app code is still going to be slow - or at least the
> bottleneck. No one was insane enough to attempt such a thing - one
> would have to do nasty stuff in order to make it extendable (easily
> extendable). Probably because if you really want throughput you would not
> care over whatever productivity python code would offer you for app code.

I wonder if this would yield much, the Python VM is horribly slow, but
since times in network latency are usually worth millions of machine
instructions, all the Python overhead could not hinder that much?

Closest thing that resembles something like that (that
I know of) would be the yield server ( http://code.google.com/p/yield/ ) - a
http server in C as a gateway for python greenlet stack-switched wsgi
apps. (I hope that is an accurate description)
Well, I suppose you could plug in eventlet since it uses greenlets
as an async framework for python code, if you are determined
enough.

There's also FAPWS, using the builtin http handling from libevent

You can get insanely high response rates with these kind of servers
but the app usually still is the bottleneck.