Writing a scalable server

hantheman

unread,

Dec 21, 2003, 6:10:58 AM12/21/03

to

Dear c.p.t. experts,

I'm writing a server application that needs to serve about 15000
concurrent users (six companies access the same server). That's the
peek period. Usually, about 2500 concurrent users. Users submit
requests that are queried from a database. The returned result is on
average 500 KB in our current system, but I may implement compression
to reduce network load. The input queries < 1K. I know that creating a
new thread per client doesn't scale, so this is my idea on how to
solve it:

1. From the socket, read the query (I call this a work list). Each
worklist is placed in a fifo queue. The socket listener thread is
immediatly ready for another request since the request is now in the
fifo queue.

2. Another thread reads from the fifo_queue. Call this the FIFO
manager. Its responsibility is to pop off work lists and start a new
thread FROM A THREAD POOL. The queries live on the fifo list until
some worker thread is ready to handle it. Related question... how big
a thread pool for an X CPU system? Any heuristics available?

3. The worker threads queries the database and write replies to the
proper client. I guess info about the client end of the socket
connection needs to be part of the work list so the worker thread
knows where to send things...

4. Each worker threads needs to query a database file. This has
special requirements so I need to write a simple DBMS (for spatial
queries). How does one best implement concurrency when several threads
are access the same database file? I guess two or more threads cannot
read/write the same file simultaniously. Will simply allow one thread
at a time do? (while the other threads are either waiting or sending
query results to the respective client)

5. We are currently using Windows 2000, but are going to move to
Solaris or Linux later on. Thus, the solution must be reasonably
portable.

Does this sound like a good approach, or just plain stupid? I also
read about IOCP, but I don't quite understand it - also it is not
portable(?)

Any thoughts?

/han

Michael Fuhr

unread,

Dec 21, 2003, 1:33:30 PM12/21/03

to

hanth...@hotmail.com (hantheman) writes:

> I'm writing a server application that needs to serve about 15000
> concurrent users (six companies access the same server). That's the
> peek period. Usually, about 2500 concurrent users. Users submit
> requests that are queried from a database. The returned result is on
> average 500 KB in our current system, but I may implement compression
> to reduce network load. The input queries < 1K. I know that creating a
> new thread per client doesn't scale, so this is my idea on how to
> solve it:

You might find the article "The C10K problem" interesting reading:

http://www.kegel.com/c10k.html

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

SenderX

unread,

Dec 21, 2003, 9:37:22 PM12/21/03

to

> 1. From the socket, read the query (I call this a work list). Each
> worklist is placed in a fifo queue. The socket listener thread is
> immediatly ready for another request since the request is now in the
> fifo queue.

This is ok.

> 2. Another thread reads from the fifo_queue. Call this the FIFO
> manager. Its responsibility is to pop off work lists and start a new
> thread FROM A THREAD POOL. The queries live on the fifo list until
> some worker thread is ready to handle it. Related question... how big
> a thread pool for an X CPU system? Any heuristics available?

2 threads per. cpu is good for worker thread pool that doesn't make any
blocking calls...

If your database access will block a worker thread, I would separate the
database access logic from the network layer. You can do this by organizing
your server into stages, more on that later...

The database stage would be driven by a larger pool of threads, that way it
can handle more concurrent blocking operations at a time, and ease the
bottleneck.

http://www.jetbyte.com/portfolio-showarticle.asp?articleId=38&catId=1&subcatId=2

> 5. We are currently using Windows 2000, but are going to move to
> Solaris or Linux later on. Thus, the solution must be reasonably
> portable.

You should create a single interface for the server stages, and back the
interface by select on non-windows server os's.

This is possible.

>
> Does this sound like a good approach, or just plain stupid? I also
> read about IOCP, but I don't quite understand it - also it is not
> portable(?)
>
> Any thoughts?

Not portable at all, but very scaleable:

http://www.sysinternals.com/ntw2k/info/comport.shtml

IOCP + nice computer == tens of thousands of clients...

;)

P.S.

Your server can exist as a number of separate stages...

1. Network stage

- Passes new connections to stage 2

- Handles socket io

2. Parsing stage

- Accepts connections from stage 1

- Keep requesting recv's from stage 1 until a query is parsed

- Reference the socket

- Pass parsed messages to stage 3, and repeat

3. Query stage

- Accepts queries from stage 2

- Access database

- Pass received data to the response stage

4. Response stage

- Accepts responses from stage 3

- Prepares the response for the client

- Keep requesting sends from stage 1 until response is sent

- Decrement references

Markus Elfring

unread,

Dec 22, 2003, 7:53:47 AM12/22/03

to

Can the links from the discussion "C or C++ multithreaded server code"
show you any useful design alternatives and possibilities?

Eric Sosman

unread,

Dec 22, 2003, 12:00:05 PM12/22/03

to

hantheman wrote:
>
> [...]

>
> 1. From the socket, read the query (I call this a work list). Each
> worklist is placed in a fifo queue. The socket listener thread is
> immediatly ready for another request since the request is now in the
> fifo queue.
>
> 2. Another thread reads from the fifo_queue. Call this the FIFO
> manager. Its responsibility is to pop off work lists and start a new
> thread FROM A THREAD POOL. The queries live on the fifo list until
> some worker thread is ready to handle it. Related question... how big
> a thread pool for an X CPU system? Any heuristics available?

What is the purpose of this "FIFO manager" thread? Why
can't the worker threads themselves pull the requests from the
queue? Why use three context switches per request instead of
just two?

In fact, why use two context switches when one might suffice?
The "socket listener" thread doesn't really advance the state of
affairs (as far as I can see): all it does is read the incoming
requests and put them in a queue where they can be ignored for
a while if all the workers are busy. (This is called the "Hurry
up and wait" design pattern.) Perhaps you could do away with this
thread altogether, and have the workers read the socket for
themselves. In effect, you discard the application-level work
queue and instead use whatever queuing the network layer offers.

There are reasons you might *not* want to do this: perhaps
the network's queuing doesn't preserve temporal ordering of
requests arriving on different sockets, or perhaps you want to
time-stamp the requests so you can keep track of transit times,
or ... However, you should evaluate whether you actually need
an explicit producer/consumer model, or whether you're just
falling into it out of habit.

--
Eric....@sun.com

hantheman

unread,

Dec 22, 2003, 12:27:48 PM12/22/03

to

//The designer of the experimental, SMP and HyperThread friendly, AppCore
//library.
//http://AppCore.home.comcast.net

Looks great. Any chance you'll write a Linux/Unix version soon?

;-)

Bob Summers

unread,

Dec 22, 2003, 4:17:11 PM12/22/03

to

What he said.

I spent the last three years or so trying to make a server
similar to the design you, hantheman, suggested. In our server,
however, the workers directly took the requests off of the queue.

As tuning progessed, the queue became a bottleneck. I ended up
implementing multiple queues between the frontend and the backend
and speeding up the queue implementation to reduce contention.

I did have a design that I thought would be better. I didn't
get a chance to implement it because I got laid off shortly
after I cleaned things up enough to make implementing my design possible.

My plan was to have a single listener thread that would assign
the accepted socket to a reader thread. There might be a pool of
reader threads; I planned to run experiments to see if
multiple reader threads were required on any of the platforms
we supported.

The reader thread would choose a worker thread and copy the
data from the socket into a structure (a mailbox) for the worker
along with any other info the reader needed to pass to the
worker.

When the mailbox had enough of the request (in my situation,
that was the complete request) it would signal the worker to
start work.

If there was no worker available, the first implementation
would have just waited for one. There's no point in spending
resources copying data off of the socket when you already
have enough work to do. Some experiments would have shown
whether we needed multiple worker pools and/or some way to
pop more threads when the pool(s) ran dry.

The buffer portion of each mailbox entry would be enlarged as
necessary and never deallocated to avoid heap allocations and
deallocations which were a major performance problem for the
product.

I had separate reader thread(s) in my design for a couple of
reasons. I wasn't sure that all the platforms I had to
support (including those that would need to be supported in
the future) would gracefully handle lots of threads waiting
on a socket. I was also expecting that the reader thread(s)
could settle onto a CPU with nice warm caches for the OS's
networking code. I've read that a gigabit network can
saturate a single CPU. Also, having threads that do nothing but
read from some number of sockets means that you can have
separate worker pools for each reader. Which means that
choosing a worker thread is less likely to ever become
a bottleneck. Not that I would expect choosing a worker
to be a bottleneck, it's just that I've been suprised before. :-)

Of course we used persistent connections to avoid the
setup/teardown overhead.

The number of threads is best determined by experiment. However,
Little's Law can help you get in the ballpark if you can make
some reasonable guesses.

Little's Law is N = XR where N is the population of the system
(i.e. busy threads), R is how long (say in seconds) the average
request takes and X is the thruput (i.e. the number of requests
completed per second or TPS).

So, for a 3,000 TPS server with the average request taking
20 milliseconds the number of worker threads required is:
N = 3,000 * 0.020 = 60 threads.

At some point, the overhead of having "too many" threads,
starts to have a negative effect on thruput. In my experience,
thruput depends weakly on the number of worker threads,
once you have enough of them. IOW, it's a broad peak.
If the optimal number of threads is 60, the difference
between 60 threads and 70 threads (or even 55 and 60) probably
wouldn't be very big (a few percent). You'd probably have
to go over 100 threads before they started having much of a
negative effect.

Of course, YMMV .

Bob S

ig...@notformail.paco.net

unread,

Dec 23, 2003, 1:31:32 PM12/23/03

to

SenderX <x...@xxx.xxx> wrote:

>> 2. Another thread reads from the fifo_queue. Call this the FIFO
>> manager. Its responsibility is to pop off work lists and start a new
>> thread FROM A THREAD POOL. The queries live on the fifo list until
>> some worker thread is ready to handle it. Related question... how big
>> a thread pool for an X CPU system? Any heuristics available?
>
> 2 threads per. cpu is good for worker thread pool that doesn't make any
> blocking calls...
>

Why single thread per CPU is not enough (leave any disk I/O aside)?

--
Igor Khasilev |
PACO Links, igor at paco dot net |

Bob Summers

unread,

Dec 23, 2003, 1:56:20 PM12/23/03

to

Actually, any resource (lock) contention can be a problem as is
any IO, including OS paging. Multiple threads can also
drive the utilization of a contended resource higher (i.e.
reduce the amount of time a contended resource is idle) which
can improve system thruput.

Bob S

contention can also

Sean Kelly

unread,

Dec 23, 2003, 7:01:21 PM12/23/03

to

Unfortunately, the optimal way to handle a large number of socket
connections is fairly platform-dependent, though the design does
generally involve some sort of thread pool. Under Windows, IOCP is
indeed the best. Under Linux I believe it's epoll (an optimization of
poll), and I'm not sure about Solaris. BSD-style sockets are
definately the most portable but probably the least optimal overall.
Were I doing the project I'd likely bite the bullet and write
platform-specific code to handle the socket layer and use more generic
code for the rest. There is almost no documentation for IOCP under
Windows, but there is some (slightly aged) sample code here:
http://www.mvps.org/win32/network/sockhim.html. You should also read
the MSDN info on thread pooling and on the function
BindIoCompletionCallback().

One thing to be aware of is that BindIoCompletionCallback() uses
CreateThread, which last I heard leaked memory with apps that used
LIBC. If you're using code with any of the older C calls, it may be
worth rolling your own version of this function using _beginthreadex.

Sean

David Schwartz

unread,

Dec 23, 2003, 7:28:21 PM12/23/03

to

<ig...@notformail.paco.net> wrote in message
news:bsa1m4$co0$1...@snark.paco.net...
> SenderX <x...@xxx.xxx> wrote:

Ambush, say due to a page fault (or anything else you can't anticipate).

DS

Michael B Allen

unread,

Dec 27, 2003, 5:51:24 PM12/27/03

to

On Sun, 21 Dec 2003 06:10:58 -0500, hantheman wrote:

> Any thoughts?

In addition to what other folks have said if you are really going to
service 15,000 concurrent users you are undoubtedly going to run into
a few issues. I'll just try to mention some things that others haven't:

Number of File Descriptors

Consider that each client socket will require a file
descriptor. Additionally your database client (say Oracle ProC) will use a
descriptor for it's database connection as well. So it will be necessary
to either increase the number of file descriptors permitted per process
or fork new processes periodically or in response to some condition.

Client State

The design of a server is highly influenced by were the state of a client
must reside. In the easiest case all client state is self contained in
which case a thread or process can be initialized with it and left alone
to run to completion. If clients send multiple, possibly interleaved
requests, over the same socket session it will be necessary serialize
responses. If a "boss thread" is required to be notified when some
worker task has been completed it will be necessary to use a pipe to
recieve those notifications. If you need different workers to seteuid
and impersonate another user or switch locales it will be necessary to
fork worker processes which is complicated when compounded with other
scenarios mentioned above. Fortunately it sounds like your scenario is
a litter easier.

Idle VS Active Client Lists

There is one design that I do not believe has been mentioned. That is
using poll for both socket clients and database clients assuming the
database client has asyncronous notification like select(2). In this
case you could manage a list of active clients and use poll to monitor
their activity. This technique may not use threads at all and could
arguably scale higher than any other implementation because it would not
use memory for extra threads and select(2)/poll(2) type notification is
more efficient than the implicit wait queue created by multiple threads.

hantheman

unread,

Dec 28, 2003, 11:23:15 AM12/28/03

to

Michael B Allen <mba...@ioplex.com> wrote in message news:<pan.2003.12.27.17....@ioplex.com>...

> On Sun, 21 Dec 2003 06:10:58 -0500, hantheman wrote:
>

> Number of File Descriptors

Just two files are going to be open ;-)

>
> Client State

No state required.

> Idle VS Active Client Lists

[...]

> In this case you could manage a list of active clients and use poll to monitor
> their activity. This technique may not use threads at all and could

> arguably scale higher [...]

Quite interesting, but can you please elaborate a bit more? Not sure I
understand how this would work.

Regards,
han

hantheman

unread,

Dec 28, 2003, 11:29:17 AM12/28/03

to

Eric Sosman <Eric....@sun.com> wrote in message news:<3FE72315...@sun.com>...

>However, you should evaluate whether you actually need
>an explicit producer/consumer model, or whether you're just
>falling into it out of habit.

Not sure - that's why I posted my suggestion since I'm too big a
socket/threading newbie to have any habits to fall into ;-)

However, I *am* going to timestamp the requests so that's one of the
reasons I find the fifo queue useful.

However, I like your idea of not using a fifo-manager at all; just a
pool of worker threads poping worklists at will.

One context switch saved, thank you ;-)

Michael B Allen

unread,

Dec 28, 2003, 2:06:07 PM12/28/03

to

On Sun, 28 Dec 2003 11:23:15 -0500, hantheman wrote:

>> Number of File Descriptors
>
> Just two files are going to be open ;-)

Each socket also utilizes an entry in the process file table. Most
systems have relatively small limits on how many files a process can have
open (e.g. on Linux it is 1024 by default).

>> Client State
>
> No state required.

So if you pass the client socket to a thread/process the passer never
needs to communicate with the thread/process again? The thread/process
will take care to shutdown/close the socket when it's done?

>
>> Idle VS Active Client Lists
> [...]
>> In this case you could manage a list of active clients and use poll to
>> monitor their activity. This technique may not use threads at all and
>> could arguably scale higher [...]
>
> Quite interesting, but can you please elaborate a bit more? Not sure I
> understand how this would work.

In the UNIX/Linux environment read the man page about select(2) or
poll(2). On Windows there are functions like WaitForMultipleObjects. There
are others specific to sockets on Windows if I recall. The principle is
basically the same. You indicate which descriptors you are interested in
an then call select(2) or poll(2) or whatever. This will block until
something of interest has occured at which point you can check to
determine which descriptors can be read or written without blocking.

Mike

SenderX

unread,

Dec 28, 2003, 7:43:23 PM12/28/03

to

> However, I like your idea of not using a fifo-manager at all; just a
> pool of worker threads poping worklists at will.

That's not an ideal design, as it totally destroys data locality. Really...

Again:

http://research.microsoft.com/users/larus/Papers/usenix02_cohort.pdf
(read whole paper please)

Good data locality will be cache friendly and outperform a server who passes
its requests to random threads, without any regard to data locality. For
sure!

P.S.

Yes, a Linux version of AppCore is coming along.

--

The designer of the experimental, SMP and HyperThread friendly, AppCore

library.

http://AppCore.home.comcast.net

SenderX

unread,

Dec 28, 2003, 7:45:39 PM12/28/03

to

> In the UNIX/Linux environment read the man page about select(2) or
> poll(2). On Windows there are functions like WaitForMultipleObjects. There
> are others specific to sockets on Windows if I recall.

On Windows, for loaded servers, use IOCP for files and sockets.

SenderX

unread,

Dec 28, 2003, 8:01:14 PM12/28/03

to

> Not sure - that's why I posted my suggestion since I'm too big a
> socket/threading newbie to have any habits to fall into ;-)

:O I missed that...

Are you really ready to code a server that handles spikes of 15,000 clients?

Do you have to meet a deadline?

Frank Cusack

unread,

Dec 29, 2003, 12:47:18 AM12/29/03

to

On Sat, 27 Dec 2003 17:51:24 -0500 Michael B Allen <mba...@ioplex.com> wrote:
> select(2)/poll(2) type notification is
> more efficient than the implicit wait queue created by multiple threads.

Really? Can you post a benchmark?

thanks
/fc

ig...@notformail.paco.net

unread,

Dec 29, 2003, 1:28:08 AM12/29/03

to

Michael B Allen <mba...@ioplex.com> wrote:

> Idle VS Active Client Lists
>
> There is one design that I do not believe has been mentioned. That is
> using poll for both socket clients and database clients assuming the
> database client has asyncronous notification like select(2). In this
> case you could manage a list of active clients and use poll to monitor
> their activity. This technique may not use threads at all and could
> arguably scale higher than any other implementation because it would not
> use memory for extra threads and select(2)/poll(2) type notification is
> more efficient than the implicit wait queue created by multiple threads.

There are more efficient polling mechanisms for that under different UNIX
versions (alas, not standard): /dev/poll(solaris), epoll(linux),
kqueue(freebsd). There is no reasons to separate active and idle connections
with this poll drivers.

Michael B Allen

unread,

Dec 29, 2003, 2:40:28 AM12/29/03

to

On Sun, 28 Dec 2003 19:43:23 -0500, SenderX wrote:

> http://research.microsoft.com/users/larus/Papers/usenix02_cohort.pdf
> (read whole paper please)

This is a tough sell. It is enlightening in terms of the importance
of code locality but the requirement of asynchronous operations and
schedualling autonomy is optamistic and would be a major constraint in a
general purpose application. It also sounds like a good way to get into
a deadlock.

Still it is interesting to think about simply adding a 'stage' variable
to objects for building cohorts. A select(2) loop is an implicit state
machine in this regard meaning it might be advantagous to loop over
descriptors twice; first process reads then process writes.

Code locality is also a good reason to *not* use locking like this:

while (stuff) {
lock(m);
do_stuff();
unlock(m);
}

but instead using something like the following:

lock(m);
while (stuff) {
/* maybe throw in a condvar here */
do_stuff();
}
unlock(m);

Using processor specific tables to increase paralellism is interesting too
but how can processor affinity of data table be ensured such that explicit
syncronization isn't necessary?

Also, I think they cheated a little on their benchmark by using a
website with 1,000,000 pages. That is if their SSWS server used the
staged computation for disk caching as they described (I sort of gave
up at that point :) whereas the threaded version was forced to use the
kernel cache.

Mike

hantheman

unread,

Dec 29, 2003, 9:30:29 AM12/29/03

to

"SenderX" <x...@xxx.xxx> wrote in message news:<u1LHb.205$xX.1269@attbi_s02>...

> > Not sure - that's why I posted my suggestion since I'm too big a
> > socket/threading newbie to have any habits to fall into ;-)
>
> :O I missed that...
>
> Are you really ready to code a server that handles spikes of 15,000 clients?

Probably not :-( There's going to be a though learning curve, but I'm
not going to give up before giving it a shot.

> Do you have to meet a deadline?

Yes, we got approx. eight months calendar time (two persons), two
months of which is design.

hantheman

unread,

Dec 29, 2003, 9:35:55 AM12/29/03

to

"SenderX" <x...@xxx.xxx> wrote in message news:<LMKHb.675669$HS4.4797316@attbi_s01>...

> > However, I like your idea of not using a fifo-manager at all; just a
> > pool of worker threads poping worklists at will.
>
> That's not an ideal design, as it totally destroys data locality. Really...

Why? The queue is actually a shared array (which waits when the array
is full; we'll have to adjust its size after maximum load).

The array should provide good cache locality, shouldn't it?

> Again:
> http://research.microsoft.com/users/larus/Papers/usenix02_cohort.pdf
> (read whole paper please)

Reading it again.

> Good data locality will be cache friendly and outperform a server who passes
> its requests to random threads, without any regard to data locality. For
> sure!

I thought data locality was less important since this application is
heavily IO bound. Isn't disk / socket utilization and parallellization
way more important than L1 and L2 cache misses in such an application?

Thanks.

Markus Elfring

unread,

Dec 29, 2003, 2:42:06 PM12/29/03

to

> There are more efficient polling mechanisms for that under different UNIX
> versions (alas, not standard): /dev/poll(solaris), epoll(linux),
> kqueue(freebsd). There is no reasons to separate active and idle connections
> with this poll drivers.

The following pages contain useful tests.
- event notification library
http://monkey.org/~provos/libevent/

- Microbenchmark comparing poll, kqueue, and /dev/poll
http://www.kegel.com/dkftpbench/Poller_bench.html

- A scalable and explicit event delivery mechanism for UNIX
http://citeseer.org/64790.html

Markus Elfring

unread,

Dec 29, 2003, 3:32:31 PM12/29/03

to

> There are more efficient polling mechanisms for that under different UNIX
> versions (alas, not standard): /dev/poll(solaris), epoll(linux),
> kqueue(freebsd). There is no reasons to separate active and idle connections
> with this poll drivers.

Did anybody try a benchmark with the programming interface "http://liboop.org/why"?

Michael B Allen

unread,

Dec 29, 2003, 3:42:12 PM12/29/03

to

On Mon, 29 Dec 2003 09:35:55 -0500, hantheman wrote:

> "SenderX" <x...@xxx.xxx> wrote in message
> news:<LMKHb.675669$HS4.4797316@attbi_s01>...
>> > However, I like your idea of not using a fifo-manager at all; just a
>> > pool of worker threads poping worklists at will.
>>
>> That's not an ideal design, as it totally destroys data locality.
>> Really...
>
> Why? The queue is actually a shared array (which waits when the array is
> full; we'll have to adjust its size after maximum load).
>
> The array should provide good cache locality, shouldn't it?

If each thread does four things;

1) compose an SQL statement,
2) execute the SQL statement,
3) format the response and
4) write it to a socket

it is more efficient to coalesce each of these "stages" into "cohorts"
and execute the stages to gether. For example 11134422233 would have
better data and code locality than 14423123213.

In your case if you are really going to have a peak of 15000 connects
you are going to have to think of this and more. I would suggest an
asyncronous state machine with separate active/inactive lists for use
with poll and take advantage of the code and data locatily opportunities
inherent in the state machine.

>> Good data locality will be cache friendly and outperform a server who
>> passes its requests to random threads, without any regard to data
>> locality. For sure!
>
> I thought data locality was less important since this application is
> heavily IO bound. Isn't disk / socket utilization and parallellization
> way more important than L1 and L2 cache misses in such an application?

Sound reasoning but I wouldn't rule out "staged computation" if it
doesn't make any difference.

The biggest factor for you is the number of clients. You have some
significant barriers to hurdle which is why I think the poll(2) state
machine is the best choice.

Mike

David Schwartz

unread,

Dec 29, 2003, 5:12:36 PM12/29/03

to

"hantheman" <hanth...@hotmail.com> wrote in message
news:580fae16.03122...@posting.google.com...

> I thought data locality was less important since this application is
> heavily IO bound. Isn't disk / socket utilization and parallellization
> way more important than L1 and L2 cache misses in such an application?

SenderX likes to latch onto some obscure, barely relevant point and
claim that your design is bad because it doesn't address that point.

DS

SenderX

unread,

Dec 29, 2003, 8:21:32 PM12/29/03

to

> SenderX likes to latch onto some obscure, barely relevant point and
> claim that your design is bad because it doesn't address that point.

How is cohort-scheduling not relevant to high-user server applications?

Anyway:

<1>

> 1. From the socket, read the query (I call this a work list). Each
> worklist is placed in a fifo queue. The socket listener thread is
> immediatly ready for another request since the request is now in the
> fifo queue.

(SX): This is ok.
</1>

<2>

> However, I like your idea of not using a fifo-manager at all; just a
> pool of worker threads poping worklists at will.

(SX): That's not an ideal design, as it totally destroys data locality.
Really...
</2>

I said OP's ideas are ok(1), but not ideal(2)...

The simple task of staging his server will help him assemble cohorts, and it
actually will improve his performance... 15,000 client server should
probably do all it can to improve throughput? I also believe a staged design
should actually simplify his code and design...

P.S.

Also, I did not mention lock-free algos directly to the OP either...

;)

SenderX

unread,

Dec 29, 2003, 8:47:11 PM12/29/03

to

> I thought data locality was less important since this application is
> heavily IO bound. Isn't disk / socket utilization and parallellization
> way more important than L1 and L2 cache misses in such an application?

Staging a server will still maintain good: "disk / socket utilization and
parallellization". Plus, its a simple design. Also, by having a processor
execute batches of very similar requests, it should be executing down the
same code paths over and over again. So the chances of cache miss is
decreased...

There is more than one way to create a stage server:

http://citeseer.nj.nec.com/welsh01seda.html

SenderX

unread,

Dec 29, 2003, 8:50:58 PM12/29/03

to

> > Do you have to meet a deadline?
>
> Yes, we got approx. eight months calendar time (two persons), two
> months of which is design.

Do you have to code a windows version, and a unix version?

Michael B Allen

unread,

Dec 30, 2003, 2:44:03 AM12/30/03

to

Actually the more I think about this the key for him is going to be the
fact that his tasks (database work) sound self contained. To get around
the various limits he could just exec a new server every 1000 clients and
pass the server socket file descriptor over a unix domain socket. Does
anyone see a problem with that? I've never done it. Then maybe he can
abstract that and IOCP but I don't know the limits with I/O completion
ports. Sometimes it's better to KISS.

Mike

Michael B Allen

unread,

Dec 30, 2003, 2:56:39 AM12/30/03

to

On Mon, 29 Dec 2003 14:42:06 -0500, Markus Elfring wrote:

>> There are more efficient polling mechanisms for that under different
>> UNIX versions (alas, not standard): /dev/poll(solaris), epoll(linux),
>> kqueue(freebsd). There is no reasons to separate active and idle
>> connections with this poll drivers.
>

<snip>

> - A scalable and explicit event delivery mechanism for UNIX
> http://citeseer.org/64790.html

This is interesting. They suspect poll(2) would in practice be slower
than select(2) because it ultimately copies 64bits vs 3bits per file
descriptor into the kernel with each call. I wonder if mmap-ing the
struct pollfs * or fdsets would improve performance. I beleive cp gets
huge gains by mmap-ing and using memcpy.

So is the performance gain here dependent on using event-based (aka
event-tiggered) notification? Is this what epoll is based on? So if you
use state-based (aka level-triggered) you have to use non-blocking IO?
I'm not convinced non-blocking I/O is better. Why is everything I'm
reading hooked on non-blocking I/O?

Also, this API doesn't have a context
object which is a flaw IMHO. Fortunately epoll has a context (the epoll
descriptor).

What are the chances of the epoll patch making it into the
official 2.4? I suppose their poor.

Mike

Bjørn Augestad

unread,

Dec 30, 2003, 4:13:00 AM12/30/03

to

hantheman wrote:

> Dear c.p.t. experts,
Hi.

First of all, I'm by no means an expert, but I've written a server
framework which is very similar to what you describe below. I therefore
just wanted to make some comments and have a few questions as well. It's
a very interesting task/challenge you have here, btw :-)

>
> I'm writing a server application that needs to serve about 15000
> concurrent users (six companies access the same server).

OK, but why do you want all those users from 6 different companies on
the same server? One thing is scalability, but what about SLA's,
security, and other issues?

For how long is a user connected to the server? Will you use persistent
connections or will a user connect and disconnect for each request?

Will you use plain TCP/IP or will you use encryption, e.g. ssl?

That's the
> peek period. Usually, about 2500 concurrent users. Users submit
> requests that are queried from a database. The returned result is on
> average 500 KB in our current system, but I may implement compression
> to reduce network load.

500KB * 15000 = 7.500MB = 7.3GB. You need a fast network ;-)

How many requests are made per second? What's the max response time?

The input queries < 1K. I know that creating a
> new thread per client doesn't scale, so this is my idea on how to
> solve it:

>
> 1. From the socket, read the query (I call this a work list). Each
> worklist is placed in a fifo queue. The socket listener thread is
> immediatly ready for another request since the request is now in the
> fifo queue.

No, let the worker thread read the query. Just accept the connection and
place the socket in the queue. Then continue accepting new connection
requests.

If one client for some reason fails to send you the request or does so
slowly, other clients cannot connect.

>
> 2. Another thread reads from the fifo_queue. Call this the FIFO
> manager. Its responsibility is to pop off work lists and start a new
> thread FROM A THREAD POOL. The queries live on the fifo list until
> some worker thread is ready to handle it. Related question... how big
> a thread pool for an X CPU system? Any heuristics available?

Assuming that the network will be the bottleneck, I'd say quite a large
thread pool and (fifo) queue. Make it configurable and test it :-)

>
> 3. The worker threads queries the database and write replies to the
> proper client. I guess info about the client end of the socket
> connection needs to be part of the work list so the worker thread
> knows where to send things...

Good idea :-)

>
> 4. Each worker threads needs to query a database file. This has
> special requirements so I need to write a simple DBMS (for spatial
> queries). How does one best implement concurrency when several threads
> are access the same database file? I guess two or more threads cannot
> read/write the same file simultaniously. Will simply allow one thread
> at a time do? (while the other threads are either waiting or sending
> query results to the respective client)

IMHO, it all depends on how fast and when the data in the db changes.
Can you tell us more about this part?

>
> 5. We are currently using Windows 2000, but are going to move to
> Solaris or Linux later on. Thus, the solution must be reasonably
> portable.

If the deployment platform is Solaris or Linux, I'd make that my
development platform as well, and as soon as possible.

>
> Does this sound like a good approach, or just plain stupid? I also
> read about IOCP, but I don't quite understand it - also it is not
> portable(?)

>
> Any thoughts?

I really don't know enough about your project, but my gut feeling tells
me that I'd use one server per company, possibly with a failover server
or two per company. Then I'd use a common database server if all
companies share the same data, or a db server per company if the data is
unique per company. Other alternatives exist as well, but The "One
server" alternative seems both quite expensive and risky. All IMHO, of
course :-)

Below is a link to the framework I mentioned earlier. It is basically a
HTTP server, but is implemented in a layered way, so it also contains a
standalone tcp_server which does all the TCP/IP stuff for you
(connect/disconnect/queueing/buffering/thread handling and more). Check
out the file src/samples/echoserver.c for an example of how to use the
tcp_server ADT (Abstract Data Type).

The framework contains lots of other stuff as well, you may want to look
into the process ADT which handles daemonizing, security, shutdowns and
other minor issues which may be important to you. A fullblown example is
available in the highlander/src/hws directory.

Note that this is a UNIX only framework, it runs on most POSIX compliant
platforms. It is cabable of handling lots of requests very, very fast,
but I would not use it for 15.000 persistent connections. Non-persistent
would work OK, AFAIK. ;-)

If you decide to have a look at the framework and run into any problems
or have any questions or comments, don't hesitate to email me. I know
that the documentation isn't the best, but I'm working on it. (yeah,
right. Too much fun coding to write doc :-))

HTH,
Bjørn

--
The worlds fastest web server is now available
at http://highlander.metasystems.no:2000. Enjoy!

SenderX

unread,

Dec 30, 2003, 7:29:04 AM12/30/03

to

> Then maybe he can
> abstract that and IOCP but I don't know the limits with I/O completion
> ports. Sometimes it's better to KISS.

IOCP can handle 45,000+ concurrent socket connections with a nice computer
( memory ):

http://www.microsoft.com/mspress/books/sampchap/5726a.asp#130

http://msdn.microsoft.com/msdnmag/issues/1000/winsock/default.aspx

In order to get tens-of-thousands of parallel clients, you have to design a
good resource management layer. This layer can store overflowed requests in
waitlists to "throttle" client spikes.

A super simple IOCP throttle that would just affect the socket io stage:

Stage 1: Socket IO

CPerSocket pSck;

for (;;)
{
// Check total
if ( m_Iocp.iPending < 45,000 )
{
int i = 100;

while ( ( --i ) && ( pSck = m_Waits.Pop() ) )
{
// Check total
if ( m_Iocp.iPending > 50,000 )
{
// Queue in local waits
m_Waits.Push( pSck );
}

else
{
// Handle the io
ProcessIo( pSck );
}
}
}

// Dequeue an io completion
pSck = m_Iocp.Pop();

// Check total
if ( m_Iocp.iPending > 50,000 )
{
// Queue in local waits
m_Waits.Push( pSck );
}

else
{
// Handle the io
ProcessIo( pSck );
}
}

You can tweak each stage just how you like it...

;)

For select I guess you could:

1. just create a lot of threads each waiting on their own pipe using
select...

2. listener thread would accept sockets and dispatch the accpted
connections...

Dispatch method (Part of Socket IO Stage):

- Listener thread can build batches of connections by accpeting one, then
poll for another with select.

- Listener would queue the socket state(s) onto the target threads queue.

- Listener would write a single byte to the target threads pipe

Select thread (Part of Socket IO Stage):

- Wakes from select because the pipe was written to

- Checks its local queue for any new connections

- Empties the entire queue, and adds sockets state(s) to local fd set

- Run through the fd sets, passes completed socket io to the correct stages,
and wait on select again.

Markus Elfring

unread,

Dec 30, 2003, 5:23:28 PM12/30/03

to

> So is the performance gain here dependent on using event-based (aka
> event-tiggered) notification?

It seems so.
The authors (Gaurav Banga, Jeffrey C. Mogul, Peter Druschel)
implemented a new and improved event-oriented API.

I assume that more can be found about it with the topic "System
Support for Scalable Network Servers"
(http://www.cs.rice.edu/CS/Systems/ScalaServer/).

Markus Elfring

unread,

Dec 30, 2003, 5:42:35 PM12/30/03

to

> OK, but why do you want all those users from 6 different companies on
> the same server? One thing is scalability, but what about SLA's,
> security, and other issues?

Will the following topics be considered?
- service configurator
http://www.cs.wustl.edu/~schmidt/ACE-papers.html
http://www.cs.wustl.edu/~jxh/research/research.html

- fault tolerance with CORBA
http://citeseer.org/moser97eternal.html
http://citeseer.org/felber97replicating.html

Markus Elfring

unread,

Dec 30, 2003, 8:04:42 PM12/30/03

to

> I'm writing a server application that needs to serve about 15000

> concurrent users (six companies access the same server). That's the

> peek period. Usually, about 2500 concurrent users. Users submit
> requests that are queried from a database. The returned result is on
> average 500 KB in our current system, but I may implement compression

> to reduce network load. The input queries < 1K. I know that creating a

> new thread per client doesn't scale, so this is my idea on how to
> solve it:

> ...

Are you going to start your development from scratch?
Can you leave any low leven details like "multithreading" to an object
request broker?

How do you think about to reuse available software components like the
following?
1. asymmetric multiprocess event-driven architecture (AMPED)
http://citeseer.nj.nec.com/pai99flash.html

2. Real-time CORBA with TAO(TM) (The ACE ORB)
http://www.cs.wustl.edu/~schmidt/TAO.html

3. Real Time CORBA for RTSJ Java
http://www.zen.uci.edu/

4. The Java library for parallel, distributed, and concurrent
computing with mobility and security
http://proactive.objectweb.org/
Would you like to move your objects around your computer systems by
"drag and drop"?

Casper H.S. Dik

unread,

Jan 1, 2004, 6:41:21 AM1/1/04

to

Michael B Allen <mba...@ioplex.com> writes:

>Each socket also utilizes an entry in the process file table. Most
>systems have relatively small limits on how many files a process can have
>open (e.g. on Linux it is 1024 by default).

Solaris does not; while there's a default hard limit, there's no
"hard" hard limit. (64K is supported in S9)

Allocating a new fd in current Solaris takes O(log n) where n is the number
of the fd returned. Many other OSes use O(n). (The algorithm is expensive
because POSIX requires that the lowest available fd is returned)

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Markus Elfring

unread,

Jan 4, 2004, 1:52:09 PM1/4/04

to

> I'm writing a server application that needs to serve about 15000
> concurrent users (six companies access the same server). That's the
> peek period. Usually, about 2500 concurrent users. Users submit
> requests that are queried from a database. The returned result is on
> average 500 KB in our current system, but I may implement compression
> to reduce network load. The input queries < 1K. I know that creating a
> new thread per client doesn't scale, so this is my idea on how to
> solve it:
...

How do you think about to use technologies that are related to the Globus Toolkit?
http://en.wikipedia.org/wiki/Grid_computing

Randy Howard

unread,

Jan 6, 2004, 5:07:43 AM1/6/04

to

> On Sun, 21 Dec 2003 06:10:58 -0500, hantheman wrote:
>
> > Any thoughts?

The latest Dr. Dobb's Journal has an interesting article on
this very topic. I recommend it as at least background material.

--
Randy Howard
2reply remove FOOBAR

Markus Elfring

unread,

Jan 7, 2004, 4:42:01 AM1/7/04

to

The edition "http://www.ddj.com/articles/2001/0109/" may be interesting, too.

> The latest Dr. Dobb's Journal has an interesting article on
> this very topic. I recommend it as at least background material.

Which article do you mean from the edition "http://ddj.com/articles/2004/0401/"?

Randy Howard

unread,

Jan 7, 2004, 5:10:54 AM1/7/04

to

In article <40ed1d8f.04010...@posting.google.com>,
Markus....@web.de says...

That's not the latest. I am referring to the February 04 issue, and the
article by Ian Barile entitled "I/O Multiplexing & Scalable Socket
Servers".

It's not on DDJ yet as far as I can tell, though I do note this from their
site:

"All articles are first published in our print magazine. For the most up-
to-date information, you should subscribe. Print copies are shipped to
newstands and subscribers about one month prior to the cover date."