Many threads in Linux

Scott Gifford

unread,

Dec 10, 2007, 12:35:26 AM12/10/07

to

Hello,

I have a straightforward multi-threaded server I am working on. Each
client has a socket and two threads: one thread blocks reading the
socket, then parses any lines it reads and sends them on for
processing, and another blocks on a message queue and sends any
messages out over the TCP connection a lines of text. Each connection
is fairly low-traffic, and threads are used primarily to allow
blocking I/O with C++ iostreams, which simplifies the code for our
application a great deal.

I am programming in C++ with g++ 4.1.2 using pthreads via
boost::thread 1.33.1, and am currently running on Ubuntu 7.04 ("feisty
fawn") with kernel 2.6.22.1 and glibc 2.5.0.

I'm exploring the scalability of this configuration, to see how it
would scale to 100,000 clients. The first limitation I have run into
is the number of threads I can create. At first I was only able to
create a few hundred threads. By shrinking the stack size I increased
this quite a bit, but seem to be now hitting a limit of around 32,000
threads system-wide; if I try to create more I get EAGAIN every time I
try to create an additional thread, even in a different process. I
have adjusted /proc/sys/kernel/threads-max to 64000, but still seem to
be hitting a limit around 32000. My process is only using about 500MB
of virtual memory, on a system with 1GB of RAM and several GB of swap,
and anyways when I was running out of memory I got ENOMEM. To test, I
am just calling pthread_create in a loop.

Here are my questions:

* Where could this limit of 32000 threads be coming from, and how
can I increase it? Several articles on modern Linux threading
(NPTL) mention benchmarks with creating 100,000 threads, so it
seems like more should be possible.

* When I was searching for information about this, I read several
posts along the line of "if you need to create this many threads,
your design is broken". Is there an inherent reason why it's
inefficient or otherwise bad to use a large number of mostly idle
threads blocking on I/O? Or is this opinion based on the
limitations of thread libraries, and if so do those limitations
still apply?

Thanks!

---Scott.

Markus Elfring

unread,

Dec 10, 2007, 3:48:15 AM12/10/07

to

> * When I was searching for information about this, I read several
> posts along the line of "if you need to create this many threads,
> your design is broken". Is there an inherent reason why it's
> inefficient or otherwise bad to use a large number of mostly idle
> threads blocking on I/O?

See article "The C10K problem" ...
http://kegel.com/c10k.html

Eric Sosman

unread,

Dec 10, 2007, 12:06:40 PM12/10/07

to

Scott Gifford wrote:
> Hello,
>
> I have a straightforward multi-threaded server I am working on. Each

> client has a socket and two threads: [...]

>
> I'm exploring the scalability of this configuration, to see how it

> would scale to 100,000 clients. [...]

Hence 200000 threads. My understanding is that a 32-bit
Linux process has about 2 GB of address space to play with,
so each thread can have at most 2 GB / 200000 ~= 10.5 KB
for its stack (assuming code and other data use no space
at all). My further understanding is that the Linux page
size is 4 KB and that stack size is allocated in whole pages,
so in fact each thread gets at most an 8 KB stack. Some
operating systems also reserve a "guard page" at the active
end of the stack; I don't know whether Linux does so, but if
it does you're now down to 4 KB of stack per thread.

In short, it really doesn't matter what limit you are
running into at the moment; you're not going to get 200,000
threads running anyhow. Not in a 32-bit process. Moving to
64-bit and expanding RAM to about 16 GB would let the threads
use a fairly stingy 64 KB stack allocation, after which the
other limits (whatever they are) might merit further study.

Warning: Some of my "understandings" may actually be
misunderstandings, so you should check this stuff for
yourself.

--
Eric....@sun.com

Marcel Müller

unread,

Dec 10, 2007, 3:17:04 PM12/10/07

to

Hi,

Scott Gifford schrieb:
[Threads O(n) with many clients]

> I'm exploring the scalability of this configuration, to see how it
> would scale to 100,000 clients.

forget it.

You will have to take threads out of a pool and assign them to a client
for the time of a request, like web servers, .NET, SAP or other scalable
applications solve the problem too. Between the requests the client
context is held in a database usually.

> * When I was searching for information about this, I read several
> posts along the line of "if you need to create this many threads,
> your design is broken".

Absolutely true.

> Is there an inherent reason why it's
> inefficient or otherwise bad to use a large number of mostly idle
> threads blocking on I/O?

Threads use ressources in the scheduler and the stack even if they are
not running. On many platforms some of the kernel ressources are not
dynamic for performance reasons.

The only plattform I know where threads consume only a few ressources
are the old inmos transputers. They have a hardware scheduler and all
you need to wake up a thread is it's stack pointer. They can efficiently
deal with several thousends of threads as long as they take only a few
hundret bytes of stack space.

> Or is this opinion based on the
> limitations of thread libraries, and if so do those limitations
> still apply?

The most C runtime libraries use a plenty of thread local storage to
work properly. This increases the private memory load of your process
additionally. Other libraries may do the same.

Marcel

Scott Gifford

unread,

Dec 10, 2007, 3:52:11 PM12/10/07

to

Eric Sosman <Eric....@sun.com> writes:

[...]

> Hence 200000 threads.

Right.

> My understanding is that a 32-bit Linux process has about 2 GB of
> address space to play with, so each thread can have at most 2 GB /
> 200000 ~= 10.5 KB for its stack (assuming code and other data use no
> space at all). My further understanding is that the Linux page size
> is 4 KB and that stack size is allocated in whole pages, so in fact
> each thread gets at most an 8 KB stack.

I think that's more or less right, and the minimum allowable stack
size for Linux's thread implementation is 16KB, which limits things
even further. That affects the number of threads that can run in a
single *process*, but the limitation I am running into appears to be
system-wide. For our application I believe it will be straightforward
to divide the work among multiple processes, each of which has its own
address space and thread pool, which should provide an effective way
to work around the memory problem. Or else...

[...]

> Moving to 64-bit and expanding RAM to about 16 GB would let the
> threads use a fairly stingy 64 KB stack allocation, after which the
> other limits (whatever they are) might merit further study.

....using a 64-bit platform, which is definitely a possibility. I
currently have the threads working in a 64KB page without problems.
But before I can try this, I need to figure out whether it's possible
to overcome these limits without running into other limits that may be
much harder to overcome.

Thanks,

----Scott.

Eric Sosman

unread,

Dec 10, 2007, 4:43:49 PM12/10/07

to

Scott Gifford wrote:
> Eric Sosman <Eric....@sun.com> writes:
>
> [...]
>
>> Hence 200000 threads.
>
> Right.
>
>> My understanding is that a 32-bit Linux process has about 2 GB of
>> address space to play with, so each thread can have at most 2 GB /
>> 200000 ~= 10.5 KB for its stack (assuming code and other data use no
>> space at all). My further understanding is that the Linux page size
>> is 4 KB and that stack size is allocated in whole pages, so in fact
>> each thread gets at most an 8 KB stack.
>
> I think that's more or less right, and the minimum allowable stack
> size for Linux's thread implementation is 16KB, which limits things
> even further. That affects the number of threads that can run in a
> single *process*, but the limitation I am running into appears to be
> system-wide. For our application I believe it will be straightforward
> to divide the work among multiple processes, each of which has its own
> address space and thread pool, which should provide an effective way
> to work around the memory problem. Or else...

More back-of-the-envelope arithmetic: 200000 threads
at 64 KB per thread stack is a little more than 12 GB.
You mentioned, I think, that you have 1 GB of physical
memory on the system. Even if the kernel and the rest
of the program occupy no memory at all, you're more than
12 oversubscribed on RAM. Each time a thread wants to
run after a period of idleness (you described the threads
as "mostly idle"), it's almost certain that the thread
stack will have been paged out and will need to be paged
back in before the thread can run. If your disk can
sustain an impressive 200 I/O operations per second, you
can page in and start 100 threads per second (because
each page-in also involves a page-out).

So: Even the most optimistic projection suggests that
the system will crumple under the load offered by 0.05%
of your 200000 threads.

> [...]
>
>> Moving to 64-bit and expanding RAM to about 16 GB would let the
>> threads use a fairly stingy 64 KB stack allocation, after which the
>> other limits (whatever they are) might merit further study.
>
> ....using a 64-bit platform, which is definitely a possibility. I
> currently have the threads working in a 64KB page without problems.
> But before I can try this, I need to figure out whether it's possible
> to overcome these limits without running into other limits that may be
> much harder to overcome.

I'm sorry, but I can't help you there. The point of
all these fanciful calculations has been to convince you
that your proposed design is going to be expensive. Also
wasteful: If you upgrade to 16 GB of RAM and then blow
three-quarters of it on stacks for "mostly idle" threads,
what you've effectively done is bought 4 GB of RAM at a
quadruple price.

There are saner and less profligate ways to proceed.

--
Eric....@sun.com

Scott Gifford

unread,

Dec 10, 2007, 6:22:08 PM12/10/07

to

Eric Sosman <Eric....@sun.com> writes:
> Scott Gifford wrote:

[...]

> More back-of-the-envelope arithmetic: 200000 threads
> at 64 KB per thread stack is a little more than 12 GB.

Right, but only if they use all of their 64KB stack.In my quick test,
where my threads are very simple and have a 64KB stack, when I run
into the 32K thread limit I am using 2.5GB of address space, but only
130MB of resident memory. Here's what top(1) says:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6673 sgifford 24 0 2526m 130m 352 S 12 6.5 0:01.10 threadtest2

That's about one 4KB page of resident memory per thread, so with 1GB
of RAM the physical memory should scale to 250K of these very simple
threads (though obviously the address space available to a single
process will not).

Of course the threads of my real application will need more stack than
that to store their actual state...

> You mentioned, I think, that you have 1 GB of physical
> memory on the system. Even if the kernel and the rest
> of the program occupy no memory at all, you're more than
> 12 oversubscribed on RAM. Each time a thread wants to
> run after a period of idleness (you described the threads
> as "mostly idle"), it's almost certain that the thread
> stack will have been paged out and will need to be paged
> back in before the thread can run. If your disk can
> sustain an impressive 200 I/O operations per second, you
> can page in and start 100 threads per second (because
> each page-in also involves a page-out).

....but the state will need to be stored somewhere regardless of the
approach, and if there isn't enough physical memory to keep all of the
actual state in memory, other approaches are equally doomed. The only
solution is to reduce the amount of state, which should work
comparably well in a thread or some other per-client data structure.

[...]

>> But before I can try this, I need to figure out whether it's possible
>> to overcome these limits without running into other limits that may be
>> much harder to overcome.
>
> I'm sorry, but I can't help you there. The point of
> all these fanciful calculations has been to convince you
> that your proposed design is going to be expensive. Also
> wasteful: If you upgrade to 16 GB of RAM and then blow
> three-quarters of it on stacks for "mostly idle" threads,
> what you've effectively done is bought 4 GB of RAM at a
> quadruple price.

Of course, I could well be wildly incorrect with all of the above,
which is why I'd like to run some benchmarks to measure the actual
performance of using a large number of threads. I could be completely
convinced of everything you're saying if I tried running with 200,000
threads and found that I had all of my physical RAM used up and the
system was swapping with every request.

But unfortunately I'm running into this 32K limit that prevents me
from verifying that my design is truly disastrous.

If anybody has any suggestions for how to get past this limit, I would
very much appreciate them.

Thanks!

----Scott.

David Schwartz

unread,

Dec 10, 2007, 7:49:33 PM12/10/07

to

On Dec 9, 9:35 pm, Scott Gifford <sgiff...@suspectclass.com> wrote:

> * When I was searching for information about this, I read several
> posts along the line of "if you need to create this many threads,
> your design is broken". Is there an inherent reason why it's
> inefficient or otherwise bad to use a large number of mostly idle
> threads blocking on I/O? Or is this opinion based on the
> limitations of thread libraries, and if so do those limitations
> still apply?

That's just not what threads are for. Threads are expensive execution
vehicles.

Suppose you need five buses on the road at a time, but each one could
take one of twenty thousand routes. Which is more efficient:

1) You have ten buses. Each bus takes whatever route is needed at the
time.

2) You have 20,000 buses, one for each route. You find the appropriate
bus and send it on its route.

Here's the big problem. With ten buses, every time a bus comes into
the station, you can just send it back out on whatever route most
needs to be driven. With 20,000, you have to put each arriving bus
away, dig out the appropriate bus to send out, and send that
particular bus on its way. Your buses are just going to get in each
other's way.

Getting back to threads, the main issue is context switches. Suppose
you have one CPU and you receive ten bytes on each of 15,000
connections. With your approach, you need 15,000 context switches.
With mine, you need none.

That's really just the tip of the iceberg. It's simply horrendously
inefficient to use threads just for their stacks. All you need from
these threads is someplace to store the context of each connection.
That could more easily and more elegantly be done with a data
structure.

DS

Scott Gifford

unread,

Dec 10, 2007, 8:47:58 PM12/10/07

to

Hello David,

David Schwartz <dav...@webmaster.com> writes:

> On Dec 9, 9:35 pm, Scott Gifford <sgiff...@suspectclass.com> wrote:
>

[...]

>> Is there an inherent reason why it's inefficient or otherwise bad
>> to use a large number of mostly idle threads blocking on I/O?

[...]

> That's just not what threads are for. Threads are expensive execution
> vehicles.
>
> Suppose you need five buses on the road at a time, but each one could
> take one of twenty thousand routes. Which is more efficient:
>
> 1) You have ten buses. Each bus takes whatever route is needed at the
> time.
>
> 2) You have 20,000 buses, one for each route. You find the appropriate
> bus and send it on its route.

That's a very useful metaphor. You could also look at (2) as 20,000
vehicles, one for each traveler. 10 buses are more efficient, but
20,000 cars are much more convenient for the individual travelers. In
my case, development time is at a premium, so I will take convenience
over efficiency as long as I can get away with it. :-)

[...]

> the main issue is context switches. Suppose you have one CPU and you
> receive ten bytes on each of 15,000 connections. With your approach,
> you need 15,000 context switches. With mine, you need none.

Right, that definitely makes sense as a possible performance problem.
To measure it, I tried the benchmarks here:

http://www.gelato.unsw.edu.au/IA64wiki/NPTLbenchmarks

On my laptop, they show about 980K thread context switches per second.
Very optimistically, I would be handling 3500 requests/second on
faster hardware, so this performance seems more than adequate.

> That's really just the tip of the iceberg. It's simply horrendously
> inefficient

That seems to be the common wisdom, and it may well be exactly
correct, but all the same I'd like to be able to get past the 32K
thread limit to see the scalability issues for myself. Then I could
waste my own time running benchmarks, rather than this group's time.
:-)

> to use threads just for their stacks.

Well, and also for their blocking behavior and the I/O models that
permits.

> All you need from these threads is someplace to store the context of
> each connection. That could more easily and more elegantly be done
> with a data structure.

Your recommendation, then, would be to use one of the newer efficient
select(2) replacements, like epoll(7), and just implement the
buffering and parsing I would normally get from the C++ iostreams
library myself?

Thanks,

----Scott.

Eric Sosman

unread,

Dec 10, 2007, 10:55:27 PM12/10/07

to

Scott Gifford wrote:
> Eric Sosman <Eric....@sun.com> writes:
>> Scott Gifford wrote:
>
> [...]
>
>> More back-of-the-envelope arithmetic: 200000 threads
>> at 64 KB per thread stack is a little more than 12 GB.
>
> Right, but only if they use all of their 64KB stack.In my quick test,
> where my threads are very simple and have a 64KB stack, when I run
> into the 32K thread limit I am using 2.5GB of address space, but only

> 130MB of resident memory. [...]

So the 32 Kthreads are using about 80 KB apiece -- that's
consistent with 64 KB of stack bracketed by *two* 4 KB guard
pages, or more likely one guard page plus another 128 KB of
process-wide code and data. (I repeat: I'm not intimate with
the details of Linux stack management, and you should double-
check my assumptions.)

At any rate, your 32 Kthreads are one 6.1th of the
desired 200000, and 6.1 times 2.5 GB is a little more than
15 GB -- a little worse than the 12+ GB that I predicted,
but still in the same ballpark.

> ....but the state will need to be stored somewhere regardless of the
> approach, and if there isn't enough physical memory to keep all of the
> actual state in memory, other approaches are equally doomed. The only
> solution is to reduce the amount of state, which should work
> comparably well in a thread or some other per-client data structure.

Give the man a cigar! I put it to you that the amount
of memory needed to record the state is likely to be a good
deal less than an eighth of a megabyte (two 64 KB stacks) per
connection. Not only that, but you can spend what you need
without rounding everything up to whole-page units.

> Of course, I could well be wildly incorrect with all of the above,
> which is why I'd like to run some benchmarks to measure the actual
> performance of using a large number of threads. I could be completely
> convinced of everything you're saying if I tried running with 200,000
> threads and found that I had all of my physical RAM used up and the
> system was swapping with every request.
>
> But unfortunately I'm running into this 32K limit that prevents me
> from verifying that my design is truly disastrous.

Consider it an omen, an augury, a sign from the gods that
you are treading the path of wickedness and are just begging
to be struck down by a thunderbolt. No, wrong metaphor: You
are begging for a vacation in, not at, the La Brea tar pits.

Or: "Doctor, it hurts when I hit my head with rocks."
"Well, stop hitting your head." "No, I want you to make it
stop hurting, so I can hit myself even harder and find out
if my skull is really as durable as it's cracked up to be!"

--
Eric Sosman
eso...@ieee-dot-org.invalid

Scott Gifford

unread,

Dec 11, 2007, 1:44:09 AM12/11/07

to

Eric Sosman <eso...@ieee-dot-org.invalid> writes:

> Scott Gifford wrote:
>> Eric Sosman <Eric....@sun.com> writes:
>>> Scott Gifford wrote:
>> [...]
>>
>>> More back-of-the-envelope arithmetic: 200000 threads
>>> at 64 KB per thread stack is a little more than 12 GB.
>> Right, but only if they use all of their 64KB stack.In my quick test,
>> where my threads are very simple and have a 64KB stack, when I run
>> into the 32K thread limit I am using 2.5GB of address space, but only
>> 130MB of resident memory. [...]

[...]

> At any rate, your 32 Kthreads are one 6.1th of the
> desired 200000, and 6.1 times 2.5 GB is a little more than
> 15 GB -- a little worse than the 12+ GB that I predicted,
> but still in the same ballpark.

But you have missed the point of my last post, which is that while
scaling these numbers by a factor of 6.25 predicts that they will
require 15GB of *address space*, if they follow the pattern above,
they will only use 812.5MB of *resident memory*. So the system would
require 15GB of swap and 1GB of RAM, which is quite reasonable I
think.

[...]

> you can spend what you need without rounding everything up to
> whole-page units.

It is true that the threaded approach requires page alignment, which
wastes memory on any partial pages.

[...]

> Consider it an omen, an augury, a sign from the gods that you
> are treading the path of wickedness and are just begging to be
> struck down by a thunderbolt.

:-) Perhaps, though where possible I prefer interpreting performance
measurements rather than omens...

> No, wrong metaphor: You are begging for a vacation in, not at, the
> La Brea tar pits.
>
> Or: "Doctor, it hurts when I hit my head with rocks."
> "Well, stop hitting your head." "No, I want you to make it
> stop hurting, so I can hit myself even harder and find out
> if my skull is really as durable as it's cracked up to be!"

The thing is, apart from this limit, so far it doesn't hurt at all. A
few minutes ago I ran a benchmark with 16,000 clients issuing very
simple requests and reading the responses on my current server
implementation (not the simple benchmark I was using above), which
should measure the cost of the I/O, message queue, and any
synchronization and threading overhead. During this run, the server
used 548MB of RAM (2.7GB of address space) and processed about 15,000
requests/second on my laptop, with the client running on the same box,
and me browsing the Web while I waited for the benchmark to finish.
My target right now is 3,500 requests/second on a much faster server,
and 4GB of RAM is much cheaper than the time it will take to rewrite
the server to use nonblocking I/O. So I have not yet run into any
real performance issues with 32,000 threads, except for this limit.

Thanks, Eric, for all of your input,

----Scott.

Gil Hamilton

unread,

Dec 11, 2007, 7:39:29 AM12/11/07

to

Scott Gifford <sgif...@suspectclass.com> wrote in
news:lymysi3...@gfn.org:

It's likely that you've simply run out of virtual address space. A 32-
bit version of linux has a 4GB virtual address space, of which 1GB is
typically reserved for the kernel (all addresses >= 0xC0000000). So, you
may have plenty of physical memory and even more available system virtual
memory but you don't have address space available within your process to
hold additional pages. Given guard pages and other "dead space" in the
page tables, you should expect to see this at something less than the
full 3GB of address space. (This particular restriction would be
alleviated if you were to move to a 64-bit implementation.)

GH

Markus Elfring

unread,

Dec 11, 2007, 11:47:15 AM12/11/07

to

> That seems to be the common wisdom, and it may well be exactly
> correct, but all the same I'd like to be able to get past the 32K
> thread limit to see the scalability issues for myself.

Can alternative design approaches help in your application?
http://haproxy.1wt.eu/#desi
http://httpd.apache.org/docs/2.2/mod/event.html

> Then I could waste my own time running benchmarks, rather than this group's time.

Why do you want to "waste" any further test time if other software developers
have published corresponding experiences already?

How much development time can be saved by the reuse of proper class libraries
for your use case?

Regards,
Markus

Markus Elfring

unread,

Dec 11, 2007, 12:02:57 PM12/11/07

to

> Your recommendation, then, would be to use one of the newer efficient
> select(2) replacements, like epoll(7), and just implement the
> buffering and parsing I would normally get from the C++ iostreams
> library myself?

Why do you want to develop all software components yourself?
Is the library "http://monkey.org/~provos/libevent/" good enough for you?

Regards,
Markus

Scott Gifford

unread,

Dec 11, 2007, 2:21:50 PM12/11/07

to

Markus Elfring <Markus....@web.de> writes:

>> That seems to be the common wisdom, and it may well be exactly
>> correct, but all the same I'd like to be able to get past the 32K
>> thread limit to see the scalability issues for myself.
>
> Can alternative design approaches help in your application?
> http://haproxy.1wt.eu/#desi
> http://httpd.apache.org/docs/2.2/mod/event.html

Perhaps. That is definitely what I will do if I find that threading
can't do the job.

>> Then I could waste my own time running benchmarks, rather than this
>> group's time.
>
> Why do you want to "waste" any further test time if other software
> developers have published corresponding experiences already?

I have not heard anybody saying they have tried creating a large
number of very simple threads on a platform similar to mine and run
into performance problems. If somebody has done this, I would love to
know about their experiences, and also how they managed it, so I can
try it out and see the results with my own application.

> How much development time can be saved by the reuse of proper class
> libraries for your use case?

A bit, to be sure, but not nearly as much as I can save by reusing my
existing code. :-)

[...]

Markus Elfring <Markus....@web.de> writes:

>> Your recommendation, then, would be to use one of the newer efficient
>> select(2) replacements, like epoll(7), and just implement the
>> buffering and parsing I would normally get from the C++ iostreams
>> library myself?
>

> Why do you want to develop all software components yourself?
> Is the library "http://monkey.org/~provos/libevent/" good enough for you?

I certainly do not want to unless I have to. I've looked a bit at
Niels' libevent, and also at boost::asio, and ACE. They all seem to
be helpful for portable access to the wide variety of newer efficient
select(2) replacements, but do not seem to provide much help with the
buffering, parsing, and simple I/O model that C++ iostreams currently
provide me. I will probably use one of them if I need to rewrite my
code to be event-driven.

Thanks,

----Scott.

Scott Gifford

unread,

Dec 11, 2007, 2:26:25 PM12/11/07

to

Gil Hamilton <gil_ha...@hotmail.com> writes:

> Scott Gifford <sgif...@suspectclass.com> wrote in

[...]

>> But unfortunately I'm running into this 32K limit that prevents me
>> from verifying that my design is truly disastrous.
>
> It's likely that you've simply run out of virtual address space. A 32-
> bit version of linux has a 4GB virtual address space, of which 1GB is
> typically reserved for the kernel (all addresses >= 0xC0000000). So, you
> may have plenty of physical memory and even more available system virtual
> memory but you don't have address space available within your process to
> hold additional pages.

Thanks Gil,

That's what I initially thought, too. But a few things convinced me
this was probably not the case:

* When I shrink the stack down to the minimum allowable size, I'm
only using 500MB of address space when I run into trouble

* When I was actually running out of address space, pthread_create()
was returning ENOMEM, but in this case it's returning EAGAIN.

* When I tried dividing up the 32K threads among 8 processes, each
with their own address space, the processes each died around 4K
threads, which implies that it's a system-wide limit on threads,
and not an address space problem.

----Scott.

Chris Thomasson

unread,

Dec 11, 2007, 3:18:17 PM12/11/07

to

"Scott Gifford" <sgif...@suspectclass.com> wrote in message
news:lywsrl1...@gfn.org...

> Markus Elfring <Markus....@web.de> writes:
>
>>> That seems to be the common wisdom, and it may well be exactly
>>> correct, but all the same I'd like to be able to get past the 32K
>>> thread limit to see the scalability issues for myself.
>>
>> Can alternative design approaches help in your application?
>> http://haproxy.1wt.eu/#desi
>> http://httpd.apache.org/docs/2.2/mod/event.html
>
> Perhaps. That is definitely what I will do if I find that threading
> can't do the job.

[...]

Threading can most certainly do the job. What's wrong with a
event-processing system driven by a thread-pool? Thread-per-client is not
going to scale.

Marcel Müller

unread,

Dec 11, 2007, 3:37:28 PM12/11/07

to

Scott Gifford schrieb:

> That's a very useful metaphor. You could also look at (2) as 20,000
> vehicles, one for each traveler. 10 buses are more efficient, but
> 20,000 cars are much more convenient for the individual travelers. In
> my case, development time is at a premium, so I will take convenience
> over efficiency as long as I can get away with it. :-)

Well that's the kind of software I really like. Each programmer ensures
that /his/ part of code does not consume /all/ ressources of reasonabe
hardware. Devil-may-care.

SCNR.
(I have to correct codes like these from big companies from time to time
because of performance problems. I do everything to prevent my
programmers from doing similar things, because we have the
disadvantageous situation that we must use our own software in-house.)

Marcel

Scott Gifford

unread,

Dec 11, 2007, 4:28:15 PM12/11/07

to

Marcel Müller <news.5...@spamgourmet.com> writes:

> Scott Gifford schrieb:
>> That's a very useful metaphor. You could also look at (2) as 20,000
>> vehicles, one for each traveler. 10 buses are more efficient, but
>> 20,000 cars are much more convenient for the individual travelers. In
>> my case, development time is at a premium, so I will take convenience
>> over efficiency as long as I can get away with it. :-)
>
> Well that's the kind of software I really like. Each programmer
> ensures that /his/ part of code does not consume /all/ ressources of
> reasonabe hardware. Devil-may-care.

:-) Well, the devil is in the details, and the detail I'm thinking
about while I'm analyzing my current system is Hoare's dictum:
"premature optimization is the root of all evil". It's a generally
accepted engineering principle (at least in the circles where I hang
out) that before rewriting your code in a more complex way to improve
its performance, you should first measure the performance to verify
that it really is a problem, and that your improvements really help.
Many hours have been wasted, and many unstable and unmaintainable
programs written, in the name of keeping away imaginary performance
bogeymen.

It's the same tradeoff decision a programmer makes when deciding
whether to port a server from Java to C++ or C++ to assembler:
efficiency of execution vs. efficiency of development.

All I am trying to do is measure the performance impact of using a
large number of threads with my current code, so I can see its limits,
decide if it will scale as much as I need it to, and if not compare it
to other approaches. In other words, to determine whether this
performance bogeyman is real or imaginary for my application.

I am not advocating a particular approach (although I would clearly
prefer to use my existing code if its performance is acceptable), just
trying to find the limits of my current approach and if necessary
prototype some other approaches and compare the performance.

----Scott.

Chris Friesen

unread,

Dec 11, 2007, 4:47:57 PM12/11/07

to

Scott Gifford wrote:

> :-) Well, the devil is in the details, and the detail I'm thinking
> about while I'm analyzing my current system is Hoare's dictum:
> "premature optimization is the root of all evil".

If you're looking at threads at all, then you're already concerned about
performance optimization. If you weren't, then you could use separate
processes and gain memory protection.

Chris

Scott Gifford

unread,

Dec 11, 2007, 5:15:50 PM12/11/07

to

Chris Friesen <cbf...@mail.usask.ca> writes:

Certainly I'm concerned about performance optimization! Hoare's
dictum doesn't speak against optimizing the performance bottlenecks in
your program, just against optimizing things that will not
significantly improve the performance of your program. And the only
reliable way I know to find out your bottlenecks is to measure the
performance of your program.

My application is accessing a central shared in-memory data structure.
My current implementation uses shared memory threads because they were
the most straightforward way to provide client access to this data
structure.

If I find that the client threads present an unacceptable performance
bottleneck, the next implementation will take a different approach.

-----Scott.

David Schwartz

unread,

Dec 11, 2007, 6:22:04 PM12/11/07

to

On Dec 10, 5:47 pm, Scott Gifford <sgiff...@suspectclass.com> wrote:

> That's a very useful metaphor. You could also look at (2) as 20,000
> vehicles, one for each traveler. 10 buses are more efficient, but
> 20,000 cars are much more convenient for the individual travelers. In
> my case, development time is at a premium, so I will take convenience
> over efficiency as long as I can get away with it. :-)

You can't have it both ways. You can't say that development time is at
a premium but you also want world-class performance. World-class
anything takes time. You have to choose -- convenience or performance?
You can't have both.

Well, you sort of can. You just started out the wrong way. Why didn't
you use a high-performance I/O library like libevent?

DS

Eric Sosman

unread,

Dec 11, 2007, 6:24:12 PM12/11/07

to

Scott Gifford wrote:
> [...]

> :-) Well, the devil is in the details, and the detail I'm thinking
> about while I'm analyzing my current system is Hoare's dictum:
> "premature optimization is the root of all evil".

s/Hoare/Knuth/

Also, "premature" and "before coding" are not always
synonymous.

It's clear you are enamored of your foll^H^H^H^Hplan
to use an excess^H^H^H^H^H^Hunusual number of threads, so
I'm going to give up trying to persuade you to change
course. Herewith, "enough rope:"

The symptoms you report (system-wide as opposed to
per-process limit, consistent limit value, clean failure
with EAGAIN) are strongly suggestive of a kernel limit
as opposed to a library limit. The kernel limit might
be a real resource exhaustion, or there's a possibility
that you're running into a per-userid quota (you should
rule this out by running your multiple processes under
different userids).

If it's a real resource exhaustion in the kernel,
get out your kernel source and fire up the kernel debugger
and trace through the failure to find out where the EAGAIN
comes from, and hence to identify what resource has run
out. If you're lucky, it'll be a resource you can expand
by tuning or patching or rebuilding the kernel, or maybe
just by adding more RAM and/or swap. So add more of
whatever it takes, re-run your tests, and find out where
the next-higher limit lies. Lather, rinse, repeat.

If you're unlucky, you'll find that the kernel uses
a 16-bit thread identifier with one bit serving as a status
flag, or something of that sort. Making more bits available
is a considerably harder job, but it can be done: Change
data types, repair whatever happens when data structures
grow, rebuild whatever system utilities rely on the kernel's
data structures, and build a new system.

Either way, I think the task will be highly educational
and you'll emerge with Linux kernel "chops" that will look
impressive on your resumé.

--
Eric....@sun.com

David Schwartz

unread,

Dec 11, 2007, 6:36:15 PM12/11/07

to

On Dec 11, 1:28 pm, Scott Gifford <sgiff...@suspectclass.com> wrote:

> :-) Well, the devil is in the details, and the detail I'm thinking
> about while I'm analyzing my current system is Hoare's dictum:
> "premature optimization is the root of all evil". It's a generally
> accepted engineering principle (at least in the circles where I hang
> out) that before rewriting your code in a more complex way to improve
> its performance, you should first measure the performance to verify
> that it really is a problem, and that your improvements really help.
> Many hours have been wasted, and many unstable and unmaintainable
> programs written, in the name of keeping away imaginary performance
> bogeymen.

Right, but that doesn't mean that if you're writing a new OS you start
out trying to write it in BASIC. You can accept the experience of
others that an OS in BASIC will not perform anywhere near reasonable
expectations.

Thread-per-connection is known to max out at about 800 connections.
Thread pool approaches and sophisticated I/O discovery mechanisms are
known to be needed to support 10,000 connections or more.

But as I explained elsewhere, storing the context of a logical object
(such as a connection) on the stack is simply wrong.

DS

Scott Gifford

unread,

Dec 11, 2007, 6:38:25 PM12/11/07

to

David Schwartz <dav...@webmaster.com> writes:

> On Dec 10, 5:47 pm, Scott Gifford <sgiff...@suspectclass.com> wrote:
>
>> That's a very useful metaphor. You could also look at (2) as 20,000
>> vehicles, one for each traveler. 10 buses are more efficient, but
>> 20,000 cars are much more convenient for the individual travelers. In
>> my case, development time is at a premium, so I will take convenience
>> over efficiency as long as I can get away with it. :-)
>
> You can't have it both ways. You can't say that development time is at
> a premium but you also want world-class performance. World-class
> anything takes time. You have to choose -- convenience or performance?
> You can't have both.

With various solutions, you get varying degress of both, and good
engineering involves finding the right tradeoffs. In my experience,
the key to making good decisions about performance tradeoffs is
measuring the performance.

> Well, you sort of can. You just started out the wrong way. Why didn't
> you use a high-performance I/O library

The existing code I started off with was single-threaded and made
extensive use of iostreams for parsing. The most straightforward
port, then, was to put each client in its own thread, so it could have
its own istream and use the same parsing code as the single-threaded
version.

My general strategy is to get the code working first in the simplest
way possible, then see where the performance bottlenecks are and fix
them. That second part is what I'm working on right now.

> like libevent?

For portable and efficient event-driven C++ code, is libevent what you
would recommend?

Thanks!

----Scott.

Scott Gifford

unread,

Dec 11, 2007, 6:45:26 PM12/11/07

to

Eric Sosman <Eric....@sun.com> writes:

> Scott Gifford wrote:
>> [...]
>> :-) Well, the devil is in the details, and the detail I'm thinking
>> about while I'm analyzing my current system is Hoare's dictum:
>> "premature optimization is the root of all evil".
>
> s/Hoare/Knuth/

I believe Knuth was quoting Hoare, but that's neither here nor there.

> Also, "premature" and "before coding" are not always
> synonymous.

Agreed.

> It's clear you are enamored of your foll^H^H^H^Hplan
> to use an excess^H^H^H^H^H^Hunusual number of threads, so
> I'm going to give up trying to persuade you to change
> course. Herewith, "enough rope:"

I'm not enamored of my current design, I just want to see what its
performance limits are, and avoid changing it without good reason and
a way to check that the new version does in fact perform better under
a heavier load.

[ ... extremely helpful information elided ... ]

The rest of your post was extremely helpful, and I very much
appreciate all the time you've put into responding to my posts! I
will try some of your suggestions and report back what I find.

Thanks!

-----Scott.

David Schwartz

unread,

Dec 12, 2007, 7:39:41 AM12/12/07

to

On Dec 11, 3:38 pm, Scott Gifford <sgiff...@suspectclass.com> wrote:

> My general strategy is to get the code working first in the simplest
> way possible, then see where the performance bottlenecks are and fix
> them. That second part is what I'm working on right now.

Right, but in this case, you steered yourself directly into one of the
best known performance disasters in the world of programming.

> > like libevent?

> For portable and efficient event-driven C++ code, is libevent what you
> would recommend?

Yes, but I have never used it. This is a recommendation based on
recommendations from others.

DS

Hallvard B Furuseth

unread,

Dec 12, 2007, 8:29:30 AM12/12/07

to

Scott Gifford writes:
>Eric Sosman writes:

>> Scott Gifford wrote:
>>> :-) Well, the devil is in the details, and the detail I'm thinking
>>> about while I'm analyzing my current system is Hoare's dictum:
>>> "premature optimization is the root of all evil".
>>
>> s/Hoare/Knuth/
>
> I believe Knuth was quoting Hoare, but that's neither here nor there.

Premature pessimization still doesn't sound much better though:-)

--
Regards,
Hallvard

Markus Elfring

unread,

Dec 12, 2007, 9:00:25 AM12/12/07

to

> For portable and efficient event-driven C++ code, is libevent what you
> would recommend?

Would you like to try a bigger step for the refatoring of your code base?
Is the well-known CORBA standard an appropriate software design option?
http://en.wikipedia.org/wiki/Common_Object_Request_Broker_Architecture

Regards,
Markus

Marcel Müller

unread,

Dec 12, 2007, 4:21:04 PM12/12/07

to

Hi!

Scott Gifford schrieb:

> Marcel Müller <news.5...@spamgourmet.com> writes:
>
>> Well that's the kind of software I really like. Each programmer
>> ensures that /his/ part of code does not consume /all/ ressources of
>> reasonabe hardware. Devil-may-care.
>
> :-) Well, the devil is in the details, and the detail I'm thinking
> about while I'm analyzing my current system is Hoare's dictum:
> "premature optimization is the root of all evil". It's a generally
> accepted engineering principle (at least in the circles where I hang
> out) that before rewriting your code in a more complex way to improve
> its performance, you should first measure the performance to verify
> that it really is a problem, and that your improvements really help.
> Many hours have been wasted, and many unstable and unmaintainable
> programs written, in the name of keeping away imaginary performance
> bogeymen.

Of course, there are situations where the optimization of the runtime
behaviour is irrelevant. But I saw more often the situation where either
the additional effort is neglectable or the benefit is quite obvious. In
this cases not doing it is simply a (quite common) bad habit. And this
is esspecially true if it is comparable or more work to find out wheter
an optimization is required than simply doing it.

In case of your problem, I am pretty sure that the benefit is obvious.
Many platforms simply will end with an runtime error if you go
significantly beyond 10^3 threads. So a solution with O(n) in the number
of threads over the number of clients will not work with 10^5 clients
anyway. (Except when one thread serves a large, fixed number of clients,
let's say 100.)

> It's the same tradeoff decision a programmer makes when deciding
> whether to port a server from Java to C++ or C++ to assembler:
> efficiency of execution vs. efficiency of development.

These are extreme situations. And from my experience the gain of such a
decision is mostly much less compared to looking twice at the solution
and it's complexity with respect to ressource utilization (memory, CPU
etc.). Of course, if the latter has already been done, one might need to
do the next step. But more than an factor two is unusual, except for
number crunching. And in most cases it is sufficient to implement a few
core functions in low level languages using JNI or assembler.

> All I am trying to do is measure the performance impact of using a
> large number of threads with my current code, so I can see its limits,
> decide if it will scale as much as I need it to, and if not compare it
> to other approaches. In other words, to determine whether this
> performance bogeyman is real or imaginary for my application.

The question is not relevant, if the intended solution will not work at all.

Marcel

Scott Gifford

unread,

Dec 12, 2007, 5:42:17 PM12/12/07

to

Scott Gifford <sgif...@suspectclass.com> writes:

> Eric Sosman <Eric....@sun.com> writes:

[...]

> [ ... extremely helpful information elided ... ]

>
> The rest of your post was extremely helpful, and I very much
> appreciate all the time you've put into responding to my posts! I
> will try some of your suggestions and report back what I find.

OK, here is what I found.

The limitation I was hitting was running out of PIDs. I increased the
maximum PIDs in the system and recompiled, and was able to get up to
about 60,000 threads before the Linux out-of-memory killer killed my
process. It happened when my process reached about 1GB of VM, and I
have my kernel configured with a 3G/1G memory split, so I will try
with a 2G/2G split and see if that makes a difference.

The test I'm doing is a very simple echo server. The server is in C
and uses pthreads, one thread per client. The client is in perl and
uses epoll. I've posted the code I used here (I modified it somewhat
to improve clarity, including increasing the stack size by 4KB to
allow printing error messages, so results will not be identical),
along with a description of the changes I made to my system and the
benchmark results:

http://www.suspectclass.com/~sgifford/manythreads/

Here are the results I saw:

THREADS VIRT MEM RES MEM REQ/S
------- -------- ------- -----
1000 18m 9m 35218
10000 159m 79m 35216
20000 317m 159m 33659
30000 474m 238m 33733
40000 632m 318m 35183
50000 789m 398m 28864
60000 947m 477m 28018

It does start to slow down a bit, but overall the performance is
not too bad.

The client and server ran on the same machine, which kept network
latency to a minimum, but meant they shared the system's 2 cores. I
didn't try to determine whether the performance was limited by the
client or server, but top reported that the client was using about 10%
more of the CPU than the server on all runs.

I will report back if I have luck with the 2G/2G split. In the
meantime, any thoughts are welcome, particularly if you see flaws in
my methodology or code.

----Scott.

David Schwartz

unread,

Dec 12, 2007, 7:48:23 PM12/12/07

to

On Dec 12, 2:42 pm, Scott Gifford <sgiff...@suspectclass.com> wrote:

> Here are the results I saw:
>
> THREADS VIRT MEM RES MEM REQ/S
> ------- -------- ------- -----
> 1000 18m 9m 35218
> 10000 159m 79m 35216
> 20000 317m 159m 33659
> 30000 474m 238m 33733
> 40000 632m 318m 35183
> 50000 789m 398m 28864
> 60000 947m 477m 28018
>
> It does start to slow down a bit, but overall the performance is
> not too bad.

No, the performance is awful. It just doesn't get too much worse as
the number of threads goes up.

The difference is, for your program to handle X requests, it must make
X context switches (which means the cost of an echo may be more the
cost of a context switch). For a more sensible program to handle X
requests, it doesn't have to make any beyond one per timeslice
exhaustion (which must have a negligible effect on performance or your
timeslices are too small).

DS

Scott Gifford

unread,

Dec 13, 2007, 2:28:10 AM12/13/07

to

David Schwartz <dav...@webmaster.com> writes:

> On Dec 12, 2:42 pm, Scott Gifford <sgiff...@suspectclass.com> wrote:
>
>> Here are the results I saw:
>>
>> THREADS VIRT MEM RES MEM REQ/S
>> ------- -------- ------- -----
>> 1000 18m 9m 35218
>> 10000 159m 79m 35216
>> 20000 317m 159m 33659
>> 30000 474m 238m 33733
>> 40000 632m 318m 35183
>> 50000 789m 398m 28864
>> 60000 947m 477m 28018
>>
>> It does start to slow down a bit, but overall the performance is
>> not too bad.
>
> No, the performance is awful.

For comparison, I implemented a small epoll-based echo server in Perl.
The performance ranged from about 4% slower to about 20% faster.
Memory usage was substantially lower across the board, though. Here's
a comparison:

Thread Epoll Thread Epoll
Connections REQ/S REQ/S Speedup Mem MB Mem MB Savings
10000 35216 41496 18% 79 11 86%
20000 33659 40724 21% 159 19 88%
30000 33733 39108 16% 238 25 89%
40000 35183 38944 11% 318 33 90%
50000 28864 28680 -1% 398 38 90%
60000 28018 27018 -4% 477 41 91%
70000 26537 29103 10% 565 49 91%

So, I'll agree that memory usage is awful, but the speed is not too
much worse. They both seem to slow down and consume more memory
roughly linearly.

The epoll server scales somewhat further, though it seems to hit a
file descriptor limit around 100,000 even with a much higher ulimit.
I haven't investigated further.

Also, it doesn't require recompiling your kernel to use it, which is
an advantage. :-)

At any rate, it appears to be fast enough for my application, if I
give it enough memory.

For reference, I added the epoll server and benchmark data to my Web
site:

http://www.suspectclass.com/~sgifford/manythreads/

[...]

> The difference is, for your program to handle X requests, it must make
> X context switches (which means the cost of an echo may be more the
> cost of a context switch).

Surely that's true, but I'm not sure it accounts for the entire
performance difference. As I mentioned in this post (msgid
<lyejdu3...@gfn.org>):

http://groups.google.com/group/comp.programming.threads/msg/10791c41f6487839

the benchmark I tried measured my system at nearly a million context
switches per second.

I suspect the other differences come from better memory efficiency and
from OS resources required to schedule the task (as Marcel Muller
suggested), but I don't really have a way to measure this.

----Scott.

Scott Gifford

unread,

Dec 13, 2007, 2:36:02 AM12/13/07

to

Scott Gifford <sgif...@suspectclass.com> writes:

[...]

> I will report back if I have luck with the 2G/2G split. In the
> meantime, any thoughts are welcome, particularly if you see flaws in
> my methodology or code.

Using a 2G/2G got me to around 75,000 threads before I ran out of
physical RAM.

Interestingly, although my laptop has 2GB of RAM, it ran out of
physical memory with only about 1GB of resident memory consumed. It
seems that some of the nonresident virtual memory can't be swapped
out, or something else is going on. Here's what top said before the
system basically stopped responding:

top - 00:31:12 up 44 min, 4 users, load average: 0.62, 0.65, 0.61
Tasks: 74 total, 5 running, 69 sleeping, 0 stopped, 0 zombie
Cpu(s): 11.8%us, 83.6%sy, 0.0%ni, 1.8%id, 2.7%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 2074068k total, 2060464k used, 13604k free, 388k buffers
Swap: 2586424k total, 9620k used, 2576804k free, 19188k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
33306 root 25 0 1484m 599m 296 R 72 29.6 0:18.58 threadtest2
33307 root 16 0 351m 349m 1780 R 41 17.2 1:04.57 perl
7041 root 15 0 4128 1924 1460 S 0 0.1 0:00.00 bash
6986 root 15 0 4128 1376 1096 S 0 0.1 0:00.01 bash
5357 root 15 0 4424 1068 568 R 2 0.1 1:54.68 screen

I found this surprising; I would expect "memory used" to be roughly
the sum of the resident memory, but it was closer to the sum of
virtual memory. When I ran two copies of a simple memory hog
application was paged out and "memory used" matched total resident
memory, as I would expect. If anybody can shed some light on why this
is happening, I would be very appreciative.

Thanks,

---Scott.

Eric Sosman

unread,

Dec 13, 2007, 9:21:43 AM12/13/07

to

Scott Gifford wrote:
> [...]

> Here are the results I saw:
>
> THREADS VIRT MEM RES MEM REQ/S
> ------- -------- ------- -----
> 1000 18m 9m 35218
> 10000 159m 79m 35216
> 20000 317m 159m 33659
> 30000 474m 238m 33733
> 40000 632m 318m 35183
> 50000 789m 398m 28864
> 60000 947m 477m 28018
>
> It does start to slow down a bit, but overall the performance is
> not too bad.

"A bit?" It's a twenty percent plunge! Are your friends
hedge fund managers, maybe?

--
Eric Sosman
eso...@ieee-dot-org.invalid

Zeljko Vrba

unread,

Dec 13, 2007, 9:51:43 AM12/13/07

to

On 2007-12-13, Eric Sosman <eso...@ieee-dot-org.invalid> wrote:
>
> "A bit?" It's a twenty percent plunge! Are your friends
> hedge fund managers, maybe?
>

Sometimes people have their goals set in absolute numbers. So even a 20% drop
isn't significant if the absolute numbers stay well above the performance
requirement.

mojmir

unread,

Dec 13, 2007, 1:05:54 PM12/13/07

to

On Dec 12, 12:38 am, Scott Gifford <sgiff...@suspectclass.com> wrote:

> My general strategy is to get the code working first in the simplest
> way possible, then see where the performance bottlenecks are and fix
> them. That second part is what I'm working on right now.

well it seems to me that you are creating bottlenecks right from the
start ;)

colleague of mine had the same idea: let's do this thread per
connection
and then we'll se how's it doing. of course it went wrong first time
we
reached some 500+ connections. as a hot-hot-hotfix we recommended to
lower stack size and proposed a solution using thread pool. it never
went
through management as the hotfix was already acceptable for them.
afaik it's still there: lurking in the dark and waiting for intensive
traffic...
perhaps you'll be the lucky guy with lost money ;-)

personally i would never do such thing again: first it's dangerous and
i do
not want come any closer to kernel limits. second it's simply wasting
kernel
resources (threads are (usually) cheaper than pcesses, but still not
free).
third it's _totally_ useless: even on multicore you cannot process
100000
requests in parallel, because you're limited by hardware cores.
why wasting cycles spawning myriads of threads when 16 for example
will
do fine? few of them can receive/send packets and forward them to/from
queue that is watched by the rest of thread pool.
i find this design more natural.

mojmir

Scott Gifford

unread,

Dec 13, 2007, 5:59:24 PM12/13/07

to

Eric Sosman <eso...@ieee-dot-org.invalid> writes:

It's about what I expected; my experience with servers has generally
been that throughput decreases under a high enough load. It's
actually slightly better than the "plunge" I saw with a simple
epoll-based echo server I wrote for comparison, although the epoll
server was still about 10% faster overall. See message
<lyve73o...@gfn.org> for the numbers and a link to the code:

http://groups.google.com/group/comp.programming.threads/msg/c1af24bd26a4860a

-----Scott.

llothar

unread,

Dec 15, 2007, 2:03:32 PM12/15/07

to

On Dec 13, 2:36 pm, Scott Gifford <sgiff...@suspectclass.com> wrote:

> Using a 2G/2G got me to around 75,000 threads before I ran out of
> physical RAM.

With only 64K of possible connections (Port numbers) this shows that
you
are measuring something wrong here.

And please don't do an Echo server, do something usefull that might
be
comparable to your real request handling.

Scott Gifford

unread,

Dec 15, 2007, 3:11:18 PM12/15/07

to

llothar <llo...@web.de> writes:

> On Dec 13, 2:36 pm, Scott Gifford <sgiff...@suspectclass.com> wrote:
>
>> Using a 2G/2G got me to around 75,000 threads before I ran out of
>> physical RAM.
>
> With only 64K of possible connections (Port numbers) this shows that
> you are measuring something wrong here.

Hi llothar,

On Linux, all addresses in 127.0.0.0/8 are treated as individual
loopback addresses. Since a TCP connection is defined as (source
address, source port, destination address, destination port) each of
those addresses effectively has its own 64K ports. My server listens
on INADDR_ANY, and the client loops through ports and addresses in the
127.* network. With a port required on the client and the server, I
can get about 32K ports per address, which gives me almost 550 billion
possible connections. I think I will run out of threads before then.
:-)

> And please don't do an Echo server, do something usefull that might
> be comparable to your real request handling.

My goal was to measure the overhead of using threads, so the server
does the minimum possible that lets me verify that it's basically
working. And at any rate, my real request handling involves only
parsing a line of text then sending a message to a message queue, so
it's fairly comparable.

If you would like to see the results of a different benchmark than the
one I ran, I would encourage you to extend my benchmark code and post
your results. I posted a link to the source earlier; you can find it
here:

http://www.suspectclass.com/~sgifford/manythreads/

-----Scott.

Hallvard B Furuseth

unread,

Dec 15, 2007, 6:08:33 PM12/15/07

to

Scott Gifford writes:
> THREADS VIRT MEM RES MEM REQ/S

> (...)
> 60000 947m 477m 28018

You don't seem to test what happens to your system and response time
when all these threads are active at the same time. You just start,
run and finish them as fast as your client/server can submit them.

To test that, you can do
pthread_barrier_t barrier;
...
pthread_barrier_init(&barrier, NULL, 60000);
and start your thread function with
pthread_barrier_wait(barrier);

If you use multiple processes, use a pthread_barrierattr_t as 2nd
arg, see pthread_barrierattr_init and pthread_barrierattr_setpshared.

Another point, you need to discard the threads' exit values with
pthread_join or by detaching the threads.

You ought to do some more than toy work too, to test how a real program
will work. You don't even do any malloc/free in the thread server.
(These need synchronization between threads, which takes time.)

--
Hallvard

Chris Thomasson

unread,

Dec 15, 2007, 6:47:02 PM12/15/07

to

"Hallvard B Furuseth" <h.b.fu...@usit.uio.no> wrote in message
news:hbf.2007...@bombur.uio.no...
> Scott Gifford writes:
[...]

> Another point, you need to discard the threads' exit values with
> pthread_join or by detaching the threads.
>
> You ought to do some more than toy work too, to test how a real program
> will work. You don't even do any malloc/free in the thread server.
> (These need synchronization between threads, which takes time.)

One little quibble: malloc/free implementation does not need any
synchronization until a boundary condition is hit (e.g., heap exhausted,
need to ask the OS for more type scenarios):

http://groups.google.com/group/comp.arch/browse_frm/thread/6dc825ec9999d3a8

http://groups.google.com/group/comp.arch/browse_frm/thread/24c40d42a04ee855

To the OP:

Correct me if I am wrong, but you seem to be arguing that using a shi%load
of kernel threads on systems with FAR fewer CPUS to handle them? There are a
number of reasons why that's not going to scale. If the activity on your
server starts to rise, the poor OS thread-scheduler will have to manage the
thundering herd's of pissed off threads. :^0 There very well may be a limit
on the scalability of the underlying operating system sub-systems wrt
exposing it to tons of kernel-level threads, on say, a 2 CPU system.

If you insist on using tons of threads, I would advise you to code up a
user-level thread scheduler sub-system. Create wrapper functions for the I/O
and other various calls that may block. After that, you can keep
per-connection context in the "stack" of the user-level thread. To get some
more scalability, you can bind control threads to specific CPUS and run the
user-threads on top of those in order to cope with NUMA setups; think of
programming the Cell processor.

Chris Thomasson

unread,

Dec 15, 2007, 9:25:00 PM12/15/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:MtKdnTmZ5Lss-Pna...@comcast.com...

> "Hallvard B Furuseth" <h.b.fu...@usit.uio.no> wrote in message
> news:hbf.2007...@bombur.uio.no...
>> Scott Gifford writes:
> [...]
>> Another point, you need to discard the threads' exit values with
>> pthread_join or by detaching the threads.
>>
>> You ought to do some more than toy work too, to test how a real program
>> will work. You don't even do any malloc/free in the thread server.
>> (These need synchronization between threads, which takes time.)
>
> One little quibble: malloc/free implementation does not need any
> synchronization until a boundary condition is hit (e.g., heap exhausted,
> need to ask the OS for more type scenarios):

[...]

Another boundary condition is when a thread tries to free a block of memory
that was allocated by another thread. I use caching for the fast-path and
CAS for the slow-path in this case. You use the cache to try and amortize
the number of times you call CAS.

Scott Gifford

unread,

Dec 16, 2007, 1:16:47 AM12/16/07

to

Hallvard B Furuseth <h.b.fu...@usit.uio.no> writes:

> Scott Gifford writes:
>> THREADS VIRT MEM RES MEM REQ/S
>> (...)
>> 60000 947m 477m 28018
>
> You don't seem to test what happens to your system and response time
> when all these threads are active at the same time. You just start,
> run and finish them as fast as your client/server can submit them.

I'm not exactly sure what you mean; the threads all exist at the same
time and run for the duration of the test, but you're right that only
one is doing any work at a particular time, since my benchmark program
is single-threaded. My machine has only 2 cores, so with one running
the benchmark program and the other running the server, most of the
work will be serialized by the scheduler anyways I think, but it still
doesn't do a good job of testing concurrent load or locks.

The reason I'm not that concerned is because the threads in my
particular application are very low-activity, primarily accepting a
piece of data every 30 seconds or so, and so a bit of latency is OK as
long as the server doesn't become overwhelmed.

> To test that, you can do
> pthread_barrier_t barrier;
> ...
> pthread_barrier_init(&barrier, NULL, 60000);
> and start your thread function with
> pthread_barrier_wait(barrier);
>
> If you use multiple processes, use a pthread_barrierattr_t as 2nd
> arg, see pthread_barrierattr_init and pthread_barrierattr_setpshared.

Thanks for the tip!

> Another point, you need to discard the threads' exit values with
> pthread_join or by detaching the threads.

Yes, the real application does this. The test program is usually
killed with an unhandled SIGINT, so it doesn't matter much what
happens to the threads.

> You ought to do some more than toy work too, to test how a real program
> will work. You don't even do any malloc/free in the thread server.
> (These need synchronization between threads, which takes time.)

That's a good point, and I have done this with the real app, I just
can't post my actual code here. In a simple scenario, the real app
holds up just fine under this load. I will need to finish a bit more
of it to test out more than simple scenarios, though.

----Scott.

Scott Gifford

unread,

Dec 16, 2007, 1:32:58 AM12/16/07

to

"Chris Thomasson" <cri...@comcast.net> writes:

[...]

> To the OP:
>
> Correct me if I am wrong, but you seem to be arguing that using a
> shi%load of kernel threads on systems with FAR fewer CPUS to handle
> them?

That's what I'm currently doing, yes. I'm not arguing that it's the
best possible approach, but it does seem to be performing well enough.

> There are a number of reasons why that's not going to scale. If the
> activity on your server starts to rise, the poor OS thread-scheduler
> will have to manage the thundering herd's of pissed off threads. :^0
> There very well may be a limit on the scalability of the underlying
> operating system sub-systems wrt exposing it to tons of kernel-level
> threads, on say, a 2 CPU system.

I keep hearing that this isn't going to scale, but in all of the tests
I've done, it seems to scale well enough. I found it was on average
about 10% slower than an epoll-based server, and as I created more
clients it slowed down at about the same rate as the epoll-based
server. It does use a shi%load of memory, though. :-)

> If you insist on using tons of threads, I would advise you to code up
> a user-level thread scheduler sub-system. Create wrapper functions for
> the I/O and other various calls that may block. After that, you can
> keep per-connection context in the "stack" of the user-level
> thread. To get some more scalability, you can bind control threads to
> specific CPUS and run the user-threads on top of those in order to
> cope with NUMA setups; think of programming the Cell processor.

I thought about using GNU pth for this, to run the networking code in
user-level threads to avoid consuming OS resources, but I suspect it
would be less work to rewrite the code to use epoll() or libevent.

----Scott.

Dmitriy Vyukov

unread,

Dec 16, 2007, 3:21:26 AM12/16/07

to

On Dec 16, 2:47 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> > You ought to do some more than toy work too, to test how a real program
> > will work. You don't even do any malloc/free in the thread server.
> > (These need synchronization between threads, which takes time.)
>
> One little quibble: malloc/free implementation does not need any
> synchronization until a boundary condition is hit (e.g., heap exhausted,
> need to ask the OS for more type scenarios):

Hmmm. But I think that at least 1 malloc() and free() call have to be
in every thread. Even if you use some amortizations and smart
tricks...
Or at least 2 CAS operations if you use global memory cache...

Dmitriy V'jukov

llothar

unread,

Dec 16, 2007, 3:50:12 AM12/16/07

to

> My goal was to measure the overhead of using threads, so the server
> does the minimum possible that lets me verify that it's basically
> working. And at any rate, my real request handling involves only
> parsing a line of text then sending a message to a message queue, so
> it's fairly comparable.

Okay i see.

This seems to work as long as you can do the parsing and queuing in
one
time slice. So that there are only ready or idle tasks but the work of
a thread
will never be delayed by serving 100000 other threads first until it
can do some
work again. This might work well because it is a very very simple
server.

On the other hand what i don`t understand is: If this is really so
easy, why do you
waste so much time discussing it here. I can write your threaded
server and a
libevent server in less then an afternoon if its really only about
taking one line, parse it,
queue something in another event loop and then then wait again.

It is just a contradiction to your "development time is precious"
argument

Dmitriy Vyukov

unread,

Dec 16, 2007, 4:53:39 AM12/16/07

to

On Dec 16, 11:50 am, llothar <llot...@web.de> wrote:
> > My goal was to measure the overhead of using threads, so the server
> > does the minimum possible that lets me verify that it's basically
> > working. And at any rate, my real request handling involves only
> > parsing a line of text then sending a message to a message queue, so
> > it's fairly comparable.
>
> Okay i see.
>
> This seems to work as long as you can do the parsing and queuing in
> one
> time slice. So that there are only ready or idle tasks but the work of
> a thread
> will never be delayed by serving 100000 other threads first until it
> can do some
> work again. This might work well because it is a very very simple
> server.

Good point. If request processing involves some system calls or memory
allocation or enqueueing into some queue protected by mutex, worker
thread can be blocked inside request processing, and things can get
bad.

Scott, you can try to insert something like this into request
processing:

// sleep 1 ms with 10% probability
if (0 == (rand() % 10)) sleep(1);

Dmitriy V'jukov

Scott Gifford

unread,

Dec 16, 2007, 3:53:23 PM12/16/07

to

Dmitriy Vyukov <dvy...@gmail.com> writes:

> On Dec 16, 11:50 am, llothar <llot...@web.de> wrote:
>> > My goal was to measure the overhead of using threads, so the server
>> > does the minimum possible that lets me verify that it's basically
>> > working. And at any rate, my real request handling involves only
>> > parsing a line of text then sending a message to a message queue, so
>> > it's fairly comparable.
>>
>> Okay i see.
>>
>> This seems to work as long as you can do the parsing and queuing in
>> one
>> time slice. So that there are only ready or idle tasks but the work of
>> a thread
>> will never be delayed by serving 100000 other threads first until it
>> can do some
>> work again. This might work well because it is a very very simple
>> server.
>
>
> Good point. If request processing involves some system calls or memory
> allocation or enqueueing into some queue protected by mutex, worker
> thread can be blocked inside request processing, and things can get
> bad.

Yes, a very good point.

> Scott, you can try to insert something like this into request
> processing:
>
> // sleep 1 ms with 10% probability
> if (0 == (rand() % 10)) sleep(1);

I tried a few variations on this, and they cut the performance roughly
in half. First I tried sleeping for 1ms with nanosleep. Next I tried
yielding the processor with sched_yield. Finally I tried holding a
mutex while writing to any socket. I just tried with 30K and 60K
threads. None are exact simulations of what my app does, but they're
still interesting. Here's what I saw.

REQ/S REQ/S REQ/S REQ/S
COUNT ORIG SLEEP YIELD MUTEX
----- ----- ----- ----- -----
30000 33734 15394 19946 16818
60000 28019 11495 14341 12511

That performance is still acceptable for my app, but it is definitely
much worse. It is probably close to a worst-case, though. Sleep and
yield always require the thread to give up its timeslice, while in an
actual app the lock would sometimes be uncontended and so not slow it
down at all. The mutex test simulates the lock contention accurately,
but OS writes take much longer than queueing to an in-memory data
structure, so in a real app the lock would be held for much less time.
So the actual performance will be somewhere between the best-case ORIG
and the worst-case SLEEP, depending on the workload and lock
contention

----Scott.

Hallvard B Furuseth

unread,

Dec 16, 2007, 4:14:32 PM12/16/07

to

Scott Gifford writes:
>Hallvard B Furuseth <h.b.fu...@usit.uio.no> writes:
>>Scott Gifford writes:
>>> THREADS VIRT MEM RES MEM REQ/S
>>> (...)
>>> 60000 947m 477m 28018
>>
>> You don't seem to test what happens to your system and response time
>> when all these threads are active at the same time. You just start,
>> run and finish them as fast as your client/server can submit them.
>
> I'm not exactly sure what you mean; the threads all exist at the same
> time and run for the duration of the test, but you're right that only
> one is doing any work at a particular time, since my benchmark program
> is single-threaded.

Maybe I've looked at the wrong program - manythreads.zip:threadtest2.c.
Each echo_socket() thread reads from the socket, echoes back, and
terminates as quickly as possible. I see no code there to ensure that
all those 60000 threads will ever coexist, and that the system can
handle that well.

Though on second look I notice there is a waste_time() test too, that'll
test coexistence of all those threads. (Without them trying to do much
though, so you don't measure how that affects performance:-)

>> Another point, you need to discard the threads' exit values with
>> pthread_join or by detaching the threads.
>
> Yes, the real application does this. The test program is usually
> killed with an unhandled SIGINT, so it doesn't matter much what
> happens to the threads.

Might affect the test's performance. The system must keep the threads'
exit values until you collect them, and must reserve the thread IDs
until then. So it gets an unnecessarily big table of thread IDs.

--
Hallvard

Scott Gifford

unread,

Dec 16, 2007, 6:06:24 PM12/16/07

to

Hallvard B Furuseth <h.b.fu...@usit.uio.no> writes:

> Scott Gifford writes:
>>Hallvard B Furuseth <h.b.fu...@usit.uio.no> writes:
>>>Scott Gifford writes:
>>>> THREADS VIRT MEM RES MEM REQ/S
>>>> (...)
>>>> 60000 947m 477m 28018
>>>
>>> You don't seem to test what happens to your system and response time
>>> when all these threads are active at the same time. You just start,
>>> run and finish them as fast as your client/server can submit them.
>>
>> I'm not exactly sure what you mean; the threads all exist at the same
>> time and run for the duration of the test

[...]

> Maybe I've looked at the wrong program - manythreads.zip:threadtest2.c.
> Each echo_socket() thread reads from the socket, echoes back, and
> terminates as quickly as possible. I see no code there to ensure that
> all those 60000 threads will ever coexist, and that the system can
> handle that well.

Ah, I see what you mean. Notice that the outer loop in echo_socket()
is:

while ((readBytes = read(fd, buf, ECHO_BUFSIZE)) > 0)
...

That will keep reading until EOF or error, so as long as the client
leaves the socket open, the thread will keep running. And the client
leaves all of the sockets open until it is done running, so all of the
server threads exist concurrently.

[...]

>>> Another point, you need to discard the threads' exit values with
>>> pthread_join or by detaching the threads.
>>
>> Yes, the real application does this. The test program is usually
>> killed with an unhandled SIGINT, so it doesn't matter much what
>> happens to the threads.
>
> Might affect the test's performance. The system must keep the threads'
> exit values until you collect them, and must reserve the thread IDs
> until then. So it gets an unnecessarily big table of thread IDs.

Ah, thanks, that's useful to know.

----Scott.

Scott Gifford

unread,

Dec 16, 2007, 6:28:08 PM12/16/07

to

llothar <llo...@web.de> writes:

[...]

> On the other hand what i don`t understand is: If this is really so
> easy, why do you waste so much time discussing it here.

That is a fair question, and one I certainly considered while I was
running those benchmarks. I think it's useful to understand the
limitations of the "many threads" approach, and also to try out epoll
to see how it worked and what benefits it has. And I'll admit I got a
bit caught up in curiosity and probably spent more time on it than I
should have. But I did learn a fair amount, so it wasn't wasted time.

> I can write your threaded server and a libevent server in less then
> an afternoon if its really only about taking one line, parse it,
> queue something in another event loop and then then wait again.

The actual parsing is somewhat more complicated than that, and isn't
code that I wrote, so to rewrite the parser to use a nonblocking I/O
model will require carefully reading the existing code and "porting"
it to the new model, figuring out how to exercise boundary conditions
in unit tests, etc. It's not a giant project, but it will take a week
or two, and right now that time would be better spent on other
priorities than performance, as long as performance will be
acceptable.

Another way to put it is that I'm trying to be as lazy as possible,
but no lazier, and those tests helped me confirm that I'm not being
too lazy. :-)

----Scott.

Chris Thomasson

unread,

Dec 16, 2007, 6:53:27 PM12/16/07

to

"Scott Gifford" <sgif...@suspectclass.com> wrote in message
news:lyabobb...@gfn.org...

> "Chris Thomasson" <cri...@comcast.net> writes:
>
> [...]
>
>> To the OP:
>>
>> Correct me if I am wrong, but you seem to be arguing that using a
>> shi%load of kernel threads on systems with FAR fewer CPUS to handle
>> them?
>
> That's what I'm currently doing, yes. I'm not arguing that it's the
> best possible approach, but it does seem to be performing well enough.
>
>> There are a number of reasons why that's not going to scale. If the
>> activity on your server starts to rise, the poor OS thread-scheduler
>> will have to manage the thundering herd's of pissed off threads. :^0
>> There very well may be a limit on the scalability of the underlying
>> operating system sub-systems wrt exposing it to tons of kernel-level
>> threads, on say, a 2 CPU system.
>
> I keep hearing that this isn't going to scale, but in all of the tests
> I've done, it seems to scale well enough.

Well, if your getting acceptable scalability characteristics, that's fine.

> I found it was on average
> about 10% slower than an epoll-based server, and as I created more
> clients it slowed down at about the same rate as the epoll-based
> server.

Have you tried using aio?

http://www.opengroup.org/onlinepubs/009695399/basedefs/aio.h.html

> It does use a shi%load of memory, though. :-)

:^)

[...]

Chris Thomasson

unread,

Dec 16, 2007, 7:09:30 PM12/16/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:7f0e2c53-247c-4ea3...@e10g2000prf.googlegroups.com...

> On Dec 16, 2:47 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> > You ought to do some more than toy work too, to test how a real program
>> > will work. You don't even do any malloc/free in the thread server.
>> > (These need synchronization between threads, which takes time.)
>>
>> One little quibble: malloc/free implementation does not need any
>> synchronization until a boundary condition is hit (e.g., heap exhausted,
>> need to ask the OS for more type scenarios):
>
>
> Hmmm. But I think that at least 1 malloc() and free() call have to be
> in every thread.

[...]

I am not exactly sure what your getting at here; could you please clarify a
bit?

Thanks.

Dmitriy Vyukov

unread,

Dec 16, 2007, 8:01:28 PM12/16/07

to

On 17 дек, 03:09, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

I mean that thread start and thread end are boundary conditions you
are talking about.
When first malloc() in thread is executed, implementation must use
some kind of synchronization to get memory from OS, or from global
pool.
When thread ends implementation must use some kind of synchronization
to return memory to OS, or to global pool.
All other malloc()/free() calls can not use any synchronization.

Dmitriy V'jukov

David Schwartz

unread,

Dec 17, 2007, 12:51:26 AM12/17/07

to

On Dec 16, 1:53 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

> Scott, you can try to insert something like this into request
> processing:
>
> // sleep 1 ms with 10% probability
> if (0 == (rand() % 10)) sleep(1);

No, no, no!

His problem is almost certainly too many context switches. Increasing
them by 10% isn't going to help!

DS

Hallvard B Furuseth

unread,

Dec 17, 2007, 1:06:26 AM12/17/07

to

Scott Gifford writes:
> Notice that the outer loop in echo_socket() is:
>
> while ((readBytes = read(fd, buf, ECHO_BUFSIZE)) > 0)
> ...
>
> That will keep reading until EOF or error,

or until the socket is drained of its current contents. Then
read() returns 0.

--
Hallvard

Chris Thomasson

unread,

Dec 17, 2007, 1:37:07 AM12/17/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:838c5167-15b2-488e...@e25g2000prg.googlegroups.com...

On 17 дек, 03:09, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message
>
> news:7f0e2c53-247c-4ea3...@e10g2000prf.googlegroups.com...>
> On Dec 16, 2:47 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> >> > You ought to do some more than toy work too, to test how a real
> >> > program
> >> > will work. You don't even do any malloc/free in the thread server.
> >> > (These need synchronization between threads, which takes time.)
>
> >> One little quibble: malloc/free implementation does not need any
> >> synchronization until a boundary condition is hit (e.g., heap
> >> exhausted,
> >> need to ask the OS for more type scenarios):
>
> > Hmmm. But I think that at least 1 malloc() and free() call have to be
> > in every thread.
>
> [...]
>
> I am not exactly sure what your getting at here; could you please clarify
> a
> bit?

[...]

> I mean that thread start and thread end are boundary conditions you
> are talking about.
> When first malloc() in thread is executed, implementation must use
> some kind of synchronization to get memory from OS, or from global
> pool.

Not with the distributed stack allocator I did:

http://groups.google.com/group/comp.programming.threads/msg/7cceeb76aa3301ce
(last paragraph...)

The basic outline of my algorithm was something like:

/* system-level thread entry */
void* vzthreadsys_entry(void* state) {
vzthread* const _this = state;

/* setup our stack-based raw buffers */
char rawbuf1[32000] = { 0 };
char rawbuf2[32000] = { 0 };
char rawbuf3[32000] = { 0 };
char* rawbufs[3] = { 0 };
rawbufs[0] = rawbuf1;
rawbufs[1] = rawbuf2;
rawbufs[2] = rawbuf3;

/* prime this threads allocator with the raw buffers */
vzthreadsys_heap_prime(_this, rawbufs, 3, 32000);

/* call the user-level thread entry w/ user-state */
_this->ufpentry(_this->ustate);

/* release the heap, and wait if needed */
if (vzthreadsys_heap_release(_this)) {
vzthreadsys_heap_dtorwait(_this);
}
}

> When thread ends implementation must use some kind of synchronization
> to return memory to OS, or to global pool.

Yes. Even the stack allocator needed special synchronization on thread
termination when a certain scenario is encountered (e.g.,
vzthreadsys_heap_dtorwait).

[...]

Scott Gifford

unread,

Dec 17, 2007, 1:47:22 AM12/17/07

to

Hallvard B Furuseth <h.b.fu...@usit.uio.no> writes:

The socket is in blocking mode (the default), so when the socket is
drained of it current contents, read() blocks until more content
becomes available.

----Scott.

Marcin ‘Qrczak’ Kowalczyk

unread,

Dec 17, 2007, 2:35:28 AM12/17/07

to

Dnia 17-12-2007, Pn o godzinie 07:06 +0100, Hallvard B Furuseth pisze:

> or until the socket is drained of its current contents.
> Then read() returns 0.

read() does not return 0 in such case. It either blocks until data is
available (in blocking mode), or returns -1 with errno set to EAGAIN
(in non-blocking mode).

--
__("< Marcin Kowalczyk
\__/ qrc...@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

Hallvard B Furuseth

unread,

Dec 17, 2007, 3:50:43 AM12/17/07

to

Duh. Of course. Also read() before EOF can only return 0 with the
old and braindead SysV O_NDELAY, not with O_NONBLOCK.

--
Hallvard

Dmitriy Vyukov

unread,

Dec 18, 2007, 2:24:13 AM12/18/07

to

On Dec 17, 9:37 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> > I mean that thread start and thread end are boundary conditions you
> > are talking about.
> > When first malloc() in thread is executed, implementation must use
> > some kind of synchronization to get memory from OS, or from global
> > pool.
>
> Not with the distributed stack allocator I did:

Ok. When first malloc() in thread is executed, implementation must use

some kind of synchronization to get memory from OS, or from global

pool, *if* we want to permit allocations of arbitrary size.
Are you satisfied now? :)

Dmitriy V'jukov