I finally sat down and wrote a blog detailing the issues in the ThreadPool
and why it's unsuitable for use in production grade server applications.
Any feedback on this is appriciated!
http://www.coversant.net/dotnetnuke/Coversant/Blogs/tabid/88/EntryID/8/Default.aspx
--
Chris Mullins
My only feedback is that this is something I've been aware of for a
while. When I tried to make the same point on the CLR mailing list, I
was variously accused of being naive or finding problems when there
really aren't any.
Half the problem is that *anything* can use the system threadpool -
there's no way of isolating different areas of code to use different
pools.
Oh well - at least I'm not alone in considering it a problem.
--
Jon Skeet - <sk...@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Thanks. I'm always worred about a post like that, as it goes against what so
many other people preach.
> I have some questions on how you handled issues with ThreadPool.
> Did you consider increasing number of worker threads in the threadpool?
The problem isn't the size of the threadpool. I can use one of the ICor
interfaces (I don't remember which one) to manipulate the size of the thread
pool, and have in fact done this. I've also run into this problem on
machines that have 400 threads (by default) in the threadpool.
The problems stems from the fact that no matter how many threads are in
there, I can exhaust them all. Because the CLR itself needs threadpool
threads, this means the whole system then deadlocks.
What's needed are two threadpools: One for my application to use, and one
for the CLR to use. There's just no getting around this.
--
Chris Mullins
[Jon has seen this too]
> My only feedback is that this is something I've been aware of for a
> while. When I tried to make the same point on the CLR mailing list, I
> was variously accused of being naive or finding problems when there
> really aren't any.
That's what I've been bumping into as well. I dismiss alot of it, as its
from people who write proof-of-concept applications for a living, rather
than production grade applications. As we all know, there's a world of
difference between them.
> Half the problem is that *anything* can use the system threadpool -
> there's no way of isolating different areas of code to use different
> pools.
Yea, I agree. This was, when we ran into this 3 years or so ago, the big
issue. We needed the ability to say, "CLR gets these threads, SoapBox Server
gets these threads". Without that ability, the threadpool runs out of
threads and the whole appdomain hangs.
--
Chris Mullins
CM> The problems stems from the fact that no matter how many threads
are in
there, I can exhaust them all. Because the CLR itself needs threadpool
threads, this means the whole system then deadlocks.
Strange workaround just came to my mind, it can solve the problem with
single thread pool, but its not optiomal, for the software that you
described in your post.
The solutions is multiple AppDomains. Every AppDomain has its own
ThreadPool, so we can create something like "worker" AppDomain.
This approach is an overkill, due to marshalling, but nevertheless this
solution exists, and application can be built using this solution.
CM> What's needed are two threadpools: One for my application to use,
and
one for the CLR to use. There's just no getting around this.
Yeah, that would simplify a lot of problems, with parallel exucution in
hight performance systems.
--
Regards,
Vadym Stetsyak.
Blog: http://vadmyst.blogspot.com
--
William Stacey [MVP]
"Vadym Stetsyak" <vad...@ukr.net> wrote in message
news:urHGmqhm...@TK2MSFTNGP03.phx.gbl...
Issues:
1) 100 slow clients (in this example) can stall the TP. This could be an
attack, or just 100 slow clients. We can mitigate attack by only allowing so
many connections from same IP. Other methods, such as an async server, can
also stall after many pending reads, so not sure how much of issue this is
in reality compared to other methods. After a client does a read/reply, it
frees that thread and posts another read to the queue so other clients get
time. The next client in the queue gets to read so the server will keep
running as long as client reads eventually happen.
2) Writes are done sync on the same thread. Not sure this is much of an
issue however. The write will be done quickly with no wait. Not sure we gain
anything by doing an async write.
Advantages:
1) Built in fairness. Client requests are handled FIFO.
2) Dynamic throttle. We can throttle and apply "back-pressure" just by
changing the number of MaxThreads in the TP. This will limit the number of
pending sync reads/readprocessing on the server at any one time. New clients
just queue up.
3) Logic is sync. However we don't assign 1 thread per client for the whole
client session. Each request/reply will be handle by another TP thread
(possibly the same if Q is empty). This makes it easy to code and find bugs.
We can also use the socket timeout feature on reads and writes. Slow clients
that timeout can just be dropped (or use some algo). After a timeout, we
free that thread for another client read.
4) Easy to replace system TP with a custom TP if needed.
5) Can handle many thousands of clients with little number of threads. Only
testing we find the optimal number of threads needed for the app.
PseudoCode (just the basics. Locks, exception handling, etc, not shown for
simplicity):
---------------------------------------------------------------------------------------
ThreadPool.MinThreads = 25;
ThreadPool.MaxThreads = 100;
while(started)
{
sock = Accept();
sock.ReadTimeout = 5000;
if ( userList.Count > 5000 ) // Set to how many concurrent clients you
want.
{
sock.Close(); //A hard close - too many clients. Can do this better.
continue;
}
userList.Add(sock);
// Kick off the request/response chain for client.
ThreadPool.QueueUserWorkItem(ReadMethod, sock);
}
private void ReadMethod(object state)
{
// This is on a threadpool thread.
//
Socket sock = (Socket)state;
byte[] buf = new byte[1024];
try
{
int read = sock.Receive(buf);
// Handle request processing.
// Write reply back to client.
sock.Send(writebuf);
if ( done )
{
sock.Close()
userList.Remove(sock);
return; // Client done.
}
// Release this thread to pool, but queue another read before we
leave.
// Note: If there is no other queued work items,
// the TP will most likely just use the same thread.
ThreadPool.QueueUserWorkItem(ReadMethod, sock);
}
catch(Timeout te)
{
//Handle timeout.
}
catch(Exception ex)
{
//Handle exception
}
}
--
William Stacey [MVP]
"Chris Mullins" <cmul...@yahoo.com> wrote in message
news:uN57HGam...@TK2MSFTNGP04.phx.gbl...
I've looked all over the place for a good threadpool implementation, and
have yet to find one.
My ideal thread pool would have the following features:
1 - Threads are dynamically added and removed form the queue depending on
load.
2 - Threads have some degree of processor affinity in order to minimize
cache issues
3 - The scheduler is smart about assigning work items to threads based on
processor load.
4 - A built-in watchdog. If a work item is exceeding some timeout, a
mechanism exists to abort it.
5 - Intelligent about locking on the work item queue
I've built a number of thread pools, and looked at several that I've found
on the Internet. None of them have ever had all these qualities..
--
Chris Mullins
> 4 - A built-in watchdog. If a work item is exceeding some timeout, a
> mechanism exists to abort it.
There's no good way to abort a thread, so I wouldn't be too optimistic
about finding this feature. Thread.Abort is dangerous and evil, and any
shared structures being currently manipulated by the thread will likely
be in an inconsistent state. The best you can do is tear down the
AppDomain.
-- Barry
| 1 - Threads are dynamically added and removed form the queue depending on
| load.
What defines "load" here? Is it Q size or cpu load. CPU load logic may
compete against min/max thread parms. So what is algo here?
| 2 - Threads have some degree of processor affinity in order to minimize
| cache issues
Windows using soft affinity automatically. It prefers to run a task on the
same processor as last time, but will swith if it has to based on its algo.
IMO, the system will do a better job then we can do here. The successful
usage cases of changing processor affinity are rare. However, I agree there
may be special cases - but wonder if that will be abused if available. What
are your thoughts here?
| 3 - The scheduler is smart about assigning work items to threads based on
| processor load.
What you mean? tia
| 4 - A built-in watchdog. If a work item is exceeding some timeout, a
| mechanism exists to abort it.
That could be a good add. It is overhead, but probably not too bad.
| 5 - Intelligent about locking on the work item queue
At a minimum, you need to lock on add and remove. Are you thinking of
something specific, or just seen poor locking in the TPs you reviewed?
--
William Stacey [MVP]
--
William Stacey [MVP]
"Barry Kelly" <barry....@gmail.com> wrote in message
news:vjc6a2pqi32e4s106...@4ax.com...
[Features a threadpool should have]
> | 1 - Threads are dynamically added and removed form the queue depending
> on
> | load.
>
> What defines "load" here? Is it Q size or cpu load. CPU load logic may
> compete against min/max thread parms. So what is algo here?
I think it would be more cpu load than anything else. I'm not sold however
on this being the correct metric.
The behavior that I explicitly want to avoid is dumping 100 items into a
Queue, and having the thread pool immediatly spin up 100 threads. Likewise,
if 100 threads are spun up, I wouldn't want to see them all terminated
immediatly should there be no items in the pool.
The System thread pool seems to have a timeout of some sort before spawning
new threads. This seems to work fairly well.
>
> | 2 - Threads have some degree of processor affinity in order to minimize
> | cache issues
[Thread Affinity]
> IMO, the system will do a better job then we can do here. The successful
> usage cases of changing processor affinity are rare. However, I agree
> there
> may be special cases - but wonder if that will be abused if available.
> What
> are your thoughts here?
I tend to agree with you here, and as a result I have never myself written
code that has processor affinity. Windows has always done a good job of
things.
However I've been runnnig alot of tests latley on this machine:
http://www.coversant.net/bigiron.png
That particular machine has 16 physical Itanium2 processors, and some of
it's behavior has been unexpected. We consistantly see certain processors
load signifigantly higher than others, and it's tough to tell quite why.
Down deep in the network stack, there are some issues (which are solved by
the new Chimney features in the Scalable Network Pack) with interrupts and
processor affinity, so this may explain some of it. On the other hand, there
seems to be room for improvement. This is a more of a feeling than a fact,
I'll readily admit. It's also compounded by the lack of profiling tools that
run on Itanium (there are a few, but they're not nearly as usefull for .Net
as the Compuware stuff is).
I would really like to be able to have a way to figure out what processor
cache (especially on these Itanium2 machines with their HUGE caches) my
datastructure is already in, have a thread on that processor wake up,
process the item from the cache, write the results to main memory. I don't
think I'm going to see this any time in the furture, but it's on my wish
list. I have that feeling (I know, I know) this will give a fiarly
signifigant boost to performance as the load increases.
> | 3 - The scheduler is smart about assigning work items to threads based
> on
> | processor load.
>
> What you mean? tia
Many of the threadpools that I have seen use this algorithm:
1 -thread that wakes up
2 - thread checks for shutdown
3 - thread checks for a work item
4 - thread processes work item
5 - thread goes back to sleep
6 - goto 1
Some use variations on this, and fancy means of putting threads to sleep -
I've seen a few algorithms:
1 - Keep all threads blocked in a Montior. Whenver a work items comes in, an
"PulseOne" (or, god help us, a Pulse-All with a "First one wins" algorithm)
2 - Use Events and Waits to test for shutdown, items in the queue, etc.
3 - Items are posted to a Queue, a control thread wakes up and gives the
work item to a particular work thread.
I would really like to see an algorithm that has, in the "busy" case, as few
context switches as possible. When I post an item to a queue, there should
be exactly 1 context switch needed to process my item.
>
> | 5 - Intelligent about locking on the work item queue
>
> At a minimum, you need to lock on add and remove. Are you thinking of
> something specific, or just seen poor locking in the TPs you reviewed?
I am thinking of some specific stuff. That 16 processor Itanium box has
signifigantly influenced the way I think about locking and contention.
Typically a queue has two locks - one on Enqueue, one on Dequeue. This is
how I've been coding it for years. However, after seeing some performance
that really puzzled me, I spent some time looking into the contention in our
system. While doing that, I came across this:
http://makeashorterlink.com/?Q4132185D
I played with this test for a while, and ran it on single, dual, quad, and
16 processor boxes, and got some shocking results. After looking into this
more and more, I became quite alarmed at just how much contention we were
seeing. It was.... shocking. I have all the numbers recorded, and will be
doing a blog entry on my findings here in the not-too-distant future.
Anyway, after much wailing and gnashing of teeth, I stumbled across
thread-safe, lock-free datastructures:
http://www.boyet.com/Articles/LockfreeQueue.html and have been educating
myself.
I added these into the test suites, and got more interesting results. On a
single or dual processor machine they're signifigantly slower. On a 16 way
machine, they're waaaaaay faster.
The end result, is that on different processor architectures, and different
numbers of processors, the locking mechanisms need to change. The difference
in some cases was a full 4 orders of magnitude (in one very dramatic case,
what took 85 milliseconds on a single processor x86 machine, took 200
milliseconds on a dual-processor machine, took 2 minutes on a quad processor
machine, and took more than 30 minutes on a 16-way machine). Granted these
cases were contrived and almost nothing but thread contention, but they
illustrated the case and let me pick the right datastructures.
My hypothetical thread pool would take this into account and have a variety
of locking mechanisms. It would even be especially nice if, in a general
some case, a work item could be handed directly to a thread with no lock
required at all. In the "lots of threads awake" case, it seems as if there
should be a way to do this.
Now, I'll be the first to admit that locking and unlocking on the queue for
work items shouldn't be too big a hit, especially when compared to parsing
some xml, string manipulation, etc - but these locks are hit at least twice
per operation (one in, one out), and at 10k operations per second, that adds
up quick.
--
Chris
[lots more great discussion about thread pools and lock-free data
structures]
It seems to me that there's a great model of how to make a high performance
server make the most effective use of multiple CPUs out there already: SQL
Server (and, I would presume, Oracle and DB2 must use similar techniques).
Within SQL Server, they go to extraordinary lengths to try to have exactly 1
unblocked thread per CPU, almost never more, and fewer only when there are
fewer work items than CPUs. Further, they go to extraordinary lengths to
ensure that context switches never occur and that worker thread never block.
Considerable writing has been done on the inner workings of the User Mode
Scheduler that SQL Server uses. For example,
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsqldev/html/sqldev_02252004.asp
How does that mesh with your wish-list for the ultimate thread pool? Of
course, it's more than a thread pool - it is, in effect, an entire Operating
System built on top of the OS, requiring that code running atop it use it's
I/O and memory allocation facilities exclusively.
-cd
Here is where trade-offs come into play and needing a specific behavior
instead of something general. A simple way (that I used) is to have each
thread check, after, it is done with the delegate if new threads need to be
created (i.e. Q.Count > 1 and TPThreads.Count < maxThreads). This avoids
needing a seperate manager thread running. This, in fact, may be the
required behavior for the app (i.e. don't wait but bound the max). Having
to walk the active threads and check if they are in fact blocked and only
then spin up more can induce delay and not sure there is a managed way to do
this.
| The System thread pool seems to have a timeout of some sort before
spawning
| new threads. This seems to work fairly well.
Not sure exact way they do it either. Put has something to do with checking
if threads are blocked, then spin 1 more, and so on, upto the max.
| > | 2 - Threads have some degree of processor affinity in order to
minimize
| > | cache issues
|
| [Thread Affinity]
Not sure how much thread affinity will help here and is more likely to slow
things down. You never want to wait, so if there is a ready, willing, and
able thread ready to pop the queue and start work, then that is better then
trying to wait for same thread used last time or spend cycles trying to
figure that out when the work could have already started or completed. So
not sure we get payoff here in general. Naturally, there are probably use
cases that prove my feeling wrong.
| Some use variations on this, and fancy means of putting threads to sleep -
| I've seen a few algorithms:
| 1 - Keep all threads blocked in a Montior. Whenver a work items comes in,
an
| "PulseOne" (or, god help us, a Pulse-All with a "First one wins"
algorithm)
Need to look very hard at this one. I use pulseall in my blocking-Q because
it is the right thing to do in that case. I have seen many incorrect
implementation that use Pulse(). It may look right at first, but there a
few funky things that can happen that can create a live-lock when using just
Pulse(). In most cases, if the loop is tight, a PulseAll is not going to
create a hurding issue and may, in fact, be the only way to ensure
correctness. Sure you may get a couple threads here and there that wake-up
only to sleep again, but that should be the exception.
| I would really like to see an algorithm that has, in the "busy" case, as
few
| context switches as possible. When I post an item to a queue, there should
| be exactly 1 context switch needed to process my item.
That is the way mine implementation works. I admit to more locking then I
would like, but can't figure out a way around it while keeping everything
correct and not getting into lock-free stuff. I'll post here, so please
take shoots at it.
--
William Stacey [MVP]
> It seems to me that there's a great model of how to make a high
> performance server make the most effective use of multiple CPUs out there
> already: SQL Server (and, I would presume, Oracle and DB2 must use
> similar techniques).
The SQL Server infrastructure is great, but I'm afraid it doesn't really do
me any good. Now, as soon as Microsoft releases SQL Server 2000 or 2005
under a GPL or BSD license, I'll use that code :)
I'm not so desperate yet as to use Fibers and write my own scheduler. Next
year, perhaps, but not yet!
--
Chris Mullins
> | 1 - Keep all threads blocked in a Montior. Whenver a work items comes
> in,
> an
> | "PulseOne" (or, god help us, a Pulse-All with a "First one wins"
> algorithm)
>
> Need to look very hard at this one. I use pulseall in my blocking-Q
> because
> it is the right thing to do in that case. I have seen many incorrect
> implementation that use Pulse(). It may look right at first, but there a
> few funky things that can happen that can create a live-lock when using
> just
> Pulse(). In most cases, if the loop is tight, a PulseAll is not going to
> create a hurding issue and may, in fact, be the only way to ensure
> correctness. Sure you may get a couple threads here and there that
> wake-up
> only to sleep again, but that should be the exception.
I used the word dreaded because the normal case (where only some threads
should be active) has Windows waking up all the threads, then putting most
of them right back to sleep. This can't be a good idea... Especially with
dozens (or hundreds) of threads in the pool.
I don't know the right answer to this, but there's gotta be a better way.
--
Chris Mullins
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
If you want a fast pathed semaphore which avoids system calls much like
what critical section does for mutex, you can check here
http://groups.google.com/groups?selm=412F5D8D...@xemaps.com
Note that if you use this on xbox360 the interlocked functions don't
have memory barriers so you'll have to provide your own.
For things like timer queues I use a gating mutex or semaphore to
avoid the thundering herd problem caused by having the entire
thread pool wait on the same timeout. The gating mutex ensures
only one thread is waiting on an autoreset event with infinite
timeout if there's no work and a specific timeout if there is
a scheduled piece of work. So in pseudo code
gate.lock();
queue.lock();
for (;;) {
if queue.empty()
timeout = INFINITE;
else
timeout = queue.peek().deadline - now;
if (timeout < 0)
break;
queue.unlock();
event.wait(); // autoreset event
queue.lock()
}
item = queue.pop();
queue.unlock();
gate.unlock();
Any queueing of work that changes the head of the queue (next item
to be scheduled) signals the autoreset event.
I have, a couple years back. It would be interestsing to dust that code
off, make it .NET friendly (it's native C++), and use it to build a UMS-like
framework for .NET. That'd be a powerful tool to have in the .NET toolbox.
-cd
| gate.lock();
| queue.lock();
| for (;;) {
| if queue.empty()
| timeout = INFINITE;
| else
| timeout = queue.peek().deadline - now;
|
|
| if (timeout < 0)
| break;
|
| queue.unlock();
| event.wait(); // autoreset event
| queue.lock()
| }
|
| item = queue.pop();
| queue.unlock();
| gate.unlock();
|
I am not following that totally.
1) if queue is empty timeout is set to -1 (i.e. INFINITE). The next "if" is
then true and does a break. Then you try to pop() an empty queue? This
does not seem right. Please help.
2) Where else do you wait on timeout? Should it be event.wait(timeout)
instead of event.wait()?
tia
> I have, a couple years back. It would be interestsing to dust that code
> off, make it .NET friendly (it's native C++), and use it to build a
> UMS-like framework for .NET. That'd be a powerful tool to have in the
> .NET toolbox.
... and here I thought I was hardcore. You, sir, are well and truly sick! :)
--
Chris Mullins
Wow. I just spent some time reading Maged Michael's paper's over at IBM
reasearch. He's got some great, great stuff in there. Thanks for the
pointer.
Any idea if I can find nice (and debugged) C# implementations of his
algorithms?
--
Chris Mullins
I'm not sure. Most of the stuff I did was read lock-free (as opposed
to write lock-free) and mainly used RCU, Maged's SMR hazard pointers,
or lock-free reference counting for the GC component of the lock-free
algorithms. So you could have read lock-free hash tables, linked lists,
or even balanced binary trees if I had ever gotten around to working out
the heuristics for balancing (you can't just plug in the current
AVL or Red/Black algorithms as is). But too much patent activity in
that area so I shut down those projects.
You could take a look at the java.util.concurrent classes in the latest
version of Java to see how they did it. JSR-166 added in lock-free support.
They basically use double wide interlocked compare and exchange to
implement the AtomicMarkableReference and AtomicStampedReference classes.
You need stuff like that to avoid the ABA problem. You can use tracing GC if
it's built in to avoid the ABA problem like the article above did but
for a high throughput queue, you are going to generate a lot of garbage
for the GC to collect.