So I have a very basic producer/consumer model application that
"works" in that the results are correct and there's no hanging, but
I'm not getting the performance that I want on multi-core processors.
Of course, multi-core processors were the main reason I bothered to
implement multithreading at all. I will explain it in general and
then explain it in more detail with possible code to follow if there
are responses.
Problem:
I've essentially implemented the producer/consumer model and in my
application, I can control the number of threads. On a single
processor (all Intel, Pentium 4's or Centrinos) I get about a 20%
speedup (compared to 1 thread) when I increase up to 3 threads. After
3 threads the performance declines, but gracefully.
On a duo-core (or my two duo-core machine at work), I get good
performance with 1 thread, but the performance with 2 or more threads
declines significantly, that is, it's about a 50% or more performance
decrease!
Implementation sketch:
I have P producer threads and 1 consumer thread that tallies the
results. There is no data sharing for each of the producer threads,
i.e., they each work on instances of the "simulation" class and they
each have separate random number generators (each starting on
different seeds). The way I'm allocating memory for the results
seems worth mentioning. Essentially, I have a result container class
that preallocates R instantces of the result class. I typically make
R range from 2*P to 5*P but get the same results in this range. More
than 5*P seems pointless since the consumer is so much faster than the
producer (about 100x or more).
Before a producer thread begins a new simulation, it gets a pointer to
one of the preallocated instances of the result class. It then
simulates (takes about 35 milliseconds) , fills the result class and
tells the result container class that this pointer points to filled
result data and needs to be tallied by the consumer thread. Rinse and
repeat for the producer thread.
The consumer thread only takes filled results, collects the data, and
tallies the results (overall maybe about 100 nanoseconds). It then
tells the container class that the pointer points to available result
space (this is done reasonably efficiently).
Note: The allocation for the result container class is only done once
at the beginning of the program.
I've verified that the results are right with single and multiple
threads. I've also verified with some friends who have done the small
amount of multithreading that I'm not doing anything obviously wrong
with my critical sections. (I am using QT's QThread library including
1 QMutex, and 2 QWaitConditions).
Questions:
- Is something I'm doing obviously a multithreading no-no with multi-
cores?
- How should I go about trying to solve this issue? I have it
compiled in windows with mingw32 and also on Linux with some very
recent version of gcc (whatever came with Fedora Core Zod).
- Does QThread have known problems with multi-cores? (Nothing is
stated at trolltech's forum abou this)
- Has anyone done a similar producer/consumer model with similar
simulation timings?
- Is there some possible cache thrashing problem going on?
I will send code to anyone who thinks it may be helpful, or post it to
the thread.
Any and all help is welcome and appreciated,
Chris
If you're cpu bound, you probably won't get any speed up having
more threads than processors.
The queue should be a straight forward blocking queue implementation.
*do not* queue a buffer unless it has filled in. If you preallocate
buffers in place, you will end up waiting for the slowest thread to
finish up on a partially filled buffer even though there are other
buffers filled in. It's functionally equivalent to locking the entire
queue while you're working on one buffer.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
If more than one result object fits in each CPU cache line (64 bytes
on the CPUs you're using, IIRC), then implicit cache line sharing can
cause a huge slowdown if you have more than one CPU using that memory
at the same time. The problem is that under the covers, the CPUs use
a cache coherency protocol to control which CPU owns a cache line
during writes, and in the contested case, things really slow down.
One simple solution is to pad your result objects such that they're a
multiple of the cache line size. A different possibility that might
help some is to minimize the number of writes/reads to a result
object, perhaps by having the producer wait to write until the
computation is complete, then having the consumer copy the entire
result object in one operation.
Jason Evans
> On a duo-core (or my two duo-core machine at work), I get good
> performance with 1 thread, but the performance with 2 or more threads
> declines significantly, that is, it's about a 50% or more performance
> decrease!
Are these Pentium D machines? While those are "dual-core" in that there
are two complete processor in the package, they don't share any cache
levels inside the package, and communicate strictly via the FSB. Core 2
Duos, or dual-core AMDs, communicate inside the package, which generally
makes things more efficient.
--
John Dallman, j...@cix.co.uk, HTML mail is treated as probable spam.
Also, I only queue buffers to be given to the consumer thread once
they are filled in by the appropriate producer. Therefore, this
waiting won't occur.
Chris
Chris
http://en.wikipedia.org/wiki/Xeon#5000-series_.22Dempsey.22
Thus, it is not a Pentium D.
According to Wikipedia, it has two caches (one per core) but I'm not
sure about the specifics of this or how I should program for it. I'm
not sure whether it communicates inside or outside the package as you
say. Either way, how would this affect how I should program for it?
Chris
http://www.intel.com/cd/ids/developer/asmo-na/eng/43813.htm
http://softwarecommunity.intel.com/isn/Community/en-US/forums/thread/30228078.aspx
http://softwarecommunity.intel.com/articles/eng/1050.htm
You must pad your data-structures to the appropriate l2 cache line size AND
aligned them on cache line "boundaries". Only then can you expect to fully
avoid false sharing.
If you want a working example of how to do this take a look at the following
library source code:
http://appcore.home.comcast.net/
Read the intro paragraph and it mentions that the data-structures are
forcefully padded and aligned on cache lines....
This is extremely important if you interested in any type of performance wrt
multiprocessor threading...
It's not such a matter of fundamental programming strategy as what you
can get away with and still get decent performance. From the look of
that chip, it has the same basic property for threading as a Pentium D,
in that moving a cache line between the two processors' cores requires
going via main memory.
What this means at a practical level is that you want the need for
synchronisation to occur at as long an interval as possible, because the
time for a critical section or mutex to make it between one processor's
cache and the other cache is hundreds of instructions- worth.
So at a practical level, I'd make the result buffers a load bigger.
A producer should get a pointer to space for maybe a hundred results,
generate them, and only then tell the consumer that it can take them.
You absolutely don't want to have a mutex claim/release on each result
generation when you have multiple processors, because the cache line
thrashing between different caches will take all your time. Doing it
this way with the same number of producer threads as you have cores
should hopefully work fairly well.
I'm not surprised that you don't see a problem with a single processor,
because then the mutex just stays in the one cache and is accessed very
fast.
So, in addition to possible cache line thrashing problems with my
result structure (which I'm working on fixing), there may even been
cache line sharing with my mutex itself?
Chris
> So, in addition to possible cache line thrashing problems with my
> result structure (which I'm working on fixing), there may even been
> cache line sharing with my mutex itself?
Absolutely. It is, after all, just a piece of ordinary memory. The only
special thing about it is the instructions that are used to access it,
when it's accessed as a mutex.
Yes, a memory-stamping bug can trash mutex values. This would be
horribly difficult to debug, as far as I can see - I'm fortunate enough
to have never needed to do it.