detailed producer consumer model memory question

chrisv...@gmail.com

unread,

Apr 21, 2007, 12:20:18 AM4/21/07

to

CPTers,

So I have a very basic producer/consumer model application that
"works" in that the results are correct and there's no hanging, but
I'm not getting the performance that I want on multi-core processors.
Of course, multi-core processors were the main reason I bothered to
implement multithreading at all. I will explain it in general and
then explain it in more detail with possible code to follow if there
are responses.

Problem:

I've essentially implemented the producer/consumer model and in my
application, I can control the number of threads. On a single
processor (all Intel, Pentium 4's or Centrinos) I get about a 20%
speedup (compared to 1 thread) when I increase up to 3 threads. After
3 threads the performance declines, but gracefully.

On a duo-core (or my two duo-core machine at work), I get good
performance with 1 thread, but the performance with 2 or more threads
declines significantly, that is, it's about a 50% or more performance
decrease!

Implementation sketch:
I have P producer threads and 1 consumer thread that tallies the
results. There is no data sharing for each of the producer threads,
i.e., they each work on instances of the "simulation" class and they
each have separate random number generators (each starting on
different seeds). The way I'm allocating memory for the results
seems worth mentioning. Essentially, I have a result container class
that preallocates R instantces of the result class. I typically make
R range from 2*P to 5*P but get the same results in this range. More
than 5*P seems pointless since the consumer is so much faster than the
producer (about 100x or more).

Before a producer thread begins a new simulation, it gets a pointer to
one of the preallocated instances of the result class. It then
simulates (takes about 35 milliseconds) , fills the result class and
tells the result container class that this pointer points to filled
result data and needs to be tallied by the consumer thread. Rinse and
repeat for the producer thread.

The consumer thread only takes filled results, collects the data, and
tallies the results (overall maybe about 100 nanoseconds). It then
tells the container class that the pointer points to available result
space (this is done reasonably efficiently).

Note: The allocation for the result container class is only done once
at the beginning of the program.

I've verified that the results are right with single and multiple
threads. I've also verified with some friends who have done the small
amount of multithreading that I'm not doing anything obviously wrong
with my critical sections. (I am using QT's QThread library including
1 QMutex, and 2 QWaitConditions).

Questions:
- Is something I'm doing obviously a multithreading no-no with multi-
cores?
- How should I go about trying to solve this issue? I have it
compiled in windows with mingw32 and also on Linux with some very
recent version of gcc (whatever came with Fedora Core Zod).
- Does QThread have known problems with multi-cores? (Nothing is
stated at trolltech's forum abou this)
- Has anyone done a similar producer/consumer model with similar
simulation timings?
- Is there some possible cache thrashing problem going on?

I will send code to anyone who thinks it may be helpful, or post it to
the thread.

Any and all help is welcome and appreciated,
Chris

Joe Seigh

unread,

Apr 21, 2007, 7:24:05 AM4/21/07

to

chrisv...@gmail.com wrote:
> CPTers,
>
> So I have a very basic producer/consumer model application that
> "works" in that the results are correct and there's no hanging, but
> I'm not getting the performance that I want on multi-core processors.
> Of course, multi-core processors were the main reason I bothered to
> implement multithreading at all. I will explain it in general and
> then explain it in more detail with possible code to follow if there
> are responses.
>
> Problem:
>
> I've essentially implemented the producer/consumer model and in my
> application, I can control the number of threads. On a single
> processor (all Intel, Pentium 4's or Centrinos) I get about a 20%
> speedup (compared to 1 thread) when I increase up to 3 threads. After
> 3 threads the performance declines, but gracefully.
>
> On a duo-core (or my two duo-core machine at work), I get good
> performance with 1 thread, but the performance with 2 or more threads
> declines significantly, that is, it's about a 50% or more performance
> decrease!

[...]

> Questions:
> - Is something I'm doing obviously a multithreading no-no with multi-
> cores?
> - How should I go about trying to solve this issue? I have it
> compiled in windows with mingw32 and also on Linux with some very
> recent version of gcc (whatever came with Fedora Core Zod).
> - Does QThread have known problems with multi-cores? (Nothing is
> stated at trolltech's forum abou this)
> - Has anyone done a similar producer/consumer model with similar
> simulation timings?
> - Is there some possible cache thrashing problem going on?
>
> I will send code to anyone who thinks it may be helpful, or post it to
> the thread.
>
> Any and all help is welcome and appreciated,
> Chris
>

If you're cpu bound, you probably won't get any speed up having
more threads than processors.

The queue should be a straight forward blocking queue implementation.
*do not* queue a buffer unless it has filled in. If you preallocate
buffers in place, you will end up waiting for the slowest thread to
finish up on a partially filled buffer even though there are other
buffers filled in. It's functionally equivalent to locking the entire
queue while you're working on one buffer.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Jason

unread,

Apr 21, 2007, 8:13:03 AM4/21/07

to

On Apr 20, 9:20 pm, chrisvalv...@gmail.com wrote:
> - Is there some possible cache thrashing problem going on?

If more than one result object fits in each CPU cache line (64 bytes
on the CPUs you're using, IIRC), then implicit cache line sharing can
cause a huge slowdown if you have more than one CPU using that memory
at the same time. The problem is that under the covers, the CPUs use
a cache coherency protocol to control which CPU owns a cache line
during writes, and in the contested case, things really slow down.

One simple solution is to pad your result objects such that they're a
multiple of the cache line size. A different possibility that might
help some is to minimize the number of writes/reads to a result
object, perhaps by having the producer wait to write until the
computation is complete, then having the consumer copy the entire
result object in one operation.

Jason Evans

John Dallman

unread,

Apr 21, 2007, 9:26:00 AM4/21/07

to

In article <1177129218....@n59g2000hsh.googlegroups.com>,
chrisv...@gmail.com () wrote:

> On a duo-core (or my two duo-core machine at work), I get good
> performance with 1 thread, but the performance with 2 or more threads
> declines significantly, that is, it's about a 50% or more performance
> decrease!

Are these Pentium D machines? While those are "dual-core" in that there
are two complete processor in the package, they don't share any cache
levels inside the package, and communicate strictly via the FSB. Core 2
Duos, or dual-core AMDs, communicate inside the package, which generally
makes things more efficient.

--
John Dallman, j...@cix.co.uk, HTML mail is treated as probable spam.

chrisv...@gmail.com

unread,

Apr 23, 2007, 7:49:38 PM4/23/07

to

Thanks, I'm using a blocking queue implementation, i.e., each
producer, before it "produces", takes the queue mutex and requests the
queue to give it the pointer to the next available memory location,
then immediately releases the mutex. Only a pointer is passed so it's
an extremely fast critical section.

Also, I only queue buffers to be given to the consumer thread once
they are filled in by the appropriate producer. Therefore, this
waiting won't occur.

Chris

chrisv...@gmail.com

unread,

Apr 23, 2007, 7:51:03 PM4/23/07

to

Thanks, this is interesting and I did not consider it previously. I
will experiment with this in mind.

Chris

chrisv...@gmail.com

unread,

Apr 23, 2007, 7:59:56 PM4/23/07

to

Actually, I was slightly off the cuff before about the processor. My
processor is a Dual-Core Xeon 5000-series "Dempsey" as explained here:

http://en.wikipedia.org/wiki/Xeon#5000-series_.22Dempsey.22

Thus, it is not a Pentium D.

According to Wikipedia, it has two caches (one per core) but I'm not
sure about the specifics of this or how I should program for it. I'm
not sure whether it communicates inside or outside the package as you
say. Either way, how would this affect how I should program for it?

Chris

Chris Thomasson

unread,

Apr 24, 2007, 1:03:50 AM4/24/07

to

<chrisv...@gmail.com> wrote in message
news:1177372263.6...@b75g2000hsg.googlegroups.com...

> Thanks, this is interesting and I did not consider it previously. I
> will experiment with this in mind.

http://www.intel.com/cd/ids/developer/asmo-na/eng/43813.htm

http://softwarecommunity.intel.com/isn/Community/en-US/forums/thread/30228078.aspx

http://softwarecommunity.intel.com/articles/eng/1050.htm

You must pad your data-structures to the appropriate l2 cache line size AND
aligned them on cache line "boundaries". Only then can you expect to fully
avoid false sharing.

If you want a working example of how to do this take a look at the following
library source code:

http://appcore.home.comcast.net/

Read the intro paragraph and it mentions that the data-structures are
forcefully padded and aligned on cache lines....

This is extremely important if you interested in any type of performance wrt
multiprocessor threading...

John Dallman

unread,

Apr 25, 2007, 5:54:00 PM4/25/07

to

In article <1177372796.1...@l77g2000hsb.googlegroups.com>,
chrisv...@gmail.com () wrote:

It's not such a matter of fundamental programming strategy as what you
can get away with and still get decent performance. From the look of
that chip, it has the same basic property for threading as a Pentium D,
in that moving a cache line between the two processors' cores requires
going via main memory.

What this means at a practical level is that you want the need for
synchronisation to occur at as long an interval as possible, because the
time for a critical section or mutex to make it between one processor's
cache and the other cache is hundreds of instructions- worth.

So at a practical level, I'd make the result buffers a load bigger.
A producer should get a pointer to space for maybe a hundred results,
generate them, and only then tell the consumer that it can take them.
You absolutely don't want to have a mutex claim/release on each result
generation when you have multiple processors, because the cache line
thrashing between different caches will take all your time. Doing it
this way with the same number of producer threads as you have cores
should hopefully work fairly well.

I'm not surprised that you don't see a problem with a single processor,
because then the mutex just stays in the one cache and is accessed very
fast.

chrisv...@gmail.com

unread,

Apr 26, 2007, 11:46:21 AM4/26/07

to

>> I'm not surprised that you don't see a problem with a single processor,
>> because then the mutex just stays in the one cache and is accessed very
>> fast.

So, in addition to possible cache line thrashing problems with my
result structure (which I'm working on fixing), there may even been
cache line sharing with my mutex itself?

Chris

John Dallman

unread,

Apr 27, 2007, 2:19:00 PM4/27/07

to

In article <1177602381.1...@u32g2000prd.googlegroups.com>,
chrisv...@gmail.com () wrote:

> So, in addition to possible cache line thrashing problems with my
> result structure (which I'm working on fixing), there may even been
> cache line sharing with my mutex itself?

Absolutely. It is, after all, just a piece of ordinary memory. The only
special thing about it is the instructions that are used to access it,
when it's accessed as a mutex.

Yes, a memory-stamping bug can trash mutex values. This would be
horribly difficult to debug, as far as I can see - I'm fortunate enough
to have never needed to do it.