Thanks
hmovva
You might want to provide some more information on what you're trying to
accomplish, current program design, what your throughput metrics are, etc.
Chris
Chris
The job is cpu bound and there is no io or any other activity. It is a huge
array which I have split and set up parameters for one other thread and
created the thread. After creating the thread I have assigned the thread
(within the thread code) to the other cpu using CPU_SET and
sched_setaffinity. Originally without threading I am getting around 560
operations per second and with threading also around the same if not
slightly less. I monitored the cpu activity using sysguard utility where I
can see the activity on each processor. Looks like something is blocking
the thread in achieving the parallelism. As the original code is too
complex, I will work out an equivalent example and post a little later. The
total time consumed is totally due to the said loop and hence the attempt to
speed up the processing using both cores. I did not use any locks as the
thread operates on exclusive data areas. Appreciate any ideas to overcome
this.
Thanks
hmovva
Here are some very brief advise/comments:
First of all, you need to try "real hard" to remove the "possibility" of
so-called "false-sharing" from occurring throughout your application's
lifetime. This can be a "fairly" tedious process because you have to
accomplish two tasks for basically every point in which a thread could
access shared-data. Also, the tasks are unfortunately not all that
"portable". First of all, you would need to find out the size of the l2
cache-line for the arch(s) your targeting. Then, you should make sure that
all of your data that could be concurrently accessed by multiple threads is
padded up to at least a multiple of that size. After you do that, then you
should make sure that the data is at least aligned on a l2 cache-line
boundary. Accomplishing those two steps:
___________________
1. Pad your structs up to L2 cache-line size
2. Align your structs on L2 cache-line boundaries
___________________
for your entire application infrastructure will most likely end up helping
you drastically reduce the chance of false-sharing from occurring:
http://groups.google.com/group/comp.programming.threads/msg/323db40e9fd4d704
The can help ensure that shared-data is indeed segregated from other
"unrelated" shared-data. Please keep in mind that false-sharing could have a
"marked negative effect" on multiple threads... For instance, if
critical-section A has false-sharing with something over on another
unrelated critical-section B, then the performance of both A and B can
indeed experience performance degradation associated with the false-sharing
phenomena. Another quick example scenario: you don't want the stores to a
Mutex A to effect the shared-data contained in any critical-section that it
may protect, and vise-versa. In other words, you don't want modifications to
Mutex A to "interfere" with the thread that's reading a critical-section
guarded by Mutex A. Given that, if you reduce false-sharing, then your
usually going to see some marked improvements.
There are some tools, like the Intel Thread Checker, which are designed to
help you identify some points where false-sharing can occur. A mixture of
manual and automatic false-sharing detection can be useful indeed. Manual
detection is what I do first personally, because I already know all of the
places where threads may be accessing shared data. If you are finding
yourself having to rely on automatic detection, then you should get better
acquainted with your applications architecture wrt threads accessing shared
data...
Now, after you do all of the above, and you're __still__ experiencing some
marked "performance" issues, the next course of action IMHO is to go ahead
and identify "any possible synchronization bottlenecks"... Check your code
to make sure that all of its critical-sections are small, quick and to the
point. You don't want large critical-sections... Another situation that
could cause a sync bottleneck is when you use a "sub-optimal" sync-primitive
to protect a so-called "critical, performance sensitive" data-structure that
gets "frequently" accessed by multiple threads. This could be caused by
using a mutex to guard access to a structure which experiences very frequent
read-access, and only moderate-to-low write-access from time to time. This
type of access pattern is sometimes referred to as "read-mostly". In that
type of scenario, your application could possibly gain some performance, and
scalability attributes if you were to substitute a rw-mutex in place of the
mutex and then divide out the access to the data-structure into specific
reader/writer critical-regions.
Also, think of new design patterns you can use which may allow your to skip
the use of a sync-primitive. For instance, if you can design your algorithm
to make use of per-thread data-structures instead of global shared ones,
then your "usually" going to end up reaping some sort of performance
benefit(s). The design of your overall synchronization scheme can greatly
effect the scalability, throughput and overall performance of your
application as a whole...
________________
After you do all of that, and your still not satisfied with the way things
are running, you could start to explore some of the more "advanced"
synchronization techniques:
http://groups.google.com/group/comp.programming.threads/msg/9a5500d831dd2ec7
http://groups.google.de/group/comp.programming.threads/browse_frm/thread/d062e1bfa460a375
http://groups.google.de/group/comp.programming.threads/browse_frm/thread/d062e1bfa460a375
Was that helpful to you at all? Any thoughts?
:^)
Chris
The job is cpu bound and there is no io or any other activity. It is a huge
Here is one that fits into the 'other means' category's, humm, although it
does heavily rely on POSIX Threads...:
http://appcore.home.comcast.net/
Everything is sort-of designed around eliminating false-sharing...
> The trhuput does not increase even though I managed
> to run the threads on both processors using scheduling functions.
Don't do that. Most likely you're just reducing the scheduler's
choices.
DS