Question about pmap

26 views
Skip to first unread message

Johann Kraus

unread,
Aug 3, 2009, 5:46:57 AM8/3/09
to Clojure
Hi all,

recently I did some micro-benchmarks of parallel code on my 8-core
computer. But I don't get the point about this behaviour of pmap. Can
anyone explain this to me? The code is running on a dual quad-core
intel machine (Xeon X5482, 3.20 GHz).

(defn maptest [cores] (doall (map (fn [x] (dotimes [_ 1000000000] (inc
0))) (range len))))
(defn pmaptest [cores] (doall (pmap (fn [x] (dotimes [_ 1000000000]
(inc 0))) (range len))))
leads to:
(time (maptest 8)) : 20804.182202 msecs
(time (pmaptest 8)) : 2567.521341 msecs
i.e. a speedup of ~8.

Doing this with doubles:

(defn maptest [cores] (doall (map (fn [x] (dotimes [_ 1000000000] (inc
0.1))) (range len))))
(defn pmaptest [cores] (doall (pmap (fn [x] (dotimes [_ 1000000000]
(inc 0.1))) (range len))))
leads to:
(time (maptest 8)) : 68044.060324 msecs
(time (pmaptest 8)) : 35051.174503 msecs
i.e. a speedup of ~2.

However, the CPU usage indicated by "top" is ~690%. What does the CPU
do?

Regards,

Johann

Johann Kraus

unread,
Aug 3, 2009, 6:08:38 AM8/3/09
to Clojure
Sorry about the copy&paste error. I partially changed len to cores.
The code must look like:

(defn maptest [cores] (doall (map (fn [x] (dotimes [_ 1000000000]
(inc
0))) (range cores))))
(defn pmaptest [cores] (doall (pmap (fn [x] (dotimes [_ 1000000000]
(inc 0))) (range cores))))

and

(defn maptest [cores] (doall (map (fn [x] (dotimes [_ 1000000000]
(inc
0.1))) (range cores))))
(defn pmaptest [cores] (doall (pmap (fn [x] (dotimes [_ 1000000000]
(inc 0.1))) (range cores))))

tmountain

unread,
Aug 3, 2009, 2:47:16 PM8/3/09
to Clojure
> However, the CPU usage indicated by "top" is ~690%. What does the CPU do?

100% per core. So with dual quad-core processors, it'd mean roughly 7
cores were being pegged.

Sudish Joseph

unread,
Aug 3, 2009, 4:14:47 PM8/3/09
to clo...@googlegroups.com
Johann Kraus <johann...@gmail.com> writes:
> Doing this with doubles:

> leads to:
> (time (maptest 8)) : 68044.060324 msecs
> (time (pmaptest 8)) : 35051.174503 msecs
> i.e. a speedup of ~2.
>
> However, the CPU usage indicated by "top" is ~690%. What does the CPU
> do?

My guess would be you're seeing the overhead for pmap since the
(inc 0.1) computation is so cheap. From the docs for pmap:
"Only useful for computationally intensive functions where the time of
f dominates the coordination overhead."

Maybe try this with something more expensive to compute? Also, making
the result a function of the parameter can't hurt since there's a chance
the JVM may optimize away the inc of a constant as it stands.

On my 2.4 GHz Core 2 Duo with JVM 1.6.0_13, here's what your pmaptest
with doubles looked like:

user=> (time (maptest 1))
"Elapsed time: 11979.023 msecs"
user=> (time (maptest 2))
"Elapsed time: 23930.877 msecs"
(nil nil)
user=> (time (pmaptest 1))
"Elapsed time: 11912.518 msecs"
(nil)
user=> (time (pmaptest 2))
"Elapsed time: 23917.375 msecs"

I.e., no speedup at all! The communication and coordination overhead
easily dwarfs the actual computation.


After replacing the (inc 0.1) with (+ x 0.1 x) and reduced the number of
iterations by a factor of 10:

user=> (defn pmaptest [cores] (doall (pmap (fn [x] (dotimes [_ 100000000]
(+ x 0.1 x))) (range cores))))

user=> (time (maptest 1))
"Elapsed time: 16173.675 msecs"
(nil)
user=> (time (maptest 2))
"Elapsed time: 32351.821 msecs"
user=> (time (pmaptest 1))
"Elapsed time: 16399.574 msecs"
(nil)
user=> (time (pmaptest 2))
"Elapsed time: 17574.663 msecs"

Now there's a not quite linear speedup.

This ratio should get closer to linear as the overhead for f gets higher.

--
Sudish Joseph

Johann Kraus

unread,
Aug 4, 2009, 4:59:25 AM8/4/09
to Clojure
> My guess would be you're seeing the overhead for pmap since the
> (inc 0.1) computation is so cheap.  From the docs for pmap:
>   "Only useful for computationally intensive functions where the time of
>   f dominates the coordination overhead."

I don't think so, as the cheap computation (inc 0.1) is packed into
the long running (dotimes). And it does work for using (inc 0), where
I get a perfect linear speedup. Just by using (inc 0.1) instead of
(inc 0) the speedup is rapidly getting down while the CPU load is next
to the limit.

> Maybe try this with something more expensive to compute?  Also, making
> the result a function of the parameter can't hurt since there's a chance
> the JVM may optimize away the inc of a constant as it stands.

I also would expect the JVM to optimize away something, but for me it
seems it does not, because again in case of (inc 0) everything works
perfect.

The point is:
Why do I get a linear speedup with (inc 0) -> 8 times and not with
(inc 0.1) -> 2 times, while the CPU load is 800% and 700%,
respectively?
__
Johann

Andy Fingerhut

unread,
Aug 4, 2009, 4:35:27 PM8/4/09
to Clojure
Johann:

Could it be that your CPU has a single floating-point unit shared by 4
cores on a single die, and thus only 2 floating-point units total for
all 8 of your cores? If so, then that fact, plus the fact that each
core has its own separate ALU for integer operations, would seem to
explain the results you are seeing. I see similar results on my 2-
core Intel Core 2 Duo when running your test.

I tried to check out that possibility by doing some searches on the
Internet about the Xeon, but couldn't quickly find a definitive
answer. I'm hoping someone else on the group might know the answer
off the top of their head.

Andy

Johann Kraus

unread,
Aug 5, 2009, 8:29:41 AM8/5/09
to Clojure
> Could it be that your CPU has a single floating-point unit shared by 4
> cores on a single die, and thus only 2 floating-point units total for
> all 8 of your cores? If so, then that fact, plus the fact that each
> core has its own separate ALU for integer operations, would seem to
> explain the results you are seeing.

Exactly, this would explain the behaviour. But unfortunately it is not
the case. I implemented a small example using Java (Java Threads) and
C (PThreads) and both times I get a linear speedup. See the attached
code below. The cores only share 12 MB cache, but this should be
enough memory for my micro-benchmark. Seeing the linear speedup in
Java and C, I would negate a hardware limitation.

_
Johann

### C ###

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define NUM_THREADS 8

int inc(int);
double inc_d(int);

int inc(int x){
int y;
y = x + 1;
return y;
}

double inc_d(int x){
double y;
y = (double)x + 1.0;
return y;
}

void *BusyWork(void *t){
int i;
long tid;
int result=0;
tid = (long)t;
printf("Thread %ld starting...\n",tid);
for (i=0; i<1000000000; i++){
/* result = result + sin(i) * tan(i); */
result = inc(i);
}
printf("Thread %ld done. Result = %i\n",tid, result);
pthread_exit((void*) t);
}

void *BusyWork_d(void *t){
int i;
long tid;
double result=0.0;
tid = (long)t;
printf("Thread %ld starting...\n",tid);
for (i=0; i<1000000000; i++){
/* result = result + sin(i) * tan(i); */
result = inc_d(i);
}
printf("Thread %ld done. Result = %e\n",tid, result);
pthread_exit((void*) t);
}

void *BusyWork_single(){
int i;
double result=0.0;
for (i=0; i<1000000000; i++){
/* result = result + sin(i) * tan(i); */
result = inc_d(i);
}
}

int main (int argc, char *argv[]){

time_t start,end;
double dif;

pthread_t thread[NUM_THREADS];
pthread_attr_t attr;
int rc;
long t;
void *status;

start = time(NULL);

/* Running serial code */
for(t=0; t<NUM_THREADS; t++){
printf("Running serial code #: %ld\n", t);
BusyWork_single();
}

end = time(NULL);
dif = difftime(end,start);
printf("Runtime for serial code: %f\n", dif);

start = time(NULL);

/* Initialize and set thread detached attribute */
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

for(t=0; t<NUM_THREADS; t++){
printf("Main: creating thread %ld\n", t);
/* Let's rock */
/* rc = pthread_create(&thread[t], &attr, BusyWork, (void *)t); */
rc = pthread_create(&thread[t], &attr, BusyWork_d, (void *)t);
if (rc) {
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}

/* Free attribute and wait for the other threads */
pthread_attr_destroy(&attr);
for(t=0; t<NUM_THREADS; t++){
rc = pthread_join(thread[t], &status);
if (rc) {
printf("ERROR; return code from pthread_join() is %d\n", rc);
exit(-1);
}
printf("Main: completed join with thread %ld having a status of %ld
\n",t,(long)status);
}

end = time(NULL);
dif = difftime(end,start);
printf("Runtime for parallel code: %f\n", dif);

printf("Main: program completed. Exiting.\n");
pthread_exit(NULL);
}

### Java ###

import java.text.DecimalFormat;

public class MapTest{

public static class IntTest implements Runnable{
long loops;
long result;

public IntTest(long loops){
this.loops = loops;
}

// stupid work
public void run(){
result = 0;
for (long i = 0L; i < loops; i++){
result = result + 1;
}
System.out.println(result);
}
}

public static class DoubleTest implements Runnable{
long loops;
double result;

public DoubleTest(long loops){
this.loops = loops;
}

// stupid work
public void run(){
result = 0.0;
for (long i = 0L; i < loops; i++){
result = result + 1.0;
}
System.out.println(result);
}
}

public static void main(String[] args){
try{

long loops = 10000000000L;
// number of threads
int tcount = 8;
System.out.println("Number of Runs: "+tcount);


// IntTest sequential in one Block ;)
{
Thread[] tarray = new Thread[tcount];
for (int i = 0; i < tcount; i++)
tarray[i] = new Thread(new IntTest(loops));

long startTime = System.nanoTime();
long stopTime = 0;
long runTime = 0;
for (int i = 0; i < tcount; i++){
tarray[i].start();
tarray[i].join();
}

stopTime = System.nanoTime();
runTime = stopTime - startTime;
System.out.println();
System.out.println("Int ALL RUNS FINISHED for JOB.");
System.out.println("Int OVERALL Time: "+new DecimalFormat
("0.0000").format((double)runTime/1000000)+" ms");
System.out.println();
}
System.out.println("Number of Runs: "+tcount);

// DoubleTest sequential in one Block ;)
{
Thread[] tarray = new Thread[tcount];
for (int i = 0; i < tcount; i++)
tarray[i] = new Thread(new DoubleTest(loops));

long startTime = System.nanoTime();
long stopTime = 0;
long runTime = 0;
for (int i = 0; i < tcount; i++){
tarray[i].start();
tarray[i].join();
}

stopTime = System.nanoTime();
runTime = stopTime - startTime;
System.out.println();
System.out.println("Double ALL RUNS FINISHED for JOB.");
System.out.println("Double OVERALL Time: "+new DecimalFormat
("0.0000").format((double)runTime/1000000)+" ms");
System.out.println();
}

System.out.println("Number of Threads: "+tcount);

// IntTest parallel in one Block ;)
{
Thread[] tarray = new Thread[tcount];
for (int i = 0; i < tcount; i++)
tarray[i] = new Thread(new IntTest(loops));

long startTime = System.nanoTime();
long stopTime = 0;
long runTime = 0;
for (int i = 0; i < tcount; i++)
tarray[i].start();
for (int i = 0; i < tcount; i++)
tarray[i].join();

stopTime = System.nanoTime();
runTime = stopTime - startTime;
System.out.println();
System.out.println("Int ALL THREADS FINISHED for JOB.");
System.out.println("Int OVERALL Time: "+new DecimalFormat
("0.0000").format((double)runTime/1000000)+" ms");
System.out.println();
}

// DoubleTest parallel in one Block ;)
{
Thread[] tarray = new Thread[tcount];
for (int i = 0; i < tcount; i++)
tarray[i] = new Thread(new DoubleTest(loops));

long startTime = System.nanoTime();
long stopTime = 0;
long runTime = 0;
for (int i = 0; i < tcount; i++)
tarray[i].start();
for (int i = 0; i < tcount; i++)
tarray[i].join();
stopTime = System.nanoTime();
runTime = stopTime - startTime;
System.out.println();
System.out.println("Double ALL THREADS FINISHED for JOB. ");
System.out.println("Double OVERALL Time: "+new DecimalFormat
("0.0000").format((double)runTime/1000000)+" ms");
System.out.println();
}
}
catch(Exception e){}
}
}

Rich Hickey

unread,
Aug 5, 2009, 9:09:42 AM8/5/09
to clo...@googlegroups.com
On Wed, Aug 5, 2009 at 8:29 AM, Johann Kraus<johann...@gmail.com> wrote:
>
>> Could it be that your CPU has a single floating-point unit shared by 4
>> cores on a single die, and thus only 2 floating-point units total for
>> all 8 of your cores?  If so, then that fact, plus the fact that each
>> core has its own separate ALU for integer operations, would seem to
>> explain the results you are seeing.
>
> Exactly, this would explain the behaviour. But unfortunately it is not
> the case. I implemented a small example using Java (Java Threads) and
> C (PThreads) and both times I get a linear speedup. See the attached
> code below. The cores only share 12 MB cache, but this should be
> enough memory for my micro-benchmark. Seeing the linear speedup in
> Java and C, I would negate a hardware limitation.
>
> _
> Johann
>
> ### C ###
>

That's a lot of code for a message, could you please use a pastebin
next time? Thanks.

I looked briefly at your problem and don't see anything right off the
bat. Do you have a profiler and could you try that out? I'm
interested.

Thanks,

Rich

Andy Fingerhut

unread,
Aug 6, 2009, 6:07:43 AM8/6/09
to Clojure
On Aug 5, 6:09 am, Rich Hickey <richhic...@gmail.com> wrote:
> On Wed, Aug 5, 2009 at 8:29 AM, Johann Kraus<johann.kr...@gmail.com> wrote:
>
> >> Could it be that your CPU has a single floating-point unit shared by 4
> >> cores on a single die, and thus only 2 floating-point units total for
> >> all 8 of your cores?  If so, then that fact, plus the fact that each
> >> core has its own separate ALU for integer operations, would seem to
> >> explain the results you are seeing.
>
> > Exactly, this would explain the behaviour. But unfortunately it is not
> > the case. I implemented a small example using Java (Java Threads) and
> > C (PThreads) and both times I get a linear speedup. See the attached
> > code below. The cores only share 12 MB cache, but this should be
> > enough memory for my micro-benchmark. Seeing the linear speedup in
> > Java and C, I would negate a hardware limitation.
>
> > _
> > Johann
>
>
> I looked briefly at your problem and don't see anything right off the
> bat. Do you have a profiler and could you try that out? I'm
> interested.
> Rich

I ran these tests on my iMac with 2.16 GHz Intel Core 2 Duo (2 cores)
using latest Clojure and clojure-contrib from git as of some time on
Aug 4, 2009. The Java implementation is from Apple, version 1.6.0_13.

----------------------------------------------------------------------
For int, there are 64 "jobs" run, each of which consists of doing
(inc 0) 1,000,000,000 times. See pmap-batch.sh and pmap-testing.clj
for details.

http://github.com/jafingerhut/clojure-benchmarks/blob/398688c71525964ba4c2d55d0e487b7618efdc8b/misc/pmap-batch.sh

http://github.com/jafingerhut/clojure-benchmarks/blob/398688c71525964ba4c2d55d0e487b7618efdc8b/misc/pmap-testing.clj

Yes, yes, I know. I should really use a library for command line
argument parsing to avoid so much repetitive code. I may do that some
day.


Results for int 1 thread - jobs run sequentially

"Elapsed time: 267547.789 msecs"
real 269.22
user 268.61
sys 1.79

int 2 threads - jobs run in 2 threads using modified-pmap, which
limits the number of futures causing threads to run jobs to be at most
2 at a time.

"Elapsed time: 177428.626 msecs"
real 179.14
user 330.30
sys 15.46

Comment: Elapsed time with 2 threads is about 2/3 of elapsed time with
1 thread. Not as good as the 1/2 as we'd like with a 2 core machine,
but better than not being faster at all.

----------------------------------------------------------------------
For double, there are 16 "jobs" run, each of which consists of doing
(inc 0.1) 1,000,000,000 times.

double 1 thread

"Elapsed time: 258659.424 msecs"
real 263.28
user 247.29
sys 12.17

double 2 threads

"Elapsed time: 229382.68 msecs"
Dumping CPU usage by sampling running threads ... done.
real 231.05
user 380.79
sys 11.49

Comment: Elapsed time with 2 threads is about 7/8 of elapsed time with
1 thread. Hardly any improvement at all for something that should be
"embarrassingly parallel", and the user time reported by Mac OS X's
/usr/bin/time increased by a factor of about 1.5. That seems like way
too much overhead for thread coordination.


Here are hprof output files for the "double 1 thread" and "double 2
threads" tests:

http://github.com/jafingerhut/clojure-benchmarks/blob/51d499c2679c2d5ed42a65adcef6d9e8bc3e1aad/misc/pmap-batch-output/pmap-double-1-hprof.txt

http://github.com/jafingerhut/clojure-benchmarks/blob/51d499c2679c2d5ed42a65adcef6d9e8bc3e1aad/misc/pmap-batch-output/pmap-double-2-hprof.txt

In both cases, over 98% of the time is spent in
java.lang.Double.valueOf(double d). See the files for the full stack
backtraces if you are curious.

I don't see any reason why that method should have any kind of
contention or worse performance when running on 2 cores vs. 1 core,
but I don't know the guts of how it is implemented. At least in
OpenJDK all it does is "return new Double(d)", where d is the double
arg to valueOf(). Is there any reason why "new" might exhibit
contention between parallel threads?

Andy

Nicolas Oury

unread,
Aug 6, 2009, 9:32:36 AM8/6/09
to clo...@googlegroups.com
Hello,

I will try to have a guess. If 98% of time is spend allocating Doubles, the program is loading new lines of memory in cache
every n Doubles. At some point down the different levels of cache, you have a common cache/main memory for both cores and the bus to this memory has to be shared in some way.

So maybe you are measuring the throughput of your memory, and not the computational speed.

To confirm/infirm the guess it would be interesting to
- see where the int version spend its time and compare the hprof.
- replace doubles with floats. (I assume Floats are smaller than Doubles but I have no clue whether it is the case).
- replace ints with longs (corresponding assumption).
- I read somewhere about an option of the JVM for better cache-aware allocation. But I don't remember what it was
ans whether it was in 6 or will be in 7.

Cheers,

Nicolas.

Nicolas Oury

unread,
Aug 6, 2009, 10:00:49 AM8/6/09
to clo...@googlegroups.com
Hello again,


Another interesting test: replace the double operation by something longer, that won't allocate anything.
(a long chain of math functions with primitive types...), and see if the parallelism is better.

Best,

Nicolas.

Bradbev

unread,
Aug 6, 2009, 1:00:06 PM8/6/09
to Clojure
On Aug 6, 3:07 am, Andy Fingerhut <andy_finger...@alum.wustl.edu>
wrote:
> http://github.com/jafingerhut/clojure-benchmarks/blob/398688c71525964...
>
> http://github.com/jafingerhut/clojure-benchmarks/blob/398688c71525964...
> http://github.com/jafingerhut/clojure-benchmarks/blob/51d499c2679c2d5...
>
> http://github.com/jafingerhut/clojure-benchmarks/blob/51d499c2679c2d5...
>
> In both cases, over 98% of the time is spent in
> java.lang.Double.valueOf(double d).  See the files for the full stack
> backtraces if you are curious.
>
> I don't see any reason why that method should have any kind of
> contention or worse performance when running on 2 cores vs. 1 core,
> but I don't know the guts of how it is implemented.  At least in
> OpenJDK all it does is "return new Double(d)", where d is the double
> arg to valueOf().  Is there any reason why "new" might exhibit
> contention between parallel threads?

Can you run your benchmarks with the number of concurrent threads
being equal to the number of cores that you have? The increase in
system time is interesting to me - is it possible that the JVM or OS
can detect threads that don't use floating point registers & therefore
doesn't bother to save them when doing a thread context switch? If
so, that is a significant amount of memory that doesn't need to be
touched during context switch.

Brad

Sean Devlin

unread,
Aug 6, 2009, 1:29:12 PM8/6/09
to Clojure
FYI

IEEE doubles are typically 64 bit

IEEE floats are typically 32 bit.

The wikipedia article is good:

http://en.wikipedia.org/wiki/IEEE_754-2008

The IEEE standard (requires login):

http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4610935

I'm not sure how the JVM implements them precisely.

Hey, this is a *really* stupid question, but...

Is everyone using 64 bit hardware?

On Aug 6, 9:32 am, Nicolas Oury <nicolas.o...@gmail.com> wrote:
> Hello,
>
> I will try to have a guess. If 98% of time is spend allocating Doubles, the
> program is loading new lines of memory in cache
> every n Doubles. At some point down the different levels of cache, you have
> a common cache/main memory for both cores and the bus to this memory has to
> be shared in some way.
>
> So maybe you are measuring the throughput of your memory, and not the
> computational speed.
>
> To confirm/infirm the guess it would be interesting to
> - see where the int version spend its time and compare the hprof.
> - replace doubles with floats. (I assume Floats are smaller than Doubles but
> I have no clue whether it is the case).
> - replace ints with longs (corresponding assumption).
> - I read somewhere about an option of the JVM for better cache-aware
> allocation. But I don't remember what it was
> ans whether it was in 6 or will be in 7.
>
> Cheers,
>
> Nicolas.
>
> On Thu, Aug 6, 2009 at 11:07 AM, Andy Fingerhut <
>
> >http://github.com/jafingerhut/clojure-benchmarks/blob/398688c71525964...
>
> >http://github.com/jafingerhut/clojure-benchmarks/blob/398688c71525964...
> >http://github.com/jafingerhut/clojure-benchmarks/blob/51d499c2679c2d5...
>
> >http://github.com/jafingerhut/clojure-benchmarks/blob/51d499c2679c2d5...

Andy Fingerhut

unread,
Aug 6, 2009, 1:29:43 PM8/6/09
to Clojure
Johann who started this thread has an 8 core machine, but I don't. My
machine has 2 cores. All of my tests were with 1 thread, using map,
or 2 threads, using my modified-pmap which I've tested and confirmed
that it tries to evaluate at most 2 future calls at a time, so at most
2 threads at a time.

Andy

John Harrop

unread,
Aug 6, 2009, 2:51:32 PM8/6/09
to clo...@googlegroups.com
Cache misses are a possibility; try the integer version with long, so the size of the data is the same as with double.

The other possibility I'd consider likely is that the JDK you were using implements caching in Double.valueOf(double). This could be dealt with if Clojure boxing directly called new Double(double). Caching in valueOf methods has been known to cause trouble in other situations, involving Java, primitive boxing, and performance. Besides the caching wrecking memory access-pattern locality, causing CPU cache misses, it likely involves acquiring and releasing a global lock on the Double cache, which will kill multithreaded performance particularly.

Andy Fingerhut

unread,
Aug 6, 2009, 7:05:21 PM8/6/09
to Clojure
Interesting how the caching can mess performance up, especially in
multi-threaded situations. I don't know yet how to confirm whether my
implementation uses that caching, but I did see a mention of it in
some docs for Double.valueOf(double), and didn't think about the
possibility of that introducing locking. In this particular test, I
doubt it would increase non-locality much if at all, since we are
always providing the same value (0.1) to the method, so the cache
should have exactly 1 element in it the whole time. But if it were
synchronized, that would definitely bring down the performance of this
particular test, or in general any test with a significant number of
float or double operations, down closer to that of a single core.

If that is the performance bottleneck in this case, is there any way
around it? Perhaps I could try locally changing my Clojure Java
source to use new Double instead of Double.valueOf(double) and it
might actually be faster.

I'll try out some variations with different types like long and float,
and report back when I've got them.

Thanks,
Andy

Andy Fingerhut

unread,
Aug 8, 2009, 7:27:07 AM8/8/09
to Clojure
Johann, if you are still following this thread, could you try running
this Clojure program on your 8 core machine?

http://github.com/jafingerhut/clojure-benchmarks/blob/3e45bd8f6c3eba47f982a0f6083493a9f076d0e9/misc/pmap-testing.clj

These first set of parameters below will do 8 jobs sequentially, each
doing 10^10 (inc c)'s, where c is a double primitive. The second will
do the 8 jobs in parallel, hopefully finishing about 8 times faster on
your machine:

java -cp <your_class_path> clojure.main pmap-testing.clj double2 8
10000000000 1
java -cp <your_class_path> clojure.main pmap-testing.clj double2 8
10000000000 8

If you replace double2 with double1, it should reproduce your initial
test case with (inc 0.1) in the inner loop -- the one that started
this thread about why there wasn't much speedup.


I created some Clojure and Java functions that are as similar as I
know how to make them, but frankly I don't really know whether my JVM
implementation (Apple's java 1.6.0_13) is using 'new Double', or a
cache as mentioned by John Harrop earlier in this discussion, in its
implementation of Double.valueOf(double). I've found that the
performance is very similar to a Java program that uses 'new Double'
explicitly in its inner loop.

In the results linked below, the 'sequential' runs are for calling the
function named twice in a row, whereas the 'parallel' runs are for
having two parallel threads, on my 2 core machine, each of which calls
the named function once. The total amount of computation in each case
should be the same, so you'd hope that the parallel case would finish
in about half the time, with about the same amount of total CPU time
being utilized.

That isn't what happens for the Java code with 'new Double' (called
NewDoubleTest), or for the Clojure code that has (inc 0.1) and calls
Double.valueOf(double) down in its implementation (called spin-
double1). The parallel case only saves about 14% of the elapsed time
with 2 cores, and takes about 50% more total CPU time.

The more typical code that takes a single double value, initializes
it, and then adds 1 to it each time through the inner loop (Java
DoubleTest, Clojure spin-double2), is significantly faster than the
versions mentioned above. Plus they exhibit the expected speedup from
using 2 cores instead of 1.

http://github.com/jafingerhut/clojure-benchmarks/blob/3e45bd8f6c3eba47f982a0f6083493a9f076d0e9/misc/RESULTS

I'm not sure how to determine why calling 'new Double' each time
through NewDoubleTest's inner loop causes 2 threads to perform not
much better than 1. The best possible explanation I've heard is from
Nicolas Oury -- perhaps we are measuring the bandwidth from cache to
main memory, not raw computational ability of the processor cores.

Andy

Bradbev

unread,
Aug 8, 2009, 12:49:18 PM8/8/09
to Clojure
> I'm not sure how to determine why calling 'new Double' each time
> through NewDoubleTest's inner loop causes 2 threads to perform not
> much better than 1.  The best possible explanation I've heard is from
> Nicolas Oury -- perhaps we are measuring the bandwidth from cache to
> main memory, not raw computational ability of the processor cores.

I assume that 'new Double' is actually doing a heap alloc? In that
case, won't a lock need to be aquired for the allocation? With a high
enough alloc rate, GC and the heap lock will reduce performance a lot.

Brad

Nicolas Oury

unread,
Aug 8, 2009, 1:13:06 PM8/8/09
to clo...@googlegroups.com
Hi Brad,

I think that there is no global lock for heap allocation, at least for
small objects.
As a support for this claim:

http://www.ibm.com/developerworks/java/library/j-jtp09275.html
(see more specifically: "Thread-local allocation", but the article is
really interesting as a whole.)

I am interested in this global cache for valueOf thing and haven't find
any good explanations on that phenomena. Does someone have a link to
some ressources?

Thank you very much,

Nicolas.

Chad Harrington

unread,
Aug 8, 2009, 2:33:47 PM8/8/09
to clo...@googlegroups.com
Andy,
I just thought I'd mention that for 80 cents you can rent an hour on an 8-core EC2 machine with 7GB of RAM.  We use EC2 a lot for such things at work.  It may be an easy way for you to accomplish your goals.
http://aws.amazon.com/ec2/instance-types/

Chad Harrington
chad.ha...@gmail.com

Johann Kraus

unread,
Aug 9, 2009, 7:51:46 AM8/9/09
to Clojure
> Johann, if you are still following this thread, could you try running
> this Clojure program on your 8 core machine?
>
> http://github.com/jafingerhut/clojure-benchmarks/blob/3e45bd8f6c3eba4...
>
> These first set of parameters below will do 8 jobs sequentially, each
> doing 10^10 (inc c)'s, where c is a double primitive.  The second will
> do the 8 jobs in parallel, hopefully finishing about 8 times faster on
> your machine:
>
> java -cp <your_class_path> clojure.main pmap-testing.clj double2 8
> 10000000000 1
> java -cp <your_class_path> clojure.main pmap-testing.clj double2 8
> 10000000000 8

Ok these are the results on my machine ((*) denotes the value of the
last parameter):

sequential (1) : "Elapsed time: 112843.711514 msecs"
parallel (8) : "Elapsed time: 28347.267593 msecs"

while using 800% CPU in case of parallel (8).

I tried with changing the last parameter and these are the results:

parallel (2) : "Elapsed time: 75269.155776 msecs"
parallel (3) : "Elapsed time: 47145.931658 msecs"
parallel (4) : "Elapsed time: 37715.217599 msecs"
parallel (5) : "Elapsed time: 37839.950079 msecs"
parallel (6) : "Elapsed time: 38357.797175 msecs"
parallel (7) : "Elapsed time: 37756.190205 msecs"

From 4 to 7 there is no speedup at all.

> If you replace double2 with double1, it should reproduce your initial
> test case with (inc 0.1) in the inner loop -- the one that started
> this thread about why there wasn't much speedup.

Results for double1:

sequential (1) : "Elapsed time: 776338.624798 msecs"
parallel (8) : "Elapsed time: 578037.874598 msecs"

> I created some Clojure and Java functions that are as similar as I
> know how to make them, but frankly I don't really know whether my JVM
> implementation (Apple's java 1.6.0_13) is using 'new Double', or a
> cache as mentioned by John Harrop earlier in this discussion, in its
> implementation of Double.valueOf(double).  I've found that the
> performance is very similar to a Java program that uses 'new Double'
> explicitly in its inner loop.

I changed my example for Java threads to just do a new Double and a
new Integer in the loop. In this case there is both times no speedup
at all. So I think Java does not parallelize well with lots of objects
handled by each thread. So this is really a Java problem, but I did
not find a discussion about this issue in Java forums.

But there remains an open issue for Clojure:

If I do my pmaptest with a very large Integer (inc 2000000000) instead
of (inc 0), it is as slow as the double version. My question is,
whether Clojure may has a special handling for small integers? Like
using primitives for small ints and doing a new Integer for larger
ones?

By the way GC is running periodically, but while profiling it does not
report more than 1 sec of total runtime.

--
Johann

Nicolas Oury

unread,
Aug 9, 2009, 9:08:12 AM8/9/09
to clo...@googlegroups.com

> If I do my pmaptest with a very large Integer (inc 2000000000) instead
> of (inc 0), it is as slow as the double version. My question is,
> whether Clojure may has a special handling for small integers? Like
> using primitives for small ints and doing a new Integer for larger
> ones?
>
It seems a confirmation that it is a memory allocation related problem.
(I think for some reason, this is store as a long, the same size as a
double).

This can means two things:
- as I hypothesized before, we are measuring the memory bandwidth
- we fill the Thread Local Heap so fast, that there is a lock
contention for the common heap.

I don't know how to separate one from the another...

Best,

Nicolas.

Bradbev

unread,
Aug 9, 2009, 12:36:50 PM8/9/09
to Clojure
On Aug 9, 6:08 am, Nicolas Oury <nicolas.o...@gmail.com> wrote:
> > If I do my pmaptest with a very large Integer (inc 2000000000) instead
> > of (inc 0), it is as slow as the double version. My question is,
> > whether Clojure may has a special handling for small integers? Like
> > using primitives for small ints and doing a new Integer for larger
> > ones?
>
> It seems a confirmation that it is a memory allocation related problem.
> (I think for some reason, this is store as a long, the same size as a
> double).
>
> This can means two things:
>  - as I hypothesized before, we are measuring the memory bandwidth
>  - we fill the Thread Local Heap so fast, that there is a lock
> contention for the common heap.
>
> I don't know how to separate one from the another...
You could write a task that doesn't allocate, but does hammer on
memory bandwidth & have this run in parallel at the same time. You'd
need an array that is larger than your cache, and then stride through
it at cache line size steps. If the runtime doesn't decrease then you
are seeing lock contention.

Cheers,
Brad

Berk Özbozkurt

unread,
Aug 9, 2009, 1:35:27 PM8/9/09
to clo...@googlegroups.com
...

>parallel (6) : "Elapsed time: 38357.797175 msecs"
>parallel (7) : "Elapsed time: 37756.190205 msecs"

>From 4 to 7 there is no speedup at all.

>This awfully looks like you are using a core i7 with 8 threats but
only 4 physical cores. What is your hardware?

sorry, I found you have already posted that.

cliffc

unread,
Aug 10, 2009, 12:53:03 PM8/10/09
to Clojure

I'll volunteer to run your code on an Azul box.
- Azul gear has a great profiling tool. I should be able to rapidly
tell hot-locks/lock-contention from other resource bottlenecks.
- Azul gear has far more bandwidth than X86 gear, so if your X86 is
bandwidth bound - this won't show up on us.
- Azul gear supports a far higher allocation rate than X86 gear, so if
your X86 is allocation-rate bound - this won't show up on us.
- Azul has 100's of CPUs so if you are looking at a 4-HT-cores-
mimic'ing-8-real-cores issue - this also won't show up on us.
- Azul CPUs are slower than X86 CPUs (especially for FP); the timing
numbers won't be comparable. But the locking bottlenecks should be
the same.

If you want to do this, email me cliffc at acm.org with a library
tarball and a *bare* java command line - no scripts please as I have
to replace the java command line with my own.

Thanks,
Cliff
Reply all
Reply to author
Forward
0 new messages