Having trouble getting full performance from a quad-core with trivial code

Zak Wilson

unread,

May 30, 2010, 12:31:15 PM5/30/10

to Clojure

I'm running Clojure code on an early Mac Pro with OS X 10.5 and Java
1.6. It has two dual-core Xeon 5150s and 5GB of memory.

I'm not getting the performance I expected despite top reporting 390%
steady-state CPU use, so I wrote some trivial tests to see if I was
actually getting the benefit of all four cores. It runs about twice as
fast with four cores as with one, and only slightly faster with three
or four than with two. This code being trivially parallel, I was
expecting nearly 4x the speed with four cores.

Here are the tests and results: http://gist.github.com/418631

I'd appreciate it if anybody could

a. point out any problems with my code that might be hurting
performance
b. try this out on your own 3+ core machine and see if you have better
results

Heinz N. Gies

unread,

May 30, 2010, 12:43:36 PM5/30/10

to clo...@googlegroups.com

On May 30, 2010, at 18:31 , Zak Wilson wrote:

> I'm running Clojure code on an early Mac Pro with OS X 10.5 and Java
> 1.6. It has two dual-core Xeon 5150s and 5GB of memory.

Just a idea, two dual cores != 4 cores. Parallelism on more then one CPU is always slower then on one cpu with multiple cores, for the first seconds the process might not even get swapped to the second CPU at all. Perhaps that is why your speed gain isn't that high.

Try to take 10k instead of 5k for the tests, does that change anything?

Regards,
Heinz

Lee Spector

unread,

May 30, 2010, 12:49:30 PM5/30/10

to clo...@googlegroups.com

Zak,

This may not be your main issue and I haven't done enough testing with my own code to know if it's even my main issue, but I've found that things appear to go better for me on multicore machines if I invoke java with the -XX:+UseParallelGC option.

-Lee

--
Lee Spector, Professor of Computer Science
School of Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspe...@hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438

Check out Genetic Programming and Evolvable Machines:
http://www.springer.com/10710 - http://gpemjournal.blogspot.com/

Zak Wilson

unread,

May 30, 2010, 1:28:50 PM5/30/10

to Clojure

Heinz - playing with the size of the number doesn't have much effect,
except that when it becomes very small, parallelization overhead
eventually exceeds compute time.

Lee - Parallel GC slowed it down by 3 seconds on the four core
benchmark.

ka

unread,

Jun 1, 2010, 3:11:00 AM6/1/10

to Clojure

Hi Zak,

I tried your example on my i7 (4 physical cores, 8 logical); here are
the results -

1:298 user=> (time (do (doall (map fac (take 10 (repeat 50000))))
nil))
"Elapsed time: 54166.665145 msecs"

1:300 user=> (time (do (doall (pmap fac (take 10 (repeat 50000))))
nil))
"Elapsed time: 27418.26344 msecs"

With map CPU usage ~12.5%, with pmap (10 threads) ~50% average.

But when I change the fac function to say -

(defn fac
[n]
(let [n (* n 1000)]
(loop [i 0]
(when (< i n)
(* 2 2)
(recur (inc i))))))

1:308 user=> (time (do (doall (map fac (take 10 (repeat 50000))))
nil))
"Elapsed time: 48507.220449 msecs"

1:309 user=> (time (do (doall (pmap fac (take 10 (repeat 50000))))
nil))
"Elapsed time: 9320.92417 msecs"

With map CPU usage ~12.5%, with pmap (10 threads) ~95% average.

So I think it may be something to do with really BigIntegers in your
original fac, but I can't be sure. The point is that the original fac
somehow doesn't consume the entire CPU even with 10 threads ?! If you
find out the reason why let me know.

Also I didn't spend much time with your npmap function, but it seems
like a nice idea. One observation - partition-all is not very good at
dividing up work to be done. For example suppose you want to divide
up a coll of length 9 into 4 threads, there is no way to do that with
the partition-all directly -

(1:335 user=> (clojure.core/partition-all 2 (range 9))
((0 1) (2 3) (4 5) (6 7) (8))
1:336 user=> (clojure.core/partition-all 3 (range 9))
((0 1 2) (3 4 5) (6 7 8))

Something like this might be better suited (https://gist.github.com/
54313ab02d570204393b) -
1:338 user=> (partition-work 4 (range 9))
((0 1) (2 3) (4 5) (6 7 8))

The above partition-work is based on ideas which I got from MPI
programming a while back.

- Thanks

Zak Wilson

unread,

Jun 2, 2010, 2:14:25 PM6/2/10

to Clojure

ka, I ran some more tests, including partition-work and your version
of fac. I also ran some code from http://shootout.alioth.debian.org in
both C and Java.

On these 10-element sequences, partition-work seems to be a few tens
of milliseconds slower than partition-all. It does look generally
useful though; I'll run some more tests with it.

Your version of fac doesn't change the performance characteristics on
the Mac much: two cores are almost twice as fast as one, but three or
four result in single-digit percentage gains. My fac was 8% faster on
four cores instead of two. Yours was 1% faster. I also tried with a 12-
element sequence to try to ensure that all the cores had the same
amount of work. In that situation, four and two were even closer.

I ran the Java versions of the Mandelbrot and spectral norm benchmarks
from the above-linked site, as those appeared to keep all the cores
busy. Java versions of both run nearly twice as fast on the Mac Pro as
my dual-core laptop, which is in line with expected results on a
trivially parallel problem (the laptop is slightly faster per-core).

The problem here seems to be Clojure-specific. There's some sort of
overhead here that keeps the CPUs busy (top reports 390%), but very
little extra desirable work is actually getting done. If this is
indeed a Clojure problem, I'm happy to try to help track it down. I'm
not very familiar with the profiling and monitoring tools for the JVM,
so a pointer in the right direction there would be appreciated.

ka

unread,

Jun 3, 2010, 4:07:24 AM6/3/10

to Clojure

Hi Zak,

It seems very weird that my version of fac changes performance
characteristics on my machine and not yours (OS/hardware dependent?).
Can you tell your hardware configuration, esp. number of physical and
logical cores? I am planning next to leave out using pmap and just try
to run the thing using threads directly. Also write up the same
implementation in Java and compare. Will do that next when I have
some time to spend on it.

Reg. partitioning -

You won't see any appreciable difference between partition-all and
partition-work in a few examples. But I'm quite sure that overall on
an average partition-work will divide up the work in a much more
balanced fashion than partition-all. Imo partition-all is intended
for a different purpose than dividing a coll "equally".

partition-work guarantees (prove it!) that the number of elements in
each sub-coll is bounded above by ceil(n/p) and bounded below by
floor(n/p). n=count of coll, p=num of sub colls to divide into.

Example - suppose need to partition 13 tasks (taking roughly equal
amount of time) onto 4 CPUs.

1:345 user=> (partition-work 4 (range 13))
((0 1 2) (3 4 5) (6 7 8) (9 10 11 12))

1:346 user=> (clojure.core/partition-all 4 (range 13))
((0 1 2 3) (4 5 6 7) (8 9 10 11) (12))

If we just use partition-all (w/o any modifications) it puts 4 tasks
each on 3 CPUs and only 1 task on the last CPU. But partition-work
will divide - 3 tasks on the first 3 CPUs and 4 tasks on the last CPU;
leading to a more balanced work division.

Also as stated above you cannot use partition-all (directly in its
current form) to divide up 9 tasks among 4 cores -

(1:335 user=> (clojure.core/partition-all 2 (range 9))
((0 1) (2 3) (4 5) (6 7) (8))
1:336 user=> (clojure.core/partition-all 3 (range 9))
((0 1 2) (3 4 5) (6 7 8))

1:338 user=> (partition-work 4 (range 9))
((0 1) (2 3) (4 5) (6 7 8))

- Thanks

Zak Wilson

unread,

Jun 3, 2010, 3:26:15 PM6/3/10

to Clojure

> It seems very weird that my version of fac changes performance
> characteristics on my machine and not yours (OS/hardware dependent?).
> Can you tell your hardware configuration, esp. number of physical and
> logical cores?

It's an early Mac Pro with two dual-core Xeon 5150s, 5gb RAM, Mac OS
10.5 and Java 1.6 (installed through software update). There are four
physical and four logical cores.

One thing I have noticed is that the client VM shows improvements with
four cores over two, but the server VM does not. The client VM is
always slower than the server VM on these tests. The biggest
improvement I've seen with 4 cores over 2 on the server VM is 10%, on
an overclocked Core2 Extreme.

Zak Wilson

unread,

Jun 4, 2010, 5:36:38 PM6/4/10

to Clojure

I have some new data that suggests there are issues inherent to pmap
and possibly other parallelism with Clojure on older Intel quad+ core
machines.

I added a noop loop to the benchmark. It looks like this:

(defn noops [n]
(when (> n 0)
(recur (- n 1))))

Running those in parallel is also no faster on the Xeon 5150 box with
four cores than it is with two. It has been suggested that memory
contention is the problem with this machine. I suspect Clojure's
overhead relative to Java is the reason that parallel Java benchmarks
get more out of the four cores on this machine, but don't quote me on
that.

I had someone run the benchmarks on an 8-core Nehalem Mac Pro. Those
results are quite a bit different from mine. On the true factorial
benchmark, four threads are twice as fast as two. Eight threads are
50% faster than four, but 16 threads are about twice as fast as four.
Intermediate numbers are a bit variable, but it seems like
hyperthreading actually speeds things up quite a bit on this
benchmark. ka's version of fac, which I've renamed spin-mult scales
linearly with the number of physical cores, but slows down with
between 9 and 15 threads. 16 threads is about equal to 8.

I've put the benchmarks up on github: http://github.com/zakwilson/npmap

I'm going to try changing spin-mult to use dotimes and see how that
runs on several machines. Initial results on the Xeon 5150 box suggest
that using dotimes instead of recur solves the problem, and I'll
probably be changing the benchmarks to further explore the issue.

Reply all

Reply to author

Forward