abysmal multicore performance, especially on AMD processors

3,474 views
Skip to first unread message

Lee Spector

unread,
Dec 7, 2012, 8:25:14 PM12/7/12
to clo...@googlegroups.com

I've been running compute intensive (multi-day), highly parallelizable Clojure processes on high-core-count machines and blithely assuming that since I saw near maximal CPU utilization in "top" and the like that I was probably getting good speedups.

But a colleague recently did some tests and the results are really quite alarming.

On intel machines we're seeing speedups but much less than I expected -- about a 2x speedup going from 1 to 8 cores.

But on AMD processors we're seeing SLOWDOWNS, with the same tests taking almost twice as long on 8 cores as on 1.

I'm baffled, and unhappy that my runs are probably going slower on 48-core and 64-core nodes than on single-core nodes.

It's possible that I'm just doing something wrong in the way that I dispatch the tasks, or that I've missed some Clojure or JVM setting... but right now I'm mystified and would really appreciate some help.

I'm aware that there's overhead for multicore distribution and that one can expect slowdowns if the computations that are being distributed are fast relative to the dispatch overhead, but this should not be the case here. We're distributing computations that take seconds or minutes, and not huge numbers of them (at least in our tests while trying to figure out what's going on).

I'm also aware that the test that produced the data I give below, insofar as it uses pmap to do the distribution, may leave cores idle for a bit if some tasks take a lot longer than others, because of the way that pmap allocates cores to threads. But that also shouldn't be a big issue here because for this test all of the threads are doing the exact same computation. And I also tried using an agent-based dispatch approach that shouldn't have the pmap thread allocation issue, and the results were about the same.

Note also that all of the computations in this test are purely functional and independent -- there shouldn't be any resource contention issues.

The test: I wrote a time-consuming function that just does a bunch of math and list manipulation (which is what takes a lot of time in my real applications):

(defn burn
([] (loop [i 0
value '()]
(if (>= i 10000)
(count (last (take 10000 (iterate reverse value))))
(recur (inc i)
(cons
(* (int i)
(+ (float i)
(- (int i)
(/ (float i)
(inc (int i))))))
value)))))
([_] (burn)))

Then I have a main function like this:

(defn -main
[& args]
(time (doall (pmap burn (range 8))))
(System/exit 0))

We run it with "lein run" (we've tried both leingingen 1.7.1 and 2.0.0-preview10) with Java 1.7.0_03 Java HotSpot(TM) 64-Bit Server VM. We also tried Java 1.6.0_22. We've tried various JVM memory options (via :jvm-opts with -Xmx and -Xms settings) and also with and without -XX:+UseParallelGC. None of this seems to change the picture substantially.

The results that we get generally look like this:

- On an Intel Core i7 3770K with 8 cores and 16GB of RAM, running the code above, it takes about 45 seconds (and all cores appear to be fully loaded as it does so). If we change the pmap to just plain map, so that we use only a single core, the time goes up to about 1 minute and 36 seconds. So the speedup for 8 cores is just about 2x, even though there are 8 completely independent tasks. So that's pretty depressing.

- But much worse: on a 4 x Opteron 6272 with 48 cores and 32GB of RAM, running the same test (with pmap) takes about 4 minutes and 2 seconds. That's really slow! Changing the pmap to map here produces a runtime of about 2 minutes and 20 seconds. So it's quite a bit faster on one core than on 8! And all of these times are terrible compared to those on the intel.

Another strange observation is that we can run multiple instances of the test on the same machine and (up to some limit, presumably) they don't seem to slow each other down, even though just one instance of the test appears to be maxing out all of the CPU according to "top". I suppose that means that "top" isn't telling me what I thought -- my colleague says it can mean that something is blocked in some way with a full instruction queue. But I'm not interested in running multiple instances. I have single computations that involve multiple expensive but independent subcomputations, and I want to farm those subcomputations out to multiple cores -- and get speedups as a result. My subcomputations are so completely independent that I think I should be able to get speedups approaching a factor of n for n cores, but what I see is a factor of only about 2 on intel machines, and a bizarre factor of about 1/2 on AMD machines.

Any help would be greatly appreciated!

Thanks,

-Lee

--
Lee Spector, Professor of Computer Science
Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspe...@hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438

Andy Fingerhut

unread,
Dec 7, 2012, 8:41:28 PM12/7/12
to clo...@googlegroups.com
Lee:

I'll just give a brief description right now, but one thing I've found in the past on a 2-core machine that was achieving much less than 2x speedup was memory bandwidth being the limiting factor.

Not all Clojure code allocates memory, but a lot does. If the hardware in a system can write at rate X from a multicore processor to main memory, and a single-threaded Clojure program writes to memory at rate 0.5*X, then the most speedup you will ever get out of multicore execution of the same code on N cores will be 2x, no matter how large N is.

As one way to see if this is the problem, you could try changing your "burn" function so that instead of doing cons to build up a list result, first allocate a Java mutable array before the loop that is as large as you need it to be at the end, and write values into that. You can convert it to some other Clojure type at the end of the loop if you prefer.


I have some C benchmark programs that test memory read and write bandwidth on single and multiple cores you can run on your Intel machine to see if that might be the issue. If this is the issue, I would expect to see at least a little speedup from 1 core to multiple cores, but capped at some maximum speedup that is determined by the memory bandwidth, not the number of cores you run in parallel.

I don't currently have any guess about what might be happening with the AMD multicore machine. If you are interested in wild guessing, perhaps there could be some kind of multicore cache coherency protocol that is badly configured, causing cache lines to be frequently invalidated when multiple cores are sharing memory? That would make more sense if multiple cores were reading from and writing to the same cache lines, which doesn't seem terribly likely for a typical Clojure program.

Let me know if you are interested and I will find those C programs for you to try out. I got them from somewhere on the Internet and may have tweaked them a little bit.

Andy
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en

Lee Spector

unread,
Dec 7, 2012, 9:03:25 PM12/7/12
to clo...@googlegroups.com

Thanks Andy.

My applications definitely allocate a lot of memory, which is reflected in all of that consing in the test I was using. It'd be hard to do what we do in any other way. I can see how a test using a Java mutable array would help to diagnose the problem, but if that IS the problem then it sounds like there'd be no solution short of radically re-engineering our systems. That would be sad!

My colleague who's running the tests is out for the weekend but we'll talk early next week and get back to you if we want to try your C programs etc.

-Lee

Andy Fingerhut

unread,
Dec 7, 2012, 9:42:03 PM12/7/12
to clo...@googlegroups.com

On Dec 7, 2012, at 5:25 PM, Lee Spector wrote:

>
> Another strange observation is that we can run multiple instances of the test on the same machine and (up to some limit, presumably) they don't seem to slow each other down, even though just one instance of the test appears to be maxing out all of the CPU according to "top". I suppose that means that "top" isn't telling me what I thought -- my colleague says it can mean that something is blocked in some way with a full instruction queue. But I'm not interested in running multiple instances. I have single computations that involve multiple expensive but independent subcomputations, and I want to farm those subcomputations out to multiple cores -- and get speedups as a result. My subcomputations are so completely independent that I think I should be able to get speedups approaching a factor of n for n cores, but what I see is a factor of only about 2 on intel machines, and a bizarre factor of about 1/2 on AMD machines.

Lee:

When you say "we can run multiple instances of the test on the same machine", do you mean that, for example, on an 8 core machine you run 8 different JVMs in parallel, each doing a single-threaded 'map' in your Clojure code and not a 'pmap'?

And what kinds of speedups does that achieve?

The results could help indicate whether the problem you are seeing is due to the hardware/OS, or something about multiple threads within a single JVM.

Andy

Jim - FooBar();

unread,
Dec 8, 2012, 7:04:11 AM12/8/12
to clo...@googlegroups.com
Even though this is very surprising (and sad) to hear, I'm afraid I've
got different experiences... My reducer-based parallel minimax is about
3x faster than the serial one, on my 4-core AMD phenom II and a tiny bit
faster on my girlfriend's intel i5 (2 physical cores + 2 virtual). I'm
suspecting due to slightly larger L3 cache... So no complaints on my
part from this...

Now, with regards to pmap....The only occassions I'm finding pmap useful
is for coarse-grain concurrency...In other words I'm only using pmap
where i 'd use Executors. Usually, when mapping a very expensive fn
(that needs no coordination) across a very large collection of roughly
equally-sized elements. In these cases I'm getting proper linear speedup
and i don't see why wouldn't I. If you're familiar with text-mining (I
think you are) consider annotating a collection of documents (500 -
1000) based on some dictionaries and probabilistic taggers. This could
take more than 15 seconds per document so pmap does pay off as you would
expect.

I've not done any experiments on any clusters with reducers but i
wouldn't expect them to perform well when distributed across several
nodes. Basically you 'd have horrible locality for something that
requires coordination (like minimax). However, if you could run
something like a genetic-algorithm across several nodes and minimax on
each node then I'd expect good data locality thus tremendous speedup.
This is exactly what I'm planning of doing for training my chess
neural-net. Using 20-30 8core machines would be ideal for this use-case
(20-30 individuals in the population - each lifetime (5 games) running
on a separate machine that is entirely devoted to minimax)...

Anyway, like you, i am baffled by your experiences...You say that
monitoring cpus shows that they are all busy throughout the entire task.
If this is truly the case then I really don't understand! Otherwise, as
you say "pmap may leave cores idle for a bit if some tasks take a lot
longer than others"...this is why I said "equally-sized" chunks
before...make sure you monitor your cpus closely to verify that they are
indeed busy....This has bitten me in the past... Of course, in your case
(range 8) indeed contains 'equally-sized' elemements so i don't know
what to think! In any case there might be something going on that we're
not seeing...

Jim

Marshall Bockrath-Vandegrift

unread,
Dec 8, 2012, 9:36:56 AM12/8/12
to clo...@googlegroups.com
Lee Spector <lspe...@hampshire.edu> writes:

> I'm also aware that the test that produced the data I give below,
> insofar as it uses pmap to do the distribution, may leave cores idle
> for a bit if some tasks take a lot longer than others, because of the
> way that pmap allocates cores to threads.

Although it doesn’t impact your benchmark, `pmap` may be further
adversely affecting the performance of your actual program. There’s a
open bug regarding `pmap` and chunked seqs:

http://dev.clojure.org/jira/browse/CLJ-862

The impact is that `pmap` with chunked seq input will spawn futures for
its function applications in flights of 32, spawning as many flights as
necessary to reach or exceed #CPUS + 2. On a 48-way system, it will
initially launch 64 futures, then spawn an additional 32 every time the
number of active unrealized futures drops below 50, leading to
significant contention for a CPU-bound application.

I hope it can be made useful in a future version of Clojure, but right
now `pmap` is more of an attractive nuisance than anything else.

-Marshall

Lee Spector

unread,
Dec 8, 2012, 11:44:39 AM12/8/12
to clo...@googlegroups.com

On Dec 7, 2012, at 9:42 PM, Andy Fingerhut wrote:
>
>
> When you say "we can run multiple instances of the test on the same machine", do you mean that, for example, on an 8 core machine you run 8 different JVMs in parallel, each doing a single-threaded 'map' in your Clojure code and not a 'pmap'?
>
> And what kinds of speedups does that achieve?
>
> The results could help indicate whether the problem you are seeing is due to the hardware/OS, or something about multiple threads within a single JVM.


Here are a couple more results, courtesy my colleague Josiah Erikson (who may chime in here at some point too):

On a 4 core machine, each test within a single "lein run" call:

2:31 for (doall (pmap burn (range 8)))
1:29 for (doall (map burn (range 8)))
1:48 for (doall (pmapall burn (range 8))) [see my reply to Jim, immediately following this one, for pmapall]

Now with multiple "lein run" calls launched concurrently:

29 seconds to do 4 concurrent copies of (doall (map burn (range 2)))
33 seconds to do 8 concurrent copies of (burn 2) [the argument to burn doesn't matter, but that's what Josiah ran]
38 seconds to do 8 concurrent copies of (burn (rand-int 8)) [again, the argument shouldn't matter].

48 seconds to do 2 concurrent copies of (doall (map burn (range 4)))
1:07 to do 2 concurrent copies of (doall (pmap burn (range 4)))

What do we learn from this, aside from the fact that things are way wacky?

Thanks for the help!!

-Lee

Lee Spector

unread,
Dec 8, 2012, 11:47:16 AM12/8/12
to clo...@googlegroups.com

On Dec 8, 2012, at 9:36 AM, Marshall Bockrath-Vandegrift wrote:
>
> Although it doesn’t impact your benchmark, `pmap` may be further
> adversely affecting the performance of your actual program. There’s a
> open bug regarding `pmap` and chunked seqs:
>
> http://dev.clojure.org/jira/browse/CLJ-862
>
> The impact is that `pmap` with chunked seq input will spawn futures for
> its function applications in flights of 32, spawning as many flights as
> necessary to reach or exceed #CPUS + 2. On a 48-way system, it will
> initially launch 64 futures, then spawn an additional 32 every time the
> number of active unrealized futures drops below 50, leading to
> significant contention for a CPU-bound application.
>
> I hope it can be made useful in a future version of Clojure, but right
> now `pmap` is more of an attractive nuisance than anything else.


(I said this would be in a reply to Jim, but now I see it makes more sense in a reply to this message of Marshall's.)

In order to avoid issues with pmap we've also run tests using "pmapall", defined as follows:


(defn pmapall
"Like pmap but: 1) coll should be finite, 2) the returned sequence
will not be lazy, 3) calls to f may occur in any order, to maximize
multicore processor utilization, and 4) takes only one coll so far."
[f coll]
(let [agents (map agent coll)]
(dorun (map #(send % f) agents))
(apply await agents)
(doall (map deref agents))))

The results are more or less the same (awful).

-Lee

Paul deGrandis

unread,
Dec 8, 2012, 1:28:03 PM12/8/12
to clo...@googlegroups.com
My experiences in the past are similar to the numbers that Jim is reporting.

I have recently been centering most of my crunching code around reducers.
Is it possible for you to cook up a small representative test using reducers+fork/join (and potentially primitives in the intermediate steps)?

Perhaps it can shed some light on contention or chunking issues (and eliminate some of the allocation concerns).

Paul

Lee Spector

unread,
Dec 8, 2012, 2:43:46 PM12/8/12
to clo...@googlegroups.com
Just tried, my first foray into reducers, but I must not be understanding something correctly:

(time (r/map burn (doall (range 4))))

returns in less than a second on my macbook pro, whereas

(time (doall (map burn (range 4))))

takes nearly a minute.

This feels like unforced laziness (although it's not quite that fast), but clojure.core.reducers/map involves no laziness, right?

I think my burn function (reproduced below for convenience) forces all of the laziness in there...

So what am I doing wrong?

BTW the runs described in this message use Clojure 1.5.0-alpha3 -- the ones in previous messages in this thread used Clojure 1.3.0.

-Lee

Andy Fingerhut

unread,
Dec 8, 2012, 3:31:25 PM12/8/12
to clo...@googlegroups.com
I haven't analyzed your results in detail, but here are some results I had on my 2GHz 4-core Intel core i7 MacBook Pro vintage 2011.

When running multiple threads within a single JVM invocation, I never got a speedup of even 2.  The highest speedup I measured was 1.82 speedup when I ran 8 threads using -XX:+UseParallelGC.  I tried with -XX:+UseParNewGC but never got a speedup over 1.45 (with 4 threads in parallel -- it was lower with 8 threads).

When running multiple invocations of "lein2 run" in parallel as separate processes, I was able to achieve a speedup of 1.88 with 2 processes, 3.40 with 4 processes, and 5.34 with 8 processes (it went over 4 I think because of 2 hyperthreads per each of the 4 cores).

This is a strong indication that the issue is some kind of interference between multiple threads in the same JVM, not the hardware, at least on my hardware and OS (OS was Mac OS X 10.6.8, JVM was Apple/Oracle Java 1.6.0_37).

My first guess would be that even with -XX:+UseParallelGC or -XX:+UseParNewGC, there is either some kind of interference with garbage collection, or perhaps there is even some kind of interference between them when allocating memory?  Should JVM memory allocations be completely parallel with no synchronization when running multiple threads, or do memory allocations sometimes lock a shared data structure?

Andy


On Dec 8, 2012, at 11:10 AM, Wm. Josiah Erikson wrote:

Hi guys - I'm the colleague Lee speaks of. Because Jim mentioned running things on a 4-core Phenom II, I did some benchmarking on a Phenom II X4 945, and found some very strange results, which I shall post here, after I explain a little function that Lee wrote that is designed to get improved results over pmap. It looks like this:


(defn pmapall
  "Like pmap but: 1) coll should be finite, 2) the returned sequence
   will not be lazy, 3) calls to f may occur in any order, to maximize
   multicore processor utilization, and 4) takes only one coll so far."
  [f coll]
  (let [agents (map agent coll)]
    (dorun (map #(send % f) agents))
    (apply await agents)
    (doall (map deref agents))))

Refer to Lee's first post for the benchmarking routine we're running.

I figured that, in order to figure out if it was Java's multithreading that was the problem (as opposed to memory bandwidth, or the OS, or whatever), I'd compare ( doall( pmapall burn (range 8))) to running 8 concurrent copies of (burn (rand-int 8) or even just (burn 2) or 4 copies of ( doall( map burn (range 2))) or whatever. Does this make sense? I THINK it does. If it doesn't, then that's cool - just let me know why and I'll feel less crazy, because I am finding my results rather confounding.

On said Phenom II X4 945 with 16GB of RAM, it takes 2:31 to do ( doall( pmap burn (range 8))), 1:29 to do ( doall( map burn (range 8))), and 1:48 to do ( doall( pmapall burn (range 8))).

So that's weird, because although we do see decreased slowdown from using pmapall, we still don't see a speedup compared to map. Watching processor utilization while these are going on shows that map is using one core, and both pmap and pmapall are using all four cores fully, as they should. So, maybe the OS or the hardware just can't deal with running that many copies of burn at once? Maybe there's a memory bottleneck?

Now here's the weird part: it takes around 29 seconds to do four concurrent copies of ( doall( map burn (range 2))), around 33 seconds to run 8 copies of (burn 2). Yes. Read that again. What? Watching top while this is going on shows what you would expect to see: When I run four concurrent copies, I've got four copies of Java using 100% of a core each, and when I run eight concurrent copies, I see eight copies of Java, all using around 50% of the processor each.

Also, by the way, it takes 48 seconds to run two concurrent copies of ( doall( map burn (range 4))) and 1:07 to run two concurrent copies of ( doall( pmap burn (range 4))).

What is going on here? Is Java's multithreading really THAT bad? This appears to me to prove that Java, or clojure, has something very seriously wrong with it, or has outrageous amounts of overhead when spawning a new thread. No?

all run with :jvm-opts ["-Xmx1g" "-Xms1g" "-XX:+AggressiveOpts"] and clojure 1.5.0-beta1
(I tried increasing the memory allowed for the pmap and pmapall runs, even to 8g, and it doesn't help at all)
Java(TM) SE Runtime Environment (build 1.7.0_03-b04)
Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode):

on ROCKS 6.0 (CentOS 6.2) with kernel 2.6.32-220.13.1.el6.x86_64 #1 SMP


Any thoughts or ideas?

There's more weirdness, too, in case anybody in interested. I'm getting results that vary strangely from other benchmarks that are available, and make no sense to me. Check this out (these are incomplete, because I decided to dig deeper with the above benchmarks, but you'll see, I think, why this is so confusing, if you know how fast these processors are "supposed" to be):

all run with :jvm-opts ["-Xmx1g" "-Xms1g" "-XX:+AggressiveOpts"] and clojure 1.5.0-beta1
Java(TM) SE Runtime Environment (build 1.7.0_03-b04)
Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode):

Key:    1. (pmap range 8) :
    2. (map range 8) :
    3. (8 concurrent copies of pmap range 8) :
    4. (8 concurrent copies of map range 8) :
    5. pmapall range 8:

4x AMD Opteron 6168:
    1. 4:02.06
    2. 2:20.29
    3.
    4.

AMD Phenom II X4 945:
    1. 2:31.65
    2. 1:29.90
    3. 3:32.60
    4. 3:08.97
    5. 1:48.36

AMD Phenom II X6 1100T:
    1. 2:03.71
    2. 1:14.76
    3. 2:20.14
    4. 1:57.38
    5. 2:14.43

AMD FX 8120:
    1. 4:50.06
    2. 1:25.04
    3. 5:55.84
    4. 2:46.94
    5. 4:36.61

AMD FX 8350:
    1. 3:42.35
    2. 1:13.94
    3. 3:00.46
    4. 2:06.18
    5. 3:56.95

Intel Core i7 3770K:
    1. 0:44
    2. 1:37.18
    3. 2:29.41
    4. 2:16.05
    5. 0:44.42

2 x Intel Paxville DP Xeon:
    1. 6:26.112
    2. 3:20.149
    3. 8:09.85
    4. 7:06.52
    5. 5:55.29

Andy Fingerhut

unread,
Dec 8, 2012, 3:42:16 PM12/8/12
to clo...@googlegroups.com

On Dec 7, 2012, at 5:25 PM, Lee Spector wrote:

> The test: I wrote a time-consuming function that just does a bunch of math and list manipulation (which is what takes a lot of time in my real applications):
>
> (defn burn
> ([] (loop [i 0
> value '()]
> (if (>= i 10000)
> (count (last (take 10000 (iterate reverse value))))
> (recur (inc i)
> (cons
> (* (int i)
> (+ (float i)
> (- (int i)
> (/ (float i)
> (inc (int i))))))
> value)))))
> ([_] (burn)))

Lee:

I'm hoping you realize that (take 10000 (iterate reverse value)) is reversing a linked list 10000 times, each time allocating 10000 cons cells (or Clojure's equivalent of a cons cell)? For a total of around 100,000,000 memory allocations of small objects per call to burn?

I know that this is just a bit of code for testing purposes, and not your real application, but do you really do that quantity of memory allocation versus arithmetic in your code (7*10,000 arithmetic operations for every 10,000*10,000 memory allocations?)

I still believe there is an issue here with either the memory allocation or garbage collection implementations that we are seeing here that can hopefully be improved with some command line options to the JVM that I don't know yet, but I know that I have seen much bigger speedups in code that didn't do this much memory allocation versus non-allocating computation (e.g. the Computer Language Benchmarks Game Clojure programs).

Andy

Lee Spector

unread,
Dec 8, 2012, 3:54:39 PM12/8/12
to clo...@googlegroups.com

On Dec 8, 2012, at 3:42 PM, Andy Fingerhut wrote:
>
> I'm hoping you realize that (take 10000 (iterate reverse value)) is reversing a linked list 10000 times, each time allocating 10000 cons cells (or Clojure's equivalent of a cons cell)? For a total of around 100,000,000 memory allocations of small objects per call to burn?
>
> I know that this is just a bit of code for testing purposes, and not your real application, but do you really do that quantity of memory allocation versus arithmetic in your code (7*10,000 arithmetic operations for every 10,000*10,000 memory allocations?)
>
> I still believe there is an issue here with either the memory allocation or garbage collection implementations that we are seeing here that can hopefully be improved with some command line options to the JVM that I don't know yet, but I know that I have seen much bigger speedups in code that didn't do this much memory allocation versus non-allocating computation (e.g. the Computer Language Benchmarks Game Clojure programs).

The numbers were guesses, but my real applications do have something of this flavor. Mostly I'm doing genetic programming, which involves generating, running, mutating and recombining programs. Population sizes are often in the thousands, with program sizes up to thousands of instructions/literals. Often the programs include list manipulation instructions so there's lots of memory allocation involved in executing them (in addition to the allocation required to generate and manipulate them).

-Lee

Andy Fingerhut

unread,
Dec 8, 2012, 4:01:40 PM12/8/12
to clo...@googlegroups.com
One more possibility to consider:

Single-threaded versions are more likely to keep the working set in the processor's largest cache, whereas parallel versions that use N times the working set for N times the parallelism can cause that same cache to thrash to main memory.

Andy

Wm. Josiah Erikson

unread,
Dec 8, 2012, 4:43:07 PM12/8/12
to clo...@googlegroups.com
I'm glad somebody else can duplicate our findings! I get results similar to this on Intel hardware. On AMD hardware, the disparity is bigger, and multiple threads of a single JVM invocation on AMD hardware consistently gives me slowdowns as compared to a single thread. Also, your results are on MacOS and my results are on linux, so it begs the question: Is this generally true of Java, or is it something about clojure?

Marek Šrank

unread,
Dec 8, 2012, 8:16:57 PM12/8/12
to clo...@googlegroups.com

Just tried, my first foray into reducers, but I must not be understanding something correctly:

  (time (r/map burn (doall (range 4))))

returns in less than a second on my macbook pro, whereas

  (time (doall (map burn (range 4))))

takes nearly a minute.

This feels like unforced laziness (although it's not quite that fast), but clojure.core.reducers/map involves no laziness, right?


Yep, reducers, don't use lazy seqs. But they return just sth. like transformed functions, that will be applied when building the collection. So you can use them like this:

    (into [] (r/map burn (doall (range 4)))))

See http://clojure.com/blog/2012/05/08/reducers-a-library-and-model-for-collection-processing.html
Message has been deleted

meteorfox

unread,
Dec 8, 2012, 10:19:43 PM12/8/12
to clo...@googlegroups.com
Lee:

I ran Linux perf and also watched the run queue (with vmstat) and your bottleneck is basically memory access. The CPUs are idle 80% of the time by stalled cycles. Here's what I got on my machine.

Intel Core i7 4 cores with Hyper thread (8 virtual processors)
16 GiB of Memory
Oracle JVM

and this is how I ran perf. (Note: After monitoring the GC,  even 1GiB Heap is a lot of memory for your benchmark, it mostly used a lot less than 256 MiB)
Also I ran the (time (doall ..)) three times to make sure that the JVM was warmed up. :)

 perf stat java -server -Xmx1024m -Xms1024m -jar target/cpu-burn-0.1.0-SNAPSHOT-standalone.jar
"Elapsed time: 49194.258825 msecs"
Warm-up 1 (10000 10000 10000 10000 10000 10000 10000 10000)
"Elapsed time: 48200.221677 msecs"
Warm-up 2 (10000 10000 10000 10000 10000 10000 10000 10000)
"Elapsed time: 50050.013156 msecs"
Run  (10000 10000 10000 10000 10000 10000 10000 10000)

 Performance counter stats for 'java -server -Xmx1024m -Xms1024m -jar target/cpu-burn-0.1.0-SNAPSHOT-standalone.jar':

    1094967.286876 task-clock                #    7.383 CPUs utilized          
           170,384 context-switches          #    0.000 M/sec                  
             2,932 CPU-migrations            #    0.000 M/sec                  
            24,734 page-faults               #    0.000 M/sec                  
 3,004,827,628,531 cycles                    #    2.744 GHz                     [83.34%]
 2,430,611,924,472 stalled-cycles-frontend   #   80.89% frontend cycles idle    [83.34%]
 1,976,836,501,358 stalled-cycles-backend    #   65.79% backend  cycles idle    [66.67%]
   650,868,504,974 instructions              #    0.22  insns per cycle        
                                             #    3.73  stalled cycles per insn [83.33%]
   124,479,033,687 branches                  #  113.683 M/sec                   [83.33%]
     1,753,021,734 branch-misses             #    1.41% of all branches         [83.33%]

     148.307812402 seconds time elapsed

Now if you run vmstat 1 while running your benchmark you'll notice that the run queue will be most of the time at 8, meaning that 8 "processes" are waiting for CPU, and this is due to memory accesses (in this case, since this is not true for all applications).

So, I feel your benchmark may be artificial and does not truly represent your real application, make sure to profile your real application, and optimize according to the bottlenecks. There are really useful tools out there 
for profiling, such as VisualVM, perf.

Good Luck,

Carlos

meteorfox

unread,
Dec 8, 2012, 10:59:18 PM12/8/12
to clo...@googlegroups.com
Correction regarding the run-queue, this is not completely correct, :S . But the stalled cycles and memory accesses still holds.

Sorry for the misinformation.

On Friday, December 7, 2012 8:25:14 PM UTC-5, Lee wrote:

Lee Spector

unread,
Dec 9, 2012, 12:19:56 AM12/9/12
to clo...@googlegroups.com

On Dec 8, 2012, at 8:16 PM, Marek Šrank wrote:
>
> Yep, reducers, don't use lazy seqs. But they return just sth. like transformed functions, that will be applied when building the collection. So you can use them like this:
>
> (into [] (r/map burn (doall (range 4)))))
>
> See http://clojure.com/blog/2012/05/08/reducers-a-library-and-model-for-collection-processing.html
> and http://clojure.com/blog/2012/05/15/anatomy-of-reducer.html for more info...
>

Thanks Marek. This does fix the "too quick to be true" issue, but alas, on my mac laptop (a 4 core intel i7):

57522.869 msecs for (time (into [] (r/map burn (doall (range 4)))))

58263.312 msecs for (time (doall (map burn (doall (range 4)))))

So while I'm not getting a terrible slowdown from using the reducers version of map, I'm also not getting any speedup over the single-thread map.

We should try this on our other architectures too, but it doesn't look promising.

Lee Spector

unread,
Dec 9, 2012, 12:37:49 AM12/9/12
to clo...@googlegroups.com

On Dec 8, 2012, at 10:19 PM, meteorfox wrote:
>
> Now if you run vmstat 1 while running your benchmark you'll notice that the run queue will be most of the time at 8, meaning that 8 "processes" are waiting for CPU, and this is due to memory accesses (in this case, since this is not true for all applications).
>
> So, I feel your benchmark may be artificial and does not truly represent your real application, make sure to profile your real application, and optimize according to the bottlenecks. There are really useful tools out there
> for profiling, such as VisualVM, perf.

Thanks Carlos.

I don't actually think that the benchmark is particularly artificial. It's very difficult to profile my actual application because it uses random numbers all over the place and is highly and nonlinearly variable in lots of ways. But I think that the benchmark I'm running really is pretty representative.

In any event, WHY would all of that waiting be happening? Logically, nothing should have to be waiting for anything else. We know from the fact that we get good speedups from multiple simultaneous JVM runs, each just running one call to my burn function, that the hardware is capable of performing well with multiple instances of this benchmark running concurrently. It MUST be able to handle all of the memory allocation and everything else. It's just when we try to launch them all in parallel from the same Clojure process, using pmap OR agents OR reducers, that we fail to get the concurrency speedups. Why should this be? And what can be done about it?

I know nearly nothing about the internals of the JVM, but is there perhaps a bottleneck on memory allocation because there's a single serial allocator? Perhaps when we run multiple JVMs each has its own allocator so we don't have the bottleneck? If all of this makes sense and is true, then perhaps (wishful thinking) there's a JVM option like "use parallel memory allocators" that will fix it?!

Thanks,

-Lee

cameron

unread,
Dec 9, 2012, 2:26:16 AM12/9/12
to clo...@googlegroups.com

Interesting problem, the slowdown seems to being caused by the reverse call (actually the calls to conj with a list argument).
Calling conj in a multi-threaded environment seems to have a significant performance impact when using lists
I created some alternate reverse implementations (the fastest uses a vector and cons), the gist with the test code can be found at https://gist.github.com/4243724

On a dual Xeon E5520 (8 physical cores) @ 2.27GHz with Linux runnning OpenJDK 1.7.0_09 I got the following results:

fast-reverse    : map-ms: 3.3, pmap-ms 0.7, speedup 4.97
list-cons       : map-ms: 4.0, pmap-ms 0.7, speedup 6.13
vec-conj        : map-ms: 4.0, pmap-ms 1.3, speedup 3.10
list-conj       : map-ms: 10.8, pmap-ms 21.2, speedup 0.51
clojure-reverse : map-ms: 13.5, pmap-ms 26.8, speedup 0.50 (this is equivalent to the original code)

The following JDK command line options were used:
  "-Xmx20G" "-XX:MaxPermSize=8G" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC"

Some other notes:
I ran the sample under YourKit and garbage collection represents a small percentage of execution time in both single and multi-threaded tests, there are no blocked threads for the duration of the test and there is no unexpected monitor/lock usage.

Cheers,
    Cameron.

cameron

unread,
Dec 9, 2012, 2:32:10 AM12/9/12
to clo...@googlegroups.com
I forgot to mention, I cut the number of reverse iterations down to 1000 (not 10000) so I wouldn't have to wait too long for criterium, the speedup numbers are representative of the full test though.

Cameron.

Jim - FooBar();

unread,
Dec 9, 2012, 7:28:36 AM12/9/12
to clo...@googlegroups.com
Hi Lee,


Would it be difficult to try the following version of 'pmap'? It doesn't
use futures but executors instead so at least this could help narrow the
problem down... If the problem is due to the high number of futures
spawned by pmap then this should fix it...

(defn- with-thread-pool* [num-threads f]
(let [pool (java.util.concurrent.Executors/newFixedThreadPool
num-threads)]
(try (f pool)
(finally
(when pool (.shutdown pool))))))

(defmacro with-thread-pool [[name num-threads] & body]
`(with-thread-pool* ~num-threads (fn [~name] ~@body)))

(defn pmapp [f coll]
(with-thread-pool [pool (.. Runtime getRuntime availableProcessors)]
(doall
(map #(.get %)
(.invokeAll pool
(map (partial partial f) coll))))))

And btw, reducers are great but usually you need to reformulate your
problem..r/map is not parallel. It is serial but with no intermediate
allocation (unlike lazy-seqs). If you want to make it parallel you need
to r/fold the result of r/map (the reducer) providing reducing &
combining fns. However, is still doesn't make sense to be slower than
map...If you 've followed Rich's example with the pie-maker consider this:

Both map & r/map will return immediately. The difference is that map
will build a recipe for the first item whereas r/map will build a recipe
for the entire coll. In other words, the lazy-pie-maker assistant
provides 1 apple at a time (after asking for an apple) but the
reducer-pie-maker-assistant, as soon as you ask for the 1st apple it
will do the entire bag (without intermediate 'asking'). The lazy recipe
is recursive whereas the reducer-based one looks like a stream...It
should be still faster, albeit not terribly faster. You need r/fold to
see any parallelism...


Jim

Marshall Bockrath-Vandegrift

unread,
Dec 9, 2012, 7:48:27 AM12/9/12
to clo...@googlegroups.com
cameron <cdo...@gmail.com> writes:

> Interesting problem, the slowdown seems to being caused by the reverse
> call (actually the calls to conj with a list argument).

Excellent analysis, sir! I think this points things in the right
direction.

> fast-reverse    : map-ms: 3.3, pmap-ms 0.7, speedup 4.97
> list-cons       : map-ms: 4.0, pmap-ms 0.7, speedup 6.13

The difference between these two I believe is just jitter. Once `cons`
is called on either a list or a vector, the result is a
`clojure.lang.Cons` object.

> vec-conj        : map-ms: 4.0, pmap-ms 1.3, speedup 3.10

For the sub-goal of optimizing `reverse`, I get better times even than
for the `cons`-based implementation by using transient vectors:

list-cons         : map-ms: 4.0, pmap-ms 0.6, speedup 6.42
tvec-conj         : map-ms: 0.9, pmap-ms 0.2, speedup 4.10

(defn tvec-conj [coll]
(persistent! (reduce conj! (transient []) coll)))

> list-conj       : map-ms: 10.8, pmap-ms 21.2, speedup 0.51
> clojure-reverse : map-ms: 13.5, pmap-ms 26.8, speedup 0.50 (this is
> equivalent to the original code)

I add the following:

cons-conj         : map-ms: 3.3, pmap-ms 16.8, speedup 0.19

(defn cons-conj [coll]
(reduce conj (clojure.lang.Cons. (first coll) nil) (rest coll)))

I think this is the key, but I don’t understand it.

The `cons` function just immediately creates a new `Cons` object. The
`conj` function calls the `.cons` method of the collection, and the
`.cons` implementation `Cons` inherits from `ASeq` just creates a new
`Cons` object!

It’s like there’s a lock of some sort sneaking in on the `conj` path.
Any thoughts on what that could be?

-Marshall

Softaddicts

unread,
Dec 9, 2012, 9:25:19 AM12/9/12
to clo...@googlegroups.com
If the number of object allocation mentioned earlier in this thread are real,
yes vm heap management can be a bottleneck. There has to be some
locking done somewhere otherwise the heap would corrupt :)

The other bottleneck can come from garbage collection which has to freeze
object allocation completely or partially.

This internal process has to reclaim unreferenced objects otherwise you may end up
exhausting the heap. That can even susoend your app while gc is running depending
on the strategy used.

In java 6, the vm is suppose to adjust the gc strategy by looking at your app
behavior. That may take some time to profile however. If your app has this heavy
memory requirement abruptely, the vm may not have enough data to adapt.

Have a look here, you will find how to set explicitly the gc behavior:

http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

This is specific to java 6, java 5 is a different beast as java 7.

Maybe this can help, I emphasize, *can help* not having looked deeply into
your problem.

Luc P.
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
>
--
Softaddicts<lprefo...@softaddicts.ca> sent by ibisMail from my ipad!

Andy Fingerhut

unread,
Dec 9, 2012, 12:52:36 PM12/9/12
to clo...@googlegroups.com
Lee:

I don't know yet know how to get good speedups with this workload in the Oracle JVM, although it might be possible with options I'm unaware of.

Azul's Zing JVM has a different memory allocator and GC implementation that might be better tuned for parallel workloads. I haven't used it myself yet -- this is just from hearing about it in the past and a brief scan of their web site. They have free trials available. Maybe you could try that to see if it gives you better results out of the box, or with minor tweaking of parameters?

I don't know the cost, but like many companies they might have significant discounts for educational customers.

Andy


Andy Fingerhut

unread,
Dec 9, 2012, 12:54:41 PM12/9/12
to clo...@googlegroups.com

On Dec 9, 2012, at 4:48 AM, Marshall Bockrath-Vandegrift wrote:

It’s like there’s a lock of some sort sneaking in on the `conj` path.
Any thoughts on what that could be?

My current best guess is the JVM's memory allocator, not Clojure code.

Andy


Marshall Bockrath-Vandegrift

unread,
Dec 9, 2012, 1:19:26 PM12/9/12
to clo...@googlegroups.com
Andy Fingerhut <andy.fi...@gmail.com> writes:

> My current best guess is the JVM's memory allocator, not Clojure code.

I didn’t mean to imply the problem was in Clojure itself, but I don’t
believe the issue is in the memory allocator either. I now believe the
problem is in a class of JIT optimization HotSpot is performing which
turns into a “pessimization” in the parallel case.

I have the following code, taking the structure from Cameron’s
benchmarks, but replacing the exact functions being tested:

https://gist.github.com/4246320

Note that `conj*` is simply a copy-paste of the `conj` implementation
from clojure/core.clj. The benchmark runs reducing that `conj*`
function once on `Cons`s, then on `PersistantList`s, then again on
`Cons`s.

And here are the results:

cons-conj* : map-ms: 5.6, pmap-ms 1.1, speedup 5.08
list-conj* : map-ms: 10.1, pmap-ms 15.9, speedup 0.63
cons-conj* : map-ms: 10.0, pmap-ms 15.6, speedup 0.64

The function performs fine on `Cons` objects, but once applied to
`PersistantList` objects, `conj*` somehow becomes “tainted” and acquires
the inverse speedup factor. When I run with -XX:+PrintCompilation I
don’t see anything particularly suspicious, but I’m not well-versed in
the art of interpreting HotSpot’s entrails.

My idea for next steps is to create a copy of the PersistantList class
and selectively modify it in an attempt to identify what aspect of it
causes the JVM to produce this behavior.

-Marshall

Andy Fingerhut

unread,
Dec 9, 2012, 1:37:09 PM12/9/12
to clo...@googlegroups.com

On Dec 9, 2012, at 6:25 AM, Softaddicts wrote:

> If the number of object allocation mentioned earlier in this thread are real,
> yes vm heap management can be a bottleneck. There has to be some
> locking done somewhere otherwise the heap would corrupt :)
>
> The other bottleneck can come from garbage collection which has to freeze
> object allocation completely or partially.
>
> This internal process has to reclaim unreferenced objects otherwise you may end up
> exhausting the heap. That can even susoend your app while gc is running depending
> on the strategy used.

Agreed that memory allocation and garbage collection will in some cases need to coordinate between threads to work in the general case of arbitrary allocations and GCs.

However, one could have a central list of large "pages" of free memory (e.g. a few MBytes, or maybe even larger), and pass these out to concurrent memory allocators in these large chunks, and let them do small object allocations and GC within each thread completely concurrently.

The only times locking of any kind might be needed with such a strategy would be when one of the parallel threads requests a new big page from the central free list, or returned a completely empty free page back to the central list that it didn't need any more. All other memory allocation and GC could be completely concurrent. The idea of making those "pages" large is that such passing pages around would be infrequent, and thus could be made to have no measurable synchronization overhead.

That is pretty much what is happening when you run Lee's benchmark programs as 1 thread per JVM, but 4 or 8 different JVMs processes running in parallel. In that situation the OS has the central free list of pages, and the JVMs manage their small object allocations and GCs completely concurrently without interfering with each other.

If HotSpot's JVM could be configured to work like that, he would be seeing big speedups in a single JVM.

Andy

Softaddicts

unread,
Dec 9, 2012, 9:05:02 PM12/9/12
to clo...@googlegroups.com
There's no magic here, everyone tuning their app hit this wall eventually,
tweaking the JVM memory options :)

Luc

cameron

unread,
Dec 10, 2012, 3:48:46 AM12/10/12
to clo...@googlegroups.com
Hi Marshall,
  I think we're definitely on the right track.
If I replace the reverse call with the following function I get a parallel speedup of ~7.3 on an 8 core machine.

(defn copy-to-java-list [coll]
  (let [lst (java.util.LinkedList.)]
    (doseq [x coll]
      (.addFirst lst x))
    lst))

This function should do as much memory allocation as the clojure reverse but has vastly better parallel performance.
There does seem to be something unusual about conj and clojure.lang.PersistentList in this parallel test case and I don't think it's related to the JVMs memory allocation.

Cameron.

Marko Topolnik

unread,
Dec 10, 2012, 6:17:01 AM12/10/12
to clo...@googlegroups.com
The main GC feature here are the Thread-Local Allocation Buffers. They are on by default and are "automatically sized according to allocation patterns". The size can also be fine-tuned with the -XX:TLABSize=n configuration option. You may consider tweaking this setting to optimize runtime. Basically, everything that one call to your function allocates should fit into a TLAB because it is all garbage upon exit. Allocation inside TLAB is ultra-fast and completely concurrent.

Marshall Bockrath-Vandegrift

unread,
Dec 10, 2012, 6:17:47 AM12/10/12
to clo...@googlegroups.com
cameron <cdo...@gmail.com> writes:

> There does seem to be something unusual about conj and
> clojure.lang.PersistentList in this parallel test case and I don't
> think it's related to the JVMs memory allocation.

I’ve got a few more data-points, but still no handle on what exactly is
going on.

My last benchmark showing the `conj*` speedup for `Cons` objects
degrading as soon as it was used on a `PersistantList` was incomplete.
In fact, the speedup degrades after it is used on objects of more than
one type. The effect just appears immediately when used with
`PersistantList` because '() is in fact a different a
`PersistantList$EmptyList`. Using `conj*` first in vector
implementation then results in the same inverse speedup on `Cons`s.

Even without your near-optimal speedup using Java standard library
types, I think your earlier benchmarks are enough to demonstrate that
this isn’t an issue with allocation alone. All of the implementations
based on `reduce` with `conj` must allocate and return a new object for
each iteration. If parallel allocation were the sole issue, I’d expect
all of the implementations to demonstrate the same behavior.

Unfortunately I have no idea what to connect from these facts:

- Parallel allocation of `Cons` and `PersistentList` instances through
a Clojure `conj` function remains fast as long as the function only
ever returns objects of a single concrete type

- Parallel allocation speed for `PersistentVector` instances is
unaffected by `conj` returning multiple types, and does not
demonstrate the inverse speedup seen for the previous types.

At this point I believe the symptoms point to cache contention, but I
don’t know where or why. Using OpenJDK 7 with -XX:+UseCondMark didn’t
appear to produce any improvement. Creating a private copy of
`PersistentList` which contained additional padding fields likewise
didn’t appear to produce any improvement.

So, Lee Spector: I think it’s possible to work around this though by
just not using `conj` on lists. It’s suboptimal, but at least solves
the problem in your original benchmark. Further improvements are
obviously possible, but that’s a start.

-Marshall

nicola...@gmail.com

unread,
Dec 10, 2012, 7:14:12 AM12/10/12
to clojure
    cons-conj* : map-ms: 5.6, pmap-ms 1.1, speedup 5.08
    list-conj* : map-ms: 10.1, pmap-ms 15.9, speedup 0.63
    cons-conj* : map-ms: 10.0, pmap-ms 15.6, speedup 0.64

What happens if your run it a third time at the end?
 (The question is related to the fact that there appears to be transition states between monomorphic and megamorphic call sites, 
which might lead to an explanation.)

meteorfox

unread,
Dec 10, 2012, 11:21:19 AM12/10/12
to clo...@googlegroups.com
 - Parallel allocation of `Cons` and `PersistentList` instances through 
    a Clojure `conj` function remains fast as long as the function only 
    ever returns objects of a single concrete type 

A possible explanation for this could be JIT Deoptimization. Deoptimization happens when the JIT optimized a hot code path based
on what has learned from the executing program, but no longer holds, then the compiler notices it has made an incorrect decision in
the previous optimization, and performs the deoptimization.

It might be the case that after a deoptimization, the JIT compiler reoptimizes a code path the was previously optimized. This frequently happens
when dealing different implementations of an interface, and/or inheritance and type hierarchy due to the code path execution changing constantly and
having to go through multiple optimizations and deoptimizations.

To identify whether deoptimization is happening, you can add the following flag to jvm args -XX:+PrintCompilation 
If the output contains prints "made not entrant" then that indicates a deoptimization, and that method will be interpreted until
a certain threshold of executions is surpassed, and get optimized again.


Reference:
Java Performance, Charlie Hunt and  Binu John

Wm. Josiah Erikson

unread,
Dec 10, 2012, 12:55:14 PM12/10/12
to clo...@googlegroups.com
Aha. Not only do I get a lot of "made not entrant", I get a lot of "made zombie". However, I get this for both runs with map and with pmap (and with pmapall as well)

 For instance, from a pmapall run:

33752  159             clojure.lang.Cons::next (10 bytes)   made zombie
  33752  164             clojure.lang.RT::conj (21 bytes)   made zombie
  33753  154             clojure.lang.RT::seq (32 bytes)   made zombie
  33753    5 %           clojure.core$reduce1::invoke @ -2 (184 bytes)   made zombie
  33753  167             clojure.core$conj::invoke (13 bytes)   made zombie
  33753  184             clojure.core$rest::invoke (7 bytes)
  33884  186             clojure.lang.Numbers$LongOps::isPos (15 bytes)
  34298  187             clojure.lang.Numbers::dec (17 bytes)
  34421  188             clojure.lang.Numbers$LongOps::dec (13 bytes)
  34897  189             clojure.core$take::invoke (24 bytes)
  34903  190             clojure.core$take$fn__4227::<init> (15 bytes)
  35386  191             clojure.core$iterate::invoke (39 bytes)
<SNIP>

  1168  154             clojure.lang.RT::seq (32 bytes)   made not entrant
   1169  168             clojure.lang.RT::seq (32 bytes)
   1171    5 %           clojure.core$reduce1::invoke @ -2 (184 bytes)   made not entrant
   1171  169             clojure.core$reduce1::invoke (184 bytes)
   1173  159             clojure.lang.Cons::next (10 bytes)   made not entrant
   1173  167             clojure.core$conj::invoke (13 bytes)   made not entrant
   1173  164             clojure.lang.RT::conj (21 bytes)   made not entrant
   1192  170             clojure.lang.PersistentList::first (5 bytes)
   1193  171             clojure.lang.PersistentList::next (18 bytes)
   1193  172             clojure.lang.Cons::next (10 bytes)
   1194  173             clojure.lang.RT::conj (21 bytes)
   1197  174             clojure.core$conj::invoke (13 bytes)
   1233    6 %           clojure.core$reduce1::invoke @ 0 (184 bytes)

And then, from a map run:

  1163  151             clojure.lang.RT::seq (32 bytes)   made not entrant
   1163  145             clojure.core$cons::invoke (10 bytes)   made not entrant
   1163  168             clojure.lang.RT::seq (32 bytes)
   1165    4 %           clojure.core$reduce1::invoke @ 0 (184 bytes)
   1169  144             clojure.lang.RT::cons (46 bytes)   made not entrant
   3467  169             clojure.lang.RT::cons (46 bytes)
   3470   24             clojure.lang.Util::equiv (65 bytes)   made zombie
   3470   23             java.lang.String::equals (88 bytes)   made zombie
   3470   18             clojure.lang.PersistentArrayMap::createWithCheck (80 bytes)   made zombie
   3470   26             clojure.lang.PersistentArrayMap::equalKey (6 bytes)   made zombie
   3617  170             clojure.core$cons::invoke (10 bytes)
   3622   41             clojure.lang.PersistentArrayMap::indexOf (34 bytes)   made zombie
   3622   30   !         java.net.URL::<init> (543 bytes)   made zombie
   3622   58             clojure.lang.Symbol::equals (49 bytes)   made zombie
   3623   73             java.lang.Object::equals (11 bytes)   made zombie
   3623   65             clojure.lang.Util::hasheq (43 bytes)   made zombie
   4249  171  s!         clojure.lang.LazySeq::sval (54 bytes)
   4259   77             clojure.lang.Util::equiv (65 bytes)   made zombie
   4259   89             clojure.lang.RT::first (35 bytes)   made zombie
   4259   88             clojure.lang.PersistentHashMap$NodeSeq::create (94 bytes)   made zombie
   4578  172     n       java.lang.Object::getClass (0 bytes)  
   5380  173             clojure.lang.AFunction::<init> (5 bytes)
   5634  174             java.lang.Long::longValue (5 bytes)
   5785  175  s          clojure.lang.LazySeq::seq (53 bytes)
   5830  176             clojure.lang.Numbers::ops (97 bytes)
   6168  177             clojure.lang.LazySeq::<init> (10 bytes)
   6169   32             java.lang.AbstractStringBuilder::append (48 bytes)   made zombie
  10727  178             java.lang.Long::valueOf (36 bytes)
  10730   37             java.lang.StringBuilder::append (8 bytes)   made zombie
  10730   49   !         sun.misc.URLClassPath$JarLoader::getResource (91 bytes)   made zombie
  10730   44             sun.misc.URLClassPath::getResource (74 bytes)   made zombie
  11121  179             clojure.lang.Numbers::num (5 bytes)
  11240  180             clojure.core$rest::invoke (7 bytes)
  11240  181             clojure.lang.RT::more (37 bytes)
  11242  144             clojure.lang.RT::cons (46 bytes)   made zombie
  11242  145             clojure.core$cons::invoke (10 bytes)   made zombie
  11242  151             clojure.lang.RT::seq (32 bytes)   made zombie




On Mon, Dec 10, 2012 at 11:21 AM, meteorfox <ctorresk8...@gmail.com> wrote:
-XX:+PrintCompilation

Wm. Josiah Erikson

unread,
Dec 10, 2012, 2:36:03 PM12/10/12
to clo...@googlegroups.com
I tried some more performance tuning options in Java, just for kicks, and didn't get any advantages from them: "-server" "-XX:+TieredCompilation" "-XX:ReservedCodeCacheSize=256m"

Also, in case it's informative:

[josiah@compute-1-17 benchmark]$ grep entrant compilerOutputCompute-1-1.txt | wc -l
173
[josiah@compute-1-17 benchmark]$ grep entrant compilerOutputCompute-1-17.txt | wc -l
178
[josiah@compute-1-17 benchmark]$ grep zombie compilerOutputCompute-1-17.txt | wc -l
163
[josiah@compute-1-17 benchmark]$ grep zombie compilerOutputCompute-1-1.txt | wc -l
158
[josiah@compute-1-17 benchmark]$

Marshall Bockrath-Vandegrift

unread,
Dec 10, 2012, 3:33:11 PM12/10/12
to clo...@googlegroups.com
"Wm. Josiah Erikson" <wmjo...@gmail.com> writes:

> Aha. Not only do I get a lot of "made not entrant", I get a lot of
> "made zombie". However, I get this for both runs with map and with
> pmap (and with pmapall as well)

I’m not sure this is all that enlightening. From what I can gather,
“made not entrant” just means that a JITed version proved to be invalid
in light of later code and new invocation paths won’t use the previous
version. And “made zombie” just means all references to an old JIT’d
version have been lost, making it available to be GC’ed.

A copy of `conj` becomes “not entrant” after being used on both vectors
and lists, but the new version gets the same speed-up when used on
vectors as a copy which has only been used on vectors. There’s
something else going on which is specifically affecting parallel calls
to the polymorphic version when applied to instances of
`PersistentList`, and `Cons`.

-Marshall

Wm. Josiah Erikson

unread,
Dec 10, 2012, 4:18:42 PM12/10/12
to clo...@googlegroups.com
Interesting. I tried the following:
:jvm-opts ["-Xmx10g" "-Xms10g" "-XX:+AggressiveOpts" "-server" "-XX:+TieredCompilation" "-XX:ReservedCodeCacheSize=256m" "-XX:TLABSize=1G" "-XX:+PrintGCDetails" "-XX:+PrintGCTimeStamps" "-XX:+UseParNewGC" "-XX:+ResizeTLAB" "-XX:+UseTLAB"]

I got a slight slowdown, and the GC details are as follows:

[josiah@compute-1-1 benchmark]$ /usr/bin/time -f %E lein run
0.852: [GC 0.852: [ParNew: 2796224K->6726K(3145728K), 0.0136610 secs] 2796224K->6726K(10136256K), 0.0136930 secs] [Times: user=0.05 sys=0.02, real=0.01 secs]
0.871: [GC 0.871: [ParNew: 2802950K->8104K(3145728K), 0.0116360 secs] 2802950K->8104K(10136256K), 0.0116720 secs] [Times: user=0.06 sys=0.01, real=0.01 secs]
0.890: [GC 0.890: [ParNew: 2804328K->11538K(3145728K), 0.0112460 secs] 2804328K->11538K(10136256K), 0.0112720 secs] [Times: user=0.08 sys=0.00, real=0.01 secs]
0.904: [GC 0.904: [ParNew: 2807762K->11752K(3145728K), 0.0092300 secs] 2807762K->11752K(10136256K), 0.0092550 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
0.915: [GC 0.915: [ParNew: 2807976K->10702K(3145728K), 0.0072210 secs] 2807976K->10702K(10136256K), 0.0072480 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
0.969: [GC 0.969: [ParNew: 2806926K->12249K(3145728K), 0.0206880 secs] 2806926K->12249K(10136256K), 0.0207160 secs] [Times: user=0.13 sys=0.01, real=0.02 secs]
21.099: [GC 21.099: [ParNew: 2808473K->14256K(3145728K), 0.0174230 secs] 2808473K->14256K(10136256K), 0.0174580 secs] [Times: user=0.12 sys=0.00, real=0.02 secs]
46.533: [GC 46.533: [ParNew: 2810480K->10070K(3145728K), 0.0097840 secs] 2810480K->10070K(10136256K), 0.0098140 secs] [Times: user=0.08 sys=0.00, real=0.01 secs]
74.988: [GC 74.988: [ParNew: 2806294K->11576K(3145728K), 0.0134020 secs] 2806294K->11576K(10136256K), 0.0134330 secs] [Times: user=0.08 sys=0.00, real=0.02 secs]
105.143: [GC 105.143: [ParNew: 2807800K->12728K(3145728K), 0.0121870 secs] 2807800K->12728K(10136256K), 0.0122240 secs] [Times: user=0.08 sys=0.00, real=0.02 secs]
136.170: [GC 136.170: [ParNew: 2808952K->13336K(3145728K), 0.0144400 secs] 2808952K->13336K(10136256K), 0.0144720 secs] [Times: user=0.09 sys=0.00, real=0.01 secs]
167.703: [GC 167.703: [ParNew: 2809560K->14763K(3145728K), 0.0105520 secs] 2809560K->14763K(10136256K), 0.0105830 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
199.593: [GC 199.593: [ParNew: 2810987K->11407K(3145728K), 0.0142030 secs] 2810987K->11407K(10136256K), 0.0142350 secs] [Times: user=0.08 sys=0.00, real=0.01 secs]
231.894: [GC 231.894: [ParNew: 2807631K->15066K(3145728K), 0.0129290 secs] 2807631K->15066K(10136256K), 0.0129630 secs] [Times: user=0.10 sys=0.01, real=0.01 secs]
264.239: [GC 264.239: [ParNew: 2811290K->9632K(3145728K), 0.0119130 secs] 2811290K->9632K(10136256K), 0.0119850 secs] [Times: user=0.08 sys=0.00, real=0.01 secs]
"Elapsed time: 291038.415325 msecs"
Heap
 par new generation   total 3145728K, used 2700935K [0x000000057ae00000, 0x0000000650350000, 0x0000000650350000)
  eden space 2796224K,  96% used [0x000000057ae00000, 0x000000061f239bb0, 0x00000006258b0000)
  from space 349504K,   2% used [0x000000063ae00000, 0x000000063b768340, 0x0000000650350000)
  to   space 349504K,   0% used [0x00000006258b0000, 0x00000006258b0000, 0x000000063ae00000)
 tenured generation   total 6990528K, used 0K [0x0000000650350000, 0x00000007fae00000, 0x00000007fae00000)
   the space 6990528K,   0% used [0x0000000650350000, 0x0000000650350000, 0x0000000650350200, 0x00000007fae00000)
 compacting perm gen  total 21248K, used 11049K [0x00000007fae00000, 0x00000007fc2c0000, 0x0000000800000000)
   the space 21248K,  52% used [0x00000007fae00000, 0x00000007fb8ca638, 0x00000007fb8ca800, 0x00000007fc2c0000)
No shared spaces configured.
4:53.06
[josiah@compute-1-1 benchmark]$




-Marshall

Marshall Bockrath-Vandegrift

unread,
Dec 11, 2012, 4:37:29 AM12/11/12
to clo...@googlegroups.com
"nicola...@gmail.com" <nicola...@gmail.com> writes:

> What happens if your run it a third time at the end?  (The question
> is related to the fact that there appears to be transition states
> between monomorphic and megamorphic call sites,  which might lead to
> an explanation.)

Same results, but your comment jogged my reading and analysis is what I
believe is the right direction.

This is a potentially grandiose claim, but I believe the inverse speedup
is due to contention caused by JVM type profiling in megamorphic call
sites. Unfortunately -XX:-UseTypeProfile doesn’t appear to turn off
type profiling itself, so I can’t prove this claim definitively without
going even further down the rabbit hole and grubbing through the JVM
source code.

To back up this claim, I present the following modified `conj*`:

(defn conj*
[coll x]
(let [a (long-array 32)]
(dotimes [n 5] (dotimes [i 32] (aset a i n)))
(clojure.lang.RT/conj coll x)))

And the resulting profiling numbers:

list-conj* : map-ms: 42.0, pmap-ms 24.8, speedup 1.69
cons-conj* : map-ms: 39.7, pmap-ms 25.1, speedup 1.58

Adding busy-work (and an extra allocation!) to the loop improved the
speed-up, I believe by decreasing the relative portion of the call
execution time during which the callsite type profile information is
being updated. If I’m correct, the `Cons` and `PersistentList` `.cons`
implementations are so tight that in their base versions the type
profiling forms a significant enough portion of the total call that the
updates are in frequent conflict. Adding busy-work to the `conj*` call
adds some jitter which prevents them from contending quite as
frequently.

I’m not sure what the next steps are. Open a bug on the JVM? This is
something one can attempt to circumvent on a case-by-case basis, but
IHMO has significant negative implications for Clojure’s concurrency
story.

-Marshall

Lee Spector

unread,
Dec 11, 2012, 10:37:50 AM12/11/12
to clo...@googlegroups.com
On Dec 11, 2012, at 4:37 AM, Marshall Bockrath-Vandegrift wrote:
> I’m not sure what the next steps are. Open a bug on the JVM? This is
> something one can attempt to circumvent on a case-by-case basis, but
> IHMO has significant negative implications for Clojure’s concurrency
> story.

I've gotten a bit lost in some of the diagnoses and experimental results and analyses and suggestions -- all of which I really appreciate! -- but I'm trying to get clear in my mind what the current state of things is from an application perspective.

Is the following a fair characterization pending further developments?

---
If you have a cons-intensive task then even if it can be divided into completely independent, long-running subtasks, there is currently no known way to get significant speedups by running the subtasks on multiple cores within a single Clojure process. In some cases you will be able to get significant speedups by separating the subtasks completely and running them in separate Clojure processes running on separate JVM instances. But the speedups will be lost (mostly, and you might even experience slowdowns) if you try to run them from within a single Clojure process.
---

Or have I missed a currently-available work-around among the many suggestions?

I realize that even if this is the current situation it might change via fixes on any one of several fronts (and I hope that it does!), but is this indeed the current situation?

Thanks so much,

-Lee

Gary Johnson

unread,
Dec 11, 2012, 11:35:20 AM12/11/12
to clo...@googlegroups.com
Lee,

  My reading of this thread is not quite as pessimistic as yours. Here is my synthesis for the practical application developer in Clojure from reading and re-reading all of the posts above. Marshall and Cameron, please feel free to correct me if I screw anything up here royally. ;-)

When running code in parallel, the fastest to slowest ways to allocate memory via Clojure (of those examined above) are:
1. Java array
2. Java linked list
3. conj onto a transient vector
4. cons onto a list
5. conj onto a vector
6. DO NOT conj onto a list EVER!!!

And that's pretty much all there is to it until Marshall figures out how to hack the JVM to overcome this very subtle JIT deoptimization bug relating to polymorphic conj calls on Cons and PersistentList types.

  Good luck,
    ~Gary

Marshall Bockrath-Vandegrift

unread,
Dec 11, 2012, 11:40:49 AM12/11/12
to clo...@googlegroups.com
Lee Spector <lspe...@hampshire.edu> writes:

> Is the following a fair characterization pending further developments?
>
> If you have a cons-intensive task then even if it can be divided into
> completely independent, long-running subtasks, there is currently no
> known way to get significant speedups by running the subtasks on
> multiple cores within a single Clojure process.

Not quite. If you’d been using `cons` (in the benchmark, if `reverse`
used `cons` in its implementation), then you’d be getting a perfectly
reasonable speedup. The problem child in this instance is `conj`.

If my analysis is correct, then the issue is any megamodal call site –
such as `conj` – which is invoked in a tight loop by multiple threads
simultaneously. Any simultaneous invocation of such call sites
introduces contention and reduces speedup, but the problem only becomes
pathological in very, very tight loops, such as when performing the
minimal work required by the `.cons` [1] implementations of `Cons` and
`PersistentList`. In these cases the portion of the call which
introduces contention is a sufficient proportion of the overall call
time that the speedup becomes inverse.

> In some cases you will be able to get significant speedups by
> separating the subtasks completely and running them in separate
> Clojure processes running on separate JVM instances. But the speedups
> will be lost (mostly, and you might even experience slowdowns) if you
> try to run them from within a single Clojure process.

For this particular issue, splitting each task into a separate JVM
entirely negates the problem, because there is no simultaneous
invocation of the same call site.

> Or have I missed a currently-available work-around among the many
> suggestions?

You can specialize your application to avoid megamodal call sites in
tight loops. If you are working with `Cons`-order sequences, just use
`cons` instead of `conj`. If you are working with vectors, create your
own private implementation of `conj` which you *only* call on vectors.
If you are depending on operations which may/do use `conj` in tight
loops, create your own private re-implementations which don’t, such as
with any of the faster versions of `reverse` earlier in this thread.

This is suboptimal, but it’s totally possible to work around the issue
with a little bit of analysis and profiling.

[1] Possible point of confusion – the JVM interface method invoked by
the Clojure `conj` function is named `.cons`, for I assume historical
reasons. The Clojure `cons` function on the other hand just allocates a
`Cons` object in an entirely monomodal fashion.

-Marshall

Lee Spector

unread,
Dec 11, 2012, 12:03:22 PM12/11/12
to clo...@googlegroups.com

On Dec 11, 2012, at 11:40 AM, Marshall Bockrath-Vandegrift wrote:
>
>> Or have I missed a currently-available work-around among the many
>> suggestions?
>
> You can specialize your application to avoid megamodal call sites in
> tight loops. If you are working with `Cons`-order sequences, just use
> `cons` instead of `conj`. If you are working with vectors, create your
> own private implementation of `conj` which you *only* call on vectors.
> If you are depending on operations which may/do use `conj` in tight
> loops, create your own private re-implementations which don’t, such as
> with any of the faster versions of `reverse` earlier in this thread.
>
> This is suboptimal, but it’s totally possible to work around the issue
> with a little bit of analysis and profiling.

Very informative. Thanks and thanks also to Gary.

If the application does lots of "list processing" but does so with a mix of Clojure list and sequence manipulation functions, then one would have to write private, list/cons-only versions of all of these things? That is -- overstating it a bit, to be sure, but perhaps not entirely unfairly -- re-implement Clojure's Lisp?

-Lee

Marshall Bockrath-Vandegrift

unread,
Dec 11, 2012, 1:06:51 PM12/11/12
to clo...@googlegroups.com
Lee Spector <lspe...@hampshire.edu> writes:

> If the application does lots of "list processing" but does so with a
> mix of Clojure list and sequence manipulation functions, then one
> would have to write private, list/cons-only versions of all of these
> things? That is -- overstating it a bit, to be sure, but perhaps not
> entirely unfairly -- re-implement Clojure's Lisp?

I just did a quick look over clojure/core.clj, and `reverse` is the only
function which stood out to me as hitting the most pathological case.
Every other `conj` loop over a user-provided datastructure is `conj`ing
into an explicit non-list/`Cons` type.

So I think if you replace your calls to `reverse` and any `conj` loops
you have in your own code, you should see a perfectly reasonable
speedup.

-Marshall

Wm. Josiah Erikson

unread,
Dec 11, 2012, 2:54:02 PM12/11/12
to clo...@googlegroups.com
OK WOW. You hit the nail on the head. It's "reverse" being called in a pmap that does it. When I redefine my own version of reverse (I totally cheated and just stole this) like this:

(defn reverse-recursively [coll]
  (loop [[r & more :as all] (seq coll)
         acc '()]
    (if all
      (recur more (cons r acc))
      acc)))



I speed it up from, get this, 4:32 to 0:25. Yeah.

In case anybody's curious, pmap and pmapall show identical performance in this case.

Wow.

-Josiah

Wm. Josiah Erikson

unread,
Dec 11, 2012, 2:57:24 PM12/11/12
to clo...@googlegroups.com
And, interestingly enough, suddenly the AMD FX-8350 beats the Intel Core i7 3770K, when before it was very very much not so. So for some reason, this bug was tickled more dramatically on AMD multicore processors than on Intel ones.

Andy Fingerhut

unread,
Dec 11, 2012, 3:00:12 PM12/11/12
to clo...@googlegroups.com
Marshall:

I'm not practiced in recognizing megamorphic call sites, so I could be missing some in the example code below, modified from Lee's original code. It doesn't use reverse or conj, and as far as I can tell doesn't use PersistentList, either, only Cons.

(defn burn-cons [size]
(let [size (long size)]
(loop [i (long 0)
value nil]
(if (>= i size)
(last value)
(recur (inc i) (clojure.lang.Cons.
(* (int i)
(+ (float i)
(- (int i)
(/ (float i)
(inc (int i))))))
value))))))

(a) invoke (burn-cons 2000000) sequentially 64 times in a single JVM

(b) invoke (burn-cons 2000000) 64 times using a modified version of pmap that limits the number of active threads to 2 (see below), in a single JVM. I would hope that this would take about half the elapsed time than (a), but the elapsed time is longer than (a)

(c) start up two JVMs simultaneously and invoke (burn-cons 2000000) sequentially 32 times in each. The elapsed time here is less than (a), as I would expect.

(Clojure 1.4, Oracle/Apple JDK 1.6.0_37, Mac OS X 10.6.8, running on a machine with Intel core i7 with 4 physical cores but OS X reports it as 8 I think because of 2 hyperthreads per core -- more details available on request).

Can you try to reproduce to see if you get similar results? If so, do you know why we get bad parallelism in a single JVM for this code? If there are no megamorphic call sites, then it is examples like this that lead me to wonder about locking in memory allocation and/or GC.


With the functions below, my part (b) was measured by doing:

(time (doall (nthreads-pmap 2 (burn-cons 2000000) (unchunk (range 64)))))

Andy



(defn unchunk [s]
(when (seq s)
(lazy-seq
(cons (first s)
(unchunk (next s))))))

(defn nthreads-pmap
"Like pmap, except can take an argument nthreads to control the
maximum number of parallel threads used."
([f coll]
(let [n (+ 2 (.. Runtime getRuntime availableProcessors))]
(nthreads-pmap n f coll)))
([nthreads f coll]
(if (= nthreads 1)
(map f coll)
(let [n (dec nthreads)
rets (map #(future (f %)) coll)
step (fn step [[x & xs :as vs] fs]
(lazy-seq
(if-let [s (seq fs)]
(cons (deref x) (step xs (rest s)))
(map deref vs))))]
(step rets (drop n rets)))))
([nthreads f coll & colls]
(let [step (fn step [cs]
(lazy-seq
(let [ss (map seq cs)]
(when (every? identity ss)
(cons (map first ss) (step (map rest ss)))))))]
(nthreads-pmap nthreads #(apply f %) (step (cons coll colls))))))

Wm. Josiah Erikson

unread,
Dec 11, 2012, 3:12:07 PM12/11/12
to clo...@googlegroups.com
...and, suddenly, the high-core-count Opterons show us what we wanted and hoped for. If I increase that range statement to 100 and run it on the 48-core node, it takes 50 seconds (before it took 50 minutes), while the FX-8350 takes 3:31.89 and the 3770K takes 3:48.95. Thanks Marshall! I think you may have just saved us outrageous quantities of time, though Lee isn't convinced that we know exactly what's going on with clojush yet, I don't think. We haven't looked at it yet though.... I'm sure we will soon enough!

Wm. Josiah Erikson

unread,
Dec 11, 2012, 3:17:05 PM12/11/12
to clo...@googlegroups.com
Hm. Interesting. For the record, the exact code I'm running right now that I'm seeing great parallelism with is this:


(defn reverse-recursively [coll]
  (loop [[r & more :as all] (seq coll)
         acc '()]
    (if all
      (recur more (cons r acc))
      acc)))

(defn burn
  ([] (loop [i 0
             value '()]
        (if (>= i 10000)
          (count (last (take 10000 (iterate reverse-recursively value))))
          (recur (inc i)
                 (cons

                   (* (int i)
                      (+ (float i)
                         (- (int i)
                            (/ (float i)
                               (inc (int i))))))
                   value)))))
  ([_] (burn)))

(defn pmapall
  "Like pmap but: 1) coll should be finite, 2) the returned sequence
   will not be lazy, 3) calls to f may occur in any order, to maximize
   multicore processor utilization, and 4) takes only one coll so far."
  [f coll]
  (let [agents (map agent coll)]
    (dorun (map #(send % f) agents))
    (apply await agents)
    (doall (map deref agents))))

(defn -main
  [& args]
  (time ( doall( pmapall burn (range 100))))
  (System/exit 0))

Lee Spector

unread,
Dec 11, 2012, 7:37:18 PM12/11/12
to clo...@googlegroups.com

On Dec 11, 2012, at 1:06 PM, Marshall Bockrath-Vandegrift wrote:
> So I think if you replace your calls to `reverse` and any `conj` loops
> you have in your own code, you should see a perfectly reasonable
> speedup.

Tantalizing, but on investigation I see that our real application actually does very little explicitly with reverse or conj, and I don't actually think that we're getting reasonable speedups (which is what led me to try that benchmark). So while I'm not sure of the source of the problem in our application I think there can be a problem even if one avoids direct calls to reverse and conj. Andy's recent tests also seem to confirm this.

BTW benchmarking our real application (https://github.com/lspector/Clojush) is a bit tricky because it's riddled with random number generator calls that can have big effects, but we're going to look into working around that. Recent postings re: seedable RNGs may help, although changing all of the RNG code may be a little involved because we use thread-local RNGs (to avoid contention and get good multicore speedups... we thought!).

-Lee

cameron

unread,
Dec 12, 2012, 3:11:00 AM12/12/12
to clo...@googlegroups.com

Hi Marshall,
  the megamorphic call site hypothesis does sound plausible but I'm not sure where the following test fits in.
If I understand correctly we believe that it's the fact that the base case (an PersistentList$EmptyList instance)
and the normal case (an PersistsentList instance) have different types and when run in paralell the interleaving
invocations are causing problems for the JIT but when we use a vector it's a single type so we see a better speedup.

I was toying with the idea of replacing the EmptyList class with a PersistsentList instance to mitigate the problem
in at least one common case, however it doesn't seem to help.
If I replace the reverse call in burn with the following code:
  #(reduce conj (list nil) %)
I get the same slowdown as we see if reverse (equivalent to #(reduce conj '() %))

where:
  (class '())        => clojure.lang.PersistentList$EmptyList
  (class (list nil)) => clojure.lang.PersistentList

I'm not convinced were at the bottom of it yet, perhaps the earlier post on memory contention by Carlos might
yield some clues. If the issue was related to memory access it might help explain why the impact differed
significantly between intel and amd hardware.

Cameron.

Marshall Bockrath-Vandegrift

unread,
Dec 12, 2012, 8:38:05 AM12/12/12
to clo...@googlegroups.com
Andy Fingerhut <andy.fi...@gmail.com> writes:

> I'm not practiced in recognizing megamorphic call sites, so I could be
> missing some in the example code below, modified from Lee's original
> code. It doesn't use reverse or conj, and as far as I can tell
> doesn't use PersistentList, either, only Cons.

...

> Can you try to reproduce to see if you get similar results? If so, do
> you know why we get bad parallelism in a single JVM for this code? If
> there are no megamorphic call sites, then it is examples like this
> that lead me to wonder about locking in memory allocation and/or GC.

I think your benchmark is a bit different from Lee’s original. The
`reverse`-based versions perform heavily allocation as they repeatedly
reverse a sequence, but each thread will hold a sequence of length at
most 10,000 at any given time. In your benchmark, each thread holds a
sequence of at most 2,000,000 elements, for a naive 200x increase in
memory pressure and a potential increase in the number of objects being
promoted out of the young generation.

I ran your run benchmark under a version of Cameron’s criterium-based
speed-up measurement wrapper I’ve modified to pass in the `pmap`
function to use. I reduced the number of iterations in your algorithm
by a factor of 5 to get it to run in a reasonable amount of time. And I
ran it using default JVM GC settings, on a 32-way AMD system.

I get the following numbers for 1-32 way parallelism with a 500MB heap:

andy 1 : smap-ms 7.5, pmap-ms 7.7, speedup 0.97
andy 2 : smap-ms 7.8, pmap-ms 9.8, speedup 0.80
andy 4 : smap-ms 8.5, pmap-ms 10.6, speedup 0.80
andy 8 : smap-ms 8.6, pmap-ms 11.5, speedup 0.75
andy 16 : smap-ms 8.1, pmap-ms 12.5, speedup 0.65
andy 32 : [java.lang.OutOfMemoryError: Java heap space]

And these numbers with a 4GB heap:

andy 1 : smap-ms 3.8, pmap-ms 4.0, speedup 0.95
andy 2 : smap-ms 4.2, pmap-ms 2.1, speedup 2.02
andy 4 : smap-ms 4.2, pmap-ms 1.7, speedup 2.48
andy 8 : smap-ms 4.2, pmap-ms 1.2, speedup 3.44
andy 16 : smap-ms 4.4, pmap-ms 1.0, speedup 4.52
andy 32 : smap-ms 4.0, pmap-ms 1.6, speedup 2.55

I’m running out of time for breakfast experiments, but it seems
relatively likely to me that the increased at-once sequence size in your
benchmark is increasing the number of objects making it out of the young
generation. This in turn is increasing the number of pause-the-world
GCs, which increase even further in frequency at lower heap sizes. I’ll
run these again later with GC logging and report if the results are
unexpected.

-Marshall

Marshall Bockrath-Vandegrift

unread,
Dec 12, 2012, 8:51:57 AM12/12/12
to clo...@googlegroups.com
cameron <cdo...@gmail.com> writes:

>   the megamorphic call site hypothesis does sound plausible but I'm
> not sure where the following test fits in.

...

> I was toying with the idea of replacing the EmptyList class with a
> PersistsentList instance to mitigate the problem
> in at least one common case, however it doesn't seem to help.
> If I replace the reverse call in burn with the following code:
>   #(reduce conj (list nil) %)
> I get the same slowdown as we see if reverse (equivalent to #(reduce
> conj '() %))

Ah, but include your own copy of `conj` and try those two cases. The
existing clojure.core/conj has already been used on multiple types, so
you need a new IFn class with a fresh call site. Here are the numbers I
get when I do that:

w/o EmptyList : smap-ms 6.1, pmap-ms 1.2, speedup 5.26
w/ EmptyList : smap-ms 10.4, pmap-ms 16.2, speedup 0.64
w/o EmptyList : smap-ms 10.5, pmap-ms 16.3, speedup 0.64

That said, I’m slightly less convinced than I was earlier. I’m having
difficulty producing a minimal example demonstrating the issue, and the
results wmjosiah reported for modifying their actual code are
disheartening. I just don’t have anything else which begins to explain
the transition from speedup to inverse speedup in the above benchmarking
sequence.

-Marshall

Andy Fingerhut

unread,
Dec 12, 2012, 10:03:28 AM12/12/12
to clo...@googlegroups.com
Lee:

I believe you said that with your benchmarking code achieved good speedup when run as separate JVMs that were each running a single thread, even before making the changes to the implementation of reverse found by Marshall. I confirmed that on my own machine as well.

Have you tried running your real application in a single thread in a JVM, and then run multiple JVMs in parallel, to see if there is any speedup? If so, that would again help determine whether it is multiple threads in a single JVM causing the slowdown, or something to do with the hardware or OS that is the limiting factor.

Andy

Lee Spector

unread,
Dec 12, 2012, 10:11:13 AM12/12/12
to clo...@googlegroups.com

On Dec 12, 2012, at 10:03 AM, Andy Fingerhut wrote:
>
> Have you tried running your real application in a single thread in a JVM, and then run multiple JVMs in parallel, to see if there is any speedup? If so, that would again help determine whether it is multiple threads in a single JVM causing the slowdown, or something to do with the hardware or OS that is the limiting factor.

I don't think we've tried this exact test but we will now. Because we haven't yet engineered all of the randomness out of the application we'll do a bunch of runs in each condition and average the times. Based on recent tests I think this will give reasonably stable results if we do 10 runs or so in each condition. I'll keep you posted.

Thanks!

-Lee

Christophe Grand

unread,
Dec 12, 2012, 10:45:45 AM12/12/12
to clojure
Lee, while you are at benchmarking, would you mind running several threads in one JVM with one clojure instance per thread? Thus each thread should get JITted independently.

Christophe



 -Lee

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en



--
On Clojure http://clj-me.cgrand.net/
Clojure Programming http://clojurebook.com
Training, Consulting & Contracting http://lambdanext.eu/

Lee Spector

unread,
Dec 12, 2012, 11:20:03 AM12/12/12
to clo...@googlegroups.com

On Dec 12, 2012, at 10:45 AM, Christophe Grand wrote:
> Lee, while you are at benchmarking, would you mind running several threads in one JVM with one clojure instance per thread? Thus each thread should get JITted independently.

I'm not actually sure how to do that. We're starting runs with "lein run", which, if I understand correctly launches a JVM and a Clojure instance within it. Within that Clojure instance we can do things in either a single threaded or a multi threaded way. How would I change our method to do what you're asking here? We're not particularly Java savvy (I come from the Lisp side of things).

Thanks,

-Lee

cameron

unread,
Dec 12, 2012, 4:35:08 PM12/12/12
to clo...@googlegroups.com


On Thursday, December 13, 2012 12:51:57 AM UTC+11, Marshall Bockrath-Vandegrift wrote:
cameron <cdo...@gmail.com> writes:

>   the megamorphic call site hypothesis does sound plausible but I'm
> not sure where the following test fits in.

...

> I was toying with the idea of replacing the EmptyList class with a
> PersistsentList instance to mitigate the problem
> in at least one common case, however it doesn't seem to help.
> If I replace the reverse call in burn with the following code:
>   #(reduce conj (list nil) %)
> I get the same slowdown as we see if reverse (equivalent to #(reduce
> conj '() %))

Ah, but include your own copy of `conj` and try those two cases.  The
existing clojure.core/conj has already been used on multiple types, so
you need a new IFn class with a fresh call site.  Here are the numbers I
get when I do that:
 
Ah, I've also been looking at this in the morning and missed that bit.
When I use a copy of conj I get similar results to yourself:
  w/ list-base             : map-ms: 4.0, pmap-ms 0.6, speedup 7.02
  w/ empty-list-base : map-ms: 9.9, pmap-ms 20.4, speedup 0.48
  w/ list-base             : map-ms: 9.4, pmap-ms 20.7, speedup 0.45
  w/ vector                 : map-ms: 4.5, pmap-ms 1.4, speedup 3.28

It does seem that once used on the empty list case conj gets tainted and all future list
uses incur a performance penalty. On that basis it would seem reasonable
 to convert the EmptyList uses in core.lang to a PersistentList instance.

Interestingly using a tainted conj with other types doesn't seem to incur the penalty
(the last vector in the times above)

If I get time I might look at a protocol based version of conj out of interest.

Cameron.

Christophe Grand

unread,
Dec 12, 2012, 6:09:54 PM12/12/12
to clojure
See https://github.com/flatland/classlojure for a, nearly, ready-made solution to running several Clojures in one JVM.



 -Lee

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Wm. Josiah Erikson

unread,
Dec 13, 2012, 1:41:59 PM12/13/12
to clo...@googlegroups.com
OK, I did something a little bit different, but I think it proves the same thing we were shooting for.

On a 48-way 4 x Opteron 6168 with 32GB of RAM. This is Tom's "Bowling" benchmark:

1: multithreaded. Average of 10 runs: 14:00.9
2. singlethreaded. Average of 10 runs: 23:35.3
3. singlethreaded, 8 simultaneous copies. Average run time of said concurrently running copies: 23:31.5

So we see a speedup of less than 2x running multithreaded in a single JVM instance. By contrast, running 8 simultaneous copies in 8 separate JVM's gives us a perfect 8 x speedup over running a single instance of the same singlethreaded benchmark. This proves pretty conclusively that it's not a hardware limitation, it seems to me.... unless the problem is that it's trying to spawn 48 threads, and that creates contention.

I don't think so though, because on an 8-way FX-8120 with 16GB of RAM, we see a very similar lack of speedup going from singlethreaded to multithreaded (and it will only be trying to use 8 threads, right?), and then we see a much better speedup (around 4x - we're doing 8 times the work in twice the amount of time) going to 8 concurrent copies of the same thing in separate JVM's (even though I had to decrease RAM usage on the 8 concurrent copies to avoid swapping, thereby possibly slowing this down a bit):
1. 9:00.6
2. 14:15.6
3. 27:35.1

We're probably getting a better speedup with the concurrent copies on the 48-way node because of higher memory bandwidth, bigger caches (and more of them), and more memory.

Does this help? Should I do something else as well? I'm curious to try running like, say 16 concurrent copies on the 48-way node....


    -Josiah

Wm. Josiah Erikson

unread,
Dec 13, 2012, 1:42:52 PM12/13/12
to clo...@googlegroups.com
Ah. We'll look into running several clojures in one JVM too. Thanks.

Andy Fingerhut

unread,
Dec 13, 2012, 1:53:25 PM12/13/12
to clo...@googlegroups.com
I'm not saying that I know this will help, but if you are open to trying a different JVM that has had a lot of work done on it to optimize it for high concurrency, Azul's Zing JVM may be worth a try, to see if it increases parallelism for a single Clojure instance in a single JVM, with lots of threads.

It costs $$, but I'm guessing they might have steep discounts for educational institutions.  They have free trials, too.


Andy

Wm. Josiah Erikson

unread,
Dec 13, 2012, 1:57:33 PM12/13/12
to clo...@googlegroups.com
Cool. I've requested a free trial.

cameron

unread,
Dec 13, 2012, 4:21:26 PM12/13/12
to clo...@googlegroups.com


On Friday, December 14, 2012 5:41:59 AM UTC+11, Wm. Josiah Erikson wrote:

Does this help? Should I do something else as well? I'm curious to try running like, say 16 concurrent copies on the 48-way node....

Have you made any progress on a small deterministic benchmark that reflects your applications behaviour (ie. the RNG seed work you were discussing)? I'm keen to help, but I don't have time to look at benchmarks that take hours to run.

I've also been using clojure for genetic programming and have been getting very good parallel speedups on a single jvm, this type of problem should see excellent performance gains, it has many medium size cpu intensive sub-problems that require no shared resources. I used hadoop to scale beyond single machines, in the worst case you can use this approach and you should see speed-ups equivalent to your multiple jvm test. I just split the population into small blocks and had each map function calculate the fitness and return a map of individual-id -> fitness.

Cameron.
   

Lee Spector

unread,
Dec 13, 2012, 4:58:41 PM12/13/12
to clo...@googlegroups.com

On Dec 13, 2012, at 4:21 PM, cameron wrote:
>
> Have you made any progress on a small deterministic benchmark that reflects your applications behaviour (ie. the RNG seed work you were discussing)? I'm keen to help, but I don't have time to look at benchmarks that take hours to run.
>
> I've also been using clojure for genetic programming and have been getting very good parallel speedups on a single jvm, this type of problem should see excellent performance gains, it has many medium size cpu intensive sub-problems that require no shared resources. I used hadoop to scale beyond single machines, in the worst case you can use this approach and you should see speed-ups equivalent to your multiple jvm test. I just split the population into small blocks and had each map function calculate the fitness and return a map of individual-id -> fitness.
>
> Cameron.


I don't think we've quite worked out a deterministic test, but I'll look into it some more.

I'd be interested in seeing your GP system. The one we're using evolves "Push" programs and I suspect that whatever's triggering this problem with multicore utilization is stemming from something in the inner loop of my Push interpreter (https://github.com/lspector/Clojush)... but I don't know exactly what it is.

I've never used hadoop, but I've done lots of coarse-grained parallel GP using (gasp) shell scripts and the file system to glue the independent parts together. I realize that I could go back to this sort of thing, or something more modern and reasonable like hadoop, but of course it'd be a lot nicer if parallelization via agents (or some other mechanism central to the language) just didn't have whatever pathology we've uncovered.

-Lee

Herwig Hochleitner

unread,
Dec 14, 2012, 3:28:01 PM12/14/12
to clo...@googlegroups.com
I've created a test harness for this as a leiningen plugin: https://github.com/bendlas/lein-partest

You can just put 

    :plugins [[net.bendlas/lein-partest "0.1.0"]]

into your project and run 

    lein partest your.ns/testfn 6

to run 6 threads/processes in parallel

The plugin then runs the function as follows:

- (a default of) 2 warmup runs
- 6 times serially
- 6 times in parallel with futures
- 6 times in parallel with classloaders
- 6 times in parallel with processes

A run with the original function (with an i of 1000 instead of 10000) on my machine reveals that running in separate classloaders yields roughly the same times.
This hints at the bottleneck being in the JVM instead of the clojure runtime, provided I have used the classloader feature of leiningen correctly (please verify).
I'd be very interested in the performance on Azul's JVM.

> lein partest example/burn 4
##################### Run serial ... warmup ...
  #  0, elapsed:       187.57 msecs
  #  1, elapsed:       184.83 msecs
  #  2, elapsed:       183.99 msecs
  #  3, elapsed:       184.33 msecs
##################### Run parallel, same classloader ... warmup ...
  #  1, elapsed:      1063.02 msecs
  #  2, elapsed:      1071.12 msecs
  #  3, elapsed:      1091.50 msecs
  #  0, elapsed:      1100.42 msecs
##################### Run parallel, distinct classloaders ... warmup ...
 warmup ...
 warmup ...
 warmup ...
  #  2, elapsed:      1135.74 msecs
  #  3, elapsed:      1118.14 msecs
  #  1, elapsed:      1119.45 msecs
  #  0, elapsed:      1103.45 msecs
##################### Run parallel, distinct processes ... warmup ...
 warmup ...
 warmup ...
 warmup ...
  #  0, elapsed:       191.96 msecs
  #  2, elapsed:       209.49 msecs
  #  3, elapsed:       191.99 msecs
  #  1, elapsed:       201.93 msecs

cameron

unread,
Dec 14, 2012, 10:41:10 PM12/14/12
to clo...@googlegroups.com
Thanks Herwig,
  I used your plugin with the following 2 burn variants:

(defn burn-slow [& _]
  (count (last (take 1000 (iterate #(reduce conj '() %) (range 10000))))))

(defn burn-fast [& _]
  (count (last (take 1000 (iterate #(reduce conj* (list nil) %) (range 10000))))))

Where conj* is just a copy of clojure.core/conj. In the first case I see the same
behaviour you saw, fast serially and in separate processes (~115 ms) but slow in parallel for
both the same and different class loaders (708ms - 950ms).

In the second case where I avoid any references to clojure.lang.PersistentList$EmptyList
and use a clean copy of conj I see much better performance in all cases (~ 40ms - 50ms) and no slow down
in the parallel cases in the same JVM.

Until Lee has a representative benchmark for his application it's difficult to tell if he's
experiencing the same problem but there would seem to be a case for changing the PersistentList
implementation in clojure.lang.

Cameron.

cameron

unread,
Dec 15, 2012, 1:14:09 AM12/15/12
to clo...@googlegroups.com
 

I'd be interested in seeing your GP system. The one we're using evolves "Push" programs and I suspect that whatever's triggering this problem with multicore utilization is stemming from something in the inner loop of my Push interpreter (https://github.com/lspector/Clojush)... but I don't know exactly what it is.

Originally I was using ECJ (http://cs.gmu.edu/~eclab/projects/ecj/) in java for my GP work but for the last few years it's been GEVA with a clojure wrapper I wrote (https://github.com/cdorrat/geva-clj).


 I realize that I could go back to this sort of thing, or something more modern and reasonable like hadoop, but of course it'd be a lot nicer if parallelization via agents (or some other mechanism central to the language) just didn't have whatever pathology we've uncovered.


Agreed, I'd like clojure to be my preferred language when I have a 100 core machine, it's certainly something I'd like to get to the bottom of.

Cameron.

Lee Spector

unread,
Dec 16, 2012, 4:45:57 PM12/16/12
to clo...@googlegroups.com

On Dec 15, 2012, at 1:14 AM, cameron wrote:
>
> Originally I was using ECJ (http://cs.gmu.edu/~eclab/projects/ecj/) in java for my GP work but for the last few years it's been GEVA with a clojure wrapper I wrote (https://github.com/cdorrat/geva-clj).

Ah yes -- I've actually downloaded and played a bit with your GEVA via Clojure. I've thought it would be interesting to try using GEVA to evolve Push programs (which have a trivial grammar), but haven't had the time to try that.

I've also used ECJ for some projects but haven't looked at its multicore performance. Sean Luke (ECJ's author and incidentally a former student of mine) even hacked together a way to use it to evolve Push programs but it was nasty (generating Push programs as strings) and would be low-performance for other reasons.

-Lee

Lee Spector

unread,
Dec 16, 2012, 4:54:41 PM12/16/12
to clo...@googlegroups.com

On Dec 14, 2012, at 10:41 PM, cameron wrote:
> Until Lee has a representative benchmark for his application it's difficult to tell if he's
> experiencing the same problem but there would seem to be a case for changing the PersistentList
> implementation in clojure.lang.

We put together a version of our application in which we just replaced all of the random calls with deterministic functions (returning either constants or deterministic functions of their arguments).

What we saw was a maximum speedup of less than x2 on a 48 core machine, which was interesting in part because the determinism meant that only a tiny subset of our code was being executed (because all of the Push programs in the population were identical and used only a single Push instruction). So I think that this may indeed help us to hone in on the problem.

Before we share that code, however, we should make sure that the evaluations that are being done concurrently are sufficiently long-running, and maybe tweak a couple of other things. I think we'll have a chance to do that early in the week and we'll share the results/code when we do.

Thanks,

-Lee

Wm. Josiah Erikson

unread,
Dec 19, 2012, 11:57:45 AM12/19/12
to clo...@googlegroups.com
So here's what we came up with that clearly demonstrates the problem. Lee provided the code and I tweaked it until I believe it shows the problem clearly and succinctly.

I have put together a .tar.gz file that has everything needed to run it, except lein. Grab it here: clojush_bowling_benchmark.tar.gz

Then run, for instance: /usr/bin/time -f %E lein run clojush.examples.benchmark-bowling

and then, when that has finished, edit src/clojush/examples/benchmark_bowling.clj and uncomment ":use-single-thread true" and run it again. I think this is a succinct, deterministic benchmark that clearly demonstrates the problem and also doesn't use conj or reverse. We don't see slowdowns, but I cannot get any better than around 2x speedup on any hardware with this benchmark.

I hope this helps people get to the bottom of things.

-Josiah


Wm. Josiah Erikson

unread,
Dec 19, 2012, 12:00:09 PM12/19/12
to clo...@googlegroups.com
Whoops, sorry about the link. It should be able to be found here: http://gibson.hampshire.edu/~josiah/clojush/

On Wed, Dec 19, 2012 at 11:57 AM, Wm. Josiah Erikson <wmjo...@gmail.com> wrote:
So here's what we came up with that clearly demonstrates the problem. Lee provided the code and I tweaked it until I believe it shows the problem clearly and succinctly.

I have put together a .tar.gz file that has everything needed to run it, except lein. Grab it here: clojush_bowling_benchmark.tar.gz

Then run, for instance: /usr/bin/time -f %E lein run clojush.examples.benchmark-bowling

and then, when thWhooat has finished, edit src/clojush/examples/benchmark_bowling.clj and uncomment ":use-single-thread true" and run it again. I think this is a succinct, deterministic benchmark that clearly demonstrates the problem and also doesn't use conj or reverse. We don't see slowdowns, but I cannot get any better than around 2x speedup on any hardware with this benchmark.

Lee Spector

unread,
Dec 19, 2012, 1:00:09 PM12/19/12
to clo...@googlegroups.com

On Dec 19, 2012, at 11:57 AM, Wm. Josiah Erikson wrote:
> I think this is a succinct, deterministic benchmark that clearly demonstrates the problem and also doesn't use conj or reverse.

Clarification: it's not just a tight loop involving reverse/conj, as our previous benchmark was. It's our real application but with deterministic versions of all of the "random" functions, and while the project includes some calls to reverse and conj I don't think they're playing a big role here.

Almost all of the time here is spent evaluating a Push program that just does a lot of integer addition and consing. As far as I can tell all of the consing is done with "cons" explicitly, and not conj.... although maybe I'm missing something, and I'm saying this only from looking at our code, not the Clojure libraries.

-Lee

Wm. Josiah Erikson

unread,
Dec 19, 2012, 1:23:51 PM12/19/12
to clo...@googlegroups.com
I tried redefining the few places in the code (string_reverse, I think) that used reverse to use the same version of reverse that I got such great speedups with in your code, and it made no difference. There are not any explicit calls to conj in the code that I could find.


 -Lee

Tassilo Horn

unread,
Dec 19, 2012, 1:30:56 PM12/19/12
to Wm. Josiah Erikson, clo...@googlegroups.com
"Wm. Josiah Erikson" <wmjo...@gmail.com> writes:

> Then run, for instance: /usr/bin/time -f %E lein run
> clojush.examples.benchmark-bowling
>
> and then, when that has finished, edit
> src/clojush/examples/benchmark_bowling.clj and uncomment
> ":use-single-thread true" and run it again. I think this is a
> succinct, deterministic benchmark that clearly demonstrates the
> problem and also doesn't use conj or reverse. We don't see slowdowns,
> but I cannot get any better than around 2x speedup on any hardware
> with this benchmark.

FWIW, I've just ran it with these results:

- a dual-core intel notebook:
+ single-threaded: 4:00.09
+ multi-threaded: 2:27.35

- an AMD Opteron 8-core machine [*]
+ single-threaded: 3:03.51
+ multi-threaded: 1:31.58

So indeed, I also get only a speedup of factor 2.

[*] I'm not exactly sure what for a machine that is. /proc/cpuinfo
reports 8 processors, each being a "Quad-Core AMD Opteron(tm) Processor
8387". Well, that would make 32 cores, but actually htop shows just 8.
But I think its some virtualized machine, so maybe the host has 8
4-cores, but the virtual machine gets only 8 of them...

Bye,
Tassilo

Marshall Bockrath-Vandegrift

unread,
Dec 21, 2012, 5:22:57 PM12/21/12
to clo...@googlegroups.com
"Wm. Josiah Erikson" <wmjo...@gmail.com> writes:

> I hope this helps people get to the bottom of things.

Not to the bottom of things yet, but found some low-hanging fruit –
switching the `push-state` from a struct-map to a record gives a flat
~2x speedup in all configurations I tested. So, that’s good?

I have however eliminated to my satisfaction the possibility that this
is something orthogonal to your system. I do see some speedup
improvements when I lower the level of concurrency to the number of
actual physical cores on my system, but each call to the
`error-function` still takes ~10x as long to complete when run in
parallel vs in serial.

For a while I had some hope that atom-access in the main interpreter
loop was the problem, due to causing extraneous fetches to main memory.
But removing all the atoms from the system didn’t have any appreciable
impact.

Anyway, still poking at it.

-Marshall

Lee Spector

unread,
Dec 21, 2012, 6:37:38 PM12/21/12
to clo...@googlegroups.com

On Dec 21, 2012, at 5:22 PM, Marshall Bockrath-Vandegrift wrote:
> Not to the bottom of things yet, but found some low-hanging fruit –
> switching the `push-state` from a struct-map to a record gives a flat
> ~2x speedup in all configurations I tested. So, that’s good?

I really appreciate your attention to this Marshall.

FWIW I used records for push-states at one point but did not observe a speedup and it required much messier code, so I reverted to struct-maps. But maybe I wasn't doing the right timings. I'm curious about how you changed to records without the messiness. I'll include below my sig the way that I had to do it... maybe you can show me what you did instead.

Still, I guess the gorilla in the room, which is eating the multicore performance, hasn't yet been found.

-Lee

--- tl;dr: the structs vs. records thing ---

Here's how I do what I need to do with struct-maps:

;; this is defined elsewhere, and I want push-states to have fields for each push-type that's defined here
(def push-types '(:exec :integer :float :code :boolean :string :zip
:tag :auxiliary :return :environment)

;; so I use a macro to produce the struct-map definition
(defmacro define-push-state-structure []
`(defstruct push-state ~@push-types))

;; call the macro to define the struct
(define-push-state-structure)

;; and then when I want a new push-state I can just call struct-map:
(defn make-push-state
"Returns an empty push state."
[]
(struct-map push-state))

In order to do the same thing with records I had to resort to this:

(defn keyword->symbol [kwd]
"Returns the symbol obtained by removing the : from a keyword."
(read-string (name kwd)))

(defmacro define-push-state-record-type []
"Defines the pushstate record type. The odd trick with read-string was a hack to
avoid namespace qualification on the pushstate symbol."
`(defrecord ~(read-string "pushstate") [~@(map keyword->symbol push-types)]))

(define-push-state-record-type)

(defmacro make-push-state
"Returns an empty push state."
[]
`(pushstate. ~@(map (fn [_] nil) push-types)))

Is there a much simpler way that I overlooked?




Marshall Bockrath-Vandegrift

unread,
Dec 21, 2012, 6:57:28 PM12/21/12
to clo...@googlegroups.com
Lee Spector <lspe...@hampshire.edu> writes:

> FWIW I used records for push-states at one point but did not observe a
> speedup and it required much messier code, so I reverted to
> struct-maps. But maybe I wasn't doing the right timings. I'm curious
> about how you changed to records without the messiness. I'll include
> below my sig the way that I had to do it... maybe you can show me what
> you did instead.

I just double-checked, and I definitely see a >2x speedup on Josiah’s
benchmark. That may still be synthetic, of course. Here’s what I did:

(eval `(defrecord ~'PushState [~'trace ~@(map (comp symbol name) push-types)]))

(let [empty-state (map->PushState {})]
(defn make-push-state
"Returns an empty push state."
[] empty-state))

> Still, I guess the gorilla in the room, which is eating the multicore
> performance, hasn't yet been found.

No, not yet... I’ve become obsessed with figuring it out though, so
still slogging at it.

-Marshall

Meikel Brandmeyer

unread,
Dec 21, 2012, 6:59:37 PM12/21/12
to clo...@googlegroups.com
Hi,

Am 22.12.12 00:37, schrieb Lee Spector:

> ;; this is defined elsewhere, and I want push-states to have fields for each push-type that's defined here
> (def push-types '(:exec :integer :float :code :boolean :string :zip
> :tag :auxiliary :return :environment)
>
> (defn keyword->symbol [kwd]
> "Returns the symbol obtained by removing the : from a keyword."
> (read-string (name kwd)))

I'm afraid this one is still necessary, but I would use symbol instead
of read-string.

> (defmacro define-push-state-record-type []
> "Defines the pushstate record type. The odd trick with read-string was a hack to
> avoid namespace qualification on the pushstate symbol."
> `(defrecord ~(read-string "pushstate") [~@(map keyword->symbol push-types)]))

Use ~'pushstate instead of read-string.

> (define-push-state-record-type)
>
> (defmacro make-push-state
> "Returns an empty push state."
> []
> `(pushstate. ~@(map (fn [_] nil) push-types)))

Use this:

(defn make-push-state
[]
(map->pushstate {}))

> Is there a much simpler way that I overlooked?

I'm not sure it's simpler, but it's more straight-forward, I'd say.

Kind regards
Meikel


Lee Spector

unread,
Dec 22, 2012, 9:51:33 AM12/22/12
to clo...@googlegroups.com

On Dec 21, 2012, at 6:59 PM, Meikel Brandmeyer wrote:
>
>> Is there a much simpler way that I overlooked?
>
> I'm not sure it's simpler, but it's more straight-forward, I'd say.
>

Thanks Marshall and Mikel on the struct->record conversion code. I'll definitely make a change along those lines.

-Lee

cameron

unread,
Dec 24, 2012, 3:53:29 AM12/24/12
to clo...@googlegroups.com
I've been moving house for the last week or so but I'll also give the benchmark another look.
My initial profiling seemed to show that the parallel version was spending a significant amount of time in java.lang.isArray,
clojush.pushstate/stack-ref is calling nth on the result of cons, since it isn't an instance of clojure.lang.Indexed nth resorts to a series of tests before returning the value (including isArray).

I changed clojush.pushstate/push-item implementation to
   (assoc state type (vec (cons value (type state)))) ;; vec added to ensure result is indexable

This slowed down my single threaded version a bit but improved my parallel speedup from 2x to 3x on an 8 physical core machine. We could easily improve this by replacing the cons with conj and updating the code that pops the state or by implementing a more efficient indexable stack with deftype.

After the change above clojush.interpreter/execute_instruction is looking like a hotspot with clojure.lang.ArraySeq creation  seeming to spend more time in java.lang.Class.getComponentType() in the parallel version than the serial one.

Cameron.

cameron

unread,
Dec 28, 2012, 6:29:36 PM12/28/12
to clo...@googlegroups.com
Hi Lee,
  I've done some more digging and seem to have found the root of the problem,
it seems that java native methods are much slower when called in parallel.
The following code illustrates the problem:

(letfn [(time-native [f]         
         (let [c (class [])]
           (time (dorun (f
                         (fn [_] (.isArray c))
                         (range 10000000))))))]
  (println "Sequential Test:")
  (time-native map)
  (println "Parallel Test:")
  (time-native pmap))

On a dual quad-core xeon box I get the following results:
  Sequential Test:
   "Elapsed time: 1054.807725 msecs"
  Parallel Test:
   "Elapsed time: 15302.605697 msecs"

ie. the code executed in parallel was 15 times slower than the sequential version.
The same can be seen with isInstance and isArray members of java.lang.Class.

I'm not sure where to go from here, these functions are frequently used by clojure.core
Perhaps someone with more JVM implementation knowledge can help?

Cameron.

Leonardo Borges

unread,
Dec 28, 2012, 6:34:42 PM12/28/12
to clo...@googlegroups.com

In that case isn't context switching dominating your test?

.isArray isn't expensive enough to warrant the use of pmap

Leonardo Borges
www.leonardoborges.com

--

cameron

unread,
Dec 28, 2012, 6:46:39 PM12/28/12
to clo...@googlegroups.com
No, it's not the context switching, changing isArray (a native method) to getAnnotations (a normal jvm method) gives the same time for both the parallel and serial version.

Cameron.

cameron

unread,
Dec 30, 2012, 9:28:04 PM12/30/12
to clo...@googlegroups.com
I've posted a patch with some changes here (https://gist.github.com/4416803), it includes the record change here  and a small change to interpret-instruction, the benchmark runs > 2x the default as it did for Marshall.
The patch also modifies the main loop to use a thread pool instead of agents and allows you to set the number of threads, this might help diagnosing the parallel performance issue.

On the modified benchmark I'm seeing ~4x speedup with the parallel version on an 8 core machine and the profiler reports that the parallel version is using twice as much cpu time.

I also had another look at the native calls issue & modified the clojure runtime to avoid most to the calls the profiler said were taking significantly more time in the parallel version, it did speed things up but only by ~6%, not the large margin the profiling results had led me to believe were possible, it looks like the profiler overstates these methods times. The modified clojure 1.5 is available here https://github.com/cdorrat/clojure/commit/dfb5f99eb5d0a45165978e079284bab1f25bd79f if anyone's interested

YourKit is reporting that a number of clojure.core functions are taking longer in the parallel version than the serial and they all seem to be methods that have one or more instanceof or instance? calls but given the results above I'm not sure how much weight to give this.

It's seems the elephant is still in the room and responsible to ~50% of the cpu time :)

Cameron.

Wm. Josiah Erikson

unread,
Jan 10, 2013, 10:25:23 AM1/10/13
to clo...@googlegroups.com
Am I reading this right that this is actually a Java problem, and not clojure-specific? Wouldn't the rest of the Java community have noticed this? Or maybe massive parallelism in this particular way isn't something commonly done with Java in the industry?

Thanks for the patches though - it's nice to see some improvement... I'll be fascinated to see how this turns out in the end. Have we found a large Java bug?

Marshall Bockrath-Vandegrift

unread,
Jan 30, 2013, 8:39:52 PM1/30/13
to clo...@googlegroups.com
"Wm. Josiah Erikson" <wmjo...@gmail.com> writes:

> Am I reading this right that this is actually a Java problem, and not
> clojure-specific? Wouldn't the rest of the Java community have noticed
> this? Or maybe massive parallelism in this particular way isn't
> something commonly done with Java in the industry?
>
> Thanks for the patches though - it's nice to see some improvement...
> I'll be fascinated to see how this turns out in the end. Have we found
> a large Java bug?

Apologies for my very-slow reply here. I keep thinking that I’ll have
more time to look into this issue, and keep having other things
requiring my attention. And on top of that, I’ve temporarily lost the
many-way AMD system I was using as a test-bed.

I very much want to see if I can get my hands on an Intel system to
compare to. My AMD system is in theory 32-way – two physical CPUs, each
with 16 cores. However, Linux reports (via /proc/cpuinfo) the cores in
groups of 8 (“cpu cores : 8” etc). And something very strange happens
when extending parallelism beyond 8-way... I ran several experiments
using a version of your whole-application benchmark I modified to
control the level of parallelism. At parallelism 9+, the real time it
takes to complete the benchmark hardly budges, but the user/CPU time
increases linearly with the level of parallelism! As far as I can tell,
multi-processor AMD *is* a NUMA architecture, which might potentially
explain things. But enabling the JVM NUMA options doesn’t seem to
affect the benchmark.

I think next steps are two-fold: (1) examine parallelism vs real & CPU
time on an Intel system, and (2) attempt to reproduce the observed
behavior in pure Java. I’m keeping my fingers crossed that I’ll have
some time to look at this more soon, but I’m honestly not very hopeful.

In the mean time, I hope you’ve managed to exploit multi-process
parallelism to run more efficiently?

-Marshall

Lee Spector

unread,
Jan 30, 2013, 9:20:13 PM1/30/13
to clo...@googlegroups.com

FYI we had a bit of a discussion about this at a meetup in Amherst MA yesterday, and while I'm not sufficiently on top of the JVM or system issues to have briefed everyone on all of the details there has been a little of followup since the discussion, including results of some different experiments by Chas Emerick, at: http://www.meetup.com/Functional-Programming-Connoisseurs/messages/boards/thread/30946382

-Lee

Andy Fingerhut

unread,
Jan 30, 2013, 11:02:24 PM1/30/13
to clo...@googlegroups.com
Josiah mentioned requesting a free trial of the ZIng JVM. Did you ever get access to that, and were able to try your code running on that?

Again, I have no direct experience with their product to guarantee you better results -- just that I've heard good things about their ability to handle concurrent workloads with different GC than most JVMs.

Andy

Chas Emerick

unread,
Jan 31, 2013, 8:37:19 AM1/31/13
to clo...@googlegroups.com
Keeping the discussion here would make sense, esp. in light of meetup.com's horrible "discussion board".

I don't have a lot to offer on the JVM/Clojure-specific problem beyond what I wrote in that meetup thread, but Lee's challenge(s) were too hard to resist:

> "Would your conclusion be something like 'Intensive list processing can't currently be done in Java (or Clojure/JVM) in a way that takes reasonable advantage of multiple cores.'?"


> "I see that "rewrite without lists" might be the only way out, but that'd be a bummer. Clojure is a multicore Lisp; if you can't use it for multicore list processing then... sigh."

The nature of the `burn` program is such that I'm skeptical of the ability of any garbage-collected runtime (lispy or not) to scale its operation across multiple threads. It generates a huge amount of garbage that is held for a very long time (in GC terms), thus producing a great deal of contention on the heap between the active job threads and the GC implementation constantly having to clean up after them, compact the results, etc. The workload is not CPU-bound, so I don't see it being parallelized effectively.

I'd be very interested in seeing an implementation for a runtime that proves me wrong, i.e. can attain significant performance benefits by running the e.g. 8 `burn` "jobs" across multiple threads.

In an attempt to do just that (prove myself wrong), I thought I'd golf the `burn` program (again; onlookers can go look at the Java implementation I knocked out a few days ago here: https://gist.github.com/4682554) on Racket, presumably as well-suited to "multicore list processing" as any other. The results:

https://gist.github.com/4682453

Please forgive my perhaps pitiful Racket-fu. I'm sure there's threadpool and CAS libraries available, but it was entertaining to hack away with the threading and semaphore primitives. If I've goofed something up badly (it's been some time since I've schemed), please speak up.

The Racket and Java implementations are effectively equivalent:

`java -server -Xmx2g cemerick.burn $THREAD_COUNT` (oracle java 1.7.0_09)
1 thread: 61s
2 threads: 126s

`java -server -Xmx2g cemerick.burn $THREAD_COUNT` (oracle java 1.7.0_09)
1 thread: 47s
2 threads: 92s

`time ./racket -t ~/burn.scm -m $THREAD_COUNT` (racket 5.3.1)
1 thread: 45s
2 threads: 126s

The above was run on OS X 10.7.5 with 4GB of physical memory. If someone knows of ways to get better perf on racket, let me know. I tried running from the results of `raco make` and such, but saw no improvements; I guess it is strictly producing bytecode that is later JIT'ed, and not doing any kind of additional AOT optimizations...

It'd be interesting to see timings from different operating systems, and very interesting to see timings from running burn.scm on other schemes, or a Common Lisp implementation and timings.

Cheers,

- Chas
> --
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Marshall Bockrath-Vandegrift

unread,
Jan 31, 2013, 9:23:04 AM1/31/13
to clo...@googlegroups.com
Chas Emerick <ch...@cemerick.com> writes:

> Keeping the discussion here would make sense, esp. in light of
> meetup.com's horrible "discussion board".

Excellent. Solves the problem of deciding the etiquette of jumping on
the meetup board for a meetup one has never been involved in. :-)

> The nature of the `burn` program is such that I'm skeptical of the
> ability of any garbage-collected runtime (lispy or not) to scale its
> operation across multiple threads.

Bringing you up to speed on this very long thread, I don’t believe the
original `burn` benchmark is a GC issue, due to results cameron first
reported here:

https://groups.google.com/d/msg/clojure/48W2eff3caU/83uXZjLi3iAJ

I that narrowed to what I still believe to be the explanation here:

https://groups.google.com/d/msg/clojure/48W2eff3caU/jd8dpmzEtEYJ

And have more compact demonstration benchmark results here:

https://groups.google.com/d/msg/clojure/48W2eff3caU/tCCkjXxTUMEJ

I haven’t been able to produce those results in a minimal Java-only test
case, though.

Then Wm. Josiah posted a full-application benchmark, which appears to
have entirely different performance problems from the synthetic `burn`
benchmark. I’d rejected GC as the cause for the slowdown there too, but
ATM can’t recall why or what I tested, so GC may definitely be a
candidate to re-examine:

https://groups.google.com/d/msg/clojure/48W2eff3caU/K224Aqwkn5YJ

Quite looking forward to additional insight...

-Marshall

Chas Emerick

unread,
Jan 31, 2013, 10:15:03 AM1/31/13
to clo...@googlegroups.com
On Jan 31, 2013, at 9:23 AM, Marshall Bockrath-Vandegrift wrote:

> Chas Emerick <ch...@cemerick.com> writes:
>
>> The nature of the `burn` program is such that I'm skeptical of the
>> ability of any garbage-collected runtime (lispy or not) to scale its
>> operation across multiple threads.
>
> Bringing you up to speed on this very long thread, I don’t believe the
> original `burn` benchmark is a GC issue, due to results cameron first
> reported here:
>
> https://groups.google.com/d/msg/clojure/48W2eff3caU/83uXZjLi3iAJ
>
> I that narrowed to what I still believe to be the explanation here:
>
> https://groups.google.com/d/msg/clojure/48W2eff3caU/jd8dpmzEtEYJ
>
> And have more compact demonstration benchmark results here:
>
> https://groups.google.com/d/msg/clojure/48W2eff3caU/tCCkjXxTUMEJ
>
> I haven’t been able to produce those results in a minimal Java-only test
> case, though.
>
> Then Wm. Josiah posted a full-application benchmark, which appears to
> have entirely different performance problems from the synthetic `burn`
> benchmark. I’d rejected GC as the cause for the slowdown there too, but
> ATM can’t recall why or what I tested, so GC may definitely be a
> candidate to re-examine:
>
> https://groups.google.com/d/msg/clojure/48W2eff3caU/K224Aqwkn5YJ
>
> Quite looking forward to additional insight...

Yeah, the thread is far too large for me to reasonably digest (and there's inevitably a lot of noise in rounds of microbenchmarking golf ;-)

BTW, I just realized I copy/pasted the wrong command for the faster of the Java implementation benchmarks in the previous mail; it used -XX:-UseParallelGC:

`java -server -Xmx2g -XX:-UseParallelGC cemerick.burn $THREAD_COUNT`

I've not looked at clojush at all; it is AFAICT a complex application in its own right, and I wouldn't know where to start. My understanding is that the `burn` function is considered representative of the bulk of operations in clojush, but I'm happy to assume that that's not the case. There's likely a number of things that go into the results of the `burn` benchmark, nevermind the apparent performance of clojush.

Here's what I know:

* Two separate implementations (Java and Racket) exhibit very similar multicore performance characteristics. In particular, the Java implementation would not suffer from any particular megamorphic inefficiencies, since no such calls are made in that program. Racket is far more of a black box to me than the JVM, but it does not provide much of any polymorphism at all, so JIT-related issues are presumably not in scope there.

* Broadly speaking, heap allocation and its evil twin, GC, sabotages parallelism and performance in general. Functional programming with immutable values has been "made possible" in large part by the gains registered by persistent data structures, structural sharing, lazy evaluation, and so on; without them, we're back to copy-on-write, which is roughly what `burn` represents in grand style.

Now, all this digital ink spilled, and just for kicks, I go back to run the Clojure `burn` again (https://gist.github.com/4683413 in a 1.5.0-RC2 REPL started with jvm args of `-Xmx2g -XX:-UseParallelGC`):

1 thread: 38s
2 threads: 32s
4 threads: 23s

Hardly a 2x or 4x improvement, but I had previously obtained badly degrading timings corresponding to those in my previous mail. Microbenchmarking sucks. At least I got to twiddle with Racket again. :-P

Cheers,

- Chas

Lee Spector

unread,
Jan 31, 2013, 11:31:14 AM1/31/13
to clo...@googlegroups.com

On Jan 31, 2013, at 10:15 AM, Chas Emerick wrote:
>>
>> Then Wm. Josiah posted a full-application benchmark, which appears to
>> have entirely different performance problems from the synthetic `burn`
>> benchmark. I’d rejected GC as the cause for the slowdown there too, but
>> ATM can’t recall why or what I tested, so GC may definitely be a
>> candidate to re-examine:
>>
>
> I've not looked at clojush at all; it is AFAICT a complex application in its own right, and I wouldn't know where to start. My understanding is that the `burn` function is considered representative of the bulk of operations in clojush, but I'm happy to assume that that's not the case. There's likely a number of things that go into the results of the `burn` benchmark, nevermind the apparent performance of clojush.

FWIW I wrote the "burn" benchmark because "lots of intensive list manipulation" is at the heart of our real applications, and that seemed to be a nicely minimal way to see how that could scale across cores.

Our real applications may indeed raise other issues too, but getting "lots of intensive list manipulation" to scale well is bound to be good.

FWIW the "full-application benchmark" that Josiah posted was made deterministic in a way that means that big chunks of our code won't actually be executing. The system generates and interprets lots of "Push" programs, and if I recall correctly the benchmark hardcodes all of the Push programs to be the same constant program, meaning that not all of the Push instructions will run, etc. It'd be trickier to get the "real" mix of instructions in a way that makes for an easily reproducible benchmark, and this benchmark is at least testing the basic interpreter and population-level infrastructure. If we can get this to scale reasonably then we could try to see how much that helps under more-completely "real" conditions.

-Lee

Andy Fingerhut

unread,
Sep 26, 2013, 11:43:40 PM9/26/13
to clo...@googlegroups.com, Lee Spector, Wm. Josiah Erikson
Adding to this thread from almost a year ago.  I don't have conclusive proof with experiments to show right now, but I do have some experiments that have led me to what I think is a plausible cause of not just Clojure programs running more slowly when multi-threaded than when single-threaded, but any programs running on JVM's memory model doing so.  Qualification: My explanation would be true only for multi-threaded programs running on the JVM that store significant amounts of data in memory, even if the data written by each thread is only read by that thread, and there is no locking or other inter-thread communication, i.e. for "embarrasingly parallel" problems.

Outline of the argument:

When you start a thread, and then wait for the thread to complete, the JVM memory model requires all loads and stores to satisfy certain restrictions.  One of these is that any store done before the thread is created should 'happen before' the thread start, and thus the updated stored values must be visible to the new thread.  'Visible' here means that the thread doing the store must cause the CPU it is running on to update main memory from whatever locally modified values it has written into its local cache.  That rule isn't so relevant to my argument.

The one that is relevant is that any store performed by the thread is considered to 'happen before' a join operation on the thread.  Thus any store done by a thread must be written back to main memory, *even if the store is to a JVM object that later becomes garbage*.

So imagine a single-threaded program that creates X bytes of garbage while it runs.  Those X bytes will definitely be written to the CPU's local cache, but they will only be written to main memory if the cache space runs out before the garbage collector does its work and allows that memory to be reused for allocations.  The CPU-to-local-cache bandwidth in many modern systems is significantly faster than local-cache-to-main-memory bandwidth.

Now take that same program and spread its work across 2 or more threads, with a join at the end of each one.  For the sake of example, say that each thread will write X/N bytes of data while it runs.  Even if the only data needed later in the rest of the program is a single Long object, for example, all of those X/N bytes of data will be copied from the local cache to main memory (if that did not already happen before the thread terminated).

If the number of threads is large enough, the amount of data written from all local caches to main memory can be higher in the multi-threaded case than in the single-threaded case.

Anyway, that is my hypothesis about what could be happening here.  It isn't Clojure-specific, but it can be exacerbated by the common behavior of a lot of Clojure code to allocate significant amounts of memory that becomes garbage.

Andy



--

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---

You received this message because you are subscribed to the Google Groups "Clojure" group.

Wm. Josiah Erikson

unread,
Sep 27, 2013, 7:29:25 AM9/27/13
to Andy Fingerhut, clo...@googlegroups.com, Lee Spector

Interesting! If that is true of Java (I don't know Java at all), then your argument seems plausible. Cache-to-main-memory writes still take many more CPU cycles (an order of magnitude more, last I knew) than processor-to-cache. I don't think it's so much a bandwidth issue as latency, AFAIK. Thanks for thinking about this more, so long after the fact. We still see the issue.

Neale Swinnerton

unread,
Sep 27, 2013, 8:22:00 AM9/27/13
to clo...@googlegroups.com
The disruptor project from LMAX has wrestled with these sort of issues at length and achieved astounding levels of performance on the JVM

Martin Thompson, the original author of the disruptor, is a leading light in the JVM performance space, his mechanical sympathy blog is a goldmine of information and a must read for anyone wanting to understand the JVM / hardware interface.

Find more info at:



Neale Swinnerton
{t: @sw1nn, w: sw1nn.com }

Wm. Josiah Erikson

unread,
Nov 5, 2013, 9:23:52 PM11/5/13
to clo...@googlegroups.com

Neat, thanks for that. I skimmed it and don't know enough about Java to be able to tell quickly how easily we can use this to our advantage, but perhaps somebody else on the list will know.

You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/48W2eff3caU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.

Dave Tenny

unread,
Nov 6, 2013, 9:32:20 AM11/6/13
to clo...@googlegroups.com
As a person who has recently been dabbling with clojure for evaluation purposes I wondered if anybody wanted to post some links about parallel clojure apps that have been clear and easy parallelism wins for the types of applications that clojure was designed for.  (To contrast the lengthy discussion and analysis of this topic that is *hopefully* the exception and not the rule).

Any good links here?  Any high profile stuff?



It is loading more messages.
0 new messages