Poor parallelization performance across 18 cores (but not 4)

439 views
Skip to first unread message

David Iba

unread,
Nov 17, 2015, 12:38:39 AM11/17/15
to Clojure
I have functions f1 and f2 below, and let's say they run in T1 and T2 amount of time when running a single instance/thread.  The issue I'm facing is that parallelizing f2 across 18 cores takes anywhere from 2-5X T2, and for more complex funcs takes absurdly long.

  1. (defn f1 []
  2.   (apply + (range 2e9)))
  3.  
  4. ;; Note: each call to (f2) makes its own x* atom, so the 'swap!' should never retry.
  5. (defn f2 []
  6.   (let [x* (atom {})]
  7.     (loop [i 1e9]
  8.       (when-not (zero? i)
  9.         (swap! x* assoc :k i)
  10.         (recur (dec i))))))

Of note:
- On a 4-core machine, both f1 and f2 parallelize well (roungly T1 and T2 for 4 runs in parallel)
- running 18 f1's in parallel on the 18-core machine also parallelizes well.
- Disabling hyperthreading doesn't help.
- Based on jvisualvm monitoring, doesn't seem to be GC-related
- also tried on dedicated 18-core ec2 instance with same issues, so not shared-tenancy-related
- if I make a jar that runs a single f2 and launch 18 in parallel, it parallelizes well (so I don't think it's machine/aws-related)

Could it be that the 18 f2's in parallel on a single JVM instance is overworking the STM with all the swap's?  Any other theories?

Thanks!

Andy Fingerhut

unread,
Nov 17, 2015, 12:51:38 AM11/17/15
to clo...@googlegroups.com
There is no STM involved if you only have atoms, and no refs, so it can't be STM-related.

I have a conjecture, but don't yet have a suggestion for an experiment that would prove or disprove it.

The JVM memory model requires that changes to values that should be visible to all threads, like swap! on an atom, require making state changes to those values visible to all threads, which I think may often be implemented by flushing any local cache values to main memory, even if no other thread actually reads the value.

Your f1 code only does thread-local computation with no requirement to make its results visible to other threads.

Your f2 code must  make its results visible to other threads.  Not only that, but the values it must make visible are allocating new memory with each new value (via calls to assoc).

Perhaps main memory is not fast enough to keep up with 18 cores running f2 at full rate, but it is fast enough to keep up with 4 cores running f2 at full rate?

Maybe collecting data for time to completion for all number of cores running f2 from 4 up to 18 on the same hardware would be illuminating?  Especially if it showed that there was some maximum number of 'f2 iterations per second' total that was equal across any number of cores running f2 in parallel?

I am not sure whether that would explain your results of running 18 separate processes each running 1 thread of f2 in parallel getting full speedup, unless the JVM can tell only one thread is running and thus no flushes to main memory are required.  Maybe try running 9 processes, each with 2 f2 threads, to see if it is as bad as 1 process with 18 threads?

Andy


--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Niels van Klaveren

unread,
Nov 17, 2015, 3:33:01 AM11/17/15
to Clojure
Could you also show how you are running these functions in parallel and time them ? The way you start the functions can have as much impact as the functions themselves.

Regards,
Niels

David Iba

unread,
Nov 17, 2015, 4:49:16 AM11/17/15
to Clojure
Andy:  Interesting.  Thanks for educating me on the fact that atom swap's don't use the STM.  Your theory seems plausible... I will try those tests next time I launch the 18-core instance, but yeah, not sure how illuminating the results will be.

Niels: along the lines of this (so that each thread prints its time as well as printing the overall time):
  1.   (time
  2.    (let [f f1
  3.          n-runs 18
  4.          futs (do (for [(range n-runs)]
  5.                     (future (time (f)))))]
  6.      (doseq [fut futs]
  7.        @fut)))

David Iba

unread,
Nov 17, 2015, 5:00:45 AM11/17/15
to Clojure
correction: that "do" should be a "doall".  (My actual test code was a bit different, but each run printed some info when it started so it doesn't have to do with delayed evaluation of lazy seq's or anything).

Andy Fingerhut

unread,
Nov 17, 2015, 2:28:49 PM11/17/15
to clo...@googlegroups.com
David, you say "Based on jvisualvm monitoring, doesn't seem to be GC-related".

What is jvisualvm showing you related to GC and/or memory allocation when you tried the 18-core version with 18 threads in the same process?

Even memory allocation could become a point of contention, depending upon how the memory allocation works with many threads.  e.g. Depends on whether a thread gets a large chunk of memory on a global lock, and then locally carves it up into the small pieces it needs for each individual Java 'new' allocation, or gets a global lock for every 'new'.  The latter would give terrible performance as # cores increase, but I don't know how to tell whether that is the case, except by knowing more about how the memory allocator is implemented in your JVM.  Maybe digging through OpenJDK source code in the right place would tell?

Andy

--

gianluca torta

unread,
Nov 18, 2015, 10:13:27 AM11/18/15
to Clojure
by the way, have you tried both Oracle and Open JDK with the same results?
Gianluca

Timothy Baldridge

unread,
Nov 18, 2015, 10:38:55 AM11/18/15
to clo...@googlegroups.com
This sort of code is somewhat the worst case situation for atoms (or really for CAS). Clojure's swap! is based off the "compare-and-swap" or CAS operation that most x86 CPUs have as an instruction. If we expand swap! it looks something like this:

(loop [old-val @x*]
  (let [new-val (assoc old-val :k i)]
    (if (compare-and-swap x* old-val new-val)
       new-val
       (recur @x*)))

Compare-and-swap can be defined as "updates the content of the reference to new-val only if the current value of the reference is equal to the old-val). 

So in essence, only one core can be modifying the contents of an atom at a time, if the atom is modified during the execution of the swap! call, then swap! will continue to re-run your function until it's able to update the atom without it being modified during the function's execution. 

So let's say you have some super long task that you need to integrate into a ref, he's one way to do it, but probably not the best:

(let [a (atom 0)]
  (dotimes [x 18]
    (future
        (swap! a long-operation-on-score some-param))))


In this case long-operation-on-score will need to be re-run every time a thread modifies the atom. However if our function only needs the state of the ref to add to it, then we can do something like this instead:

(let [a (atom 0)]
  (dotimes [x 18]
    (future
        (let [score (long-operation-on-score some-param)
          (swap! a + score)))))

Now we only have a simple addition inside the swap! and we will have less contention between the CPUs because they will most likely be spending more time inside 'long-operation-on-score' instead of inside the swap.

TL;DR: do as little work as possible inside swap! the more you have inside swap! the higher chance you will have of throwing away work due to swap! retries. 

Timothy
--
“One of the main causes of the fall of the Roman Empire was that–lacking zero–they had no way to indicate successful termination of their C programs.”
(Robert Firth)

David Iba

unread,
Nov 18, 2015, 11:00:01 AM11/18/15
to Clojure
Timothy:  Each thread (call of f2) creates its own "local" atom, so I don't think there should be any swap retries.

Gianluca:  Good idea!  I've only tried OpenJDK, but I will look into trying Oracle and report back.

Andy:  jvisualvm was showing pretty much all of the memory allocated in the eden space and a little in the first survivor (no major/full GC's), and total GC Time was very minimal.

I'm in the middle of running some more tests and will report back when I get a chance today or tomorrow.  Thanks for all the feedback on this!

Timothy Baldridge

unread,
Nov 18, 2015, 11:04:04 AM11/18/15
to clo...@googlegroups.com
Oh, then I completely mis-understood the problem at hand here. If that's the case then do the following:

Change "atom" to "volatile!" and "swap!" to "vswap!". See if that changes anything. 

Timothy

David Iba

unread,
Nov 18, 2015, 11:08:14 AM11/18/15
to Clojure
No worries.  Thanks, I'll give that a try as well!

David Iba

unread,
Nov 19, 2015, 1:36:59 AM11/19/15
to Clojure
OK, have a few updates to report:
  • Oracle vs OpenJDK did not make a difference
  • Whenever I run N>1 threads calling any of these functions with swap/vswap, there is some overhead compared to running 18 separate single-run processes in parallel.  This overhead seems to increase as N increases.
    • For both swap and vswap, the function timings from running 18 futures (from one JVM) show about 1.5X the time from running 18 separate JVM processes.
    • For the swap version (f2), very often a few of the calls would go rogue and take around 3X the time of the others.
      • this did not happen for the vswap version of f2.
  • Running 9 processes with 2 f2-calling threads each was maybe 4% slower than 18 processes of 1.
  • Running 4 processes with 4 f2-calling threads each was mostly the same speed as the 18x1, but there were a couple of those rogue threads that took 2-3X the time of the others.
Any ideas?

Herwig Hochleitner

unread,
Nov 19, 2015, 5:54:39 AM11/19/15
to clo...@googlegroups.com
This reminds me of another thread, where performance issues related to concurrent allocation were explored in depth: https://groups.google.com/d/topic/clojure/48W2eff3caU/discussion
The main takeaway for me was, that Hotspot will slow down pretty dramatically, as soon as there are two threads allocating.

Could you try:

a) how performance develops, when you take out the allocation (assoc)
b) if increasing Hotspot's TLAB size will make any difference?

Fluid Dynamics

unread,
Nov 19, 2015, 8:08:55 AM11/19/15
to Clojure
On Thursday, November 19, 2015 at 1:36:59 AM UTC-5, David Iba wrote:
OK, have a few updates to report:
  • Oracle vs OpenJDK did not make a difference
  • Whenever I run N>1 threads calling any of these functions with swap/vswap, there is some overhead compared to running 18 separate single-run processes in parallel.  This overhead seems to increase as N increases.
    • For both swap and vswap, the function timings from running 18 futures (from one JVM) show about 1.5X the time from running 18 separate JVM processes.
    • For the swap version (f2), very often a few of the calls would go rogue and take around 3X the time of the others.
      • this did not happen for the vswap version of f2.
  • Running 9 processes with 2 f2-calling threads each was maybe 4% slower than 18 processes of 1.
  • Running 4 processes with 4 f2-calling threads each was mostly the same speed as the 18x1, but there were a couple of those rogue threads that took 2-3X the time of the others.
Any ideas?

Try a one-element array and aset, and see if that's faster than atom/swap and volatile/vswap. The latter two have memory barriers, the former does not, so if it's flushing the CPU cache that's the key here, aset should be faster, but if it's something else, it will probably be the same speed.

David Iba

unread,
Nov 19, 2015, 8:55:55 AM11/19/15
to clo...@googlegroups.com
Yeah, I actually tried using aset as well, and was still seeing these "rogue" threads taking much longer (although the ones that did finish in a normal amount of time had very similar completion times to those running in their own process.)

Herwig: I will try those suggestions when I get a chance.



--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/W-sddnit69Q/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.

Andy Fingerhut

unread,
Nov 19, 2015, 3:58:47 PM11/19/15
to clo...@googlegroups.com
David:

No new suggestions to add right now.  Herwig's suggestion that it could be the Java allocator has some evidence for it given your results.  I'm not sure whether this StackOverflow Q&A on TLAB is fully accurate, but it may provide some useful info:

http://stackoverflow.com/questions/26351243/allocations-in-new-tlab-vs-allocations-outside-tlab

I mainly wanted to give you a virtual high-five, kudos, and thank-you thank-you thank-you thank-you thank-you for taking the time to run these experiments.  Similar performance issues with many threads in the same JVM on a many-core machine have come up before in the past, and so far I don't know if anyone has gotten to the bottom of it yet.

Andy

Niels van Klaveren

unread,
Nov 20, 2015, 3:53:42 AM11/20/15
to Clojure
For what it's worth, here's the code I've been using while experimenting along with this at home.

Basically, it's a for loop over a collection of functions and a collection of core counts, running a fixed number of tasks.
So every function it can step up from running f on one core n times to f on x cores one time. I use com.climate/claypoole's unordered pmap, which gives a nice abstraction over spawning futures.

Included are two function sets: summation and key assoc (since the cross-comparison used in the OP bugged me a bit)
Suggestions for alterations are welcome, but tests I ran seem to show that all variants of the functions slow down considerably the more it is run in parallel. (2-3x overhead compared to a single core run).

Granted, I only could test this on a 4 core (8 hyperthreading) machine.
parallel-test.zip

David Iba

unread,
Nov 20, 2015, 4:09:06 AM11/20/15
to Clojure
Andy: Heh, glad to hear that I'm not the only one facing this issue, and I appreciate the encouragement since it's been kicking my ass the past week :)  On the bright side, as someone coming from more of a math background, this has forced me to learn a lot about how cpus/threads/memory/etc. work!

Herwig: I just got a chance to look through that thread you linked - sounds very very similar to what I'm encountering!

Niels: Glad to hear you're able to replicate the behavior.  I was also using claypoole's unordered pmap myself but excluded it in my code examples for simplicity :)  One thing to note that's tricky about benchmarking with hyperthreading enabled is that for fully CPU-bound jobs that don't share any cache and whatnot, if you're using all virtual-cores (8 in your case), a 2X slowdown would be expected.  Furthermore, if you launch less than the number of vCPUs available, it's possible that both threads get assigned to the same vCPU and thus again might run in 2X the time.  I noticed this seemed to happen more when the threads were spawned from the same java process (probably b/c it's presumed they can share cache) as opposed to separate processes.  So IMO the best way to test in this setting (without disabling HT) is to max out the vCPUs and compare against the expected 2X slowdown.

I think the "multiple threads allocating simultaneously" hypothesis makes the most sense so far.  This TLAB setting is interesting and I'll definitely give adjusting that a try - is setting the jvm option "-XX:+MinTLABSize" (like in the stackoverflow link Andy posted) the best way to go about this?
Reply all
Reply to author
Forward
0 new messages