Hi guys - I'm the colleague Lee speaks of. Because Jim mentioned running things on a 4-core Phenom II, I did some benchmarking on a Phenom II X4 945, and found some very strange results, which I shall post here, after I explain a little function that Lee wrote that is designed to get improved results over pmap. It looks like this:
(defn pmapall
"Like pmap but: 1) coll should be finite, 2) the returned sequence
will not be lazy, 3) calls to f may occur in any order, to maximize
multicore processor utilization, and 4) takes only one coll so far."
[f coll]
(let [agents (map agent coll)]
(dorun (map #(send % f) agents))
(apply await agents)
(doall (map deref agents))))
Refer to Lee's first post for the benchmarking routine we're running.
I figured that, in order to figure out if it was Java's multithreading that was the problem (as opposed to memory bandwidth, or the OS, or whatever), I'd compare ( doall( pmapall burn (range 8))) to running 8 concurrent copies of (burn (rand-int 8) or even just (burn 2) or 4 copies of ( doall( map burn (range 2))) or whatever. Does this make sense? I THINK it does. If it doesn't, then that's cool - just let me know why and I'll feel less crazy, because I am finding my results rather confounding.
On said Phenom II X4 945 with 16GB of RAM, it takes 2:31 to do ( doall( pmap burn (range 8))), 1:29 to do ( doall( map burn (range 8))), and 1:48 to do ( doall( pmapall burn (range 8))).
So that's weird, because although we do see decreased slowdown from using pmapall, we still don't see a speedup compared to map. Watching processor utilization while these are going on shows that map is using one core, and both pmap and pmapall are using all four cores fully, as they should. So, maybe the OS or the hardware just can't deal with running that many copies of burn at once? Maybe there's a memory bottleneck?
Now here's the weird part: it takes around 29 seconds to do four concurrent copies of ( doall( map burn (range 2))), around 33 seconds to run 8 copies of (burn 2). Yes. Read that again. What? Watching top while this is going on shows what you would expect to see: When I run four concurrent copies, I've got four copies of Java using 100% of a core each, and when I run eight concurrent copies, I see eight copies of Java, all using around 50% of the processor each.
Also, by the way, it takes 48 seconds to run two concurrent copies of ( doall( map burn (range 4))) and 1:07 to run two concurrent copies of ( doall( pmap burn (range 4))).
What is going on here? Is Java's multithreading really THAT bad? This appears to me to prove that Java, or clojure, has something very seriously wrong with it, or has outrageous amounts of overhead when spawning a new thread. No?
all run with :jvm-opts ["-Xmx1g" "-Xms1g" "-XX:+AggressiveOpts"] and clojure 1.5.0-beta1
(I tried increasing the memory allowed for the pmap and pmapall runs, even to 8g, and it doesn't help at all)
Java(TM) SE Runtime Environment (build 1.7.0_03-b04)
Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode):
on ROCKS 6.0 (CentOS 6.2) with kernel 2.6.32-220.13.1.el6.x86_64 #1 SMP
Any thoughts or ideas?
There's more weirdness, too, in case anybody in interested. I'm getting results that vary strangely from other benchmarks that are available, and make no sense to me. Check this out (these are incomplete, because I decided to dig deeper with the above benchmarks, but you'll see, I think, why this is so confusing, if you know how fast these processors are "supposed" to be):
all run with :jvm-opts ["-Xmx1g" "-Xms1g" "-XX:+AggressiveOpts"] and clojure 1.5.0-beta1
Java(TM) SE Runtime Environment (build 1.7.0_03-b04)
Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode):
Key: 1. (pmap range 8) :
2. (map range 8) :
3. (8 concurrent copies of pmap range 8) :
4. (8 concurrent copies of map range 8) :
5. pmapall range 8:
4x AMD Opteron 6168:
1. 4:02.06
2. 2:20.29
3.
4.
AMD Phenom II X4 945:
1. 2:31.65
2. 1:29.90
3. 3:32.60
4. 3:08.97
5. 1:48.36
AMD Phenom II X6 1100T:
1. 2:03.71
2. 1:14.76
3. 2:20.14
4. 1:57.38
5. 2:14.43
AMD FX 8120:
1. 4:50.06
2. 1:25.04
3. 5:55.84
4. 2:46.94
5. 4:36.61
AMD FX 8350:
1. 3:42.35
2. 1:13.94
3. 3:00.46
4. 2:06.18
5. 3:56.95
Intel Core i7 3770K:
1. 0:44
2. 1:37.18
3. 2:29.41
4. 2:16.05
5. 0:44.42
2 x Intel Paxville DP Xeon:
1. 6:26.112
2. 3:20.149
3. 8:09.85
4. 7:06.52
5. 5:55.29
Just tried, my first foray into reducers, but I must not be understanding something correctly:
(time (r/map burn (doall (range 4))))
returns in less than a second on my macbook pro, whereas
(time (doall (map burn (range 4))))
takes nearly a minute.
This feels like unforced laziness (although it's not quite that fast), but clojure.core.reducers/map involves no laziness, right?
It’s like there’s a lock of some sort sneaking in on the `conj` path.
Any thoughts on what that could be?
cons-conj* : map-ms: 5.6, pmap-ms 1.1, speedup 5.08
list-conj* : map-ms: 10.1, pmap-ms 15.9, speedup 0.63
cons-conj* : map-ms: 10.0, pmap-ms 15.6, speedup 0.64
- Parallel allocation of `Cons` and `PersistentList` instances through
a Clojure `conj` function remains fast as long as the function only
ever returns objects of a single concrete type
-Marshall
-Lee
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
cameron <cdo...@gmail.com> writes:
> the megamorphic call site hypothesis does sound plausible but I'm
> not sure where the following test fits in.
...
> I was toying with the idea of replacing the EmptyList class with a
> PersistsentList instance to mitigate the problem
> in at least one common case, however it doesn't seem to help.
> If I replace the reverse call in burn with the following code:
> #(reduce conj (list nil) %)
> I get the same slowdown as we see if reverse (equivalent to #(reduce
> conj '() %))
Ah, but include your own copy of `conj` and try those two cases. The
existing clojure.core/conj has already been used on multiple types, so
you need a new IFn class with a fresh call site. Here are the numbers I
get when I do that:
-Lee
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
Does this help? Should I do something else as well? I'm curious to try running like, say 16 concurrent copies on the 48-way node....
I'd be interested in seeing your GP system. The one we're using evolves "Push" programs and I suspect that whatever's triggering this problem with multicore utilization is stemming from something in the inner loop of my Push interpreter (https://github.com/lspector/Clojush)... but I don't know exactly what it is.
I realize that I could go back to this sort of thing, or something more modern and reasonable like hadoop, but of course it'd be a lot nicer if parallelization via agents (or some other mechanism central to the language) just didn't have whatever pathology we've uncovered.
So here's what we came up with that clearly demonstrates the problem. Lee provided the code and I tweaked it until I believe it shows the problem clearly and succinctly.
I have put together a .tar.gz file that has everything needed to run it, except lein. Grab it here: clojush_bowling_benchmark.tar.gz
Then run, for instance: /usr/bin/time -f %E lein run clojush.examples.benchmark-bowling
and then, when thWhooat has finished, edit src/clojush/examples/benchmark_bowling.clj and uncomment ":use-single-thread true" and run it again. I think this is a succinct, deterministic benchmark that clearly demonstrates the problem and also doesn't use conj or reverse. We don't see slowdowns, but I cannot get any better than around 2x speedup on any hardware with this benchmark.
-Lee
In that case isn't context switching dominating your test?
.isArray isn't expensive enough to warrant the use of pmap
Leonardo Borges
www.leonardoborges.com
--
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
Interesting! If that is true of Java (I don't know Java at all), then your argument seems plausible. Cache-to-main-memory writes still take many more CPU cycles (an order of magnitude more, last I knew) than processor-to-cache. I don't think it's so much a bandwidth issue as latency, AFAIK. Thanks for thinking about this more, so long after the fact. We still see the issue.
Neat, thanks for that. I skimmed it and don't know enough about Java to be able to tell quickly how easily we can use this to our advantage, but perhaps somebody else on the list will know.