Why does the following Clojure code take 10x the time the C# version does? How to improve the Clojure version?

Amith George

unread,

May 14, 2015, 4:02:42 AM5/14/15

to clo...@googlegroups.com

I wrote the following code to solve this challenge - https://www.reddit.com/r/dailyprogrammer/comments/35s2ds/20150513_challenge_214_intermediate_pile_of_paper/.

Code - https://github.com/amithgeorge/reddit-dailyprogrammer-clojure/blob/56ce1dbb6a08e96150dc85934caecfeb68108a53/src/rdp/214_intermediate.clj

I executed the -main function using `lein run 1`.

Output

    ;; lein run 1

    0 12605919
    1 3578145
    2 15356894
    3 19134293
    4 2394558
    5 15030409
    6 6424953
    7 14893444
    8 1592254
    9 1914025
    10 7075106
    "Elapsed time: 501168.972435 msecs"

The code originally used an immutable hashmap, but I lost patience waiting for the computation to end. With mutable hashmap, it still takes around 8 mins.

I wrote a C# version of the above code - https://gist.github.com/amithgeorge/766b8220f39d48221e58. It finishes under 40secs. The C# exe was built under Release mode and executed directly from the commandline. I expected the Clojure version to perform similarly.

Any tips on what I am doing wrong?

-----
Explanation of the code - Create a vector of all paper sheets, such that the sheet placed last is the first element of the vector and the last element is the canvas. To compute the frequency of each visible color - for each point in the canvas find the first sheet in the vector that covers the point. Store/increment its count in the hashmap. I understand there might be better more efficient ways to solve this, but currently I am interested in why the Clojure versions is so slow vis-a-vis the C# version.

Amith George

unread,

May 14, 2015, 4:07:00 AM5/14/15

to clo...@googlegroups.com

I forgot to link the input files

https://raw.githubusercontent.com/fsufitch/dailyprogrammer/0e4bb5ba1e3bc6e749b9e9bb49387513d5a623b7/ideas/pile_of_paper/100rects100x100.in
https://raw.githubusercontent.com/fsufitch/dailyprogrammer/0e4bb5ba1e3bc6e749b9e9bb49387513d5a623b7/ideas/pile_of_paper/100rects100Kx100K.in

Mikera

unread,

May 14, 2015, 5:33:52 AM5/14/15

to clo...@googlegroups.com

My general rule of thumb is that "idiomatic" Clojure is about 10x slower than Java. So this result doesn't really surprise me.

If you want to get maximum performance, you will have to do some more advanced / non-idiomatic things like:

- Add type hints for classes and primitive values

- Use "deftype" for your data structures

- Use loop/recur rather than more functional looping constructs

- Avoid laziness

- Use Java arrays for accumulating / mutating values

- Use Java classes like java.util.HashMap, java.util.ArrayList as appropriate

Raoul Duke

unread,

May 14, 2015, 2:36:17 PM5/14/15

to clo...@googlegroups.com

Ditto F# vs. C#.

One has to wonder when / where / if functional-pure-immutable
approaches will ever under the covers get "fast enough"?

Timothy Baldridge

unread,

May 14, 2015, 3:01:57 PM5/14/15

to clo...@googlegroups.com

I think it false view that immutable/functional constructs are slow by default. Instead, a lot of it comes down to lack of support in the popular VMs for these constructs. For example the following code:

(defn add-fn [& args]
(reduce -add 0 args))

(loop [x 0]
(if (eq x 10000)
x
(recur (add-fn x 1))))

Is never going to be super fast in Clojure, but it compiles down to a loop with 4 assembly operations in Pixie (https://github.com/pixie-lang/pixie) all while remaining fully dynamic. But that's because the VM is tuned to optimize things like varargs and reduce. You can't really tap into the VM at that level on the CLR or the JVM.

That's not to say that Pixie is always faster than these platforms, on the contrary, it's often much slower on GC heavy operations. But hopefully someday we'll reach the holy grail of a VM that aggressively optimizes dynamic languages without requiring hinting or rewriting into less idiomatic forms.

Timothy

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

“One of the main causes of the fall of the Roman Empire was that–lacking zero–they had no way to indicate successful termination of their C programs.”
(Robert Firth)

Jorge Branco

unread,

May 14, 2015, 3:03:17 PM5/14/15

to clo...@googlegroups.com

I could be a bit off here, but I'd say the speed differences between Clojure and Java are actually due to immutability AND weak-typing, while in C# vs F# the difference should be less pronounced and governed primarily by the use of immutability (i.e., using mutable data-structures yield results in the same order of magnitude)?

Alex Miller

unread,

May 14, 2015, 3:44:38 PM5/14/15

to clo...@googlegroups.com

The major problem here is that you are using boxed math for everything instead of primitives.

0) Switch to Clojure 1.7.0-beta3 - it's faster and some things below are dependent on it for best performance. And use Java 1.8.

1) Parse the lines you're reading directly into longs (Clojure focuses on 64-bit primitives - longs and doubles)

2) Put the longs first into a data structure that preserves the primitive type. The two best options for that here are records (which can have primitive fields) and arrays. I would create a Canvas defrecord with ^long width and height and a Paper defrecord with all ^long fields for example.

3) Store the papers in a vector (using transient to create it)

4) I suspect visible-color and covered? could probably be tightened up into a reduce over papers or a single loop-recur over papers - can't say I totally get what's happening there.

5) In visible-color-frequencies, you could use "update" instead of get + transient assoc! on the acc map, but this is never going to be terribly fast. Another option here would be to create an array with the max color (you could track that while reading if it's not a well-known answer) and bash the array. That can retain int or long counters and will be *way* faster.

6) You can use (set! *unchecked-math* :warn-on-boxed) to get faster math (no overflow checks) and also issue warnings (added in 1.7) if you happened to use boxed math by accident.

Alex Miller

unread,

May 14, 2015, 3:46:59 PM5/14/15

to clo...@googlegroups.com

Oh also, the default Leiningen settings are optimized for reducing startup, not for speed. Use:

^:jvm-opts ^:replace ["-server"] ;; maybe also set max heap with "-Xmx1g" in there

Or just don't use lein when perf testing.

Alex Miller

unread,

May 14, 2015, 3:47:51 PM5/14/15

to clo...@googlegroups.com

Gah. Meant in project.clj:

:jvm-opts ^:replace ["-server"] ;; maybe also set max heap with "-Xmx1g" in there

David Nolen

unread,

May 14, 2015, 3:53:17 PM5/14/15

to clojure

Also using `lein run` without supplying correct JVM settings (like -server) is going to lead to pretty misleading results.

Lee Spector

unread,

May 14, 2015, 4:46:01 PM5/14/15

to clo...@googlegroups.com

I'd like to get more guidance on how to specify :jvm-opts for maximum performance. I've received some help on this topic from people on this list in the past (thanks!), but I'm pretty sure that I'm still not doing things right.

I'm almost always interested in maximum performance for long-running, compute-intensive and memory-intensive processes, almost never caring much at all about startup time or anything else.

I also run my code on different machines, with different numbers of cores and amounts of memory, and would prefer to be able to put something in my project.clj that will do something reasonable regardless of what machine it's running on.

I run my code with "lein run", and I'd like whatever I need to "run fast" to be in project.clj.

I don't know a lot about JVM options, and I've tried to figure out what to specify for :jvm-opts by asking questions here and web searches, but I'm not at all confident that I'm doing it right yet. And because my systems are also stochastic, it's not easy for me to do timing tests on the options I've tried.

I think there are options relevant to memory and also garbage collection and maybe also compilation... and what else? I wish there was a simple switch to get maximum performance of the sort I've outlined here (or at least a reasonable stab at it), but I gather that there isn't.

Anyway, here's what I've been using recently, which just deals with memory and GC (and maybe not in the best way):

:jvm-opts ~(let [mem-to-use (long (* (.getTotalPhysicalMemorySize
(java.lang.management.ManagementFactory/getOperatingSystemMXBean))
0.8))]
[(str "-Xmx" mem-to-use)
(str "-Xms" mem-to-use)
"-XX:+UseG1GC"])

After seeing Alex's post I thought that maybe I should add "-server", as follows:

:jvm-opts ~(let [mem-to-use (long (* (.getTotalPhysicalMemorySize
(java.lang.management.ManagementFactory/getOperatingSystemMXBean))
0.8))]
[(str "-Xmx" mem-to-use)
(str "-Xms" mem-to-use)
"-XX:+UseG1GC"
"-server"])

Is that right? Does it make sense? What does "-server" do? Also, should I also be using "^:replace"?

I've looked in https://github.com/technomancy/leiningen/blob/master/sample.project.clj in hopes that this would say more about this stuff, but it doesn't say anything about -server or ^:replace.

Looking into the compilation options, it looks from https://github.com/technomancy/leiningen/wiki/Faster that I should be specifying:

:jvm-opts ^:replace []

This is also familiar to me from some earlier discussions. But how would I combine this with the memory/GC/server(?) options above?

A guess would be that maybe I should do this:

:jvm-opts ^:replace
~(let [mem-to-use (long (* (.getTotalPhysicalMemorySize
(java.lang.management.ManagementFactory/getOperatingSystemMXBean))
0.8))]
[(str "-Xmx" mem-to-use)
(str "-Xms" mem-to-use)
"-XX:+UseG1GC"
"-server"
"-XX:-TieredCompilation"])

Note that this guess involves changing a + to a - in the last line, which was suggested for the opposite purpose at https://github.com/technomancy/leiningen/wiki/Faster -- but I don't know if it's legitimate.

Is this a reasonable thing to do to get maximum performance for long-running, compute-intensive and memory-intensive processes?

Is the tiered compilation thing maybe already done by including "-server"?

I'm probably at least somewhat confused about several different issues here...

Any help or pointers would be appreciated.

Thanks,

-Lee

Colin Yates

unread,

May 14, 2015, 4:54:13 PM5/14/15

to clo...@googlegroups.com

Probably not helpful, but I tend to rely on the jvm optimisations and just -server. I figured this is an area where a little knowledge is a dangerous thing.

At the very least I would have a realistic benchmark suite to prove to myself that these gains were worth it. In my experience the performance bottleneck is always in design.

Just my 2p, although with my self-confessed ignoramus status it is more like 0.5p :).

Mark Engelberg

unread,

May 14, 2015, 4:57:53 PM5/14/15

to clojure

I know that remembering to put "-server" used to be a huge issue, but I thought that on all recent versions of Java on all modern 64-bit machines, -server is now the default. I thought it was a non-issue at this point. Is that incorrect?

Colin Yates

unread,

May 14, 2015, 4:59:15 PM5/14/15

to clo...@googlegroups.com

Now I feel even more of an ignoramus :)

Lee Spector

unread,

May 14, 2015, 5:29:50 PM5/14/15

to clo...@googlegroups.com

FWIW if I run with no :jvm-opts at all then I often crash with an out-of-memory error, so I do know that whatever happens by default doesn't do what I'm doing with respect to memory, at least.

I don't know what happens with respect to the other issues (tiered compilation and whatever else) by default, or with -server, etc.

-Lee

--

Lee Spector, Professor of Computer Science

Director, Institute for Computational Intelligence
Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspe...@hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438

Mark Engelberg

unread,

May 14, 2015, 5:38:33 PM5/14/15

to clojure

I always run my REPL inside of Counterclockwise, but as far as I know it uses leiningen to launch the REPL. In any case, when I attach a profiler such as the built-in jvisualvm to the REPL process, it clearly says that the REPL is running in 64-bit server mode.

Alex Miller

unread,

May 14, 2015, 5:42:21 PM5/14/15

to clo...@googlegroups.com

By default, I believe that lein will set: -XX:+TieredCompilation -XX:TieredStopAtLevel=1 for the JVM opts.

If you supply :jvm-opts, they will be *added* to these. Using ^:replace will instead *replace* the defaults with what you supply. There are a bunch of rules in the JVM for whether -server is used by default but they have are different across versions so I just use -server so I don't have to wonder. I often also set a max memory with -Xmx512m so that's known too.

Using -XX:+AggressiveOpts will enable optimizations that are available but not the default (yet) in your current JDK. Sometimes that's helpful but what that actually does will depend on your version.

That's a good place to start. There are lots of other things that *might* help (like GC settings), but you should probably understand what it all means before applying it, esp GC settings.

Andy Fingerhut

unread,

May 14, 2015, 6:15:32 PM5/14/15

to clo...@googlegroups.com

In general, running any kind of JVM via Leiningen, using the 'ps' command with lots of 'wwww' on the end of the command line options should give full command line options used to start the JVM, so you can see what is happening.

Understanding exactly which options are good or bad for performance, I don't have anything to add to the message I am replying to.

Andy

Colin Jones

unread,

May 14, 2015, 8:19:11 PM5/14/15

to clo...@googlegroups.com

I can heartily recommend Java Performance: The Definitive Guide to anyone interested in digging further into all the knobs you can set on the command line: http://www.amazon.com/Java-Performance-The-Definitive-Guide/dp/1449358454

Lee Spector

unread,

May 14, 2015, 8:45:09 PM5/14/15

to clo...@googlegroups.com

Thanks Colin and also Alex and Andy.

I'm trying to determine a reasonable way to do this without reading a book about it.

It sounds like I should use ^:replace, -server, and also -XX:-TieredCompilation (is that right, the way I've changed + to -?), and also -XX:+AggressiveOpts. Does it make sense to use all of these together?

And maybe I should get rid of "-XX:+UseG1GC", since I'm not really sure if that's a good thing.

Assuming that none of those things use as big a chunk of RAM as is available, I guess I should keep my messy code for the memory options.

So that would mean that overall I'd do the following to maximize performance on long-running, compute-intensive, memory-intensive runs:

:jvm-opts ^:replace ~(let [mem-to-use
(long (* (.getTotalPhysicalMemorySize
(java.lang.management.ManagementFactory/getOperatingSystemMXBean))
0.8))]
[(str "-Xmx" mem-to-use)
(str "-Xms" mem-to-use)

"-server"
"-XX:-TieredCompilation"
"-XX:+AggressiveOpts"])

Seem reasonable?

Thanks for all of the help!

-Lee

Fluid Dynamics

unread,

May 14, 2015, 9:15:21 PM5/14/15

to clo...@googlegroups.com

On Thursday, May 14, 2015 at 8:45:09 PM UTC-4, Lee wrote:

Thanks Colin and also Alex and Andy.

I'm trying to determine a reasonable way to do this without reading a book about it.

It sounds like I should use ^:replace, -server, and also -XX:-TieredCompilation (is that right, the way I've changed + to -?), and also -XX:+AggressiveOpts. Does it make sense to use all of these together?

And maybe I should get rid of "-XX:+UseG1GC", since I'm not really sure if that's a good thing.

Assuming that none of those things use as big a chunk of RAM as is available, I guess I should keep my messy code for the memory options.

So that would mean that overall I'd do the following to maximize performance on long-running, compute-intensive, memory-intensive runs:

:jvm-opts ^:replace ~(let [mem-to-use
(long (* (.getTotalPhysicalMemorySize
(java.lang.management.ManagementFactory/getOperatingSystemMXBean))
0.8))]
[(str "-Xmx" mem-to-use)
(str "-Xms" mem-to-use)
"-server"
"-XX:-TieredCompilation"
"-XX:+AggressiveOpts"])

Seem reasonable?

Umm, the :replace metadata needs to be on the vector. Attaching it to the let form won't do much of anything useful. So:

:jvm-opts ~(let [mem-to-use

                   (long (* (.getTotalPhysicalMemorySize
                              (java.lang.management.ManagementFactory/getOperatingSystemMXBean))
                            0.8))]

^:replace [(str "-Xmx" mem-to-use)

Mikera

unread,

May 14, 2015, 9:20:29 PM5/14/15

to clo...@googlegroups.com

I agree that the problem is immutable/functional constructs per se., but I don't think the problem is the lack of VM support either. It is possible to get *very* fast code on the JVM.

In Clojure the issue is more that the dynamic nature of many Clojure constructs and lack of static type inference make it impossible for some of the more advanced JVM optimisations to be applied. This isn't really a VM problem, it is more of a compiler problem.

When I optimise Clojure code, I always get the feeling that I am doing something that a "sufficiently smart compiler" should be able to do (forcing use of primitives rather than boxed numbers, adding type hints etc.)

Lee Spector

unread,

May 14, 2015, 9:22:52 PM5/14/15

to clo...@googlegroups.com

> On May 14, 2015, at 9:15 PM, Fluid Dynamics <a209...@trbvm.com> wrote:
>
> Umm, the :replace metadata needs to be on the vector. Attaching it to the let form won't do much of anything useful. So:
>
> :jvm-opts ~(let [mem-to-use
> (long (* (.getTotalPhysicalMemorySize
> (java.lang.management.ManagementFactory/getOperatingSystemMXBean))
> 0.8))]
> ^:replace [(str "-Xmx" mem-to-use)
> (str "-Xms" mem-to-use)
> "-server"
> "-XX:-TieredCompilation"
> "-XX:+AggressiveOpts"])

Thanks Fluid!

-Lee

Rangel Spasov

unread,

May 14, 2015, 9:50:22 PM5/14/15

to clo...@googlegroups.com

All the advice here is valid; type hinting + eliminating boxed math will probably give you the biggest gains. By adding one type hint in the proper place sometimes I've been able to make my code go 5-10x faster.

I've used tools like YourKit to great advantage to be able to pinpoint exactly which parts of the code contribute most to the slowness.

I.e. your time is better spent optimizing a fn that's called 1k times per second and it's a little slow (for example, missing a type hint and has to do reflection or using boxed math) vs. a fn that's very slow but is only called once a minute.

@raspasov

Raoul Duke

unread,

May 15, 2015, 12:57:53 AM5/15/15

to clo...@googlegroups.com

> I.e. your time is better spent optimizing a fn that's called 1k times per
> second and it's a little slow (for example, missing a type hint and has to
> do reflection or using boxed math) vs. a fn that's very slow but is only
> called once a minute.

not all apps, and not all developers, end up with code that has hot
spots (ha ha).

Rangel Spasov

unread,

May 15, 2015, 1:26:11 AM5/15/15

to clo...@googlegroups.com

Ok, IF there's such a spot : ) .

Amith George

unread,

May 15, 2015, 3:59:22 AM5/15/15

to clo...@googlegroups.com

Thanks for the detailed suggestions. Implementing them did bring the execution time down to around 250secs. Though that value is still much longer than 45secs. Could you please verify if I have implemented them correctly?

Code - https://github.com/amithgeorge/reddit-dailyprogrammer-clojure/blob/4ec02600e025d257a59d76fee0ad0eb01f4785ff/src/rdp/214_intermediate_arr.clj

project.clj contains the line

:jvm-opts ^:replace ["-Xms1024m" "-Xmx1g" "-server"]

> 0) Switch to Clojure 1.7.0-beta3 - it's faster and some things below are dependent on it for best performance. And use Java 1.8.

Done.

> 1) Parse the lines you're reading directly into longs (Clojure focuses on 64-bit primitives - longs and doubles)

    (defn- read-line-nums
      ([] (read-line-nums (read-line) #(Long/parseLong %1)))
      ([input-line parse-fn]
       (if-let [line input-line]
         (->> line
              (#(clojure.string/split %1 #" "))
              (map parse-fn)))))

> 2) Put the longs first into a data structure that preserves the primitive type. The two best options for that here are records (which can have primitive fields) and arrays. I would create a Canvas defrecord with ^long width and height and a Paper defrecord with all ^long fields for example.

    (defrecord Paper [^long color
                      ^long x1 ^long y1
                      ^long x2 ^long y2])

    (defrecord Canvas [^long width ^long height])

    (defn- make-paper
      ([^long w ^long h] (make-paper [0 0 0 w h]))
      ([^longs [color x y w h]]
       (Paper. color x y (+ x w -1) (+ y h -1))))

I had to make the second arity accept a vector, as it seems impossible to create a function accepting more than 4 primitive arguments. Though I suppose I could grouped the args into a `record` of its own?

> 3) Store the papers in a vector (using transient to create it)

The `read-input-file` function which creates and stores the papers, it finishes within 20ms. So I didn't bother using a transient.

> 4) I suspect visible-color and covered? could probably be tightened up into a reduce over papers or a single loop-recur over papers - can't say I totally get what's happening there.

The papers vector represents a sheets of papers that have been stacked on top of each other. ie, the last element in the vector is the canvas, and the first element is the topmost sheet. Each sheet is of different dimensions. Depending on the coordinates where they are placed, they may cover a part of the canvas. Different sheets may overlap in the areas they cover and thus we only want to consider the topmost sheet for that area.

The `visible-color` function accepts a canvas coordinate and the stack papers and goes through the stack to find the first sheet that covers said coordinate. That papers color will be visible color for that coordinate.

> 5) In visible-color-frequencies, you could use "update" instead of get + transient assoc! on the acc map, but this is never going to be terribly fast. Another option here would be to create an array with the max color (you could track that while reading if it's not a well-known answer) and bash the array. That can retain int or long counters and will be *way* faster.

    (defn- visible-color-frequencies-arr
      [{:keys [colors canvas papers]}]
      (let [colorCounts (long-array (count colors))]
        (reduce
         (fn [_ ^longs coord]
           (if-let [color (visible-color coord papers)]
             (aset-long colorCounts color (+ 1 (aget colorCounts color)))
             _))
         -1
         (for [^long y (range (:height canvas))
               ^long x (range (:width canvas))]
           [x y]))
         (zipmap (range) colorCounts)))

It's a bit messy with me abusing reduce and totally ignoring the accumulator, but the version using arrays is atleast 10 seconds faster than the one using transient maps. That said, even the hashmap version took around 260-270secs. So the bulk of the time savings is caused by some other change.

> 6) You can use (set! *unchecked-math* :warn-on-boxed) to get faster math (no overflow checks) and also issue warnings (added in 1.7) if you happened to use boxed math by accident.

I added these to both the old code (which didn't use type hints) and the new one. I ran the `-main` function of both from within the repl and using lein run. I didn't see any warnings in any of the executions. How am I supposed to use these?

    (defn -main
      ([] (-main "0" "false"))
      ([index use-arrays]
       (time
        (binding [*unchecked-math* :warn-on-boxed
                  *warn-on-reflection* true]
          (if-not (Boolean/parseBoolean use-arrays)
            (solve (input-files (Integer/parseInt index)))
            (solve-arr (input-files (Integer/parseInt index))))))))

> Use: ^:jvm-opts ^:replace ["-server"]

In my case there isn't any noticable difference between explicitly using '-server' and not using it. Which brings me to my final question,

> The major problem here is that you are using boxed math for everything instead of primitives.

From what I understand of the changes I made, a reduction of almost 250 seconds came about from simply using type hints. Is that normal? The incrementing of color counts in the original code, was that the boxed math you were referring to?

In C#, using generic types consistently meant I never had to worry about boxing. I tried using `no.disassemble` to verify if the type hints really changed anything, but I only got more confused.

Consider this example code where my-val is supposed to be a long and each of the functions is supposed to accept a long and return a long.

    (-> my-val
        (func-a)
        (func-b)
        (func-c))

I will have to add typehints to original value as well to each function's input and output, right? Can typehints be added to denote a collection/sequence of longs? What about a collection/sequence of records?

Finally, can the performance be improved any more? 250secs still feels too long compared to the C# version.

Jony Hudson

unread,

May 15, 2015, 7:52:45 AM5/15/15

to clo...@googlegroups.com

@puzzler -server is the default mostly, but Leiningen overrides some of the important options that -server enables, as detailed by Alex.

@Lee the first 15 minutes of this talk by Tom Crayford has some useful info about performance and JVM options in it:

https://skillsmatter.com/skillscasts/6148-opening-the-grimoire-some-things-about-clojure-performance

For the record for my long-running workloads (which I suspect are _very_ similar to yours) I use:

:jvm-opts ^:replace ["-server"
                     ;;"-XX:+AggressiveOpts"
                     ;;"-XX:+UseFastAccessorMethods"
                     ;;"-XX:+UseCompressedOops"
                     "-Xmx4g"]

The commented out options are in there so I don't forget them (I learned some of them from Tom's talk), but I haven't found them to make any much difference for my jobs.

Jony

Lee Spector

unread,

May 15, 2015, 9:07:25 AM5/15/15

to clo...@googlegroups.com

Thanks Jony -- very helpful!

-Lee

Steven Yi

unread,

May 15, 2015, 8:18:43 PM5/15/15

to clo...@googlegroups.com

Hi Amith,

I checked out your project from git and just doing 'lein run' I got a reported:

"Elapsed time: 185.651689 msecs"

However, if I modify the -main function in 214_intermediate.clj to wrap the time testing with (doseq [_ (range 20)]), to run the test multiple times, the behavior is much better after the first 9 or so runs, and by the end it is down to:

"Elapsed time: 35.574945 msecs"

I think you might be running into a situation where the VM hasn't run the functions enough times to optimize.

Rather than use the time function, you might want to try using criterium[1] to benchmark the code. The site explains all the wonderful things it does to get a good benchmark.

Unfortunately, if your data set is small, and you're running a one off calculation like this, I don't know if there's much to improve due to the warmup cost. You could fiddle a bit with VM flags to try to optimize with less calls (I can't recall the flag, but I think the JVM defaults to optimizing after a functions been called 10,000 times). On the other hand, if you're processing larger datasets, I think it's reassuring that once warmed up, the Clojure code performs pretty well.

For reference, this was run on an Macbook Pro 13" early 2011, Core i7 2.7ghz.

steven

[1] - https://github.com/hugoduncan/criterium/

Amith George

unread,

May 15, 2015, 9:07:38 PM5/15/15

to clo...@googlegroups.com

Hi Steven,

My bad. You need to invoke the code using the command

lein run -m rdp.214-intermediate-arr 1 true

The `1` tells it to select a certain input file, (in this case the biggest) and the `true` tells it to use the function that internally uses a java array (as opposed to the function that internally uses a transient map.)

The above mentioned command takes around 250secs on my laptop. My apologies again, I should have made it clear how to execute the project.

Steven Yi

unread,

May 15, 2015, 11:02:12 PM5/15/15

to clo...@googlegroups.com

Ah, I see. Well, I think then you can ignore the stuff about warming
up, as this certainly takes a while to run here:

"Elapsed time: 314763.666693 msecs"

I tried profiling with Yourkit and saw a couple of things to change:

;; I think lte with more than two args ends up being slower than
unrolling out here
;; I also tried type-hinting values from paper to ^long to avoid lte
calls with Number
(defn- covered?
[[^long canvas-x ^long canvas-y] paper]
(and (<= ^long (:x1 paper) canvas-x ) (<= canvas-x ^long (:x2 paper))
(<= ^long (:y1 paper) canvas-y ) (<= canvas-y ^long (:y2 paper))))

;; for the reduce function in visible-color-frequencies-arr
;; using aset instead of aset-long, as it looked like aset-long was
using reflection
(aset colorCounts color (+ 1 (aget colorCounts color)))

That got it down to:

"Elapsed time: 279864.041477 msecs"

I suspect you might get improvement too if you change
visible-color-frequencies-arr to use loop-recur instead of reduce
since you're doing a bit of magic there.

Unfortunately I have to stop at the moment as I have to leave on a
trip early in the morning, but hopefully that's useful.

steven

> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your
> first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---

> You received this message because you are subscribed to a topic in the
> Google Groups "Clojure" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/clojure/JgxFQLP2E34/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to

Daniel

unread,

May 17, 2015, 3:12:43 PM5/17/15

to clo...@googlegroups.com

Actually, I'm more than a little curious about performance optimizations to my solution as well[0]. Running Yourkit shows that most of the execution time is spent in reduce, so I've tried switching to group-by instead. Also tried replacing with iterate. Neither of these improved overall execution time. Using 1.6.

[0] https://www.reddit.com/r/dailyprogrammer/comments/35s2ds/20150513_challenge_214_intermediate_pile_of_paper/crbs3co?context=3

Leif

unread,

May 18, 2015, 8:04:21 PM5/18/15

to clo...@googlegroups.com

Summary: With a new algorithm + parallelism, I reduced it from 528s to 11s.

This sounded fun, so I took a crack at it, starting with your solution. Description and patch files here: https://gist.github.com/leifp/a864bca941ecdacb5840

Starting with your solution, I:

Divided the canvas into 100 blocks, created an index of {blockindex -> papers that intersect that block}. The reasoning is that if we calculate which block a pixel is in, we only need to check the papers that intersect that block. In the extreme case, certain blocks only intersected one paper (the background). In the original code we would have had to check all 100 papers for each pixel in that block; now we just check one. (5x average speedup)
Changed the code to calculate color areas for a block at a time; after that, it was a simple 2-line change to parallelize the work using pmap. (8x speedup on 12-core machine)
Make paper a record; use direct field access (this resulted in a modest ~10% improvement, but maybe not worth it).

So clojure was helpful in trying out algorithm ideas and parallelizing the code. The final program would no doubt also be faster in C#, but my goal is "fast enough."

Further idea (which I don't think I'll implement): Index the papers using an data structure built for geometrical indexing, like an R-tree or variation, to get a near-optimal index without tuning.

I hope my solution is interesting to you. Questions / comments welcome.
Leif

P.S. I apologize for the messy, repetitive, stream-of-consciousness code.

Amith George

unread,

May 19, 2015, 4:38:13 AM5/19/15

to clo...@googlegroups.com

Hi,

Thanks for taking the time to profile the code. I implemented the two suggestions (using the two argument arity of lte and using aset instead of aset-long).

https://github.com/amithgeorge/reddit-dailyprogrammer-clojure/blob/2655a83f7fcf51e4fedae164d7d17386a0c8854f/src/rdp/214_intermediate_arr.clj

lein run -m rdp.214-intermediate-arr 1 true

;; took around 250s.

The changes didn't seem to make a difference. The before and after runs all take between 250 - 260s. I kept the reduce as-is. From your reply, it looks like these two changes reduced the execution time by almost 30s. Any thoughts of why there isn't much of a difference for me? - I am using Clojure 1.7.3-beta3 and Oracle Java 1.8.0_45 on OSX Mavericks.

Amith George

unread,

May 19, 2015, 5:18:24 AM5/19/15

to clo...@googlegroups.com

Hi,

Thank you for taking the time. That is a rather smart algo. The code and the changes were easy to understand.

I didn't implement the parallelized version as my aim was to understand why the clojure version was so slow compared to the equivalent C# version. (Btw, you have access to a 12 core machine! :O)

Ok. Two versions of the code -

https://github.com/amithgeorge/reddit-dailyprogrammer-clojure/blob/2655a83f7fcf51e4fedae164d7d17386a0c8854f/src/rdp/214_intermediate_leifp.clj

lein run -m rdp.214-intermediate-leifp 1
;; took around 100s

https://github.com/amithgeorge/reddit-dailyprogrammer-clojure/blob/2655a83f7fcf51e4fedae164d7d17386a0c8854f/src/rdp/214_intermediate_arr_leifp.clj

lein run -m rdp.214-intermediate-arr-leifp 1 true
;; took around 70s

The first file (214_intermediate_leifp.clj) is similar to what you had after point 1.

The second file (214_intermediate_arr_leifp.clj) is your algo combined with the tips provided in the posts above - mainly - using type hints; native arrays, two argument arity of `<=` etc. It also uses a record `Paper` instead of a map. However the record values are accessed by the keyword instead of direct field access.

Applying those tips to my algo, reduced the time taken from 500s to 250s. For your algo, those tips only reduced the time taken by 1/4th. Which makes sense as your algo performs far less operations, so there are lesser number of operations that can get a speed bump...

That said, I am none the wiser on how to write to fast single threaded clojure :(

Steven Yi

unread,

May 19, 2015, 12:50:29 PM5/19/15

to clo...@googlegroups.com

Hi Amith,

Just got back from a trip and had a chance to look at this again. I'm
not sure why you didn't see a change there; I'm running the same
version of Clojure and Java here.

I did another change to move from using a reduce to nested
loop-recurs. That with a some additional type hints for ^Paper, and
moving the retrieval of :height and :width outside of the loops got a
performance increase of almost 3x from the original:

"Elapsed time: 109335.307414 msecs"

The modified code is here:

https://gist.github.com/kunstmusik/e1081d417142a90d5cfa

Some general notes:

- covered? - type-hinted ^Paper
- visible-color - switched to pass in ^long x and y, used fn form for
passed-in function and type-hinted ^Paper argument
- visible-color-frequencies - switched to nested loop-recur, moved out
height and width to single call outside critical loop

Could you try the version from the gist there to see if you get a
similar speedup?

Cheers!
steven

Steven Yi

unread,

May 19, 2015, 1:32:11 PM5/19/15

to clo...@googlegroups.com

Hi Amith,

One last optimization yields a 4x increase over the last version, and
now 12x over the original:

"Elapsed time: 22848.384642 msecs"

Code here:

https://gist.github.com/kunstmusik/db6e14118e818abf3bb8

Changes from last version:

- parse-inputs - Put parsed Paper objects into an array instead of a vector.
- visible-color iterates through the array of Paper objects using a loop-recur

I spent a moment looking at this using JVisualVM and saw from the
previous version that there was a lot of time spent in sequence
related stuff and in the some function. I realized from using some it
was going through the vector items pretty often, and since that list
of items is static, are read-only within the context of the code, and
the values aren't shared around a codebase, I went ahead and optimized
it to use a fixed array.

Hope this runs well on your side too!
steven

Amith George

unread,

May 20, 2015, 9:18:28 AM5/20/15

to clo...@googlegroups.com

Wow. This is rather hard to believe. But the execution time is now 10s. :D

All of your suggestions were on the mark. I implemented them in stages/commits. The command to execute the file remains the same - `lein run -m rdp.214-intermediate-arr 1 true`

1. Replaced the coordinates vector with separate x and y. Added relevant type hints.

https://github.com/amithgeorge/reddit-dailyprogrammer-clojure/blob/0ba1992ff892928beb71400fb44cce471318d187/src/rdp/214_intermediate_arr.clj#L60

Best time - 110s. Can't believe that simply changing from `[[^long x ^long y]]` to `[^long x ^long y]` can reduce the execution time by half!!

2. Replaced the reduce in visible-color-frequencies-arr to nested loop-recurs.

https://github.com/amithgeorge/reddit-dailyprogrammer-clojure/blob/75db0f839b748a4ca313bc319c4970a9a0329fb6/src/rdp/214_intermediate_arr.clj#L86

Best time - 90s. I thought reduce was equivalent to a loop recur. Might have to create a separate bare-minimum code sample to investigate why reduce was slower.

3. Stored papers in a native array. The parsed input still contains the papers in a vector. Its only when the main function is called with array true, that we convert to an array and store as a local value.

https://github.com/amithgeorge/reddit-dailyprogrammer-clojure/blob/0f49b7979f9c2e0e7a522f5ad875de0807d4e2da/src/rdp/214_intermediate_arr.clj#L115

Best time - 10s. I really don't get why the `some` over a vector was so slow. I tried replacing the `some` with a loop-recur (one style using an incrementing index and `get papers index` and the other style using `rest papers`.) Surprisingly both styles were slower than using `some`.

Thanks a lot for figuring out the slow areas. You made me aware of two tools I could use to find bottlenecks. At the very least, I now know Clojure doesn't necessarily mean 10x slower.

That said, I really wanna know why storing papers in a native array made such a difference. Ideally I shouldn't have to expose a mutable array to other functions. In this case, the functions that accept/use the mutable array are meant to be private and under my control, so I could have some measure of confidence that mutation wouldn't occur.

Amith George

unread,

May 20, 2015, 9:37:16 AM5/20/15

to clo...@googlegroups.com

Oh, a few more things I observed;

- I figured out how to use *unchecked-math* and *warn-on-reflection* :)

Earlier, I was setting them within a binding, but that didn't seem to do anything. Seems the correct way is to set them at the very top of the file, the first 2 lines after the namespace declaration.

https://github.com/amithgeorge/reddit-dailyprogrammer-clojure/blob/0f49b7979f9c2e0e7a522f5ad875de0807d4e2da/src/rdp/214_intermediate_arr.clj#L4-L5

- With the warnings now being generated properly, I was able to figure out where the type hints needed to be added. For example,

(defn- covered?
  [^long canvas-x ^long canvas-y ^Paper paper]
  (and (<= ^long (.x1 paper) canvas-x ) (<= canvas-x ^long (.x2 paper)) 
       (<= ^long (.y1 paper) canvas-y ) (<= canvas-y ^long (.y2 paper))))

The `^long` type hint is not needed within the `<=` forms. As paper is typed to `Paper` and the various `x1, x2 etc` fields are of type long in the Paper definition.

- Also,

(recur (unchecked-inc i))

can be

(recur (inc i))

because `*unchecked-math*` is set to a truthy value.

Please correct me if I misunderstood anything :)

Leif

unread,

May 20, 2015, 11:01:59 AM5/20/15

to clo...@googlegroups.com

"Elapsed time: 1155.723592 msecs"

I implemented the changes Amith and Steven suggested on my multithreaded version. I think we can safely say that, with a modest effort, Clojure can compare favorably to C#. :)

https://gist.github.com/leifp/a864bca941ecdacb5840
Cheers,
Leif

Steven Yi

unread,

May 20, 2015, 1:37:29 PM5/20/15

to clo...@googlegroups.com

Hi Amith,

Very glad you're seeing the performance increase there too! I think
you're right about both the extraneous type hints and the
unchecked-math. I tend to turn on unchecked-math in a leiningen
profile that I use while developing, but leave it set to default for
normal builds. Because of that, I'm in the habit of using
unchecked-inc. For the extra type hints there, I think I had
originally used the :member syntax and changed it to the .member
syntax but had left the type hints in. It would have been better to
just use .member and leave the type hints out as you noted.

For some/sequences vs. loop-recur/arrays, I haven't done a deep enough
analysis and don't want to say something incorrect. Here's some notes
in looking at it just now:

- get - using get [1] calls to RT[2], which in turns does some
instanceof checks to then call the specific type's function for
getting a value. With PersistentVector[3], that should translate to
the ILookup.valAt() call, which in turn calls nth. Compared to the
inlined aget[4][5] call, it's a lot more layers of functions.

- some[6] - this ends up calling first and next[7], which ends up
calling first and next in PersistentVector[8], which ends up
allocating a new ChunkedSeq. Granted, object allocation on the JVM is
super cheap (pointer bump) compared to allocating memory with
something like malloc in C, there's still a cost there + the setting
of fields in the constructor. It's all very cheap operations on their
own but I'm guessing that in this context of a critical loop, plus the
logic to check for boundaries in the first/next calls, are adding up
compared to just indexing into the full array.

- some vs. loop-recur/rest - In this case, I'm not sure. It might be
due to how you coded the loop with the call to empty? (assuming the
commented out code in your example is what you used).

I guess then all of this becomes a series of tradeoffs and trying to
find the right balance. IMO, sequences offer so many benefits over
arrays. It's really only in certain critical sections of code where
the performance costs might outweigh programmer benefits. I guess even
then, the amortized cost of the sequence compared to array goes down
if the the operations on the individual items becomes more expensive.
In this case it happened to be that the cost of the operation on the
set of values was pretty small compared to the cost of calling
first/next. Maybe in other scenarios the use of an array won't give
the same benefit.

Well, hopefully if I missed something in the analysis above someone
here will correct me. :)

Cheers!
steven

[1] - https://github.com/clojure/clojure/blob/master/src/clj/clojure/core.clj#L1421-L1429
[2] - https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/RT.java#L719-L744
[3] - https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/PersistentVector.java#L657-L671
[4] - https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/RT.java#L2321-L2323
[5] - https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/RT.java#L2321-L2323
[6] - https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/RT.java#L2321-L2323
[7] - https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/RT.java#L651-L679
[8] - https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/RT.java#L651-L679

Reply all

Reply to author

Forward