Direct Linking Experience Reports?

691 views
Skip to first unread message

Ghadi Shayban

unread,
Oct 1, 2015, 10:39:24 AM10/1/15
to Clojure Dev
Have the direct linking changes been quantified from a performance perspective? I haven't seen any reports of positive or negative impacts. If anyone has some, I'd be interested to hear.

Alex Miller

unread,
Oct 1, 2015, 11:03:15 AM10/1/15
to cloju...@googlegroups.com
I did some tests with the Alioth benchmarks when direct linking was added, but a) they're generally so short (<10 sec) that it's hard to see major effects and b) most of them are bound by hot loops where var invocation has already been factored out so there are no direct calls in the places where it matters. So, I did not see any significant differences.

On Thu, Oct 1, 2015 at 9:39 AM, Ghadi Shayban <gsha...@gmail.com> wrote:
Have the direct linking changes been quantified from a performance perspective? I haven't seen any reports of positive or negative impacts. If anyone has some, I'd be interested to hear.

--
You received this message because you are subscribed to the Google Groups "Clojure Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure-dev...@googlegroups.com.
To post to this group, send email to cloju...@googlegroups.com.
Visit this group at http://groups.google.com/group/clojure-dev.
For more options, visit https://groups.google.com/d/optout.

Colin Fleming

unread,
Oct 1, 2015, 5:38:25 PM10/1/15
to cloju...@googlegroups.com
Relatedly, has anyone investigated how good HotSpot is at inlining var indirection, i.e. how often it happens in practice? Tom Crayford talked a bit about this at EuroClojure but I haven't seen anyone sit down and look at what HotSpot is actually doing.

If HotSpot is already good at doing this where it matters, does that more or less obsolete the direct linking change? I'm not sure, just curious.

Alex Miller

unread,
Oct 1, 2015, 11:22:55 PM10/1/15
to cloju...@googlegroups.com
I have looked at it. Direct static is definitely faster than the var lookup. I did some tests with all the jit inlining debug stuff on. It inlines either way but seems to be able to do better across bigger chunks with direct.


Herwig Hochleitner

unread,
Oct 5, 2015, 10:34:52 PM10/5/15
to cloju...@googlegroups.com
I looked at whether direct linking might make compiled code tree-shakable by proguard (very beneficial for android). The short version is: nope and it doesn't seem worth pursuing (proguard even chokes on some byte code compiled from data.json)

Ghadi Shayban

unread,
Oct 27, 2015, 3:19:17 PM10/27/15
to Clojure Dev
Does direct linking solve a particular problem or have measurable benefits?

Please forgive the "directness" (heh). It presents a few (totally surmountable) challenges for people experimenting with or forking the compiler, but I'm genuinely curious about a rationale or thoughtdump.

Daniel Compton

unread,
Oct 28, 2015, 5:13:21 PM10/28/15
to Clojure Dev
This JIRA page has some of the wider context around direct linking, but not a lot of details on direct linking itself: http://dev.clojure.org/display/design/Build+Profiles.

--
You received this message because you are subscribed to the Google Groups "Clojure Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure-dev...@googlegroups.com.
To post to this group, send email to cloju...@googlegroups.com.
Visit this group at http://groups.google.com/group/clojure-dev.
For more options, visit https://groups.google.com/d/optout.
--
Daniel

Timothy Baldridge

unread,
Nov 12, 2015, 5:03:17 PM11/12/15
to cloju...@googlegroups.com
So I have a microbenchmark of direct linking I'd thought I'd share. This is quite trivial, but it demonstrates the cases where I would expect to see a performance improvement. 

(ns dl-test.core
(:require [criterium.core :refer [quick-bench]])
  (:gen-class))

(defn fn4 []
  1)

(defn fn3 []
  (fn4))

(defn fn2 []
  (fn3))

(defn fn1 []
  (fn2))

(defn test-fn []
  (dotimes [x 10000]
    (dotimes [y 10000]
      (fn1))))

(defn -main []
  (quick-bench
    (test-fn))
  (println "done"))


So as you see we're doing nothing but measuring the overhead of calling deep into 4 functions, as well as the time it takes to iterate the loops. These loops will make use of some clojure.core functions as well, so I'd expect to see some improvement here. 


My results:

1.7.0 
WARNING: Final GC required 3.7534489667856072 % of runtime
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 170.189143 ms
    Execution time std-deviation : 5.863419 ms
   Execution time lower quantile : 163.627363 ms ( 2.5%)
   Execution time upper quantile : 178.348332 ms (97.5%)
                   Overhead used : 1.722470 ns
done

1.8.0-RC1 without direct linking
WARNING: Final GC required 3.1782225321822644 % of runtime
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 173.798084 ms
    Execution time std-deviation : 8.781647 ms
   Execution time lower quantile : 165.613774 ms ( 2.5%)
   Execution time upper quantile : 186.626237 ms (97.5%)
                   Overhead used : 1.673530 ns
done

1.8.0-RC1 with direct linking 
WARNING: Final GC required 4.171942981416464 % of runtime
Evaluation count : 24 in 6 samples of 4 calls.
             Execution time mean : 35.457521 ms
    Execution time std-deviation : 3.025490 ms
   Execution time lower quantile : 32.932862 ms ( 2.5%)
   Execution time upper quantile : 39.592962 ms (97.5%)
                   Overhead used : 1.685840 ns
done


So I would expect to see some improvements inside tight inner loops that call lots of clojure.core functions with little other overhead (i.e. collection manipulation or GC stuff will most likely hide any perf improvement seen here). 

Timothy Baldridge




“One of the main causes of the fall of the Roman Empire was that–lacking zero–they had no way to indicate successful termination of their C programs.”
(Robert Firth)

Nicola Mometto

unread,
Nov 12, 2015, 5:46:12 PM11/12/15
to cloju...@googlegroups.com
As soon as you start doing some work (as trivial as adding two integers) inside fn4, the difference between 1.8 w/ direct linking and 1.7 w/o becomes significantly smaller:

(ns dl-test.core
  (:require [criterium.core :refer [quick-bench]]))

(defn fn4 [a]
  (+ a (rand-int 10)))

(defn fn3 [a]
  (fn4 a))

(defn fn2 [a]
  (fn3 a))

(defn fn1 [a]
  (fn2 a))

(defn test-fn []
  (dotimes [x 1000]
    (dotimes [y 1000]
      (fn1 (* y x)))))

(defn -main []
  (quick-bench (test-fn)))

1.7.0
WARNING: Final GC required 6.823549649317508 % of runtime
Evaluation count : 12 in 6 samples of 2 calls.
             Execution time mean : 62.453135 ms
    Execution time std-deviation : 3.237941 ms
   Execution time lower quantile : 57.752310 ms ( 2.5%)
   Execution time upper quantile : 66.066329 ms (97.5%)
                   Overhead used : 1.603176 ns

1.8.0-RC1 -Dclojure.compiler.direct-linking=true
WARNING: Final GC required 1.024493079167096 % of runtime
WARNING: Final GC required 7.296848357750578 % of runtime
Evaluation count : 12 in 6 samples of 2 calls.
             Execution time mean : 59.691313 ms
    Execution time std-deviation : 6.498723 ms
   Execution time lower quantile : 54.058580 ms ( 2.5%)
   Execution time upper quantile : 69.466560 ms (97.5%)
                   Overhead used : 1.588121 ns

Nicola Mometto

unread,
Nov 12, 2015, 5:57:16 PM11/12/15
to cloju...@googlegroups.com
Using bench rather than quick-bench:

1.7.0
valuation count : 960 in 60 samples of 16 calls.
             Execution time mean : 60.613051 ms
    Execution time std-deviation : 1.848730 ms
   Execution time lower quantile : 58.167461 ms ( 2.5%)
   Execution time upper quantile : 64.498360 ms (97.5%)
                   Overhead used : 1.588868 ns

Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
 Variance from outliers : 17.3924 % Variance is moderately inflated by outliers

1.8.0 w/ direct-linking
WARNING: Final GC required 1.071554270235752 % of runtime
Evaluation count : 1020 in 60 samples of 17 calls.
             Execution time mean : 58.102072 ms
    Execution time std-deviation : 1.838173 ms
   Execution time lower quantile : 55.564640 ms ( 2.5%)
   Execution time upper quantile : 61.313636 ms (97.5%)
                   Overhead used : 1.591066 ns

That's a 3% perf improvement in code that does essentialy nothing.
If this benchmark is significative, I suspect this would mean that in real code direct-linking would produce no noticeable improvement, at the cost of a *significantly* more complex (and bug-prone!) Compiler implementation.

I'm admittedly not an expert in benchmarking on the JVM and I hope other benchmarks from folks more expert than me on this will prove me wrong, but if those are right, they don't make a very compelling point for direct linking.

On 12 Nov 2015, at 22:03, Timothy Baldridge <tbald...@gmail.com> wrote:

Colin Fleming

unread,
Nov 12, 2015, 6:57:58 PM11/12/15
to cloju...@googlegroups.com
If I don't enable direct linking, to what extent are existing code paths in the compiler touched? i.e. what's the potential scope for regressions?

Michael Blume

unread,
Nov 12, 2015, 8:39:48 PM11/12/15
to cloju...@googlegroups.com
I'm with Nicola, I'm honestly kind of confused about why we're moving forward on this when the gains have been so elusive.

Alex Miller

unread,
Nov 12, 2015, 9:27:53 PM11/12/15
to cloju...@googlegroups.com
If you are not compiling with direct linking, then to a large part you are generating the same code. 

Core itself is compiled with direct linking though so all function invocations in core utilize new invokeStatic calls instead. The non direct paths still exist though.


Alex Miller

unread,
Nov 13, 2015, 12:54:27 PM11/13/15
to Clojure Dev
I ran these tests as well, as well as some other variants. All tests run with:

* java 1.8.0-b132
* -server -Xmx1024m
* criterium bench (not quick-bench)
* restarted JVM for each test so no cross-contamination

prior listed example where:
(defn fn4 [a]
  (+ a (rand-int 10)))

;dl=false
"Elapsed time: 57.399002 ms"
;dl=true
"Elapsed time: 51.925353 ms"
;; 9.5% improvement

variant:
(def x 42)  ;; global var
(defn fn4 [a]
  (+ a x))

;;dl=false
"Elapsed time: 58.750851 ms"
;;dl=true
"Elapsed time: 20.766207 ms"
;; 64.6% improvement

constant:
(defn fn4 [a]
  (+ a 42))

;;dl=false
"Elapsed time: 25.724935 ms"
;;dl=true
"Elapsed time: 17.318167 ms"
;; 32.6% improvement

Switching from var dereferencing to static call paths has been consistently shown to be better for hotspot to optimize. Some code will not push var invocation enough for this to matter, but in other cases, direct invocation can provide a significant benefit. Some of the improvements I got in the alioth tests were a result of doing things like manual inlining to remove var invocation. With direct invocation, those changes would likely not have been necessary.

One avenue I have not looked at yet but I think would be interesting is that comp'ed transducer chains consist largely of a bunch of nested function invocation. Thus that may be a situation where direct invocation would allow for better hotspot optimization.

Renzo Borgatti

unread,
Nov 13, 2015, 12:59:21 PM11/13/15
to cloju...@googlegroups.com
Hi,

I did some preliminary testing on our webapp, playing live traffic to it with or without direct linking with close to zero differences (measuring avg response time). We depend on a large amount of libraries though, so even if our namespaces and core are direct-link compiled, the 80% of the rest of the Clojure code is not. In order to see some boost, I assume I should download our dependencies, lein uberjar them with aot and direct-linking enabled, use those instead of the current and maybe see something.

Inspecting the profiler (here’s a screenshot https://dl.dropboxusercontent.com/u/1740372/direct-linking.png) shows that RestFn.invoke is still dominating. I suppose it should be invokeStatic instead. So we’ll have to wait until all libraries are on-board in the long term to see the benefits.

Regards,
Renzo


> On 1 Oct 2015, at 15:39, Ghadi Shayban <gsha...@gmail.com> wrote:
>
> Have the direct linking changes been quantified from a performance perspective? I haven't seen any reports of positive or negative impacts. If anyone has some, I'd be interested to hear.
>

Michael Blume

unread,
Nov 13, 2015, 4:47:38 PM11/13/15
to cloju...@googlegroups.com
Renzo, most Clojure libraries are distributed as source-only, so whatever options you have affecting your code should affect your libraries as well. It'd be highly unusual to distribute a Clojure library as a jar full of class files.

rebo...@gmail.com

unread,
Nov 14, 2015, 11:51:21 AM11/14/15
to Clojure Dev
Maybe you're right, I though uberjarring with :aot :all was standard practice, but I might be wrong. I'll try to verify that for my dependencies and post results again if different.

Thanks
Renzo

Herwig Hochleitner

unread,
Nov 16, 2015, 7:15:13 AM11/16/15
to cloju...@googlegroups.com
2015-11-13 18:54 GMT+01:00 Alex Miller <al...@puredanger.com>:

One avenue I have not looked at yet but I think would be interesting is that comp'ed transducer chains consist largely of a bunch of nested function invocation. Thus that may be a situation where direct invocation would allow for better hotspot optimization.

Isn't direct invocation only for eliminating var-calls? The thing passed into comp would be the already deref'ed fn, so I wouldn't expect transducer stacks to benefit.

I would expect the biggest win to be in deep call trees, where the additional Var stack frames would lead to the inlining budget being blown. Transducer stacks would indeed be a candidate, if they didn't already side-step the issue by using higher-order composition.

Brandon Adams

unread,
Nov 20, 2015, 12:48:23 PM11/20/15
to cloju...@googlegroups.com


On Nov 14, 2015 10:51 AM, <rebo...@gmail.com> wrote:
>
> Maybe you're right, I though uberjarring with :aot :all was standard practice, but I might be wrong. I'll try to verify that for my dependencies and post results again if different.

AOT with uberjar is common, but distributing libraries as an uberjar is not common. The only one I can think of is storm-core, and that's a huge headache because of its AOT deps.

Reply all
Reply to author
Forward
0 new messages