transducer (+ closures) efficiency question

193 views
Skip to first unread message

Sam Raker

unread,
Jun 23, 2015, 6:07:06 PM6/23/15
to clo...@googlegroups.com
Let's say that, as part of an xf, I want to filter out everything in a sequence that's also in some other sequence. Here are some ways of doing that:

(defn filter-contains1 [edn-file] (remove (partial contains? (set (read-edn-file edn-file)))))

(defn filter-contains2 [coll] (remove (partial contains? (set coll))))

(def filter-contains3 [coll] (let [coll-as-set (set coll)] (remove (partial contains? (set coll)))))

I have the strong suspicion that `filter-contains3` is the best of the 3, and `filter-contains1` the worst. The internal mechanics of transduce are a bit of a mystery to me, however: if `filter-contains2` were to be used on a collection of, say, a million items, would `coll` be cast to a set a million times, or is Clojure/the JVM smarter than that? I'm also wondering if anyone has any "best practices" (or whatever) they can share relating to this kind of intersection of transducers/xfs and closures. It seems to me, for example, that something like

(defn my-thing [coll & stuff]
 
(let [s (set coll)]
 
...
 
(comp
   
...
   
(map foo)
   
(filter bar)
   
(remove (partial contains? s))
   
...

is awkward, but that a lot of limited-use transducer factory functions (like the ones above) aren't exactly optimal, either.

Ghadi Shayban

unread,
Jun 23, 2015, 7:17:28 PM6/23/15
to clo...@googlegroups.com
Good question.

Clojure's evaluation semantics dictate that the arguments are evaluated (computed) before calling the function. So(set coll) is computed before being passed to `partial`.  Partial receives a function (a value) and arguments (also values) and returns back a new function that saves those original arguments (which happen to be stuffed away in Java final fields).

All three of your filters return a transducer.  All three of the inputs to `remove` are a partial function.  Each of the arguments to the call to partial is a set.  They are all essentially equivalent, and should perform the same.  (Except filter3 happens to create the set twice; once in the let, and once as the arg to partial).  So over a billion item collection, the set in your examples will only be computed once, once, and twice respectively.

Note however that sets *are* functions that evaluate whether the argument is in the set. This means you could remove the call to partial and shorten to:

(defn filter-contains1 [edn-file]
  (remove (set (read-edn-file edn-file))))


Tangentially:
(remove even?)
Will be faster than
(remove (fn [i] (even? i)))
because in the first case the dereference of the var 'even?' happens only once and the value inside the var will be passed to `remove` at the outset.  In the second example the var dereference happens for every single item (though it's very cheap).  The second example is equivalent to writing (remove #'even?)

Sam Raker

unread,
Jun 23, 2015, 7:23:10 PM6/23/15
to clo...@googlegroups.com
Oh fantastic! I was 100% wrong in literally the best way.


Thanks!

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/H0zoF6jlY-c/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Ben Wolfson

unread,
Jun 23, 2015, 7:25:12 PM6/23/15
to clo...@googlegroups.com
On Tue, Jun 23, 2015 at 4:17 PM, Ghadi Shayban <gsha...@gmail.com> wrote:

Tangentially:
(remove even?)
Will be faster than
(remove (fn [i] (even? i)))
because in the first case the dereference of the var 'even?' happens only once and the value inside the var will be passed to `remove` at the outset.  In the second example the var dereference happens for every single item (though it's very cheap).  The second example is equivalent to writing (remove #'even?)

I can't seem to observe a difference between the two cases.

--
Ben Wolfson
"Human kind has used its intelligence to vary the flavour of drinks, which may be sweet, aromatic, fermented or spirit-based. ... Family and social life also offer numerous other occasions to consume drinks for pleasure." [Larousse, "Drink" entry]

Herwig Hochleitner

unread,
Jun 24, 2015, 11:30:04 AM6/24/15
to clo...@googlegroups.com
2015-06-24 1:25 GMT+02:00 Ben Wolfson <wol...@gmail.com>:
On Tue, Jun 23, 2015 at 4:17 PM, Ghadi Shayban <gsha...@gmail.com> wrote:

Tangentially:
(remove even?)
Will be faster than
(remove (fn [i] (even? i)))
because in the first case the dereference of the var 'even?' happens only once and the value inside the var will be passed to `remove` at the outset.  In the second example the var dereference happens for every single item (though it's very cheap).  The second example is equivalent to writing (remove #'even?)

I can't seem to observe a difference between the two cases.

A simple var reference like even? compiles to a fetch of its root val: @#'even?
In the first case, that happens only once, passing the resulting fn object to remove. In the second case, the deref happens every time the fn is called, thus allowing for redefinition of even?.

Have you ever passed passed #'handler to a server-start function (e.g. in a ring app), so that you might redefine the handler fn during run time of the server?

Ad. thread topic: As detailed, clojure's evaluation semantics already help with transducers. Even better: Built transducer stacks should be pretty accessible to the JIT, since they contain no more var references (in the wiring). Also, since transducers are a kind of lexically closed pipe from input effect to output effect, hotspot's escape analysis should also kick in nicely.

Ben Wolfson

unread,
Jun 24, 2015, 12:19:34 PM6/24/15
to clo...@googlegroups.com
On Wed, Jun 24, 2015 at 8:29 AM, Herwig Hochleitner <hhochl...@gmail.com> wrote:


2015-06-24 1:25 GMT+02:00 Ben Wolfson <wol...@gmail.com>:
On Tue, Jun 23, 2015 at 4:17 PM, Ghadi Shayban <gsha...@gmail.com> wrote:

Tangentially:
(remove even?)
Will be faster than
(remove (fn [i] (even? i)))
because in the first case the dereference of the var 'even?' happens only once and the value inside the var will be passed to `remove` at the outset.  In the second example the var dereference happens for every single item (though it's very cheap).  The second example is equivalent to writing (remove #'even?)

I can't seem to observe a difference between the two cases.

A simple var reference like even? compiles to a fetch of its root val: @#'even?
In the first case, that happens only once, passing the resulting fn object to remove. In the second case, the deref happens every time the fn is called, thus allowing for redefinition of even?.

I meant a difference in speed.

Leon Grapenthin

unread,
Jun 24, 2015, 12:34:20 PM6/24/15
to clo...@googlegroups.com
Using a set as filter predicate is idiomatic Clojure.

If you don't know whether your filter coll is a set, you can write your filter-contains as

(def filter-contains (comp filter set))

E. g. use as

(into [] (filter-contains [1 2 3]) [1 2 3 4]) 
;-> [4]

I am not up to date about optimizations in the PersistentHashSet world, but I'd assume for very small filter colls you can gain speed with a linear scan instead of contains?.

If you are ok with input, output and filter being a set, I'd recommend looking at clojure.set/intersection for that task. It optimizes by reducing the smaller set. Depending on the sizes of the collections you are dealing with, the set conversion cost may be worth the speed up you can gain. 

Leon Grapenthin

unread,
Jun 24, 2015, 12:35:18 PM6/24/15
to clo...@googlegroups.com
Correction: The result in the example should be [1 2 3] of course.

Herwig Hochleitner

unread,
Jun 24, 2015, 12:48:19 PM6/24/15
to clo...@googlegroups.com
2015-06-24 18:19 GMT+02:00 Ben Wolfson <wol...@gmail.com>:

I meant a difference in speed.


A var-root fetch is a volatile read, which can have several adverse performance effects. Most importantly, it hinders inlining, since the JIT has to place a guard for every inlining site of a volatile read, thus it won't so readily inline.
Even when inlined, there must be a memory barrier for the guard, which can lead to pipeline flushes, bus communication and other stalls in execution.

Alan Shaw

unread,
Jun 24, 2015, 3:59:42 PM6/24/15
to clo...@googlegroups.com

As with all discussions involving interoperability, I'd love to see the same questions answered for the JS target too. (This would of course apply to books such as the new Clojure Applied. )

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages