Pass through pre-constructed data structures?

187 views
Skip to first unread message

Peter Taoussanis

unread,
Apr 22, 2014, 3:08:16 AM4/22/14
to cloju...@googlegroups.com
Hi all, just noticed some behaviour that surprised me: 

`(set <preexisting-set>)` and `(vector <preexisting-vector>)` actually reconstruct their input rather than just passing them through.

Is there a reason we couldn't/shouldn't extend PersistentHashSet/PersistentVector to do so? Seems like this'd be a simple change that buys us some performance for nothing.

Thanks, cheers! :-)

Alex Miller

unread,
Apr 22, 2014, 10:56:37 AM4/22/14
to cloju...@googlegroups.com
Seems like this would need to be a change in set and vector right? 


--
You received this message because you are subscribed to the Google Groups "Clojure Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure-dev...@googlegroups.com.
To post to this group, send email to cloju...@googlegroups.com.
Visit this group at http://groups.google.com/group/clojure-dev.
For more options, visit https://groups.google.com/d/optout.

Peter Taoussanis

unread,
Apr 22, 2014, 11:06:53 AM4/22/14
to cloju...@googlegroups.com

Seems like this would need to be a change in set and vector right? 

Sure, okay - can I create an issue for this & submit a patch?

Gary Fredericks

unread,
Apr 22, 2014, 11:23:17 AM4/22/14
to cloju...@googlegroups.com
What would be the behavior when passed a specialized data structure? E.g., (set (sorted-set 1 2 3))?
On Tue, Apr 22, 2014 at 10:06 AM, Peter Taoussanis <ptaou...@gmail.com> wrote:

Seems like this would need to be a change in set and vector right? 

Sure, okay - can I create an issue for this & submit a patch?

Peter Taoussanis

unread,
Apr 22, 2014, 11:56:59 AM4/22/14
to cloju...@googlegroups.com

What would be the behavior when passed a specialized data structure? E.g., (set (sorted-set 1 2 3))?

I propose to only pass through types that would be unaffected by a rewrapping. So a PersistentTreeSet would get reconstructed as a HashSet in this case. I.e. no semantics would be changed by the optimization.

Jean Niklas L'orange

unread,
Apr 22, 2014, 12:51:16 PM4/22/14
to cloju...@googlegroups.com
A better example is perhaps self-designed data structures, which extends PersistentSet or PersistentVector: I wouldn't be too surprised if someone out there has extended PersistentVector, and want `vec` to convert the specialized structure back to the standard PersistentVector.

I would thus suggest that this optimisation checks that the data structure passed in is exactly of the type PersistentSet/PersistentVector, and not a subtype of them.
--
Regards,
Jean Niklas L'orange

Aphyr Null

unread,
Apr 24, 2014, 3:26:15 PM4/24/14
to cloju...@googlegroups.com
On Tuesday, April 22, 2014 9:51:16 AM UTC-7, Jean Niklas L'orange wrote:
I would thus suggest that this optimisation checks that the data structure passed in is exactly of the type PersistentSet/PersistentVector, and not a subtype of them.

Hmm, I was actually going to suggest the opposite; that wherever (vec? x) returns true, (vec x) should be exactly x. We have explicit constructors (hash-set) and (sorted-set), (vector), (array-map) and (hash-map), etc; so I think (set) should simply ensure that the thing is a set, and (apply hash-set x) or (into (hash-set) x) should be used to make a thing explicitly a hash set. Also means I don't have to keep writing (if (set? x) x (set x)). :)

Alex Miller

unread,
Apr 24, 2014, 3:57:50 PM4/24/14
to cloju...@googlegroups.com
I think I'm more in agreement with Kyle that "set" only needs to guarantee that the return is an IPersistentSet, not that it's a PersistentHashSet. It would of course be interesting to know whether code exists that relies on this (instead of more properly calling hash-set).

Another benefit is that the alternate interpretation would require you to compare the type of coll to PersistentHashSet in the code and generally that's a good sign something we are not programming to abstractions.

The set impl could then be: 

(if (set? coll) coll (clojure.lang.PersistentHashSet/create (seq coll))))


--

Peter Taoussanis

unread,
Apr 26, 2014, 3:53:01 AM4/26/14
to cloju...@googlegroups.com
Hey Alex,
 
I think I'm more in agreement with Kyle that "set" only needs to guarantee that the return is an IPersistentSet, not that it's a PersistentHashSet.

Given that this is a core fn, my thinking was just to suggest the most conservative possible move. I can imagine that some folks may be incorrectly depending on `set` to return a PersistentHashSet specifically.

I'd be happy with a `set?` passthrough myself if that seems reasonable.

Have opened an issue with the relevant patches: http://dev.clojure.org/jira/browse/CLJ-1410

This my first Clojure submission, so please double-check for possible mistakes. Thanks a lot, cheers! :-)

Steve Miner

unread,
Apr 26, 2014, 10:20:51 AM4/26/14
to cloju...@googlegroups.com
I'd like to see some benchmarks. I have outsmarted myself in the past by optimizing special cases only to find out that the benefit was negligible and the cost was higher than expected for general usage. Also, the JIT has surprised me -- it's smarter than I thought most of the time, and extra branches can just get in the way of better compilation.

It would also be interesting to know how often (set already-a-set) was called in some existing projects. In my code, it would be rare, but that's just me.

Just a suggestion about the proposed patches in CLJ-1410: I would not change the doc strings. I would like to leave some room for the implementation to do what it thinks is best without demanding that the result is identical to an existing set. That's just a performance optimization, not a language guarantee. IMHO, of course.

Steve Miner


Peter Taoussanis

unread,
Apr 26, 2014, 10:56:33 AM4/26/14
to cloju...@googlegroups.com
Hi Steve, thanks for the input!

I'd like to see some benchmarks.  I have outsmarted myself in the past by optimizing special cases only to find out that the benefit was negligible and the cost was higher than expected for general usage.  

Set/vector construction don't appear to be particularly well-suited to JIT optimization:
(let [coll #{:a :b :c :d :e :f :g :h :i :j :k}]
  (time (dotimes [_ 100000] (set coll)))                       ; ~480ms
  (time (dotimes [_ 100000] (if (set? coll) coll (set coll)))) ; ~11ms
  )

Will a difference like this be significant? Depends on the context, I guess. I tend to deal with large collections, and these tend to exaggerate the effects of unnecessary reconstruction. Spotted this when a tight loop was performing worse than expected.

It would also be interesting to know how often (set already-a-set) was called in some existing projects.  In my code, it would be rare, but that's just me.

Anecdotally, I count around 36 cases in the ~100k lines I've got on hand. Of those maybe 3-4 that matter. That may or may not be representative though.

To clarify: I don't think this is a big deal either way (and it's easy to work around once you're aware of it).

Just a suggestion about the proposed patches in CLJ-1410: I would not change the doc strings.  I would like to leave some room for the implementation to do what it thinks is best without demanding that the result is identical to an existing set.

Sure, wasn't sure what the preference there would be.

Cheers :-)

Jean Niklas L'orange

unread,
Apr 26, 2014, 11:00:52 AM4/26/14
to cloju...@googlegroups.com
On 26 April 2014 09:53, Peter Taoussanis <ptaou...@gmail.com> wrote:
Hey Alex,
 
I think I'm more in agreement with Kyle that "set" only needs to guarantee that the return is an IPersistentSet, not that it's a PersistentHashSet.

Given that this is a core fn, my thinking was just to suggest the most conservative possible move. I can imagine that some folks may be incorrectly depending on `set` to return a PersistentHashSet specifically.

That was my thought as well, but if it's okay to change the semantics of the implementation, I'd agree with Kyle and Alex that it is better to have `set`/`vec` short-circuit if the input collection already satisfies `set?`/`vector?`.

Alex Miller

unread,
Apr 26, 2014, 11:13:55 AM4/26/14
to cloju...@googlegroups.com
I agree with Steve's comments on this ticket but I'd recommend moving this discussion to the ticket comments at this point.

As a general rule of thumb, any change made for performance reasons should demonstrate repeatable benefits in the ticket.



--
Reply all
Reply to author
Forward
0 new messages