What recommendation to make for updating hash calc of alternate Clojure collections?

328 views
Skip to first unread message

Andy Fingerhut

unread,
Feb 7, 2014, 11:14:50 AM2/7/14
to cloju...@googlegroups.com
With Clojure 1.6's new hash function, there are several alternate Clojure collection implementations that maintain hash consistency with Clojure 1.5 and earlier, i.e.: if (= coll1 coll2), then (= (hash coll1) (hash coll2)), for all collections coll1 and coll2 that can compare = to each other.

I know there are patches in progress to make the Murmur3 hash APIs available to these collection authors, but there remain at least 2 choices for them to make:

+ Use those new APIs in all cases in their code.  This makes the release of the library making that change no longer work with Clojure 1.5.1 and earlier.

+ Use those new APIs only if they exist in the version of Clojure in use.  The library could check this condition once the first time it calculated the hash of one of its collections.  The benefit to library users is that they aren't required to switch over to Clojure 1.6 in order to update to a later release of the library.

Mark, any thoughts on what you planned to do for data.priority-map?

Andy

Mark Engelberg

unread,
Feb 7, 2014, 12:43:30 PM2/7/14
to clojure-dev
On Fri, Feb 7, 2014 at 8:14 AM, Andy Fingerhut <andy.fi...@gmail.com> wrote:

Mark, any thoughts on what you planned to do for data.priority-map?

priority-map should work out of the box, cross-version, I think, because internally, one of its components is a regular map, and the hash of the priority-map is simply the hash of the underlying map.  So it should behave the way regular maps behave.

 

Alex Miller

unread,
Feb 7, 2014, 12:57:00 PM2/7/14
to cloju...@googlegroups.com
Equality has not changed, just hashing. So, the major place where this would matter is when collections are put inside hashmaps as keys or inside hashsets (the latter seems more common). 

If you had either all internal or all external collection instances, they would use different rules for their hashes, but would be consistent, so this case seems fine to me.

If you had a mixture of internal and external collections inside a map key or set following different rules then you could get two "equal" maps with different hashcodes, which would be bad. The chances of this seem relatively small though. (Please correct me if this sounds like something that could happen to you though!)

FYI, the new hashing algorithms for external collections is now documented here: 




--
You received this message because you are subscribed to the Google Groups "Clojure Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure-dev...@googlegroups.com.
To post to this group, send email to cloju...@googlegroups.com.
Visit this group at http://groups.google.com/group/clojure-dev.
For more options, visit https://groups.google.com/groups/opt_out.

Andy Fingerhut

unread,
Feb 7, 2014, 2:04:07 PM2/7/14
to cloju...@googlegroups.com
Alex:

I am glad you have documented how other collections should implement hasheq going forward from Clojure 1.6, but that doesn't address the question of what a collection library ought to do if it wants to continue to support Clojure 1.5.1 as well as the 1.6 and later.

data.priority-map was a bad example, as Mark pointed out it simply passes on the hash calculation to an existing Clojure collection, so it is coming along 'for free' with the hash change.

data.avl is probably a better example -- it currently implements hasheq on its collections in its own code.

In fact, it calls the Java method mapHasheq, which still exists in Clojure's Java code, but still implements the old hash calculation.  Is it planned to deprecate that method?  Update it?  Something else?

Thanks,
Andy

Alex Miller

unread,
Feb 7, 2014, 2:17:22 PM2/7/14
to cloju...@googlegroups.com
On Fri, Feb 7, 2014 at 1:04 PM, Andy Fingerhut <andy.fi...@gmail.com> wrote:
Alex:

I am glad you have documented how other collections should implement hasheq going forward from Clojure 1.6, but that doesn't address the question of what a collection library ought to do if it wants to continue to support Clojure 1.5.1 as well as the 1.6 and later.

That was just an FYI, I didn't intend it as an answer to the first question.

My reasoning in last email is that the scope for breakage is relatively small, so one response could be: do nothing. But I would be interested in knowing whether that reasoning does not hold up for someone. 

I don't see any easy solution in Clojure itself to the issue - library writers could of course provide multiple version-specific implementations to work around this until Clojure 1.6 is pervasive.
 
data.priority-map was a bad example, as Mark pointed out it simply passes on the hash calculation to an existing Clojure collection, so it is coming along 'for free' with the hash change.

data.avl is probably a better example -- it currently implements hasheq on its collections in its own code.

In fact, it calls the Java method mapHasheq, which still exists in Clojure's Java code, but still implements the old hash calculation.  Is it planned to deprecate that method?  Update it?  Something else?

Interesting - I think that should be updated. deftype is using it to implement hasheq right now.

Michał Marczyk

unread,
Feb 9, 2014, 2:56:30 AM2/9/14
to cloju...@googlegroups.com
I was planning on using a macro to check (Class/forName
"clojure.lang.Murmur3") or *clojure-version* in both core.rrb-vector
and data.avl. Clojure already uses this approach (checking
Class/forName) to compile against the appropriate ForkJoin
implementation in clojure.core.reducers. As long as the new hashing
API remains stable, this should be a straightforward way to make an
external impl's hashing strategy future-proof. And if things change
again, we'll just use the same trick. (Andy has just provided a patch
to data.avl using the *clojure-version* approach -- thanks!)

Incidentally, I wish the basic methods -- hashOrdered, hashUnordered
-- were exported by a class with a generic name, perhaps HashEq or
HashUtils, just so that they wouldn't need to change if the algorithm
gets swapped out for another one at some point. Introducing Clojure
functions for hashing would work too, I suppose, although I think I'd
actually prefer static methods, at least until Clojure's interfaces /
protocols are unified across platforms.

As for doing nothing: not sure if that refers to Clojure itself or to
library implementers? If the former, with a comprehensive public
hashing API coming to Clojure soon there is indeed nothing else to be
done IMO. If the latter, I don't think this is an acceptable solution
(I think extensions of core ADTs should be capable of being used as
drop-in replacements for the built-ins, for reasons including "least
surprise", ability to be fed to existing data transformations
alongside the built-ins etc.); happily, with the new hashing API in
place, there's also no reason not to do the right thing.

Cheers,
Michał

al...@puredanger.com

unread,
Feb 9, 2014, 9:17:11 AM2/9/14
to cloju...@googlegroups.com, cloju...@googlegroups.com
On Feb 9, 2014, at 1:56 AM, Michał Marczyk <michal....@gmail.com> wrote:

I was planning on using a macro to check (Class/forName
"clojure.lang.Murmur3") or *clojure-version* in both core.rrb-vector

*clojure-version* is the preferred approach. The use of Murmur3 (class or the algorithm) should be considered an implementation detail that may change in the future.  The parts you should consider standard and public are exactly the parts documented at http://clojure.org/data_structures#hashNamely, that unordered collections sum the hasheq of their elements and that ordered collections use the 31* + algorithm and that both finish with a call to mix-collection-hash. The algorithm for mix-collection-hash is subject to change.

and data.avl. Clojure already uses this approach (checking
Class/forName) to compile against the appropriate ForkJoin
implementation in clojure.core.reducers.
As long as the new hashing
API remains stable, this should be a straightforward way to make an
external impl's hashing strategy future-proof. And if things change
again, we'll just use the same trick. (Andy has just provided a patch
to data.avl using the *clojure-version* approach -- thanks!)

Just to be doubly clear, the Murmur3 class is not part of the API and may not remain stable. Following the advice above is the best way to remain future- proof.


Incidentally, I wish the basic methods -- hashOrdered, hashUnordered
-- were exported by a class with a generic name, perhaps HashEq or
HashUtils, just so that they wouldn't need to change if the algorithm
gets swapped out for another one at some point. Introducing Clojure
functions for hashing would work too, I suppose, although I think I'd
actually prefer static methods, at least until Clojure's interfaces /
protocols are unified across platforms.

These methods should not be considered public or stable. That's why they are not better exposed.

As for doing nothing: not sure if that refers to Clojure itself or to
library implementers?

I meant in Clojure itself. I think external collections should of course be moving towards a consistent hash algorithm.

Andy Fingerhut

unread,
Feb 9, 2014, 2:29:20 PM2/9/14
to cloju...@googlegroups.com
On Sun, Feb 9, 2014 at 6:17 AM, al...@puredanger.com <al...@puredanger.com> wrote:
The parts you should consider standard and public are exactly the parts documented at http://clojure.org/data_structures#hashNamely, that unordered collections sum the hasheq of their elements and that ordered collections use the 31* + algorithm and that both finish with a call to mix-collection-hash. The algorithm for mix-collection-hash is subject to change.

Alex, I understand that it is useful to publish an API that allows collections that want to calculate their hashes incrementally, in such a way that they can know how to do it.

I was wondering if for those collections that do not want to implement incremental hashing, if it might also be helpful to publish as part of the API 2 functions that calculate the entire hash of ordered and unordered collections, perhaps named something like (hash-ordered-coll <ordered-coll>) and (hash-unordered-coll <set-or-map>).  These would be straightforward implementations of the published methods of calculating the hashes.

The advantage to doing so that I can see would be (1) slightly simpler to implement hashing for collections that do not want to implement incremental hashing, and (2) less likely to need updating if the hash function changes in the future.

Andy

Alex Miller

unread,
Feb 9, 2014, 2:45:11 PM2/9/14
to cloju...@googlegroups.com
I suggested similar to Rich and he didn't seem to want to do this now, but I will raise it again before we go beta. Need to consider performance too - if your collection happens to be implemented in Java, then it is probably better to implement hasheq directly in your class. If in Clojure, the published algorithms may not be the fastest way to implement hash-ordered and hash-unordered - I'd guess a loop/recur version is faster. I chose to write it with reduce for clarity.

Alex Miller

unread,
Feb 10, 2014, 9:17:48 AM2/10/14
to cloju...@googlegroups.com
It occurred to me that another option for maps (once http://dev.clojure.org/jira/browse/CLJ-1344 is applied) is to call APersistentMap/mapHasheq(). 

Alex Miller

unread,
Feb 10, 2014, 10:01:01 AM2/10/14
to cloju...@googlegroups.com
We will add public functions for these.

Rich Hickey

unread,
Feb 16, 2014, 9:20:17 AM2/16/14
to cloju...@googlegroups.com
And we could back port these entry points to 1.5 (over the 1.5 logic for hashing).

Any interest in that? It would mean asking your 1.5 lib users to move to 1.5.2 rather than forcing to 1.6, but you could then target both with one set of code.

Andy Fingerhut

unread,
Feb 16, 2014, 4:08:50 PM2/16/14
to cloju...@googlegroups.com
I can't answer for the maintainers of the collection libraries, but for those cases where a change is needed to support the new Clojure 1.6 hashes, the changes I have worked on for several libraries appear to be fairly easy to support Clojure 1.6.0 all the way back to 1.4.0, even if no Clojure 1.5.2 as you describe is created.

For example, here is a proposed set of changes to immutable-bitset to make it compatible with 1.5.1 and 1.6.0: https://github.com/jafingerhut/immutable-bitset/commit/515c8607f6472bdcad3443600761b7229ded1f4d

and here are similar proposed changes (currently only on a non-master branch) for core.rrb-vector: https://github.com/clojure/core.rrb-vector/commit/a869c9dcc48be3894ee14ba23a7b4e32a7658a54

Such changes make it easy to maintain compatibility back to Clojure 1.4.0, but not 1.3.0, because I believe 1.4.0 is the first time when the IHashEq interface was introduced.

Andy

Michał Marczyk

unread,
Feb 16, 2014, 4:33:04 PM2/16/14
to cloju...@googlegroups.com
I agree with Andy. And since it's so easy to support both the new and
the old hashing protocol, I see no reason to break compatibility with
<= 1.5.1:

(compile-if (resolve 'clojure.core/hash-ordered-coll) new-hash old-hash)

(I lifted (a simplified version of) compile-if from c.c.reducers to do
this in core.rrb-vector -- on a branch, as Andy mentioned, but I'll be
merging it to master and cutting a release soon.)

Cheers,
Michał

Mark Engelberg

unread,
Feb 18, 2014, 9:30:39 PM2/18/14
to clojure-dev
mix-collection-hash's inputs appear to be type hinted as long, but the underlying java algorithm takes ints.


Andy Fingerhut

unread,
Feb 19, 2014, 12:04:51 AM2/19/14
to cloju...@googlegroups.com
Isn't it the case that Clojure functions can only have primitive type hints on args and return values that are long or double?  I thought that restriction was made to avoid combinatorial explosion in some part of the implementation.

Andy

Alex Miller

unread,
Feb 19, 2014, 12:06:28 AM2/19/14
to cloju...@googlegroups.com
It's not possible to type hint an ^int return but it should be a long in int range. If you're using the specified algorithm to compute the basis, you should be accumulating with int overflow (which you can get in Clojure with unchecked-add-int or in Java with ints of course).

Mark Engelberg

unread,
Feb 19, 2014, 1:01:20 AM2/19/14
to clojure-dev
The point is that mix-collection-hash takes a hash basis and a count of a collection, both of which are ints, and returns an int, which presumably you're going to store in some field that is a primitive int.  So without the ability to type-hint the function as an int, there's a perf hit vs going directly to the Java function:

=> (time (dotimes [n 100000000] (mix-collection-hash 1 2)))
"Elapsed time: 957.731399 msecs"
nil
=> (let [x (int 1) y (int 1)] (time (dotimes [n 100000000] (mix-collection-hash x y))))
"Elapsed time: 498.924461 msecs"
nil
=> (let [x (int 1) y (int 1)] (time (dotimes [n 100000000] (clojure.lang.Murmur3/mixCollHash x y))))
"Elapsed time: 47.864854 msecs"

Maybe mix-collection-hash should be a macro, since it can't be properly type-hinted?

Alex Miller

unread,
Feb 19, 2014, 8:30:14 AM2/19/14
to cloju...@googlegroups.com
The return type of mix-collection-hash is currently unhinted and if there is boxing there, that's probably the bulk of the cost. Typically the long/int conversion is a non-issue on modern hardware. I will take a look at this today.

Bronsa

unread,
Feb 19, 2014, 8:38:57 AM2/19/14
to cloju...@googlegroups.com

an option would be to provide an :inline implementation to avoid all casting/boxing

Alex Miller

unread,
Feb 20, 2014, 9:56:47 PM2/20/14
to cloju...@googlegroups.com
Mark, can you chime in here on this issue:

I have a patch on there for inlining - if you could point to the code using this and also try out the patch and report differences, that would be helpful.

Alex



On Wed, Feb 19, 2014 at 12:01 AM, Mark Engelberg <mark.en...@gmail.com> wrote:

Mark Engelberg

unread,
Feb 21, 2014, 12:15:40 AM2/21/14
to clojure-dev
Thanks.  I should clarify -- I don't really believe that the lack of a type hint on the mix function is the primary factor causing the overall slowdown I've been seeing with 1.6.

My thought process was that I noticed a slowdown, went investigating, and since adding the mix function was one of the only things new in my code, looked closely at the implementation.  I noticed the type hint seemed insufficient to get the best possible performance, and thought it might be a factor.  I actually tried changing my code to directly call Murmur3.java's method, and it only made a small difference (my number of calls to the mix function is on the order of thousands, which likely isn't enough to see a huge difference), but I reported the type issue anyway because it appeared to potentially be limiting the performance of the function in clojure.core (based on the simple timing tests I reported here).  I thought you might want to make sure the type hints are correct so people aren't tempted to just go straight to the Murmur3.java.

Other than the microbenchmark I posted here, I don't have any concrete evidence that the mix function is a problem.

To modify for 1.6, my incrementally-hashing data structure now needs to track two cached hash codes (premix and postmix).  As things are added to the collection, the premix hashcode is updated using the same algorithm as before, but the postmix hashcode is also updated by calling clojure's mix function on the new premix hash.

Tonight, looking through the code again, I noticed a potential culprit.  I had created a PremixHash protocol to serve as a getter for the premix hashcode, with a protocol function (premix-hash self), implemented by the collection to just return the premix-hashcode.  Since protocols can't be type hinted with primitive return values, this means my getter returns the premix-hashcode boxed.  Ugh.  By nixing this getter protocol, and changing the code to retrieve the premix-hashcode with appropriately type-hinted calls to `.premix-hashcode`, I have gotten a significant speed-up.  Now, my code runs only 5% slower on 1.6 versus 1.5.

I find it plausible that the remaining speed difference is due to the fact that hashing is slightly slower in Clojure 1.6, and I do lots of hashing, putting things in small collections where I don't reap much benefit from the improved hash distribution.  But it's also possible that there's still lurking in my code some subtle boxing issue like what I discovered tonight.

I do believe that the inlined patch helps, so I hope you keep it, but in the context of my own code, the impact is too subtle for me to be certain that it makes a difference.
Reply all
Reply to author
Forward
0 new messages