Hashing strategies - Executive summary

Mark Engelberg

unread,

Oct 26, 2013, 5:37:49 PM10/26/13

to clojure-dev

I've attempted to summarize my observations and thoughts about hashing in a report, which hopefully can form the basis for a design page on the issue:

https://docs.google.com/document/d/10olJzjNHXn1Si1LsSvjNqF_X1O7OTPZLuv3Uo_duY5o/edit?usp=sharing

Andy Fingerhut

unread,

Oct 26, 2013, 5:48:52 PM10/26/13

to cloju...@googlegroups.com

Very nicely written, Mark. Thank you. I will let you know if I think of any comments, questions, or improvements, but it looks very good the way it is now. This would be a great base on which to add experimental results, starting with the N-queens program, and perhaps also some of the performance issues you have experienced, I think while optimizing Instaparse.

Are there any other suggestions for other projects using large sets and maps that would be good to benchmark with these kinds of proposed changes?

Andy

On Sat, Oct 26, 2013 at 2:37 PM, Mark Engelberg <mark.en...@gmail.com> wrote:

I've attempted to summarize my observations and thoughts about hashing in a report, which hopefully can form the basis for a design page on the issue:

https://docs.google.com/document/d/10olJzjNHXn1Si1LsSvjNqF_X1O7OTPZLuv3Uo_duY5o/edit?usp=sharing

--
You received this message because you are subscribed to the Google Groups "Clojure Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure-dev...@googlegroups.com.
To post to this group, send email to cloju...@googlegroups.com.
Visit this group at http://groups.google.com/group/clojure-dev.
For more options, visit https://groups.google.com/groups/opt_out.

Tim Visher

unread,

Oct 27, 2013, 7:30:20 AM10/27/13

to cloju...@googlegroups.com

I don't have much to add to the discussion as a whole as I haven't
experienced many needs to optimize my code yet, but I did want to say
that the write up is fantastic. Thanks for putting the effort into
writing that!

kovas boguta

unread,

Oct 27, 2013, 7:06:23 PM10/27/13

to cloju...@googlegroups.com

Nice job putting together the summary.

Maybe my impression is an artifact of the examples.. but isn't the
main problem that the hash of numbers is not evenly distributed?

If you have a higher-order hash for e.g. collections, you want to make
the assumption that the input hash codes are evenly distributed.
Otherwise, someone is failing by definition. You don't want to be
re-munging existing codes "just to be safe", thats a recipe that now
everyone has to remember to implement.

Suppose, for arguments sake, that every number hashed to a random number.

In the vector example, I get 39,999 unique hashes (out of 40,000),
compared to 6,369 in the current scheme.

So the problem of the magic number of 31 is only a problem if the
inputs are not evenly distributed to begin with. The proposed solution
is basically deferring a true hash of a number until it is used in a
collection.

From a design point of view, it would probably be simpler and more
robust for numbers to get hashed upfront. Then the contract for the
hashing function is simply: if the inputs are evenly distributed, then
the output is evenly distributed.

Colin Fleming

unread,

Oct 27, 2013, 9:02:30 PM10/27/13

to cloju...@googlegroups.com

Fantastic write-up, Mark - thanks.

On 27 October 2013 10:37, Mark Engelberg <mark.en...@gmail.com> wrote:

I've attempted to summarize my observations and thoughts about hashing in a report, which hopefully can form the basis for a design page on the issue:

https://docs.google.com/document/d/10olJzjNHXn1Si1LsSvjNqF_X1O7OTPZLuv3Uo_duY5o/edit?usp=sharing

--

Mark Engelberg

unread,

Oct 28, 2013, 1:24:51 AM10/28/13

to clojure-dev

If hash numbers are evenly distributed, I agree that you don't need to make any alteration to the magic number of 31. Vector hashing will be fine.

However, for example, for all numbers a,b,c,d:

#{[a b] [c d]} will still hash to the same value as

#{[c b] [a d]}

#{#{a b} #{c d}} will still hash to the same value as

#{#{a c} #{b d}}

and other similar issues involved with permuting around values. I don't think Paul's code would be helped all that much if only number hashing were improved.

Michał Marczyk

unread,

Oct 28, 2013, 5:29:55 AM10/28/13

to cloju...@googlegroups.com

Great write-up indeed, wow.

With regard to bringing alternative implementations of various
abstract data types present in Clojure's core library: I already use
static methods like APersistentMap/mapEquals and
APersistentMap/mapHash in preference to my reimplementations;
unfortunately only some of the map / set methods are backed by such
static methods (set hashCode / hasheq are actually methods of
APersistentSet, inherited by the concrete types). If this could be
changed so that every abstract data type had hash functions (the one
used in hashCode and the one used in hasheq) implemented as static
methods somewhere, possibly taking something like an iterator as the
input argument (or whatever general approach is most efficient), that
would make the problem of out-of-sync reimplementations go away
completely (and in fact it would make things more DRY etc.). I'm
hereby volunteering to write any necessary patches along these lines.

(In fact in ClojureScript set hashing uses cljs.core/hash-iset, so
e.g. avl.clj is able to call that. As for compatibility of libraries
with earlier Clojure versions -- I'd just use conditional compilation
to accommodate any differences in hashing strategies, which is easy to
accomplish with macros if the condition in question only involves the
Clojure version number. Clearly if Clojure introduces an appropriate
static method for an abstract data type, the version where this
happens will be the final one for which a branch will need to be
introduced. Conditional compilation is also an acceptable solution if
hand-rolled reimplementations happen to be significantly faster than
the static method using an iterator for some data structure.)

As for the general problem of hashing strategies, there are two
remarks I'd like to make. I'd really love it if Clojure could adopt
some universal hashing scheme for hasheq in place of the fixed hash
code approach. (Clearly any such scheme would need to be benchmarked
very extensively.) Also, I'm not too worried about a small decrease in
hashing performance -- hashing is useful pretty much exclusively in
the context of putting things in hash maps / sets (right?), where I'd
expect avoiding collisions to have a much greater performance impact
than speeding up the hash function by a small factor.

Cheers,
Michał

Christophe Grand

unread,

Oct 28, 2013, 7:21:25 AM10/28/13

to clojure-dev

Great write-up Mark!

There's one type of collsion with maps (under the current scheme or your proposed one) that bothers me: all [x x] entries hash to 0.

Should/could something reasonably be done about that?

Christophe

On Sat, Oct 26, 2013 at 11:37 PM, Mark Engelberg <mark.en...@gmail.com> wrote:

I've attempted to summarize my observations and thoughts about hashing in a report, which hopefully can form the basis for a design page on the issue:

https://docs.google.com/document/d/10olJzjNHXn1Si1LsSvjNqF_X1O7OTPZLuv3Uo_duY5o/edit?usp=sharing

--

You received this message because you are subscribed to the Google Groups "Clojure Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure-dev...@googlegroups.com.
To post to this group, send email to cloju...@googlegroups.com.
Visit this group at http://groups.google.com/group/clojure-dev.
For more options, visit https://groups.google.com/groups/opt_out.

--
On Clojure http://clj-me.cgrand.net/
Clojure Programming http://clojurebook.com
Training, Consulting & Contracting http://lambdanext.eu/

Stuart Halloway

unread,

Oct 28, 2013, 1:17:42 PM10/28/13

to cloju...@googlegroups.com

Thanks Mark (and others) for pulling this information together. I talked to Rich, and he is open to fast-tracking hasheq improvements, so long as we can do the diligence to get it right in one go.

One thing that is missing is more appeal to authority. :-) We will need to expand the algorithmic recommendations to include pointers to what experts recommend, and what experienced users report. This is a lot less work than doing the basic research and benchmarking ourselves.

As a starting point, why isn't Murmur3 (a well-known algorithm) used by Scala (a language that experienced a similar problem) a good choice? The relevant Scala commit is https://github.com/scala/scala/commit/5f905da8b66fe69b09d8b62c917970cde14e23ba. I don't think they took on numbers, but we could.

Cheers,

Stu

kovas boguta

unread,

Oct 28, 2013, 3:39:31 PM10/28/13

to cloju...@googlegroups.com

Improving the problem with the permutations & nested structure is
important, but its not the dominant factor.

The dominant factor is the integer partitions, and that can be fixed
by randomizing the number hashes.

In the collision data, all the samples have a certain structure:
taking the vector pairs, the first numbers sum to 10, and the second
numbers sum to 5. That these hash to the same thing is a consequence
of the vector and hashset interaction.

Now, the point is that there are many ways to sum to 10, in fact there
are 42 distinct integer sets that sum to 10, and thats *before* you
get into permutations.

To match our real world example, lets put a cap of length 5 on each
partition, and only allow integers <= 4 as in our data. There are 12
such partitions that sum to 15, and 6 that sum to 5.

The total number permutations on top of these partitions is 381 and
121 respectively, which leads to a grand total of 46,101 ways to get
the same hash code by substituting numbers between 0-4 into the sample
data.

If numbers hash to random numbers, the integer partition problem goes
away, and have a much smaller set of equivalence classes based on
permutations. The set of 46,101 breaks up into 72 smaller sets with a
median # of elts of 600, and a max # of 3600.

So thats a pretty big improvement over 46,101.

381

Permuting the vectors among the hashsets will not change the code, and
that leads to collisions. But there are actually more ways to sum
integers to get 10 than there are to

IntegerPartitions[10]

Notice that the numbers are not simply permutations. There are 176
distinct sets of integers that sum to 15, know as integer partitions.
This is the root of the combinatorial explosion.

The total number of ordered lists of length 10 that sum to 15 is
1,307,706 . These will all sum to the same hash code

There are 176 distinct sets of integers that sum to 15. These are the
integer partitions, and are the root of the combinatorial explosion.

If the hash code was random, the 176 distinct sets would sum to
different codes, rather than the same one (15).

1,307,706

Permutations come into play on top of the integer partitions.

For each of these distinct sets, take #{4, 2, 2, 2, 1, 1, 1, 1, 1}
for instance,
we can expand it to length 10 as in the example, arbitrarily permute
it , and get the same hash. For this instance, there are 5040
permutations.

To get the total number of possible hash collisions, we take the
integer partitions, pad them to 0 to get length 10, count the
permutations and sum it up.

The sample collisions obtained by Andy all have in common 1 thing:

kovas boguta

unread,

Oct 28, 2013, 3:40:49 PM10/28/13

to cloju...@googlegroups.com

Ignore previous message, had some draft junk at the end.
-------------------------------------------------------------------------

Improving the problem with the permutations & nested structure is
important, but its not the dominant factor.

The dominant factor is the integer partitions, and that can be fixed
by randomizing the number hashes.

In the collision data, all the samples have a certain structure:
taking the vector pairs, the first numbers sum to 10, and the second
numbers sum to 5. That these hash to the same thing is a consequence
of the vector and hashset interaction.

Now, the point is that there are many ways to sum to 10, in fact there
are 42 distinct integer sets that sum to 10, and thats *before* you
get into permutations.

To match our real world example, lets put a cap of length 5 on each
partition, and only allow integers <= 4 as in our data. There are 12
such partitions that sum to 15, and 6 that sum to 5.

The total number permutations on top of these partitions is 381 and
121 respectively, which leads to a grand total of 46,101 ways to get
the same hash code by substituting numbers between 0-4 into the sample
data.

If numbers hash to random numbers, the integer partition problem goes
away, and have a much smaller set of equivalence classes based on
permutations. The set of 46,101 breaks up into 72 smaller sets with a
median # of elts of 600, and a max # of 3600.

So thats a pretty big improvement over 46,101.

Andy Fingerhut

unread,

Oct 28, 2013, 4:11:26 PM10/28/13

to cloju...@googlegroups.com

Kovas:

Pick any hash function you like on the integers, arbitrarily expensive to compute, but deterministic.

I believe it is easy to prove that if you leave the vector hash function for a pair as the current (31*hash(1st elem) + hash(2nd elem)) (or perhaps it is 31 times that plus a constant, but that is irrelevant to my point), and leave the set hash function as the sum of the hashes of the elements, then the hash for these two sets will still be the same:

#{[a b] [c d]}

#{[c b] [a d]}

And there are many, many other such sets of 2-element vectors that have equal hashes. In fact, all of those where the first elements contain a multiset of values A, and the second elements contain a multiset of values B, will have the same hash.

If I'm wrong on that, I would appreciate it if you can provide a counterexample where that statement is false. It seems pretty simple to prove it is true.

The hash function of collections is a significant factor here.

Andy

Andy Fingerhut

unread,

Oct 28, 2013, 4:35:19 PM10/28/13

to cloju...@googlegroups.com

I am assuming the smiley after "appeal to authority" indicates you really mean something closer to "appeal to experience and knowledge". Otherwise we can all go back to believing the Earth is immobile at the center of the universe. :-)

Murmur3 [1] looks to me like a reasonable candidate for integers and vectors/sequences, but unless you are willing to sort the hash values of all set elements before calculating the hash of the set, I do not see how it will help provide a good hash function on sets.

I wanted to toss out some hastily-found links for thought, in case anyone had time and interest to dig into them. Most of them mention multisets, which is more general than we need for sets, but that shouldn't be a problem. [2] [3] [4] I don't want to turn this into a research project -- just wanted to indicate that some careful thought had been put into possibly relevant ideas.

Andy

[1] http://en.wikipedia.org/wiki/MurmurHash

[2] http://cstheory.stackexchange.com/questions/3390/is-there-a-hash-function-for-a-collection-i-e-multi-set-of-integers-that-has

[3] http://people.csail.mit.edu/devadas/pubs/mhashes.pdf

[4] http://www.sciencedirect.com/science/article/pii/0022000081900337

kovas boguta

unread,

Oct 28, 2013, 5:16:53 PM10/28/13

to cloju...@googlegroups.com

Hi Andy,

You are totally correct. It is significant. And I'm all in favor of
changing this property.

The question is, what is the biggest problem overall? If you could
change 1 thing, what would have the most impact?

My argument is that numbers affect everything else, all the upstream consumers.

The counterclaim was that my proposal is irrelevant to the performance
of the sample problem at hand. My counter-counter-claim shows that in
fact it is likely to significantly improve it.

Of course one can and should improve the hashing for the specific case
of vectors inside sets (and more generally account for structure). But
there is evidence that fixing numbers will lift all boats, including
many of the examples under consideration, with a single change.

kovas boguta

unread,

Oct 28, 2013, 5:25:54 PM10/28/13

to cloju...@googlegroups.com

Btw, does anyone have a github project with benchmark code for this problem?

I'd be happy to test my hypothesis about numbers, but that would be
easier if I could just reuse some set of benchmarks.

Mark Engelberg

unread,

Oct 28, 2013, 8:14:19 PM10/28/13

to clojure-dev

On Mon, Oct 28, 2013 at 4:21 AM, Christophe Grand <chris...@cgrand.net> wrote:

Great write-up Mark!

There's one type of collsion with maps (under the current scheme or your proposed one) that bothers me: all [x x] entries hash to 0.

Should/could something reasonably be done about that?

Christophe

Thanks for pointing that out. I've expanded my write-up of map hashing to address that, and altered my recommendation at the end of that section. I'm continuing to investigate, benchmark, and think about the point you raised about map hashing, but I wanted to update the document to reflect where I'm at right now. Let me know what you think of the solution I've outlined.

Mark Engelberg

unread,

Oct 28, 2013, 8:45:07 PM10/28/13

to clojure-dev

On Mon, Oct 28, 2013 at 10:17 AM, Stuart Halloway <stuart....@gmail.com> wrote:

As a starting point, why isn't Murmur3 (a well-known algorithm) used by Scala (a language that experienced a similar problem) a good choice? The relevant Scala commit is https://github.com/scala/scala/commit/5f905da8b66fe69b09d8b62c917970cde14e23ba. I don't think they took on numbers, but we could.

I looked briefly at Murmur3 and observed that it had a lot more complexity that the current Java-inspired algorithms. I mostly didn't consider it at that point because I surmised that any solution that spent more than a few cycles over the existing algorithm wouldn't be a serious contender. I approached it all from the standpoint of "What is the simplest possible change that would address the collision problem?" Once we had a way to solve the problem with a few cheap xors and bitshifts, I figured we were on the right track.

I agree that we need to test out a variety of solutions, including established solutions like murmur. I think now that we know there is interest in improving hashing, we can create a ticket to hang potential patches on. With the patches, we can then test the effect of hashing variations on substantial code bases.

Andy Fingerhut

unread,

Oct 28, 2013, 8:55:59 PM10/28/13

to cloju...@googlegroups.com

Probably multiple people have their own little tests similar to this one, but here [1] is one that I forked from Paul Butcher's chess-clojure Github repo that had the original Scala-converted-to-Clojure N-queens program that started these discussions. Do 'lein run' from the project root to see some stats output on the average and max number of hash collisions that occur among some selected sets of objects using Clojure's built-in hash and 2 alternate ones. See doc strings for alt-hash-1 and alt-hash-2 in the code for what they are intended to do, but in brief alt-hash-1 only changes the hash of integers, but alt-hash-2 changes the hash of integers and of vectors. In both cases it is using Murmur3.

Caveat emptor: I have not carefully tested against any other reference implementation. This code probably has bugs. Feel free to send pull requests or comments if you find any. It is currently not good for measuring the run-time performance of different hash functions -- only for comparing how many collisions they have.

Interesting note on Paul's N-queens program: Clojure's built-in hash on the solutions have some cases where the *average* number of collisions is in the hundreds. That is, the average linked list length in a HashCollisionNode of a PersistentHashSet containing those solutions would have hundreds of elements, scanned linearly. The longest linked lists are in the thousands of elements. You have to change the the first two args to 'solve' function to 6 6 to see those results.

[1] https://github.com/jafingerhut/chess-clojure

Andy

Christophe Grand

unread,

Oct 29, 2013, 5:16:01 AM10/29/13

to clojure-dev

On Mon, Oct 28, 2013 at 6:17 PM, Stuart Halloway <stuart....@gmail.com> wrote:

As a starting point, why isn't Murmur3 (a well-known algorithm) used by Scala (a language that experienced a similar problem) a good choice? The relevant Scala commit is https://github.com/scala/scala/commit/5f905da8b66fe69b09d8b62c917970cde14e23ba. I don't think they took on numbers, but we could.

I haven't spent enough time trying to factor them but:

* Murmur3 does not seem amenable to structural caching.

* Murmur2A is incremental but in an "append-only" way: if a vector is already hashed and you assoc on its first item you'll have to recompute the whole hash.

Christophe

Mark Engelberg

unread,

Oct 30, 2013, 2:10:43 AM10/30/13

to clojure-dev

I've added two sections at the end of the Report addressing some of these questions. I discuss the munge operations and point to the source paper, and also discuss my impression of Murmur3.

Same link: https://docs.google.com/document/d/10olJzjNHXn1Si1LsSvjNqF_X1O7OTPZLuv3Uo_duY5o/edit?usp=sharing

For those that want to play around with Murmur, it's really easy to incorporate it into Clojure from the Guava library.

Add [com.google.guava/guava "15.0"] to your dependencies.

Here is an example of how to use Murmur to thoroughly mash a 64-bit long into a 32-bit int hash:
(let [^HashFunction f (Hashing/murmur3_32)]
(defn number-hash [^long l]
(.asInt (.hashLong ^HashFunction f l))))

Speaking of mashing 64-bit longs into 32-bit ints, Mikera has suggested that the Clojure-universe hashes (as opposed to the hashCode method) should just stay longs. Does anyone know if there are aspects of Clojure's built-in data structures such as hash maps and hash sets that rely on the fact that hashes are ints?

Andy Fingerhut

unread,

Oct 30, 2013, 2:42:39 AM10/30/13

to cloju...@googlegroups.com

I have a few results added to this git repo [1]. In particular, this file [2]. Here is a sample few lines of output and what they mean. Unfortunately they only look any good with a fixed width font:

# hash avg max hash fn

# items vals collisions collision name

------- ------- ---------- --------- -----------

Solve (:K :K :N :R :B :Q) example set elem #{[:B [0 0]] [:K [0 2]] [:K [0 4]] [:N [2 0]] [:Q [4 3]] [:R [3 1]]}

180568 301 599.89 4976 Clojure 1.5.1 hash

180568 8321 21.70 1648 alt-hash-1

180568 180567 1.00 2 alt-hash-2

180568 180419 1.00 2 engelberg-hash

The Clojure 1.5.1 hash line numbers are: 180,568 different values (N-queens solutions), which have 301 unique hash values by Clojure's hash function. This makes an average number of values per hash value of 599.89, and among those 301 different hash values, one of them had 4,976 different values that all hashed to it.

The other lines show the same statistics for 3 other hash functions, including the one recommended in Mark's writeup on the last line. alt-hash-1 and alt-hash-2 use Murmur3_32, or at least a version that I coded up but have not compared against the one in the guava library that Mark mentioned.

[1] https://github.com/jafingerhut/chess-clojure
[2] https://github.com/jafingerhut/chess-clojure/blob/master/lein-run-all-6-6-output.txt

--

Andy Fingerhut

unread,

Oct 31, 2013, 10:31:43 AM10/31/13

to cloju...@googlegroups.com

I just posted almost this same message to the Clojure group, but it seemed reasonable to add it to this discussion thread:

Mark Engelberg has a very nice document describing the existing hash functions used by Clojure, examples with why they can cause collisions so often, and suggestions for changes to them to have fewer collisions in these cases. https://docs.google.com/document/d/10olJzjNHXn1Si1LsSvjNqF_X1O7OTPZLuv3Uo_duY5o/edit

I have made modifications to a fork of the Clojure code base with his suggested hash modifications here, for experimentation purposes: https://github.com/jafingerhut/clojure I also have some modifications to Paul's N-queens Clojure program to print out additional statistics and try out different hash functions here: https://github.com/jafingerhut/chess-clojure

Running Paul's N-queens problem with a 6x6 board without any changes to Clojure takes about 7 mins on one of my computers. With those changes to the hash functions it finishes in about 12 sec.

I haven't let a 6x9 board run to completion without those hash changes, but it takes a long time :-) With those changes to the hash functions it finishes in about 11 minutes.

The reason for the currently slow behavior is a combination of the current hash functions and the implementation of the PersistentHashSet data structure used to implement sets. A PersistentHashSet is, very roughly, a 32-way branching tree with the hash function of the elements used to decide which branch to take. When you hit a leaf in the tree, all elements with the same hash function value are put into an array that is scanned linearly whenever you want to check whether a value is already in the set. So the more hash collisions, the more it is like using an array to store and look up set elements in O(n) time.

With the 6x9 board, there are 20,136,752 solutions. The way those solutions are represented in the program as sets of vectors, those 20 million values only have 17,936 different hash values. Thus the average length of those arrays of values in the tree leaves is 1,122 values. The longest of those arrays has 81,610 values in it, all with the same hash function.

With Mark's proposed hash function changes, those 20,136,752 values hash to 20,089,488 different ints, for an average array length of 1 value, and a longest array length of 3 values. Big speedup.

There is nothing unique about Mark's particular hash functions that achieves this. Other hash modifications can give similar improvements. You can see the stats for a few other choices of hash functions on the 20 million set of values used in the 6x9 board problem at [1]

[1] https://github.com/jafingerhut/chess-clojure/blob/master/lein-run-nqueens-6-9-hash-set-output.txt

Andy

Andy Fingerhut

unread,

Nov 4, 2013, 12:26:15 PM11/4/13

to cloju...@googlegroups.com

Alex Miller suggested created a design page for this topic. There is one here:

http://dev.clojure.org/display/design/Better+hashing

It links to Mark's document, but does not contain everything in his document. It attempts to be significantly shorter, and call out the main points.

If anyone has suggestions for what else should be added to it, feel free to do so here, or edit it yourself if you like.

Andy

Mikera

unread,

Nov 4, 2013, 7:56:37 PM11/4/13

to cloju...@googlegroups.com

Perhaps this is obvious, but benchmarking really needs to be done on more representative code bases, not just on an specific micro-benchmark (which has been selected precisely because it triggers worst-case behaviour).

It would be bad to optimise for a special case if it slows down general purpose code. So I'd suggest a couple of alternative benchmarks. e.g.

- Compiling and testing Clojure itself

- A general hashing benchmark designed to represent the common cases (e.g. keys are a mix of regular keywords, strings, small numbers)

Andy Fingerhut

unread,

Nov 5, 2013, 12:55:40 AM11/5/13

to cloju...@googlegroups.com

I have added results to the design page for compiling and testing Clojure itself:

http://dev.clojure.org/display/design/Better+hashing

Regarding your suggestion of a general hashing benchmark, I am not sure if that would prove anything useful, because the proposed hash function changes do not modify how hashes are calculated for keywords or strings at all.

Andy

Mikera

unread,

Nov 5, 2013, 7:53:58 AM11/5/13

to cloju...@googlegroups.com

Thanks Andy - very useful! Good to see that the overall Clojure compile isn't being adversely affected by these changes.

A hashing benchmark would mostly be useful to see how much extra overhead we are creating. This won't show up in the worst-case benchmark (we already know the new approach will be much better here) and it also probably won't show up in the Clojure compile (too much noise from other random factors). This would be particularly important if we make changes to the infrastructure around hashing - e.g.

a) if we start swapping Object.hashCode() for Util.hasheq(Object) in a lot of places

b) If we change the way that hashes are incrementally calculated / cached

Changes like these can impact performance even in cases where we haven't changed the hash calculation for the specific keys being considered.

e.g. I just tried the following with c=criterium:

(c/quick-bench (clojure.lang.Util/hasheq "foo"))

=> Execution time mean : 25.753074 ns

(c/quick-bench (.hashCode "foo"))

=> Execution time mean : 7.919031 ns

So..... the differences in overhead can definitely be noticeable if you benchmark in a targeted way.

Andy Fingerhut

unread,

Nov 5, 2013, 3:27:03 PM11/5/13

to cloju...@googlegroups.com

OK, I do have some small benchmarks that test for any changes in performance of hasheq from today's 1.6.0-alpha1 release (same hash as Clojure 1.5.1, and probably for several major releases now), to the performance of hasheq with Mark Engelberg's proposed changes as described on the design wiki. The table of performance results on that page is probably getting a bit too large now and should be presented in a different way, but for now the timing results are there.

http://dev.clojure.org/display/design/Better+hashing

Summary: If all you are doing is hashing longs in a loop, the new hash is about 12% slower on one machine/OS/JDK combo I tested. If you are doing hashes of various size vectors and sets containing longs, it is about 3% to 4% slower.

I may add some more timing results for maps, too. I'm open for suggestions on what else people think would be needed in order to make a decision.

Andy

Andy Fingerhut

unread,

Jan 20, 2014, 11:32:14 PM1/20/14

to cloju...@googlegroups.com

I have copied the implementation of the Murmur3 hash that is in the latest released version of Scala (v2.10.3) and put it into a branch of a forked Clojure repo, and run the same experiments that I ran earlier on Mark Engelberg's proposed hash functions. The results are in the table on the dev wiki:

http://dev.clojure.org/display/design/Better+hashing

The modified Clojure code is in branch try-murmurhash3 of this Github repo: https://github.com/jafingerhut/clojure

They would probably need some refining before they would be ready for a patch worthy of consideration (e.g. they do not modify the hashes of records) -- they were made primarily to test performance and hash function quality against the current Clojure hash function, and against Mark Engelberg's proposed hash.

Andy

Mark Engelberg

unread,

Jan 21, 2014, 1:11:18 AM1/21/14

to clojure-dev

Thanks. It's worth pointing out that Andy's implementation is modeled after Scala's, and uses Murmur3's hashing strategy for ordered and unordered collections.

Alex Miller made an excellent case that another way to use Murmur3 would be to implement many of the changes I outlined in my summary doc, but use Murmur3_32 to munge the hashes everywhere I suggested using xorshift32.

I like that Alex Miller's proposal gets the benefits of Murmur3's thorough mixing, while maintaining the ability to create collections that do incremental hashing (which is very difficult with Scala's Murmur3 strategy).

--Mark

Rich Hickey

unread,

Jan 29, 2014, 12:10:24 AM1/29/14

to cloju...@googlegroups.com

I would like to thank everyone for your input on this issue. It's been a great help.

I've pushed a full set of hashing changes, basically Murmur3 throughout.

I understand the issues with incremental hashing, but we don't do that yet and I'm not yet convinced it would matter much in practice unless one is frequently hashing large collections.

In any case, this will let us experience better hashing throughout and do some testing.

The hashing functions in Murmur3 could be surfaced in core to aid developers of other collections that want to retain compatibility with their 'categories', and the recipes are simple - lists use hashOrdered, sets and maps use hashUnordered.

Please try it out and provide feedback here.

Thanks again,

Rich

Mark Engelberg

unread,

Jan 29, 2014, 2:19:15 AM1/29/14

to clojure-dev

On Tue, Jan 28, 2014 at 9:10 PM, Rich Hickey <richh...@gmail.com> wrote:

I understand the issues with incremental hashing, but we don't do that yet and I'm not yet convinced it would matter much in practice unless one is frequently hashing large collections.

Rich, thanks for your work evaluating and incorporating a new hashing strategy.

I would like to point out a concrete situation where I needed incremental hashing when developing the instaparse library.

As an example, consider the trivial "S = 'a'+" parser that takes a string like "aaaaa" and produces [:S "a" "a" "a" "a" "a"].

The S parser builds up a series of intermediate results as it parses the string: [:S "a"], [:S "a" "a"], etc., conjing each new "a" into the result vector it is building up. These intermediate results are stored in a hash set in order to prevent infinite loops and stop other sorts of redundant work that can occur in certain types of grammar.

I was trying to figure out why I was getting O(n^2) performance that was making it impractical to parse a string of just 3000 characters. I realized that I was generating 3000 intermediate results, each a simple conj on to the previous result, but then when putting the results into the set, the hashing process traversed the entire vector. So these full traversals over vectors of length 1 through 3000 was giving me the n^2 running time.

I solved the problem by creating a deftype that worked like a vector but updated the hash immediately when the single new item was conjed on to the end of the vector. This strategy worked beautifully.

I remember that initially, I felt annoyed when I realized what was going on, feeling as if Clojure's lack of incremental hashing violated the promise that conj was an O(1)ish operation, because once I started using these conjed vectors as keys, there was, in some sense, a hidden O(n) cost associated with each new conj. But after I got over my initial annoyance, I mainly felt gratitude that Clojure was sufficiently flexible (and the hashing strategy was sufficiently transparent) to allow me to make something that behaved like a vector but better suited my needs.

I can't make a compelling case that Clojure needs incremental hashing built into all the standard collections as part of the language. But there is real value to using a hashing strategy that allows others to develop their own incremental hashing collections. It really did prove essential for this project, and the straight Murmur3 hashing strategy will definitely break the performance of my code in a way that can't easily be repaired.

I found Alex Miller's proposal to be strong; he advocated using Murmur3 in a way that maintains good support for incremental hashing.

--Mark

Rich Hickey

unread,

Jan 29, 2014, 9:06:33 AM1/29/14

to cloju...@googlegroups.com

I don't think Murmur strictly precludes incremental hashing. You could perform and cache all but the finalizing mix step incrementally, then finalize on each call to hasheq, resulting in the desired constant-time hash. dissoc/disj might be tricky. Alternatively, given the approach taken for sets and maps, you could incrementally do the sum/xor/prods and save all the murmur for the hasheq call, as it is done only once on those results. This would make dissoc/disj easy. Etc.

There are a number of issues with the 'fix it in the mix' approach of munging. Munges of dupes are themselves dupes, and bad combinations (the Java algos for list and set) of munges are still bad combinations, e.g. the list algo will still do poorly when given pairs with frequent repeate of the first value, no matter what you munge it to, etc. Applying Murmur3 per item doesn't convey the strong properties it has for ordered collections when applied across items.

This patch gives us really good hashes, across integers, strings, symbols, keywords and collections. That was and is the objective. I think with some creative thinking it could support incremental hashing. But that work remains.

Other notes:

I had to disable a bunch of tests that encoded hash collection order. Such tests are bad and should be fixed or removed.

I think it is important that the hasheq world have the same collection category (sets/maps/lists are equal/hash due to sharing algos) properties as does Java, which means we need to convey the recipes for alternative implementations. Helper fns for the hashOrdered/Unordered, and perhaps the mixing/finalizing fns would be required.

We need not have the same algo in CLJS if it proves too expensive, but we should have the same encapsulating helper API, thus rendering collections portable independent of the strategy. This would also let us change/improve the algos at will.

I would appreciate help/patches for the above.

Thanks,

Rich

Nicola Mometto

unread,

Jan 29, 2014, 9:40:52 AM1/29/14

to cloju...@googlegroups.com

On the same note, now some contrib libraries are failing their tests
suites on clojure-1.6.0-SNAPSHOT, I've verified for core.cache that its
implementation (probably unintentionally) made assumptions about element
orders in maps, that now are no longer true.

This patch had the good side-effect of exposing such bugs.

Kevin Downey

unread,

Jan 29, 2014, 12:51:50 PM1/29/14

to cloju...@googlegroups.com

it looks like something is not right with the patch, I haven't been able
to narrow it down yet, but running our tests using clojure master I see
failures like

FAIL in (test-pool-records) (db.clj:59)
expected: (= a b)
actual: (not (= #{0 1 1799999 3599999 2 1800002 1800001 1800000} #{0
1800000 1 1800001 2 1800002 3599999 1799999}))

if I add some prns to the test and run on 1.5.1 I get

"hash a" 10800004
"hash b" 10800004
(= (seq (sort a)) (seq (sort b))) true

no failures, just the prns

if I rerun on the HEAD of master with the prns I get

FAIL in (test-pool-records) (db.clj:59)
expected: (= a b)
actual: (not (= #{0 1 1799999 3599999 2 1800002 1800001 1800000} #{0
1800000 1 1800001 2 1800002 3599999 1799999}))
"hash a" -879625494
"hash b" 1826386109
(= (seq (sort a)) (seq (sort b))) true

so if I sort the sets and compare, they are equal, but the sets are not.

in this test one of the sets is built from numbers pulled from a db, one
via subvec'ing a vector and calling set on that.

--
And what is good, Phaedrus,
And what is not good—
Need we ask anyone to tell us these things?

Alex Miller

unread,

Jan 29, 2014, 1:32:22 PM1/29/14

to cloju...@googlegroups.com

If you sort the sets then you are really comparing seqs instead right?

> And what is not good--

> Need we ask anyone to tell us these things?
>

Andy Fingerhut

unread,

Jan 29, 2014, 1:46:55 PM1/29/14

to cloju...@googlegroups.com

Kevin:

I can't think off hand how that could be happening. Just in case it is due to the numbers having different classes from each other, could you also print out something like (map (juxt identity class) my-set) for the sets that are not comparing as equal to each other, but they should. Probably they are all Long's, but nice to confirm it.

Andy

And what is not good--

Need we ask anyone to tell us these things?

Kevin Downey

unread,

Jan 29, 2014, 1:55:58 PM1/29/14

to cloju...@googlegroups.com

the original test is comparing sets, I added a prn of the result of
comparing the sorted seqs of those sets.

the original set based test fails on the HEAD of master now, but passes
on 1.5.1. the sorted seq based test passes on both.

so what I am seeing is given sets A and B, A and B are not equal(and do
not hash the same), but the sorted sequences of the elements of A and B
are equal, which seems to violate some set invariants.

And what is not good—

Rich Hickey

unread,

Jan 29, 2014, 2:19:52 PM1/29/14

to cloju...@googlegroups.com

Could there be some java.math.BigIntegers in there?

Alex Miller

unread,

Jan 29, 2014, 2:31:30 PM1/29/14

to cloju...@googlegroups.com

I would be particularly suspicious of any values returned from a DB which could be whatever JDBC returns (like custom Oracle number types, etc).

Rich Hickey

unread,

Jan 29, 2014, 2:33:00 PM1/29/14

to cloju...@googlegroups.com

I've pushed an enhancement that makes hasheq of java.math.BigIntegers conform to longs when in long range.

See if that helps.

Rich

Kevin Downey

unread,

Jan 29, 2014, 3:11:33 PM1/29/14

to cloju...@googlegroups.com

the latest changes did the trick, thanks!

I think what threw me is = on seqs doesn't short circuit on hashing, it
is entirely an equality check, while = on sets does seem to short
circuit on hash mismatches.

I thought if there was a hashing mismatch for a particular type it
should show up in both equality tests.

And what is not good—

Kevin Downey

unread,

Jan 29, 2014, 3:12:08 PM1/29/14

to cloju...@googlegroups.com

On 1/29/14, 11:19 AM, Rich Hickey wrote:
> Could there be some java.math.BigIntegers in there?

there could be, but if equality and hashing of bigintegers was the
problem, I would expect taking the same elements and comparing them as
sorted seqs to also fail.

And what is not good—

Rich Hickey

unread,

Jan 29, 2014, 3:57:58 PM1/29/14

to cloju...@googlegroups.com

Sets have to do lookups in order to implement equality, thus they use the hashes.

Mark Engelberg

unread,

Jan 29, 2014, 8:00:35 PM1/29/14

to clojure-dev

On Wed, Jan 29, 2014 at 6:06 AM, Rich Hickey <richh...@gmail.com> wrote:

I don't think Murmur strictly precludes incremental hashing.

I agree that some aspects of incremental hashing remain in the "possible but difficult" category, but there are also aspects that will never be possible with Murmur. I hate to keep pushing on this, because you've clearly spent time weighing the options, and have already come to the conclusion that incremental hashing is not necessarily worth preserving as an option, or could be achieved with sufficient effort. But since this decision will affect my code, I want to be certain you've thought this through.

Concrete examples:

1. Assoc at an arbitrary index in a vector. This is easy to do incrementally with Java's hashing scheme, but probably impossible with Murmur.

2. Let's say you want to create a new sequential data structure that has constant or logarithmic time concatenation. Specifically, in instaparse, I needed a data structure with incremental hashing and constant time concatenation, and was easily able to craft something simple that fit my needs. But with murmur, as far as I know it is impossible to combine the hash codes of two sequences to derive the hash code of the concatenated sequence -- you'd have to traverse all the elements of the second sequence, mixing them into the first hash one by one.

This patch gives us really good hashes, across integers, strings, symbols, keywords and collections. That was and is the objective. I think with some creative thinking it could support incremental hashing. But that work remains.

I fully support the overall objective of the patch and agree that Murmur3 has the strongest overall protection against hash collision, but this makes it extremely difficult for people who need custom data structures that support constant-time modifications and are used as keys. There's no evidence that Murmur3's "great hashes" convey any benefit beyond "good hashes" for this purpose, so it seems a shame to close the door on the very real benefits of incremental hashing (for some algorithms) in the name of achieving a level of collision protection whose need has not been established.

Mark Engelberg

unread,

Jan 29, 2014, 8:25:16 PM1/29/14

to clojure-dev

On Wed, Jan 29, 2014 at 12:11 PM, Kevin Downey <red...@gmail.com> wrote:

I think what threw me is = on seqs doesn't short circuit on hashing, it
is entirely an equality check, while = on sets does seem to short
circuit on hash mismatches.

I have also thought that equality on seqs would benefit from a line that basically says that if the cached hash values for both seqs are not equal to -1, and unequal to one another, then shortcircuit with a false answer. We don't want to force a hash to happen on a sequence just because you're comparing for equality, but I don't see any reason not to use it if it is already computed.

Andy Fingerhut

unread,

Jan 29, 2014, 9:16:44 PM1/29/14

to cloju...@googlegroups.com

On Wed, Jan 29, 2014 at 6:06 AM, Rich Hickey <richh...@gmail.com> wrote:

I had to disable a bunch of tests that encoded hash collection order. Such tests are bad and should be fixed or removed.

Ticket CLJ-1328 has a patch that updates many, but not all, of the removed/disabled tests so they become independent of Clojure's hash function.

CLJ-1331 has a proposed patch to update the hash of Clojure primitive vectors to the new Murmur3 hash.

I have added performance and hash quality measurements for the latest changes to the table on the wiki page below. They do not include Alex Miller's tests of hash quality (i.e. variety) on other data structures, which I think would be useful to add, but I haven't done that yet.

    http://dev.clojure.org/display/design/Better+hashing

    http://dev.clojure.org/jira/browse/CLJ-1328
    http://dev.clojure.org/jira/browse/CLJ-1331

Andy

Rich Hickey

unread,

Jan 30, 2014, 11:09:03 AM1/30/14

to cloju...@googlegroups.com

It's not merely about hash collision, it's about good hashes (bit dispersion, freedom from arithmetic properties, avalanche). Vector hashes, in all the proposals, did not become "good enough". They still encode simple sums regardless of how good the element hashes are, or what the multiplying factor is. In fact, it's that property you are leveraging in order to make them incremental, spliceable etc. That's why sums of hashes of similar vectors can collapse again. This is not a problem with set's hash, it is a problem with the vector hashes themselves - however 'distinct' they may appear, they have arithmetic relationships that reappear when combined.

I'm less concerned with murmur across elements than I am with hashes being returned with these arithmetic weaknesses. My primary beef with the proposals is they propagate bad hashes and make fixing them the responsibility of the next collection ("fix it in the mix").

It is simple enough to merely include the fix *before* returning the hash, rather than leaving it to the next guy. I've updated the inside of hashOrdered to do that (and also simplified unordered to be just a single sum).

An incremental strategy need only delay the final step, caching the basis, and moving unordered back to single sum means the basis in each case is just a single int. The fix step has been exposed a la carte, and includes the size.

Please stop pushing for public hashes that can be arithmetically manipulated - that's not going to be "good enough". But the internals are now completely amenable to incrementality, or at least as much as they ever were, IMO.

Exposing APIs so collection implementors can play along still a "help wanted" item.

Thanks,

Rich

kovas boguta

unread,

Jan 30, 2014, 1:01:44 PM1/30/14

to cloju...@googlegroups.com

On Thu, Jan 30, 2014 at 10:09 AM, Rich Hickey <richh...@gmail.com> wrote:
> I'm less concerned with murmur across elements than I am with hashes being returned with these arithmetic weaknesses. My primary beef with the proposals is they propagate bad hashes and make fixing them the responsibility of the next collection ("fix it in the mix").

This is why I argue that numbers should not hash to themselves.

Though having apis for collection implementors would solve most of my concern.

Rich Hickey

unread,

Jan 30, 2014, 2:10:56 PM1/30/14

to cloju...@googlegroups.com

Numbers no longer hash to themselves.

Mark Engelberg

unread,

Jan 30, 2014, 2:45:06 PM1/30/14

to clojure-dev

On Thu, Jan 30, 2014 at 8:09 AM, Rich Hickey <richh...@gmail.com> wrote:

But the internals are now completely amenable to incrementality, or at least as much as they ever were, IMO.

Thanks Rich. I think your latest push nicely balances and addresses the competing needs.

Exposing APIs so collection implementors can play along still a "help wanted" item.

I think the functions in Murmur3.java are already pretty clear and easy to use. When you talk about exposing an API, is it mainly about creating a level of abstraction for compatibility between both Clojure and Clojurescript, or is there something more you have in mind?

Rich Hickey

unread,

Jan 30, 2014, 4:05:32 PM1/30/14

to cloju...@googlegroups.com

Encapsulating Murmur, providing Clojure entry points, abstracting for Clojure/CLJS. We need to define what it means to be compatible with each of the collection categories.

I imagine if we are going to support incrementality we might want to enhance collections so they can report their hasheq bases.

Reply all

Reply to author

Forward