Improving the problem with the permutations & nested structure is
important, but its not the dominant factor.
The dominant factor is the integer partitions, and that can be fixed
by randomizing the number hashes.
In the collision data, all the samples have a certain structure:
taking the vector pairs, the first numbers sum to 10, and the second
numbers sum to 5. That these hash to the same thing is a consequence
of the vector and hashset interaction.
Now, the point is that there are many ways to sum to 10, in fact there
are 42 distinct integer sets that sum to 10, and thats *before* you
get into permutations.
To match our real world example, lets put a cap of length 5 on each
partition, and only allow integers <= 4 as in our data. There are 12
such partitions that sum to 15, and 6 that sum to 5.
The total number permutations on top of these partitions is 381 and
121 respectively, which leads to a grand total of 46,101 ways to get
the same hash code by substituting numbers between 0-4 into the sample
data.
If numbers hash to random numbers, the integer partition problem goes
away, and have a much smaller set of equivalence classes based on
permutations. The set of 46,101 breaks up into 72 smaller sets with a
median # of elts of 600, and a max # of 3600.
So thats a pretty big improvement over 46,101.
381
Permuting the vectors among the hashsets will not change the code, and
that leads to collisions. But there are actually more ways to sum
integers to get 10 than there are to
IntegerPartitions[10]
Notice that the numbers are not simply permutations. There are 176
distinct sets of integers that sum to 15, know as integer partitions.
This is the root of the combinatorial explosion.
The total number of ordered lists of length 10 that sum to 15 is
1,307,706 . These will all sum to the same hash code
There are 176 distinct sets of integers that sum to 15. These are the
integer partitions, and are the root of the combinatorial explosion.
If the hash code was random, the 176 distinct sets would sum to
different codes, rather than the same one (15).
1,307,706
Permutations come into play on top of the integer partitions.
For each of these distinct sets, take #{4, 2, 2, 2, 1, 1, 1, 1, 1}
for instance,
we can expand it to length 10 as in the example, arbitrarily permute
it , and get the same hash. For this instance, there are 5040
permutations.
To get the total number of possible hash collisions, we take the
integer partitions, pad them to 0 to get length 10, count the
permutations and sum it up.
The sample collisions obtained by Andy all have in common 1 thing: