I've analyzed the properties of the different hash-algorithms
with the following program:
#include <iostream>
#include <cstddef>
#include <random>
#include <limits>
#include <vector>
#include <cmath>
#include <functional>
#include <algorithm>
using namespace std;
void showHashStatistics( vector<unsigned> &v );
int main()
{
size_t const MAX64_PRIME = 0xffffffffffffffc5u;
size_t const N_BUCKETS = 0x100000;
random_device rd;
uniform_int_distribution<size_t> uidHash( 0,
numeric_limits<size_t>::max() );
vector<unsigned> v;
v.resize( N_BUCKETS, 0 );
cout << "simple prime hash:" << endl;
for( size_t i = 0; i != 10'000'000; ++i )
++v[(size_t)(uidHash( rd ) * MAX64_PRIME) % N_BUCKETS];
showHashStatistics( v );
cout << endl;
cout << "hash<size_t>:" << endl;
fill( v.begin(), v.end(), 0 );
hash<size_t> hasher;
for( size_t i = 0; i != 10'000'000; ++i )
++v[hasher( i ) % N_BUCKETS];
showHashStatistics( v );
}
void showHashStatistics( vector<unsigned> &v )
{
unsigned min = numeric_limits<unsigned>::max();
unsigned max = 0;
for( unsigned x : v )
min = x < min ? x : min,
max = x > max ? x : max;
cout << "min: " << min << " max: " << max << endl;
double avg = 0;
for( unsigned x : v )
avg += x;
avg /= v.size();
cout << "average: " << avg << endl;
double rmsd = 0.0;
double sq;
for( unsigned x : v )
sq = (x - avg),
rmsd += sq * sq;
rmsd = sqrt( rmsd / v.size() );
cout << "standard deviation: " << rmsd << endl << endl;
cout << "distrubution:" << endl;
vector<unsigned> vDist;
vDist.resize( max + 1, 0 );
for( unsigned x : v )
++vDist[x];
for( unsigned i = 0; i < max; ++i )
cout << vDist[i] << endl;
}
It makes the following: it fills a vector with 2 ^ 20 (about one
million) unsigneds with zeroes. Then it calculates 10 million hashes.
Each hash is then taken modulo the vector-size and this is taken as
an index to the vector and the according element is incremented.
That's like not having any buckets but just to count the fill-degree
of each bucket.
Then I output different statistics: the maximum load, the minumum
load, the average load and the standard deviationaround the average.
For the trivial solution min is zero, max is 29, the average is 9,54
and the standard deviation is 3,09. For the MS hash<size_t>-solution
min is 4, max is 15, the average is 9,54 (the program outputs some
more digits and they're equal to the average of the simple solution!).
The standard deviation is a bit smaller, 3,09 vs. 1,52 around an
average of 9,54. And if something here thinks that's not a subject
for standard deviations because we won't have any similar like a
normal distribiotion, here't the proof:
https://ibin.co/4vuLPDgTcwYY.png
The above is the simple algorithm and the lower is that of MS.
Because of the high number of distinct values one might think that
the algorithm MS is significantly better, but this is isn't really
true because the average is the same and the values don't vary much
differently. So I don't know why MS makes such a high effort just
because of such a little difference. And imagine that the time
in 1,5 buckets more might be outweight by the simpler hashing-algo-
rithm.
And now something different: when I displayed the distribition
with Openoffice, I was surprised that I almost got a standard
distribution. This means that prime number arithmetics results
in distrubition-properties which can be found in nature. But
that's higher mathematics I won't understand in this life.