Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

hash<size_t>

41 views
Skip to first unread message

Bonita Montero

unread,
Sep 21, 2019, 12:08:08 PM9/21/19
to
Can anyone tell me why hash<size_t> or even all hash<>-specializations
for integral values my platform return not the value itself as the hash
-code? That doesn't make sense to me because if you have n different
input-values for the hash-generators you would have maximally have n
different output values (if the integral type is less or equal in size
than size_t, which is usually true on 64-bit-platforms); or even more:
when you hash an integral value with maximally the size of a size_t
you might even get collisions depending on the hash-algorithm whereas
with 1:1-mapping wont get any collisions.

Bonita Montero

unread,
Sep 21, 2019, 12:59:46 PM9/21/19
to
Ok, that's only the standard-library of MSVC that
handles it in this way. Consider the following code:

#include <iostream>
#include <functional>

using namespace std;

int main()
{
hash<size_t> hasher;
size_t hash = hasher( 123 );
cout << hash << endl;
}

"MSVC" ouptputs: 10276731894227909406
glibc++ is so reasonable as to simply output: 123.

Juha Nieminen

unread,
Sep 21, 2019, 1:30:08 PM9/21/19
to
You have a misconception there.

The situation is exactly the opposite: Using the integer value itself
as its hash is actually an extremely *poor* hashing function, not
the "optimal" one.

It would optimal if the hash table had as many buckets as the maximum
value of the integer (ie. eg. 2^32 in the case of 32-bit integers).
However, that's not feasible in the least. The typical number of buckets
in a hash table is very small (in relation to the maximum value of the
key), and you want to distribute elements *as randomly as possible*
into them. Using the key itself as the hash value is extremely poor
randomness with a significantly smaller amount of buckets than the
maximum key value, and will usually cause an enormous amounts of
collisions.

The default hash functions for integral values in libstdc++ (used by
gcc) simply return the value itself, which is extremely bad. I assume
they do this because of pure laziness (or to avoid some kind of
controversy, as there is no universally agreed "best" hashing
function for integers).

Bonita Montero

unread,
Sep 21, 2019, 1:37:24 PM9/21/19
to
> The situation is exactly the opposite: Using the integer value itself
> as its hash is actually an extremely *poor* hashing function, not
> the "optimal" one.

No, when the integer-value has less or equal bits than the hash-value
a 1:1-mapping is a perfect solution. And the glibc++ behaves exactly
in this way.

> It would optimal if the hash table had as many buckets as the maximum
> value of the integer (ie. eg. 2^32 in the case of 32-bit integers).

The number of buckets in a hashtable is a different discussion.

> However, that's not feasible in the least. The typical number of buckets
> in a hash table is very small (in relation to the maximum value of the
> key), and you want to distribute elements *as randomly as possible*
> into them.

If you have n distinct values in the input-value and the input-value
more less bits in output-value, you would get a maxumum of n output
-values. So you could stick with a 1:1-mapping and discard the higher
bits not fitting in the hash-index. That's a perfect solution.

> The default hash functions for integral values in libstdc++ (used
> by gcc) simply return the value itself, which is extremely bad.

No, that's absolutely not bad.

Bonita Montero

unread,
Sep 21, 2019, 1:46:25 PM9/21/19
to
>> However, that's not feasible in the least. The typical number of buckets
>> in a hash table is very small (in relation to the maximum value of the
>> key), and you want to distribute elements *as randomly as possible*
>> into them.

> If you have n distinct values in the input-value and the input-value
> more less bits in output-value, you would get a maxumum of n output
> -values. So you could stick with a 1:1-mapping and discard the higher
> bits not fitting in the hash-index. That's a perfect solution.

And even more: you get a equal distribution on all buckets if all keys
are represented equally frequent.

Melzzzzz

unread,
Sep 21, 2019, 10:53:49 PM9/21/19
to
On 2019-09-21, Juha Nieminen <nos...@thanks.invalid> wrote:
> Bonita Montero <Bonita....@gmail.com> wrote:
>> Can anyone tell me why hash<size_t> or even all hash<>-specializations
>> for integral values my platform return not the value itself as the hash
>> -code? That doesn't make sense to me because if you have n different
>> input-values for the hash-generators you would have maximally have n
>> different output values (if the integral type is less or equal in size
>> than size_t, which is usually true on 64-bit-platforms); or even more:
>> when you hash an integral value with maximally the size of a size_t
>> you might even get collisions depending on the hash-algorithm whereas
>> with 1:1-mapping wont get any collisions.
>
> You have a misconception there.
>
> The situation is exactly the opposite: Using the integer value itself
> as its hash is actually an extremely *poor* hashing function, not
> the "optimal" one.
>
> It would optimal if the hash table had as many buckets as the maximum
> value of the integer (ie. eg. 2^32 in the case of 32-bit integers).
> However, that's not feasible in the least. The typical number of buckets
> in a hash table is very small (in relation to the maximum value of the
> key), and you want to distribute elements *as randomly as possible*
> into them. Using the key itself as the hash value is extremely poor
> randomness with a significantly smaller amount of buckets than the
> maximum key value, and will usually cause an enormous amounts of
> collisions.
>

How so? It depends on value itself. You can map value to index
which is smaller simply as value % buckets


> The default hash functions for integral values in libstdc++ (used by
> gcc) simply return the value itself, which is extremely bad. I assume
> they do this because of pure laziness (or to avoid some kind of
> controversy, as there is no universally agreed "best" hashing
> function for integers).

Mapping value to buckets is performed anyway, but using same value has
advantage of zero collisions in hash value, what is better?


--
press any key to continue or any other to quit...
U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec
Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi
bili naoruzani. -- Mladen Gogala

Bonita Montero

unread,
Sep 22, 2019, 12:46:50 AM9/22/19
to
>> The default hash functions for integral values in libstdc++ (used by
>> gcc) simply return the value itself, which is extremely bad. I assume
>> they do this because of pure laziness (or to avoid some kind of
>> controversy, as there is no universally agreed "best" hashing
>> function for integers).

> Mapping value to buckets is performed anyway, but using same value
> has advantage of zero collisions in hash value, what is better?

There's one seemingly refutation: imagine you have a 1:1 hash-key
of values 0 to 2 ^ 16 - 1 and you have 256 buckets; so you drop the
upper byte of the hash-key to get the bucket-index. And imagine you
have input-values which are a multiple of 256; in this case a 1:1
-mapping wouldn't be good as the keys were all mapped to the same
bucket. So multiplying the input value with a prime would be better
here. But the truth is: such a distribution of input-values is not
common.

Bonita Montero

unread,
Sep 22, 2019, 8:40:32 AM9/22/19
to
> with 1:1-mapping wont get any collisions.

The whole issue is becoming more and more strange: I've just written
a tiny function that uses std::hash<size_>t to return a hashed value
of a input size_t. This is the code:

#include <functional>
using namespace std;
size_t f123( size_t s )
{
hash<size_t> hasher;
return hasher( s );
}

Now here's the MSVC-code with full optimizations:

?f123@@YA_K_K@Z PROC
mov r8, rcx
movzx r9d, cl
mov r10, 1099511628211 ; 00000100000001b3H
mov rax, -3750763034362895579 ; cbf29ce484222325H
xor r9, rax
mov rax, rcx
shr rax, 8
movzx edx, al
mov rax, rcx
shr rax, 16
movzx ecx, al
mov rax, r8
shr rax, 24
imul r9, r10
xor r9, rdx
imul r9, r10
xor r9, rcx
movzx ecx, al
imul r9, r10
mov rax, r8
shr rax, 32
xor r9, rcx
movzx ecx, al
mov rax, r8
shr rax, 40
movzx eax, al
imul r9, r10
xor r9, rcx
mov rcx, r8
imul r9, r10
shr rcx, 48
xor rax, r9
movzx edx, cl
imul rax, r10
shr r8, 56
xor rax, rdx
imul rax, r10
xor rax, r8
imul rax, r10
ret
?f123@@YA_K_K@Z ENDP

_What they're smoking in Redmond?_
Are they trying to get crypto-hashes from size_t? ;-)
Actually output-value = input-value would be sufficient. If you don't
want the repeated-bucket-problem described by me here in another article
in this thread, multiplying with a prime would be ok, thereby discarding
the bits not fitting in size_t. Here's an implementation of a multiplci-
cation with the largest prime fitting in size_t on a 64-bit-machine
(this is still suitable as long as the correspoding prime isn't a
meresenne prime):

#include <functional>
using namespace std;
size_t f123( size_t s )
{
size_t const MAXPRIME64 = 0xffffffffffffffc5u;
return s * MAXPRIME64;
}

This is the resulting code:


?f123@@YA_K_K@Z PROC
imul rax, rcx, -59 ; ffffffffffffffc5H
ret 0
?f123@@YA_K_K@Z ENDP

Bonita Montero

unread,
Sep 22, 2019, 9:33:08 AM9/22/19
to
>     mov      r10, 1099511628211            ; 00000100000001b3H
>     mov      rax, -3750763034362895579     ; cbf29ce484222325H

So, diese beiden Werte machten mich stutzig. Ich hab mal durch
die STL-Implementation von MS gesucht und da gab es eine constexpr
-"Variable", die war folgendermaßen definiert:
constexpr size_t _FNV_offset_basis = 14695981039346656037ULL;
Desweiteren war 0x00000100000001b3u, also deziaml 1099511628211,
folgeend zu finden:
size_t _FNV_prime = 1099511628211ULL;
Dann habv ich mal nach "_FNV_offset_basis" gesucht unnd bin auf
folgenden Wikipeida-Artikel gestoßen: https://bit.ly/2Pa3Owh
Da ist der Algorithmus beschrieben als nicht-kryptografischer
Hash-Algorithmus (kann ja prinzipiell bei 64 Bit Output schon
mal gar nicht sein). Auf jeden Fall ist der für das was man
so braucht recht aufwendig.

Bonita Montero

unread,
Sep 22, 2019, 9:49:35 AM9/22/19
to
Sorry, the resonse should be directed to a german newsgroup.
So here's the english translation:
Both values made me curious. I've searched through MS' ST-implementation
and there was a constexpr-"variable that was defined like this:
constexpr size_t _FNV_offset_basis = 14695981039346656037ULL;
Further there was 0x00000100000001b3u, i.e. 1099511628211 decimally,
to find: size_t _FNV_prime = 1099511628211ULL;
I googled for "_FNV_offset_basis" and found this Wikipedia-article:
https://bit.ly/2Pa3Owh . The algorithm described there is a non-crypto-
graphic hash-algorithm (coudln't be hardly ever with a 64-bit output).
Anyway that's a rather expensive implementation.

Bonita Montero

unread,
Sep 22, 2019, 12:25:37 PM9/22/19
to
I've analyzed the properties of the different hash-algorithms
with the following program:

#include <iostream>
#include <cstddef>
#include <random>
#include <limits>
#include <vector>
#include <cmath>
#include <functional>
#include <algorithm>

using namespace std;

void showHashStatistics( vector<unsigned> &v );

int main()
{
size_t const MAX64_PRIME = 0xffffffffffffffc5u;
size_t const N_BUCKETS = 0x100000;
random_device rd;
uniform_int_distribution<size_t> uidHash( 0,
numeric_limits<size_t>::max() );
vector<unsigned> v;

v.resize( N_BUCKETS, 0 );

cout << "simple prime hash:" << endl;
for( size_t i = 0; i != 10'000'000; ++i )
++v[(size_t)(uidHash( rd ) * MAX64_PRIME) % N_BUCKETS];
showHashStatistics( v );

cout << endl;
cout << "hash<size_t>:" << endl;
fill( v.begin(), v.end(), 0 );
hash<size_t> hasher;
for( size_t i = 0; i != 10'000'000; ++i )
++v[hasher( i ) % N_BUCKETS];
showHashStatistics( v );
}

void showHashStatistics( vector<unsigned> &v )
{
unsigned min = numeric_limits<unsigned>::max();
unsigned max = 0;
for( unsigned x : v )
min = x < min ? x : min,
max = x > max ? x : max;
cout << "min: " << min << " max: " << max << endl;

double avg = 0;
for( unsigned x : v )
avg += x;
avg /= v.size();
cout << "average: " << avg << endl;

double rmsd = 0.0;
double sq;
for( unsigned x : v )
sq = (x - avg),
rmsd += sq * sq;
rmsd = sqrt( rmsd / v.size() );
cout << "standard deviation: " << rmsd << endl << endl;

cout << "distrubution:" << endl;
vector<unsigned> vDist;
vDist.resize( max + 1, 0 );
for( unsigned x : v )
++vDist[x];
for( unsigned i = 0; i < max; ++i )
cout << vDist[i] << endl;
}

It makes the following: it fills a vector with 2 ^ 20 (about one
million) unsigneds with zeroes. Then it calculates 10 million hashes.
Each hash is then taken modulo the vector-size and this is taken as
an index to the vector and the according element is incremented.
That's like not having any buckets but just to count the fill-degree
of each bucket.
Then I output different statistics: the maximum load, the minumum
load, the average load and the standard deviationaround the average.
For the trivial solution min is zero, max is 29, the average is 9,54
and the standard deviation is 3,09. For the MS hash<size_t>-solution
min is 4, max is 15, the average is 9,54 (the program outputs some
more digits and they're equal to the average of the simple solution!).
The standard deviation is a bit smaller, 3,09 vs. 1,52 around an
average of 9,54. And if something here thinks that's not a subject
for standard deviations because we won't have any similar like a
normal distribiotion, here't the proof:
https://ibin.co/4vuLPDgTcwYY.png
The above is the simple algorithm and the lower is that of MS.
Because of the high number of distinct values one might think that
the algorithm MS is significantly better, but this is isn't really
true because the average is the same and the values don't vary much
differently. So I don't know why MS makes such a high effort just
because of such a little difference. And imagine that the time
in 1,5 buckets more might be outweight by the simpler hashing-algo-
rithm.
And now something different: when I displayed the distribition
with Openoffice, I was surprised that I almost got a standard
distribution. This means that prime number arithmetics results
in distrubition-properties which can be found in nature. But
that's higher mathematics I won't understand in this life.

Bonita Montero

unread,
Sep 22, 2019, 1:32:27 PM9/22/19
to
>     cout << "distrubution:" << endl;
>     vector<unsigned> vDist;
>     vDist.resize( max + 1, 0 );
>     for( unsigned x : v )
>         ++vDist[x];
>     for( unsigned i = 0; i < max; ++i )
>         cout << vDist[i] << endl;

And I added some code after that here:

double mean = (double)numeric_limits<size_t>::max() / 2.0;
normal_distribution<double> nd( mean, sqrt( mean ) );
cout << endl;
cout << "normal distribution:" << endl;
fill( v.begin(), v.end(), 0 );
for( size_t i = 0; i != 10'000'000; ++i )
++v[((size_t)nd( rd ) * MAX64_PRIME) % N_BUCKETS];
showHashStatistics( v );

This uses a normal distrubution around 2 ^ 63 - 1with a standard
deviation of sqrt( 2 ^ 63 - 1 ) as the random input. This input
is further multiplied with MAX64_PRIME and taken modulo the bucket
-size. The distrubution is really strange: nearly all buckets fall
on the first bucket index, i.e. excactly 2 ^ 20 - 1024! The is
distributet on the remaining 1024 buckets. So theres not a relation-
ship in the direction that the hash-calculations give a normal dis-
tribution but also in the other direction, that the normal-distri-
bution projected through a hash-algorithm results in almost only
one bucket filled. WTF is going on there? Has anyone some mathee-
matical explanation for that?

Juha Nieminen

unread,
Sep 23, 2019, 4:13:58 AM9/23/19
to
Melzzzzz <Melz...@zzzzz.com> wrote:
> How so? It depends on value itself. You can map value to index
> which is smaller simply as value % buckets

Which is very bad. For example, if the key has only even values, half
of the buckets will always be completely empty, and the remaining will
fill at twice the speed.

If, however, the keys are hashed with a strong hashing functions, the
buckets will still all be filled at about equal rates.

> Mapping value to buckets is performed anyway, but using same value has
> advantage of zero collisions in hash value, what is better?

There will be zero collisions only if there are as many buckets as the
maximum value of the integer key (eg. 2^32 if it's a 32-bit integer).

Juha Nieminen

unread,
Sep 23, 2019, 4:18:18 AM9/23/19
to
Bonita Montero <Bonita....@gmail.com> wrote:
>> The situation is exactly the opposite: Using the integer value itself
>> as its hash is actually an extremely *poor* hashing function, not
>> the "optimal" one.
>
> No, when the integer-value has less or equal bits than the hash-value
> a 1:1-mapping is a perfect solution. And the glibc++ behaves exactly
> in this way.

No, it's not, because the distribution becomes extremely dependent on
regular patterns in the keys. For example, if all the keys are even
(and you have an even number of buckets), half of the buckets will
always be empty and the other half will fill at double the rate.

With a good-quality hashing function that won't happen. If you have a
cryptographically strong hashing function, no regular pattern in the
keys will ever cause a significant imbalance in the filling of the buckets.

>> It would optimal if the hash table had as many buckets as the maximum
>> value of the integer (ie. eg. 2^32 in the case of 32-bit integers).
>
> The number of buckets in a hashtable is a different discussion.

No, it's not. It's extremely relevant to the question of how good the
hashing function has to be.

> If you have n distinct values in the input-value and the input-value
> more less bits in output-value, you would get a maxumum of n output
> -values. So you could stick with a 1:1-mapping and discard the higher
> bits not fitting in the hash-index. That's a perfect solution.

No, it's not. The distribution becomes extremely susceptible to any
patterns that may appear in the keys.

With a high-quality hashing function that won't happen.

Bonita Montero

unread,
Sep 23, 2019, 4:26:57 AM9/23/19
to
> No, it's not, because the distribution becomes extremely dependent
> on regular patterns in the keys.

Regular patterns that results in a small amout of buckets filling up
are rare, if they eve hben.

> For example, if all the keys are even
> (and you have an even number of buckets), half of the buckets will
> always be empty and the other half will fill at double the rate.

That also doesn't happen really.

Bonita Montero

unread,
Sep 23, 2019, 5:27:31 AM9/23/19
to
>> For example, if all the keys are even
>> (and you have an even number of buckets), half of the buckets will
>> always be empty and the other half will fill at double the rate.

Check this with the complex hashing-algorithm of MSVC and
you get an idea of that such kind of hashing won't help:

#include <iostream>
#include <cstdint>
#include <cstring>
#include <functional>

using namespace std;

void print( unsigned *a );

int main()
{
unsigned const MAXPRIME16 = 65521;
unsigned a[0x100];

memset( a, 0, sizeof a );
unsigned i = 0;
do
++a[(uint8_t)(i * MAXPRIME16)];
while( (i +=2) != 0x10000 );
print( a );

memset( a, 0, sizeof a );
hash<uint8_t> hasher;
cout << endl;
i = 0;
do
++a[(uint8_t)hasher( i )];
while( (i +=2) != 0x10000 );
print( a );
}

void print( unsigned *a )
{
unsigned i = 0;
do
cout << i << " " << (unsigned)a[i] << endl;
while( ++i != 0x100 );
}

Juha Nieminen

unread,
Sep 23, 2019, 5:37:27 AM9/23/19
to
Bonita Montero <Bonita....@gmail.com> wrote:
>> No, it's not, because the distribution becomes extremely dependent
>> on regular patterns in the keys.
>
> Regular patterns that results in a small amout of buckets filling up
> are rare, if they eve hben.

Like all the keys being even? Yeah, very rare.

Now you are just reaching. As I said, with a strong hashing function, any
regular patterns in the keys will have very little effect on the quality
of the distribution. So what exactly is the reason you deliberately want
to avoid an actual hashing function, and instead use the trivial version
that is extremely susceptible to result in very poor distributions if
there are regular patterns in the keys? What possible advantage could
there be?

Using the key value as its own hashing function is an *extremely poor*
form of hashing, as it's very easily susceptible to poor distributions.
There's a good reason why tons of research has been done to come up with
cryptographically strong hashing functions that "randomize" the result
as much as possible. Why do you think that is?

This is basic comp sci theory. Why are you so stubborn about it?

Bonita Montero

unread,
Sep 23, 2019, 5:45:05 AM9/23/19
to
> Like all the keys being even? Yeah, very rare.

That almost never happens. And as you can read from my other post
(<qma35p$ksp$1...@news.albasani.net>) using a Fowler–Noll–Vo hash MS
uses or a simpler hash like multiplying with a prime doesn't help
here. So you can stick with a 1:1-mapping.

Bonita Montero

unread,
Sep 23, 2019, 5:47:57 AM9/23/19
to
> Like all the keys being even? Yeah, very rare.

And if you know the keys are even, divide them by two.

Juha Nieminen

unread,
Sep 23, 2019, 8:00:40 AM9/23/19
to
Bonita Montero <Bonita....@gmail.com> wrote:
>> Like all the keys being even? Yeah, very rare.
>
> That almost never happens.

Is that like using gets()? "Nobody would input a line that's longer
than 1000 characters. A 1000-character buffer should be enough."
Or "nobody would ever send a malicious .exe file via email, there's
no need to check incoming emails". Or "quicksort ought to be fast
enough. It's very unlikely that the input is already sorted."

You have still not explained why you oppose the idea of using a
strong hashing function which ensures that no regular pattern in the
key values will cause imbalance in their distribution. It's quite
clear that you are just opposing it because of stubborness and
unwillingness to admit having made a mistake.

> And as you can read from my other post
> (<qma35p$ksp$1...@news.albasani.net>) using a Fowler???Noll???Vo hash MS
> uses or a simpler hash like multiplying with a prime doesn't help
> here. So you can stick with a 1:1-mapping.

Nowhere did I suggest using either one of those functions. I suggested
using strong hashing functions. If those give a crappy result, then
they quite obviously are not strong. You can't prove that using a
good hashing function doesn't help by using crappy hashing functions.
Multiplying by a prime is a horrible hashing function, not significantly
better than just not doing anything at all.

Bonita Montero

unread,
Sep 23, 2019, 9:03:54 AM9/23/19
to
> Is that like using gets()? "Nobody would input a line that's longer
> than 1000 characters. A 1000-character buffer should be enough."
> Or "nobody would ever send a malicious .exe file via email, there's
> no need to check incoming emails". Or "quicksort ought to be fast
> enough. It's very unlikely that the input is already sorted."

This isn't an analogy because that doesn't fit.

> You have still not explained why you oppose the idea of using a
> strong hashing function which ensures that no regular pattern in
> the key values will cause imbalance in their distribution. It's
> quite clear that you are just opposing it because of stubborness
> and unwillingness to admit having made a mistake.

It's simply not necessary because the input-values almost never
have any regulariry you have in mind.

MS uses the FNV-Hash. This isn't cryptograyphical, but gets a guod
distribution

> Nowhere did I suggest using either one of those functions.
> I suggested using strong hashing functions. If those give
> a crappy result, then they quite obviously are not strong.

FNV-hashing isn't cryppy.

> Multiplying by a prime is a horrible hashing function, ...

No, that's suitable for hashtables.

Juha Nieminen

unread,
Sep 23, 2019, 10:33:13 AM9/23/19
to
Bonita Montero <Bonita....@gmail.com> wrote:
>> Multiplying by a prime is a horrible hashing function, ...
>
> No, that's suitable for hashtables.

Says the person who thinks that using the key value as its own hash
function is suitable for hashtables.

Multiplying by a constant is absolutely horrible hashing. It doesn't
randomize the value at all. A linear increase in the value produces
a linear result. That's pretty much as poor as returning the value
itself. The constant being a prime has absolutely no effect in this.

Bonita Montero

unread,
Sep 23, 2019, 10:45:45 AM9/23/19
to
>> No, that's suitable for hashtables.

> Says the person who thinks that using the key value as its own hash
> function is suitable for hashtables.

Run the numbers and write a program that checks the distrubition of
bucket-depths on input-values with regular patterns instead of talking
stupid things.

> Multiplying by a constant is absolutely horrible hashing.

No, if extend the input-value to the size of the output hash-value
and multiply it by the largest prime that fits in this type (as long
as this isn't a meresenne-number, then chose the next lower prime)
that's perfectly suitable hashing.

Bonita Montero

unread,
Sep 24, 2019, 12:27:13 AM9/24/19
to
So I added another hashcode-generation.
I don't show the changes but the full source now:

#include <iostream>
#include <cstddef>
#include <random>
#include <limits>
#include <vector>
#include <cmath>
#include <functional>
#include <algorithm>
#include <intrin.h>

using namespace std;

void showHashStatistics( vector<unsigned> &v );

int main()
{
size_t const MAX64_PRIME = 0xFFFFFFFFFFFFFFC5u;
size_t const N_BUCKETS = 0x100000;
random_device rd;
uniform_int_distribution<size_t> uidHash( 0,
numeric_limits<size_t>::max() );
vector<unsigned> v;

v.resize( N_BUCKETS, 0 );
cout << "simple prime hash:" << endl;
for( size_t i = 0; i != 10'000'000; ++i )
++v[(size_t)(uidHash( rd ) * MAX64_PRIME) % N_BUCKETS];
showHashStatistics( v );

size_t low, high;
fill( v.begin(), v.end(), 0 );
cout << "extended prime hash:" << endl;
for( size_t i = 0; i != 10'000'000; ++i )
low = _mul128( uidHash( rd ), MAX64_PRIME, &(__int64 &)high ),
++v[(low ^ high) % N_BUCKETS];
showHashStatistics( v );

cout << endl;
cout << "hash<size_t>:" << endl;
fill( v.begin(), v.end(), 0 );
hash<size_t> hasher;
for( size_t i = 0; i != 10'000'000; ++i )
++v[hasher( i ) % N_BUCKETS];
showHashStatistics( v );

double mean = (double)numeric_limits<size_t>::max() / 2.0;
normal_distribution<double> nd( mean, sqrt( mean ) );
cout << endl;
cout << "normal distribution:" << endl;
fill( v.begin(), v.end(), 0 );
for( size_t i = 0; i != 10'000'000; ++i )
++v[((size_t)nd( rd ) * MAX64_PRIME) % N_BUCKETS];
showHashStatistics( v );
}

void showHashStatistics( vector<unsigned> &v )
{
unsigned min = numeric_limits<unsigned>::max();
unsigned max = 0;
for( unsigned x : v )
min = x < min ? x : min,
max = x > max ? x : max;
cout << "min: " << min << " max: " << max << endl;

double avg = 0;
for( unsigned x : v )
avg += x;
avg /= v.size();
cout << "average: " << avg << endl;

double rmsd = 0.0;
double sq;
for( unsigned x : v )
sq = (x - avg),
rmsd += sq * sq;
rmsd = sqrt( rmsd / v.size() );
cout << "standard deviation: " << rmsd << endl << endl;

cout << "distrubution:" << endl;
vector<unsigned> vDist;
vDist.resize( max - min + 1, 0 );
for( unsigned x : v )
++vDist[x - min];
for( unsigned i = 0; i < max; ++i )
cout << vDist[i] << endl;
}

The relevant new part here is:

size_t low, high;
fill( v.begin(), v.end(), 0 );
cout << "extended prime hash:" << endl;
for( size_t i = 0; i != 10'000'000; ++i )
low = _mul128( uidHash( rd ), MAX64_PRIME, &(__int64 &)high ),
++v[(low ^ high) % N_BUCKETS];
showHashStatistics( v );

There I'm multiplying a random value with the maximum prime fitting
in 64 bit to a 128 bit value through an intrinsic and xor the higher
64 bit to the lower 64 bit. One might think you'd get pure randomness
here, but the distribution of bucket-depths is almost the same as
simply multiplying the input-value with the mentioned prime. So we
have bucket-depthts vom 0 to 29 instead 0 to 28, the average is
exactly the same and the standard-deviation is almost the same;
so WTF is going on here? Where's the relationship between randomness
basing on prime numbers and normal distribution?
0 new messages