[Caml-list] range of hash function

Grégoire Seux

unread,

Feb 20, 2010, 3:54:34 AM2/20/10

to caml-list

hello !

i would like to use the polymorhpic hash function on strings. But i would
like to know what is the probability of a collision between two hashes.

my first question is about the range of the Hashtbl.hash function: what is
its range ? ( string -> [1..N] ?)

the second question is : can i assume that the result is a uniform
distribution over [1..N] ? (for 10⁶ words which is an estimation of the
english vocabulary size)

the third one is : is it possible to predict which will be the collision ? I
mean collisions are between words which are very 'similar' (for ex: "boy"
and "boys") or are completely unpredictable.

thanks !

--
Grégoire Seux

Goswin von Brederlow

unread,

Feb 20, 2010, 5:52:18 AM2/20/10

to Grgoire Seux, caml-list

Grégoire Seux <kamarad...@gmail.com> writes:

> hello !
>
> i would like to use the polymorhpic hash function on strings. But i would like
> to know what is the probability of a collision between two hashes.
>
> my first question is about the range of the Hashtbl.hash function: what is its
> range ? ( string -> [1..N] ?)

It is range [min_int..max_int]. Meaning 31 or 63 bits.

> the second question is : can i assume that the result is a uniform distribution

> over [1..N] ? (for 10? words which is an estimation of the english vocabulary
> size)

If the function is any good it will be reasonably uniform. And if it is
crap then I bet someone have replaced it already.

> the third one is : is it possible to predict which will be the collision ? I
> mean collisions are between words which are very 'similar' (for ex: "boy" and
> "boys") or are completely unpredictable.

You can read the source, analyze the math and try to figure out a
collision. From past experience every widely used hash function so far
has been shown to be vulnerable. Like md5 is now considered compromised
due to the ease of creating a collision. Sha1 is deemed unsafe too. It
is just a matter of how much brainpower and cpu time you want to throw
at it. Or pure bad luck.

Is the Hashtbl.hash the same function used to hash polymorphic variant
types? If so then there are collisions between words which are verry
similar. I knew of an example once but google isn't verry helpfull
finding it again.

MfG
Goswin

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

ygrek

unread,

Feb 20, 2010, 7:40:25 AM2/20/10

to caml-list

On Sat, 20 Feb 2010 11:52:05 +0100
Goswin von Brederlow <goswi...@web.de> wrote:

> GrÃ©goire Seux <kamarad...@gmail.com> writes:
>
> > hello !
> >
> > i would like to use the polymorhpic hash function on strings. But i would like

> > to know what is the probability of a collision between two hashes.Â

> >
> > my first question is about the range of the Hashtbl.hash function: what is its
> > range ? ( string -> [1..N] ?)
>
> It is range [min_int..max_int]. Meaning 31 or 63 bits.
>
> > the second question is : can i assume that the result is a uniform distribution
> > over [1..N] ? (for 10? words which is an estimation of the english vocabulary
> > size)

The best way is to test yourself. Or rely on someone else's tests :)

http://alaska-kamtchatka.blogspot.com/2009/06/hash-n-mix.html

> You can read the source, analyze the math and try to figure out a
> collision. From past experience every widely used hash function so far
> has been shown to be vulnerable. Like md5 is now considered compromised
> due to the ease of creating a collision. Sha1 is deemed unsafe too. It
> is just a matter of how much brainpower and cpu time you want to throw
> at it. Or pure bad luck.

Hashtbl.hash function is not required to be cryptographically strong, but rather
uniformly random on "usual" input data and fast.

--
ygrek
http://ygrek.org.ua

Jacques Garrigue

unread,

Feb 20, 2010, 8:54:27 AM2/20/10

to kamarad...@gmail.com, caml...@inria.fr

From: Grégoire Seux <kamarad...@gmail.com>

> i would like to use the polymorhpic hash function on strings. But i would
> like to know what is the probability of a collision between two hashes.
>
> my first question is about the range of the Hashtbl.hash function: what is
> its range ? ( string -> [1..N] ?)

Just to get things straight: this is 0..2^30-1 (0..0x3fffffff).
The result of the hash function is the same on 32-bit and 64-bit
architectures.

> the second question is : can i assume that the result is a uniform
> distribution over [1..N] ? (for 10⁶ words which is an estimation of the
> english vocabulary size)

The algorithm for strings is as follows:

i = caml_string_length(obj);
for (p = &Byte_u(obj, 0); i > 0; i--, p++)
hash_accu = hash_accu * 19 + *p;

So you can assume an uniform distribution for sufficiently long
strings.

> the third one is : is it possible to predict which will be the collision ? I
> mean collisions are between words which are very 'similar' (for ex: "boy"
> and "boys") or are completely unpredictable.

Since you have the algorithm, you can predict collisions. Basically
shifting character n by 1 is equivalent to shifting character n+1 by
19, so you have lots of collisions. But this hash function being
intended for hashtables, collisions are not a problem, uniform
distribution matters more.

By the way, for polymorphic variants collisions matter, and the hash
function is different. The range is 31-bits rather than 30-bits, and
the factor is 243, so that names of no more than 4 characters are
guaranteed to be different. You still have collisions, but they are
going to be less similar.

Both hash functions are defined in byterun/hash.c.

Hope this helps,

Jacques Garrigue

Richard Jones

unread,

Feb 21, 2010, 6:10:16 PM2/21/10

to Jacques Garrigue, kamarad...@gmail.com, caml...@inria.fr

On Sat, Feb 20, 2010 at 10:54:12PM +0900, Jacques Garrigue wrote:
> Since you have the algorithm, you can predict collisions. Basically
> shifting character n by 1 is equivalent to shifting character n+1 by
> 19, so you have lots of collisions. But this hash function being
> intended for hashtables, collisions are not a problem, uniform
> distribution matters more.

I would slightly dispute your assertion that collisions are not a
problem, because of algorithmic complexity attacks:

http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003/index.html

The hash implementation used by Perl was changed to avoid this attack:

http://perl5.git.perl.org/perl.git/blob/HEAD:/perl.c#l1465

Rich.

--
Richard Jones
Red Hat