Databases for testing hash functions

César Estébanez

unread,

Mar 10, 2010, 1:57:36 PM3/10/10

to Hash Functions

Hello,

I'm writing a P.h.D. thesis about the use of Artificial Intelligence
techniques to automatically generate hash functions. In order to
compare the performance of the generated functions with the state of
the art, I'm coding a benchmarking tool in Java that includes several
tests of avalanche effect, collisions and distribution of hashes. My
problem now is how to choose the set of databases for the tests. I
want to ask if anybody knows about any database that have been used in
the past to compare hash functions and that is accepted as a good
metric.

At the moment I'm using synthetic databases that I generate following
some common patterns pointed out by Bob Jenkins in his web:

<<
1) [...] Values consist of common substrings arranged in different
orders.

2) All the characters are ASCII, with the high bit of every byte set
to 0. [...] Some rows even have identical values in some columns.
Values often differ in only a few bits.

[...]

3) Another common pattern (not present in this example) is for keys to
be nearly all zero, with only a few bits set.

Human-selected and computer-generated sets of keys almost always match
at least one of these patterns
>>

I greatly appreciate any advice/critic/suggestion/feedback you can
give me.

Thank you very much.

César.

Paul Hsieh

unread,

Mar 14, 2010, 6:54:49 AM3/14/10

to hash-fu...@googlegroups.com

Hi César,

In general, the purpose of a (non-secure) hash is to help perform a
backwards map from output back to input for some easily obtained
forwards map through table look up. This application is wide ranging,
and should be seen confined to exactly what it is supposed to be used
for.

What that means is that *any* database column which is not an index
column is a candidate for hashing. So you could quite literally scan
the web for .csv files, and take nearly anything that isn't the first
column (if its the usual 1, 2, 3, ...) and it would be a good
candidate for hashing. This will, of course, lead to the kind of
biased data Bob Jenkins suggests (ASCII, low numerics, strings with
common substrings, etc), but it may include other kinds as well.

Another place other than raw databases are computer programs. When
the compiler or interpretor generates the "symbol table" for any given
source code they usually hash up the variable names so they can be
quickly matched up from a hash table in future lookups. So you could
also go to a code searching website (like koders.com or Google's code
search) and download random source code and use regular expressions to
find all their variables/symbols to hash.

Finally, many word based programs, such as scrabble-like games or spam
filters will typically have the entire dictionary hashed in order to
check and score words one way or another. So you could take various
language dictionaries (it would be hard to avoid english) and simply
use them as collections to hash in their entirety. This actually
removes the word use bias, but this actually makes sense for some
applications.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Christian Esteve

unread,

Mar 15, 2010, 9:16:25 AM3/15/10

to Hash Functions

César,

as Paul pointed out, the data sets are application domain-specific.

Let me suggest you two pieces of work that may help you during your
research:

Mitzenmacher and Vadhan [1] have recently shown why simple hash
functions work well, in practice, as long as there is enough
randomness in the data input:

Now the question is what is "working well". You already mentioned some
good tests to try out (avalanche effect, collisions, distribution).

Henke et al. [2] have published a thorough empirical evaluation of 25
hash functions you may want to look at. Bein a networking application,
the input data sets are IP addresses if I recall right.

Hope it helps!

Best,
Christian

[1] Mitzenmacher, M. and Vadhan, S. 2008. Why simple hash functions
work: exploiting the entropy in a data stream. In Proceedings of the
Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms (San
Francisco, California, January 20 - 22, 2008). Symposium on Discrete
Algorithms. Society for Industrial and Applied Mathematics,
Philadelphia, PA, 746-755.

[2] Henke, C., Schmoll, C., and Zseby, T. 2008. Empirical evaluation
of hash functions for multipoint measurements. SIGCOMM Comput. Commun.
Rev. 38, 3 (Jul. 2008), 39-50. DOI= http://doi.acm.org/10.1145/1384609.1384614

> Paul Hsiehhttp://www.pobox.com/~qed/http://bstring.sf.net/

Andres Valloud

unread,

Mar 17, 2010, 7:34:44 PM3/17/10

to hash-fu...@googlegroups.com

Also, you may consider a couple data points. First, I wrote a book
called Hashing in Smalltalk: Theory and Practice. If the 2008 paper
abstract interests you, you may also be interested in the book. It
has a detailed survey with more than 50 hash functions, plus analysis
and improvements for over 100 more. Also, it attempts (I think with
some success) to standardize on what "good quality" means for
non-cryptographic hash functions. Finally, it has 230 exercises or so
which expand on the material as well. You can get it here:

www.lulu.com/avSmalltalkBooks

In addition to that, you can get the tool I wrote to do the research.
It already comes with ~300 hash functions implemented, as well as with
~100 data sets. The tool is called the Hash Analysis Tool, and it's
available under the MIT license. To get the tool, download a copy of
VisualWorks non-commercial (from www.cincomsmalltalk.com), connect to
the public Store repository (Store is the version control system for
VisualWorks), and load the bundles called "Hash Analysis Tool" and
"Hash Analysis Tool Extensions".

Andres.

> --
> You received this message because you are subscribed to the Google Groups "Hash Functions" group.
> To post to this group, send email to hash-fu...@googlegroups.com.
> To unsubscribe from this group, send email to hash-function...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/hash-functions?hl=en.
>
>

César Estébanez

unread,

Mar 19, 2010, 9:48:25 AM3/19/10

to Hash Functions

Hello,

I've been coped with so much teaching work this week, and I couldn't
answer your posts. I appreciate a lot your advise:

-- Paul: I could use the idea of hashing source code. Right now, apart
from the synthetic data sets I mentioned in my previous post, I'm
using several sets that probably match very well your suggestions:

1) A huge synthetic MySQL table of clients (name, address, age,
nationality, etc.) from which I extract some columns;

2) A passwords file used by hackers to make dictionary attacks (it
contains millions of words in 23 different languages, as well as
combinations of characters and numbers, and other typical password-
like words).

3) A file containing a large list of common spanish names.

4) The next data set I want to use is an English dictionary:
http://www.gutenberg.org/etext/673. I'm looking for conveniently
formatted open dictionaries on other languages.

-- Christian: I already had the [Henke 2008] paper: very in-depth,
interesting work, although it is a little bit application-specific
(good for networking practitioners, not as good for me).

I didn't know about [Mitzenmacher and Vadhan 2008] paper. Abstract and
Conclusions (the sections that a researcher always reads in the first
place, and (sadly) often the only ones :-) sounds like very
interesting information for my research. Also the bibliography
contains some of references I could use. I can't wait to thoroughly
read it. I appreciate a lot this kind of information.

So, sure it helps! Thank you very much.

-- Andres: I already have your book. I just received my hard copy some
days ago, but I still didn't have time to read it. Also, I finished
installing VisualWorks a few minutes ago. I'm not familiar with
Smalltalk or VW, so I'm still trying to figure out how to use it. I
will let you know if I get into serious trouble :-)

Thank you very much again. My main research field is Artificial
Intelligence, so all the feedback coming from experts on Hash
Functions is incredible valuable for me.

Best regards,

César.

On 18 mar, 00:34, Andres Valloud <andres.vall...@gmail.com> wrote:
> Also, you may consider a couple data points. First, I wrote a book
> called Hashing in Smalltalk: Theory and Practice. If the 2008 paper
> abstract interests you, you may also be interested in the book. It
> has a detailed survey with more than 50 hash functions, plus analysis
> and improvements for over 100 more. Also, it attempts (I think with
> some success) to standardize on what "good quality" means for
> non-cryptographic hash functions. Finally, it has 230 exercises or so
> which expand on the material as well. You can get it here:
>
> www.lulu.com/avSmalltalkBooks
>
> In addition to that, you can get the tool I wrote to do the research.
> It already comes with ~300 hash functions implemented, as well as with
> ~100 data sets. The tool is called the Hash Analysis Tool, and it's
> available under the MIT license. To get the tool, download a copy of

> VisualWorks non-commercial (fromwww.cincomsmalltalk.com), connect to

> > Rev. 38, 3 (Jul. 2008), 39-50. DOI=http://doi.acm.org/10.1145/1384609.1384614

Andres Valloud

unread,

Mar 19, 2010, 7:44:52 PM3/19/10

to hash-fu...@googlegroups.com

Thank you Cesar, let me know your thoughts about the book so I can
make a better second edition sometime in the next few years. Also, if
you are interested in VisualWorks, you should subscribe to the VWNC
mailing list. Here are the instructions on how to subscribe:

http://www.cincomsmalltalk.com/CincomSmalltalkWiki/VisualWorks+Online+Communities