ANN: hashlib v1.0.0.1 released

CBFalconer

unread,

May 31, 2003, 9:39:45 PM5/31/03

to

Thanks to feedback from Hans-Juergen Taenzer a mild memory leak in
hashlib has been exterminated. The revised release is available
at:

<http://cbfalconer.home.att.net/download/hashlib.zip>

in source form, under GNU licence. The code is entirely portable
ISO standard C. A test suite is included.

Hashlib is a generalized system for building, searching, deleting,
and generally manipulating data in hash tables. The user supplies
several simple functions, including those to copy data objects,
create a hash function, on opening the hashtable. The library
then controls all aspects, including the size of the hashtable.
The library can manipulate multiple hash tables with the same code
and is re-entrant. Examples show several usages, including how to
sort a snapshot of the table contents.

In general a hash table allows data storage and retrieval with
O(1) speed. Performance can be heavily affected by the
suitability of the hash functions to the actual data, and thus
those functions are completely under the control of the user.
Some userful hash functions for operating on strings are
available, together with links to functions for other forms of
data.

--
Chuck F (cbfal...@yahoo.com) (cbfal...@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!

Dan Pop

unread,

Jun 2, 2003, 1:25:17 PM6/2/03

to

In <3ED957E2...@yahoo.com> CBFalconer <cbfal...@yahoo.com> writes:

>Subject: Re: ANN: hashlib v1.0.0.1 released

I thought that only Microsoft had such a high bug rate that it needed
a 4-number versioning scheme...

Dan ;-)
--
Dan Pop
DESY Zeuthen, RZ group
Email: Dan...@ifh.de

arun

unread,

Jun 3, 2003, 3:02:50 AM6/3/03

to

"CBFalconer" <cbfal...@yahoo.com> wrote in message
news:3ED957E2...@yahoo.com...

> Thanks to feedback from Hans-Juergen Taenzer a mild memory leak in
> hashlib has been exterminated. The revised release is available
> at:
>
> <http://cbfalconer.home.att.net/download/hashlib.zip>
>
> in source form, under GNU licence. The code is entirely portable
> ISO standard C. A test suite is included.
>

> Some userful hash functions for operating on strings are
> available, together with links to functions for other forms of
> data.

I couldnt find the links to functions for other forms of data.
Can i have them ? ( unsigned int especially)

And also regarding the string hash functions, i have heard in
umpteen places that Toreks algorithm is the best for string
hashing. I have used it but not seen it. Whats the difference
here ?

Thanks
arun

Joona I Palaste

unread,

Jun 3, 2003, 3:05:35 AM6/3/03

to

Dan Pop <Dan...@cern.ch> scribbled the following:

> In <3ED957E2...@yahoo.com> CBFalconer <cbfal...@yahoo.com> writes:

>>Subject: Re: ANN: hashlib v1.0.0.1 released

> I thought that only Microsoft had such a high bug rate that it needed
> a 4-number versioning scheme...

If it were from Microsoft it would be called hashlib 2000 alpha Service
Pack 4. =)

--
/-- Joona Palaste (pal...@cc.helsinki.fi) ---------------------------\
| Kingpriest of "The Flying Lemon Tree" G++ FR FW+ M- #108 D+ ADA N+++|
| http://www.helsinki.fi/~palaste W++ B OP+ |
\----------------------------------------- Finland rules! ------------/
"I am lying."
- Anon

CBFalconer

unread,

Jun 3, 2003, 7:31:44 AM6/3/03

to

Look in the hashtest.c code for routines and references. There is
a suitable one there. What is Toreks algorithm?

In general the suitability of a hash function(s) is highly data
specific, the beauty of hashlib is that it will tolerate a
terrible function at the expense of performance. One of the tests
does exactly that, to show that the net result is unchanged. The
summaries (probes, misses, etc.) show the performance difference.
Note that the rehash function MUST be different from the hash
function, else it won't break up clustering.

Dann Corbit

unread,

Jun 3, 2003, 2:51:10 PM6/3/03

to

"CBFalconer" <cbfal...@yahoo.com> wrote in message

news:3EDC76EB...@yahoo.com...

> arun wrote:
> >
> > "CBFalconer" <cbfal...@yahoo.com> wrote in message
> > news:3ED957E2...@yahoo.com...
> > > Thanks to feedback from Hans-Juergen Taenzer a mild memory leak in
> > > hashlib has been exterminated. The revised release is available
> > > at:
> > >
> > > <http://cbfalconer.home.att.net/download/hashlib.zip>
> > >
> > > in source form, under GNU licence. The code is entirely portable
> > > ISO standard C. A test suite is included.
> > >
> > > Some userful hash functions for operating on strings are
> > > available, together with links to functions for other forms of
> > > data.
> >
> > I couldnt find the links to functions for other forms of data.
> > Can i have them ? ( unsigned int especially)
> >
> > And also regarding the string hash functions, i have heard in
> > umpteen places that Toreks algorithm is the best for string
> > hashing. I have used it but not seen it. Whats the difference
> > here ?
>
> Look in the hashtest.c code for routines and references. There is
> a suitable one there. What is Toreks algorithm?

unsigned long hash(const char *s)
{
unsigned long h;
for (h = 5381; *s; s++) {
h *= 33;
h += *s;
}
return h;
}

Sometimes attributed to Chris Torek, I think that Dan Bernstein is the
actual inventor.

An improved version is found in the public domain djbdns code base:
http://cr.yp.to/djbdns.html

> In general the suitability of a hash function(s) is highly data
> specific, the beauty of hashlib is that it will tolerate a
> terrible function at the expense of performance.

A terrible hash function will result in terrible hashing. See (for
instance) K&R1 for an example of a terrible hash function.
http://www.cs.yorku.ca/~oz/hash.html

> One of the tests
> does exactly that, to show that the net result is unchanged. The
> summaries (probes, misses, etc.) show the performance difference.
> Note that the rehash function MUST be different from the hash
> function, else it won't break up clustering.

The function used to hash is of paramount importance. There are some
strong hashing functions here:
ftp://cap.connx.com/chess-engines/new-approach/crypto-mod.zip
Sorry, it's C++ rather than C.

For long strings and 64 bit hashing, UMAC is hard to beat:
http://www.cs.ucdavis.edu/~rogaway/umac/
--
C-FAQ: http://www.eskimo.com/~scs/C-faq/top.html
"The C-FAQ Book" ISBN 0-201-84519-9
C.A.P. FAQ: ftp://cap.connx.com/pub/Chess%20Analysis%20Project%20FAQ.htm

CBFalconer

unread,

Jun 3, 2003, 7:14:50 PM6/3/03

to

Dann Corbit wrote:
> "CBFalconer" <cbfal...@yahoo.com> wrote in message
> > >

... snip ...

> >
> > Look in the hashtest.c code for routines and references. There is
> > a suitable one there. What is Toreks algorithm?
>
> unsigned long hash(const char *s)
> {
> unsigned long h;
> for (h = 5381; *s; s++) {
> h *= 33;
> h += *s;
> }
> return h;
> }
>
> Sometimes attributed to Chris Torek, I think that Dan Bernstein is
> the actual inventor. An improved version is found in the public
> domain djbdns code base: http://cr.yp.to/djbdns.html

The GP ones I supplied with hashlib replace 5381 with 0, and 33
with 31 or 37 for hash and rehash. Cribbed from Kernighan & Pike,
"Practice of Programming". Keeping the multiplier prime seems
like a good idea in general.

>
> > In general the suitability of a hash function(s) is highly data
> > specific, the beauty of hashlib is that it will tolerate a
> > terrible function at the expense of performance.
>
> A terrible hash function will result in terrible hashing. See (for
> instance) K&R1 for an example of a terrible hash function.
> http://www.cs.yorku.ca/~oz/hash.html

Can you imagine one worse than "#define hash(x) 0"? Except that
it has to be a real function to have its address passed, hashlib
will survive this one. However the performance is no longer O(1).
:-(

>
> > One of the tests
> > does exactly that, to show that the net result is unchanged. The
> > summaries (probes, misses, etc.) show the performance difference.
> > Note that the rehash function MUST be different from the hash
> > function, else it won't break up clustering.
>
> The function used to hash is of paramount importance. There are some
> strong hashing functions here:
> ftp://cap.connx.com/chess-engines/new-approach/crypto-mod.zip
> Sorry, it's C++ rather than C.
>
> For long strings and 64 bit hashing, UMAC is hard to beat:
> http://www.cs.ucdavis.edu/~rogaway/umac/

However, the optimum still depends on the input data set. For a
fixed data set it is usually possible (but not necessarily
worthwhile) to find a perfect hash. The primary use I have seen
of such is to quickly resolve whether or not an input string is a
reserved word.

Hashlib originally grew out of a test bed for hashing functions.
It still maintains statistics to allow easy measurement.

Dann Corbit

unread,

Jun 3, 2003, 9:09:12 PM6/3/03

to

"CBFalconer" <cbfal...@yahoo.com> wrote in message

news:3EDD0CEA...@yahoo.com...

> Dann Corbit wrote:
> > "CBFalconer" <cbfal...@yahoo.com> wrote in message
> > > >
> ... snip ...
> > >
> > > Look in the hashtest.c code for routines and references. There is
> > > a suitable one there. What is Toreks algorithm?
> >
> > unsigned long hash(const char *s)
> > {
> > unsigned long h;
> > for (h = 5381; *s; s++) {
> > h *= 33;
> > h += *s;
> > }
> > return h;
> > }
> >
> > Sometimes attributed to Chris Torek, I think that Dan Bernstein is
> > the actual inventor. An improved version is found in the public
> > domain djbdns code base: http://cr.yp.to/djbdns.html
>
> The GP ones I supplied with hashlib replace 5381 with 0, and 33
> with 31 or 37 for hash and rehash. Cribbed from Kernighan & Pike,
> "Practice of Programming". Keeping the multiplier prime seems
> like a good idea in general.

33 is better. Nobody knows why.

Here are the tests I would like to see:
1. Funneling -- do several identical inputs map to one output unnecessarily
(e.g. your '#define hash(x) 0' is a pure funnel).
2. Are the hashes normal when given a large set of inputs?
3. Timing -- how fast is it?
4. Degeneration when long chains form?

Chris Torek

unread,

Jun 3, 2003, 11:26:55 PM6/3/03

to

>"CBFalconer" <cbfal...@yahoo.com> wrote in message
>news:3EDC76EB...@yahoo.com...

>> Look in the hashtest.c code for routines and references. There is
>> a suitable one there. What is Toreks algorithm?

In article <bbiqm...@enews3.newsguy.com>

Dann Corbit <dco...@connx.com> writes:
>unsigned long hash(const char *s)
>{
> unsigned long h;
> for (h = 5381; *s; s++) {
> h *= 33;
> h += *s;
> }
> return h;
>}

Actually I always write this as something like:

for (h = 0; (c = *s) != '\0'; s++)
h = h * 33 + c;

I have no idea where "5381" came from; it should have no real
effect anyway.

>Sometimes attributed to Chris Torek, I think that Dan Bernstein is the
>actual inventor.

I stole it from Gosling Emacs, back in the mid-1980s.

I thought 31 would be a "better" multiplier than 33, so I tried
31, and it turned out to be worse in the tests I did. Later I
suggested adding it to a more thorough set of tests by some folks
at Berkeley, who found the same thing -- 33 works better than 31,
for almost all our sample inputs. None of us can explain this. :-)

There are hash functions with better distributions, but one must
trade hash distribution against hash computation-time. The "best"
tradeoff is likely to change over time, but when we were doing our
tests (me in the mid-1980s, Berkeley in the early 1990s), this
simple "multiply by 33" method kept winning because of its rapid
computation. That is, we could afford the extra rehashes (linear
probing, chains, what-have-you) resulting from collisions, "spending"
only some of the time saved by faster initial hash.
--
In-Real-Life: Chris Torek, Wind River Systems (BSD engineering)
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://67.40.109.61/torek/index.html (for the moment)
Reading email is like searching for food in the garbage, thanks to spammers.

CBFalconer

unread,

Jun 4, 2003, 2:35:57 AM6/4/03

to

Dann Corbit wrote:
> "CBFalconer" <cbfal...@yahoo.com> wrote in message
> >
...snip...
> >

> > Hashlib originally grew out of a test bed for hashing functions.
> > It still maintains statistics to allow easy measurement.
>
> Here are the tests I would like to see:
> 1. Funneling -- do several identical inputs map to one output
> unnecessarily (e.g. your '#define hash(x) 0' is a pure funnel).
> 2. Are the hashes normal when given a large set of inputs?
> 3. Timing -- how fast is it?
> 4. Degeneration when long chains form?

I'm not sure what you mean. As far as 4 is concerned, hashlib
rehashes on collisions, specifically to avoid chains and
clustering. I forget the correct terminology. In general I find
1.5 to 3 probes average to insert or find things. The hash
functions are external, and entirely up to the user. Hashlib
automatically reorganizes into a larger table when full, so the
user does not need to worry about maximum capacity. This
maintains the table in the 50% to 88% full range.

CBFalconer

unread,

Jun 4, 2003, 2:36:00 AM6/4/03

to

Chris Torek wrote:
> Dann Corbit <dco...@connx.com> writes:
>
> >unsigned long hash(const char *s)
> >{
> > unsigned long h;
> > for (h = 5381; *s; s++) {
> > h *= 33;
> > h += *s;
> > }
> > return h;
> >}
>

... snip ...

>
> I thought 31 would be a "better" multiplier than 33, so I tried
> 31, and it turned out to be worse in the tests I did. Later I
> suggested adding it to a more thorough set of tests by some folks
> at Berkeley, who found the same thing -- 33 works better than 31,
> for almost all our sample inputs. None of us can explain this. :-)

Could it be that you had a system that looked for patterns formed
by adding two powers of 2 multiples of the multiplier? 31 works
just as well by subtracting. Are you talking about how the
function distributed the values, or how fast it ran?

I suspect the shift and add methods are largely pointless with
todays hardware multipliers.

I am comparing effectively:

inline unsigned int mul31(unsigned int m)
{
unsigned int mm;

mm = m << 5;
return mm - m;
} /* mul31 */

inline unsigned int mul33(unsigned int m)
{
unsigned int mm;

mm = m << 5;
return mm + m;
} /* mul33 */

with m = m * 33; and m = m * 31;

Chris Torek

unread,

Jun 4, 2003, 7:19:59 AM6/4/03

to

[I wrote, regarding the simple (h = h * 33 + *s++) recurrence hash]:

>> I thought 31 would be a "better" multiplier than 33, so I tried
>> 31, and it turned out to be worse in the tests I did. Later I
>> suggested adding it to a more thorough set of tests by some folks
>> at Berkeley, who found the same thing -- 33 works better than 31,
>> for almost all our sample inputs. None of us can explain this. :-)

In article <3EDD8AAC...@yahoo.com>

CBFalconer <cbfal...@worldnet.att.net> writes:
>Could it be that you had a system that looked for patterns formed
>by adding two powers of 2 multiples of the multiplier? 31 works
>just as well by subtracting. Are you talking about how the
>function distributed the values, or how fast it ran?

Distribution of course. Any C compiler worth its salt will turn
the two multiplies into shift operations (and back in the 1980s,
we just did this manually).

In any case, this simple recurrence hash had "pretty good"
distributions, not just for ASCII data but even for some binary
data that the Berkeley folks tested. (Of course, they had to
change the function to take a byte count.) All of the "better"
functions tested -- "better" in terms of hash distribution --
were substantially more expensive computationally, hurting their
overall performance.

Dann Corbit

unread,

Jun 5, 2003, 7:19:05 PM6/5/03

to

"Chris Torek" <nos...@elf.eng.bsdi.com> wrote in message
news:bbjotv$bf0$1...@elf.eng.bsdi.com...

> >"CBFalconer" <cbfal...@yahoo.com> wrote in message
> >news:3EDC76EB...@yahoo.com...
> >> Look in the hashtest.c code for routines and references. There is
> >> a suitable one there. What is Toreks algorithm?
>
> In article <bbiqm...@enews3.newsguy.com>
> Dann Corbit <dco...@connx.com> writes:
> >unsigned long hash(const char *s)
> >{
> > unsigned long h;
> > for (h = 5381; *s; s++) {
> > h *= 33;
> > h += *s;
> > }
> > return h;
> >}
>
> Actually I always write this as something like:
>
> for (h = 0; (c = *s) != '\0'; s++)
> h = h * 33 + c;
>
> I have no idea where "5381" came from; it should have no real
> effect anyway.
>
> >Sometimes attributed to Chris Torek, I think that Dan Bernstein is the
> >actual inventor.
>
> I stole it from Gosling Emacs, back in the mid-1980s.

That certainly does not bode well for its use, considering the emacs
debacle.
[snip]

Bob Jenkins

unread,

Jun 5, 2003, 8:22:25 PM6/5/03

to

Chris Torek <nos...@elf.eng.bsdi.com> wrote in message news:<bbjotv$bf0$1...@elf.eng.bsdi.com>...

'He', 'ID', and 'J#' have the same hash value. Any two strings
differing only by replacing one of those substrings with another will
have the same hash value. Most 2-letter printable ASCII strings have
two synonyms like this.

If you restrict yourself to just lowercase ASCII letters, no
substrings of length 6 will collide. Same goes for 31 instead of 33,
though, I don't know why one would be better than the other.

Slightly more complicated hashes:

for (h = 5381; (c = *s) != '\0'; s++)
h = h * 33 ^ c; /* Swapping He, ID won't always collide */

for (h = 5381; (c = *s) != '\0'; s++) {
h += c;
h += (h<<26)+(h>>6); /* Rotates are fast on Intel */
}

for (h = 5381; (c = *s) != '\0'; s++) {
h += c;
h += (h<<10);
h ^= (h>>6); /* one-at-a-time, minus postprocessing */
}

for (h = 2166136261; (c = *s) != '\0'; s++)
h = (h * 16777619) ^ c; /* FNV without postprocessing */

The 5381, or something nonzero, causes all-zero strings of different
lengths to have different hash values. All-zero strings are common in
some applications.

CBFalconer

unread,

Jun 5, 2003, 10:18:11 PM6/5/03

to

Bob Jenkins wrote:
> Chris Torek <nos...@elf.eng.bsdi.com> wrote in message

> > Dann Corbit <dco...@connx.com> writes:
> > >"CBFalconer" <cbfal...@yahoo.com> wrote in message
> > >

> > >> Look in the hashtest.c code for routines and references.
> > >> There is a suitable one there. What is Toreks algorithm?
> >

In C those aren't strings, they are arrays, and the zero can't
normally arise. What needs to be looked out for is strings of
blanks, which are system dependant.

A ccitcrc16 might be appropriate, though. You need either
assembly or a large table to implement it efficiently in C, where
large is 256 entries. And this is an application where hash speed
matters.

At any rate hashlib allows you to try all these out very easily.

Daniel Vallstrom

unread,

Jun 6, 2003, 6:38:02 AM6/6/03

to

CBFalconer <cbfal...@yahoo.com> wrote in message news:<3EDFF065...@yahoo.com>...

AFAIK, the only purpose of having an offset, like 5381 and
2166136261 above, is to map all-zeros of different lengths to
different values. The specific termination condition != '\0'
above of course ensures that no used character value c is zero
so there can be no all-zeros of different lengths. But for a
general hash function you might want to use an offset as
Jenkins says.

> What needs to be looked out for is strings of
> blanks, which are system dependant.

You don't need to look out for strings of all blanks since it's
guaranteed that blanks differ from 0.

>
> A ccitcrc16 might be appropriate, though.

What does 'ccitcrc16' mean?

Daniel Vallstrom

Alex

unread,

Jun 6, 2003, 1:22:57 PM6/6/03

to

CBFalconer <cbfal...@yahoo.com> wrote:
> Bob Jenkins wrote:
>> Chris Torek <nos...@elf.eng.bsdi.com> wrote in message
>> > Dann Corbit <dco...@connx.com> writes:
>> > >"CBFalconer" <cbfal...@yahoo.com> wrote in message
>> > >

[snip]

> A ccitcrc16 might be appropriate, though. You need either

ITYM: CCITT CRC16.

Alex

Kamilche

unread,

Jun 12, 2003, 5:42:10 PM6/12/03

to

"Dann Corbit" <dco...@connx.com> wrote in message news:<bbjgs...@enews4.newsguy.com>...

> 33 is better. Nobody knows why.

Isn't it interesting, that it's one larger than 32, which is the
number of bits in an integer. If the integer size of the machine were
to change to 64 bits, perhaps 65 would be the 'new magic number.'

Dann Corbit

unread,

Jun 13, 2003, 2:07:46 AM6/13/03

to

"Kamilche" <klac...@home.com> wrote in message
news:889cbba0.03061...@posting.google.com...

65 = 5*13, so it would probably not be such a good one.
67 might be good, though.

65599 works well (65536+63).

Walter Faxon

unread,

Jun 13, 2003, 5:27:05 AM6/13/03

to

bob_j...@burtleburtle.net (Bob Jenkins) wrote in message news:<a5d787df.03060...@posting.google.com>...

Bob, why are you hiding your light under a bushel? You published a
fast, good, open-source string hashing function years ago. I refer to
your "Algorithm Alley" article in Dr. Dobbs Journal #269, Sept. 1997,
pp. 107-109 & 115-116. Strings were processed four bytes at a time, a
lot faster than the methods that read the next byte then process it
into the hash value, etc.

Your website from that time is gone, but I'm sure you have the code
somewhere. Why not republish it in this thread?

-- Walter

Dann Corbit

unread,

Jun 13, 2003, 9:19:01 PM6/13/03

to

http://burtleburtle.net/bob/

Bob Jenkins

unread,

Jun 20, 2003, 5:14:53 PM6/20/03

to

wfa...@gis.net (Walter Faxon) wrote in message news:<57958c65.03061...@posting.google.com>...

> bob_j...@burtleburtle.net (Bob Jenkins) wrote in message news:<a5d787df.03060...@posting.google.com>...

> > Slightly more complicated hashes:
> >
> > for (h = 5381; (c = *s) != '\0'; s++)
> > h = h * 33 ^ c; /* Swapping He, ID won't always collide */
> >
> > for (h = 5381; (c = *s) != '\0'; s++) {
> > h += c;
> > h += (h<<26)+(h>>6); /* Rotates are fast on Intel */
> > }
> >
> > for (h = 5381; (c = *s) != '\0'; s++) {
> > h += c;
> > h += (h<<10);
> > h ^= (h>>6); /* one-at-a-time, minus postprocessing */
> > }
> >
> > for (h = 2166136261; (c = *s) != '\0'; s++)
> > h = (h * 16777619) ^ c; /* FNV without postprocessing */
> >
> > The 5381, or something nonzero, causes all-zero strings of different
> > lengths to have different hash values. All-zero strings are common in
> > some applications.
>

> Your website from that time is gone, but I'm sure you have the code
> somewhere. Why not republish it in this thread?
>
> -- Walter

http://burtleburtle.net/bob/hash/doobs.html
http://burtleburtle.net/bob/c/lookup2.c

Unfortunately, there's no hash that's best in all circumstances. The
one in lookup2.c does a good job of mixing any class of keys and is
near optimal speed for relatively short strings (8 to 1000 bytes).
Above 1000 bytes, Rogaway has published good hashes that are much
faster. Below 8 bytes, one character at a time is usually faster. On
Intel, FNV is 3x faster than lookup2.c on 4 byte strings. On Sun,
lookup2.c is 3x faster than FNV on 100 byte strings. For 8-byte hash
values you're probably best with MD5, especially on an 8-byte
platform. If you can inline the hash in the loop that first walks the
string looking for '\0', hashes that do one character at a time (like
FNV, or one-at-a-time in doobs.html) are faster lookup2.c because the
loop overhead is already paid for. For chess programs, Zobrist
hashing can't be beat. Cryptographic hashes are a whole other ball of
wax. And on and on.

The application at hand works on short strings and is inlining the
hash in the loop that looks for '\0'. The hashes I listed are good
things to try in that circumstance. They will probably be faster than
my hash in lookup2.c because the loop overhead is already paid for.
The remaining question is to balance speed against thoroughness of the
hash. The application might even have a special domain where certain
weak hashes could do better than random on average.