Hashing strings

5 views
Skip to first unread message

Wailer at the Gates of Dawn

unread,
Mar 11, 1992, 8:28:44 PM3/11/92
to

Anyone have any favorite string hashing functions? I've pulled
them from gcc and some other utilities but I'd love to see some more.
Following the charter, pseudo-code is fine. Thanks.

--
The Wailer at the Gates of Dawn | ban...@cats.UCSC.EDU |
Just who ARE you calling a FROOFROO Head? | |
DoD#0667 "Just a friend of the beast." | ban...@ucscb.UCSC.EDU |
2,3,5,7,13,17,19,31,61,89,107,127,521,607....| ban...@ucscb.BITNET |

Raghu Hudli

unread,
Mar 12, 1992, 9:33:30 AM3/12/92
to
In article <30...@darkstar.ucsc.edu>, ban...@cats.ucsc.edu (Wailer at


A recent paper has a catalog of hashing functions.

The reference for the paper is ....

B.J. McKenzie, R.Harries, and T. Bell, "Selecting a Hashing
Algorithm", Software - Practice and Experience Vol. 20(2),
Feb. 1990, pp 209-224

and the C functions that implement the hashing functions given in the paper
follow .....

/**************** Cut here *************************/

unsigned int BITS(unsigned int h ,unsigned int n ); /* return the least
significant n bits of h */


unsigned int hasheth(const char *key) /* ETH Modula-2 Cross Compiler */
/* Recommended table length = 1699 */

{
int tableLength = 1699;
register unsigned int h=0;
register const char *c=key;

while (*c) h = BITS(*c++,8)*(h % 257)+1;
return(h%tableLength);
}


unsigned int hashgnuccpp(const char *key) /* GNU C preprocessor */
/* Recomended table length=1403 */
{
int tableLength = 1403;
register unsigned int h=0;
register const char *c=key;

while (*c) h = 4*h + *c++;
return(BITS(h,31)%tableLength);
}

unsigned int hashgnucc1(const char *key) /* GNU C compiler front end */
/* Recomended table length=1008 */
{
int tableLength = 1008;
register unsigned int h=0;
register const char *c=key;

while (*c) h = 613*h + *c++;
return(BITS(h,30)%tableLength);
}

unsigned int hashpcc(const char *key) /* Portable C Compiler */
/* Recomended table length=1013 */
{

int tableLength = 1013;
register int h=0;
register const char *c=key;

while (*c) h = 2*h + *c++;
return (((unsigned int)BITS(h,15))%tableLength);
}

unsigned int hashcpp(const char *key) /* Unix 4.3BSD C preprocessor based */
/* Recomended table length=2000 */
{
int tableLength=2000;
register int h=0;
register const char *c=key;
register unsigned int index;

while (*c) h = 2*h + *c++;

if ((index = h % 2000)>=0)
return (index%tableLength);
else
return (index %tableLength + 2000);
}

unsigned int hashcplusplus(const char *key) /*AT&T C++ compiler based. */
/* Recomended table length=257 */
{
int tableLength = 257;
register unsigned int h=0;
register const char *c=key;

while (*c) h = 2*h + *c++;
return(h%tableLength);
}

unsigned int hashicon(const char *key) /* ICON translator */
/* Recommended table length = 257 */
{
int tableLength = 257;
register unsigned int h=0;
register const char *c=key;

while (*c) h += BITS(*c++,8);
return(h%tableLength);
}

unsigned int BITS(unsigned int h ,unsigned int n )
/* Return the least significant n */
/* bits in h. */
{
register int bits=0;
while(n--){
bits = bits << 1;
bits += 1;
}
return(bits&h);
}

/********** Cut here ******************/

Raghu
-----------------------------------------------------------------------
Raghu V. Hudli Tel: (914) 892-4188
IBM Corporation Fax: (914) 892-4316
E. Fishkill, NY email: ra...@vnet.ibm.com
------------------------------------------------------------------------
Disclaimer: The opinions/views expressed here are mine and my employer
does not necessarily concur with me.

John DiMarco

unread,
Mar 12, 1992, 2:59:40 PM3/12/92
to
ban...@cats.ucsc.edu (Wailer at the Gates of Dawn) writes:
> Anyone have any favorite string hashing functions? I've pulled
>them from gcc....

Anybody who pulls any code from gcc, any other piece of Gnu software,
or any other program covered by the Gnu public licence had better read
the licence document carefully. The GPL has a contagion clause which
essentially asserts that any software written using any piece of code
from a GPL-licenced program falls under the Gnu public licence itself.
This basically means that you have to give it away free, with source.
Not a good idea if you're in the business of selling software.

John
--
John DiMarco j...@db.toronto.edu/j...@db.utoronto.ca
University of Toronto, CSRI DBOIS group BITNET: jdd%db.toronto.edu@UTORGW
(416) 978-1928 UUCP: uunet!utai!db!jdd

Jack Radigan

unread,
Mar 12, 1992, 6:25:35 PM3/12/92
to
j...@db.toronto.edu (John DiMarco) writes:

>Anybody who pulls any code from gcc, any other piece of Gnu software,
>or any other program covered by the Gnu public licence had better read
>the licence document carefully. The GPL has a contagion clause which
>essentially asserts that any software written using any piece of code
>from a GPL-licenced program falls under the Gnu public licence itself.
>This basically means that you have to give it away free, with source.
>Not a good idea if you're in the business of selling software.

I know the simplistic answer is to create original work or to only use
published sources of algorithms like Sedgewick, Knuth et al. But, how can
I be sure of the exact origin of a piece of source that may have mutated
(algorithmically and copyright status) over the course of several contributing
authors? Sure, I can contact them to inquire about each item, but what if I
can't or they can't produce the exact origin of the source?

On a darker note, what about the possibility of "source code terrorism"?
Say a disgruntled ex-employee alters some code and posts it with a PD status
and I or someone else comes across it and uses it? Doesn't the legal tenant
of "ignorance is no excuse" still apply even though I document the origin of
the code used?

If I can draw from an existing body of work legally, I'd rather do that
instead of having to reinvent the wheel. But, from the sound of it, the
only real way to get any work done is to do it yourself so that you don't
have to waste time researching the origin of every piece of code you use.

-jack-

Geoff Collyer

unread,
Mar 12, 1992, 8:56:37 PM3/12/92
to
Raghu Hudli:

unsigned int BITS(unsigned int h ,unsigned int n )
/* Return the least significant n */
/* bits in h. */
{
register int bits=0;
while(n--){
bits = bits << 1;
bits += 1;
}
return(bits&h);
}

This seems like a lot of bother. Is there any good reason not to just use

/* return the least significant n bits of h (safe macro) */
#define BITS(h, n) ((unsigned long)(h) & (((unsigned long)1<<(n))-1))

(Yes, this doesn't do the right thing for n==(sizeof(long)*8), assuming
8-bit bytes. Does anyone care?)
--
Geoff Collyer world.std.com!geoff, uunet.uu.net!geoff

Mats Wichmann

unread,
Mar 12, 1992, 7:14:47 PM3/12/92
to
j...@db.toronto.edu (John DiMarco) writes:

>ban...@cats.ucsc.edu (Wailer at the Gates of Dawn) writes:
>> Anyone have any favorite string hashing functions? I've pulled
>>them from gcc....

>Anybody who pulls any code from gcc, any other piece of Gnu software,
>or any other program covered by the Gnu public licence had better read
>the licence document carefully. The GPL has a contagion clause which
>essentially asserts that any software written using any piece of code
>from a GPL-licenced program falls under the Gnu public licence itself.
>This basically means that you have to give it away free, with source.
>Not a good idea if you're in the business of selling software.

I just received a bellyful of flames from several people for disseminating
bad information on another group (hey, I didn't know it was bad info), so
I'm trying to be nicer here....

The above simply isn't true, and the new gnu license which is being
included with most of the newer revisions of programs has clarified
the situation. There are some caveats that come along with the gnu
license; best to refer to the gnu.* groups for information (although,
of course, they are rife with bad information like all other groups...).
--
Mats Wichmann
Systems Software Consultant
alruna!ma...@ossi.com (or ma...@netcom.com)

brian.g.beuning

unread,
Mar 13, 1992, 12:52:03 AM3/13/92
to
From article <30...@darkstar.ucsc.edu>, by ban...@cats.ucsc.edu (Wailer at the Gates of Dawn):

>
> Anyone have any favorite string hashing functions?

Many people like the shift and add style, the trouble is
for long strings they ignore part of the input string.

What you would really want is a rotate and add routine
so the bits that a shift would loose wrap around instead.
This leads to my favorite, which goes like this (in C)

int
hash( char * s )
{
union { char c[ 4 ];
int i;
} u;
u.i = 0;
while( 1 ) {
if( *s == '\0' ) break;
c[0] ^= *s++;
if( *s == '\0' ) break;
c[1] ^= *s++;
if( *s == '\0' ) break;
c[2] ^= *s++;
if( *s == '\0' ) break;
c[3] ^= *s++;
}
return i;
}

Brian Beuning

John D. Mitchell

unread,
Mar 13, 1992, 12:54:32 AM3/13/92
to
In article <26...@faatcrl.UUCP> jp...@faatcrl.UUCP (Jack Radigan) writes:
[...]

> On a darker note, what about the possibility of "source code terrorism"?
>Say a disgruntled ex-employee alters some code and posts it with a PD status
>and I or someone else comes across it and uses it? Doesn't the legal tenant
>of "ignorance is no excuse" still apply even though I document the origin of
>the code used?

I think your still hosed (but I'm no lawyer :-).

> If I can draw from an existing body of work legally, I'd rather do that
>instead of having to reinvent the wheel. But, from the sound of it, the
>only real way to get any work done is to do it yourself so that you don't
>have to waste time researching the origin of every piece of code you use.

For the most part this is true. The problem is that even if you develop
some new fangled neat-o, whiz-bang doohicky, it could have already been
patented by somebody else and you are completely out of luck. That's one
of the many pitfalls of the current patent system.

John

P.S. These are my opinions. Get your own.

Jack Radigan

unread,
Mar 13, 1992, 7:50:18 AM3/13/92
to
jo...@cory.Berkeley.EDU (John D. Mitchell) writes:

>For the most part this is true. The problem is that even if you develop
>some new fangled neat-o, whiz-bang doohicky, it could have already been
>patented by somebody else and you are completely out of luck. That's one
>of the many pitfalls of the current patent system.

That's something that really irks me. How can a system like the patent
process lend equal credence to something like the RSA PKA and an XOR'd cursor?

-jack-



Vadim Antonov

unread,
Mar 13, 1992, 12:08:34 PM3/13/92
to
ge...@world.std.com (Geoff Collyer) writes:

> /* return the least significant n bits of h (safe macro) */
> #define BITS(h, n) ((unsigned long)(h) & (((unsigned long)1<<(n))-1))

>(Yes, this doesn't do the right thing for n==(sizeof(long)*8), assuming
>8-bit bytes. Does anyone care?)

Hm. It does do the right thing on the most existant machines:

(i assume 32bit longs):

(1<<32)-1 == 0-1 == 0xffffffff (32 bits)

>Geoff Collyer world.std.com!geoff, uunet.uu.net!geoff

Vadim Antonov
Berkeley Software Design, Inc.

John DiMarco

unread,
Mar 13, 1992, 12:34:57 PM3/13/92
to
ma...@netcom.com (Mats Wichmann) writes:

>j...@db.toronto.edu (John DiMarco) writes:

>>ban...@cats.ucsc.edu (Wailer at the Gates of Dawn) writes:
>>> Anyone have any favorite string hashing functions? I've pulled
>>>them from gcc....

>>Anybody who pulls any code from gcc, any other piece of Gnu software,
>>or any other program covered by the Gnu public licence had better read
>>the licence document carefully. The GPL has a contagion clause which
>>essentially asserts that any software written using any piece of code
>>from a GPL-licenced program falls under the Gnu public licence itself.
>>This basically means that you have to give it away free, with source.
>>Not a good idea if you're in the business of selling software.

>The above simply isn't true, and the new gnu license which is being


>included with most of the newer revisions of programs has clarified
>the situation. There are some caveats that come along with the gnu
>license; best to refer to the gnu.* groups for information (although,
>of course, they are rife with bad information like all other groups...).

Permit me to quote from the GNU public licence as distributed with
GCC 2.0:

BEGIN QUOTE
2. You may modify your copy or copies of the Program or any portion
of it, thus forming a work based on the Program, and copy and
distribute such modifications or work under the terms of Section 1
above, provided that you also meet all of these conditions:

a) ...
b) You must cause any work that you distribute or publish, that in
whole or in part contains or is derived from the Program or any
part thereof, to be licensed as a whole at no charge to all third
parties under the terms of this License.
c) ...

These requirements apply to the modified work as a whole. If
identifiable sections of that work are not derived from the Program,
and can be reasonably considered independent and separate works in
themselves, then this License, and its terms, do not apply to those
sections when you distribute them as separate works. But when you
distribute the same sections as part of a whole which is a work based
on the Program, the distribution of the whole must be on the terms of
this License, whose permissions for other licensees extend to the
entire whole, and thus to each and every part regardless of who wrote it.
END QUOTE

In simple english, this means:

- any work you publish that contains any part of a GPL-covered program
falls *as a whole* under the GPL itself.

John D. Mitchell

unread,
Mar 13, 1992, 1:20:53 PM3/13/92
to
In article <26...@faatcrl.UUCP> jp...@faatcrl.UUCP (Jack Radigan) writes:

If you want to learn more about this you can jump into the gnu.misc.discuss
newsgroups and/or join/get-information-from the League for Programming
Freedom. My quick take on the subject is that it's arguable whether
software should be patentable or not. The current system sucks mostly
because it works (like the courts) very much with precedence and relys on
the qualifications of the patent investigators. Well there aren't any
software engineering patent investigators. Therefore all of these patents
have been reviewed/accepted by people who have no clue about what's
"obvious", "original", etc.

John
jo...@cory.Berkeley.EDU

Stavros Macrakis

unread,
Mar 13, 1992, 12:51:55 PM3/13/92
to
In article <26...@faatcrl.UUCP> jp...@faatcrl.UUCP (Jack Radigan) writes:
Say a disgruntled ex-employee alters some code and posts it with a PD status
and I or someone else comes across it and uses it? Doesn't the legal tenant
of "ignorance is no excuse" still apply even though I document the origin of
the code used?

In article <1992Mar13.0...@pasteur.Berkeley.EDU> jo...@cory.Berkeley.EDU (John D. Mitchell) replies:

I think your still hosed (but I'm no lawyer :-).

Well, you certainly wouldn't be liable for any sort of criminal
penalty or punitive damages. But you would probably be liable for
actual damages. I am not a lawyer either, but it's very similar to
the following case: a friend gives you a gift which he stole (but you
had no way of knowing that). The rightful owner can probably recover
it from you, although you did nothing criminal.

...For the most part this is true. The problem is that even if you


develop some new fangled neat-o, whiz-bang doohicky, it could have
already been patented by somebody else and you are completely out
of luck. That's one of the many pitfalls of the current patent
system.

This is true for patents, but not for copyrights. Independently
developed code cannot infringe a copyright. (Although its execution
can if it, e.g., reproduces a copyrighted game.)

-s

Mats Wichmann

unread,
Mar 13, 1992, 7:31:00 PM3/13/92
to
In reply to:

>>Anybody who pulls any code from gcc, any other piece of Gnu software,
>>or any other program covered by the Gnu public licence had better read
>>the licence document carefully. The GPL has a contagion clause which
>>essentially asserts that any software written using any piece of code
>>from a GPL-licenced program falls under the Gnu public licence itself.
>>This basically means that you have to give it away free, with source.
>>Not a good idea if you're in the business of selling software.

I wrote:

>The above simply isn't true, and the new gnu license which is being
>included with most of the newer revisions of programs has clarified
>the situation.

And of course I'm wrong. I'm sure this will be too late to forestall
lots of notes to that effect, but hey, I blew it. Been a bad week
that way.

Dave Schaumann

unread,
Mar 13, 1992, 6:48:45 PM3/13/92
to
[this has become a C issue. followups directed to comp.lang.c]

In article <BL142...@world.std.com> ge...@world.std.com (Geoff Collyer) writes:
|Raghu Hudli:
| unsigned int BITS(unsigned int h ,unsigned int n )
|

| { register int bits=0; /* Return the least significant n */
| while(n--){ /* bits in h. */


| bits = bits << 1;
| bits += 1;
| }
| return(bits&h);
| }
|
|This seems like a lot of bother. Is there any good reason not to just use
|
| /* return the least significant n bits of h (safe macro) */
| #define BITS(h, n) ((unsigned long)(h) & (((unsigned long)1<<(n))-1))
|
|(Yes, this doesn't do the right thing for n==(sizeof(long)*8),

It also assumes 2's complement arithmetic.

|assuming 8-bit bytes. Does anyone care?)

Well, sure, considering that it's just as easy to get right:

#define ALL_1S ( ~((unsigned long) 0) ) /* added for clarity */
#define BITS(h, n) ( (unsigned long)(h) & ~(ALL_1S <<(n)) )

--
You're traveling through another dimension -- a dimension not only of sight and
sound but of mind. A journey into a wonderous land whose boundaries are that of
imagination. That's a signpost up ahead: your next stop -- the Twilight Zone

Dave Schaumann

unread,
Mar 13, 1992, 7:00:35 PM3/13/92
to
In article <1992Mar13.0...@cbnewsd.att.com> bg...@cbnewsd.att.com (brian.g.beuning) writes:
>From article <30...@darkstar.ucsc.edu>, by ban...@cats.ucsc.edu (Wailer at the Gates of Dawn):
>>
>> Anyone have any favorite string hashing functions?
>
>Many people like the shift and add style, the trouble is
>for long strings they ignore part of the input string.

I'd just like to add a reality check to all this. Not too long ago, I
wrote a program that required a hash table. So I used the shift & add
code, thinking I was Really Clever (tm). Just to be complete, I took
a look at the hash values I was getting. Surprise, surprise, the values
wern't very random at all. I tried again with just adding the bytes and
dropping the high bits, and I got a nice random distribution.

The moral of the story is, don't spend time with fancy hashing function
where a simple one will due just as well (and maybe better).

I would suspect that unless you want to use a table size larger than
the alphabet of your keys, or you are expecting a lot of very similar key
values, a simple byte add hash function will give you as much `randomness'
as you need.

[most of the code deleted]

> union { char c[ 4 ];
> int i;
> } u;

Say it with me: `all the world does *not* have 4 byte ints'. `The sizeof
operator is my *friend*.'

Geoff Collyer

unread,
Mar 13, 1992, 9:02:26 PM3/13/92
to
Dave Schaumann:

>> union { char c[ 4 ];
>> int i;
>> } u;

>Say it with me: `all the world does *not* have 4 byte ints'. `The sizeof
>operator is my *friend*.'

Ah, well if we're allowed to pick nits, there are a bunch of ``u.''s
missing; the code doesn't compile.

Silver

unread,
Mar 14, 1992, 5:58:32 AM3/14/92
to
> Anyone have any favorite string hashing functions?

Peter K. Pearson, "Fast hashing of Variable-Length Text Strings", ACM 6/1990.
He describes a neat twist to the standard rotating-xor algorithm. By pumping
intermediate byte values through a random byte permutation, values really start
distributing well. Plus, by fatootsing with the byte permutation, you can get
an almost-minimal perfect hash for free. If you want more info, send me mail.

Regards, [Ag] gay...@paul.rutgers.edu

Daniel Sleator

unread,
Mar 14, 1992, 10:08:12 AM3/14/92
to

My favorite hash function starts with a random table of integers, then
xors a subset of these together based on the string. It would be a
bad idea to simply xor the set specified by the string, since that
would give a commutative function of the letters of the string (useful
for computing anagrams perhaps). So, I use the bits of the hash
function computed so far to affect the choice of which element in the
random table to use. I simply add the hash function so far to the
current character modulo the size of the random table, and use that to
index into the table. The result is a fast and "extremely random"
function.

I'm probably not the first to point this out -- but I've never seen it
anywhere else.

Here is an ANSI-C implementation.

#define RTSIZE 256
/* size of random table for computing the
hash functions. must be a power of 2 */

unsigned int randtable[RTSIZE];

void init_hash_function(void) {
/* called once to initialize the random table */
int i;
srandom(10);
for (i=0; i<RTSIZE; i++) {
randtable[i] = random();
}
}

int hash(char * s) {
/* returns a "random" 32 bit integer hash function */
int i=0;
while(*s != '\0') {
i ^= randtable[(*s + i) & (RTSIZE-1)];
s++;
}
return i;
}

D. Sleator

Silver

unread,
Mar 14, 1992, 4:09:52 PM3/14/92
to
Some blathering idiot wrote:
> [Concerning Pearson's hashing algorithm...] Plus, by fatootsing with the

> byte permutation, you can get an almost-minimal perfect hash for free.

Eh, it's not almost-minimal. It's minimal. Uh duh.

Regards, [Ag]

Edward Reid

unread,
Mar 15, 1992, 9:41:11 AM3/15/92
to
In article <30...@darkstar.ucsc.edu> (comp.programming), ban...@cats.ucsc.edu (Wailer at the Gates of Dawn) writes:
> Anyone have any favorite string hashing functions? I've pulled
> them from gcc and some other utilities but I'd love to see some more.

I've had very good results with very simple functions: xor all bits of the
string (plus the length), breaking it into chunks as large as the largest
integer the target system can handle, then mod it by a prime. (*Never* omit
any bits of the string from the hash.)

I'm sure there are applications where this won't give satisfactory results.
For large databases and most internal tables, however, I've found that the
results are consistently a little more uniform than Poisson.

The systems I work with most often, can handle a 39-bit unsigned integer. This
has two advantages: first, chopping a string into 39-bit chunks rotates the
bytes without any extra effort; I would probably add a rotation otherwise.
Second, 2^39 is much larger than any of my target spaces, so any nonuniformity
introduced by the mod is undetectable. I'll decide on a target space size and
decrease it to a prime, to be used as the modulus. Again, it's not applicable
to all problems, but it works very well for a surprisingly wide variety.

Edward Reid (8-}>
eel: e...@titipu.meta.com or nosc.mil!titipu.meta.com!ed
snail: PO Box 378/Greensboro FL 32330

Preston Bannister

unread,
Mar 16, 1992, 3:05:08 PM3/16/92
to
From article <30...@darkstar.ucsc.edu>, by ban...@cats.ucsc.edu (Wailer at the Gates of Dawn):

>
> Anyone have any favorite string hashing functions? I've pulled
> them from gcc and some other utilities but I'd love to see some more.
> Following the charter, pseudo-code is fine. Thanks.

My favorite is from Communications of the ACM (see the reference below).

Basically you build a table T of the values 0 to 255 in random order.
(Nothing special about how you build the table). You generate the
hash by computing:

hash = T[hash ^ c]

for each character c in the string.

It is very fast to compute, and generates very different values for
similar strings.

I've used it for name lookup where the hash value is used as an index
to select a hash bucket, and then do a linear search of the bucket.
(If my problem size was bigger, then a binary tree in the bucket would
have been a better idea...). Since similar names tend to end up in
different buckets, the string compares in the searching the bucket
generally terminate after comparing only a character or two.

My library function follows...

================================================================

/* The string hash function (StringAsHash below) was taken from:
"Fast Hashing of Variable-Length Text Strings", Peter K. Pearson,
Communications of the ACM, June 1990. */

static unsigned char T[] = { 166, 231, 148, 61, 50, 131, 0, 57, 126,
223, 44, 245, 138, 251, 24, 113, 86, 215, 196, 173, 226, 115,
48, 169, 46, 207, 92, 101, 58, 235, 72, 225, 6, 199, 244, 29,
146, 99, 96, 25, 222, 191, 140, 213, 234, 219, 120, 81, 182,
183, 36, 141, 66, 83, 144, 137, 142, 175, 188, 69, 154, 203,
168, 193, 102, 167, 84, 253, 242, 67, 192, 249, 62, 159, 236,
181, 74, 187, 216, 49, 22, 151, 132, 109, 162, 51, 240, 105,
238, 143, 28, 37, 250, 171, 8, 161, 198, 135, 180, 221, 82,
35, 32, 217, 158, 127, 76, 149, 170, 155, 56, 17, 118, 119,
228, 77, 2, 19, 80, 73, 78, 111, 124, 5, 90, 139, 104, 129,
38, 103, 20, 189, 178, 3, 128, 185, 254, 95, 172, 117, 10,
123, 152, 241, 214, 87, 68, 45, 98, 243, 176, 41, 174, 79,
220, 229, 186, 107, 200, 97, 134, 71, 116, 157, 18, 227, 224,
153, 94, 63, 12, 85, 106, 91, 248, 209, 54, 55, 164, 13, 194,
211, 16, 9, 14, 47, 60, 197, 26, 75, 40, 65, 230, 39, 212,
125, 114, 195, 64, 121, 190, 31, 108, 53, 202, 59, 88, 177,
150, 23, 4, 237, 34, 179, 112, 233, 110, 15, 156, 165, 122,
43, 136, 33, 70, 7, 52, 93, 210, 163, 160, 89, 30, 255, 204,
21, 42, 27, 184, 145, 246, 247, 100, 205, 130, 147, 208, 201,
206, 239, 252, 133, 218, 11, 232, 1, 0
};


unsigned StringAsHash (s)
register unsigned char *s;
{
register unsigned h;
for (h=0;*s;++s) {
h = T[h ^ *s];
}
return h;
}
--
Preston L. Bannister
USENET: pre...@filenet.com
BIX: plb | CompuServe: 71350,3505 | GEnie: p.bannister

j.lee

unread,
Mar 16, 1992, 9:19:36 PM3/16/92
to
In <16...@fritz.filenet.com> pre...@felix.filenet.com (Preston Bannister) writes:

>Basically you build a table T of the values 0 to 255 in random order.
>(Nothing special about how you build the table). You generate the
>hash by computing:
> hash = T[hash ^ c]
>for each character c in the string.

This only generates values in the range 0..255 (or sizeof(T)/sizeof(*T))
hence it is not very general. Also, I'd question the comment that
there is "Nothing special about how you build the table", since obviously
the identity function (T[n]==n) make for a bad hash function.

A slight varient on this scheme though, is to use table driven CRC
calculation to compute a hash value (easily done up to the machine word
size). I've often wondered about this as a hashing scheme but never had
the time to prove/disprove its effectiveness. Does anyone know if this
would work? (After all, CRCs involve "polynomial division", so they
should "mix" the bits as well as a time-honoured division based hash.)

--
First there was Unix... Now there's AIX, AU/X, BSD, BSDI, Dynix, EP/IX, FTX,
Hurricane, HP-UX, Irix, Linux, Mach, Minix, Open Desktop, OSF/1, OSx, PC/IX,
Plan 9, Polyx, Posix, QNX, Risc/OS, Risc/ix, SCO Unix, Sinix, Solaris,
Sprite, SunOS, SVRx, Topaz, Tunis, Ultrix, Unicos, V, v10, Xenix, ..."

Phong Vo[dgb]

unread,
Mar 17, 1992, 12:07:38 AM3/17/92
to
Many people have proposed various hash functions from strings to 32-bit ints.
None offered any concrete evidence supporting their function
but the usual "it works for me". I thought it may be useful to have a more
concrete test. So, proposers, please take a few minutes to implement
your functions, suitably change the below code, run it and post the result.

By default, the code uses /usr/dict/words as the test dictionary.
If you want to change that, just redefine the line that #defines DICTIONARY.
Collisions are reported to stderr in case you want to redirect it elsewhere.
Seeing them flashes on the screen can be painful :-).

See if you can beat the below simple linear congruential hash function.
Have fun. Phong.

----cut here for testhash.c----------------------------------------------------
#include <stdio.h>
#ifndef DICTIONARY
#define DICTIONARY "/usr/dict/words"
#endif
#define ulong unsigned long
extern char* malloc();

/* Replace the hash_string function below with your favorite function,
** then compile it and run. Post or send me the answer. I'll summarize
** if there is enough interests.
**
** Phong Vo, k...@ulysses.att.com
*/

ulong hash_string(s)
char* s;
{
ulong hash = 0;
while(*s)
hash = hash*129 + (ulong)(*s++) + 987654321L;
return hash;
}


typedef struct _entry_ Entry_t;
struct _entry_
{
char* word;
ulong hash;
Entry_t *next;
};

#define TABSIZE ((2<<12)-1)
Entry_t *Table[TABSIZE];

main()
{
FILE* fp;
char word[1024];
ulong hash;
int collision;
Entry_t* ep;

if(freopen(DICTIONARY,"r",stdin) != stdin)
{ fprintf(stderr,"Can't open %s\n", DICTIONARY);
return -1;
}

collision = 0;
while(fgets(word,sizeof(word),stdin))
{
word[strlen(word)-1] = '\0';
hash = hash_string(word);

for(ep = Table[hash%TABSIZE]; ep; ep = ep->next)
{ if(ep->hash != hash || strcmp(word,ep->word) == 0)
continue;

fprintf(stderr,"%#0x: %s, %s\n", hash, word, ep->word);
collision += 1;
}

if(!(ep = (Entry_t*)malloc(sizeof(Entry_t))) ||
!(ep->word = malloc(strlen(word)+1)) )
{ fprintf(stderr,"Out of space\n");
break;
}

strcpy(ep->word,word);
ep->hash = hash;
ep->next = Table[hash%TABSIZE];
Table[hash%TABSIZE] = ep;
}

printf("# collisions = %d\n", collision);
return 0;
}

Phong Vo[dgb]

unread,
Mar 17, 1992, 7:47:28 AM3/17/92
to
In article <1992Mar17....@sobeco.com>, jl...@sobeco.com (j.lee) writes:
: In <16...@fritz.filenet.com> pre...@felix.filenet.com (Preston Bannister) writes:
:
: A slight varient on this scheme though, is to use table driven CRC

: calculation to compute a hash value (easily done up to the machine word
: size). I've often wondered about this as a hashing scheme but never had
: the time to prove/disprove its effectiveness. Does anyone know if this
: would work? (After all, CRCs involve "polynomial division", so they
: should "mix" the bits as well as a time-honoured division based hash.)
:
The crc checksum is also a very good hash value. I've tested this on
ethernet data with more than 400K packages. As I remember (this was a while back),
there were only under 20 collision pairs. This is slightly better than random
scattering. There is currently a proposal on the table for such a standard
checksum function for POSIX. Another good checksum is a linear congruential hash
function with some well-chosen multiplier and constant term.

Daniel Sleator

unread,
Mar 17, 1992, 1:29:56 PM3/17/92
to
Whoops....Don't bother to run Phong Vo's experiment on my hash
function.....It gives 45 collisions, while his gives none! There's
clearly a serious problem with my function, and I withdraw it pending
more thought.

On the other hand, suppose you want to use a hash table whose size is
a power of 2 (a natural thing to want to do because of the efficiency
of the mod operations). Also, suppose in your application, for some
reason, all the words happen to have the same number of characters.

To simulate this situation, I classified the words of /usr/dict/words
(all 24474 of them on my machine) by size. I hashed the words of each
size into a table of size 2^14 = 16384. Here are the number of
collisions incurred by Phong Vo's hash funciton and mine on this data:

collisions
size words Vo's Mine
1 26 0 0
2 91 0 0
3 759 105 12
4 2142 467 158
5 3097 561 292
6 3796 662 447
7 4045 731 510
8 3578 544 405
9 2970 369 295
10 1890 131 110
11 1072 55 34
12 547 17 17
13 275 5 3
14 111 1 0
15 41 0 0
16 17 0 0
17 8 0 0
18 5 0 0

Touche'

D. Sleator

Silver

unread,
Mar 17, 1992, 5:41:12 PM3/17/92
to
pre...@felix.filenet.com (Preston Bannister) writes:
> hash = T[hash ^ c]

jl...@sobeco.com (j.lee) writes:
> This only generates values in the range 0..255 (or sizeof(T)/sizeof(*T))
> hence it is not very general.

In his paper, Pearson describes running it `in parallel' for each byte,
starting with (uint8)(key[0] + 0), (uint8)(key[0] + 1), ..., until you have a
sufficient number of bytes in your hash value. So, a one-, two-, and four-byte
hash function would read (and this is off-the-cuff):

uint8 hash8(uint8 * key)
{uint8 h1;

if (!*key)
then {return(0);}
else {h1 = permutation[(uint8)(*key + 0)];
while (*++key)
{h1 = permutation[h1 ^ *key];}
return((h1<<0));}}

uint16 hash16(uint8 * key)
{uint8 h1;
uint8 h2;

if (!*key)
then {return(0);}
else {h1 = permutation[(uint8)(*key + 0)];
h2 = permutation[(uint8)(*key + 1)];
while (*++key)
{h1 = permutation[h1 ^ *key];
h2 = permutation[h2 ^ *key];}
return((h1<<0) | (h2<<8));}}

uint32 hash32(uint8 * key)
{uint8 h1;
uint8 h2;
uint8 h3;
uint8 h4;

if (!*key)
then {return(0);}
else {h1 = permutation[(uint8)(*key + 0)];
h2 = permutation[(uint8)(*key + 1)];
h3 = permutation[(uint8)(*key + 2)];
h4 = permutation[(uint8)(*key + 3)];
while (*++key)
{h1 = permutation[h1 ^ *key];
h2 = permutation[h2 ^ *key];
h3 = permutation[h3 ^ *key];
h4 = permutation[h4 ^ *key];}
return((h1<<0) | (h2<<8) | (h3<<16) | (h4<<24));}}

I think this looks just a tad wasteful, but he claims that it yields an
extremely good distribution in general. I'm wondering if this kind of
excessive table access could be eliminated by using a table of N consecutive
256-byte permutations, where every 1 ... Nth element also forms a permutation.
That is, for N = 4, elements 0, 4, ..., 1020 form a permutation, 1, 5, ...,
1021 form a permutation, ..., and 3, 7, ..., 1023 form a permutation. Hmmm...
Gotta scratch my head a little more on this rambling.

> Also, I'd question the comment that there is "Nothing special about how you
> build the table", since obviously the identity function (T[n]==n) make for a
> bad hash function.

Heh, that brings you right back to rotating xor. But this is hashing, mon,
we're throwing the dice knowing that the odds are with us. It works well for
the average permutation.

Regards, [Ag]

Dan Hoey

unread,
Mar 18, 1992, 6:47:17 PM3/18/92
to
ra...@vnet.ibm.com writes:

> A recent paper has a catalog of hashing functions.
>...
> B.J. McKenzie, R.Harries, and T. Bell, "Selecting a Hashing
> Algorithm", Software - Practice and Experience Vol. 20(2),
> Feb. 1990, pp 209-224
>...

I hope they note the drawbacks of the methods they list, where (on
machines with the popular sizeof(unsigned)==4):

gnucpp's hash function ignores all but the last 16 characters of
the string,

the hash functions of pcc, cpp, and c++ ignore all but the last 32
characters of the string, and

icon's hash function ignores the order of characters within a
string.

These problems aren't so harmful in C, since its paralyzed macro
processor (and in some implementations, few significant characters in
identifiers) militate against the long, uniform identifiers often
created by code-generating tools. But I've seen such code
incorporated into Lisp interpreters, where it can bring some programs
to their knees.

Don't do that.

Dan
Ho...@AIC.NRL.Navy.Mil

Reply all
Reply to author
Forward
0 new messages