hash function for text in C

cameron shelley

unread,

Oct 16, 1990, 2:49:51 PM10/16/90

to

(Greeting(time == morning? Goodmorning: (
time == afternoon? Goodafternoon:
Goodevening)
)
);

I have been (futily :<) experimenting with hash functions to
hash english words into a large array. No book I can find around
here gives the topic more than cursory discussion. Does anyone
out there know of a good, simple one? They must exist, and I'm
getting a little tired of mine... (er, function that is, I can't
afford another data structures book right now :).

I'll be happy to post mine if anyone needs a good laugh, btw.

Cam

--
Cameron Shelley | "Saw, n. A trite popular saying, or proverb.
cpsh...@violet.waterloo.edu| So called because it makes its way into a
Davis Centre Rm 2136 | wooden head."
Phone (519) 885-1211 x3390 | Ambrose Bierce

Scott C. Mac Phee

unread,

Oct 17, 1990, 8:47:59 AM10/17/90

to

In article <1990Oct16....@watdragon.waterloo.edu> cpsh...@violet.waterloo.edu (cameron shelley) writes:
>
> I have been (futily :<) experimenting with hash functions to
>hash english words into a large array. No book I can find around
>here gives the topic more than cursory discussion. Does anyone
>out there know of a good, simple one? They must exist, and I'm
>getting a little tired of mine... (er, function that is, I can't
>afford another data structures book right now :).

Here is some code I've been using. It is an adapted version of some
hashing algorithms I've seen pass through this newsgroup.

Hope it helps...

===========================================================================

/*
* ============================================================================
* HASH TABLE FUNCTIONS
* ============================================================================
*/

#define HASH_TABLE_SIZE 1000
static int eventhash[HASH_TABLE_SIZE] ;
static int hash_entries, /* number in table... */
collisions ; /* count of collisions */

int
init_hash_table()
{
register int i ;

hash_entries = collisions = 0 ;

for (i=0; i<HASH_TABLE_SIZE;) eventhash[i++] = 0 ;

}

/*
* calc_hash_value.c
*
* calculate a hash table location from the given name.
*
*/

int
calc_hash_value(name)
char *name ;
{
register int hash ;

for (hash=0; *name;)
hash = (hash<<2) ^ *name++ ;

return(abs(hash % HASH_TABLE_SIZE)) ;

}

--
===============================================================================
CONVEX Computer Corporation (CVX) Richardson, Texas
Scott C. Mac Phee (..uunet.uu.net!convex.com!macphee) 214-497-4772
===============================================================================

Scott C. Mac Phee

unread,

Oct 17, 1990, 9:06:25 AM10/17/90

to

In article <107...@convex.convex.com> mac...@convex.COM (Scott C. Mac Phee) writes:
>In article <1990Oct16....@watdragon.waterloo.edu> cpsh...@violet.waterloo.edu (cameron shelley) writes:
>>
>> I have been (futily :<) experimenting with hash functions to
>>hash english words into a large array. No book I can find around
>>here gives the topic more than cursory discussion. Does anyone
>>out there know of a good, simple one? They must exist, and I'm
>>getting a little tired of mine... (er, function that is, I can't
>>afford another data structures book right now :).
>
>Here is some code I've been using. It is an adapted version of some
>hashing algorithms I've seen pass through this newsgroup.
>
>Hope it helps...
>
>===========================================================================
>
>/*
> * ============================================================================
> * HASH TABLE FUNCTIONS
> * ============================================================================
> */
>
>#define HASH_TABLE_SIZE 1000
>static int eventhash[HASH_TABLE_SIZE] ;
>static int hash_entries, /* number in table... */
> collisions ; /* count of collisions */

Sorry... amended code...

the hash table is an array of pointers to a structure, not int.

ex:
struct event {
char * name ;
struct event * next ;
} eventhash[HASH_TABLE_SIZE] ;

If a collision occurs use the hashed entry as the head of a linked
list and walk the linked list until the end is found, then add
the entry at the end.

I pulled the code out and then added the declarations without looking
at the original code. Ten minutes later I realized I'd messed up.

Is today Monday, again?

Chris Torek

unread,

Oct 18, 1990, 12:50:35 AM10/18/90

to

The following implements expanding chained hash tables for strings and
numbers. This is unmodified from a special-purpose program (having to
do with reading a large number of files with name=id pairs and doing
various operations on them). Chain lengths are not monitored; instead,
the table is invariably expanded whenever it becomes 2/3 full. Entries
may be neither removed nor modified.

An essentially-unrelated function called `string' maintains a string
pool such that string(x) == string(y) iff strcmp(x,y)==0 (the program
typically stores each name several times, and often needs to test for
name equality, so this helps there).

: Run this shell script with "sh" not "csh"
PATH=/bin:/usr/bin:/usr/ucb:/etc:$PATH
export PATH
all=false
if [ x$1 = x-a ]; then
all=true
fi
echo Extracting hash.h
sed 's/^X//' <<'//go.sysin dd *' >hash.h
X/*
X * $Id: hash.h,v 1.1 90/09/24 23:58:38 chris Exp $
X *
X * Hash table entries keep track of name (or id) = value pairs.
X * Values are simply pointers. Hash tables themselves are named
X * (for debugging).
X */
X
Xstruct hashtab;
X
X/*
X * The `ins' functions return nonzero if the old value existed,
X * without changing the value.
X * The iterate functions calls a given function with name,value
X * or id,value pairs.
X */
Xstruct hashtab *ht_new(); /* given name, create new hash table */
Xchar *ht_nget(); /* given table and name, get value */
Xchar *ht_iget(); /* given table and ID, get value */
Xint ht_nins(); /* given table and name, insert new value */
Xint ht_iins(); /* given table and id, insert new value */
Xvoid ht_niterate(); /* given table and function, iterate */
Xvoid ht_iiterate(); /* given table and function, iterate */
X
X/*
X * Some things that ought not to be here, but are anyway.
X * goodnumber() takes a number of objects and a size and returns
X * a new number of objects, such that malloc(goodnumber(n,size)*size)
X * calls malloc with a `good' size value (resulting in less wasted
X * memory). emalloc is malloc with program-abort on out-of-memory.
X * string() makes a `read-only' copy of a string, reusing the previous
X * copy if any.
X */
Xint goodnumber(); /* given n & sizeof, return new n */
Xchar *emalloc(); /* malloc, exit on error */
Xchar *string(); /* make an `ideal' copy of a string */
//go.sysin dd *
if [ `wc -c < hash.h` != 1415 ]; then
made=false
echo error transmitting hash.h --
echo length should be 1415, not `wc -c < hash.h`
else
made=true
fi
if $all; then
echo ' Changing owner to bin'
chown bin hash.h
else
echo ' Original owner was bin'
fi
if $made; then
chmod 444 hash.h
echo -n ' '; ls -ld hash.h
fi
echo Extracting hash.c
sed 's/^X//' <<'//go.sysin dd *' >hash.c
X#ifndef lint
Xstatic char rcsid[] = "$Id: hash.c,v 1.2 90/10/02 13:32:23 chris Exp $";
X#endif
X
X/*
X * Hash table routines.
X */
X
X#include <stdio.h>
X#include <stdlib.h>
X#include <string.h>
X#include "hash.h"
X
X/*
X * Hash table entries keep track of name=value pairs.
X * The names may be numeric IDs instead (by having a null name).
X */
Xstruct hashent {
X struct hashent *h_next;/* next in chain */
X int h_hash; /* hash value or ID */
X char *h_name; /* name (null if from numeric ID) */
X char *h_value; /* value */
X};
X
Xstruct hashtab {
X int ht_size; /* size (power of 2) */
X int ht_mask; /* == ht_size - 1 */
X int ht_used; /* number of entries used */
X int ht_lim; /* when to expand */
X struct hashent **ht_tab;/* base of table */
X char ht_name[1]; /* table name; actually larger */
X};
X
Xextern char *progname;
X
Xchar *
Xemalloc(n)
X size_t n;
X{
X register char *p = malloc(n);
X
X if (p == NULL) {
X (void) fprintf(stderr, "%s: out of memory\n", progname);
X exit(1);
X }
X return (p);
X}
X
X/* round up to next multiple of y, where y is a power of 2 */
X#define ROUND(x, y) (((x) + (y) - 1) & ~((y) - 1))
X
X/* compute a `good' number of objects to allocate via malloc */
Xint
Xgoodnumber(n, s)
X int n;
X size_t s;
X{
X
X /* 16384 is a guess at a good page size for malloc */
X /* 32 is a guess at malloc's overhead */
X return ((int)((ROUND(n * s + 32, 16384) - 32) / s));
X}
X
X/*
X * Make a new hash table.
X */
Xstruct hashtab *
Xht_new(name)
X char *name;
X{
X register struct hashtab *ht;
X register struct hashent **h;
X register int n;
X
X ht = (struct hashtab *)emalloc(sizeof *ht + strlen(name));
X ht->ht_tab = h = (struct hashent **)emalloc(128 * sizeof *h);
X ht->ht_size = 128;
X ht->ht_mask = 127;
X for (n = 128; --n >= 0;)
X *h++ = NULL;
X ht->ht_used = 0;
X ht->ht_lim = (128 * 2) / 3;
X (void) strcpy(ht->ht_name, name);
X return (ht);
X}
X
X/*
X * Expand an existing hash table.
X */
Xstatic void
Xht_expand(ht)
X register struct hashtab *ht;
X{
X register int n = ht->ht_size * 2, i;
X register struct hashent *p, **h, **oldh, *q;
X
X h = (struct hashent **)emalloc(n * sizeof *h);
X for (i = n; --i >= 0;)
X *h++ = NULL;
X h -= n;
X oldh = ht->ht_tab;
X n--;
X for (i = ht->ht_size; --i >= 0;) {
X for (p = *oldh++; p != NULL; p = q) {
X q = p->h_next;
X p->h_next = h[p->h_hash & n];
X h[p->h_hash & n] = p;
X }
X }
X free((char *)ht->ht_tab);
X ht->ht_tab = h;
X ht->ht_mask = n;
X ht->ht_size = ++n;
X ht->ht_lim = (n * 2) / 3;
X}
X
X/*
X * Make a new hash entry. Its h_next will be NULL.
X */
Xstatic struct hashent *
Xnewhashent(hash, name, value)
X int hash;
X char *name, *value;
X{
X static struct hashent *hfree;
X register struct hashent *h;
X register int n, nalloc;
X
X if ((h = hfree) != NULL)
X hfree = h->h_next;
X else {
X nalloc = goodnumber(2, sizeof *h); /* need at least 2 */
X hfree = h = (struct hashent *)emalloc(nalloc * sizeof *h) + 1;
X for (n = nalloc - 2; --n >= 0; h++)
X h->h_next = h + 1;
X h->h_next = NULL;
X h -= nalloc - 1;
X }
X h->h_next = NULL;
X h->h_hash = hash;
X h->h_name = name;
X h->h_value = value;
X return (h);
X}
X
X#define HASH(str, h, p) \
X for (p = str, h = 0; *p;) h = (h << 5) - h + *p++
X
X/*
X * Look up a name=value.
X */
Xchar *
Xht_nget(ht, name)
X register struct hashtab *ht;
X char *name;
X{
X register char *p;
X register int hash;
X register struct hashent *h;
X
X HASH(name, hash, p);
X p = name;
X for (h = ht->ht_tab[hash & ht->ht_mask]; h != NULL; h = h->h_next)
X if (h->h_hash == hash && h->h_name != NULL &&
X strcmp(h->h_name, p) == 0)
X return (h->h_value);
X return (NULL);
X}
X
X/*
X * Look up an ID=value.
X */
Xchar *
Xht_iget(ht, id)
X register struct hashtab *ht;
X register int id;
X{
X register struct hashent *h;
X
X for (h = ht->ht_tab[id & ht->ht_mask]; h != NULL; h = h->h_next)
X if (h->h_hash == id && h->h_name == NULL)
X return (h->h_value);
X return (NULL);
X}
X
X/*
X * Insert (do not clobber) a name=value.
X * Return zero on success.
X */
Xint
Xht_nins(ht, name, value)
X register struct hashtab *ht;
X char *name, *value;
X{
X register char *p;
X register int hash;
X register struct hashent *h, **hp;
X
X HASH(name, hash, p);
X p = name;
X for (hp = &ht->ht_tab[hash & ht->ht_mask]; (h = *hp) != NULL;
X hp = &h->h_next)
X if (h->h_hash == hash && h->h_name != NULL &&
X strcmp(h->h_name, p) == 0)
X return (-1);
X *hp = newhashent(hash, name, value);
X if (++ht->ht_used > ht->ht_lim)
X ht_expand(ht);
X return (0);
X}
X
X/*
X * Insert (do not clobber) an ID=value.
X * Return zero on success.
X */
Xint
Xht_iins(ht, id, value)
X register struct hashtab *ht;
X register int id;
X char *value;
X{
X register struct hashent *h, **hp;
X
X for (hp = &ht->ht_tab[id & ht->ht_mask]; (h = *hp) != NULL;
X hp = &h->h_next)
X if (h->h_hash == id && h->h_name == NULL)
X return (-1);
X *hp = newhashent(id, (char *)NULL, value);
X if (++ht->ht_used > ht->ht_lim)
X ht_expand(ht);
X return (0);
X}
X
X/*
X * Stash a copy of a string away; it will never be freed.
X */
Xstatic char *
Xpoolstr(s)
X char *s;
X{
X register char *p;
X register size_t l = strlen(s) + 1;
X static char *poolp;
X static size_t nleft;
X
X if (nleft < l)
X poolp = emalloc(nleft = goodnumber(l, 1));
X bcopy(s, p = poolp, l);
X poolp += l;
X return (p);
X}
X
X/*
X * Generate a single unique copy of the given string.
X */
Xchar *
Xstring(s)
X char *s;
X{
X register char *p;
X register int hash;
X register struct hashent *h, **hp;
X static struct hashtab *ht;
X
X if (ht == NULL)
X ht = ht_new("strings");
X HASH(s, hash, p);
X p = s;
X for (hp = &ht->ht_tab[hash & ht->ht_mask]; (h = *hp) != NULL;
X hp = &h->h_next)
X if (h->h_hash == hash && strcmp(h->h_name, p) == 0)
X return (h->h_name);
X *hp = h = newhashent(hash, poolstr(s), (char *)NULL);
X if (++ht->ht_used > ht->ht_lim)
X ht_expand(ht);
X return (h->h_name);
X}
X
X/*
X * Call fn on all the name=value pairs.
X */
Xvoid
Xht_niterate(ht, fn)
X struct hashtab *ht;
X register void (*fn)();
X{
X register struct hashent *h, **hp;
X register int n;
X
X for (n = ht->ht_size, hp = ht->ht_tab; --n >= 0;)
X for (h = *hp++; h != NULL; h = h->h_next)
X if (h->h_name != NULL)
X (*fn)(h->h_name, h->h_value);
X}
X
X/*
X * Call fn on all the id=value pairs.
X */
Xvoid
Xht_iiterate(ht, fn)
X struct hashtab *ht;
X register void (*fn)();
X{
X register struct hashent *h, **hp;
X register int n;
X
X for (n = ht->ht_size, hp = ht->ht_tab; --n >= 0;)
X for (h = *hp++; h != NULL; h = h->h_next)
X if (h->h_name == NULL)
X (*fn)(h->h_hash, h->h_value);
X}
//go.sysin dd *
if [ `wc -c < hash.c` != 6291 ]; then
made=false
echo error transmitting hash.c --
echo length should be 6291, not `wc -c < hash.c`
else
made=true
fi
if $all; then
echo ' Changing owner to bin'
chown bin hash.c
else
echo ' Original owner was bin'
fi
if $made; then
chmod 444 hash.c
echo -n ' '; ls -ld hash.c
fi
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750)
Domain: ch...@cs.umd.edu Path: uunet!mimsy!chris

Silver

unread,

Oct 18, 1990, 1:13:09 AM10/18/90

to

You must supply definitions of uint8 (an 8-bit unsigned int), uint16
(similarly), and uint32 (likewise).

----------------------------------- hash.h ------------------------------------
/* This is an implementation of Peter K. Pearson's hash algorithm as */
/* presented in "Fast Hashing of Variable-Length Text Strings", ACM 6/90. */
/* The conclusion of his article is cited below. */
/* */
/* The main advantages of the hashing function presented here are: */
/* 1. No restriction is placed on the length of the text string. */
/* 2. The length of the text string does not need to be known beforehand. */
/* 3. Very little arithmetic is performed on each character being hashed. */
/* 4. Similar strings are not likely to collide. */
/* 5. Minimal, perfect hashing functions can be built in this form. */
/* [6. It's average distribution is quite random.] */
/* */
/* Its principal disadvantages are: */
/* 1. Output value ranges that are not powers of 2 are somwehat more */
/* complicated to provide. */
/* 2. More auxiliary memory (the 256-byte table T) is required by this */
/* hashing function than by many traditional functions. */
/* [3. I blew 50 cents copying the article and then had to type it in.] */
/* */
/* Elaborating on advantage 5 above, by fatootsing with the permutation */
/* table T, one can make hash8 perform perfect minimal hashes. The author */
/* provides the following guidelines for construction of such a table: */
/* */
/* [C[N] are the characters of the string being hashed and h[N] is the Nth */
/* iteration of the loop.] */
/* */
/* 1. A table T was consructed by pseudorandom permutation of the integers */
/* (0 ... 255). */
/* 2. One by one, the desired values were assigned to the words in the */
/* list. Each assignment was effected by exchanging two elemnts in the */
/* table. */
/* 3. For each word, the first candidate considered for exchange was */
/* T[h[n-1] ^ C[n]], the last table element referenced in the */
/* computation of the hash function for that word. */
/* 4. A table element could not be exchanged if it was referenced during */
/* the hashing of a previously assgined word or if it was referenced */
/* earlier in the hashing of the same word. */
/* 5. If the necessary exchange was forbidden by Rule 4, attention was */
/* shifted to the previously referenced table element, */
/* T[[h[n-2]] ^ C[n-1]]. */
/* */
/* The procedure is not always successful. For example, using the ascii */
/* character codes, if the word "a" hashes to 0 and the word "i" hashes to */
/* 15, it turns out that the word "in" must hash to 0. Initial attempts */
/* to map Knuth's 31 words onto the integers (0 ... 30) failed for exactly */
/* this reason. The shift to the range (1 ... 31) was an ad hoc tactic to */
/* circumvent this problem. */

/* Hash s into an 8, 16, or 32 bit unsigned integer. The larger-sized */
/* values are generated by applying hash8 in parallel. */
extern uint8 hash8(uint8 * s);
extern uint16 hash16(uint8 * s);
extern uint32 hash32(uint8 * s);

/* The hash permutation table. */
extern uint32 * hashpermutation;
----------------------------------- hash.c ------------------------------------
#include "hash.h"

static uint8 permutation[256] =
{
51, 140, 233, 27, 118, 125, 170, 138, 119, 132, 174, 97, 25, 110, 1, 14,
65, 36, 40, 188, 73, 173, 7, 30, 68, 56, 169, 234, 107, 177, 197, 87,
28, 210, 186, 67, 2, 15, 115, 48, 223, 148, 211, 57, 190, 104, 213, 49,
144, 172, 147, 124, 157, 238, 167, 183, 78, 75, 58, 22, 70, 103, 181, 12,
254, 41, 198, 168, 46, 79, 241, 156, 83, 128, 66, 60, 86, 141, 161, 176,
221, 54, 192, 252, 116, 95, 206, 35, 88, 133, 154, 250, 237, 253, 85, 178,
93, 159, 155, 42, 9, 89, 3, 61, 201, 158, 106, 82, 240, 255, 218, 102,
189, 8, 33, 4, 145, 16, 150, 26, 99, 100, 195, 175, 34, 50, 80, 166,
194, 195, 164, 29, 134, 105, 55, 143, 122, 130, 245, 208, 72, 77, 64, 121,
139, 232, 191, 108, 228, 137, 59, 74, 11, 126, 171, 5, 242, 101, 239, 193,
112, 113, 98, 21, 207, 225, 151, 251, 92, 91, 17, 127, 20, 81, 24, 6,
43, 196, 204, 247, 212, 224, 220, 94, 32, 13, 187, 199, 214, 18, 226, 84,
71, 231, 165, 19, 202, 217, 90, 129, 136, 153, 182, 111, 244, 45, 236, 249,
109, 47, 180, 205, 215, 160, 53, 162, 114, 246, 179, 62, 227, 96, 142, 230,
184, 146, 117, 39, 69, 37, 23, 63, 52, 216, 0, 135, 149, 31, 38, 44,
209, 120, 76, 203, 229, 123, 131, 152, 10, 219, 243, 248, 235, 222, 200, 163
};

uint8 hash8(register uint8 * s)
{register static uint8 h;
if (*s)
{h = permutation[*s];
while (*++s)
h ^= permutation[h ^ *s];
return(h);}
else
{return((uint8)0);}

uint16 hash16(register uint8 * s);
{register uint8 h1;
register uint8 h2;
if (*s)
{h1 = permutation[*s];
h2 = permutation[*s + 1];
while (*++s)
{h1 ^= permutation[h1 ^ *s];
h2 ^= permutation[h2 ^ *s];}
return((uint16)((h1<<8) & h2));}
else
{return((uint16)0);}}

uint32 hash32(register uint8 * s);
{register uint8 h1;
register uint8 h2;
register uint8 h3;
register uint8 h4;
if (*s)
{h1 = permutation[*s]
h2 = permutation[*s + 1];
h2 = permutation[*s + 2];
h2 = permutation[*s + 3];
while (*++s)
{h1 ^= permutation[h1 ^ *s];
h2 ^= permutation[h2 ^ *s];
h3 ^= permutation[h3 ^ *s];
h4 ^= permutation[h4 ^ *s];}
return((uint32)((h1<<24) & (h2<<16) & (h3<<8) & h4));}
else
{return((uint32)0);}}

Ron Irvine

unread,

Oct 19, 1990, 4:54:08 PM10/19/90

to

I have used many hash functions. Each one has advantages
and disadvantages. Here is a powerful but simple one.
It simply does a crc16 on the character string. The crc16 is
a great hash function since it is designed to produce a
unique code for different character strings. So for "reasonable"
text it gives a good distribution of hash numbers (a big problem
to solve in general). The crc16 function I list below uses nibbles
to build the crc from each byte, this is faster than a bit by bit and
less room than a 256 entry table needed for a byte by byte (fits
in cache). So, it is fast (no multiply, just shifts and masks),
well behaved and simple. If you have a VAX you could even use the
"crc" instruction instead of the crc16 function - what a CISC.
To use, do a crc16 on the string and the truncate the 16 bit
result to the size you need (this should be a power of 2).

/* crc16 - crc16 routine
*
* R.K. Irvine
*
* This routine returns the crc16 for the string pointed
* to by "in".
* crc16 is given by: x^16 + x^15 + x^2 + 1
*/
unsigned short
crc16(register char *in) {
register unsigned int n, crc;

static unsigned short crc16l[] = {
0x0000,0xc0c1,0xc181,0x0140,
0xc301,0x03c0,0x0280,0xc241,
0xc601,0x06c0,0x0780,0xc741,
0x0500,0xc5c1,0xc481,0x0440,
};

static unsigned short crc16h[] = {
0x0000,0xcc01,0xd801,0x1400,
0xf001,0x3c00,0x2800,0xe401,
0xa001,0x6c00,0x7800,0xb401,
0x5000,0x9c01,0x8801,0x4400,
};

crc = 0;
while (n = (unsigned char)(*in++)) {
n ^= crc;
crc = crc16l[n&0x0f] ^ crc16h[(n>>4)&0x0f] ^ (crc>>8);
}
return(crc);
}