hash function for mac needed

3316 views
Skip to first unread message

Dave Harris

unread,
Jun 16, 1991, 10:18:36 AM6/16/91
to

>From: bkot...@falcon.aamrl.wpafb.af.mil (Brett Kottmann)
>Date: 10 Jun 91 21:08:47 GMT
>Organization: Logicon Technical Services, Inc.
>Message-ID: <1991Jun10....@falcon.aamrl.wpafb.af.mil>
>Newsgroups: comp.lang.c


> Are there C implementations of hash functions lying around out
>there?
>I need a hash function that will take a string (up to 255 chars + null)
>and
>make a unique key for it. It should hash to the same key for the same
>string
>:).

When somebody compresses 255 bytes down to 2 or 4 let me know :). "Unique"
isn't exactly the word you want I don't think. The simple thing to do is to
add all the characters up then mod by the size of your hash, then use that
number to index into an array. If the spot is used, either link list off the
array location or skip cirularly in the array until you hit the next unused
location. You don't necessarly need to add all 255 characters together.
Every 5th, 10th or whatever may work just as well, depending on the data.
Simply keep a counter of the number of seeks required to find your hash
locations so you can determine which hash scheme works with the fewest seeks
for your particular data.

To anybody else. If I have n data elements, with a hash of size x, what is
considered a good number of searches based on n and x (i.e. ln(x/n)). Thanks

Dave Harris.


--
Uucp: ...{gatech,ames,rutgers}!ncar!asuvax!stjhmc!15!14!Dave.Harris
Internet: Dave....@f14.n15.z1.fidonet.org

Chris Torek

unread,
Jun 18, 1991, 3:37:44 AM6/18/91
to
In article <14207.2...@stjhmc.fidonet.org>
Dave....@f14.n15.z1.fidonet.org (Dave Harris) writes:
>... The simple thing to do is to add all the characters up then mod by
>the size of your hash, then use that number to index into an array. ...

>You don't necessarly need to add all 255 characters together.

Indeed: as it turns out, while many people have spent much time looking
for `good' hashing functions, fewer seem to have noted that many
hashing systems spend more time computing hashes than searching. I
find that this is usually the case, and that a fast hash function is
better than a slower-but-better-distribution function. One good hash
function, useful in many applications, is this:

hash = 0;
for (each character)
hash = (hash * 33) + (the character);
hash_index = hash & ((some power of two) - 1);

In C, this might be written as

struct hashent *
lookup(table, string)
register struct hashtab *table;
char *string;
{
register unsigned hash = 0;
register char *p, c;
register struct hashent *hp;

for (p = string; (c = *p) != '\0'; p++)
hash = (hash << 5) + hash + c;
p = string;
for (hp = table->ht_entries[hash & table->ht_mask];
hp != NULL; hp = hp->h_next)
if (hp->h_hash == hash && strcmp(hp->h_key, p) == 0)
return (hp);
return (NULL);
}

>To anybody else. If I have n data elements, with a hash of size x, what is
>considered a good number of searches based on n and x (i.e. ln(x/n)).

If you have a perfectly uniform distribution, each chain or hash
sequence will have n/x elements; assuming the probability of any
element lookup is identical, this is the best that can be done. I do
not know what others consider `good', but a factor of 1.3 or so over
that seems reasonable to me (I cannot quite think why).

Typically neither assumption holds true. There are too many ways to
take advantage of this to characterize here. One thing that can be
done, however, is to use the above hash table scheme with expanding
tables. As the number of entries increases, the hash table size can be
doubled, and table->ht_mask adds one bit:

table->ht_mask = (table->ht_mask << 1) | 1;

Since the hash values of each <key,datum> pair are maintained, it is
easy to move the entries from their old hash slots to their new ones.
Doubling occurs O(log2 n) times if the doubling criterion is total
number of table entries; other doubling criteria, such as maximum chain
length, are possible but rarely worthwhile---one ends up spending more
time optimizing hashes than searching, just as with more complicated
hash functions. Trading space for time this way seems to work well,
and you can select the amount of space by selecting the doubling factor.
--
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA Domain: to...@ee.lbl.gov

Richard Harter

unread,
Jun 19, 1991, 12:38:20 AM6/19/91
to
In article <14...@dog.ee.lbl.gov>, to...@elf.ee.lbl.gov (Chris Torek) writes:

> Indeed: as it turns out, while many people have spent much time looking
> for `good' hashing functions, fewer seem to have noted that many
> hashing systems spend more time computing hashes than searching. I
> find that this is usually the case, and that a fast hash function is
> better than a slower-but-better-distribution function. One good hash
> function, useful in many applications, is this:

> hash = 0;
> for (each character)
> hash = (hash * 33) + (the character);
> hash_index = hash & ((some power of two) - 1);

And it may well pay to be even more radical in trading speed for 'perfection'.
In one of my programs I found it quite profitable to only look at the last
ten characters of a string, summing all but the last two, and doing shift
and sum on the last two. E.g. if n is the string length

hash = 0;
cc = string + n;
switch (n>10 ? 10 : n) {
case 10: hash += (int)*--cc;
case 9: hash += (int)*--cc;
case 8: hash += (int)*--cc;
case 7: hash += (int)*--cc;
case 6: hash += (int)*--cc;
case 5: hash += (int)*--cc;
case 4: hash += (int)*--cc;
case 3: hash += (int)*--cc;
case 2: hash += ((int)*--cc)<<1;
case 1: hash += ((int)*--cc)<<3;
default: break;
}
hash &= 1023;

Admittedly this appears crude, but in practice the last ten bytes were
all that were really needed to distinguish most strings (in the application
of interest, of course.) Nor was it really needful to do anything more
than summing bytes. I should note that the string length was stored in
the table; in effect it served as a secondary key for the bucket search
loop. The speed up was quite sharp.
--
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb. This sentence short. This signature done.

Henry Spencer

unread,
Jun 19, 1991, 12:19:59 PM6/19/91
to
In article <5...@smds.UUCP> r...@smds.UUCP (Richard Harter) writes:
>> ... a fast hash function is
>> better than a slower-but-better-distribution function...

>In one of my programs I found it quite profitable to only look at the last
>ten characters of a string...

>Admittedly this appears crude, but in practice the last ten bytes were
>all that were really needed...

In fact, when I originally did the hashed newsgroup-lookup code in C News
expire, a quick investigation did not reveal any hashing function which
had overall performance much better than simply using the length of the
string! (The newsgroup list has grown so much that this decision is in
need of revision, mind you.)
--
"We're thinking about upgrading from | Henry Spencer @ U of Toronto Zoology
SunOS 4.1.1 to SunOS 3.5." | he...@zoo.toronto.edu utzoo!henry

Dave Harris

unread,
Jun 21, 1991, 1:11:40 AM6/21/91
to
In a message of <Jun 18 20:29>, Chris Torek (1:114/15) writes:

>Indeed: as it turns out, while many people have spent much time looking
>for `good' hashing functions, fewer seem to have noted that many
>hashing systems spend more time computing hashes than searching. I
>find that this is usually the case, and that a fast hash function is
>better than a slower-but-better-distribution function. One good hash
>function, useful in many applications, is this:

> hash = 0;
> for (each character)
> hash = (hash * 33) + (the character);
> hash_index = hash & ((some power of two) - 1);

Is the *33 use to keep short sequences has key more random or does it really
help in the distribution of large string too?

Dan Bernstein

unread,
Jun 20, 1991, 7:53:42 PM6/20/91
to
In article <14...@dog.ee.lbl.gov> to...@elf.ee.lbl.gov (Chris Torek) writes:
> hash = (hash * 33) + (the character);

I see I've converted you from hash * 31 to hash * 33. :-)

[ on the typical distribution of hash values with chaining ]

It's worth remembering the power series for exp. If your load factor---
that is, the number N of stored elements divided by the number of hash
values---is x, then there will be about exp(-x)N zero-element chains,
exp(-x)Nx one-element chains, exp(-x)Nx^2/2 two-element, exp(-x)Nx^3/3!
three-element, and so on. This holds up quite well in practice, and lets
you judge accurately how often a branch will be taken or how far it's
worth unrolling a loop when you're working with hash tables.

---Dan

Rich Salz

unread,
Jun 21, 1991, 1:35:04 PM6/21/91
to
In <1991Jun19.1...@zoo.toronto.edu> he...@zoo.toronto.edu (Henry Spencer) writes:
|In fact, when I originally did the hashed newsgroup-lookup code in C News
|expire, a quick investigation did not reveal any hashing function which
|had overall performance much better than simply using the length of the
|string! (The newsgroup list has grown so much that this decision is in
|need of revision, mind you.)

I think you'll want to re-check. The distributions are real uneven -- there
are more than 40 newsgroups of length 10, for example, so you have to resolve
lots of conflicts, and if you don't put a secondary hash, that means lots of
strcmp's. Of course, the design is very dependant on the newsgroup that a
site receives and the distribution of articles within those sites.

I use Chris Torek's hash function (amazingly good distribution and amazingly
cheap to calculate) with a table size of 128 and sort the conflicts based on
the highest article number. It seems to give the biggest win for the biggest
number of systems.
/r$
--
Please send comp.sources.unix-related mail to rs...@uunet.uu.net.
Use a domain-based address or give alternate paths, or you may lose out.

Chris Torek

unread,
Jun 21, 1991, 6:37:59 PM6/21/91
to
In an article whose referent was deleted by bogus fidonet software
(and whose tabs were changed to spaces), I provided this example
hashing function:

>> hash = 0;
>> for (each character)
>> hash = (hash * 33) + (the character);
>> hash_index = hash & ((some power of two) - 1);

In article <14491.2...@stjhmc.fidonet.org>


Dave....@f14.n15.z1.fidonet.org (Dave Harris) writes:
>Is the *33 use to keep short sequences has key more random or does it
>really help in the distribution of large string too?

I am not sure what you are asking here. There are basically three
questions you can ask about a hash function:

a) Is it fast enough? (complicated functions often take longer than
the searches they replace)

b) Does it produce a good distribution? (the hash function
h(input)=0 is very fast but produces horrible distributions)

c) If (b) is true, why does it work?

The first two can really only be answered in some specific context.
In all the contexts I have tried, this one answers `yes' to both
(which is why I recommend it).

The third question is the hardest. What *does* that 33 do? I have no
idea. All I know is that I have experimented with `easy' functions
(functions for which `fast enough' should generally be true) and this
one also produces a good distribution. I doubt I still have the mail,
but when Margo Seltzer was writing the database code to replace ndbm,
she tried a bunch of different functions on a bunch of different data.
This one usually worked well---it did not always give the best
distributions, but it never performed poorly; some functions that gave
better distributions on some data gave much worse on other.

Maybe there are some theory people out there who can explain it.
Probably some staring at Knuth vol. 3 would help.

Thomas Wang

unread,
Jun 18, 1991, 2:57:39 PM6/18/91
to

Read CACM June 1990, page 677 "Fast Hashing of Variable-Length Text Strings".


-Thomas Wang
(Everything is an object.)
wa...@hpdmsjlm.cup.hp.com
tho...@hpcupt1.cup.hp.com

Phong Vo[drew]

unread,
Jun 23, 1991, 3:03:19 PM6/23/91
to
Somewhere in this thread, Chris Torek proposed as a hash function:

> >> hash = 0;
> >> for (each character)
> >> hash = (hash * 33) + (the character);
> >> hash_index = hash & ((some power of two) - 1);
>
Chris then asked,
> ... What *does* that 33 do?

>
> Maybe there are some theory people out there who can explain it.
> Probably some staring at Knuth vol. 3 would help.

A good property for a hash function is if you keep feeding it
a constant character, it performs as if it's a random number generator.
The hash function f(h,c) = 33*h + c is derived from a class of random
number generators called linear congruential. At the start of Knuth Vol 2,
there is a good discussion of these generators. Let's apply some of the
theorems in that section and see what we get. Keep in mind that we are working
with random numbers mod 2^x where x is usually 16 or 32. Assuming that c is some
fixed odd value, Theorem A guarantees that 33 is good because (33-1)%4 == 0.
In the same reasoning, 31 is not good since (31-1)%4 == 2 != 0.
Here good means that if you keep feeding c, you'll cycle thru the entire
set of values [0,2^x). On the other hand, 33 is not a primitive element for 2^x
when x > 3 (Theorem C), so it isn't a good multiplicator if you keep feeding in 0's.
For example, 37 is a better value from that point of view. To summarize,
a good multiplicator is any number with 101 as their low order 3 bits.

At this point, we depart from theory. We typically hash byte strings and we want
the bits to mix in the 16-bit or 32-bit registers relatively fast. This suggests
that some value larger than 2^8 (but not too large) may do better as a multiplicator.
After a little experimentation, here's a pretty decent hash function:
hash = 0;
for(each byte c)
hash = (hash<<8) + (hash<<2) + hash + c;
This corresponds to the multiplicator 261 whose bit pattern is 100000101.
Try it, you'll like it.

Chris Torek

unread,
Jun 24, 1991, 5:15:58 AM6/24/91
to
In an article whose referent is long gone, I described this hash function:

for (each character)
hash = (hash * 33) + (the character);
hash_index = hash & ((some power of two) - 1);

and noted:
>>Maybe there are some theory people out there who can explain [why 33
>>works well].

In article <15...@ulysses.att.com> k...@ulysses.att.com (Phong Vo[drew]) writes:
>A good property for a hash function is if you keep feeding it
>a constant character, it performs as if it's a random number generator.
>The hash function f(h,c) = 33*h + c is derived from a class of random
>number generators called linear congruential. At the start of Knuth Vol 2,
>there is a good discussion of these generators.

... and indeed, if you turn to Knuth vol. 2, chapter 3, you will find all
sorts of interesting stuff, such as a proof that the low-order n bits of
the usual rand() repeat with period $2^n$. (rand() uses a modulus of
either $2^15$ or $2^31$.)

>... Assuming that c is some fixed odd value, Theorem A guarantees that
>33 is good because (33-1)%4 == 0. ... Here good means that if you keep


>feeding c, you'll cycle thru the entire set of values [0,2^x).

This is approaching what I was getting at, but is not quite there.
Basically it says that 33 is `good' because it is `not bad'. An LCRNG
is defined as

h = ( a h + c ) mod m (where h is each a value of the
n+1 n n

variable `hash' in the for loop)

(in quote `>' above, $m = 2^x$). With a `bad' multiplier for $a$---and
what is `bad' depends on $c$ and $m$---the generator will not produce all
$m$ possible values; with a `good' one it will. Intuitively, then, the
`bad' one will miss some of the hash buckets entirely; the `good' one
will not.

But *why* is it true that `A good property is [that] it performs as if
it's a [good] random number generator'? This is now `obvious' (from the
above result) *if* we feed in constant values of $c$, i.e., if all the
strings are made of the same characters, but are different lengths. But
we rarely do that---the strings we hash are full of weird junk. How do
LCRNGs get in there? Why do the varying values of $c$ not mess things
up? (Of course, they do, but not enough to matter. Why not?)

>At this point, we depart from theory. We typically hash byte strings ...

Actually, I typically hash ASCII text. 33 is the `easiest' `good'
multiplier (as defined above) that is >= 26. I have no idea whether
this is significant, or merely coincidental.

>... This suggests that some value larger than 2^8 (but not too large)
>may do better as a multiplicator. ...


> hash = 0;
> for(each byte c)
> hash = (hash<<8) + (hash<<2) + hash + c;

I will see if I can get this one run on the same data as the `33 hash'.
This is more work than

hash = (hash << 5) + hash + c;

and distribution improvements may be outweighed by additional hash time
(or may not: who knows?).

Dan Bernstein

unread,
Jun 24, 1991, 11:57:34 PM6/24/91
to
In article <15...@ulysses.att.com> k...@ulysses.att.com (Phong Vo[drew]) writes:
> A good property for a hash function is if you keep feeding it
> a constant character, it performs as if it's a random number generator.

No, I don't think so. This is commonly true of good hash functions and
commonly false of bad ones, but keep in mind that pseudo-random
sequences have to keep working 100, 1000, 1000000 elements later, while
simple string hash functions are rarely applied to strings with more
than 80 characters.

[ as a generator mod 2^n, 37 is better than 33 is better than 31 ]

Again, I don't think this is accurate. Each number has a sufficiently
long period mod 2^n that you don't have to worry about repeats.

> We typically hash byte strings and we want
> the bits to mix in the 16-bit or 32-bit registers relatively fast.

[ so use a multipler of 261 ]

Again, practically any good multiplier works. I think you're worrying
about the fact that 31c + d doesn't cover any reasonable range of hash
values if c and d are between 0 and 255. That's why, when I discovered
the 33 hash function and started using it in my compressors, I started
with a hash value of 5381. I think you'll find that this does just as
well as a 261 multiplier.

---Dan

Henry Spencer

unread,
Jun 25, 1991, 1:21:33 AM6/25/91
to
In article <36...@litchi.bbn.com> rs...@bbn.com (Rich Salz) writes:
>|In fact, when I originally did the hashed newsgroup-lookup code in C News
>|expire, a quick investigation did not reveal any hashing function which
>|had overall performance much better than simply using the length of the
>|string! ...

>
>I think you'll want to re-check. The distributions are real uneven -- there
>are more than 40 newsgroups of length 10, for example, so you have to resolve
>lots of conflicts, and if you don't put a secondary hash, that means lots of
>strcmp's...

Strcmps cost much less if you prefix them with an inline check of the
first character. This strategy isn't as effective with newsgroups (which
tend to have common prefixes) as with random strings, but it still helps
quite a bit. And *as I mentioned*, the above-mentioned results are out
of date; the hash chains were typically 2-3 long back then.

Phong Vo[drew]

unread,
Jun 25, 1991, 5:27:14 PM6/25/91
to
> Again, practically any good multiplier works. I think you're worrying
> about the fact that 31c + d doesn't cover any reasonable range of hash
> values if c and d are between 0 and 255. That's why, when I discovered
> the 33 hash function and started using it in my compressors, I started
> with a hash value of 5381. I think you'll find that this does just as
> well as a 261 multiplier.
>
> ---Dan

261 is just a number that was picked out of a hat that satisfies
the conditions outlined in the theorems in Knuth. These conditions are
pretty minimal so yes there are lots and lots of numbers that would do
just as well. For compressors, if you work with binary files, it isn't
infrequent to have strings of 0's. In such a case, you'll want the multiplier
to have the nice properties of a RNG. For this reason, 261 type of numbers
may be a little better than 33 (in either case, don`t forget to start the
hash sum at some odd value!).

For everyone's entertainment, I did a simple (and admittedly contrived) test
of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.
This is done by treating each number which is stored in a word as a
string of 4 bytes. With 261, 927300 of these numbers are
in their own buckets, the rest are in 36350 buckets each with 2 elements.
This is almost as good as a perfect hash function. The result for 33 is
much worse. The largest bucket has 64 elements in it and there are 4050
such buckets. The key difference is the 8th bit in 261 which causes the
mixing to go a little faster.

Dan Bernstein

unread,
Jun 26, 1991, 4:17:50 AM6/26/91
to
In article <15...@ulysses.att.com> k...@ulysses.att.com (Phong Vo[drew]) writes:
> For everyone's entertainment, I did a simple (and admittedly contrived) test
> of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.
> This is done by treating each number which is stored in a word as a
> string of 4 bytes.
[ results supposedly showing 261 much better than 33 ]

Sorry, but the mod has to be a power of 2. Also, your test is remarkably
contrived: we're talking about string hash functions, and typical
strings (a) are concentrated on a few character values, not spread
evenly over the entire character set; (b) do not all have the same high
byte; (c) do not all have the same length.

Surely you noticed that a multiplier of 256 will produce perfect hashing
in this case. Does that mean you're recommending a multiplier of 256?

The profiling statistics I've saved from various compressors form a much
more realistic sample. They show 33 as slightly better than random
hashing on typical files, 37 as slightly worse. I've never tried 261 or
31, but I'll bet 261 does worse than 33 on text files.

---Dan

Dik T. Winter

unread,
Jun 26, 1991, 9:25:56 AM6/26/91
to
In article <17918.Jun2...@kramden.acf.nyu.edu> brn...@kramden.acf.nyu.edu (Dan Bernstein) writes:
> In article <15...@ulysses.att.com> k...@ulysses.att.com (Phong Vo[drew]) writes:
> > For everyone's entertainment, I did a simple (and admittedly contrived) test
> > of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.
> > This is done by treating each number which is stored in a word as a
> > string of 4 bytes.
>
> Sorry, but the mod has to be a power of 2.
Why? (If you are telling me for speed I'll scream.)

>
> Surely you noticed that a multiplier of 256 will produce perfect hashing
> in this case. Does that mean you're recommending a multiplier of 256?
Only on big-endian machines of course :-)

>
> The profiling statistics I've saved from various compressors form a much
> more realistic sample. They show 33 as slightly better than random
> hashing on typical files, 37 as slightly worse. I've never tried 261 or
> 31, but I'll bet 261 does worse than 33 on text files.
>
I did some simple tests, hashing all entries of /usr/dict/words, counting
how many entries fall in each bucket. I did this with the four multipliers
Dan just mentioned. I did this for 16 buckets to 8192 buckets (only powers
of two done). I calculated also the standard deviations involved. I will
not bore you with all those digits, the result can be stated simply: it does
not matter very much what multiplier you use. For each multiplier there
is a number of buckets where that multiplier has the smallest standard
deviation. So use whatever you like.
--
dik t. winter, cwi, amsterdam, nederland
d...@cwi.nl

Martin Golding

unread,
Jun 26, 1991, 4:11:34 PM6/26/91
to
In <15...@ulysses.att.com> k...@ulysses.att.com (Phong Vo[drew]) writes:

>For everyone's entertainment, I did a simple (and admittedly contrived) test
>of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.
>This is done by treating each number which is stored in a word as a
>string of 4 bytes. With 261, 927300 of these numbers are
>in their own buckets, the rest are in 36350 buckets each with 2 elements.
>This is almost as good as a perfect hash function. The result for 33 is
>much worse. The largest bucket has 64 elements in it and there are 4050
>such buckets. The key difference is the 8th bit in 261 which causes the
>mixing to go a little faster.

Your test is probably irrelevant, being contiguous sequential numbers.
The best hash in this case is
for( hash = (i = 0); i < len(numstr); hash = hash*10 + numstring[i++]);
hash = hash % numberofbuckets;

This will yield the best possible distribution for any number of hash
buckets. The proof is left to the reader.

The point is not that there is a better hash for sequential numbers, but
that the performance of _any_ hash function is highly dependent on your keys;
if you any idea of the values or kinds of values to expect you can choose
a better function. If you can _control_ the values of your keys you can
sometimes get a _much_ better function; viz the numbers above.

If all this is old and boring stuff, I apologise for interrupting you all
just to look like an idiot (flame me by email), and I now return you to
your regularly scheduled newsgroup.


Martin Golding | sync, sync, sync, sank ... sunk:
Dod #0236 | He who steals my code steals trash.
A poor old decrepit Pick programmer. Sympathize at:
{mcspdx,pdxgate}!adpplz!martin or mar...@adpplz.uucp

Martin Golding

unread,
Jun 26, 1991, 7:46:35 PM6/26/91
to
In <8...@adpplz.UUCP> mar...@adpplz.UUCP (Martin Golding) writes:

A bunch of nonsense, unless you realize I misread the previous post,
and was thinking of a string of ascii digits. Then it all becomes
clear, and I stand by everything I said, all of which was irrelevant.

Sorry.

Bruce.Evans

unread,
Jun 26, 1991, 7:27:24 AM6/26/91
to
In article <14...@dog.ee.lbl.gov> to...@elf.ee.lbl.gov (Chris Torek) writes:
>>> hash = 0;
>>> for (each character)
>>> hash = (hash * 33) + (the character);
>>> hash_index = hash & ((some power of two) - 1);
>
>In article <14491.2...@stjhmc.fidonet.org>
>Dave....@f14.n15.z1.fidonet.org (Dave Harris) writes:
>>Is the *33 use to keep short sequences has key more random or does it
>>really help in the distribution of large string too?
>
>The third question is the hardest. What *does* that 33 do? I have no
>idea. All I know is that I have experimented with `easy' functions

I don't think there are any theoretical mysteries behind the 33. 33 = 32 + 1.
The 32 is to avoid mixing bits with previous hash bits. 256 would avoid all
mixing (with 8-bit chars) but it would throw away too many bits. The 1 is to
avoid throwing away bits. Rotating the bits instead of shifting left would
probably be a better way to avoid this.

The following hash function worked well for me in an assembler. It implements
the ideas that hashing all the chars in the string is unnecessary (and maybe
even bad if too many bits get shifted out) and that the middle char is more
random than leading or trailing chars.

In the program this was extracted from, the string length was already known.
Avoiding looking at all the chars in the string is not such a good idea when
strlen has to do it anyway. Also, this probably does best with a lot of
short strings.
---
extern unsigned strlen(); /* #include <string.h> */

#define SPTSIZE 1024

extern struct sym_s *spt[SPTSIZE];

#define hconv(ch) ((unsigned char) (ch) - 0x41) /* better form for hashing */

struct sym_s **gethashptr(str)
register char *str;
{
register unsigned hashval;
register unsigned length;

/* Hash function is a weighted xor of 1 to 4 chars in the string.
* This seems to work better than looking at all the chars.
* It is important that the function be fast.
* The multiplication by MULTIPLIER should compile as a shift.
*/

#define MULTIPLIER (SPTSIZE / (1 << USEFUL_BITS_IN_ASCII))
#define USEFUL_BITS_IN_ASCII 6

length = strlen(str);
if (length <= 3)
{
if (length <= 2)
hashval = hconv(str[-1]) * MULTIPLIER;
else
hashval = hconv(str[-2]) * MULTIPLIER,
hashval ^= hconv(str[-1]);
}
else
hashval = hconv(str[-(length / 2)]) * MULTIPLIER,
hashval ^= hconv(str[-2]) << 2,
hashval ^= hconv(str[-1]);
return spt + (hashval ^ (hconv(str[0]) << 1)) % SPTSIZE;
}
---
--
Bruce Evans ev...@syd.dit.csiro.au

Chris Torek

unread,
Jun 27, 1991, 8:28:24 AM6/27/91
to
>In article <14...@dog.ee.lbl.gov> I wrote:
>>The [`why'] question is the hardest. What *does* that 33 do? I have no
>>idea. ...

In article <1991Jun26.1...@syd.dit.CSIRO.AU>


ev...@syd.dit.CSIRO.AU (Bruce.Evans) writes:
>I don't think there are any theoretical mysteries behind the 33. 33 = 32 + 1.
>The 32 is to avoid mixing bits with previous hash bits. 256 would avoid all
>mixing (with 8-bit chars) but it would throw away too many bits. The 1 is to
>avoid throwing away bits.

This answer just does not hold up: if this were the case, 65, 129,
etc., should work as well as 33. They do not. Why not?

>The following hash function worked well for me in an assembler. ...
> length = strlen(str); ... hashval = hconv(str[-1]) * MULTIPLIER;

At this point, str[-1] is unlikely to exist. The function seems to
expect a pointer to the last character, rather than the first.

Rich Salz

unread,
Jun 28, 1991, 1:07:34 PM6/28/91
to
I do test the first character, Henry -- I'm pretty good at taking code and ideas
from experts. The Winter 87 paper "News Need Not Be Slow" that you and Geoff
wrote was the first reference I've seen to how effective that can be.

I apologize if my "you'll want to recheck" made it sound like a smack on
the head. Of course your article said your data was known to be old. I
just meant to add a data point of my own (for what my points are worth)
agreeing that things are very different now.

We now return you to your standard "what does a||b do" discussion...

Brian Bliss

unread,
Jul 1, 1991, 12:52:04 PM7/1/91
to
In article <14...@dog.ee.lbl.gov>, to...@elf.ee.lbl.gov (Chris Torek) writes:
|> >In article <14...@dog.ee.lbl.gov> I wrote:
|> >>The [`why'] question is the hardest. What *does* that 33 do? I have no
|> >>idea. ...
|>
|> In article <1991Jun26.1...@syd.dit.CSIRO.AU>
|> ev...@syd.dit.CSIRO.AU (Bruce.Evans) writes:
|> >I don't think there are any theoretical mysteries behind the 33. 33 = 32 + 1.
|> >The 32 is to avoid mixing bits with previous hash bits. 256 would avoid all
|> >mixing (with 8-bit chars) but it would throw away too many bits. The 1 is to
|> >avoid throwing away bits.
|>
|> This answer just does not hold up: if this were the case, 65, 129,
|> etc., should work as well as 33. They do not. Why not?
|>

Just check to see if you are throwing away a bits, and if you are,
circular shift them back in on the right. i.e., the following worked
great for me when hashing C identifier names:

int hash_symbol (char *name, int max)
{
register int retval = 0;
register char *ch = name;
register int mask = max - 1;
register int nmask = ~mask;

for (; (*ch != '\0'); ch++) {
retval <<= 1;
retval += *ch;
if (retval & nmask) {
retval = (retval & mask) + 1;
}
}
return (retval);
}

here, I assume that max is a power of 2, so that mask = 0x[137f]fff...
whenever retval exceeds max, subtract max and increment. The idea
is easily adaptable to max being a non-power of two, or multiplying
by a number larger than two, at the cost of some speed.
Now that I look at it closer, I guess I could have and'ed with
max instead of nmask in the if-condition...

bb

Brian Bliss

unread,
Jul 1, 1991, 5:02:50 PM7/1/91
to
In article <1991Jul1.1...@csrd.uiuc.edu>, bl...@sp64.csrd.uiuc.edu (Brian Bliss) writes:
|>
|> int hash_symbol (char *name, int max)
|> {
|> register int retval = 0;
|> register char *ch = name;
|> register int mask = max - 1;
|> register int nmask = ~mask;
|>
|> for (; (*ch != '\0'); ch++) {
|> retval <<= 1;
|> retval += *ch;
|> if (retval & nmask) {
|> retval = (retval & mask) + 1;
|> }
|> }
|> return (retval);
|> }
|>
|> here, I assume that max is a power of 2, so that mask = 0x[137f]fff...
|> whenever retval exceeds max, subtract max and increment. The idea
|> is easily adaptable to max being a non-power of two, or multiplying
|> by a number larger than two, at the cost of some speed.
|> Now that I look at it closer, I guess I could have and'ed with
|> max instead of nmask in the if-condition...
|>

damit! I should have just posted the code I had that worked, and not
optimized it. You can't and with max instead of nmask, because
the left shift and the addition can make retval >= 2 * max,
and (retval & mask == 0). not only that, but

retval = (retval & mask) + 1;

need to be changed to

retval = (retval + 1) & mask;

I'll go crawl back into my hole.

bb

Bruce.Evans

unread,
Jul 3, 1991, 4:31:36 AM7/3/91
to
In article <14...@dog.ee.lbl.gov> to...@elf.ee.lbl.gov (Chris Torek) writes:
>>In article <14...@dog.ee.lbl.gov> I wrote:
>>>The [`why'] question is the hardest. What *does* that 33 do? I have no
>>>idea. ...
>
>In article <1991Jun26.1...@syd.dit.CSIRO.AU>
>ev...@syd.dit.CSIRO.AU (Bruce.Evans) writes:
>>I don't think there are any theoretical mysteries behind the 33. 33 = 32 + 1.
>>The 32 is to avoid mixing bits with previous hash bits. 256 would avoid all
>>mixing (with 8-bit chars) but it would throw away too many bits. The 1 is to
>>avoid throwing away bits.
>
>This answer just does not hold up: if this were the case, 65, 129,
>etc., should work as well as 33. They do not. Why not?

In my code the major part of the shift is selected according to the hash
table size. Clearly multiplying by (2**31 + 1) would be no good (unless
overflow is helpful :-). 65 might work better than 33 with a large table.
The reason that 33 works OK in practice is probably that it distinguishes
most (lower case) letters. I tried fine-tuning this idea with an auxiliary
table hconv[] based on letter frequencies. However the rule
hconv[x] = (x) - 'A' was faster in practice.

>>The following hash function worked well for me in an assembler. ...
>> length = strlen(str); ... hashval = hconv(str[-1]) * MULTIPLIER;
>
>At this point, str[-1] is unlikely to exist. The function seems to
>expect a pointer to the last character, rather than the first.

Oops. I must have edited too much while removing hair from the code. News
is way behind and my article has expired so I'm not sure how much was
wrong. For all of the negative offsets, replace 'str' by 'strend' where
strend = str + length.
--
Bruce Evans ev...@syd.dit.csiro.au

Reply all
Reply to author
Forward
0 new messages