3492 views

Skip to first unread message

Jun 16, 1991, 10:18:36 AM6/16/91

to

>From: bkot...@falcon.aamrl.wpafb.af.mil (Brett Kottmann)

>Date: 10 Jun 91 21:08:47 GMT

>Organization: Logicon Technical Services, Inc.

>Message-ID: <1991Jun10....@falcon.aamrl.wpafb.af.mil>

>Newsgroups: comp.lang.c

> Are there C implementations of hash functions lying around out

>there?

>I need a hash function that will take a string (up to 255 chars + null)

>and

>make a unique key for it. It should hash to the same key for the same

>string

>:).

When somebody compresses 255 bytes down to 2 or 4 let me know :). "Unique"

isn't exactly the word you want I don't think. The simple thing to do is to

add all the characters up then mod by the size of your hash, then use that

number to index into an array. If the spot is used, either link list off the

array location or skip cirularly in the array until you hit the next unused

location. You don't necessarly need to add all 255 characters together.

Every 5th, 10th or whatever may work just as well, depending on the data.

Simply keep a counter of the number of seeks required to find your hash

locations so you can determine which hash scheme works with the fewest seeks

for your particular data.

To anybody else. If I have n data elements, with a hash of size x, what is

considered a good number of searches based on n and x (i.e. ln(x/n)). Thanks

Dave Harris.

--

Uucp: ...{gatech,ames,rutgers}!ncar!asuvax!stjhmc!15!14!Dave.Harris

Internet: Dave....@f14.n15.z1.fidonet.org

Jun 18, 1991, 3:37:44 AM6/18/91

to

In article <14207.2...@stjhmc.fidonet.org>

Dave....@f14.n15.z1.fidonet.org (Dave Harris) writes:

>... The simple thing to do is to add all the characters up then mod by

>the size of your hash, then use that number to index into an array. ...

>You don't necessarly need to add all 255 characters together.

Dave....@f14.n15.z1.fidonet.org (Dave Harris) writes:

>... The simple thing to do is to add all the characters up then mod by

>the size of your hash, then use that number to index into an array. ...

>You don't necessarly need to add all 255 characters together.

Indeed: as it turns out, while many people have spent much time looking

for `good' hashing functions, fewer seem to have noted that many

hashing systems spend more time computing hashes than searching. I

find that this is usually the case, and that a fast hash function is

better than a slower-but-better-distribution function. One good hash

function, useful in many applications, is this:

hash = 0;

for (each character)

hash = (hash * 33) + (the character);

hash_index = hash & ((some power of two) - 1);

In C, this might be written as

struct hashent *

lookup(table, string)

register struct hashtab *table;

char *string;

{

register unsigned hash = 0;

register char *p, c;

register struct hashent *hp;

for (p = string; (c = *p) != '\0'; p++)

hash = (hash << 5) + hash + c;

p = string;

for (hp = table->ht_entries[hash & table->ht_mask];

hp != NULL; hp = hp->h_next)

if (hp->h_hash == hash && strcmp(hp->h_key, p) == 0)

return (hp);

return (NULL);

}

>To anybody else. If I have n data elements, with a hash of size x, what is

>considered a good number of searches based on n and x (i.e. ln(x/n)).

If you have a perfectly uniform distribution, each chain or hash

sequence will have n/x elements; assuming the probability of any

element lookup is identical, this is the best that can be done. I do

not know what others consider `good', but a factor of 1.3 or so over

that seems reasonable to me (I cannot quite think why).

Typically neither assumption holds true. There are too many ways to

take advantage of this to characterize here. One thing that can be

done, however, is to use the above hash table scheme with expanding

tables. As the number of entries increases, the hash table size can be

doubled, and table->ht_mask adds one bit:

table->ht_mask = (table->ht_mask << 1) | 1;

Since the hash values of each <key,datum> pair are maintained, it is

easy to move the entries from their old hash slots to their new ones.

Doubling occurs O(log2 n) times if the doubling criterion is total

number of table entries; other doubling criteria, such as maximum chain

length, are possible but rarely worthwhile---one ends up spending more

time optimizing hashes than searching, just as with more complicated

hash functions. Trading space for time this way seems to work well,

and you can select the amount of space by selecting the doubling factor.

--

In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)

Berkeley, CA Domain: to...@ee.lbl.gov

Jun 19, 1991, 12:38:20 AM6/19/91

to

In article <14...@dog.ee.lbl.gov>, to...@elf.ee.lbl.gov (Chris Torek) writes:

> Indeed: as it turns out, while many people have spent much time looking

> for `good' hashing functions, fewer seem to have noted that many

> hashing systems spend more time computing hashes than searching. I

> find that this is usually the case, and that a fast hash function is

> better than a slower-but-better-distribution function. One good hash

> function, useful in many applications, is this:

> hash = 0;

> for (each character)

> hash = (hash * 33) + (the character);

> hash_index = hash & ((some power of two) - 1);

And it may well pay to be even more radical in trading speed for 'perfection'.

In one of my programs I found it quite profitable to only look at the last

ten characters of a string, summing all but the last two, and doing shift

and sum on the last two. E.g. if n is the string length

hash = 0;

cc = string + n;

switch (n>10 ? 10 : n) {

case 10: hash += (int)*--cc;

case 9: hash += (int)*--cc;

case 8: hash += (int)*--cc;

case 7: hash += (int)*--cc;

case 6: hash += (int)*--cc;

case 5: hash += (int)*--cc;

case 4: hash += (int)*--cc;

case 3: hash += (int)*--cc;

case 2: hash += ((int)*--cc)<<1;

case 1: hash += ((int)*--cc)<<3;

default: break;

}

hash &= 1023;

Admittedly this appears crude, but in practice the last ten bytes were

all that were really needed to distinguish most strings (in the application

of interest, of course.) Nor was it really needful to do anything more

than summing bytes. I should note that the string length was stored in

the table; in effect it served as a secondary key for the bucket search

loop. The speed up was quite sharp.

--

Richard Harter, Software Maintenance and Development Systems, Inc.

Net address: jjmhome!smds!rh Phone: 508-369-7398

US Mail: SMDS Inc., PO Box 555, Concord MA 01742

This sentence no verb. This sentence short. This signature done.

Jun 19, 1991, 12:19:59 PM6/19/91

to

In article <5...@smds.UUCP> r...@smds.UUCP (Richard Harter) writes:

>> ... a fast hash function is

>> better than a slower-but-better-distribution function...

>In one of my programs I found it quite profitable to only look at the last

>ten characters of a string...

>Admittedly this appears crude, but in practice the last ten bytes were

>all that were really needed...

>> ... a fast hash function is

>> better than a slower-but-better-distribution function...

>In one of my programs I found it quite profitable to only look at the last

>Admittedly this appears crude, but in practice the last ten bytes were

In fact, when I originally did the hashed newsgroup-lookup code in C News

expire, a quick investigation did not reveal any hashing function which

had overall performance much better than simply using the length of the

string! (The newsgroup list has grown so much that this decision is in

need of revision, mind you.)

--

"We're thinking about upgrading from | Henry Spencer @ U of Toronto Zoology

SunOS 4.1.1 to SunOS 3.5." | he...@zoo.toronto.edu utzoo!henry

Jun 21, 1991, 1:11:40 AM6/21/91

to

In a message of <Jun 18 20:29>, Chris Torek (1:114/15) writes:

>Indeed: as it turns out, while many people have spent much time looking

>for `good' hashing functions, fewer seem to have noted that many

>hashing systems spend more time computing hashes than searching. I

>find that this is usually the case, and that a fast hash function is

>better than a slower-but-better-distribution function. One good hash

>function, useful in many applications, is this:

> hash = 0;

> for (each character)

> hash = (hash * 33) + (the character);

> hash_index = hash & ((some power of two) - 1);

Is the *33 use to keep short sequences has key more random or does it really

help in the distribution of large string too?

Jun 20, 1991, 7:53:42 PM6/20/91

to

In article <14...@dog.ee.lbl.gov> to...@elf.ee.lbl.gov (Chris Torek) writes:

> hash = (hash * 33) + (the character);

> hash = (hash * 33) + (the character);

I see I've converted you from hash * 31 to hash * 33. :-)

[ on the typical distribution of hash values with chaining ]

It's worth remembering the power series for exp. If your load factor---

that is, the number N of stored elements divided by the number of hash

values---is x, then there will be about exp(-x)N zero-element chains,

exp(-x)Nx one-element chains, exp(-x)Nx^2/2 two-element, exp(-x)Nx^3/3!

three-element, and so on. This holds up quite well in practice, and lets

you judge accurately how often a branch will be taken or how far it's

worth unrolling a loop when you're working with hash tables.

---Dan

Jun 21, 1991, 1:35:04 PM6/21/91

to

In <1991Jun19.1...@zoo.toronto.edu> he...@zoo.toronto.edu (Henry Spencer) writes:

|In fact, when I originally did the hashed newsgroup-lookup code in C News

|expire, a quick investigation did not reveal any hashing function which

|had overall performance much better than simply using the length of the

|string! (The newsgroup list has grown so much that this decision is in

|need of revision, mind you.)

|In fact, when I originally did the hashed newsgroup-lookup code in C News

|expire, a quick investigation did not reveal any hashing function which

|had overall performance much better than simply using the length of the

|string! (The newsgroup list has grown so much that this decision is in

|need of revision, mind you.)

I think you'll want to re-check. The distributions are real uneven -- there

are more than 40 newsgroups of length 10, for example, so you have to resolve

lots of conflicts, and if you don't put a secondary hash, that means lots of

strcmp's. Of course, the design is very dependant on the newsgroup that a

site receives and the distribution of articles within those sites.

I use Chris Torek's hash function (amazingly good distribution and amazingly

cheap to calculate) with a table size of 128 and sort the conflicts based on

the highest article number. It seems to give the biggest win for the biggest

number of systems.

/r$

--

Please send comp.sources.unix-related mail to rs...@uunet.uu.net.

Use a domain-based address or give alternate paths, or you may lose out.

Jun 21, 1991, 6:37:59 PM6/21/91

to

In an article whose referent was deleted by bogus fidonet software

(and whose tabs were changed to spaces), I provided this example

hashing function:

>> hash = 0;

>> for (each character)

>> hash = (hash * 33) + (the character);

>> hash_index = hash & ((some power of two) - 1);

(and whose tabs were changed to spaces), I provided this example

hashing function:

>> hash = 0;

>> for (each character)

>> hash = (hash * 33) + (the character);

>> hash_index = hash & ((some power of two) - 1);

In article <14491.2...@stjhmc.fidonet.org>

Dave....@f14.n15.z1.fidonet.org (Dave Harris) writes:

>Is the *33 use to keep short sequences has key more random or does it

>really help in the distribution of large string too?

I am not sure what you are asking here. There are basically three

questions you can ask about a hash function:

a) Is it fast enough? (complicated functions often take longer than

the searches they replace)

b) Does it produce a good distribution? (the hash function

h(input)=0 is very fast but produces horrible distributions)

c) If (b) is true, why does it work?

The first two can really only be answered in some specific context.

In all the contexts I have tried, this one answers `yes' to both

(which is why I recommend it).

The third question is the hardest. What *does* that 33 do? I have no

idea. All I know is that I have experimented with `easy' functions

(functions for which `fast enough' should generally be true) and this

one also produces a good distribution. I doubt I still have the mail,

but when Margo Seltzer was writing the database code to replace ndbm,

she tried a bunch of different functions on a bunch of different data.

This one usually worked well---it did not always give the best

distributions, but it never performed poorly; some functions that gave

better distributions on some data gave much worse on other.

Maybe there are some theory people out there who can explain it.

Probably some staring at Knuth vol. 3 would help.

Jun 18, 1991, 2:57:39 PM6/18/91

to

Read CACM June 1990, page 677 "Fast Hashing of Variable-Length Text Strings".

-Thomas Wang

(Everything is an object.)

wa...@hpdmsjlm.cup.hp.com

tho...@hpcupt1.cup.hp.com

Jun 23, 1991, 3:03:19 PM6/23/91

to

Somewhere in this thread, Chris Torek proposed as a hash function:

> >> hash = 0;

> >> for (each character)

> >> hash = (hash * 33) + (the character);

> >> hash_index = hash & ((some power of two) - 1);

>

Chris then asked,

> ... What *does* that 33 do?

>

> Maybe there are some theory people out there who can explain it.

> Probably some staring at Knuth vol. 3 would help.

> >> hash = 0;

> >> for (each character)

> >> hash = (hash * 33) + (the character);

> >> hash_index = hash & ((some power of two) - 1);

>

> ... What *does* that 33 do?

>

> Maybe there are some theory people out there who can explain it.

> Probably some staring at Knuth vol. 3 would help.

A good property for a hash function is if you keep feeding it

a constant character, it performs as if it's a random number generator.

The hash function f(h,c) = 33*h + c is derived from a class of random

number generators called linear congruential. At the start of Knuth Vol 2,

there is a good discussion of these generators. Let's apply some of the

theorems in that section and see what we get. Keep in mind that we are working

with random numbers mod 2^x where x is usually 16 or 32. Assuming that c is some

fixed odd value, Theorem A guarantees that 33 is good because (33-1)%4 == 0.

In the same reasoning, 31 is not good since (31-1)%4 == 2 != 0.

Here good means that if you keep feeding c, you'll cycle thru the entire

set of values [0,2^x). On the other hand, 33 is not a primitive element for 2^x

when x > 3 (Theorem C), so it isn't a good multiplicator if you keep feeding in 0's.

For example, 37 is a better value from that point of view. To summarize,

a good multiplicator is any number with 101 as their low order 3 bits.

At this point, we depart from theory. We typically hash byte strings and we want

the bits to mix in the 16-bit or 32-bit registers relatively fast. This suggests

that some value larger than 2^8 (but not too large) may do better as a multiplicator.

After a little experimentation, here's a pretty decent hash function:

hash = 0;

for(each byte c)

hash = (hash<<8) + (hash<<2) + hash + c;

This corresponds to the multiplicator 261 whose bit pattern is 100000101.

Try it, you'll like it.

Jun 24, 1991, 5:15:58 AM6/24/91

to

In an article whose referent is long gone, I described this hash function:

for (each character)

hash = (hash * 33) + (the character);

hash_index = hash & ((some power of two) - 1);

for (each character)

hash = (hash * 33) + (the character);

hash_index = hash & ((some power of two) - 1);

and noted:

>>Maybe there are some theory people out there who can explain [why 33

>>works well].

In article <15...@ulysses.att.com> k...@ulysses.att.com (Phong Vo[drew]) writes:

>A good property for a hash function is if you keep feeding it

>a constant character, it performs as if it's a random number generator.

>The hash function f(h,c) = 33*h + c is derived from a class of random

>number generators called linear congruential. At the start of Knuth Vol 2,

>there is a good discussion of these generators.

... and indeed, if you turn to Knuth vol. 2, chapter 3, you will find all

sorts of interesting stuff, such as a proof that the low-order n bits of

the usual rand() repeat with period $2^n$. (rand() uses a modulus of

either $2^15$ or $2^31$.)

>... Assuming that c is some fixed odd value, Theorem A guarantees that

>33 is good because (33-1)%4 == 0. ... Here good means that if you keep

>feeding c, you'll cycle thru the entire set of values [0,2^x).

This is approaching what I was getting at, but is not quite there.

Basically it says that 33 is `good' because it is `not bad'. An LCRNG

is defined as

h = ( a h + c ) mod m (where h is each a value of the

n+1 n n

variable `hash' in the for loop)

(in quote `>' above, $m = 2^x$). With a `bad' multiplier for $a$---and

what is `bad' depends on $c$ and $m$---the generator will not produce all

$m$ possible values; with a `good' one it will. Intuitively, then, the

`bad' one will miss some of the hash buckets entirely; the `good' one

will not.

But *why* is it true that `A good property is [that] it performs as if

it's a [good] random number generator'? This is now `obvious' (from the

above result) *if* we feed in constant values of $c$, i.e., if all the

strings are made of the same characters, but are different lengths. But

we rarely do that---the strings we hash are full of weird junk. How do

LCRNGs get in there? Why do the varying values of $c$ not mess things

up? (Of course, they do, but not enough to matter. Why not?)

>At this point, we depart from theory. We typically hash byte strings ...

Actually, I typically hash ASCII text. 33 is the `easiest' `good'

multiplier (as defined above) that is >= 26. I have no idea whether

this is significant, or merely coincidental.

>... This suggests that some value larger than 2^8 (but not too large)

>may do better as a multiplicator. ...

> hash = 0;

> for(each byte c)

> hash = (hash<<8) + (hash<<2) + hash + c;

I will see if I can get this one run on the same data as the `33 hash'.

This is more work than

hash = (hash << 5) + hash + c;

and distribution improvements may be outweighed by additional hash time

(or may not: who knows?).

Jun 24, 1991, 11:57:34 PM6/24/91

to

> A good property for a hash function is if you keep feeding it

> a constant character, it performs as if it's a random number generator.

> a constant character, it performs as if it's a random number generator.

No, I don't think so. This is commonly true of good hash functions and

commonly false of bad ones, but keep in mind that pseudo-random

sequences have to keep working 100, 1000, 1000000 elements later, while

simple string hash functions are rarely applied to strings with more

than 80 characters.

[ as a generator mod 2^n, 37 is better than 33 is better than 31 ]

Again, I don't think this is accurate. Each number has a sufficiently

long period mod 2^n that you don't have to worry about repeats.

> We typically hash byte strings and we want

> the bits to mix in the 16-bit or 32-bit registers relatively fast.

[ so use a multipler of 261 ]

Again, practically any good multiplier works. I think you're worrying

about the fact that 31c + d doesn't cover any reasonable range of hash

values if c and d are between 0 and 255. That's why, when I discovered

the 33 hash function and started using it in my compressors, I started

with a hash value of 5381. I think you'll find that this does just as

well as a 261 multiplier.

---Dan

Jun 25, 1991, 1:21:33 AM6/25/91

to

In article <36...@litchi.bbn.com> rs...@bbn.com (Rich Salz) writes:

>|In fact, when I originally did the hashed newsgroup-lookup code in C News

>|expire, a quick investigation did not reveal any hashing function which

>|had overall performance much better than simply using the length of the

>|string! ...>|In fact, when I originally did the hashed newsgroup-lookup code in C News

>|expire, a quick investigation did not reveal any hashing function which

>|had overall performance much better than simply using the length of the

>

>I think you'll want to re-check. The distributions are real uneven -- there

>are more than 40 newsgroups of length 10, for example, so you have to resolve

>lots of conflicts, and if you don't put a secondary hash, that means lots of

Strcmps cost much less if you prefix them with an inline check of the

first character. This strategy isn't as effective with newsgroups (which

tend to have common prefixes) as with random strings, but it still helps

quite a bit. And *as I mentioned*, the above-mentioned results are out

of date; the hash chains were typically 2-3 long back then.

Jun 25, 1991, 5:27:14 PM6/25/91

to

> Again, practically any good multiplier works. I think you're worrying

> about the fact that 31c + d doesn't cover any reasonable range of hash

> values if c and d are between 0 and 255. That's why, when I discovered

> the 33 hash function and started using it in my compressors, I started

> with a hash value of 5381. I think you'll find that this does just as

> well as a 261 multiplier.

>

> ---Dan

> about the fact that 31c + d doesn't cover any reasonable range of hash

> values if c and d are between 0 and 255. That's why, when I discovered

> the 33 hash function and started using it in my compressors, I started

> with a hash value of 5381. I think you'll find that this does just as

> well as a 261 multiplier.

>

> ---Dan

261 is just a number that was picked out of a hat that satisfies

the conditions outlined in the theorems in Knuth. These conditions are

pretty minimal so yes there are lots and lots of numbers that would do

just as well. For compressors, if you work with binary files, it isn't

infrequent to have strings of 0's. In such a case, you'll want the multiplier

to have the nice properties of a RNG. For this reason, 261 type of numbers

may be a little better than 33 (in either case, don`t forget to start the

hash sum at some odd value!).

For everyone's entertainment, I did a simple (and admittedly contrived) test

of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.

This is done by treating each number which is stored in a word as a

string of 4 bytes. With 261, 927300 of these numbers are

in their own buckets, the rest are in 36350 buckets each with 2 elements.

This is almost as good as a perfect hash function. The result for 33 is

much worse. The largest bucket has 64 elements in it and there are 4050

such buckets. The key difference is the 8th bit in 261 which causes the

mixing to go a little faster.

Jun 26, 1991, 4:17:50 AM6/26/91

to

> For everyone's entertainment, I did a simple (and admittedly contrived) test

> of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.

> This is done by treating each number which is stored in a word as a

> string of 4 bytes.

[ results supposedly showing 261 much better than 33 ]> of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.

> This is done by treating each number which is stored in a word as a

> string of 4 bytes.

Sorry, but the mod has to be a power of 2. Also, your test is remarkably

contrived: we're talking about string hash functions, and typical

strings (a) are concentrated on a few character values, not spread

evenly over the entire character set; (b) do not all have the same high

byte; (c) do not all have the same length.

Surely you noticed that a multiplier of 256 will produce perfect hashing

in this case. Does that mean you're recommending a multiplier of 256?

The profiling statistics I've saved from various compressors form a much

more realistic sample. They show 33 as slightly better than random

hashing on typical files, 37 as slightly worse. I've never tried 261 or

31, but I'll bet 261 does worse than 33 on text files.

---Dan

Jun 26, 1991, 9:25:56 AM6/26/91

to

In article <17918.Jun2...@kramden.acf.nyu.edu> brn...@kramden.acf.nyu.edu (Dan Bernstein) writes:

> In article <15...@ulysses.att.com> k...@ulysses.att.com (Phong Vo[drew]) writes:

> > For everyone's entertainment, I did a simple (and admittedly contrived) test

> > of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.

> > This is done by treating each number which is stored in a word as a

> > string of 4 bytes.

>

> In article <15...@ulysses.att.com> k...@ulysses.att.com (Phong Vo[drew]) writes:

> > For everyone's entertainment, I did a simple (and admittedly contrived) test

> > of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.

> > This is done by treating each number which is stored in a word as a

> > string of 4 bytes.

>

> Sorry, but the mod has to be a power of 2.

Why? (If you are telling me for speed I'll scream.)>

> Surely you noticed that a multiplier of 256 will produce perfect hashing

> in this case. Does that mean you're recommending a multiplier of 256?

>

> The profiling statistics I've saved from various compressors form a much

> more realistic sample. They show 33 as slightly better than random

> hashing on typical files, 37 as slightly worse. I've never tried 261 or

> 31, but I'll bet 261 does worse than 33 on text files.

>

how many entries fall in each bucket. I did this with the four multipliers

Dan just mentioned. I did this for 16 buckets to 8192 buckets (only powers

of two done). I calculated also the standard deviations involved. I will

not bore you with all those digits, the result can be stated simply: it does

not matter very much what multiplier you use. For each multiplier there

is a number of buckets where that multiplier has the smallest standard

deviation. So use whatever you like.

--

dik t. winter, cwi, amsterdam, nederland

d...@cwi.nl

Jun 26, 1991, 4:11:34 PM6/26/91

to

In <15...@ulysses.att.com> k...@ulysses.att.com (Phong Vo[drew]) writes:

>For everyone's entertainment, I did a simple (and admittedly contrived) test

>of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.

>This is done by treating each number which is stored in a word as a

>string of 4 bytes. With 261, 927300 of these numbers are

>in their own buckets, the rest are in 36350 buckets each with 2 elements.

>This is almost as good as a perfect hash function. The result for 33 is

>much worse. The largest bucket has 64 elements in it and there are 4050

>such buckets. The key difference is the 8th bit in 261 which causes the

>mixing to go a little faster.

Your test is probably irrelevant, being contiguous sequential numbers.

The best hash in this case is

for( hash = (i = 0); i < len(numstr); hash = hash*10 + numstring[i++]);

hash = hash % numberofbuckets;

This will yield the best possible distribution for any number of hash

buckets. The proof is left to the reader.

The point is not that there is a better hash for sequential numbers, but

that the performance of _any_ hash function is highly dependent on your keys;

if you any idea of the values or kinds of values to expect you can choose

a better function. If you can _control_ the values of your keys you can

sometimes get a _much_ better function; viz the numbers above.

If all this is old and boring stuff, I apologise for interrupting you all

just to look like an idiot (flame me by email), and I now return you to

your regularly scheduled newsgroup.

Martin Golding | sync, sync, sync, sank ... sunk:

Dod #0236 | He who steals my code steals trash.

A poor old decrepit Pick programmer. Sympathize at:

{mcspdx,pdxgate}!adpplz!martin or mar...@adpplz.uucp

Jun 26, 1991, 7:46:35 PM6/26/91

to

In <8...@adpplz.UUCP> mar...@adpplz.UUCP (Martin Golding) writes:

A bunch of nonsense, unless you realize I misread the previous post,

and was thinking of a string of ascii digits. Then it all becomes

clear, and I stand by everything I said, all of which was irrelevant.

Sorry.

Jun 26, 1991, 7:27:24 AM6/26/91

to

>>> hash = 0;

>>> for (each character)

>>> hash = (hash * 33) + (the character);

>>> hash_index = hash & ((some power of two) - 1);

>

>In article <14491.2...@stjhmc.fidonet.org>

>Dave....@f14.n15.z1.fidonet.org (Dave Harris) writes:

>>Is the *33 use to keep short sequences has key more random or does it

>>really help in the distribution of large string too?

>

>>> for (each character)

>>> hash = (hash * 33) + (the character);

>>> hash_index = hash & ((some power of two) - 1);

>

>In article <14491.2...@stjhmc.fidonet.org>

>Dave....@f14.n15.z1.fidonet.org (Dave Harris) writes:

>>Is the *33 use to keep short sequences has key more random or does it

>>really help in the distribution of large string too?

>

>The third question is the hardest. What *does* that 33 do? I have no

>idea. All I know is that I have experimented with `easy' functions

>idea. All I know is that I have experimented with `easy' functions

I don't think there are any theoretical mysteries behind the 33. 33 = 32 + 1.

The 32 is to avoid mixing bits with previous hash bits. 256 would avoid all

mixing (with 8-bit chars) but it would throw away too many bits. The 1 is to

avoid throwing away bits. Rotating the bits instead of shifting left would

probably be a better way to avoid this.

The following hash function worked well for me in an assembler. It implements

the ideas that hashing all the chars in the string is unnecessary (and maybe

even bad if too many bits get shifted out) and that the middle char is more

random than leading or trailing chars.

In the program this was extracted from, the string length was already known.

Avoiding looking at all the chars in the string is not such a good idea when

strlen has to do it anyway. Also, this probably does best with a lot of

short strings.

---

extern unsigned strlen(); /* #include <string.h> */

#define SPTSIZE 1024

extern struct sym_s *spt[SPTSIZE];

#define hconv(ch) ((unsigned char) (ch) - 0x41) /* better form for hashing */

struct sym_s **gethashptr(str)

register char *str;

{

register unsigned hashval;

register unsigned length;

/* Hash function is a weighted xor of 1 to 4 chars in the string.

* This seems to work better than looking at all the chars.

* It is important that the function be fast.

* The multiplication by MULTIPLIER should compile as a shift.

*/

#define MULTIPLIER (SPTSIZE / (1 << USEFUL_BITS_IN_ASCII))

#define USEFUL_BITS_IN_ASCII 6

length = strlen(str);

if (length <= 3)

{

if (length <= 2)

hashval = hconv(str[-1]) * MULTIPLIER;

else

hashval = hconv(str[-2]) * MULTIPLIER,

hashval ^= hconv(str[-1]);

}

else

hashval = hconv(str[-(length / 2)]) * MULTIPLIER,

hashval ^= hconv(str[-2]) << 2,

hashval ^= hconv(str[-1]);

return spt + (hashval ^ (hconv(str[0]) << 1)) % SPTSIZE;

}

---

--

Bruce Evans ev...@syd.dit.csiro.au

Jun 27, 1991, 8:28:24 AM6/27/91

to

>In article <14...@dog.ee.lbl.gov> I wrote:

>>The [`why'] question is the hardest. What *does* that 33 do? I have no

>>idea. ...

>>The [`why'] question is the hardest. What *does* that 33 do? I have no

>>idea. ...

In article <1991Jun26.1...@syd.dit.CSIRO.AU>

ev...@syd.dit.CSIRO.AU (Bruce.Evans) writes:

>I don't think there are any theoretical mysteries behind the 33. 33 = 32 + 1.

>The 32 is to avoid mixing bits with previous hash bits. 256 would avoid all

>mixing (with 8-bit chars) but it would throw away too many bits. The 1 is to

>avoid throwing away bits.

This answer just does not hold up: if this were the case, 65, 129,

etc., should work as well as 33. They do not. Why not?

>The following hash function worked well for me in an assembler. ...

> length = strlen(str); ... hashval = hconv(str[-1]) * MULTIPLIER;

At this point, str[-1] is unlikely to exist. The function seems to

expect a pointer to the last character, rather than the first.

Jun 28, 1991, 1:07:34 PM6/28/91

to

I do test the first character, Henry -- I'm pretty good at taking code and ideas

from experts. The Winter 87 paper "News Need Not Be Slow" that you and Geoff

wrote was the first reference I've seen to how effective that can be.

from experts. The Winter 87 paper "News Need Not Be Slow" that you and Geoff

wrote was the first reference I've seen to how effective that can be.

I apologize if my "you'll want to recheck" made it sound like a smack on

the head. Of course your article said your data was known to be old. I

just meant to add a data point of my own (for what my points are worth)

agreeing that things are very different now.

We now return you to your standard "what does a||b do" discussion...

Jul 1, 1991, 12:52:04 PM7/1/91

to

|> >In article <14...@dog.ee.lbl.gov> I wrote:

|> >>The [`why'] question is the hardest. What *does* that 33 do? I have no

|> >>idea. ...

|>

|> In article <1991Jun26.1...@syd.dit.CSIRO.AU>

|> ev...@syd.dit.CSIRO.AU (Bruce.Evans) writes:

|> >I don't think there are any theoretical mysteries behind the 33. 33 = 32 + 1.

|> >The 32 is to avoid mixing bits with previous hash bits. 256 would avoid all

|> >mixing (with 8-bit chars) but it would throw away too many bits. The 1 is to

|> >avoid throwing away bits.

|>

|> This answer just does not hold up: if this were the case, 65, 129,

|> etc., should work as well as 33. They do not. Why not?

|>

|> >>The [`why'] question is the hardest. What *does* that 33 do? I have no

|> >>idea. ...

|>

|> In article <1991Jun26.1...@syd.dit.CSIRO.AU>

|> ev...@syd.dit.CSIRO.AU (Bruce.Evans) writes:

|> >I don't think there are any theoretical mysteries behind the 33. 33 = 32 + 1.

|> >The 32 is to avoid mixing bits with previous hash bits. 256 would avoid all

|> >mixing (with 8-bit chars) but it would throw away too many bits. The 1 is to

|> >avoid throwing away bits.

|>

|> This answer just does not hold up: if this were the case, 65, 129,

|> etc., should work as well as 33. They do not. Why not?

|>

Just check to see if you are throwing away a bits, and if you are,

circular shift them back in on the right. i.e., the following worked

great for me when hashing C identifier names:

int hash_symbol (char *name, int max)

{

register int retval = 0;

register char *ch = name;

register int mask = max - 1;

register int nmask = ~mask;

for (; (*ch != '\0'); ch++) {

retval <<= 1;

retval += *ch;

if (retval & nmask) {

retval = (retval & mask) + 1;

}

}

return (retval);

}

here, I assume that max is a power of 2, so that mask = 0x[137f]fff...

whenever retval exceeds max, subtract max and increment. The idea

is easily adaptable to max being a non-power of two, or multiplying

by a number larger than two, at the cost of some speed.

Now that I look at it closer, I guess I could have and'ed with

max instead of nmask in the if-condition...

bb

Jul 1, 1991, 5:02:50 PM7/1/91

to

In article <1991Jul1.1...@csrd.uiuc.edu>, bl...@sp64.csrd.uiuc.edu (Brian Bliss) writes:

|>

|> int hash_symbol (char *name, int max)

|> {

|> register int retval = 0;

|> register char *ch = name;

|> register int mask = max - 1;

|> register int nmask = ~mask;

|>

|> for (; (*ch != '\0'); ch++) {

|> retval <<= 1;

|> retval += *ch;

|> if (retval & nmask) {

|> retval = (retval & mask) + 1;

|> }

|> }

|> return (retval);

|> }

|>

|> here, I assume that max is a power of 2, so that mask = 0x[137f]fff...

|> whenever retval exceeds max, subtract max and increment. The idea

|> is easily adaptable to max being a non-power of two, or multiplying

|> by a number larger than two, at the cost of some speed.

|> Now that I look at it closer, I guess I could have and'ed with

|> max instead of nmask in the if-condition...

|>

|>

|> int hash_symbol (char *name, int max)

|> {

|> register int retval = 0;

|> register char *ch = name;

|> register int mask = max - 1;

|> register int nmask = ~mask;

|>

|> for (; (*ch != '\0'); ch++) {

|> retval <<= 1;

|> retval += *ch;

|> if (retval & nmask) {

|> retval = (retval & mask) + 1;

|> }

|> }

|> return (retval);

|> }

|>

|> here, I assume that max is a power of 2, so that mask = 0x[137f]fff...

|> whenever retval exceeds max, subtract max and increment. The idea

|> is easily adaptable to max being a non-power of two, or multiplying

|> by a number larger than two, at the cost of some speed.

|> Now that I look at it closer, I guess I could have and'ed with

|> max instead of nmask in the if-condition...

|>

damit! I should have just posted the code I had that worked, and not

optimized it. You can't and with max instead of nmask, because

the left shift and the addition can make retval >= 2 * max,

and (retval & mask == 0). not only that, but

retval = (retval & mask) + 1;

need to be changed to

retval = (retval + 1) & mask;

I'll go crawl back into my hole.

bb

Jul 3, 1991, 4:31:36 AM7/3/91

to

>>In article <14...@dog.ee.lbl.gov> I wrote:

>>>The [`why'] question is the hardest. What *does* that 33 do? I have no

>>>idea. ...

>

>In article <1991Jun26.1...@syd.dit.CSIRO.AU>

>ev...@syd.dit.CSIRO.AU (Bruce.Evans) writes:

>>I don't think there are any theoretical mysteries behind the 33. 33 = 32 + 1.

>>The 32 is to avoid mixing bits with previous hash bits. 256 would avoid all

>>mixing (with 8-bit chars) but it would throw away too many bits. The 1 is to

>>avoid throwing away bits.

>

>This answer just does not hold up: if this were the case, 65, 129,

>etc., should work as well as 33. They do not. Why not?

>>>The [`why'] question is the hardest. What *does* that 33 do? I have no

>>>idea. ...

>

>In article <1991Jun26.1...@syd.dit.CSIRO.AU>

>ev...@syd.dit.CSIRO.AU (Bruce.Evans) writes:

>>I don't think there are any theoretical mysteries behind the 33. 33 = 32 + 1.

>>The 32 is to avoid mixing bits with previous hash bits. 256 would avoid all

>>mixing (with 8-bit chars) but it would throw away too many bits. The 1 is to

>>avoid throwing away bits.

>

>This answer just does not hold up: if this were the case, 65, 129,

>etc., should work as well as 33. They do not. Why not?

In my code the major part of the shift is selected according to the hash

table size. Clearly multiplying by (2**31 + 1) would be no good (unless

overflow is helpful :-). 65 might work better than 33 with a large table.

The reason that 33 works OK in practice is probably that it distinguishes

most (lower case) letters. I tried fine-tuning this idea with an auxiliary

table hconv[] based on letter frequencies. However the rule

hconv[x] = (x) - 'A' was faster in practice.

>>The following hash function worked well for me in an assembler. ...

>> length = strlen(str); ... hashval = hconv(str[-1]) * MULTIPLIER;

>

>At this point, str[-1] is unlikely to exist. The function seems to

>expect a pointer to the last character, rather than the first.

Oops. I must have edited too much while removing hair from the code. News

is way behind and my article has expired so I'm not sure how much was

wrong. For all of the negative offsets, replace 'str' by 'strend' where

strend = str + length.

--

Bruce Evans ev...@syd.dit.csiro.au

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu