fast little hash function

20 views
Skip to first unread message

Cesium

unread,
Oct 26, 2007, 7:00:21 PM10/26/07
to Hash Functions
Long ago and far away... Paul Hsieh published a fast little hash
function: http://www.azillionmonkeys.com/qed/hash.html
One of the principal themes in this hash function is the idea of
moving the
avalanche code outside the main loop.

That was a pretty big gauntlet Paul threw down, so I decided to look
for hash functions with comparable bit mixing quality that ran
faster. (Paul's routine has good, but not great, bit mixing quality.
Jenkins' lookup2 and lookup3 have much better mixing quality.)

The Main loop is:

hash += get16bits (data);
tmp = (get16bits (data+2) << 11) ^ hash;
hash = (hash << 16) ^ tmp;
data += 2*sizeof (uint16_t);
hash += hash >> 11;

and the final avalanching code is:

hash ^= hash << 3;
hash += hash >> 5;
hash ^= hash << 4;
hash += hash >> 17;
hash ^= hash << 25;
hash += hash >> 6;

If you construct a dataflow graph of the main loop, you'll see that
the loop is not short and fat, suggesting that the loop can be sped
up. A dataflow graph shows which instructions can execute in parallel
and which must wait for the results of previous computations. We can
diagram this by listing instructions to execute in parallel on one
line:

0: t1 = get16bits(data) ; t2 = get16bits(data+2)
1: hash += t1; t3 = t2 << 11; data += 4
2: tmp = t3 ^ hash ; t4 = hash << 16;
3: hash = tmp ^ t4
4: t5 = hash ; hash >>= 11;
5: hash += t5

Thus, the inner loop requires at least 6 cycles. More if less than 2
functional units are available. Most chips also have a load delay.
On modern x86 chips (with a fast shifter) this takes about 9 cycles.
(In theory, chips can lookahead around loop boundaries and overlap the
load operations with the tail end of the loop, which could take things
back down to 5 cycles. In practice, I haven't seen this happen.)

I attacked the problem by trying to minimize the dataflow graph. I
also used the trick of using a rotate instead of a pair of shifts. My
inner loop looks like:

hash += rotate(hash, 13);
hash ^= get32bits(data);
data += 4;

And the final "avalanching" code that I use looks like:

hash += rotate(hash, 17);
hash ^= rotate(hash, 9);
hash += rotate(hash, 5);
hash ^= rotate(hash, 3);
hash += rotate(hash, 1);

The dataflow diagram for my inner loop looks like:

0: t = hash; hash = rotate(hash, 13); d = get32bits(data);
1: hash += t; data += 4;
2: hash ^= d

On modern x86 chips, this will take about 4 cycles. We carefully
schedule the load of data at the beginning of the loop and the
combining of that data with the hash at the end of the loop.

Although the inner loop looks to simple to be useful, this hash
function scrambles bits as well as (or ever-so-slightly-better) than
Hsieh's function on random strings, sparse strings, and strings with
characters that follow a zipf distribution.

The code has an interesting interpretation as making use of a (not
very good) pseudo random number generator, to which we periodically
xor in a word of the buffer we are hashing.

Andres Valloud

unread,
Oct 27, 2007, 12:46:08 AM10/27/07
to hash-fu...@googlegroups.com
Hey, this is a lot of work that you have done!  I will be taking a look at it later.

Thanks,
Andres.

Andres

unread,
Oct 31, 2007, 5:54:17 PM10/31/07
to Hash Functions
Hey... I am sorry I have not had a chance to review this yet... I've
been very busy with work and things. I will take a look as soon as I
can.

Thanks,
Andres.

On Oct 26, 9:46 pm, "Andres Valloud" <andres.vall...@gmail.com> wrote:
> Hey, this is a lot of work that you have done! I will be taking a look at
> it later.
>
> Thanks,
> Andres.
>

Andres Valloud

unread,
Nov 11, 2007, 8:07:27 PM11/11/07
to Hash Functions
I have to say... I could write this up in the hash analysis tool I wrote, but the amount of times I have seen loops and avalanching code like the one you mention make me think of questions that I seldom see answered.  For example,

* why where the particular rotations / additions chosen?

* why is it expected that this hash function will perform well in practice, other than "it seems to pass some test"?

* is the fact that the collision set of this kind of hash functions appears hard to figure out proof enough that it does not appear in practice?

* faster does not always mean better, how does this kind of hash function compare with others?

* can all hash functions of this kind be encompassed in some class description that generates them all?

Thanks,
Andres.

Blah Blah

unread,
Nov 13, 2007, 11:25:46 PM11/13/07
to hash-fu...@googlegroups.com
Yes, I agree.

On Nov 11, 2007 5:07 PM, Andres Valloud <andres....@gmail.com> wrote:
> I have to say... I could write this up in the hash analysis tool I wrote,
> but the amount of times I have seen loops and avalanching code like the one
> you mention make me think of questions that I seldom see answered. For
> example,
>
> * why where the particular rotations / additions chosen?

At the current time, we choose rotations and operations by
exhaustively searching a function space to see what seems to work
best. Bleah.

>
> * why is it expected that this hash function will perform well in practice,
> other than "it seems to pass some test"?

This is one of the things I'm trying to estimate and quantify. I'm
currently suggesting that we can approximately estimate what a hash
function computes by pretending all operations are xors or rotates,
and thus looking at the list of distinct input bit positions that are
combined into an output bit position.

>
> * is the fact that the collision set of this kind of hash functions appears
> hard to figure out proof enough that it does not appear in practice?
>
> * faster does not always mean better, how does this kind of hash function
> compare with others?

Right. We can measure the quality of Jenkins' lookup3 using a
chi-square or chi-square-like test, and it has great quality and good
speed. It's reasonably likely that we can maintain the same quality
and get a 10% performance improvement by removing the hidden temporary
variable that lookup3 uses, and then finding the right rotation
constants. It's plausible that there exists a hash function of
equivalent quality that uses more instructions and higher instruction
level parallelism to maintain the same quality and run slightly
faster.

I'm also looking at a class of hash functions that has the same
quality and speed, slightly simpler code, and might be more resistant
to denial of service attacks. (Darn it, if I can't find a faster,
different, hash function the old fashioned way, I'll cheat and change
the rules for what we're looking for. ;-)

>
> * can all hash functions of this kind be encompassed in some class
> description that generates them all?

Absolutely. Select a subset of instructions that you are willing to
execute. Select the amount of memory (number of registers) you are
willing to use. Select the number of instructions you are willing to
execute. Then do an exhaustive search across the space. You can toss
in a simulation of instruction level parallelism along the way. Of
course, it's kinda a large space to search...

Andres Valloud

unread,
Nov 13, 2007, 11:29:11 PM11/13/07
to hash-fu...@googlegroups.com
Sigh... I ran into the same problems when writing the book.  Other than running tests geared towards finding obvious fault, there's usually not a whole lot to say :(.

Andres.
Reply all
Reply to author
Forward
0 new messages