[PATCH] Switching out the Bernstein Hash function for Murmur Hash

129 views
Skip to first unread message

Max De Marzi

unread,
Jun 29, 2009, 2:55:09 PM6/29/09
to Redis DB
Hello,

The Bernstein Hash function in redis doesn't mix well which may cause
more hash collisions than expected.
See http://murmurhash.googlepages.com/avalanche for details.

The Murmur hash below is used in memcached and should yield less
collisions.


Left file: C:\Data\C Scripts\redis-0.900\dict.c Right file: C:\Data
\C Scripts\murmur\dict.c
102,103c102
< /* Generic hash function (a popular one from Bernstein).
< * I tested a few and this was the best. */
---
> /* Murmur hash function. */
105c104,117
< unsigned int hash = 5381;
---
>
> /*
> 'm' and 'r' are mixing constants generated offline. They're not
> really 'magic', they just happen to work well.
> */
>
> const unsigned int m= 0x5bd1e995;
> const unsigned int seed= (0xdeadbeef * len);
> const int r= 24;
>
>
> // Initialize the hash to a 'random' value
>
> unsigned int hash= seed ^ len;
107,109c119,159
< while (len--)
< hash = ((hash << 5) + hash) + (*buf++); /* hash * 33 + c */
< return hash;
---
> // Mix 4 bytes at a time into the hash
>
> const unsigned char * data= (const unsigned char *)buf;
>
> while(len >= 4)
> {
> unsigned int k = *(unsigned int *)data;
>
> k *= m;
> k ^= k >> r;
> k *= m;
>
> hash *= m;
> hash ^= k;
>
> data += 4;
> len -= 4;
> }
>
> // Handle the last few bytes of the input array
>
> switch(len)
> {
> case 3: hash ^= data[2] << 16;
> case 2: hash ^= data[1] << 8;
> case 1: hash ^= data[0];
> hash *= m;
> };
>
> /*
> Do a few final mixes of the hash to ensure the last few bytes are
> well-incorporated.
> */
>
> hash ^= hash >> 13;
> hash *= m;
> hash ^= hash >> 15;
>
> return hash;
>
>

Chris Lamb

unread,
Jun 30, 2009, 7:37:11 AM6/30/09
to redi...@googlegroups.com
Max De Marzi wrote:

> The Murmur hash below is used in memcached and should yield less
> collisions.

Same patch attached in a non-retarded diff format, because I'm feeling
nice.


Regards,

--
Chris Lamb, UK ch...@chris-lamb.co.uk
GPG: 0x634F9A20

patch.diff.txt
signature.asc

Salvatore Sanfilippo

unread,
Jun 30, 2009, 8:07:55 AM6/30/09
to redi...@googlegroups.com
On Tue, Jun 30, 2009 at 1:37 PM, Chris Lamb<ch...@chris-lamb.co.uk> wrote:
> Max De Marzi wrote:
>
>> The Murmur hash below is used in memcached and should yield less
>> collisions.
>
> Same patch attached in a non-retarded diff format, because I'm feeling
> nice.

Thank you Chris, Max. I know the current hashing function is not
perfect, but I run different tests against key sin the form
"identifier<incremental number>" and surprisingly the djb hash
function was the one performing better. So I want to re-run this tests
after the Redis-1.0 release against some kind of real-world dataset in
order to reality-check if this is a real win.

Basically I want to switch hash function only if the evidence will
suggest it's worth it, and given that the hash function library is
able to count collisions it should be trivial to run the test, and
after the 1.0 stable release.

Cheers,
Salvatore

>
> Regards,
>
> --
> Chris Lamb, UK                                     ch...@chris-lamb.co.uk
>                                                          GPG: 0x634F9A20
>

--
Salvatore 'antirez' Sanfilippo
http://invece.org

"Once you have something that grows faster than education grows,
you’re always going to get a pop culture.", Alan Kay

Reply all
Reply to author
Forward
0 new messages