MurmurHash 2.0 - Over 100% faster than SuperFastHash & Lookup3

244 views
Skip to first unread message

Austin Appleby

unread,
Mar 14, 2008, 6:11:31 AM3/14/08
to
No joke. I found a mix function that's much faster and mixes even
better - no differentials, no weaknesses with sparse keys.

Perf numbers on my Intel Core 2 Duo @ 2.4ghz -

OneAtATime
- 354.163715 mb/sec

FNV
- 443.668038 mb/sec

SuperFastHash
- 985.335173 mb/sec

lookup3
- 988.080652 mb/sec

MurmurHash 1.0
- 1363.293480 mb/sec

MurmurHash 2.0
- 2056.885653 mb/sec

Code is below, more info is at the website - http://murmurhash.googlepages.com
- I've updated the aligned version to use the new mix as well.

Standard disclaimers apply - this is still NOT A CRYPTOGRAPHIC HASH.
Do not use it as such. It should be good enough now to use in place of
CRC, though. :)

-Austin

//-----------------------------------------------------------------------------
// MurmurHash2, by Austin Appleby - Now with turbo speed!

// Note - This code makes a few assumptions about how your machine
behaves -

// 1. We can read a 4-byte value from any address without crashing
// 2. sizeof(int) == 4

// And it has a few limitations -

// 1. It will not work incrementally.
// 2. It will not produce the same results on little-endian and big-
endian
// machines.

#define MIX(h,k,m) { k *= m; k ^= k >> r; k *= m; h += k; h *= m; }

unsigned int MurmurHash2 ( const void * key, int len, unsigned int
seed )
{
// 'm' and 'r' are mixing constants generated offline.
// They're not really 'magic', they just happen to work well.

const unsigned int m = 0xc6a4a793;
const int r = 16;

// Initialize the hash to a 'random' value

unsigned int h = seed ^ (len * m);

// Mix 4 bytes at a time into the hash

const unsigned char * data = (const unsigned char *)key;

while(len >= 4)
{
unsigned int k = *(unsigned int *)data;

MIX(h,k,m);

data += 4;
len -= 4;
}

// Handle the last few bytes of the input array

if(len)
{
unsigned int k = 0;

switch(len)
{
case 3: k += data[2] << 16;
case 2: k += data[1] << 8;
case 1: k += data[0];
};

MIX(h,k,m);
}

// Do a few final mixes of the hash to ensure the last few
// bytes are well-incorporated.

h *= m;
h ^= h >> 10;
h *= m;
h ^= h >> 17;

return h;
}

Cristiano

unread,
Mar 14, 2008, 7:05:14 AM3/14/08
to
Austin Appleby wrote:
> No joke. I found a mix function that's much faster and mixes even
> better - no differentials, no weaknesses with sparse keys.

I used your mix function as a pseudo-random number generator. It seems very
good because the bit diffusion is very high.

Cristiano


Joseph Ashwood

unread,
Mar 14, 2008, 7:01:06 AM3/14/08
to
"Austin Appleby" <tan...@gmail.com> wrote in message
news:b5368f8a-1cbf-4dc4...@d21g2000prf.googlegroups.com...

> Standard disclaimers apply - this is still NOT A CRYPTOGRAPHIC HASH.


Then I'm sure I speak for many in sci.crypt that we would rather you stopped
posting off topic here.
Joe

yar...@gmail.com

unread,
Mar 14, 2008, 10:07:55 AM3/14/08
to
On 14 maalis, 12:11, Austin Appleby <tanj...@gmail.com> wrote:
> No joke. I found a mix function that's much faster and mixes even
> better - no differentials, no weaknesses with sparse keys.
>
> Perf numbers on my Intel Core 2 Duo @ 2.4ghz -
>
> OneAtATime
> - 354.163715 mb/sec
>
> FNV
> - 443.668038 mb/sec
>
> SuperFastHash
> - 985.335173 mb/sec
>
> lookup3
> - 988.080652 mb/sec
>
> MurmurHash 1.0
> - 1363.293480 mb/sec
>
> MurmurHash 2.0
> - 2056.885653 mb/sec
>
> Code is below, more info is at the website -http://murmurhash.googlepages.com

This is more flawed than the previous version. Not being cryptographic
is not an excuse at all in my opinion.
The state only ever gets mixed upwards (besides the finalization).
This causes very bad problems. Each possible block has a pair that
definitely results in only the highest bit in the state changing.
Because the state is never mixed downwards, this bit will never get
properly mixed in the state. This can for example, be used to generate
definite collisions without knowing anything about the state. Even
worse, it will definitely cause the bits 0-3 and 15-20 to equal in the
final hash.

For example:
"\x15\x5B\xE4\x1A"
"\x15\xDB\x31\x62"
Pick any buffer and append it to both of those. The hashes will ALWAYS
equal at bits 0-3 and 15-20.

This is only an example of the horror that the flawed state mixing
causes.

toby

unread,
Mar 14, 2008, 11:51:22 AM3/14/08
to
On Mar 14, 6:11 am, Austin Appleby <tanj...@gmail.com> wrote:
> No joke. I found a mix function that's much faster and mixes even
> better - no differentials, no weaknesses with sparse keys.
>
> Perf numbers on my Intel Core 2 Duo @ 2.4ghz -

Can you compare with djb2 [1].

--Toby

[1] http://www.cse.yorku.ca/~oz/hash.html

Austin Appleby

unread,
Mar 14, 2008, 7:09:20 PM3/14/08
to
On Mar 14, 3:11 am, Austin Appleby <tanj...@gmail.com> wrote:
> No joke. I found a mix function that's much faster and mixes even
> better - no differentials, no weaknesses with sparse keys.
>
> Perf numbers on my Intel Core 2 Duo @ 2.4ghz -
>
> OneAtATime
> - 354.163715 mb/sec
>
> FNV
> - 443.668038 mb/sec
>
> SuperFastHash
> - 985.335173 mb/sec
>
> lookup3
> - 988.080652 mb/sec
>
> MurmurHash 1.0
> - 1363.293480 mb/sec
>
> MurmurHash 2.0
> - 2056.885653 mb/sec
>
> Code is below, more info is at the website -http://murmurhash.googlepages.com

Code has been updated slightly since this post -

New mix function should be

#define MIX(h,k,m) { k *= m; k ^= k >> r; k *= m; h *= m; h ^= k; }

and new mix constant should be

const unsigned int m = 0x5bd1e995;

Avalanche bias should now be under 0.5%, performance is the same.


-Austin

user923005

unread,
Mar 14, 2008, 10:43:27 PM3/14/08
to
On Mar 14, 3:11 am, Austin Appleby <tanj...@gmail.com> wrote:
> No joke. I found a mix function that's much faster and mixes even
> better - no differentials, no weaknesses with sparse keys.
>
> Perf numbers on my Intel Core 2 Duo @ 2.4ghz -
>
> OneAtATime
> - 354.163715 mb/sec
>
> FNV
> - 443.668038 mb/sec
>
> SuperFastHash
> - 985.335173 mb/sec
>
> lookup3
> - 988.080652 mb/sec
>
> MurmurHash 1.0
> - 1363.293480 mb/sec
>
> MurmurHash 2.0
> - 2056.885653 mb/sec
>
> Code is below, more info is at the website -http://murmurhash.googlepages.com

> - I've updated the aligned version to use the new mix as well.
>
> Standard disclaimers apply - this is still NOT A CRYPTOGRAPHIC HASH.
> Do not use it as such. It should be good enough now to use in place of
> CRC, though. :)
>
> -Austin
>
> //-------------------------------------------------------------------------­----


It looks like an interesting hash, but a note about calculation of
percent change from your web site:

"UPDATE UPDATE UPDATE March 14, 2008 - I've discovered a mix function
that is both faster and better than the previous version. MurmurHash2
is now over 100% faster than SuperFastHash or lookup3, and over 50%
faster than the previous MurmurHash.


Excellent distribution - Passes chi-squared tests for practically all
keysets & bucket sizes.

Excellent avalanche behavior - Maximum bias is under 0.5%.


Excellent collision resistance - No collisions possible for 4-byte
keys, no small (1- to 5-bit) differentials.


Excellent Performance - measured on an Intel Core 2 Duo @ 2.4 ghz


OneAtATime - 354.163715 mb/sec
FNV - 443.668038 mb/sec
SuperFastHash - 985.335173 mb/sec
lookup3 - 988.080652 mb/sec
MurmurHash 1.0 - 1363.293480 mb/sec
MurmurHash 2.0 - 2056.885653 mb/sec"

Percent change is calculated as ((Vnew - Vold)/Vold) * 100
So the percent change for MurmurHash 2 verses MurmurHash 1 is:
((2056.885653 - 1363.293480)/2056.885653)*100=33.72%
And the percent change for Superfast Hash is:
((2056.885653 - 988.080652)/2056.885653)*100=51.96%

Sebastian G.

unread,
Mar 15, 2008, 12:13:08 AM3/15/08
to
user923005 wrote:


> Percent change is calculated as ((Vnew - Vold)/Vold) * 100
> So the percent change for MurmurHash 2 verses MurmurHash 1 is:
> ((2056.885653 - 1363.293480)/2056.885653)*100=33.72%
> And the percent change for Superfast Hash is:
> ((2056.885653 - 988.080652)/2056.885653)*100=51.96%


Now please compare the formula with your concrete values to spot the error.

user923005

unread,
Mar 15, 2008, 1:35:23 AM3/15/08
to

Oops.
((2056.885653 - 988.080652)/988.080652)*100=108.17%
((2056.885653 - 1363.293480)/1363.293480)*100=50.88%

Austin Appleby

unread,
Mar 15, 2008, 3:45:18 AM3/15/08
to

*laugh*

I like math.

Cristiano

unread,
Mar 15, 2008, 7:35:50 AM3/15/08
to
Austin Appleby wrote:
> New mix function should be
>
> #define MIX(h,k,m) { k *= m; k ^= k >> r; k *= m; h *= m; h ^= k; }

It seems worse; it fails sistematically two statistical tests.
There are big deviations on the expected number of runs of zeros of various
lengths.

Cristiano


Austin Appleby

unread,
Mar 15, 2008, 2:47:35 PM3/15/08
to

What tests & keysets? It passes my tests & keysets very well (chi-
squared test, avalanche test, dictionary keys, artificial text keys,
sparse keys), otherwise I wouldn't have posted it.

-Austin

Cristiano

unread,
Mar 15, 2008, 3:59:06 PM3/15/08
to
Austin Appleby wrote:
> On Mar 15, 4:35 am, "Cristiano" <cristiano...@NSquipo.it> wrote:
>> Austin Appleby wrote:
>>> New mix function should be
>>
>>> #define MIX(h,k,m) { k *= m; k ^= k >> r; k *= m; h *= m; h ^= k; }
>>
>> It seems worse; it fails sistematically two statistical tests.
>> There are big deviations on the expected number of runs of zeros of
>> various lengths.
>>
>> Cristiano
>
> What tests & keysets?

I use this:
http://www.webalice.it/cristiano.pi/rabigete/
there are many tests.

I use MIX to 'diffuse' a very bad input (the 'c' counter) and then I check
'h' to see how good your function is:

unsigned long MH2(void)
{
static unsigned long c = random_seed; c++;
unsigned long k=c, h=c*m;
MIX(h,k,m)
h *= m; h ^= h >> 17; h *= m; h ^= h >> 10;
return h;
}

It's not the intended use, but that code is good to check a mix function.

> It passes my tests & keysets very well (chi-
> squared test, avalanche test, dictionary keys, artificial text keys,
> sparse keys), otherwise I wouldn't have posted it.

I'm sure.

Cristiano


Austin Appleby

unread,
Mar 15, 2008, 5:47:02 PM3/15/08
to

Oh, that makes sense - the mix isn't in any way intended to be a
random number generator, and in fact the "no collisions on 4 byte
keys" property that makes it good for hash tables also makes it very
bad for random number generation as you won't get anywhere near as
many repeated values as expected.

I'll download your code and try it out, but in the meantime you could
make a slightly better random number generator out of the mix by
repeating 'c' a few times (say 4) to produce a longer key, hashing the
128-bit key, and feeding that hash to the random number test.

Also, the mixing constant and shift values have been updated slightly
since the original post - you should grab the latest from
http://murmurhash.googlepages.com.

-Austin

Clive D. W. Feather

unread,
Mar 16, 2008, 5:16:06 PM3/16/08
to
In article
<b5368f8a-1cbf-4dc4...@d21g2000prf.googlegroups.com>,
Austin Appleby <tan...@gmail.com> writes

> unsigned int k = *(unsigned int *)data;

This is not portable code.

* On some systems, an unsigned int is not 4 bytes.
* On some systems, a byte isn't 8 bits.
* Even when it is, the order of the bytes isn't guaranteed, so k will
have different values on different systems.
* There is no requirement that the value of k be generated from all the
bits in the bytes pointed to by data. For example, it would be
legitimate (though unusual) for only 5 bits to be taken from each byte
to construct the value of k.

All these factors will change the value of k and, therefore, the
behaviour of your hash function. I note that you've mentioned some of
them in the comments, but not all.

Some suggestions for you:
* Use uint_fast32_t instead of unsigned int. This is the generally
fastest unsigned integer type that has at least 32 bits.
* Read the bytes into k using shift-and-OR:
k = data [0];
k |= data [1] << 8;
k |= data [2] << 16;
k |= data [3] << 24;
or similar.
* Put a guard test in of the form:
#if CHAR_BIT != 8
#error "This code requires 8 bit bytes"
#endif

--
Clive D.W. Feather | Home: <cl...@davros.org>
Tel: +44 20 8495 6138 (work) | Web: <http://www.davros.org>
Fax: +44 870 051 9937 | Work: <cl...@demon.net>
Please reply to the Reply-To address, which is: <cl...@davros.org>

Austin Appleby

unread,
Mar 16, 2008, 7:35:17 PM3/16/08
to
On Mar 16, 2:16 pm, "Clive D. W. Feather" <cl...@on-the-
train.demon.co.uk> wrote:
> In article
> <b5368f8a-1cbf-4dc4-b637-0608056d9...@d21g2000prf.googlegroups.com>,
> Austin Appleby <tanj...@gmail.com> writes


Unfortunately Visual Studio (my current development environment)
doesn't ship with either stdint.h or inttypes.h, so using any of the
explicitly-sized types defined in C99 wouldn't work without
instructing users to download the 3rd-party versions of those headers.

Also, shift-and-ORing the bytes into k fixes the endian dependency but
cuts performance in half, which was the purpose of the hash in the
first place.

I chose instead to make the code use only ANSI C at the cost of
portability, while trying to make the limitations/assumptions explicit
in the comments. It's pretty brain-dead simple code, so porting to
platforms with different word/byte sizes should be trivial.

-Austin

Reply all
Reply to author
Forward
0 new messages