Perf numbers on my Intel Core 2 Duo @ 2.4ghz -
OneAtATime
- 354.163715 mb/sec
FNV
- 443.668038 mb/sec
SuperFastHash
- 985.335173 mb/sec
lookup3
- 988.080652 mb/sec
MurmurHash 1.0
- 1363.293480 mb/sec
MurmurHash 2.0
- 2056.885653 mb/sec
Code is below, more info is at the website - http://murmurhash.googlepages.com
- I've updated the aligned version to use the new mix as well.
Standard disclaimers apply - this is still NOT A CRYPTOGRAPHIC HASH.
Do not use it as such. It should be good enough now to use in place of
CRC, though. :)
-Austin
//-----------------------------------------------------------------------------
// MurmurHash2, by Austin Appleby - Now with turbo speed!
// Note - This code makes a few assumptions about how your machine
behaves -
// 1. We can read a 4-byte value from any address without crashing
// 2. sizeof(int) == 4
// And it has a few limitations -
// 1. It will not work incrementally.
// 2. It will not produce the same results on little-endian and big-
endian
// machines.
#define MIX(h,k,m) { k *= m; k ^= k >> r; k *= m; h += k; h *= m; }
unsigned int MurmurHash2 ( const void * key, int len, unsigned int
seed )
{
// 'm' and 'r' are mixing constants generated offline.
// They're not really 'magic', they just happen to work well.
const unsigned int m = 0xc6a4a793;
const int r = 16;
// Initialize the hash to a 'random' value
unsigned int h = seed ^ (len * m);
// Mix 4 bytes at a time into the hash
const unsigned char * data = (const unsigned char *)key;
while(len >= 4)
{
unsigned int k = *(unsigned int *)data;
MIX(h,k,m);
data += 4;
len -= 4;
}
// Handle the last few bytes of the input array
if(len)
{
unsigned int k = 0;
switch(len)
{
case 3: k += data[2] << 16;
case 2: k += data[1] << 8;
case 1: k += data[0];
};
MIX(h,k,m);
}
// Do a few final mixes of the hash to ensure the last few
// bytes are well-incorporated.
h *= m;
h ^= h >> 10;
h *= m;
h ^= h >> 17;
return h;
}
I used your mix function as a pseudo-random number generator. It seems very
good because the bit diffusion is very high.
Cristiano
Then I'm sure I speak for many in sci.crypt that we would rather you stopped
posting off topic here.
Joe
This is more flawed than the previous version. Not being cryptographic
is not an excuse at all in my opinion.
The state only ever gets mixed upwards (besides the finalization).
This causes very bad problems. Each possible block has a pair that
definitely results in only the highest bit in the state changing.
Because the state is never mixed downwards, this bit will never get
properly mixed in the state. This can for example, be used to generate
definite collisions without knowing anything about the state. Even
worse, it will definitely cause the bits 0-3 and 15-20 to equal in the
final hash.
For example:
"\x15\x5B\xE4\x1A"
"\x15\xDB\x31\x62"
Pick any buffer and append it to both of those. The hashes will ALWAYS
equal at bits 0-3 and 15-20.
This is only an example of the horror that the flawed state mixing
causes.
Can you compare with djb2 [1].
--Toby
Code has been updated slightly since this post -
New mix function should be
#define MIX(h,k,m) { k *= m; k ^= k >> r; k *= m; h *= m; h ^= k; }
and new mix constant should be
const unsigned int m = 0x5bd1e995;
Avalanche bias should now be under 0.5%, performance is the same.
-Austin
It looks like an interesting hash, but a note about calculation of
percent change from your web site:
"UPDATE UPDATE UPDATE March 14, 2008 - I've discovered a mix function
that is both faster and better than the previous version. MurmurHash2
is now over 100% faster than SuperFastHash or lookup3, and over 50%
faster than the previous MurmurHash.
Excellent distribution - Passes chi-squared tests for practically all
keysets & bucket sizes.
Excellent avalanche behavior - Maximum bias is under 0.5%.
Excellent collision resistance - No collisions possible for 4-byte
keys, no small (1- to 5-bit) differentials.
Excellent Performance - measured on an Intel Core 2 Duo @ 2.4 ghz
OneAtATime - 354.163715 mb/sec
FNV - 443.668038 mb/sec
SuperFastHash - 985.335173 mb/sec
lookup3 - 988.080652 mb/sec
MurmurHash 1.0 - 1363.293480 mb/sec
MurmurHash 2.0 - 2056.885653 mb/sec"
Percent change is calculated as ((Vnew - Vold)/Vold) * 100
So the percent change for MurmurHash 2 verses MurmurHash 1 is:
((2056.885653 - 1363.293480)/2056.885653)*100=33.72%
And the percent change for Superfast Hash is:
((2056.885653 - 988.080652)/2056.885653)*100=51.96%
> Percent change is calculated as ((Vnew - Vold)/Vold) * 100
> So the percent change for MurmurHash 2 verses MurmurHash 1 is:
> ((2056.885653 - 1363.293480)/2056.885653)*100=33.72%
> And the percent change for Superfast Hash is:
> ((2056.885653 - 988.080652)/2056.885653)*100=51.96%
Now please compare the formula with your concrete values to spot the error.
Oops.
((2056.885653 - 988.080652)/988.080652)*100=108.17%
((2056.885653 - 1363.293480)/1363.293480)*100=50.88%
*laugh*
I like math.
It seems worse; it fails sistematically two statistical tests.
There are big deviations on the expected number of runs of zeros of various
lengths.
Cristiano
What tests & keysets? It passes my tests & keysets very well (chi-
squared test, avalanche test, dictionary keys, artificial text keys,
sparse keys), otherwise I wouldn't have posted it.
-Austin
I use this:
http://www.webalice.it/cristiano.pi/rabigete/
there are many tests.
I use MIX to 'diffuse' a very bad input (the 'c' counter) and then I check
'h' to see how good your function is:
unsigned long MH2(void)
{
static unsigned long c = random_seed; c++;
unsigned long k=c, h=c*m;
MIX(h,k,m)
h *= m; h ^= h >> 17; h *= m; h ^= h >> 10;
return h;
}
It's not the intended use, but that code is good to check a mix function.
> It passes my tests & keysets very well (chi-
> squared test, avalanche test, dictionary keys, artificial text keys,
> sparse keys), otherwise I wouldn't have posted it.
I'm sure.
Cristiano
Oh, that makes sense - the mix isn't in any way intended to be a
random number generator, and in fact the "no collisions on 4 byte
keys" property that makes it good for hash tables also makes it very
bad for random number generation as you won't get anywhere near as
many repeated values as expected.
I'll download your code and try it out, but in the meantime you could
make a slightly better random number generator out of the mix by
repeating 'c' a few times (say 4) to produce a longer key, hashing the
128-bit key, and feeding that hash to the random number test.
Also, the mixing constant and shift values have been updated slightly
since the original post - you should grab the latest from
http://murmurhash.googlepages.com.
-Austin
This is not portable code.
* On some systems, an unsigned int is not 4 bytes.
* On some systems, a byte isn't 8 bits.
* Even when it is, the order of the bytes isn't guaranteed, so k will
have different values on different systems.
* There is no requirement that the value of k be generated from all the
bits in the bytes pointed to by data. For example, it would be
legitimate (though unusual) for only 5 bits to be taken from each byte
to construct the value of k.
All these factors will change the value of k and, therefore, the
behaviour of your hash function. I note that you've mentioned some of
them in the comments, but not all.
Some suggestions for you:
* Use uint_fast32_t instead of unsigned int. This is the generally
fastest unsigned integer type that has at least 32 bits.
* Read the bytes into k using shift-and-OR:
k = data [0];
k |= data [1] << 8;
k |= data [2] << 16;
k |= data [3] << 24;
or similar.
* Put a guard test in of the form:
#if CHAR_BIT != 8
#error "This code requires 8 bit bytes"
#endif
--
Clive D.W. Feather | Home: <cl...@davros.org>
Tel: +44 20 8495 6138 (work) | Web: <http://www.davros.org>
Fax: +44 870 051 9937 | Work: <cl...@demon.net>
Please reply to the Reply-To address, which is: <cl...@davros.org>
Unfortunately Visual Studio (my current development environment)
doesn't ship with either stdint.h or inttypes.h, so using any of the
explicitly-sized types defined in C99 wouldn't work without
instructing users to download the 3rd-party versions of those headers.
Also, shift-and-ORing the bytes into k fixes the endian dependency but
cuts performance in half, which was the purpose of the hash in the
first place.
I chose instead to make the code use only ANSI C at the cost of
portability, while trying to make the limitations/assumptions explicit
in the comments. It's pretty brain-dead simple code, so porting to
platforms with different word/byte sizes should be trivial.
-Austin