Murmur3 hash function

2,436 views
Skip to first unread message

Lucas Miller

unread,
Mar 21, 2011, 9:04:56 PM3/21/11
to alemb...@googlegroups.com
Hey folks,

A while ago I mentioned that I was looking at another faster hashing function as an alternative to MD5.

http://code.google.com/p/smhasher/wiki/MurmurHash3
http://smhasher.googlecode.com/svn/trunk/MurmurHash3.cpp
See: MurmurHash3_x64_128

The advantages of using it over MD5 is mainly it's speed (3-5 times faster), the code is also a lot simpler.
Testing it against a lot of our production data, it appears to be just as collision resistant as MD5.

A quick test implementation is showing speedups in AbcExport for some of our larger assets that don't take a long time to compute of about 4-5%.
The gains will be negligible for things that take a long time to compute (sims).
For assets that don't take a long time to compute, the savings could be 6-8%. (assuming that key calculation is about 10% of the asset write cost, which I was consistently seeing in my profiling results when I could generate them)

Questions, opinions?

Lucas

Joe Ardent

unread,
Mar 22, 2011, 5:38:54 PM3/22/11
to alemb...@googlegroups.com
Lucas Miller wrote:
> Hey folks,
>
> A while ago I mentioned that I was looking at another faster hashing
> function as an alternative to MD5.
>
> http://code.google.com/p/smhasher/wiki/MurmurHash3
> http://smhasher.googlecode.com/svn/trunk/MurmurHash3.cpp
> See: MurmurHash3_x64_128
>
> The advantages of using it over MD5 is mainly it's speed (3-5 times
> faster), the code is also a lot simpler.
> Testing it against a lot of our production data, it appears to be just
> as collision resistant as MD5.
>

Hey Lucas, I'm not opposed to this in principle, but I'm a little leery
of using this hash solely due to its lack of track record and proven
collision-resistance. Being generous and saying, "This will reduce
export times by 10%," that means that a one-hour export with the current
hashing function would take 54 minutes with murmurhash. I'm not
convinced that the decreased confidence in the hash is worth that
marginal gain. I appreciate that it doesn't seem to collide with the
assets you've tried out, but that's anecdotal evidence.

If we can rigorously show that MurmurHash is as collision-resistant as
MD5 (within, say, 10%), then I'm totally on board. Otherwise, I'm not
convinced that this is a bottleneck worth widening by changing the hash
function.

-Joe

Lucas Miller

unread,
Mar 22, 2011, 6:02:13 PM3/22/11
to alemb...@googlegroups.com
Absolutely 100% agree, I wouldn't have even suggested it if I wasn't fairly certain that it is as collision-resistant as MD5.

A few months ago, I calculated MD5 and Murmur3 keys for all the datasets within a Tako asset for the latest version of all of our assets for every shot for every show we had in production at that time.  The collision resistance on both algorithms appeared to be the same.
I can double check the implementation (I think a slight change was made to the algorithm since I last ran the test) and run it again.

I think we can also use:
http://code.google.com/p/smhasher/wiki/SMHasher

to help compare MD5 and Murmur3 collision-resistance.

Lucas

Joe Ardent

unread,
Mar 22, 2011, 6:46:09 PM3/22/11
to alemb...@googlegroups.com
Lucas Miller wrote:
> Absolutely 100% agree, I wouldn't have even suggested it if I wasn't
> fairly certain that it is as collision-resistant as MD5.
> .....

>
> I think we can also use:
> http://code.google.com/p/smhasher/wiki/SMHasher
>
> to help compare MD5 and Murmur3 collision-resistance.
>

Hmm, that test suite looks pretty good, and if the results from it are
encouraging, that would go a long way towards me getting on board with this.

And doing some more reading, I see that it's being used in things like
memcached and other large-dataset applications. I'm starting to come
around to this idea :)


-Joe

Lucas Miller

unread,
Mar 24, 2011, 9:23:29 PM3/24/11
to alemb...@googlegroups.com
Update:

Rerunning the test against the latest version of our assets has demonstrated that the latest version of Murmur3 is as collision-resistant as MD5 on this type of data and was on average about 5 times faster.

I just need to mess around with:
http://code.google.com/p/smhasher/wiki/SMHasher

to be thoroughly convinced.

Lucas

Joe Ardent

unread,
Mar 25, 2011, 3:55:31 PM3/25/11
to alemb...@googlegroups.com
Lucas Miller wrote:
> Update:
>
> Rerunning the test against the latest version of our assets has
> demonstrated that the latest version of Murmur3 is as
> collision-resistant as MD5 on this type of data and was on average about
> 5 times faster.
>
> I just need to mess around with:
> http://code.google.com/p/smhasher/wiki/SMHasher
>
> to be thoroughly convinced.
>

Yeah, I'll be convinced if we get good numbers for it out of that test.
Thanks, Lucas!

Lucas Miller

unread,
Mar 25, 2011, 8:46:24 PM3/25/11
to alemb...@googlegroups.com
Yeah, the results look VERY good.

I only had to make a very small addition in order to get it to use all the bits for MD5 (the one included for comparison was only using 32 bits)

I don't know if I would trust the speed info for md5, I'm guessing the version being used was optimized for 32 bit machines.
The version of Murmur3 I'm using was optimized for 64 bit machines.

The only test I left off was the Seed test, since the implementation of md5 that was included didn't use any seed values, so it's failure rate was high for that reason.

Let me know if you have any questions, or see something I may have missed.

Lucas
md5_128.txt
murmur_128.txt

Joe Ardent

unread,
Mar 25, 2011, 9:43:54 PM3/25/11
to alemb...@googlegroups.com
Sweet!

OK, I'm down with this. The only remaining hurdle to clear is:

- is this as robust as md5 with a 64-bit hash returned (that is, is
taking the first 64 bits as valid as md5)?

- does this return the same hash for the same value on 32 vs. 64-bit
machines?

If the answer is yes, then let's DO THIS SHIT.

Lucas Miller

unread,
Mar 25, 2011, 10:02:27 PM3/25/11
to alemb...@googlegroups.com
Feel free to correct me if I'm wrong, but I believe the Avalanche and the "Combination Lowbits", "Combination Highbits" tests demonstrate
this.

I believe the low bias values show that it is as suitable as md5 for use of those 64 bits in  boost::unordered_hashmap.


- is this as robust as md5 with a 64-bit hash returned (that is, is taking the first 64 bits as valid as md5)?



If not, there is an implementation optimized for 32 bit machines.
I'll compare keys on a handful of inputs to make sure things are correct between 32 and 64.

Reply all
Reply to author
Forward
0 new messages