MurmurHash statistical analysis done, reports posted.

93 visualizações
Pular para a primeira mensagem não lida

Austin Appleby

não lida,
11 de mar. de 2008, 03:28:3611/03/2008
para Hash Functions
Results can be found here - http://murmurhash.googlepages.com/statistics

Summary - Murmur is not perfectly random and shows very slight
weaknesses with sparse keys, but even in the worst case should perform
within 0.25% or so of a perfectly random hash.


It's 37% faster than both SuperFastHash and lookup3 and 307% faster
than FNV in bulk speed tests.

When hashing small keys, it's 15% faster than SuperFastHash, 7% faster
than lookup3, and 31% faster than FNV.

-Austin

Andres Valloud

não lida,
11 de mar. de 2008, 03:49:3811/03/2008
para hash-fu...@googlegroups.com
Austin,

Something I've found quite useful is to run the chi^2 tests modulo p, for several of the p that will be used to look at hash values modulo p.  In other words, this assumes that if f(x) is a hash function, the indexing to the hash table will be done via f(x) mod p.  This has a number of advantages that are described in Knuth's TAOCP, but basically refer to avoiding clustering.  Note that the values of p have to be very well chosen... not just any p is good.

Furthermore, the Hash Analysis Tool uses as many chi^2 buckets as it needs to achieve full resolution.  In other words, if for a dataset of 10^6 objects the hash function produces 999000 hash values, then it will measure all of the 999000 hash buckets.

This is done in an unbounded fashion first, just looking at f(x).  This helps identify problems with f(x) before mod p.  Then, the test is repeated for 25 values of p greater than the amount of objects being hashed, again measuring with one bucket per value of f(x) mod p.  This can find subtle distribution problems that things like avalanche alone will definitely not find.  In fact, I do not like the avalanche test a whole lot.  It usually fails for perfect hash functions, and it can also give excellent scores for very poor hash functions.

Usually, multiplicative hash functions such as FNV fail the chi^2 mod p test because unacceptable values are found for some p.  As the full resolution chi^2 / chi^2 mod p tests work by simulating the inner workings of the hashed collection, I find it much easier to trust their results.

Since the prime chosen for FNV-32 does not have many bits turned on, this results in poor mixing as you comment.  Knuth advises against this as well, and gives precise criteria to find good factor candidates.  All of this plus many other details are in the hash book.

Andres.

Austin Appleby

não lida,
11 de mar. de 2008, 13:02:5511/03/2008
para Hash Functions
I skimmed through your book at the last saturdayhouse meeting - I'll
probably buy a copy. :)

My bucketing test is pretty speedy, I'll add some non-power-of-2 sizes
and see what I get.

The main thing I was trying to prove with the stats is that Murmur is
at least as good for "normal" hash table use (power-of-2 buckets) as
lookup3, and potentially much faster. So, I'm not too worried if it
fails for certain particular values of p. :)



On Mar 11, 12:49 am, "Andres Valloud" <andres.vall...@gmail.com>
wrote:
> On Tue, Mar 11, 2008 at 12:28 AM, Austin Appleby <tanj...@gmail.com> wrote:
>
> > Results can be found here -http://murmurhash.googlepages.com/statistics

Austin Appleby

não lida,
11 de mar. de 2008, 14:56:5511/03/2008
para Hash Functions
I'm not seeing anything interesting/different for bucket sizes near a
power of two (tried all sizes within 20 of each power). Are there
"difficult" sizes I should try? Primes, etc?




On Mar 11, 12:49 am, "Andres Valloud" <andres.vall...@gmail.com>
wrote:
> On Tue, Mar 11, 2008 at 12:28 AM, Austin Appleby <tanj...@gmail.com> wrote:
>
> > Results can be found here -http://murmurhash.googlepages.com/statistics

Andres Valloud

não lida,
11 de mar. de 2008, 15:16:4011/03/2008
para hash-fu...@googlegroups.com
Austin,

I cannot think of bad values for the modulo that stand out... on the other hand I don't think I'd see them because the tool (and the hashed collections) use well chosen primes.

Andres.

Austin Appleby

não lida,
11 de mar. de 2008, 16:22:4711/03/2008
para Hash Functions
What primes are you using?


On Mar 11, 12:16 pm, "Andres Valloud" <andres.vall...@gmail.com>
wrote:
> Austin,
>
> I cannot think of bad values for the modulo that stand out... on the other
> hand I don't think I'd see them because the tool (and the hashed
> collections) use well chosen primes.
>
> Andres.
>

Andres Valloud

não lida,
11 de mar. de 2008, 17:43:1111/03/2008
para hash-fu...@googlegroups.com
Austin,

The implementation of Set class>>largePrimes is below.  It has the list of primes and the method of construction.

Andres.

Set class>>largePrimes
    "Answer a list of large prime numbers, suitable for the sizes of
    collections.  See below the table for more information"

    ^#(
        2237 2423 2617 2797 2999 3167 3359 3539 3727 3911
        4441 4787 5119 5471 5801 6143 6521 6827 7177 7517 7853
        8887 9587 10243 10937 11617 12289 12967 13649 14341 15013 15727
        17749 19121 20479 21859 23209 24593 25939 27329 28669 30047 31469
        35507 38231 40961 43711 46439 49157 51893 54617 57347 60077 62801
        70583 75619 80669 85703 90749 95783 100823 105871 110909 115963 120997 126031
        141157 151237 161323 171401 181499 191579 201653 211741 221813 231893 241979 252079
        282311 302483 322649 342803 362969 383143 403301 423457 443629 463787 483953 504121
        564617 604949 645313 685609 725939 766273 806609 846931 887261 927587 967919 1008239
        1123477 1198397 1273289 1348177 1423067 1497983 1572869
            1647761 1722667 1797581 1872461 1947359 2022253
        2246953 2396759 2546543 2696363 2846161 2995973 3145739
            3295541 3445357 3595117 3744941 3894707 4044503
        4493921 4793501 5093089 5392679 5692279 5991883 6291469
            6591059 6890641 7190243 7489829 7789447 8089033
        8987807 9586981 10186177 10785371 11384539 11983729 12582917
            13182109 13781291 14380469 14979667 15578861 16178053
        17895707 19014187 20132683 21251141 22369661 23488103 24606583
            25725083 26843549 27962027 29080529 30198989 31317469 32435981
        35791397 38028379 40265327 42502283 44739259 46976221 49213237
            51450131 53687099 55924061 58161041 60397993 62634959 64871921
        71582857 76056727 80530643 85004567 89478503 93952427 98426347 102900263
            107374217 111848111 116322053 120795971 125269877 129743807
        143165587 152113427 161061283 170009141 178956983 187904819 196852693 205800547
            214748383 223696237 232644089 241591943 250539763 259487603
        268435399)

    "Rather than list every prime, we've listed only enough so that
    between any number N and 2*N, there will be at least about 8 to
    choose from. This is a balance between a small list of primes, and
    getting a collection size that doesn't waste too much space.

    In addition, the primes in this table were chosen so that they do
    not divide 256^k +- a, for 1<=k<=8, and -32<=a<=32.  This is so that
    using them as a modulo will not have a tendency to just throw away
    the most significant bits of the object's hash.  See Knuth on hashing
    for details, TAOCP vol 3.

    Moreover, no primes are close to any power of two to avoid aliasing
    between hash functions based on bit manipulation and the moduli.
    This has been sporadically observed for hash functions that work
    mostly by bitwise concatenation of integer values.  See the Hash
    Analysis Tool for more details.

    The list is organized so that each line lists primes for a particular
    range starting at a power of two, starting at 2^11 and up to 2^27.
    The code used to generate this list is in a comment below the table.

    Note that for the following code to run, you will need an implementation
    of Integer>>isPrime.  Here is a sample one which, although far from a
    solution suitable for every single case, is enough for the purpose at hand.

Integer>>isPrime
    ""An integer is prime when it is positive and has exactly
    four distinct integer divisors: 1, -1, itself, and itself negated""

    self < 2 ifTrue: [^false].
    self even ifTrue: [^self = 2].
    3 to: self sqrt floor by: 2 do:
        [:eachDivisor |
            self \\ eachDivisor = 0 ifTrue: [^false]
        ].
    ^true

    The workspace code to generate the table above is as follows.

radix := 256.
valuesOfK := 1 to: 8.
valuesOfA := -32 to: 32.
valuesToCheck := OrderedCollection new.
(valuesOfK collect: [:each | radix raisedTo: each]) do:
    [:each | valuesOfA do: [:some | valuesToCheck add: each + some]].

(11 to: 27) collect:
    [:eachBasePower |
        base := 1 bitShift: eachBasePower.
        primesPerBaseMinusOne := eachBasePower // 4 + 8.
        distanceBetweenPrimes := base // (primesPerBaseMinusOne + 1).
        pivotInterval :=
            distanceBetweenPrimes + base
                to: base * 2 - distanceBetweenPrimes + 1
                by: distanceBetweenPrimes.
        pivotInterval collect:
            [:eachPivot |
                (eachPivot to: base * 2) detect:
                    [:any |
                        any isPrime and:
                            [valuesToCheck allSatisfy: [:such | such \\ any > 0]]
                    ]
            ]
    ].

((1 bitShift: 28) to: 1 by: -1) detect:
    [:any | any isPrime and: [valuesToCheck allSatisfy: [:such | such \\ any > 0]]]
Responder a todos
Responder ao autor
Encaminhar
0 nova mensagem