Hello Vladut,
You are right, I did some experimental test with the most obvious way
key names are generated, that is "prefix:<num>" and I don't like too
much the results.
In a 100 nodes cluster this is not going to be a problem since most
buckets are balanced, but from time to time there are buckets that
have very very few keys.
If one has a 4096 nodes cluster, or a node has the "wrong" nodes
assigned, then this is going to create problems. I did some test
previous but always simulating real clusters, never checking what the
distribution was in the actual 4096 buckets. A cluster of 100 nodes
will mask the bias, but not when you see the single buckets.
I guess it is time to change it now. Ideas about what we should use?
It should be fast and *easy* to implement, as client implementations
will need to have it as well.
What we need is an hash function with good distribution *especially*
in the lower 12 bits.
Cheers,
Salvatore
--
Salvatore 'antirez' Sanfilippo
open source developer - VMware
http://invece.org
"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele
actually why djb hash function has a very bad distribution, the crc16
is showing a very good distribution.
Actual test: 1 million keys form "key:0" to "key:999999".
The distribution is the following (can't spam with too many data so
I'll not provide all the info needed):
Expected elements per bucket: 244
What I got instead:
"best" bucket: 269 elements
"worst" bucket: 206 elements
I would say this is good, especially since the key generation is very
good test as we use always the same 10 bytes in the tail of the string
and the prefix is always the same.
Under the same test djb performs very, very poorly for instance with a
best bucket containing 781 elements and the worst 10 elements.
Waiting for your tests as well, but I'm pretty confident we are well
served by crc16, especially since we are almost using all the output
as we use 12 bits, and this can be good with small keys compared to
bigger hash functions where we'll take just the latest 12 bits.
Salvatore
--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
Hi Sergei, the problem with MD5 is that it is too slow for this application.
Redis internally needs to perform many Key -> Slot conversions, using
MD5 is wasting resources, since MD5 not only offers good distribution
(so good that is impossible to statistically see the difference with
true random data), and is designed in order to be "one way" so that it
is computationally very hard to find informations about the input from
the output and so forth.
We don't need all that fortunately, we only need a decent distribution
in the latest 12 bits of the output in the case of hashing biased
strings (keys are often far from random, with a prefix that is always
fixed and with bytes mostly in the A-z 0-9 range).
So we can pick something much faster. More about CRC16 in the reply to
the other email from you :)
Exactly, CRC16 and CRC32 are designed in order to make sure flipped
bytes or a few different bytes in the input will *likely* result into
something different enough. An hash function instead should be
designed for good distribution, that is, different inputs should
produce any of the possible 65536 outputs (in the case of a 16 bit
output) with the same probability, even hashing very related inputs,
that is, inputs that are not random (like our keys in the form of
key;0, key:1 and so forth). (Note: this is not the only property of an
hash function, another property is for instance, how much the output
will change if I perform a very small change in the input?).
Now what is interesting is that even if CRC16 is designed for a
different purpose, it does not mean that it completely fails at the
other task. Also note that we get a lot of the output of CRC16, and
always the *same* output, that is, the latest 12 bits, so this is not
like using CRC16 to implement an hash table that can have different
sizes. We are interested in the specific distribution of the latest 12
bits of the CRC16 output.
And well, this works remarkably well for some reason I'm not
completely sure I understand.
See for instance the output of the following Ruby program I
implemented for this purpose:
https://gist.github.com/1256122
You can switch the UseSHA1 constant to true or false in order to
switch from crc16 to SHA1, you'll see that for this use case the two
hash functions will perform equally the same.
You may think that it is performing so well since the sample size is
1000000 elements, but actually even if you use just 4096 as sample
size, and you could expect to find roughly from 0 to 2 elements per
hash bucket, the result is very similar to the perfect random
distribution.
So I think CRC16 is fine for us.
From a "mathematical" standpoint I think that even if CRC16 is a weak
hash function we are getting the best here since we get a 16 bit
output obtained using an prefix that is always the same (so this is an
invariant and we can ignore it) and an input that is usually even for
small datasets more than 3 bytes. So this is enough to guarantee us
with a great lower 12 bit distribution.
If some expert could confirm that I'll be much more happy ;)
Cheers,
Salvatore
Just noticed this thread about hashing and your concerns about distribution.
For the redis use cases the cryptographic hashes are slower and provide no value relative to a good non-crypto hash function.
Given your stated common case of <comomon prefixes><numbers>
I strongly suggest looking at the Larson hash function.
It's simple, fast, and has particularly good behavior in similar tests:
- http://www.strchr.com/hash_functions
(See the in particular their test cases called "Numbers" and "Prefix").
Go wild,
John
Given the good results I obtained in my tests, and also the evidence
on this table that CRC is not so bad for hashing contrarily to common
beliefe, I would take CRC16 if not evidence against our use case
(lower 12 bits for hashing keys in a fixed table size) is found.
Thanks for this email, very interesting link!
Salvatore
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>
--
"For table size less than 2^16, we can improve the quality of hash
function by XORing high and low words, so that more letters will be
taken into account"
This appears to give value to my hypotesis that since we have 12 bits
of output from 16 bits hash function we are getting better results
compared to 32 bit hash function where only 12 bits of output are
used.
I guess we could similarly improve even more the result xoring the
upper 4 bits in odd positions across the 12 bits of lower output, but
seems like to make the specification more complex without a good
reason...
Cheers,
Salvatore