My responses also inline.
On Fri, Aug 31, 2012 at 11:59 AM, Brice Burgess <
bric...@gmail.com> wrote:
> Thanks Josiah! Responses inline;
> On Friday, August 31, 2012 12:53:38 PM UTC-5, Josiah Carlson wrote:
>>
>> What are you trying to do? Determine whether an entry exists?
>
>
> That is correct. Store all hashes in redis, and then check to see if a
> particular hash exists.
>
> Further; I want to test the performance of this routine against using Redis,
> hfind, and mysql backends.
I'll leave you to use those other backends, but for Redis it is straightforward.
>> So if you have a million sha1 hashes, you can expect roughly 22-23
>> MBytes of memory to be used by Redis, and you can check for the
>> existence of a hash at a pipelined rate of at least 100k per second.
>
> This sounds more than adequate... I do want to get an idea of resident
> memory usage for this, as potentially the performance tradeoff of a disk
> based database using binary search (e.g. mysql myisam && hfind) will allow
> more flexibility.
Overhead is a function of what structures you use and your particular
access patterns.
>> > Can anyone suggest the most efficient method to store and lookup SHA1
>> > hashes
>> > in redis? Do you suggest using binary keys (SHA1 -> 160bit hash value ->
>> > CHAR(40) or BINARY(20))?
>>
>> Redis doesn't do anything to the strings you send it (unless it can be
>> interpreted as a base-10 integer). So if you send it 20 bytes, they
>> will be stored more efficiently than 40 bytes.
> That sounds good. One intriguing way to [potentially] keep memory usage down
> is to do a lookup of a lookup ( a sort of key compression ). So for the
> hash; So for the hash "FFFFFE0800F778DC4FEDC6BD756ADD3BF5CE56CE", I would
> take the first 5 characters and use them as the key lookup, and store the
> remaining in a set for that key. E.g.
>
> redis> SADD FFFFF E0800F778DC4FEDC6BD756ADD3BF5CE56CE
> (integer) 1
> redis> SISMEMBER FFFFF E0800F778DC4FEDC6BD756ADD3BF5CE56CE
> (integer) 1
>
> For this to be effective I assume I'd want to find the optimal area of
> collision among the hashes... e.g. the first 4 characters, the first 5, last
> 5, etc.
>
> Keep in mind that we're dealing with *47 million* hashes. Would you
> recommend this segmented approach?
The typical way of reducing memory usage for strictly lookup-like
operations is to "shard" your structures. Your use of a prefix,
suffix, etc., of hashes is the same.
At 47 million entries, assuming a desired shard size of at most 1000
entries (not unreasonable for a <20 byte data value), you will need
approximately 94,000 shards (if you assume your sha1 hashes are
essentially random, you can expect that the bucket with the highest
number of entries is roughly 2x of what you would expect on average;
at 3x, you may want to consider a different set of bytes/bits). If you
round that up to 128k, that's 17 bits. That's not super convenient to
extract (64k would have been easier), but it's also not bad. For
checking whether you should use a prefix, suffix, or something else,
you can assume 16 bit, and find out whether you have any buckets with
more than 1435 entries (2 * 47million / 65536).
As for actually storing it, it turns out that the SET structure in
Redis only uses a high-efficiency encoding for sets of integers that
fit within the signed long int limit on your platform (or it may be 64
bit on all platforms, I can't remember the source). Sharding sets
doesn't use the ziplist encoding, so you won't get size benefit. On
the other hand, you could map your entries into sharded HASH
structures, storing an empty string as the value. The overhead of an
empty string is 1 byte in a ziplist, plus the 1 byte for the short
key, plus the 18 bytes for the hash itself (if you trimmed off the
part of the hash used for sharding). Which means that if you use a
properly sharded HASH, you are storing your data about as efficiently
as you could store it in a sorted file (plus overall structure
overhead, which would be at most a bit per entry for a hash size that
maxes out at 1000 entries).
My calculator puts the expected data size at roughly 940 megabytes to
store all of the hashes, though it will probably take an extra 10-20%
or so due to memory fragmentation. Call it a bit over a gig.
As for Redis' performance compared to MySQL or another on-disk b-tree;
Redis will likely win even if you give the contenders enough cache to
store everything in memory. Why? Access patterns on binary search
trees are roughly equivalent to a random walk through memory, which is
awful for CPU caches. B+ trees are not significantly better, unless
you *just* accessed the tree node. On the other hand, while a ziplist
encoding is a linear packed blob of data, the memory access pattern is
very easy to cache, so you get one cache miss up front, followed by
streamed memory - meaning you run at memory bandwidth limits (and not
memory latency limits).
Regards,
- Josiah