Fastest way to store and lookup SHA1 hashes

1,126 views
Skip to first unread message

Brice Burgess

unread,
Aug 30, 2012, 4:37:47 PM8/30/12
to redi...@googlegroups.com
We're currently using `hfind` to store and check files against the NIST
NSRL ( http://www.nsrl.nist.gov/ ). This process is referred to as
de-nisting, and provides a means to ignore known system [and irrelevant]
files when examining/crawling a filesystem.

I'd like to compare (via benchmarks ++ memory usage) the `hfind` tool
which uses BTREE search against Redis. `hfind` is part of the sleuthkit
project ( http://www.sleuthkit.org ).

Can anyone suggest the most efficient method to store and lookup SHA1
hashes in redis? Do you suggest using binary keys (SHA1 -> 160bit hash
value -> CHAR(40) or BINARY(20))?

Many thanks for your input!

~ Brice






Josiah Carlson

unread,
Aug 31, 2012, 1:53:33 PM8/31/12
to redi...@googlegroups.com
On Thu, Aug 30, 2012 at 1:37 PM, Brice Burgess <bric...@gmail.com> wrote:
> We're currently using `hfind` to store and check files against the NIST NSRL
> ( http://www.nsrl.nist.gov/ ). This process is referred to as de-nisting,
> and provides a means to ignore known system [and irrelevant] files when
> examining/crawling a filesystem.
>
> I'd like to compare (via benchmarks ++ memory usage) the `hfind` tool which
> uses BTREE search against Redis. `hfind` is part of the sleuthkit project (
> http://www.sleuthkit.org ).

What are you trying to do? Determine whether an entry exists? If so,
there are a variety of methods that can get you roughly 2-3 bytes of
overhead (in aggregate) for the hashes you want to store, and will let
you check hashes at a rate that is dependent on your network
performance (100k checks/second would take 2+MBytes/second network
bandwidth).

So if you have a million sha1 hashes, you can expect roughly 22-23
MBytes of memory to be used by Redis, and you can check for the
existence of a hash at a pipelined rate of at least 100k per second.

> Can anyone suggest the most efficient method to store and lookup SHA1 hashes
> in redis? Do you suggest using binary keys (SHA1 -> 160bit hash value ->
> CHAR(40) or BINARY(20))?

Redis doesn't do anything to the strings you send it (unless it can be
interpreted as a base-10 integer). So if you send it 20 bytes, they
will be stored more efficiently than 40 bytes.

Regards,
- Josiah

Brice Burgess

unread,
Aug 31, 2012, 2:59:32 PM8/31/12
to redi...@googlegroups.com
Thanks Josiah! Responses inline;


On Friday, August 31, 2012 12:53:38 PM UTC-5, Josiah Carlson wrote:
What are you trying to do? Determine whether an entry exists?

That is correct. Store all hashes in redis, and then check to see if a particular hash exists.

Further; I want to test the performance of this routine against using Redis, hfind, and mysql backends.



So if you have a million sha1 hashes, you can expect roughly 22-23
MBytes of memory to be used by Redis, and you can check for the
existence of a hash at a pipelined rate of at least 100k per second.


This sounds more than adequate... I do want to get an idea of resident memory usage for this, as potentially the performance tradeoff of a disk based database using binary search (e.g. mysql myisam && hfind) will allow more flexibility.

 

> Can anyone suggest the most efficient method to store and lookup SHA1 hashes
> in redis? Do you suggest using binary keys (SHA1 -> 160bit hash value ->
> CHAR(40) or BINARY(20))?

Redis doesn't do anything to the strings you send it (unless it can be
interpreted as a base-10 integer). So if you send it 20 bytes, they
will be stored more efficiently than 40 bytes.


That sounds good. One intriguing way to [potentially] keep memory usage down is to do a lookup of a lookup ( a sort of key compression ). So for the hash;  So for the hash "FFFFFE0800F778DC4FEDC6BD756ADD3BF5CE56CE", I would take the first 5 characters and use them as the key lookup, and store the remaining in a set for that key. E.g.

redis> SADD FFFFF E0800F778DC4FEDC6BD756ADD3BF5CE56CE
(integer) 1
redis> SISMEMBER FFFFF E0800F778DC4FEDC6BD756ADD3BF5CE56CE
(integer) 1

For this to be effective I assume I'd want to find the optimal area of collision among the hashes... e.g. the first 4 characters, the first 5, last 5, etc.

Keep in mind that we're dealing with *47 million* hashes. Would you recommend this segmented approach?

Many thanks!

~ Brice

Sergei Tulentsev

unread,
Sep 1, 2012, 1:54:04 AM9/1/12
to redi...@googlegroups.com
Hi, Brice!

One intriguing way to [potentially] keep memory usage down is to do a lookup of a lookup ( a sort of key compression ). So for the hash;  So for the hash "FFFFFE0800F778DC4FEDC6BD756ADD3BF5CE56CE", I would take the first 5 characters and use them as the key lookup, and store the remaining in a set for that key.

I'm not sure how this is supposed to save you memory. At any rate, with 47M hashes you will save mere megabytes (at most). Is it worth to spend your time on it?

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To view this discussion on the web visit https://groups.google.com/d/msg/redis-db/-/NlP_A7awYugJ.

To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.



--
Best regards,
Sergei Tulentsev

Josiah Carlson

unread,
Sep 1, 2012, 10:07:10 AM9/1/12
to redi...@googlegroups.com
My responses also inline.

On Fri, Aug 31, 2012 at 11:59 AM, Brice Burgess <bric...@gmail.com> wrote:
> Thanks Josiah! Responses inline;
> On Friday, August 31, 2012 12:53:38 PM UTC-5, Josiah Carlson wrote:
>>
>> What are you trying to do? Determine whether an entry exists?
>
>
> That is correct. Store all hashes in redis, and then check to see if a
> particular hash exists.
>
> Further; I want to test the performance of this routine against using Redis,
> hfind, and mysql backends.

I'll leave you to use those other backends, but for Redis it is straightforward.

>> So if you have a million sha1 hashes, you can expect roughly 22-23
>> MBytes of memory to be used by Redis, and you can check for the
>> existence of a hash at a pipelined rate of at least 100k per second.
>
> This sounds more than adequate... I do want to get an idea of resident
> memory usage for this, as potentially the performance tradeoff of a disk
> based database using binary search (e.g. mysql myisam && hfind) will allow
> more flexibility.

Overhead is a function of what structures you use and your particular
access patterns.

>> > Can anyone suggest the most efficient method to store and lookup SHA1
>> > hashes
>> > in redis? Do you suggest using binary keys (SHA1 -> 160bit hash value ->
>> > CHAR(40) or BINARY(20))?
>>
>> Redis doesn't do anything to the strings you send it (unless it can be
>> interpreted as a base-10 integer). So if you send it 20 bytes, they
>> will be stored more efficiently than 40 bytes.

> That sounds good. One intriguing way to [potentially] keep memory usage down
> is to do a lookup of a lookup ( a sort of key compression ). So for the
> hash; So for the hash "FFFFFE0800F778DC4FEDC6BD756ADD3BF5CE56CE", I would
> take the first 5 characters and use them as the key lookup, and store the
> remaining in a set for that key. E.g.
>
> redis> SADD FFFFF E0800F778DC4FEDC6BD756ADD3BF5CE56CE
> (integer) 1
> redis> SISMEMBER FFFFF E0800F778DC4FEDC6BD756ADD3BF5CE56CE
> (integer) 1
>
> For this to be effective I assume I'd want to find the optimal area of
> collision among the hashes... e.g. the first 4 characters, the first 5, last
> 5, etc.
>
> Keep in mind that we're dealing with *47 million* hashes. Would you
> recommend this segmented approach?

The typical way of reducing memory usage for strictly lookup-like
operations is to "shard" your structures. Your use of a prefix,
suffix, etc., of hashes is the same.

At 47 million entries, assuming a desired shard size of at most 1000
entries (not unreasonable for a <20 byte data value), you will need
approximately 94,000 shards (if you assume your sha1 hashes are
essentially random, you can expect that the bucket with the highest
number of entries is roughly 2x of what you would expect on average;
at 3x, you may want to consider a different set of bytes/bits). If you
round that up to 128k, that's 17 bits. That's not super convenient to
extract (64k would have been easier), but it's also not bad. For
checking whether you should use a prefix, suffix, or something else,
you can assume 16 bit, and find out whether you have any buckets with
more than 1435 entries (2 * 47million / 65536).

As for actually storing it, it turns out that the SET structure in
Redis only uses a high-efficiency encoding for sets of integers that
fit within the signed long int limit on your platform (or it may be 64
bit on all platforms, I can't remember the source). Sharding sets
doesn't use the ziplist encoding, so you won't get size benefit. On
the other hand, you could map your entries into sharded HASH
structures, storing an empty string as the value. The overhead of an
empty string is 1 byte in a ziplist, plus the 1 byte for the short
key, plus the 18 bytes for the hash itself (if you trimmed off the
part of the hash used for sharding). Which means that if you use a
properly sharded HASH, you are storing your data about as efficiently
as you could store it in a sorted file (plus overall structure
overhead, which would be at most a bit per entry for a hash size that
maxes out at 1000 entries).

My calculator puts the expected data size at roughly 940 megabytes to
store all of the hashes, though it will probably take an extra 10-20%
or so due to memory fragmentation. Call it a bit over a gig.

As for Redis' performance compared to MySQL or another on-disk b-tree;
Redis will likely win even if you give the contenders enough cache to
store everything in memory. Why? Access patterns on binary search
trees are roughly equivalent to a random walk through memory, which is
awful for CPU caches. B+ trees are not significantly better, unless
you *just* accessed the tree node. On the other hand, while a ziplist
encoding is a linear packed blob of data, the memory access pattern is
very easy to cache, so you get one cache miss up front, followed by
streamed memory - meaning you run at memory bandwidth limits (and not
memory latency limits).

Regards,
- Josiah
Reply all
Reply to author
Forward
0 new messages