An early BloomFilter implementation

63 views
Skip to first unread message

tomgibara

unread,
Jun 1, 2010, 7:14:11 AM6/1/10
to guava-discuss
I've been slowly working towards creating a solid BloomFilter
implementation to replace my previous ad-hoc implementations. It's
based it on the interface listed in the Guava issue tracker. I've
found the tricky thing to be defining a practical API for generalizing
hashing.

There's a little more information in this blog post:

http://blog.tomgibara.com/post/652979675/bloom-filters-and-hashing

Though its very early on in the implementation, could it form the
basis for something useful in the guava libraries (which as far as I
know are still lacking a BloomFilter implementation)?

tomgibara

unread,
Jun 7, 2010, 8:25:45 AM6/7/10
to guava-discuss
Irrespective of whether my Bloom filter implementation is useful, this
may be of interest:

http://blog.tomgibara.com/post/672898521/multi-hashing-effectiveness

I just spent an hour investigating workable multi-hashing strategies
and wrote-up my first findings (very informally) above.

Tom.

Kevin Bourrillion

unread,
Jun 7, 2010, 10:45:08 AM6/7/10
to guava-...@googlegroups.com
Great, thank you for this... it was also our experience that the bloom filter itself was easy but figuring out how to deal with the multiple hash functions was the part that ground us to a halt and put it on the back burner for years. :-(

For a reminder to others, the issue in question is:
http://code.google.com/p/guava-libraries/issues/detail?id=12

I may not be able to think further on this issue for a few weeks :-( but I hope to get back to it.


--
guava-...@googlegroups.com.
http://groups.google.com/group/guava-discuss?hl=en
unsubscribe: guava-discus...@googlegroups.com

This list is for discussion; for help, post to Stack Overflow instead:
http://stackoverflow.com/questions/ask
Use the tag "guava".



--
Kevin Bourrillion @ Google
http://guava-libraries.googlecode.com

Benjamin Manes

unread,
Jun 7, 2010, 2:23:32 PM6/7/10
to guava-...@googlegroups.com
It would also be nice to see a comparison using double hashing, which will probably be faster than the random functions. I can send along an example implementation using this approach.

-Ben

Tom Gibara

unread,
Jun 7, 2010, 2:31:46 PM6/7/10
to guava-...@googlegroups.com
I'd be happy to try out alternative hash implementations in my little test app and update the comparison spreadsheet. Though hashing algorithms isn't an area I know much about - if you can show me some code, it should be easy to do a preliminary test.

Tom.


-- 
Tom Gibara
email: m...@tomgibara.com
web: http://www.tomgibara.com
blog: http://blog.tomgibara.com
twitter: tomgibara

Martin Buchholz

unread,
Jun 7, 2010, 2:33:42 PM6/7/10
to guava-...@googlegroups.com
Whenever java.util.Random is a performance bottleneck,
one can consider using a non-synchronized version.
One example of which is provided in JDK7/jsr166y via ThreadLocalRandom

http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166y/ThreadLocalRandom.java?view=co

Martin

Benjamin Manes

unread,
Jun 7, 2010, 2:39:37 PM6/7/10
to guava-...@googlegroups.com
Sure. I wrote this as a code kata exercise when interviewing, so I didn't spend much time on it. There are a lot of obvious optimizations, but it was good enough as a quick-n-dirty implementation. You'll probably want to rewrite it for your test and strip out bits like the copy-on-write semantics.

Tom Gibara

unread,
Jun 7, 2010, 2:51:53 PM6/7/10
to guava-...@googlegroups.com
Thanks,

I'm really just going to rip out the hashing function and test that.

I'll try and find some time to do that this evening. I really need to test over a wide variety of different Bloom filter configurations for these results to become more meaningful, but one step at a time...

Tom.

-- 
Tom Gibara
email: m...@tomgibara.com
web: http://www.tomgibara.com
blog: http://blog.tomgibara.com
twitter: tomgibara

Tom Gibara

unread,
Jun 7, 2010, 4:45:00 PM6/7/10
to guava-...@googlegroups.com
I've quickly incorporated the hash function from your BloomFilter implementation and the results were excellent:


I did make one change though: In your method indexes(Object) you initialize the probe modulo the length, but four lines down, the hashes modulo the bits. I couldn't see any reason for the discrepancy, so I changed them both to the number of bits.

Tom.


-- 
Tom Gibara
email: m...@tomgibara.com
web: http://www.tomgibara.com
blog: http://blog.tomgibara.com
twitter: tomgibara

On 7 June 2010 19:39, Benjamin Manes <bma...@google.com> wrote:

Benjamin Manes

unread,
Jun 7, 2010, 6:46:25 PM6/7/10
to guava-...@googlegroups.com
Great! That's probably a good change, as it looks off to me as well. I don't remember too much, but I think you can probably get rid of one of the Math#abs() calls. Given that this was a quick hack, I didn't think through the sign-bit issues more than to avoid a negative hashcode causing an index out of bounds error. You can probably use some additional bitwise tricks to squeeze out more performance throughout, but as a baseline this is promising.

Pat Farrell

unread,
Jun 9, 2010, 3:42:24 PM6/9/10
to guava-...@googlegroups.com
On Mon, Jun 7, 2010 at 2:33 PM, Martin Buchholz <mart...@google.com> wrote:
Whenever java.util.Random is a performance bottleneck,
one can consider using a non-synchronized version.

One similar idea that we used long ago at CyberCash, where we needed a lot of quality random numbers, was to have a separate thread that generates random numbers and keeps a queue of a suitable number (X) ready for instant access/delivery. The daemon can nap when there is a suitable backlog of values in the queue, and get to work when the re-order threshold is met.

This changes the problem to a well understood reorder strategy optimization
 

Tim Peierls

unread,
Jun 9, 2010, 4:00:44 PM6/9/10
to guava-...@googlegroups.com
What kind of queue would you use to communicate the random numbers to the rest of the world? I'd worry that having lots of processors constantly bang on the queue would make this approach much less attractive than something like Martin's suggestion of ThreadLocalRandom.

But this is OT, anyway, since Tom Gibara's multi-hashing application requires reproducibility.

--tim

--

Benjamin Manes

unread,
Jun 9, 2010, 4:21:01 PM6/9/10
to guava-...@googlegroups.com
In Pat's example you might retrieve a batch of random numbers rather than each individually to reduce the contention on the queue. The batch could be stored in thread-local if used individually. This is basically the same idea behind the common idiom of handing database sequences by caching a range rather than making I/O calls for every insert, so Pat is just extending it to randoms rather than unique ids.

Raymond Conner

unread,
Jun 9, 2010, 9:43:38 PM6/9/10
to guava-...@googlegroups.com
It's not quite the same. For the database sequence, it's important
that no one get the same number. For random numbers, it's usually not
important that clients get numbers from the same sequence.

A thread local *should* be faster, but always test in as much of a
"real world" situation to be sure. Beware of micro-benchmarks. How
either approach behaves in practice will depend on many things,
including the number of cores, number of threads, contention, etc. The
basic reason is that the best way to improve throughput with multiple
threads is to remove any contention entirely (paraphrased from Brian
Goetz).

I should qualify that - that should be the case for a random number
generator which is, itself, fast. If you're random number generator is
not fast, pre-calculating them could certainly be an improvement.

- Ray

Pat Farrell

unread,
Jun 10, 2010, 9:13:24 PM6/10/10
to guava-...@googlegroups.com
On Wed, Jun 9, 2010 at 4:00 PM, Tim Peierls <t...@peierls.net> wrote:
What kind of queue would you use to communicate the random numbers to the rest of the world? I'd worry that having lots of processors constantly bang on the queue would make this approach much less attractive than something like Martin's suggestion of ThreadLocalRandom.

These days, Java has nice blocking queues, there is no banging. Daemon processes are easy.

A BlockingQueue<T> is what you want. 

Tim Peierls

unread,
Jun 10, 2010, 11:32:18 PM6/10/10
to guava-...@googlegroups.com
On Thu, Jun 10, 2010 at 9:13 PM, Pat Farrell <pat2...@gmail.com> wrote:
On Wed, Jun 9, 2010 at 4:00 PM, Tim Peierls <t...@peierls.net> wrote:
What kind of queue would you use to communicate the random numbers to the rest of the world? I'd worry that having lots of processors constantly bang on the queue would make this approach much less attractive than something like Martin's suggestion of ThreadLocalRandom.

These days, Java has nice blocking queues, there is no banging.

By "constantly bang on" I mean "repeatedly take from". 

My point is that it is usually better to have multiple pieces of (unshared) data than it is to have threads contend for access to a single piece of shared data -- I believe that's why Martin brought up ThreadLocalRandom. 

Blocking queues are _not_ particularly nice in this regard because they, uh, block when more than one thread tries to remove from the queue at the same time. That's potentially much worse than doing a little extra computation.

 
Daemon processes are easy.

Do you mean daemon threads? If you mean that it's easy to create a daemon thread to fill the random number queue, then yes, that's true, but it's better not to have to pass data between threads in the first place.

 
A BlockingQueue<T> is what you want. 

When there's little or no contention for the queue, yes. But you described an approach that doesn't limit the number of random number consumer threads, which leads to heavy contention if all the consumers need a random number at once.

At any rate, a random number queue with multiple consumers wouldn't produce a reproducible sequence for any one consumer, and Tom Gibara's multi-hashing approach requires reproducibility.

--tim

Pat Farrell

unread,
Jun 11, 2010, 8:54:13 PM6/11/10
to guava-...@googlegroups.com
These days, Java has nice blocking queues, there is no banging.

By "constantly bang on" I mean "repeatedly take from". 

You are concerned about getting a piece of data? 
No matter what you do, you have to repeatedly take data from a random number generator. That the nature of the beast.

Daemon processes are easy.
Do you mean daemon threads?


At a suitably high level of abstraction, there is no difference between a daemon thread and process. The point is that your application just calls the random number store and a number is there. Quick, easy.

How expensive a process is vs a thread is very OS dependent, and can vary with hardware. Sometimes its not expensive at all, but that is moving way OT.

Dimitris Andreou

unread,
Jul 7, 2010, 10:15:22 PM7/7/10
to guava-...@googlegroups.com
Hi all,

I found the subject quite interesting, so I spared some time and created a benchmark for various hashing-related things (separate benchmarks for hashtables with chaining, open-addressing hashtables, and bloom filters). 

It can easily do side-by-side comparisons whatever scrambling/probing/multihash function one can provide. (Note I don't measure time in any of these - that would probably be fraught with problems).

Currently, I'm not sure what the best dataset(s) would be for the various cases (you may see I'm using rather simplistic ones).

In my simplistic datasets, I see that the approach with Random (that Gibara suggested) works pretty well. For example I compared it with universal hashing, and in the dataset where I only put "new Object()" in the bloom filter, it was not easy to (randomly) find constants for the universal hashing engine to make it yield a smaller false positive rate. But then I can't really tell, I had to pick an arbitrary big prime for create this, and I don't know how much better options potentially exist (I could try looking for a good prime too). I also implemented Ben Manes' suggestion, which also fares good, at a glance.

Not anything conclusive here, I'd probably need better input data, but this should be a useful tool for anyone interested.

Some (late) remarks on this thread: 
1) Doug Lea's scrambling function (ConcurrentHashMap#hash(int)) does not constitute universal hashing.

2) Using that to create a sequence of positions does not constitute double hashing either (in the strict sense, the latter must generate a permutation of all indexes of the hashtable - this is not what happens here, but neither a bloom filter requires as much, of course).

3) Estimating the false positive rate by either the expected probability of 1s, or the actual number of 1s in the bloom filter can be misleading - these estimates assume both independent and uniform hash functions, which we only try to approximate. One can also measure experimentally this rate, but the result depends on the data tried, so having quality, real-world representative datasets is quite important. (Does anyone knows what the designers of HashMap and friends used as datasets?)

4) SecurityRandom certainly can not be used as a replacement to plain Random, to generate k table positions given a hashCode as a seed. For example, the following prints 3 different numbers:

SecureRandom s = SecureRandom.getInstance("SHA1PRNG");
s.setSeed(1);
System.out.println(s.nextInt());
s.setSeed(1);
System.out.println(s.nextInt());
s.setSeed(1);
System.out.println(s.nextInt());
//also note that setSeed(0) calls are completely ignored

So a bloom filter depending on this would mulfunction.

This behavior is explained in Security.setSeed javadoc (which is surprizing and unfortunate - in retrospect perhaps this method should have been overriden to throw UnsupportedOperationException, and have a different method perform the above).

That's all for now.

Regards,
Dimitris

2010/6/8 Benjamin Manes <bma...@google.com>
Reply all
Reply to author
Forward
0 new messages