Bloom filter

3,188 views
Skip to first unread message

Will Fitzgerald

unread,
May 21, 2011, 10:24:05 AM5/21/11
to golan...@googlegroups.com
I've written an implementation of Bloom filters, using the BitSet package. You can find it at


As usual, comments welcome. 

Will

William Waites

unread,
May 21, 2011, 10:35:53 AM5/21/11
to golan...@googlegroups.com
* [2011-05-21 07:24:05 -0700] Will Fitzgerald <will.fi...@gmail.com> �crit:

] I've written an implementation of Bloom filters, using the BitSet package.

] You can find it at
]
] https://github.com/willf/bloom
]
] As usual, comments welcome.

Hello, a while ago I made something similar:

https://bitbucket.org/ww/bloom

I guess the main differences are that the hash algorithm is
configurable when creating the filter and that instead of providing
the number of hashes up front and estimating the false positive
probability, you provide the probability and it calculates the number
of hashes it needs. I think users of the package have an easier time
getting a handle on p than k.

Apart from that, BitSet is nice if it had existed a few months ago
I could have avoided a lot of messy arithmetic :)

Cheers,
-w

--
William Waites <mailto:w...@styx.org>
http://river.styx.org/ww/ <sip:w...@styx.org>
F4B3 39BF E775 CF42 0BAB 3DF0 BE40 A6DF B06F FD45

Will Fitzgerald

unread,
May 21, 2011, 10:41:59 AM5/21/11
to golan...@googlegroups.com
Actually, I do neither :) I just allow a way to estimate the FP rate, given n (cardinality of the set), m (size of the bloom filter in bits) and k (the number of hashes).

Are you using your Bloom filter for anything in production?

Will

William Waites

unread,
May 21, 2011, 10:54:31 AM5/21/11
to golan...@googlegroups.com
* [2011-05-21 07:41:59 -0700] Will Fitzgerald <will.fi...@gmail.com> �crit:

] Actually, I do neither :) I just allow a way to estimate the FP rate, given

] n (cardinality of the set), m (size of the bloom filter in bits) and k (the
] number of hashes).

Yes, I see. It's just that (m, k) which need to be provided to the
intialiser are not what I think the average user of a bloom filter
will have in hand - they're more likely to be able to give reasonable
values for (n, p). Just a bit of algebra to convert them, but
worthwhile because it saves users of the library having to look up and
implement the formula.

] Are you using your Bloom filter for anything in production?

Nah. I have some back-burner plans to try to speed up some graph
traversal things, but they remain on the back burner. Feel free to
take any parts of my implementation you might find useful.

Will Fitzgerald

unread,
May 22, 2011, 5:26:57 PM5/22/11
to golan...@googlegroups.com
I guess my choice was to pick a really fast hash function (one call to hash.fnv) to choose the bits. I know how to do the algebra; my very initial exploration seemed to indicate that the implementation wasn't following the predicted models exactly. This is worth some further investigation.

Thanks for the offer of code, by the way. This is generous.

Will

Will Fitzgerald

unread,
May 23, 2011, 12:05:34 AM5/23/11
to golan...@googlegroups.com
William (et al.),

I have added a new constructor, NewWithEstimates(n uint, fp float64), that will estimate the size of the underlying bitset and the number of hash functions based on the number of items requiring storage and a required false positive rate. Based on some quick experiments (see the benchmark), this works best with fp<= 0.001. 

Also, I added a benchmark for testing the filter (with 10 hash functions). The test reports 0.002636 milliseconds per test on my MacBook Air -- pretty good, I think.


Will

Will Fitzgerald

unread,
Mar 11, 2013, 9:40:28 PM3/11/13
to golan...@googlegroups.com
After a long hiatus, I've updated the Bloom filter code to support 1.x style installation and formatting. Comments welcome.

greg...@gmail.com

unread,
Jan 25, 2014, 2:17:38 AM1/25/14
to golan...@googlegroups.com
Hello, I seem to be unable to Unmarshal a bitset over size 32..

xrf...@gmail.com

unread,
Nov 28, 2016, 11:12:35 AM11/28/16
to golang-nuts
Hi Will,

Could you please explain the memory characteristics of your implementation? i.e. given boom.New(m, k), how much memory will the filter consume? Is the memory consumption related to number of items in the filter? If appropriate, function such as 

func EstimateParameterByMemory(mem int) (m, k int)
func (f *BloomFilter) MemoryConsumption() int

will be very useful in my case.

Thank you very much!

xrfang

在 2011年5月21日星期六 UTC+8下午10:24:05,Will Fitzgerald写道:

Ian Davis

unread,
Nov 28, 2016, 11:21:00 AM11/28/16
to golan...@googlegroups.com
On Sat, Jan 25, 2014, at 07:17 AM, greg...@gmail.com wrote:
Hello, I seem to be unable to Unmarshal a bitset over size 32..

What have you tried and what errors did you encounter?

If you share some code then your question will be easier to answer.

Stanislav Paskalev

unread,
Nov 29, 2016, 3:18:14 AM11/29/16
to golang-nuts, xrf...@gmail.com
Hi xrfang,

Myself and a friend recently wrote https://github.com/solarsea/quorum for a hackaton as a proof of concept whether bloom filters can be used as the basis for an anonymous voting system. The project includes a floom filter implementation (tweaked to show accumulated conflicts) and hash function generator based on the pearson hash function.

Free free to fork it away and use it :)

Regards,
Stanislav

lemo...@gmail.com

unread,
Oct 1, 2017, 12:16:58 AM10/1/17
to golang-nuts
I have read the code, one thing I'm not clearly is that the choosing of the k hash function. Is there any reason you use such location function to implement the k function, What's the advantage?


在 2011年5月21日星期六 UTC+8下午10:24:05,Will Fitzgerald写道:
I've written an implementation of Bloom filters, using the BitSet package. You can find it at

丁瑞文

unread,
Jun 11, 2022, 5:50:32 PM6/11/22
to golang-nuts
  hi, guys. I use https://github.com/willf/bloom in production, I store the data into the bloom filter first, and then store the bloom filter into the cache, but there is a problem, when I want to insert the data into the bloom filter , I need to take the bloom filter out of the cache first, then insert the data, and finally store the bloom filter in the cache. Inserting data into the Bloom filter causes a lot of performance consumption, how do I need to optimize it? Because it is a distributed system, there will be inconsistencies in directly storing the bloom filter in memory.  
Reply all
Reply to author
Forward
0 new messages