Bloom Filters should be a 1st class data structure

296 views
Skip to first unread message

Mitchell Hashimoto

unread,
Oct 14, 2011, 12:46:36 PM10/14/11
to redi...@googlegroups.com
Hello!


Bloom filters are an extremely space/computation efficient data structure for a probabilistic hash table. They're used in basically every architecture I've ever seen, and having bloom filters as a service would be an extremely valuable resource. Currently, our company has resorted to writing our own implementation (https://github.com/kiip/bloomd). While this is quite efficient, I do love Redis and I feel this is the sort of thing that Redis should have as a 1st class resource.

It was suggested in the previous topic that bloom filters can already be implemented on top of previous Redis primitives, which is very true, but not efficient. Having to go across the network just to set a bit or get a bit is kind of silly, and then you would also have to build in all the logic of scalable bloom filters into your layer above Redis.

I think Redis would benefit greatly for adding support for both bloom filters and scalable bloom filters.

I'd love feedback on this idea, from both the community and antirez, and I'd like to ask again what are objections to this? From everywhere I've seen Redis in use so far, bloom filters appear to fit the philosophy of Redis perfectly.

Best,
Mitchell

Jeremy Zawodny

unread,
Oct 14, 2011, 12:57:32 PM10/14/11
to redi...@googlegroups.com
What would the interface (the command set) look like?

Jeremy

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To view this discussion on the web visit https://groups.google.com/d/msg/redis-db/-/GyZ6XOlKDMkJ.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.

Salvatore Sanfilippo

unread,
Oct 14, 2011, 1:04:03 PM10/14/11
to redi...@googlegroups.com
On Fri, Oct 14, 2011 at 6:57 PM, Jeremy Zawodny <Jer...@zawodny.com> wrote:

> What would the interface (the command set) look like?

I think an hypothetical interface could be able to do this with just
two commands one doing SETBIT in the specified bit range (always using
SHA1 or alike against the given string), and one for mass-testing bits
given a string and just return a bool.

So it's like:

BLOOM SET keyname <filter size> "whatever string"
BLOOM TEST keyname <filter size> "whatever string"

The first command will hash the string and set the bits accordingly.
The second command will test all the bits returning 0 or 1.

Also a "BLOOM FLUSH" could be useful as well.

Basically no new data type is added, this just operates on strings.
This also means you can get your filter using just GET <keyname>.

The implementation should be trivial. However there is always the
"problem" that scripting would solve this very well, we even have
Redis.sha1() in Lua.

Worth opening an issue to reason about it.

Salvatore

--
Salvatore 'antirez' Sanfilippo
open source developer - VMware

http://invece.org
"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele

Josiah Carlson

unread,
Oct 14, 2011, 1:33:45 PM10/14/11
to redi...@googlegroups.com
One more field is necessary: number of bits to set.

The thing is, bloom filters are sufficiently nuanced in the way they
work (in particular how many bits to set, what will be the FP rate
given a total number of insertions, etc.) that just having the
commands may be more confusing than not.

At least with leaving it to 3rd party libraries (which can use
pipelining now, scripting later), they can include all of the useful
utility functions for calculating proper bloom filter size and number
of bits to set.

Regards,
- Josiah

Salvatore Sanfilippo

unread,
Oct 14, 2011, 2:08:33 PM10/14/11
to redi...@googlegroups.com
On Fri, Oct 14, 2011 at 7:33 PM, Josiah Carlson
<josiah....@gmail.com> wrote:
> One more field is necessary: number of bits to set.

Right, the interface should have an additional argument indeed.

> The thing is, bloom filters are sufficiently nuanced in the way they
> work (in particular how many bits to set, what will be the FP rate
> given a total number of insertions, etc.) that just having the
> commands may be more confusing than not.
>
> At least with leaving it to 3rd party libraries (which can use
> pipelining now, scripting later), they can include all of the useful
> utility functions for calculating proper bloom filter size and number
> of bits to set.

Usually I tend to think exactly along on this lines, however I think
that in this specific case it is worth testing the performances we can
achieve using Lua scripting, since testing for, for instance, 128
bits, can be pretty slow without scripting, can be much faster with
scripting, but still may be one or even two order of magnitudes slower
than if done inside Redis.

I really think the only way to know this is by testing it.

However I guess that the commands footprint should be like:

BLOOM SET key <bits> <size> <string>
BLOOM TEST key <bits> <size> <string>
BLOOM CLEAR key

It does not seems like a too complex interface to me.

SET/TEST would use the hash function in "counter mode" so that it's
not a problem to have more bits than the output size.
No new type is needed so this is completely backward / forward
compatible from the point of view of RDB files.

Sounds like a very easy task... I'm sure not more than 100 lines of
code, so if the difference in performances is as extreme as I think
may be worth it. Also I've the feeling that providing direct support
would boost the adoption of such a great data structure.

Salvatore

Armon Dadgar

unread,
Oct 14, 2011, 5:05:17 PM10/14/11
to redi...@googlegroups.com
Hey, just wanted to chime in on this.

I think that in keeping with some of the various data structures in Redis, it would be more appropriate to implement a scaling bloom filter as opposed to a static bloom filter. The distinction being that based on the create-time parameters, static bloom filters have a fixed capacity. Scaling bloom filters on the other hand can grow as necessary to add capacity, and maintain a fixed FP rate.

If that is used instead, then specifying the bits/size does not make a lot of sense as an interface. It makes more sense to instantiate with a FP rate and an initial capacity. Those parameters can be used to determine the appropriate size and bits / hash rounds to use. And this way you are free to scale the size of the bloom filter as more keys are added.

Also, I'm not sure about what other hash functions are available internally, but using a SHA1 is a bit of overkill for a bloom filter. Since cryptographic requirements are not needed, much faster and more lightweight hashes would be more appropriate.

But otherwise, I think it would be awesome to see bloom filters integrated into Redis, as they are an incredibly useful data structure with limited support. It would be much nicer to have the hashes computed and the bits checked server side, because otherwise the performance would be simply abysmal.

Best Regards,
Armon Dadgar


Dvir Volk

unread,
Oct 14, 2011, 5:06:26 PM10/14/11
to redi...@googlegroups.com
If we're in the subject of new data structures, I'd like to re-raise my request for a simple random access vector structure.
sure, the functionality can be achieved with a string, a list or  a sorted sets, but each have their drawbacks in terms of memory, flexibility and performance.
I know I'd personally use it.
Dvir Volk
System Architect, DoAT, http://doat.com

Mitchell Hashimoto

unread,
Oct 14, 2011, 5:24:21 PM10/14/11
to redi...@googlegroups.com


On Friday, October 14, 2011 11:06:26 PM UTC+2, Dvir Volk wrote:
If we're in the subject of new data structures, I'd like to re-raise my request for a simple random access vector structure.
sure, the functionality can be achieved with a string, a list or  a sorted sets, but each have their drawbacks in terms of memory, flexibility and performance.
I know I'd personally use it.


Before this thread is derailed, could you please make a new topic for this? Bloom filters are very important to me and I'd like to keep this thread on-topic. Thanks :)

A colleague of mine I believe has posted a response to Salvatore above and its simply awaiting moderation.

Thank you for responding Salvatore!

Best,
Mitchell

Bohu TANG

unread,
Oct 15, 2011, 11:46:10 AM10/15/11
to redi...@googlegroups.com

nessDB's bloom filter implementation(c code) https://github.com/shuttler/nessDB/blob/v1.7/src/bloom.c

BohuTANG

Salvatore Sanfilippo

unread,
Oct 15, 2011, 11:59:21 AM10/15/11
to redi...@googlegroups.com
On Sat, Oct 15, 2011 at 5:46 PM, Bohu TANG <overred....@gmail.com> wrote:

> nessDB's bloom filter implementation(c code)
> https://github.com/shuttler/nessDB/blob/v1.7/src/bloom.c

Writing the code is the simplest step ;)
The hard part is understanding if it can be avoided at all, and
designing the interface well if it can not be avoided.

Salvatore

MrIhaveAnOpinionOnEverything

unread,
Oct 16, 2011, 10:57:02 PM10/16/11
to redi...@googlegroups.com
I concur in pushing for the support of this data structure. Our company has been using redis for the longest time. And we are very happy with it's performance. However there are use cases where bloom filter data structures appears to be most appropriate. It would be great if redis would be able to support this data structure. So I can keep it as the default data store in our platform.

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.




--
"When I look at you I see two people, ther person you are
and the person you are supposed to be. Someday these two
people will meet. And when they do, they will achieve great things"
- Gene Hackman, The Replacements

Catch up on me and check my BLOG at
http://mrihaveanopiniononeverything.blogspot.com
Reply all
Reply to author
Forward
0 new messages