Bloom filter index

252 views
Skip to first unread message

John Tantalo

unread,
Dec 11, 2011, 10:17:19 PM12/11/11
to google-a...@googlegroups.com
I wonder whether anybody has tried to build an in-memory bloom filter in front of an index to reduce datastore read operations?

In my application, I have an exact-match query on a single field, and it commonly matches no results. However, I still have to pay for datastore read operations in this case.

My idea was to build a bloom filter on every value of the field in my datastore. Given a query input, if the bloom filter says the value is a member of the set, I will query the datastore for it, which may or may not match results (i.e., a false positive).

The bloom filter would be wrapped in an app engine model and stored in the datastore and memcached. The write rate to the datastore for this index is rather low, so I plan to update the bloom filter transactionally and cache it on every write. The updates could also be done offline in a task queue.

The goal is to reduce the cost of searches, especially in the "no matches" case. I believe this change would reduce costs on datastore read operations, but increase CPU time because each request would have to read and deserialize a potentially large bloom filter from memcached. Clearly, this tradeoff could be tuned to the needs of the app, as a larger bloom filter would produce fewer false positives and wasted datastore reads.

Thoughts?

Brandon Wirtz

unread,
Dec 11, 2011, 10:35:55 PM12/11/11
to google-a...@googlegroups.com

Your CPU time will likely be lower. Your Bloom won’t be able to be more than about 28 megs in size.

You may consider “Short” misses if your data doesn’t all fit.  (Using partial matches based on fewer characters of the complete value   Look up “Robert ZCXYNVsnup” Do I have any last names that start with ZCXY No. Great don’t look in data store.)

 

You may want to play with Instance ram vs Memcache. This will likely be a factor of the Instance life and the number of running instances.

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/ViVc7VJ8iOAJ.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Jeff Schnitzer

unread,
Dec 12, 2011, 10:24:14 AM12/12/11
to google-a...@googlegroups.com
Seems like even a 1MB bloom filter (the largest size you can fit in a
single entity/memcache bucket) would accomodate an enormously large
dataset.

The not-so-obvious part is how to update the memcache in sync with the
datastore. If you have transactional requirements, you can't update
the memcache on datastore write without risking collisions. The
answer is to clear the value on write and use CAS operations on read.
If you're in Javaland you can use Objectify's
CachingAsyncDatastoreService directly with the low-level api. If
you're in Pythonland you'll have to implement this yourself... just
remember, don't try to update memcache on datastore write. It will
fail.

Jeff

--
We are the 20%

Tom Gibara

unread,
Dec 12, 2011, 8:07:58 PM12/12/11
to google-a...@googlegroups.com
I've used Bloom filters for a number of purposes, including the scenario you describe - though not on App Engine.

Bloom filters are an extremely versatile tool, but are not a panacea. Firstly they aren't as compact as one might expect. By my reckoning, a 1MB bloom filter will accommodate 1M entries with a false positive rate of around 2%. A well implemented trie with a suitable set of keys can approach similar levels of compactness.

Secondly, Bloom filters are obviously sensitive to their initial sizing; though they degrade gracefully, in some applications it can be difficult to resize them in-situ without causing unacceptable spikes in throughput.

Finally, just in case you hadn't considered it; Bloom filters don't support element removal. This may or may not be an issue for your application.

Also, I'm not sure if Brandon's suggestion to consider "Short misses" was targeted at storing more data in the Bloom filter. Bloom filter capacity is independent of key size and smaller keys may result in poorer performance due to weaker hashing.

If it's any use I've made an open source Bloom filter implementation available as part of my collections package at http://code.google.com/p/tomgibara/ 

Tom.

--

Ikai Lan (Google)

unread,
Dec 13, 2011, 2:33:31 PM12/13/11
to google-a...@googlegroups.com
Tom,

Can you link directly to the Bloom Filter implementation? It's not obvious from the main project page how to find it. Any time we can get people to think about computer science-y solutions to big data problems instead of just subscribing to RDBMS dogma, we win =).

While bloom filters are not a solution for everything, there are places where they would greatly improve performance. One place where you'd likely see huge benefits from bloom filters are in N x N comparison type of operations. Here's an example from RapLeaf:


By doing a single pass over all entities to build the initial filter, when you cross-check on the second pass, you can do the same operation in much less time if you know that matches tend to be sparse (for other folks reading this thread that are new to the concept: this means spread out - most comparison operations will be NO MATCH). In the N x N hash comparison example, you know that you are only going to do a write in the first phase with no deletes, and you know you will only be reading the filter later on so you can use local memory.

--
Ikai Lan 
Developer Programs Engineer, Google App Engine

Tom Gibara

unread,
Dec 13, 2011, 8:04:49 PM12/13/11
to google-a...@googlegroups.com
Ikai,

Prompted by your request for a link, I took time out this evening to write walkthrough for how to use my Bloom filter implementation and combined it with some very brief build instructions to make this blog post:


Hopefully others will find it useful but it was high-time I documented it anyway.

Having just scanned my previous comments, I realize that they might be dissuasive; that wasn't my intention. I use Bloom filters heavily in a variety of situations. For example, I recently came up with the idea of using them to identify which columns in a large dataset contain unique values: a first pass populates the BloomFilter and possible duplicates are stored in an in-memory set that a second pass then uses to verify that genuine duplicates exist (incidentally, if anyone knows a better approach I'm interested, but I was pleased to come up this one).

On the subject of computer science-y solutions, if Bloom filters provide a weakened abstraction of a set, then compact approximators provide an equivalent abstraction for maps. Unfortunately they have nowhere near the popularity of Bloom filters even though its possible to find all sorts of opportunities for them to 'bound' queries over big-data.


For example, in the problem initially posed, a compact approximator could be used to return an actual upper bound on the number of matches returned instead of simply whether a match exists or not. This is expense of a greater memory overhead and whether its worthwhile is naturally dependent on the application.

Tom.

Ikai Lan (Google)

unread,
Dec 13, 2011, 8:32:53 PM12/13/11
to google-a...@googlegroups.com
Tom,

Awesome! Your writing is really accessible, and I'd love to see more of it.

I think Bloom Filters have caught on because conceptually, it's much easier to figure out optimizations when you know you can space/time efficient set inclusion checks. Bloom Filters are used everywhere - they're used in several places here, including in parts of the datastore. You don't hear as much about Compact Approximators. In fact ... I can't seem to find a library for them. Someone needs to take this paper and turn it into code:


--
Ikai Lan 
Developer Programs Engineer, Google App Engine



Brandon Wirtz

unread,
Dec 13, 2011, 9:30:27 PM12/13/11
to google-a...@googlegroups.com

Short Misses:

 

Let’s say you are building an app that Pulls all of the Tax, Owner, and Last sale Prices for a given Address for a Real Estate App. You have decided to Optimize it with a Bloom Filter.

 

Step 1: Query Validation

Doing data structure validation makes a nice poor man’s bloom.  In my ongoing battle with the Google Bot which is perfectly happy to read a form, and then Stuff gibberish in to it to see what comes out, we found we could drastically reduce datastore calls by validating that a query was asking for data in a format that made sense before actually processing it.

 

This is not a Bloom filter, but  knowing that a phone number is only numeric, or that First and Last names don’t have numbers in them, or that addresses don’t have symbols can save you a lot of calls made by bots. Or humans that type poorly. 

 

Step 2: Compact Aproximators

 

You can get much better Cache Hits if you format the valid queries the same way ever time. Consider forcing case of the query, stripping characters and white space, and adding an ignore list for things which may not be needed…

 

1313 MockingBird Lane

1313 MockingBird LN

1313 MockingBird ln

1313 Mockingbird Lane

 

Are all the same

 

When I mentioned “short Misses” I was looking for “Compact Approximators”

 

Converting 1313 Mockingbird Lane to a Compact Approximator  Helps you avoid look ups for things you don’t have.

 

A sample implementation might be to only look at Non-Vowels, and Sans Street Type Designation

 

So you would look at 1313mckngbrd

 

For the purpose of your Bloom Look up you would not include the City and state.

 

This may not shrink your data set enough to make it fit in the Bloom, or you may want a tiered Bloom. Based on the number of unique entries.   “Have we got any data on Addresses with number 1313” and a Street Name bloom with   Only check the actual Data Store if you have both a parcel with the number requested and a street requested.  This is of course a bad example since in all of the US likely every number up to 100,000 is likely used, but I didn’t want to change my example so pretend that for whatever reason it does work.

 

Step 3: Keeping the bloom alive

Blooms are for speeding things up. So they don’t have to be 100% up to date but you’d rather they say, “We might have this, look it up” than the declarative “we don’t have this, don’t bother”.  Because of this, you want to update your bloom on the creation of new entries, and the change of searchable entries but you don’t need to update on deletion.

 

Because of this you will likely want to have a transaction log of recent rights, and update the bloom on a timer.  When checking the bloom you would search the transaction log, then the bloom, then the datastore.  This will prevent False “we don’t have this” on new data. While still offering the ability to keep up with new data being added.

 

 

 

 

.

John Tantalo

unread,
Dec 15, 2011, 12:06:13 AM12/15/11
to google-a...@googlegroups.com
Thanks for the lib, Tom, but I'm in python-land so unfortunately I can't use it.

Tom Gibara

unread,
Dec 15, 2011, 5:30:41 AM12/15/11
to google-a...@googlegroups.com
Already done :)

The package containing my Bloom filter implementation already contains a general-purpose compact approximator implementation; something else I need to document though :(

Tom Gibara

unread,
Dec 15, 2011, 5:35:20 AM12/15/11
to google-a...@googlegroups.com
Pity, because I was going add that, if you're already using Guava, a BloomFilter implementation that includes a general purpose hashing package has just been added.


On 15 December 2011 05:06, John Tantalo <john.t...@gmail.com> wrote:
Thanks for the lib, Tom, but I'm in python-land so unfortunately I can't use it.

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/KiTAbhvvUlcJ.

John Tantalo

unread,
Dec 15, 2011, 11:41:23 AM12/15/11
to google-a...@googlegroups.com
I found a pure-python bloom filter implementation by Kevin Scott, based on BitVector, that appears to work on GAE (python27).


Reply all
Reply to author
Forward
0 new messages