Your CPU time will likely be lower. Your Bloom won’t be able to be more than about 28 megs in size.
You may consider “Short” misses if your data doesn’t all fit. (Using partial matches based on fewer characters of the complete value Look up “Robert ZCXYNVsnup” Do I have any last names that start with ZCXY No. Great don’t look in data store.)
You may want to play with Instance ram vs Memcache. This will likely be a factor of the Instance life and the number of running instances.
--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/ViVc7VJ8iOAJ.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
The not-so-obvious part is how to update the memcache in sync with the
datastore. If you have transactional requirements, you can't update
the memcache on datastore write without risking collisions. The
answer is to clear the value on write and use CAS operations on read.
If you're in Javaland you can use Objectify's
CachingAsyncDatastoreService directly with the low-level api. If
you're in Pythonland you'll have to implement this yourself... just
remember, don't try to update memcache on datastore write. It will
fail.
Jeff
--
We are the 20%
--
Short Misses:
Let’s say you are building an app that Pulls all of the Tax, Owner, and Last sale Prices for a given Address for a Real Estate App. You have decided to Optimize it with a Bloom Filter.
Step 1: Query Validation
Doing data structure validation makes a nice poor man’s bloom. In my ongoing battle with the Google Bot which is perfectly happy to read a form, and then Stuff gibberish in to it to see what comes out, we found we could drastically reduce datastore calls by validating that a query was asking for data in a format that made sense before actually processing it.
This is not a Bloom filter, but knowing that a phone number is only numeric, or that First and Last names don’t have numbers in them, or that addresses don’t have symbols can save you a lot of calls made by bots. Or humans that type poorly.
Step 2: Compact Aproximators
You can get much better Cache Hits if you format the valid queries the same way ever time. Consider forcing case of the query, stripping characters and white space, and adding an ignore list for things which may not be needed…
1313 MockingBird Lane
1313 MockingBird LN
1313 MockingBird ln
1313 Mockingbird Lane
Are all the same
When I mentioned “short Misses” I was looking for “Compact Approximators”
Converting 1313 Mockingbird Lane to a Compact Approximator Helps you avoid look ups for things you don’t have.
A sample implementation might be to only look at Non-Vowels, and Sans Street Type Designation
So you would look at 1313mckngbrd
For the purpose of your Bloom Look up you would not include the City and state.
This may not shrink your data set enough to make it fit in the Bloom, or you may want a tiered Bloom. Based on the number of unique entries. “Have we got any data on Addresses with number 1313” and a Street Name bloom with Only check the actual Data Store if you have both a parcel with the number requested and a street requested. This is of course a bad example since in all of the US likely every number up to 100,000 is likely used, but I didn’t want to change my example so pretend that for whatever reason it does work.
Step 3: Keeping the bloom alive
Blooms are for speeding things up. So they don’t have to be 100% up to date but you’d rather they say, “We might have this, look it up” than the declarative “we don’t have this, don’t bother”. Because of this, you want to update your bloom on the creation of new entries, and the change of searchable entries but you don’t need to update on deletion.
Because of this you will likely want to have a transaction log of recent rights, and update the bloom on a timer. When checking the bloom you would search the transaction log, then the bloom, then the datastore. This will prevent False “we don’t have this” on new data. While still offering the ability to keep up with new data being added.
.
Thanks for the lib, Tom, but I'm in python-land so unfortunately I can't use it.
--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/KiTAbhvvUlcJ.