First off, I _really_ like the idea of AppEngine. We're hosting most
of our sites (versionsapp.com, checkoutapp.com) on it and it works
great for that purpose. Impressed as we were, I started trying to
build our first serious webapp on AppEngine; a small webstore. What I
basically wanted to do is building a page where you could browse 50
products at a time, optionally limited by a search or tag, based on
Django. All went well until I wanted to implement search...
I grabbed a few example datasets from users of our Mac POS app, they
mostly have around a 1000 products with a name, brand and description.
I knew substring search was hard to implement, so I skipped that and
used the undocumented ext.search models with my own set of search
words, mostly around 40 per product. Because some searches took quite
long I decided to slice the results by 50 at a time, not impressive
but good enough. And because product descriptions don't change often I
could use memcache to cache each result 'slice'.
To do so, I figured I needed to at least order by name, so the slicing
would make sense. Therefore I had to create an index on name,
searchIndex (set of searchable words). Worked great, until I tried to
type two words. I told me I had to create another index: name,
searchIndex, searchIndex. Seemed logical, but then people may want to
type more then 2 words at the same time so I added 5 indexes to get up
to five words.
From that moment inserting products started to take more then a
second, which I guess was due to the indexing, but even worse, I
started to get errors like BadRequestError: Too many indexed
properties for entity: app. I found out the max indexes for an entity
is capped at 5000, and calculated that if I were to use max 3 words
for a search I could only add 17 searchable words in an items' index
(17*17*17=4913). I'm still not sure that math is 100% correct, but at
least I was able to insert most of my products again, but searching is
just horribly slow.
So now I'm stuck with a _lot_ of limitations for a simple app: only 3
search terms, no substring searching, 50 results at a time and in my
opinion pretty bad performance, especially when inserting. Maybe I'm
doing something horribly wrong, and I don't consider myself the best
db programmer out there but this just seems silly.
Again, this is not to bash AppEngine because I really see the
potential and like the idea. But as it stands now it's just unusable
for us (a small indie dev company that wants their products to scale),
and it's hard for me to imagine that anyone else can live with these
limitations. I guess what I'm saying is: I love this thing to work for
us! We want to use it and pay for it! Seriously, please fix the
searching... you're Google for god sake :-)
I'm pretty new to app engine so there may be some
gotcha's in what I say, but I think my logic is ok.
It seems you have a pretty simple table with a few
fields (name, brand, description) and you want to be
able to search any one of the words in any of the
fields.
I'd suggest building a search term table. Keep your
current process of adding data to your table, your search
table will reference the data it contains.
The search table will be something like
class searchTerms(db.Model):
value = db.StringProperty()
key = db.KeyProperty() # If there's no such thing use the id
To build the search table tokenize the name, brand and
description so that you have a list of separate words.
Add each of these words to the search terms table with the
key that references the item in your data table.
When you want to search via keyword(s) you just query the
value field of the searchTerms table. You will however need
to do a query for each search term.
I thought about your idea before, but it also has huge downsides:
- Inserting a single entity with 40 token words (not uncommon) will
lead to 41 inserts, and inserting is BigTables biggest weakness. Same
goes for deleting.
- It's (I think) not possible with this setup to get the second slice
of results (50-100) from a query with three words, sorted by name. I
would have to do 3 queries, which will return more results than I'm
looking for, see if they are in the other two results (intersect), and
sort them.
Please correct me if I'm wrong...
Kindest regards,
Koen Bok - madebysofa.com
On Dec 8, 1:32 am, lock <lachlan.hu...@gmail.com> wrote:
> I'm pretty new to app engine so there may be some
> gotcha's in what I say, but I think my logic is ok.
> It seems you have a pretty simple table with a few
> fields (name, brand, description) and you want to be
> able to search any one of the words in any of the
> fields.
> I'd suggest building a search term table. Keep your
> current process of adding data to your table, your search
> table will reference the data it contains.
> The search table will be something like
> class searchTerms(db.Model):
> value = db.StringProperty()
> key = db.KeyProperty() # If there's no such thing use the id
> To build the search table tokenize the name, brand and
> description so that you have a list of separate words.
> Add each of these words to the search terms table with the
> key that references the item in your data table.
> When you want to search via keyword(s) you just query the
> value field of the searchTerms table. You will however need
> to do a query for each search term.
You could try culling some of the token words, the simple
ones like 'and', 'a', 'or' to limit the number of puts,
but that will only go so far.
I don't think sorting and checking for intersection will
cost you too much. Its the number of results returned from
the queries that chew up the CPU cycles.
Slicing/pagination doesn't really seem worthwhile if you've
got all the results (which would occur if you used this
approach).
I'm out of ideas, would be interested to know if you do find
a solution though. It sounds like something a lot of apps
would need to do.
> First off, I _really_ like the idea of AppEngine. We're hosting most
> of our sites (versionsapp.com, checkoutapp.com) on it and it works
> great for that purpose. Impressed as we were, I started trying to
> build our first serious webapp on AppEngine; a small webstore. What I
> basically wanted to do is building a page where you could browse 50
> products at a time, optionally limited by a search or tag, based on
> Django. All went well until I wanted to implement search...
> I grabbed a few example datasets from users of our Mac POS app, they
> mostly have around a 1000 products with a name, brand and description.
> I knew substring search was hard to implement, so I skipped that and
> used the undocumented ext.search models with my own set of search
> words, mostly around 40 per product. Because some searches took quite
> long I decided to slice the results by 50 at a time, not impressive
> but good enough. And because product descriptions don't change often I
> could use memcache to cache each result 'slice'.
> To do so, I figured I needed to at least order by name, so the slicing
> would make sense. Therefore I had to create an index on name,
> searchIndex (set of searchable words). Worked great, until I tried to
> type two words. I told me I had to create another index: name,
> searchIndex, searchIndex. Seemed logical, but then people may want to
> type more then 2 words at the same time so I added 5 indexes to get up
> to five words.
> From that moment inserting products started to take more then a
> second, which I guess was due to the indexing, but even worse, I
> started to get errors like BadRequestError: Too many indexed
> properties for entity: app. I found out the max indexes for an entity
> is capped at 5000, and calculated that if I were to use max 3 words
> for a search I could only add 17 searchable words in an items' index
> (17*17*17=4913). I'm still not sure that math is 100% correct, but at
> least I was able to insert most of my products again, but searching is
> just horribly slow.
> So now I'm stuck with a _lot_ of limitations for a simple app: only 3
> search terms, no substring searching, 50 results at a time and in my
> opinion pretty bad performance, especially when inserting. Maybe I'm
> doing something horribly wrong, and I don't consider myself the best
> db programmer out there but this just seems silly.
> Again, this is not to bash AppEngine because I really see the
> potential and like the idea. But as it stands now it's just unusable
> for us (a small indie dev company that wants their products to scale),
> and it's hard for me to imagine that anyone else can live with these
> limitations. I guess what I'm saying is: I love this thing to work for
> us! We want to use it and pay for it! Seriously, please fix the
> searching... you're Google for god sake :-)
There's a definite need for a full-text search API (particularly
substring matches) that just works for the datastore.
For public-facing information like your products, you might consider
outsourcing the search to something like Google Custom Search:
http://www.google.com/coop/cse/ (There's an add-free option if you want to cough up some money.)
> First off, I _really_ like the idea of AppEngine. We're hosting most
> of our sites (versionsapp.com, checkoutapp.com) on it and it works
> great for that purpose. Impressed as we were, I started trying to
> build our first serious webapp on AppEngine; a small webstore. What I
> basically wanted to do is building a page where you could browse 50
> products at a time, optionally limited by a search or tag, based on
> Django. All went well until I wanted to implement search...
> I grabbed a few example datasets from users of our Mac POS app, they
> mostly have around a 1000 products with a name, brand and description.
> I knew substring search was hard to implement, so I skipped that and
> used the undocumented ext.search models with my own set of search
> words, mostly around 40 per product. Because some searches took quite
> long I decided to slice the results by 50 at a time, not impressive
> but good enough. And because product descriptions don't change often I
> could use memcache to cache each result 'slice'.
> To do so, I figured I needed to at least order by name, so the slicing
> would make sense. Therefore I had to create an index on name,
> searchIndex (set of searchable words). Worked great, until I tried to
> type two words. I told me I had to create another index: name,
> searchIndex, searchIndex. Seemed logical, but then people may want to
> type more then 2 words at the same time so I added 5 indexes to get up
> to five words.
> From that moment inserting products started to take more then a
> second, which I guess was due to the indexing, but even worse, I
> started to get errors like BadRequestError: Too many indexed
> properties for entity: app. I found out the max indexes for an entity
> is capped at 5000, and calculated that if I were to use max 3 words
> for a search I could only add 17 searchable words in an items' index
> (17*17*17=4913). I'm still not sure that math is 100% correct, but at
> least I was able to insert most of my products again, but searching is
> just horribly slow.
> So now I'm stuck with a _lot_ of limitations for a simple app: only 3
> search terms, no substring searching, 50 results at a time and in my
> opinion pretty bad performance, especially when inserting. Maybe I'm
> doing something horribly wrong, and I don't consider myself the best
> db programmer out there but this just seems silly.
> Again, this is not to bash AppEngine because I really see the
> potential and like the idea. But as it stands now it's just unusable
> for us (a small indie dev company that wants their products to scale),
> and it's hard for me to imagine that anyone else can live with these
> limitations. I guess what I'm saying is: I love this thing to work for
> us! We want to use it and pay for it! Seriously, please fix the
> searching... you're Google for god sake :-)
Yes, you're encountering 'exploding indexes'. The number of index
entries needed for a 3-term index is, for example, n*(n-1)*(n-2).
My suggestion would be this:
- Keep indexes for one and two uses of the term list.
- Keep a separate entity type that tracks how many occurrences there
are of each term in the database.
- To search, look up the terms, find the two least common ones, and
query on those. Then filter the remaining results for any other terms.
On Dec 7, 9:37 pm, Koen Bok <k...@madebysofa.com> wrote:
> First off, I _really_ like the idea of AppEngine. We're hosting most
> of our sites (versionsapp.com, checkoutapp.com) on it and it works
> great for that purpose. Impressed as we were, I started trying to
> build our first serious webapp on AppEngine; a small webstore. What I
> basically wanted to do is building a page where you could browse 50
> products at a time, optionally limited by a search or tag, based on
> Django. All went well until I wanted to implement search...
> I grabbed a few example datasets from users of our Mac POS app, they
> mostly have around a 1000 products with a name, brand and description.
> I knew substring search was hard to implement, so I skipped that and
> used the undocumented ext.search models with my own set of search
> words, mostly around 40 per product. Because some searches took quite
> long I decided to slice the results by 50 at a time, not impressive
> but good enough. And because product descriptions don't change often I
> could use memcache to cache each result 'slice'.
> To do so, I figured I needed to at least order by name, so the slicing
> would make sense. Therefore I had to create an index on name,
> searchIndex (set of searchable words). Worked great, until I tried to
> type two words. I told me I had to create another index: name,
> searchIndex, searchIndex. Seemed logical, but then people may want to
> type more then 2 words at the same time so I added 5 indexes to get up
> to five words.
> From that moment inserting products started to take more then a
> second, which I guess was due to the indexing, but even worse, I
> started to get errors like BadRequestError: Too many indexed
> properties for entity: app. I found out the max indexes for an entity
> is capped at 5000, and calculated that if I were to use max 3 words
> for a search I could only add 17 searchable words in an items' index
> (17*17*17=4913). I'm still not sure that math is 100% correct, but at
> least I was able to insert most of my products again, but searching is
> just horribly slow.
> So now I'm stuck with a _lot_ of limitations for a simple app: only 3
> search terms, no substring searching, 50 results at a time and in my
> opinion pretty bad performance, especially when inserting. Maybe I'm
> doing something horribly wrong, and I don't consider myself the best
> db programmer out there but this just seems silly.
> Again, this is not to bash AppEngine because I really see the
> potential and like the idea. But as it stands now it's just unusable
> for us (a small indie dev company that wants their products to scale),
> and it's hard for me to imagine that anyone else can live with these
> limitations. I guess what I'm saying is: I love this thing to work for
> us! We want to use it and pay for it! Seriously, please fix the
> searching... you're Google for god sake :-)
I haven't tried this "workaround" yet, but until we have a Search API
available, it might work...
Try uploading your products to Google Base and then using their API to
search your own stuff. We upload all our books to google base, and if
I go to http://base.google.com/ and search for our domain and what I'm
looking for (e.g. betterworld.com google hacks ), it comes up with
lightning Google speed.
Note: I haven't checked the terms and conditions to see if this is
legal or not...