Approach to full text searching.

1,833 views
Skip to first unread message

Nerdy Tommy

unread,
Jun 9, 2008, 7:51:56 AM6/9/08
to Google App Engine
Hi all,

I was wondering what is the best approach to implement full text
search in the GAE application. Lets say we have to implement app that
will search companies by their names. There is no such syntax in GQL
like SQL 'LIKE' that we would usually hire to do the job in
'traditional' app.

I came with the idea to build the list of all substrings of the
company name for each company and keep the list along with the entity
in GAE Datastore. It's not so complex as the number of possible
substrings is limited and quite cheap to generate.

So, lets discuss the approach further. For the simplification let's
assume the company name is just one word (string of characters w/o any
spaces). The definition of the indexes has also been omitted (it's
just obvious).

The entity would be:

class Company(db.Model)
name = db.StringProperty()
keywords = db.StringListProperty()

On insert time we would generate list of all substrings of string
stored in 'name' property and set it to 'keyword' property. When it
comes to searching the query would look like following one:

Company.gql('WHERE keywords = :word', word='foo')

As a result we would get a list of all companies that name would match
the given substring of the name (a word in GQL).

What do you think of this approach? Is is valid of are there any
potential pitfalls? Any input would be highly appreciated.

Cheers,
Tommy






Nerdy Tommy

unread,
Jun 9, 2008, 4:53:58 PM6/9/08
to Google App Engine
Guy, does anyone have any ideas on the topic? Has anyone of you
implemented or tried to implement full-text search for GAE app?

Thanks,
Tomasz

Raffaele Sena

unread,
Jun 9, 2008, 5:39:33 PM6/9/08
to google-a...@googlegroups.com
Can't you use the SearchableQuery described in... the bulk upload page ?
http://code.google.com/appengine/articles/bulkload.html

Unfortunately the only place where it seems to be described is the
HowTo for bulk upload, but it should be easy to experiment with

Tim Hoffman

unread,
Jun 10, 2008, 1:15:45 AM6/10/08
to Google App Engine
Have you looked at the SearchableModel

db.search.SearchableModel ?

class SearchableModel(db.Model):
"""A subclass of db.Model that supports full text search and
indexing.

Automatically indexes all string-based properties. To search, use
the all()
method to get a SearchableModel.Query, then use its search() method.
"""



Rgds

Tim

Nerdy Tommy

unread,
Jun 10, 2008, 7:05:50 AM6/10/08
to Google App Engine
I indeed have looked at he SearchableModel. Yesterday I did a few
experiments and it didn't seem to do the good job, unfortunately. As
far as I've discovered all it does is the maintenance of the index of
string properties. The issue is that the index contains only the
*whole* words, which is in my opinion significant limitation and it's
far from SQL 'LIKE' construct.

Let's come back to our example from my initial post. We have a bunch
of companies and we want to efficiently search for those companies by
name.

Say, in data store we have list of three companies:
'Google'
'Microsoft'
'Sun Microsystems'

Let's consider a few search phrases and expected results as an
example:
'Google' > 'Google'
'GOOG' > 'Google'
'Apple' > None
'micro' > 'Microsoft', 'Sun Microsystems'

I think you've got the grasp what I meant referring to full-text
search now :)

What SearchableModel offers is simple extension of Query class that
allows to search for entities by whole words. In our example the
result using SearchableModel would be slightly different:
'Google' > 'Google'
'GOOG' > None
'Apple' > None
'micro' > None

What's more there is quite odd limitation of SearchableModel query.
When search phrase is *two* character long (or less) it returns the
whole list with all the results including all the irrelevant ones:
'XX' > 'Google', 'Microsoft', 'Sun Microsystems'
'' > 'Google', 'Microsoft', 'Sun Microsystems'

This is the reason I've started to think of approach to the real full-
text search. Can you please put some of your ideas on that? I'm not
expert in algorithms related to searching, but I really try to build
my app right. What do you think about the limitations of
SearchableModel? Do you guys from Google are going to introduce more
functionality as presented above?

Cheers,
Tommy






wave connexion(BQ)

unread,
Jun 10, 2008, 7:59:35 AM6/10/08
to google-a...@googlegroups.com
you might try to use regex with case-insensitive prior to SearchableModel
--
BQ

Nerdy Tommy

unread,
Jun 10, 2008, 8:38:30 AM6/10/08
to Google App Engine
Hi BQ,

Could you elaborate some more on your idea? I guess the regular
expression here would be great. As far as I know the only implemented
'operators' in GAE Query are: less than ('<'), less than or equal
('<='), greater than ('>'), greater than or equal ('>='), and equal
('=' and '=='). Therefore I can't see any way to use regular
expressions in queries. Or maybe I'm missing something?

Regards,
Tommy

Peter Dolan

unread,
Jun 10, 2008, 1:24:38 PM6/10/08
to google-a...@googlegroups.com
<plug for the system I built>
If you're looking to build a secondary index on your data with which to handle finely grained searches, you might find HTTPMR useful: http://httpmr.googlecode.com/.  The single sample that I have running (and testing) right now is actually one that constructs a token index of a set of documents for searching.  So, if the search index that AppEngine builds for you with SearchableModel doesn't work out, you can build your own.
</plug for the system I built>

Though actually, I suspect that you need to give up on the notion of a 'like' operator and work with the limitations of the SearchableModel.  'LIKE' doesn't scale well on a scalable datastore in the bigtable paradigm because you need to read every single entry in the database.  The same would hold true for regular expression matching operators.  Relational databases handle this, I think (and I'm not an expert), by maintaining indices that allow somewhat efficient prefix matching in the case of "... LIKE 'xxx%'", but also have to read every row for very aggressive matches like "... LIKE '%x%'".  Those are slow even on a relational database running on one machine, prohibitively expensive on a distributed database.

If you're looking for "Sun Microsystems", why doesn't segmenting the query into "Sun", "Microsystems", querying for both tokens, and taking the union of the resultset work?

- Peter

Alan

unread,
Jun 10, 2008, 6:40:55 PM6/10/08
to Google App Engine
Tommy - I've been playing around with something similar to what you
are asking about - using a ListProperty field called indexable_list to
store all the substrings from all the words in the entity.

See http://testalan1.appspot.com/

In the field at the bottom, enter "SU", and you see anything with a
word beginning with "SU"

now enter a space, and "M", and the list gets smaller.

Now enter a space, and a "P" (for Palo Alto).

This is where the problems come. Appengine says that it needs an
index with the indexable_list repeated 3 times (use Firebug to see
this):

NeedIndexError: no matching index found

This query needs this index:

- kind: Party1
properties:
- name: indexable_list
- name: indexable_list
- name: indexable_list
- name: name

If this index was created, would lead to a combinatorial explosion in
index size. (Every combination of 3 from the list is indexed)

Today there was a notification that index sizes are capped:
http://groups.google.com/group/google-appengine/browse_thread/thread/d5f4dcb7d00ed4c6#
http://groups.google.com/group/google-appengine/browse_thread/thread/392fd8e7be160086

Solution could be to take a maximum of 2 words from the user input
(the 2 longest words), perform the query, and do additional filtering
on the other words in code.

Nerdy Tommy

unread,
Jun 10, 2008, 7:03:16 PM6/10/08
to Google App Engine
Hi Peter,

I referred LIKE in my description only to make my explanation of the
problem easier to understand. I'm fully aware that queries based on
LIKE, wildcards and/or regular expressions don't scale and this is the
reason why those concepts doesn't exist in GAE data store. But there
is still open question how to implement efficient search not limited
only to the whole strings matching (equals operator).

I've done some research on basics of approximate string searching.
There are two main approaches: online-search and offline-search.
Mentioned before LIKE, pattern matching, etc. are mostly techniques of
the first approach. The latter one is based on string preprocessing
and index building. And this offline-search approach seems to be the
right way to go with GAE architecture. It would be really good to have
low-level implementation of any indexing techniques combined with high
level API for developer to easily and efficiently perform full-text
fuzzy searching. I believe this is definitely one of the ways GAE
should evolve in the near future.

But for now I still need to implement search in my simple app I've
been building recently. I'm not very keen to live with the limitations
of SearchableModel, so I still think it is possible to build index
using high-level API and GAE concepts. I'm going to go with the idea
of generation of list of all substrings of every word in the content
of my searchable StringProperty and store it is designated
StringListProperty. The algorithm that I use to generate list of
substrings is quite efficient, and not so complex (linear complexity,
I guess). I also have idea of some tweaks to make the list more
compact and easier to search upon.

I also did some thinking on more fuzzy-like search and came across
interesting theory about so called string k-neighborhood generation.
The idea is to generate list of all strings that distance (using
Levenshtein distance measure) from an original string is acceptable
and store these in the index as well. The ideas is quite fresh and it
seems to be doable only for relatively short strings, because of the
complexity of the algorithm that might be used to generate the
neighborhood (it's exponential, I guess).

I'm still open to discuss those and other ideas. Furthermore I'm going
to publish the source code of my app to present my approach in greater
details.

Cheers,
Tommy

Tim Hoffman

unread,
Jun 10, 2008, 9:25:41 PM6/10/08
to Google App Engine
Hi Peter

You might like to have a look at how textindexng is implemented as a
searchable index in Zope.

http://plone.org/products/textindexng

Still may not be what you want, but you may find some of the
approaches useful for creating your own
Searchable model

Rgds

Tim

Edoardo Marcora

unread,
Jun 10, 2008, 9:58:52 PM6/10/08
to Google App Engine
Damn, shouldn't datastore come with something better than
searchablemodel??? This is Google after all!!!

wave connexion(BQ)

unread,
Jun 11, 2008, 2:25:35 AM6/11/08
to google-a...@googlegroups.com
Hi Tommy,

I am not currently getting into SearchableModel detail. As Peter pointed out, regex approach might not scale in appengine. I recall some one posted similar question, searching "searchablequery" on "bulkupload" on this forum might be a help for you.
--
BQ

Ben the Indefatigable

unread,
Jun 11, 2008, 11:41:01 AM6/11/08
to Google App Engine
I've thought a lot about this, and implemented SearchableModel, and
came to the conclusion that if I need real full text search I can't
use google appengine. Even if SearchableModel was adequate in terms of
search quality, they said it won't scale. Ideas I've had for a
homemade search are problematic in terms of adding a page to the
indexes, whether it has 10 words or 1000 it will affect disparate
parts of the index and require too many writes. On the other hand, I
am guessing that within a couple of years google will have to come out
with a good solution for full text search, assuming appengine stays in
their priorities.

Waleed

unread,
Jun 11, 2008, 7:11:16 PM6/11/08
to Google App Engine
You might've considered this, but here goes just in case. Not knowing
more details about your requirements (like how many company names you
expect in your list), this might or might not work for you.

1. Create an index like this:

aa, list of entities that contian aa
ab, list of entities that contian ab
ac, list of entities that contian ac
....
ba, list of entities that contian ba
bb, list of entities that contian ba


2. If a user searchs for "abc", you pull the list of items that
contain "ab" and those contain "bc" and find the items that occur in
both lists. The number of lists you'll need to pull will depend on the
number of letters in the word (list n = len(word) -1) .




On Jun 10, 4:05 am, Nerdy Tommy <nerdy.to...@gmail.com> wrote:

Ben the Indefatigable

unread,
Jun 16, 2008, 5:33:51 PM6/16/08
to Google App Engine
Waleed, yes I have thought of that. Searching based on that kind of
design might be doable especially on simple searches. But I
specifically said the cost of *adding* a page to the indexes is where
the problem lies. If you have 10 to 1000 words in a page, that's
likely near ten to several hundred index entries you need to write.
So, adding a single page of text to your indexing system becomes a
very intensive process.

Jimpa

unread,
Jun 28, 2008, 3:27:15 PM6/28/08
to Google App Engine
I agree, haven't found a solution to this..... shame really

gime...@gmail.com

unread,
Jun 30, 2008, 10:00:55 AM6/30/08
to Google App Engine

Bruno Patini Furtado

unread,
Jun 30, 2008, 1:35:00 PM6/30/08
to google-a...@googlegroups.com
Have you considered PyLucene? http://pylucene.osafoundation.org/
--
"I am not young enough to know everything" --Oscar Wilde.
"I know nothing except the fact of my ignorance." --Socrates.
Bruno Patini Furtado
Software Developer
webpage: http://bpfurtado.net
text adventures suite: http://bpfurtado.net/tas
software development blog: http://bpfurtado.livejournal.com

Bruno Patini Furtado

unread,
Jun 30, 2008, 2:11:22 PM6/30/08
to google-a...@googlegroups.com
Forget PyLucene in this scenario as it uses the Java implementation of Lucene under the covers, won't work on the GoogLe Apps engine.

SMF

unread,
Mar 8, 2012, 9:24:04 AM3/8/12
to google-a...@googlegroups.com
I tried this too... its not worth doing.
managing index was consuming my free cpu quota (i am talking about earlier pricing model) based on cpu not instance hours
Reply all
Reply to author
Forward
0 new messages