I was wondering what is the best approach to implement full text
search in the GAE application. Lets say we have to implement app that
will search companies by their names. There is no such syntax in GQL
like SQL 'LIKE' that we would usually hire to do the job in
'traditional' app.
I came with the idea to build the list of all substrings of the
company name for each company and keep the list along with the entity
in GAE Datastore. It's not so complex as the number of possible
substrings is limited and quite cheap to generate.
So, lets discuss the approach further. For the simplification let's
assume the company name is just one word (string of characters w/o any
spaces). The definition of the indexes has also been omitted (it's
just obvious).
The entity would be:
class Company(db.Model)
name = db.StringProperty()
keywords = db.StringListProperty()
On insert time we would generate list of all substrings of string
stored in 'name' property and set it to 'keyword' property. When it
comes to searching the query would look like following one:
Company.gql('WHERE keywords = :word', word='foo')
As a result we would get a list of all companies that name would match
the given substring of the name (a word in GQL).
What do you think of this approach? Is is valid of are there any
potential pitfalls? Any input would be highly appreciated.
> I was wondering what is the best approach to implement full text
> search in the GAE application. Lets say we have to implement app that
> will search companies by their names. There is no such syntax in GQL
> like SQL 'LIKE' that we would usually hire to do the job in
> 'traditional' app.
> I came with the idea to build the list of all substrings of the
> company name for each company and keep the list along with the entity
> in GAE Datastore. It's not so complex as the number of possible
> substrings is limited and quite cheap to generate.
> So, lets discuss the approach further. For the simplification let's
> assume the company name is just one word (string of characters w/o any
> spaces). The definition of the indexes has also been omitted (it's
> just obvious).
> The entity would be:
> class Company(db.Model)
> name = db.StringProperty()
> keywords = db.StringListProperty()
> On insert time we would generate list of all substrings of string
> stored in 'name' property and set it to 'keyword' property. When it
> comes to searching the query would look like following one:
On Mon, Jun 9, 2008 at 1:53 PM, Nerdy Tommy <nerdy.to...@gmail.com> wrote:
> Guy, does anyone have any ideas on the topic? Has anyone of you > implemented or tried to implement full-text search for GAE app?
> Thanks, > Tomasz
> On Jun 9, 12:51 pm, Nerdy Tommy <nerdy.to...@gmail.com> wrote: >> Hi all,
>> I was wondering what is the best approach to implement full text >> search in the GAE application. Lets say we have to implement app that >> will search companies by their names. There is no such syntax in GQL >> like SQL 'LIKE' that we would usually hire to do the job in >> 'traditional' app.
>> I came with the idea to build the list of all substrings of the >> company name for each company and keep the list along with the entity >> in GAE Datastore. It's not so complex as the number of possible >> substrings is limited and quite cheap to generate.
>> So, lets discuss the approach further. For the simplification let's >> assume the company name is just one word (string of characters w/o any >> spaces). The definition of the indexes has also been omitted (it's >> just obvious).
>> The entity would be:
>> class Company(db.Model) >> name = db.StringProperty() >> keywords = db.StringListProperty()
>> On insert time we would generate list of all substrings of string >> stored in 'name' property and set it to 'keyword' property. When it >> comes to searching the query would look like following one:
> I was wondering what is the best approach to implement full text
> search in the GAE application. Lets say we have to implement app that
> will search companies by their names. There is no such syntax in GQL
> like SQL 'LIKE' that we would usually hire to do the job in
> 'traditional' app.
> I came with the idea to build the list of all substrings of the
> company name for each company and keep the list along with the entity
> in GAE Datastore. It's not so complex as the number of possible
> substrings is limited and quite cheap to generate.
> So, lets discuss the approach further. For the simplification let's
> assume the company name is just one word (string of characters w/o any
> spaces). The definition of the indexes has also been omitted (it's
> just obvious).
> The entity would be:
> class Company(db.Model)
> name = db.StringProperty()
> keywords = db.StringListProperty()
> On insert time we would generate list of all substrings of string
> stored in 'name' property and set it to 'keyword' property. When it
> comes to searching the query would look like following one:
I indeed have looked at he SearchableModel. Yesterday I did a few
experiments and it didn't seem to do the good job, unfortunately. As
far as I've discovered all it does is the maintenance of the index of
string properties. The issue is that the index contains only the
*whole* words, which is in my opinion significant limitation and it's
far from SQL 'LIKE' construct.
Let's come back to our example from my initial post. We have a bunch
of companies and we want to efficiently search for those companies by
name.
Say, in data store we have list of three companies:
'Google'
'Microsoft'
'Sun Microsystems'
Let's consider a few search phrases and expected results as an
example:
'Google' > 'Google'
'GOOG' > 'Google'
'Apple' > None
'micro' > 'Microsoft', 'Sun Microsystems'
I think you've got the grasp what I meant referring to full-text
search now :)
What SearchableModel offers is simple extension of Query class that
allows to search for entities by whole words. In our example the
result using SearchableModel would be slightly different:
'Google' > 'Google'
'GOOG' > None
'Apple' > None
'micro' > None
What's more there is quite odd limitation of SearchableModel query.
When search phrase is *two* character long (or less) it returns the
whole list with all the results including all the irrelevant ones:
'XX' > 'Google', 'Microsoft', 'Sun Microsystems'
'' > 'Google', 'Microsoft', 'Sun Microsystems'
This is the reason I've started to think of approach to the real full-
text search. Can you please put some of your ideas on that? I'm not
expert in algorithms related to searching, but I really try to build
my app right. What do you think about the limitations of
SearchableModel? Do you guys from Google are going to introduce more
functionality as presented above?
On Tue, Jun 10, 2008 at 7:05 PM, Nerdy Tommy <nerdy.to...@gmail.com> wrote:
> I indeed have looked at he SearchableModel. Yesterday I did a few > experiments and it didn't seem to do the good job, unfortunately. As > far as I've discovered all it does is the maintenance of the index of > string properties. The issue is that the index contains only the > *whole* words, which is in my opinion significant limitation and it's > far from SQL 'LIKE' construct.
> Let's come back to our example from my initial post. We have a bunch > of companies and we want to efficiently search for those companies by > name.
> Say, in data store we have list of three companies: > 'Google' > 'Microsoft' > 'Sun Microsystems'
> Let's consider a few search phrases and expected results as an > example: > 'Google' > 'Google' > 'GOOG' > 'Google' > 'Apple' > None > 'micro' > 'Microsoft', 'Sun Microsystems'
> I think you've got the grasp what I meant referring to full-text > search now :)
> What SearchableModel offers is simple extension of Query class that > allows to search for entities by whole words. In our example the > result using SearchableModel would be slightly different: > 'Google' > 'Google' > 'GOOG' > None > 'Apple' > None > 'micro' > None
> What's more there is quite odd limitation of SearchableModel query. > When search phrase is *two* character long (or less) it returns the > whole list with all the results including all the irrelevant ones: > 'XX' > 'Google', 'Microsoft', 'Sun Microsystems' > '' > 'Google', 'Microsoft', 'Sun Microsystems'
> This is the reason I've started to think of approach to the real full- > text search. Can you please put some of your ideas on that? I'm not > expert in algorithms related to searching, but I really try to build > my app right. What do you think about the limitations of > SearchableModel? Do you guys from Google are going to introduce more > functionality as presented above?
Could you elaborate some more on your idea? I guess the regular
expression here would be great. As far as I know the only implemented
'operators' in GAE Query are: less than ('<'), less than or equal
('<='), greater than ('>'), greater than or equal ('>='), and equal
('=' and '=='). Therefore I can't see any way to use regular
expressions in queries. Or maybe I'm missing something?
<plug for the system I built> If you're looking to build a secondary index on your data with which to handle finely grained searches, you might find HTTPMR useful: http://httpmr.googlecode.com/. The single sample that I have running (and testing) right now is actually one that constructs a token index of a set of documents for searching. So, if the search index that AppEngine builds for you with SearchableModel doesn't work out, you can build your own. </plug for the system I built>
Though actually, I suspect that you need to give up on the notion of a 'like' operator and work with the limitations of the SearchableModel. 'LIKE' doesn't scale well on a scalable datastore in the bigtable paradigm because you need to read every single entry in the database. The same would hold true for regular expression matching operators. Relational databases handle this, I think (and I'm not an expert), by maintaining indices that allow somewhat efficient prefix matching in the case of "... LIKE 'xxx%'", but also have to read every row for very aggressive matches like "... LIKE '%x%'". Those are slow even on a relational database running on one machine, prohibitively expensive on a distributed database.
If you're looking for "Sun Microsystems", why doesn't segmenting the query into "Sun", "Microsystems", querying for both tokens, and taking the union of the resultset work?
On Tue, Jun 10, 2008 at 4:05 AM, Nerdy Tommy <nerdy.to...@gmail.com> wrote:
> I indeed have looked at he SearchableModel. Yesterday I did a few > experiments and it didn't seem to do the good job, unfortunately. As > far as I've discovered all it does is the maintenance of the index of > string properties. The issue is that the index contains only the > *whole* words, which is in my opinion significant limitation and it's > far from SQL 'LIKE' construct.
> Let's come back to our example from my initial post. We have a bunch > of companies and we want to efficiently search for those companies by > name.
> Say, in data store we have list of three companies: > 'Google' > 'Microsoft' > 'Sun Microsystems'
> Let's consider a few search phrases and expected results as an > example: > 'Google' > 'Google' > 'GOOG' > 'Google' > 'Apple' > None > 'micro' > 'Microsoft', 'Sun Microsystems'
> I think you've got the grasp what I meant referring to full-text > search now :)
> What SearchableModel offers is simple extension of Query class that > allows to search for entities by whole words. In our example the > result using SearchableModel would be slightly different: > 'Google' > 'Google' > 'GOOG' > None > 'Apple' > None > 'micro' > None
> What's more there is quite odd limitation of SearchableModel query. > When search phrase is *two* character long (or less) it returns the > whole list with all the results including all the irrelevant ones: > 'XX' > 'Google', 'Microsoft', 'Sun Microsystems' > '' > 'Google', 'Microsoft', 'Sun Microsystems'
> This is the reason I've started to think of approach to the real full- > text search. Can you please put some of your ideas on that? I'm not > expert in algorithms related to searching, but I really try to build > my app right. What do you think about the limitations of > SearchableModel? Do you guys from Google are going to introduce more > functionality as presented above?
Tommy - I've been playing around with something similar to what you
are asking about - using a ListProperty field called indexable_list to
store all the substrings from all the words in the entity.
Solution could be to take a maximum of 2 words from the user input
(the 2 longest words), perform the query, and do additional filtering
on the other words in code.
I referred LIKE in my description only to make my explanation of the
problem easier to understand. I'm fully aware that queries based on
LIKE, wildcards and/or regular expressions don't scale and this is the
reason why those concepts doesn't exist in GAE data store. But there
is still open question how to implement efficient search not limited
only to the whole strings matching (equals operator).
I've done some research on basics of approximate string searching.
There are two main approaches: online-search and offline-search.
Mentioned before LIKE, pattern matching, etc. are mostly techniques of
the first approach. The latter one is based on string preprocessing
and index building. And this offline-search approach seems to be the
right way to go with GAE architecture. It would be really good to have
low-level implementation of any indexing techniques combined with high
level API for developer to easily and efficiently perform full-text
fuzzy searching. I believe this is definitely one of the ways GAE
should evolve in the near future.
But for now I still need to implement search in my simple app I've
been building recently. I'm not very keen to live with the limitations
of SearchableModel, so I still think it is possible to build index
using high-level API and GAE concepts. I'm going to go with the idea
of generation of list of all substrings of every word in the content
of my searchable StringProperty and store it is designated
StringListProperty. The algorithm that I use to generate list of
substrings is quite efficient, and not so complex (linear complexity,
I guess). I also have idea of some tweaks to make the list more
compact and easier to search upon.
I also did some thinking on more fuzzy-like search and came across
interesting theory about so called string k-neighborhood generation.
The idea is to generate list of all strings that distance (using
Levenshtein distance measure) from an original string is acceptable
and store these in the index as well. The ideas is quite fresh and it
seems to be doable only for relatively short strings, because of the
complexity of the algorithm that might be used to generate the
neighborhood (it's exponential, I guess).
I'm still open to discuss those and other ideas. Furthermore I'm going
to publish the source code of my app to present my approach in greater
details.
> I referred LIKE in my description only to make my explanation of the
> problem easier to understand. I'm fully aware that queries based on
> LIKE, wildcards and/or regular expressions don't scale and this is the
> reason why those concepts doesn't exist in GAE data store. But there
> is still open question how to implement efficient search not limited
> only to the whole strings matching (equals operator).
> I've done some research on basics of approximate string searching.
> There are two main approaches: online-search and offline-search.
> Mentioned before LIKE, pattern matching, etc. are mostly techniques of
> the first approach. The latter one is based on string preprocessing
> and index building. And this offline-search approach seems to be the
> right way to go with GAE architecture. It would be really good to have
> low-level implementation of any indexing techniques combined with high
> level API for developer to easily and efficiently perform full-text
> fuzzy searching. I believe this is definitely one of the ways GAE
> should evolve in the near future.
> But for now I still need to implement search in my simple app I've
> been building recently. I'm not very keen to live with the limitations
> of SearchableModel, so I still think it is possible to build index
> using high-level API and GAE concepts. I'm going to go with the idea
> of generation of list of all substrings of every word in the content
> of my searchable StringProperty and store it is designated
> StringListProperty. The algorithm that I use to generate list of
> substrings is quite efficient, and not so complex (linear complexity,
> I guess). I also have idea of some tweaks to make the list more
> compact and easier to search upon.
> I also did some thinking on more fuzzy-like search and came across
> interesting theory about so called string k-neighborhood generation.
> The idea is to generate list of all strings that distance (using
> Levenshtein distance measure) from an original string is acceptable
> and store these in the index as well. The ideas is quite fresh and it
> seems to be doable only for relatively short strings, because of the
> complexity of the algorithm that might be used to generate the
> neighborhood (it's exponential, I guess).
> I'm still open to discuss those and other ideas. Furthermore I'm going
> to publish the source code of my app to present my approach in greater
> details.
> I indeed have looked at he SearchableModel. Yesterday I did a few
> experiments and it didn't seem to do the good job, unfortunately. As
> far as I've discovered all it does is the maintenance of the index of
> string properties. The issue is that the index contains only the
> *whole* words, which is in my opinion significant limitation and it's
> far from SQL 'LIKE' construct.
> Let's come back to our example from my initial post. We have a bunch
> of companies and we want to efficiently search for those companies by
> name.
> Say, in data store we have list of three companies:
> 'Google'
> 'Microsoft'
> 'Sun Microsystems'
> Let's consider a few search phrases and expected results as an
> example:
> 'Google' > 'Google'
> 'GOOG' > 'Google'
> 'Apple' > None
> 'micro' > 'Microsoft', 'Sun Microsystems'
> I think you've got the grasp what I meant referring to full-text
> search now :)
> What SearchableModel offers is simple extension of Query class that
> allows to search for entities by whole words. In our example the
> result using SearchableModel would be slightly different:
> 'Google' > 'Google'
> 'GOOG' > None
> 'Apple' > None
> 'micro' > None
> What's more there is quite odd limitation of SearchableModel query.
> When search phrase is *two* character long (or less) it returns the
> whole list with all the results including all the irrelevant ones:
> 'XX' > 'Google', 'Microsoft', 'Sun Microsystems'
> '' > 'Google', 'Microsoft', 'Sun Microsystems'
> This is the reason I've started to think of approach to the real full-
> text search. Can you please put some of your ideas on that? I'm not
> expert in algorithms related to searching, but I really try to build
> my app right. What do you think about the limitations of
> SearchableModel? Do you guys from Google are going to introduce more
> functionality as presented above?
I am not currently getting into SearchableModel detail. As Peter pointed out, regex approach might not scale in appengine. I recall some one posted similar question, searching "searchablequery" on "bulkupload" on this forum might be a help for you.
On Tue, Jun 10, 2008 at 8:38 PM, Nerdy Tommy <nerdy.to...@gmail.com> wrote:
> Hi BQ,
> Could you elaborate some more on your idea? I guess the regular > expression here would be great. As far as I know the only implemented > 'operators' in GAE Query are: less than ('<'), less than or equal > ('<='), greater than ('>'), greater than or equal ('>='), and equal > ('=' and '=='). Therefore I can't see any way to use regular > expressions in queries. Or maybe I'm missing something?
I've thought a lot about this, and implemented SearchableModel, and
came to the conclusion that if I need real full text search I can't
use google appengine. Even if SearchableModel was adequate in terms of
search quality, they said it won't scale. Ideas I've had for a
homemade search are problematic in terms of adding a page to the
indexes, whether it has 10 words or 1000 it will affect disparate
parts of the index and require too many writes. On the other hand, I
am guessing that within a couple of years google will have to come out
with a good solution for full text search, assuming appengine stays in
their priorities.
You might've considered this, but here goes just in case. Not knowing
more details about your requirements (like how many company names you
expect in your list), this might or might not work for you.
1. Create an index like this:
aa, list of entities that contian aa
ab, list of entities that contian ab
ac, list of entities that contian ac
....
ba, list of entities that contian ba
bb, list of entities that contian ba
2. If a user searchs for "abc", you pull the list of items that
contain "ab" and those contain "bc" and find the items that occur in
both lists. The number of lists you'll need to pull will depend on the
number of letters in the word (list n = len(word) -1) .
On Jun 10, 4:05 am, Nerdy Tommy <nerdy.to...@gmail.com> wrote:
> I indeed have looked at he SearchableModel. Yesterday I did a few
> experiments and it didn't seem to do the good job, unfortunately. As
> far as I've discovered all it does is the maintenance of the index of
> string properties. The issue is that the index contains only the
> *whole* words, which is in my opinion significant limitation and it's
> far from SQL 'LIKE' construct.
> Let's come back to our example from my initial post. We have a bunch
> of companies and we want to efficiently search for those companies by
> name.
> Say, in data store we have list of three companies:
> 'Google'
> 'Microsoft'
> 'Sun Microsystems'
> Let's consider a few search phrases and expected results as an
> example:
> 'Google' > 'Google'
> 'GOOG' > 'Google'
> 'Apple' > None
> 'micro' > 'Microsoft', 'Sun Microsystems'
> I think you've got the grasp what I meant referring to full-text
> search now :)
> What SearchableModel offers is simple extension of Query class that
> allows to search for entities by whole words. In our example the
> result using SearchableModel would be slightly different:
> 'Google' > 'Google'
> 'GOOG' > None
> 'Apple' > None
> 'micro' > None
> What's more there is quite odd limitation of SearchableModel query.
> When search phrase is *two* character long (or less) it returns the
> whole list with all the results including all the irrelevant ones:
> 'XX' > 'Google', 'Microsoft', 'Sun Microsystems'
> '' > 'Google', 'Microsoft', 'Sun Microsystems'
> This is the reason I've started to think of approach to the real full-
> text search. Can you please put some of your ideas on that? I'm not
> expert in algorithms related to searching, but I really try to build
> my app right. What do you think about the limitations of
> SearchableModel? Do you guys from Google are going to introduce more
> functionality as presented above?
Waleed, yes I have thought of that. Searching based on that kind of
design might be doable especially on simple searches. But I
specifically said the cost of *adding* a page to the indexes is where
the problem lies. If you have 10 to 1000 words in a page, that's
likely near ten to several hundred index entries you need to write.
So, adding a single page of text to your indexing system becomes a
very intensive process.
On Jun 16, 10:33 pm, Ben the Indefatigable <bcbry...@gmail.com> wrote:
> Waleed, yes I have thought of that. Searching based on that kind of
> design might be doable especially on simple searches. But I
> specifically said the cost of *adding* a page to the indexes is where
> the problem lies. If you have 10 to 1000 words in a page, that's
> likely near ten to several hundred index entries you need to write.
> So, adding a single page of text to your indexing system becomes a
> very intensive process.
I agree, haven't found a solution to this..... shame really
> On Jun 16, 10:33 pm, Ben the Indefatigable <bcbry...@gmail.com> wrote:
> > Waleed, yes I have thought of that. Searching based on that kind of
> > design might be doable especially on simple searches. But I
> > specifically said the cost of *adding* a page to the indexes is where
> > the problem lies. If you have 10 to 1000 words in a page, that's
> > likely near ten to several hundred index entries you need to write.
> > So, adding a single page of text to your indexing system becomes a
> > very intensive process.
> I agree, haven't found a solution to this..... shame really
> On Jun 28, 9:27 pm, Jimpa <jamie.nichol...@gmail.com> wrote: > > On Jun 16, 10:33 pm, Ben the Indefatigable <bcbry...@gmail.com> wrote:
> > > Waleed, yes I have thought of that. Searching based on that kind of > > > design might be doable especially on simple searches. But I > > > specifically said the cost of *adding* a page to the indexes is where > > > the problem lies. If you have 10 to 1000 words in a page, that's > > > likely near ten to several hundred index entries you need to write. > > > So, adding a single page of text to your indexing system becomes a > > > very intensive process.
> > I agree, haven't found a solution to this..... shame really
>> On Jun 28, 9:27 pm, Jimpa <jamie.nichol...@gmail.com> wrote: >> > On Jun 16, 10:33 pm, Ben the Indefatigable <bcbry...@gmail.com> wrote:
>> > > Waleed, yes I have thought of that. Searching based on that kind of >> > > design might be doable especially on simple searches. But I >> > > specifically said the cost of *adding* a page to the indexes is where >> > > the problem lies. If you have 10 to 1000 words in a page, that's >> > > likely near ten to several hundred index entries you need to write. >> > > So, adding a single page of text to your indexing system becomes a >> > > very intensive process.
>> > I agree, haven't found a solution to this..... shame really
> -- > "I am not young enough to know everything" --Oscar Wilde. > "I know nothing except the fact of my ignorance." --Socrates. > Bruno Patini Furtado > Software Developer > webpage: http://bpfurtado.net > text adventures suite: http://bpfurtado.net/tas > software development blog: http://bpfurtado.livejournal.com