Google Groups Home
Help | Sign in
Approach to full text searching.
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  20 messages - Collapse all
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
Nerdy Tommy  
View profile
(1 user)  More options Jun 9, 7:51 am
From: Nerdy Tommy <nerdy.to...@gmail.com>
Date: Mon, 9 Jun 2008 04:51:56 -0700 (PDT)
Local: Mon, Jun 9 2008 7:51 am
Subject: Approach to full text searching.
Hi all,

I was wondering what is the best approach to implement full text
search in the GAE application. Lets say we have to implement app that
will search companies by their names. There is no such syntax in GQL
like SQL 'LIKE' that we would usually hire to do the job in
'traditional' app.

I came with the idea to build the list of all substrings of the
company name for each company and keep the list along with the entity
in GAE Datastore. It's not so complex as the number of possible
substrings is limited and quite cheap to generate.

So, lets discuss the approach further. For the simplification let's
assume the company name is just one word (string of characters w/o any
spaces). The definition of the indexes has also been omitted (it's
just obvious).

The entity would be:

class Company(db.Model)
  name = db.StringProperty()
  keywords = db.StringListProperty()

On insert time we would generate list of all substrings of string
stored in 'name' property and set it to 'keyword' property. When it
comes to searching the query would look like following one:

Company.gql('WHERE keywords = :word', word='foo')

As a result we would get a list of all companies that name would match
the given substring of the name (a word in GQL).

What do you think of this approach? Is is valid of are there any
potential pitfalls? Any input would be highly appreciated.

Cheers,
Tommy


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nerdy Tommy  
View profile
(1 user)  More options Jun 9, 4:53 pm
From: Nerdy Tommy <nerdy.to...@gmail.com>
Date: Mon, 9 Jun 2008 13:53:58 -0700 (PDT)
Local: Mon, Jun 9 2008 4:53 pm
Subject: Re: Approach to full text searching.
Guy, does anyone have any ideas on the topic? Has anyone of you
implemented or tried to implement full-text search for GAE app?

Thanks,
Tomasz

On Jun 9, 12:51 pm, Nerdy Tommy <nerdy.to...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Raffaele Sena  
View profile
 More options Jun 9, 5:39 pm
From: "Raffaele Sena" <raff...@gmail.com>
Date: Mon, 9 Jun 2008 14:39:33 -0700
Local: Mon, Jun 9 2008 5:39 pm
Subject: Re: [google-appengine] Re: Approach to full text searching.
Can't you use the SearchableQuery described in... the bulk upload page ?
http://code.google.com/appengine/articles/bulkload.html

Unfortunately the only place where it seems to be described is the
HowTo for bulk upload, but it should be easy to experiment with


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Hoffman  
View profile
 More options Jun 10, 1:15 am
From: Tim Hoffman <zutes...@gmail.com>
Date: Mon, 9 Jun 2008 22:15:45 -0700 (PDT)
Local: Tues, Jun 10 2008 1:15 am
Subject: Re: Approach to full text searching.
Have you looked at the SearchableModel

db.search.SearchableModel ?

class SearchableModel(db.Model):
  """A subclass of db.Model that supports full text search and
indexing.

  Automatically indexes all string-based properties. To search, use
the all()
  method to get a SearchableModel.Query, then use its search() method.
  """

Rgds

Tim

On Jun 9, 7:51 pm, Nerdy Tommy <nerdy.to...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nerdy Tommy  
View profile
 More options Jun 10, 7:05 am
From: Nerdy Tommy <nerdy.to...@gmail.com>
Date: Tue, 10 Jun 2008 04:05:50 -0700 (PDT)
Local: Tues, Jun 10 2008 7:05 am
Subject: Re: Approach to full text searching.
I indeed have looked at he SearchableModel. Yesterday I did a few
experiments and it didn't seem to do the good job, unfortunately. As
far as I've discovered all it does is the maintenance of the index of
string properties. The issue is that the index contains only the
*whole* words, which is in my opinion significant limitation and it's
far from SQL 'LIKE' construct.

Let's come back to our example from my initial post. We have a bunch
of companies and we want to efficiently search for those companies by
name.

Say, in data store we have list of three companies:
'Google'
'Microsoft'
'Sun Microsystems'

Let's consider a few search phrases and expected results as an
example:
'Google' > 'Google'
'GOOG' > 'Google'
'Apple' > None
'micro' > 'Microsoft', 'Sun Microsystems'

I think you've got the grasp what I meant referring to full-text
search now :)

What SearchableModel offers is simple extension of Query class that
allows to search for entities by whole words. In our example the
result using SearchableModel would be slightly different:
'Google' > 'Google'
'GOOG' > None
'Apple' > None
'micro' > None

What's more there is quite odd limitation of SearchableModel query.
When search phrase is *two* character long (or less) it returns the
whole list with all the results including all the irrelevant ones:
'XX' > 'Google', 'Microsoft', 'Sun Microsystems'
'' > 'Google', 'Microsoft', 'Sun Microsystems'

This is the reason I've started to think of approach to the real full-
text search. Can you please put some of your ideas on that? I'm not
expert in algorithms related to searching, but I really try to build
my app right. What do you think about the limitations of
SearchableModel? Do you guys from Google are going to introduce more
functionality as presented above?

Cheers,
Tommy


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
wave connexion(BQ)  
View profile
 More options Jun 10, 7:59 am
From: "wave connexion(BQ)" <waveconnex...@gmail.com>
Date: Tue, 10 Jun 2008 19:59:35 +0800
Local: Tues, Jun 10 2008 7:59 am
Subject: Re: [google-appengine] Re: Approach to full text searching.

you might try to use regex with case-insensitive prior to SearchableModel

--
BQ

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nerdy Tommy  
View profile
 More options Jun 10, 8:38 am
From: Nerdy Tommy <nerdy.to...@gmail.com>
Date: Tue, 10 Jun 2008 05:38:30 -0700 (PDT)
Local: Tues, Jun 10 2008 8:38 am
Subject: Re: Approach to full text searching.
Hi BQ,

Could you elaborate some more on your idea? I guess the regular
expression here would be great. As far as I know the only implemented
'operators' in GAE Query are: less than ('<'), less than or equal
('<='), greater than ('>'), greater than or equal ('>='), and equal
('=' and '=='). Therefore I can't see any way to use regular
expressions in queries. Or maybe I'm missing something?

Regards,
Tommy


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Peter Dolan  
View profile
 More options Jun 10, 1:24 pm
From: "Peter Dolan" <peterjdo...@gmail.com>
Date: Tue, 10 Jun 2008 10:24:38 -0700
Local: Tues, Jun 10 2008 1:24 pm
Subject: Re: [google-appengine] Re: Approach to full text searching.

<plug for the system I built>
If you're looking to build a secondary index on your data with which to
handle finely grained searches, you might find HTTPMR useful:
http://httpmr.googlecode.com/.  The single sample that I have running (and
testing) right now is actually one that constructs a token index of a set of
documents for searching.  So, if the search index that AppEngine builds for
you with SearchableModel doesn't work out, you can build your own.
</plug for the system I built>

Though actually, I suspect that you need to give up on the notion of a
'like' operator and work with the limitations of the SearchableModel.
'LIKE' doesn't scale well on a scalable datastore in the bigtable paradigm
because you need to read every single entry in the database.  The same would
hold true for regular expression matching operators.  Relational databases
handle this, I think (and I'm not an expert), by maintaining indices that
allow somewhat efficient prefix matching in the case of "... LIKE 'xxx%'",
but also have to read every row for very aggressive matches like "... LIKE
'%x%'".  Those are slow even on a relational database running on one
machine, prohibitively expensive on a distributed database.

If you're looking for "Sun Microsystems", why doesn't segmenting the query
into "Sun", "Microsystems", querying for both tokens, and taking the union
of the resultset work?

- Peter


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alan  
View profile
 More options Jun 10, 6:40 pm
From: Alan <alan.fotherg...@gmail.com>
Date: Tue, 10 Jun 2008 15:40:55 -0700 (PDT)
Local: Tues, Jun 10 2008 6:40 pm
Subject: Re: Approach to full text searching.
Tommy - I've been playing around with something similar to what you
are asking about - using a ListProperty field called indexable_list to
store all the substrings from all the words in the entity.

See http://testalan1.appspot.com/

In the field at the bottom, enter "SU", and you see anything with a
word beginning with "SU"

now enter a space, and "M", and the list gets smaller.

Now enter a space, and a "P" (for Palo Alto).

This is where the problems come.  Appengine says that it needs an
index with the  indexable_list repeated 3 times (use Firebug to see
this):

    NeedIndexError: no matching index found

    This query needs this index:

    - kind: Party1
      properties:
      - name: indexable_list
      - name: indexable_list
      - name: indexable_list
      - name: name

If this index was created, would lead to a combinatorial explosion in
index size. (Every combination of 3 from the list is indexed)

Today there was a notification that index sizes are capped:
http://groups.google.com/group/google-appengine/browse_thread/thread/...
http://groups.google.com/group/google-appengine/browse_thread/thread/...

Solution could be to take a maximum of 2 words from the user input
(the 2 longest words), perform the query, and do additional filtering
on the other words in code.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nerdy Tommy  
View profile
 More options Jun 10, 7:03 pm
From: Nerdy Tommy <nerdy.to...@gmail.com>
Date: Tue, 10 Jun 2008 16:03:16 -0700 (PDT)
Local: Tues, Jun 10 2008 7:03 pm
Subject: Re: Approach to full text searching.
Hi Peter,

I referred LIKE in my description only to make my explanation of the
problem easier to understand. I'm fully aware that queries based on
LIKE, wildcards and/or regular expressions don't scale and this is the
reason why those concepts doesn't exist in GAE data store. But there
is still open question how to implement efficient search not limited
only to the whole strings matching (equals operator).

I've done some research on basics of approximate string searching.
There are two main approaches: online-search and offline-search.
Mentioned before LIKE, pattern matching, etc. are mostly techniques of
the first approach. The latter one is based on string preprocessing
and index building. And this offline-search approach seems to be the
right way to go with GAE architecture. It would be really good to have
low-level implementation of any indexing techniques combined with high
level API for developer to easily and efficiently perform full-text
fuzzy searching. I believe this is definitely one of the ways GAE
should evolve in the near future.

But for now I still need to implement search in my simple app I've
been building recently. I'm not very keen to live with the limitations
of SearchableModel, so I still think it is possible to build index
using high-level API and GAE concepts. I'm going to go with the idea
of generation of list of all substrings of every word in the content
of my searchable StringProperty and store it is designated
StringListProperty. The algorithm that I use to generate list of
substrings is quite efficient, and not so complex (linear complexity,
I guess). I also have idea of some tweaks to make the list more
compact and easier to search upon.

I also did some thinking on more fuzzy-like search and came across
interesting theory about so called string k-neighborhood generation.
The idea is to generate list of all strings that distance (using
Levenshtein distance measure) from an original string is acceptable
and store these in the index as well. The ideas is quite fresh and it
seems to be doable only for relatively short strings, because of the
complexity of the algorithm that might be used to generate the
neighborhood (it's exponential, I guess).

I'm still open to discuss those and other ideas. Furthermore I'm going
to publish the source code of my app to present my approach in greater
details.

Cheers,
Tommy


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Hoffman  
View profile
 More options Jun 10, 9:25 pm
From: Tim Hoffman <zutes...@gmail.com>
Date: Tue, 10 Jun 2008 18:25:41 -0700 (PDT)
Local: Tues, Jun 10 2008 9:25 pm
Subject: Re: Approach to full text searching.
Hi Peter

You might like to have a look