Web Images Videos Maps News Shopping Gmail more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Approach to full text searching.
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  20 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Nerdy Tommy  
View profile  
(1 user)  More options Jun 9 2008, 7:51 am
From: Nerdy Tommy <nerdy.to...@gmail.com>
Date: Mon, 9 Jun 2008 04:51:56 -0700 (PDT)
Local: Mon, Jun 9 2008 7:51 am
Subject: Approach to full text searching.
Hi all,

I was wondering what is the best approach to implement full text
search in the GAE application. Lets say we have to implement app that
will search companies by their names. There is no such syntax in GQL
like SQL 'LIKE' that we would usually hire to do the job in
'traditional' app.

I came with the idea to build the list of all substrings of the
company name for each company and keep the list along with the entity
in GAE Datastore. It's not so complex as the number of possible
substrings is limited and quite cheap to generate.

So, lets discuss the approach further. For the simplification let's
assume the company name is just one word (string of characters w/o any
spaces). The definition of the indexes has also been omitted (it's
just obvious).

The entity would be:

class Company(db.Model)
  name = db.StringProperty()
  keywords = db.StringListProperty()

On insert time we would generate list of all substrings of string
stored in 'name' property and set it to 'keyword' property. When it
comes to searching the query would look like following one:

Company.gql('WHERE keywords = :word', word='foo')

As a result we would get a list of all companies that name would match
the given substring of the name (a word in GQL).

What do you think of this approach? Is is valid of are there any
potential pitfalls? Any input would be highly appreciated.

Cheers,
Tommy


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nerdy Tommy  
View profile  
(1 user)  More options Jun 9 2008, 4:53 pm
From: Nerdy Tommy <nerdy.to...@gmail.com>
Date: Mon, 9 Jun 2008 13:53:58 -0700 (PDT)
Local: Mon, Jun 9 2008 4:53 pm
Subject: Re: Approach to full text searching.
Guy, does anyone have any ideas on the topic? Has anyone of you
implemented or tried to implement full-text search for GAE app?

Thanks,
Tomasz

On Jun 9, 12:51 pm, Nerdy Tommy <nerdy.to...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Raffaele Sena  
View profile  
 More options Jun 9 2008, 5:39 pm
From: "Raffaele Sena" <raff...@gmail.com>
Date: Mon, 9 Jun 2008 14:39:33 -0700
Local: Mon, Jun 9 2008 5:39 pm
Subject: Re: [google-appengine] Re: Approach to full text searching.
Can't you use the SearchableQuery described in... the bulk upload page ?
http://code.google.com/appengine/articles/bulkload.html

Unfortunately the only place where it seems to be described is the
HowTo for bulk upload, but it should be easy to experiment with


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Hoffman  
View profile  
 More options Jun 10 2008, 1:15 am
From: Tim Hoffman <zutes...@gmail.com>
Date: Mon, 9 Jun 2008 22:15:45 -0700 (PDT)
Local: Tues, Jun 10 2008 1:15 am
Subject: Re: Approach to full text searching.
Have you looked at the SearchableModel

db.search.SearchableModel ?

class SearchableModel(db.Model):
  """A subclass of db.Model that supports full text search and
indexing.

  Automatically indexes all string-based properties. To search, use
the all()
  method to get a SearchableModel.Query, then use its search() method.
  """

Rgds

Tim

On Jun 9, 7:51 pm, Nerdy Tommy <nerdy.to...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nerdy Tommy  
View profile  
(1 user)  More options Jun 10 2008, 7:05 am
From: Nerdy Tommy <nerdy.to...@gmail.com>
Date: Tue, 10 Jun 2008 04:05:50 -0700 (PDT)
Local: Tues, Jun 10 2008 7:05 am
Subject: Re: Approach to full text searching.
I indeed have looked at he SearchableModel. Yesterday I did a few
experiments and it didn't seem to do the good job, unfortunately. As
far as I've discovered all it does is the maintenance of the index of
string properties. The issue is that the index contains only the
*whole* words, which is in my opinion significant limitation and it's
far from SQL 'LIKE' construct.

Let's come back to our example from my initial post. We have a bunch
of companies and we want to efficiently search for those companies by
name.

Say, in data store we have list of three companies:
'Google'
'Microsoft'
'Sun Microsystems'

Let's consider a few search phrases and expected results as an
example:
'Google' > 'Google'
'GOOG' > 'Google'
'Apple' > None
'micro' > 'Microsoft', 'Sun Microsystems'

I think you've got the grasp what I meant referring to full-text
search now :)

What SearchableModel offers is simple extension of Query class that
allows to search for entities by whole words. In our example the
result using SearchableModel would be slightly different:
'Google' > 'Google'
'GOOG' > None
'Apple' > None
'micro' > None

What's more there is quite odd limitation of SearchableModel query.
When search phrase is *two* character long (or less) it returns the
whole list with all the results including all the irrelevant ones:
'XX' > 'Google', 'Microsoft', 'Sun Microsystems'
'' > 'Google', 'Microsoft', 'Sun Microsystems'

This is the reason I've started to think of approach to the real full-
text search. Can you please put some of your ideas on that? I'm not
expert in algorithms related to searching, but I really try to build
my app right. What do you think about the limitations of
SearchableModel? Do you guys from Google are going to introduce more
functionality as presented above?

Cheers,
Tommy


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
wave connexion(BQ)  
View profile  
 More options Jun 10 2008, 7:59 am
From: "wave connexion(BQ)" <waveconnex...@gmail.com>
Date: Tue, 10 Jun 2008 19:59:35 +0800
Local: Tues, Jun 10 2008 7:59 am
Subject: Re: [google-appengine] Re: Approach to full text searching.

you might try to use regex with case-insensitive prior to SearchableModel

--
BQ

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nerdy Tommy  
View profile  
 More options Jun 10 2008, 8:38 am
From: Nerdy Tommy <nerdy.to...@gmail.com>
Date: Tue, 10 Jun 2008 05:38:30 -0700 (PDT)
Local: Tues, Jun 10 2008 8:38 am
Subject: Re: Approach to full text searching.
Hi BQ,

Could you elaborate some more on your idea? I guess the regular
expression here would be great. As far as I know the only implemented
'operators' in GAE Query are: less than ('<'), less than or equal
('<='), greater than ('>'), greater than or equal ('>='), and equal
('=' and '=='). Therefore I can't see any way to use regular
expressions in queries. Or maybe I'm missing something?

Regards,
Tommy


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Peter Dolan  
View profile  
 More options Jun 10 2008, 1:24 pm
From: "Peter Dolan" <peterjdo...@gmail.com>
Date: Tue, 10 Jun 2008 10:24:38 -0700
Local: Tues, Jun 10 2008 1:24 pm
Subject: Re: [google-appengine] Re: Approach to full text searching.

<plug for the system I built>
If you're looking to build a secondary index on your data with which to
handle finely grained searches, you might find HTTPMR useful:
http://httpmr.googlecode.com/.  The single sample that I have running (and
testing) right now is actually one that constructs a token index of a set of
documents for searching.  So, if the search index that AppEngine builds for
you with SearchableModel doesn't work out, you can build your own.
</plug for the system I built>

Though actually, I suspect that you need to give up on the notion of a
'like' operator and work with the limitations of the SearchableModel.
'LIKE' doesn't scale well on a scalable datastore in the bigtable paradigm
because you need to read every single entry in the database.  The same would
hold true for regular expression matching operators.  Relational databases
handle this, I think (and I'm not an expert), by maintaining indices that
allow somewhat efficient prefix matching in the case of "... LIKE 'xxx%'",
but also have to read every row for very aggressive matches like "... LIKE
'%x%'".  Those are slow even on a relational database running on one
machine, prohibitively expensive on a distributed database.

If you're looking for "Sun Microsystems", why doesn't segmenting the query
into "Sun", "Microsystems", querying for both tokens, and taking the union
of the resultset work?

- Peter


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alan  
View profile  
 More options Jun 10 2008, 6:40 pm
From: Alan <alan.fotherg...@gmail.com>
Date: Tue, 10 Jun 2008 15:40:55 -0700 (PDT)
Local: Tues, Jun 10 2008 6:40 pm
Subject: Re: Approach to full text searching.
Tommy - I've been playing around with something similar to what you
are asking about - using a ListProperty field called indexable_list to
store all the substrings from all the words in the entity.

See http://testalan1.appspot.com/

In the field at the bottom, enter "SU", and you see anything with a
word beginning with "SU"

now enter a space, and "M", and the list gets smaller.

Now enter a space, and a "P" (for Palo Alto).

This is where the problems come.  Appengine says that it needs an
index with the  indexable_list repeated 3 times (use Firebug to see
this):

    NeedIndexError: no matching index found

    This query needs this index:

    - kind: Party1
      properties:
      - name: indexable_list
      - name: indexable_list
      - name: indexable_list
      - name: name

If this index was created, would lead to a combinatorial explosion in
index size. (Every combination of 3 from the list is indexed)

Today there was a notification that index sizes are capped:
http://groups.google.com/group/google-appengine/browse_thread/thread/...
http://groups.google.com/group/google-appengine/browse_thread/thread/...

Solution could be to take a maximum of 2 words from the user input
(the 2 longest words), perform the query, and do additional filtering
on the other words in code.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nerdy Tommy  
View profile  
(1 user)  More options Jun 10 2008, 7:03 pm
From: Nerdy Tommy <nerdy.to...@gmail.com>
Date: Tue, 10 Jun 2008 16:03:16 -0700 (PDT)
Local: Tues, Jun 10 2008 7:03 pm
Subject: Re: Approach to full text searching.
Hi Peter,

I referred LIKE in my description only to make my explanation of the
problem easier to understand. I'm fully aware that queries based on
LIKE, wildcards and/or regular expressions don't scale and this is the
reason why those concepts doesn't exist in GAE data store. But there
is still open question how to implement efficient search not limited
only to the whole strings matching (equals operator).

I've done some research on basics of approximate string searching.
There are two main approaches: online-search and offline-search.
Mentioned before LIKE, pattern matching, etc. are mostly techniques of
the first approach. The latter one is based on string preprocessing
and index building. And this offline-search approach seems to be the
right way to go with GAE architecture. It would be really good to have
low-level implementation of any indexing techniques combined with high
level API for developer to easily and efficiently perform full-text
fuzzy searching. I believe this is definitely one of the ways GAE
should evolve in the near future.

But for now I still need to implement search in my simple app I've
been building recently. I'm not very keen to live with the limitations
of SearchableModel, so I still think it is possible to build index
using high-level API and GAE concepts. I'm going to go with the idea
of generation of list of all substrings of every word in the content
of my searchable StringProperty and store it is designated
StringListProperty. The algorithm that I use to generate list of
substrings is quite efficient, and not so complex (linear complexity,
I guess). I also have idea of some tweaks to make the list more
compact and easier to search upon.

I also did some thinking on more fuzzy-like search and came across
interesting theory about so called string k-neighborhood generation.
The idea is to generate list of all strings that distance (using
Levenshtein distance measure) from an original string is acceptable
and store these in the index as well. The ideas is quite fresh and it
seems to be doable only for relatively short strings, because of the
complexity of the algorithm that might be used to generate the
neighborhood (it's exponential, I guess).

I'm still open to discuss those and other ideas. Furthermore I'm going
to publish the source code of my app to present my approach in greater
details.

Cheers,
Tommy


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Hoffman  
View profile  
 More options Jun 10 2008, 9:25 pm
From: Tim Hoffman <zutes...@gmail.com>
Date: Tue, 10 Jun 2008 18:25:41 -0700 (PDT)
Local: Tues, Jun 10 2008 9:25 pm
Subject: Re: Approach to full text searching.
Hi Peter

You might like to have a look at how textindexng is implemented as a
searchable index in Zope.

http://plone.org/products/textindexng

Still may not be what you want, but you may find some of the
approaches useful for creating your own
Searchable model

Rgds

Tim

On Jun 11, 7:03 am, Nerdy Tommy <nerdy.to...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dado  
View profile  
 More options Jun 10 2008, 9:58 pm
From: Dado <edoardo.marc...@gmail.com>
Date: Tue, 10 Jun 2008 18:58:52 -0700 (PDT)
Local: Tues, Jun 10 2008 9:58 pm
Subject: Re: Approach to full text searching.
Damn, shouldn't datastore come with something better than
searchablemodel??? This is Google after all!!!

On Jun 10, 4:05 am, Nerdy Tommy <nerdy.to...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
wave connexion(BQ)  
View profile  
 More options Jun 11 2008, 2:25 am
From: "wave connexion(BQ)" <waveconnex...@gmail.com>
Date: Wed, 11 Jun 2008 14:25:35 +0800
Local: Wed, Jun 11 2008 2:25 am
Subject: Re: [google-appengine] Re: Approach to full text searching.

Hi Tommy,

I am not currently getting into SearchableModel detail. As Peter pointed
out, regex approach might not scale in appengine. I recall some one posted
similar question, searching "searchablequery" on "bulkupload" on this forum
might be a help for you.

On Tue, Jun 10, 2008 at 8:38 PM, Nerdy Tommy <nerdy.to...@gmail.com> wrote:

> Hi BQ,

> Could you elaborate some more on your idea? I guess the regular
> expression here would be great. As far as I know the only implemented
> 'operators' in GAE Query are: less than ('<'), less than or equal
> ('<='), greater than ('>'), greater than or equal ('>='), and equal
> ('=' and '=='). Therefore I can't see any way to use regular
> expressions in queries. Or maybe I'm missing something?

> Regards,
> Tommy

--
BQ

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ben the Indefatigable  
View profile  
 More options Jun 11 2008, 11:41 am
From: Ben the Indefatigable <bcbry...@gmail.com>
Date: Wed, 11 Jun 2008 08:41:01 -0700 (PDT)
Local: Wed, Jun 11 2008 11:41 am
Subject: Re: Approach to full text searching.
I've thought a lot about this, and implemented SearchableModel, and
came to the conclusion that if I need real full text search I can't
use google appengine. Even if SearchableModel was adequate in terms of
search quality, they said it won't scale. Ideas I've had for a
homemade search are problematic in terms of adding a page to the
indexes, whether it has 10 words or 1000 it will affect disparate
parts of the index and require too many writes. On the other hand, I
am guessing that within a couple of years google will have to come out
with a good solution for full text search, assuming appengine stays in
their priorities.

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Waleed  
View profile  
 More options Jun 11 2008, 7:11 pm
From: Waleed <waleed.abdu...@gmail.com>
Date: Wed, 11 Jun 2008 16:11:16 -0700 (PDT)
Local: Wed, Jun 11 2008 7:11 pm
Subject: Re: Approach to full text searching.
You might've considered this, but here goes just in case. Not knowing
more details about your requirements (like how many company names you
expect in your list), this might or might not work for you.

1. Create an index like this:

aa,    list of entities that contian aa
ab,    list of entities that contian ab
ac,    list of entities that contian ac
....
ba,    list of entities that contian ba
bb,    list of entities that contian ba

2. If a user searchs for "abc", you pull the list of items that
contain "ab" and those contain "bc" and find the items that occur in
both lists. The number of lists you'll need to pull will depend on the
number of letters in the word (list n = len(word) -1) .

On Jun 10, 4:05 am, Nerdy Tommy <nerdy.to...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ben the Indefatigable  
View profile  
 More options Jun 16 2008, 5:33 pm
From: Ben the Indefatigable <bcbry...@gmail.com>
Date: Mon, 16 Jun 2008 14:33:51 -0700 (PDT)
Local: Mon, Jun 16 2008 5:33 pm
Subject: Re: Approach to full text searching.
Waleed, yes I have thought of that. Searching based on that kind of
design might be doable especially on simple searches. But I
specifically said the cost of *adding* a page to the indexes is where
the problem lies. If you have 10 to 1000 words in a page, that's
likely near ten to several hundred index entries you need to write.
So, adding a single page of text to your indexing system becomes a
very intensive process.

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jimpa  
View profile  
 More options Jun 28 2008, 3:27 pm
From: Jimpa <jamie.nichol...@gmail.com>
Date: Sat, 28 Jun 2008 12:27:15 -0700 (PDT)
Local: Sat, Jun 28 2008 3:27 pm
Subject: Re: Approach to full text searching.
On Jun 16, 10:33 pm, Ben the Indefatigable <bcbry...@gmail.com> wrote:

> Waleed, yes I have thought of that. Searching based on that kind of
> design might be doable especially on simple searches. But I
> specifically said the cost of *adding* a page to the indexes is where
> the problem lies. If you have 10 to 1000 words in a page, that's
> likely near ten to several hundred index entries you need to write.
> So, adding a single page of text to your indexing system becomes a
> very intensive process.

I agree, haven't found a solution to this..... shame really

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
gimenete@gmail.com  
View profile  
 More options Jun 30 2008, 10:00 am
From: "gimen...@gmail.com" <gimen...@gmail.com>
Date: Mon, 30 Jun 2008 07:00:55 -0700 (PDT)
Local: Mon, Jun 30 2008 10:00 am
Subject: Re: Approach to full text searching.
Please star this issue: http://code.google.com/p/googleappengine/issues/detail?id=217

On Jun 28, 9:27 pm, Jimpa <jamie.nichol...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bruno Patini Furtado  
View profile  
 More options Jun 30 2008, 1:35 pm
From: "Bruno Patini Furtado" <bpfurt...@gmail.com>
Date: Mon, 30 Jun 2008 14:35:00 -0300
Subject: Re: [google-appengine] Re: Approach to full text searching.

Have you considered PyLucene? http://pylucene.osafoundation.org/

On Mon, Jun 30, 2008 at 11:00 AM, gimen...@gmail.com <gimen...@gmail.com>
wrote:

--
"I am not young enough to know everything" --Oscar Wilde.
"I know nothing except the fact of my ignorance." --Socrates.
Bruno Patini Furtado
Software Developer
webpage: http://bpfurtado.net
text adventures suite: http://bpfurtado.net/tas
software development blog: http://bpfurtado.livejournal.com

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bruno Patini Furtado  
View profile  
 More options Jun 30 2008, 2:11 pm
From: "Bruno Patini Furtado" <bpfurt...@gmail.com>
Date: Mon, 30 Jun 2008 15:11:22 -0300
Local: Mon, Jun 30 2008 2:11 pm
Subject: Re: [google-appengine] Re: Approach to full text searching.

Forget PyLucene in this scenario as it uses the Java implementation of
Lucene under the covers, won't work on the GoogLe Apps engine.

On Mon, Jun 30, 2008 at 2:35 PM, Bruno Patini Furtado <bpfurt...@gmail.com>
wrote:

--
"I am not young enough to know everything" --Oscar Wilde.
"I know nothing except the fact of my ignorance." --Socrates.
Bruno Patini Furtado
Software Developer
webpage: http://bpfurtado.net
text adventures suite: http://bpfurtado.net/tas
software development blog: http://bpfurtado.livejournal.com

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google