Like Search

6 views
Skip to first unread message

Neves

unread,
Jun 12, 2009, 4:31:26 PM6/12/09
to Google App Engine
I have an idea to do LIKE search with small words in GAE.

The solution is create a Word model with the follow fields:

class Word(db.Model):
word = db.StringProperty()
like = db.StringListProperty()

# usage
word = Word()

word.word = "open"
# with the assignment above, the like property would automatically be
filled with the follow string list:
word.like = ["o", "op", "ope", "open", "p", "pe", "pen", "e", "en",
"n"]
word.put()

# a search would be like this:
part = "pen"
results = db.GqlQuery("SELECT * FROM Word WHERE like = :1", part)
# results would contain the word "open" cause it contains the
substring "pen" on the like list.

For optimization, repeated parts would not be saved on the list, for
example, the word "popo" would became:
word.like = ["p", "po", "pop", popo", "o", "op", "opo"]

The math to know how much words would exist on the list, is just do:
length * (1 + length) / 2
So a word with 20 letters would have a maximum of 210 subwords, minus
the repeated ones.
Thats why it just works for small words, in my case, for domain names,
or email address.

This idea came from http://code.google.com/events/io/sessions/BuildingScalableComplexApps.html

Barry Hunter

unread,
Jun 12, 2009, 11:20:17 PM6/12/09
to google-a...@googlegroups.com
Yeh that should work, and reasonably well. Its very similar to how the
SearchableModel demo works for whole word full text search, split the
text into words and put in a string list.

The main issue becomes when you want to be able to search on mulitple
words, an index will be created that has 'like' property twice. This
quickly leads to exploding indexes, as an entry in the index needs to
be created for every word, with every other word, so for your example
requiring a 210 words, would result in 44,100 index entries! Thats
clearly not sustainable - espically if the result set is big and/or
need to search three words!

Not saying dont do it, just be weary that it will still quickly break down...


But there is a further way to optimise. See [1] for a tip on prefix
matching. Which I think should still work on stringlists, so can
actully just store

word.like = ["open", "pen", "en", "n"] #(for 'open')

as 'op' for example would prefix match on "open"

(but only works if you need your inequality filter on a different property... )

[1] http://code.google.com/appengine/docs/python/datastore/queriesandindexes.html
--
Barry

- www.nearby.org.uk - www.geograph.org.uk -

Neves

unread,
Jun 13, 2009, 10:02:59 AM6/13/09
to Google App Engine
Sad for hearing that. But I think this kind of query is not possible
with a list property:
prop >= :1 AND prop < :2", "abc", u"abc" + u"\ufffd"

is it?

On 13 jun, 00:20, Barry Hunter <barrybhun...@googlemail.com> wrote:
> Yeh that should work, and reasonably well. Its very similar to how the
> SearchableModel demo works for whole word full text search, split the
> text into words and put in a string list.
>
> The main issue becomes when you want to be able to search on mulitple
> words, an index will be created that has 'like' property twice. This
> quickly leads to exploding indexes, as an entry in the index needs to
> be created for every word, with every other word, so for your example
> requiring a 210 words, would result in 44,100 index entries! Thats
> clearly not sustainable - espically if the result set is big and/or
> need to search three words!
>
> Not saying dont do it, just be weary that it will still quickly break down...
>
> But there is a further way to optimise. See [1] for a tip on prefix
> matching. Which I think should still work on stringlists, so can
> actully just store
>
> word.like = ["open", "pen", "en", "n"]   #(for 'open')
>
> as 'op' for example would prefix match on "open"
>
> (but only works if you need your inequality filter on a different property... )
>
> [1]http://code.google.com/appengine/docs/python/datastore/queriesandinde...
> >  This idea came fromhttp://code.google.com/events/io/sessions/BuildingScalableComplexApps...
>
> --
> Barry
>
> -www.nearby.org.uk-www.geograph.org.uk-

Nick Johnson (Google)

unread,
Jun 15, 2009, 8:50:18 AM6/15/09
to google-a...@googlegroups.com
Hi Neves,

Yes, that form of query should work fine - and eliminates the requirement for you to include all subsets of a word in your list - you only need to include all suffixes.

-Nick Johnson
Reply all
Reply to author
Forward
0 new messages