Simple use case, not getting the performance I expect

Alex Furman

unread,

Apr 12, 2013, 1:26:27 PM4/12/13

to who...@googlegroups.com

Hi, and apologies in advance for asking a newbie question. I did look around and I did not find the answer on my own. Thought I'd mention that :)

I have what I think is a pretty simple use case - and I'm trying to figure out if I'm barking up the wrong tree altogether.

I have a mid-size list of terms, somewhere in the tens of thousands. I need to be able to do a substring query on that list which always winds up looking like *query*. When I do a straight up array scan in pure python:

filter(lambda t: query_string in t.name, terms)

I get my results in something like 1/3 of a second. Using a pre-compiled regexp for search gives me similar performance.

Then I came to Whoosh and figured that since I'm looking for substring matches, I should just treat my tems as NGRAMS and index the damn thing. So I created a simple Whoosh schema:

self.schema = Schema(content=NGRAMWORDS(2, 10), id=NUMERIC(stored=True))

Wrote my index to the file system

def write_index(self):

if not os.path.exists("index"):

os.mkdir("index")

self.ix = create_in("index", self.schema)

self.ix_writer = self.ix.writer()

print "Writing Index"

for idx, term in enumerate(self.terms):

if idx % 100 == 0:

print "\t%s" % idx

self.ix_writer.add_document(id = idx, content = term.name)

self.ix_writer.commit()

print "done writing index"

And expected to get wonderfully fast searches:

def _invitae_search(self, search_string):

parser = QueryParser("content", self.ix.schema)

with self.ix.searcher() as searcher:

searcher.set_caching_policy(save=True)

results = searcher.search(parser.parse(unicode("*%s*" % search_string)))

And what I'm seeing is that these all take about the same 1/3 or a second as a full array scan with string comparisons does. I'm probably doing something very very basic wrong.

Ideas?

Thanks a lot!

alex

Matt Chaput

unread,

Apr 12, 2013, 3:06:49 PM4/12/13

to who...@googlegroups.com

> def _invitae_search(self, search_string):
> parser = QueryParser("content", self.ix.schema)
> with self.ix.searcher() as searcher:
> searcher.set_caching_policy(save=True)
> results = searcher.search(parser.parse(unicode("*%s*" % search_string)))
>
> And what I'm seeing is that these all take about the same 1/3 or a second as a full array scan with string comparisons does. I'm probably doing something very very basic wrong.
>
> Ideas?

Try removing the wildcards from the search string.

Matt

Alex Furman

unread,

Apr 15, 2013, 7:17:38 PM4/15/13

to who...@googlegroups.com

Thank you! That helped - I am not seeing the performance that I expected to see. All in all, using Whoosh (for an admittedly pretty simple task) has been a pleasure. Everything just works. Thank you!

Thomas Waldmann

unread,

Apr 23, 2013, 5:00:37 PM4/23/13

to who...@googlegroups.com

On Tuesday, April 16, 2013 1:17:38 AM UTC+2, Alex Furman wrote:

Thank you! That helped - I am not seeing the performance that I expected to see.

Did you mean "I am NOW seeing ..."?

Reply all

Reply to author

Forward