Simple use case, not getting the performance I expect

83 views
Skip to first unread message

Alex Furman

unread,
Apr 12, 2013, 1:26:27 PM4/12/13
to who...@googlegroups.com
Hi, and apologies in advance for asking a newbie question. I did look around and I did not find the answer on my own. Thought I'd mention that :)

I have what I think is a pretty simple use case - and I'm trying to figure out if I'm barking up the wrong tree altogether. 

I have a mid-size list of terms, somewhere in the tens of thousands. I need to be able to do a substring query on that list which always winds up looking like *query*. When I do a straight up array scan in pure python: 

filter(lambda t: query_string in t.name, terms)

I get my results in something like 1/3 of a second. Using a pre-compiled regexp for search gives me similar performance.

Then I came to Whoosh and figured that since I'm looking for substring matches, I should just treat my tems as NGRAMS and index the damn thing. So I created a simple Whoosh schema:

self.schema = Schema(content=NGRAMWORDS(2, 10), id=NUMERIC(stored=True))

Wrote my index to the file system

    def write_index(self):
        if not os.path.exists("index"):
            os.mkdir("index")

        self.ix = create_in("index", self.schema)
        self.ix_writer = self.ix.writer()
        print "Writing Index"
        for idx, term in enumerate(self.terms):
            if idx % 100 == 0:
                print "\t%s" % idx
            self.ix_writer.add_document(id = idx, content = term.name)
        self.ix_writer.commit()
        print "done writing index"

And expected to get wonderfully fast searches:

    def _invitae_search(self, search_string):
        parser = QueryParser("content", self.ix.schema)
        with self.ix.searcher() as searcher:
            searcher.set_caching_policy(save=True)
            results = searcher.search(parser.parse(unicode("*%s*" % search_string)))

And what I'm seeing is that these all take about the same 1/3 or a second as a full array scan with string comparisons does. I'm probably doing something very very basic wrong. 

Ideas?

Thanks a lot!

alex

Matt Chaput

unread,
Apr 12, 2013, 3:06:49 PM4/12/13
to who...@googlegroups.com
> def _invitae_search(self, search_string):
> parser = QueryParser("content", self.ix.schema)
> with self.ix.searcher() as searcher:
> searcher.set_caching_policy(save=True)
> results = searcher.search(parser.parse(unicode("*%s*" % search_string)))
>
> And what I'm seeing is that these all take about the same 1/3 or a second as a full array scan with string comparisons does. I'm probably doing something very very basic wrong.
>
> Ideas?

Try removing the wildcards from the search string.

Matt

Alex Furman

unread,
Apr 15, 2013, 7:17:38 PM4/15/13
to who...@googlegroups.com
Thank you! That helped - I am not seeing the performance that I expected to see. All in all, using Whoosh (for an admittedly pretty simple task) has been a pleasure. Everything just works. Thank you! 


Thomas Waldmann

unread,
Apr 23, 2013, 5:00:37 PM4/23/13
to who...@googlegroups.com


On Tuesday, April 16, 2013 1:17:38 AM UTC+2, Alex Furman wrote:
Thank you! That helped - I am not seeing the performance that I expected to see.

Did you mean "I am NOW seeing ..."?
Reply all
Reply to author
Forward
0 new messages