Hi, and apologies in advance for asking a newbie question. I did look around and I did not find the answer on my own. Thought I'd mention that :)
I have what I think is a pretty simple use case - and I'm trying to figure out if I'm barking up the wrong tree altogether.
I have a mid-size list of terms, somewhere in the tens of thousands. I need to be able to do a substring query on that list which always winds up looking like *query*. When I do a straight up array scan in pure python:
filter(lambda t: query_string in t.name, terms)
I get my results in something like 1/3 of a second. Using a pre-compiled regexp for search gives me similar performance.
Then I came to Whoosh and figured that since I'm looking for substring matches, I should just treat my tems as NGRAMS and index the damn thing. So I created a simple Whoosh schema:
self.schema = Schema(content=NGRAMWORDS(2, 10), id=NUMERIC(stored=True))
Wrote my index to the file system
def write_index(self):
if not os.path.exists("index"):
os.mkdir("index")
self.ix = create_in("index", self.schema)
self.ix_writer = self.ix.writer()
print "Writing Index"
for idx, term in enumerate(self.terms):
if idx % 100 == 0:
print "\t%s" % idx
self.ix_writer.add_document(id = idx, content = term.name) self.ix_writer.commit()
print "done writing index"
And expected to get wonderfully fast searches:
def _invitae_search(self, search_string):
parser = QueryParser("content", self.ix.schema)
with self.ix.searcher() as searcher:
searcher.set_caching_policy(save=True)
results = searcher.search(parser.parse(unicode("*%s*" % search_string)))
And what I'm seeing is that these all take about the same 1/3 or a second as a full array scan with string comparisons does. I'm probably doing something very very basic wrong.
Ideas?
Thanks a lot!
alex