# A practical guidance for Whoosh
### By Ruowei Wang, in 2/1/2018

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python with detailed documentation. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly.

Some of Whoosh's features include:

1.Pythonic API. 
2.Pure-Python and Open source. No compilation or binary packages needed, no mysterious crashes.
3.Fielded indexing and search.
4.Fast indexing and retrieval -- faster than any other pure-Python search solution I know of. See Benchmarks (https://bitbucket.org/mchaput/whoosh/wiki/Benchmarks).
5.Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc.
6.Powerful query language (supporting inexact search and proximity search). 
7. Production-quality pure Python spell-checker (as far as I know, the only one). 
   See http://whoosh.readthedocs.io/en/latest/spelling.html

To install whoosh:
1. if you want to use it in jupyter notebook, you could use the command ``conda install whoosh``
2. ``easy_install Whoosh`` and ``pip install Whoosh`` also works, If you have ``setuptools`` or ``pip`` installed.
3. Download source releases from PyPI at http://pypi.python.org/pypi/Whoosh/. Using ``hg clone http://bitbucket.org/mchaput/whoosh``.

Now let's start implementing Whoosh!

In [156]:
from whoosh.qparser import *
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED,NUMERIC
from whoosh.analysis import StemmingAnalyzer,StandardAnalyzer
from whoosh import index
import os, os.path

Each document can have multiple fields, such as title, content, url, date, etc. Firstly, we need to create a schema for our corpus to specify these fields of documents in an index. 

The schema is the set of all possible fields in a document. Each individual document might only use a subset of the available fields in the schema.

note:
without a schema, the query parser in Whoosh will not process the text in the user query (i.e., cannot do phrase searching).

Here is an example of creating a schema:

In [157]:
schema = Schema(year=NUMERIC(stored=True),
                author=TEXT(analyzer=StandardAnalyzer(stoplist=None),stored=True),
                title=TEXT(analyzer=StandardAnalyzer(stoplist=None),stored=True),
                abstract=TEXT(analyzer=StandardAnalyzer(stoplist=None),stored=True),
                body=TEXT(analyzer=StandardAnalyzer(stoplist=None)),
                subject=KEYWORD(commas=True,scorable=True),
                keywords=KEYWORD(commas=True, scorable=True))

Here are the predefined field types I used:

1.whoosh.fields.NUMERIC:
This field stores int, long, or floating point numbers in a compact, sortable format.

2.whoosh.fields.TEXT:
TEXT fields can indexes the text and stores term positions (by default, ``TEXT(phrase=True)``) to allow phrase searching.
This field uses ``StandardAnalyzer`` by default. To specify a different analyzer, use the analyzer keyword argument to the constructor, e.g. ``TEXT(analyzer=analysis.StemmingAnalyzer())``. 
The documentation of different analyzer is here: http://whoosh.readthedocs.io/en/latest/api/analysis.html#analyzers. ``StandardAnalyzer`` only lowercase the words and filter them with a simple stopword list. 
By default, TEXT fields are not stored, which means the content of this field will not be shown in the search result. Usually you will not want to store the body text in the search index, however, you can use TEXT(stored=True) to specify that the text should be stored in the index.

3.whoosh.fields.KEYWORD:
This field type is designed for space- or comma-separated keywords. This type is indexed and searchable (and optionally stored). It does not support phrase searching.
To store the value of the field in the index, use ``stored=True`` in the constructor. To automatically lowercase the keywords before indexing them, use ``lowercase=True``. To separate the keywords by commas (to allow keywords containing spaces), use ``commas=True``, Otherwise the keywords are space-seperated. To use the keyword field for searching, use ``scorable=True``.

Note: there are many other predefined fields for users to choose, see http://whoosh.readthedocs.io/en/latest/api/fields.html#pre-made-field-types.

Note: Whoosh can also create a schema declaratively using the SchemaClass base class and pass the declarative class to create_in() or create_index() instead of a Schema instance.

After creating our schema, we will index each document in the corpus. In this example, I just use two books "Gone with the wind" and "Grimms' Fairy Tales" for display. 
Note:
1. Indexed fields must be passed a unicode value.
2. opening a writer locks the index for writing. In a multi-threaded or multi-process environment, opening a writer may raise an error if a writer is already open. Advanced writer object "whoosh.writing.AsyncWriter" and "whoosh.writing.BufferedWriter" can solve this problem.

In [158]:
#to create an index in a dictionary
if not os.path.exists("indexdir"):
    os.mkdir("indexdir")
ix = index.create_in("indexdir", schema)
#open an existing index object
ix = index.open_dir("indexdir")
#create a writer object to add documents to the index
writer = ix.writer()
#now we can add documents to the index

abstract1=u'''It depicts the struggles of young Scarlett O'Hara, the spoiled daughter of a well-to-do plantation owner, who must use every means at her disposal to claw her way out of poverty following Sherman's destructive 'March to the Sea'. This historical novel features a Bildungsroman or coming-of-age story, with the title taken from a poem written by Ernest Dowson'''

abstract2=u'''Children's and Household Tales (German: Kinder- und Hausmärchen) is a collection of fairy tales first published in 20 December 1812 by the Grimm brothers, Jacob and Wilhelm. The collection is commonly known in English as Grimms' Fairy Tales.'''

writer.add_document(year=u"1936",
                author=u"Margaret Mitchell",
                title=u"Gone with the wind",
                abstract=abstract1,
                subject=u"novel, love",
                keywords=u"Scarlett, Rhett")
writer.add_document(year=u"1812",
                author=u" Jacob and Wilhelm",
                title=u"Grimms' Fairy Tales",
                abstract=abstract2,
                subject=u"story, children",
                keywords=u"The Frog King,  Rapunzel")
#close the writer and save the added documents in the index
#you should call the commit() function once you finish adding the documents otherwise you will cause an error-
#when you try to edit the index next time and open another writer. 
writer.commit()

After indexing the documents, we can write down the query and convert the query string into query object by the query parser.

Create a whoosh.qparser.QueryParser object, pass it the name of the default field to search and the schema of the index you’ll be searching. 

Query parser is built on modular plug-ins. For example, ``qparser.WildcardPlugin``, which is already in the default plug-in list of parser, gives the parser the ability to search for wildcards. Some frequently used plug-ins are shown in the following code.  

You can use the plugins argument when creating the object to override the default list of plug-ins, use ``add_plugin()`` and/or ``remove_plugin_class()`` to change the plug-ins included in the parser. 

Here is the list of available plug-ins:http://whoosh.readthedocs.io/en/latest/api/qparser.html#plug-ins.

Note:
very important!!! The query string should be a unicode value!

In [159]:
#parsing the query
# this is just a simple parser with default field
parser=QueryParser("abstract",schema=schema) 
#if you want “unfielded” terms to search both the title and content fields,  use a whoosh.qparser.MultifieldParser
#parser = MultifieldParser(["title", "abstract"], schema=schema)
#call parse() on query to parse a query string into a query object
result=parser.parse(u"apple company department")
print result

(abstract:apple AND abstract:company AND abstract:department)


In [160]:
#by default, the parser treats the words as if they were connected by AND. 
#Changing the "group" keyword argument if you want it connencted by Or.
# parser = MultifieldParser(["title", "abstract"], schema=schema,group=OrGroup)
result=parser.parse(u"apple company department")
print result

(abstract:apple AND abstract:company AND abstract:department)


In [161]:
# you can use .add_plugin() to make the parser more powerful
#GtLtPlugin() lets you use >, <, >=, <=, =>, or =< after a field specifier, 
#and translates the expression into the equivalent range:
parser.add_plugin(GtLtPlugin()) 
result=parser.parse(u"year:<2000")
print result

year:[ TO 2000}


In [162]:
#FuzzyTermPlugin lets you search for “fuzzy” terms, that is, terms that don’t have to match exactly. 
#The fuzzy term will match any similar term within a certain number of “edits” 
parser.add_plugin(FuzzyTermPlugin())
result=parser.parse(u"author:margare~")#would match a document has Margare and all terms in the index within one “edit” of cat, for example Margaret insert t
print result
#searcher object is used for searching the matched documents
#you can open the searcher using a with statement so the searcher is automatically closed when you’re done with it
#ix is the document index we created before
with ix.searcher() as searcher:
    results=searcher.search(result)#The Results object acts like a list of the matched documents.
    print (results[0])

author:margare~
<Hit {'title': u'Gone with the wind', 'abstract': u"It depicts the struggles of young Scarlett O'Hara, the spoiled daughter of a well-to-do plantation owner, who must use every means at her disposal to claw her way out of poverty following Sherman's destructive 'March to the Sea'. This historical novel features a Bildungsroman or coming-of-age story, with the title taken from a poem written by Ernest Dowson", 'author': u'Margaret Mitchell', 'year': u'1936'}>


In [163]:
#The default phrase query tokenizes the text between the quotes and creates a search for those terms in proximity.
# print parser.default_set()
#use single quotation marks for the unicode string since double quotation marks are used to represent phrases here
result=parser.parse(u'title:"gonE the"~2')# would match a document has wind within 2 words after gone
print result
with ix.searcher() as searcher:
    results=searcher.search(result)
    print (results)

title:"gone the"
<Top 1 Results for Phrase(u'title', [u'gone', u'the'], slop=2, boost=1.000000) runtime=0.00116706623885>


In [164]:
#you can use * or ? for inexact term search
#use ? to represent a single character and * to represent any number of characters
result=parser.parse(u'title:go*')# would match a document has wind within 2 words after gone
print result
with ix.searcher() as searcher:
    results=searcher.search(result)
    print (results)
    print (results[0])

title:go*
<Top 1 Results for Prefix(u'title', u'go') runtime=0.00056364723423>
<Hit {'title': u'Gone with the wind', 'abstract': u"It depicts the struggles of young Scarlett O'Hara, the spoiled daughter of a well-to-do plantation owner, who must use every means at her disposal to claw her way out of poverty following Sherman's destructive 'March to the Sea'. This historical novel features a Bildungsroman or coming-of-age story, with the title taken from a poem written by Ernest Dowson", 'author': u'Margaret Mitchell', 'year': u'1936'}>


In [166]:
#If you want to do more complex proximity searches, you can replace the phrase plugin with the whoosh.qparser.SequencePlugin.
#It allows any query between the quotes.

#remove the ability to specify phrase queries inside double quotes.
parser.remove_plugin_class(PhrasePlugin)
#Adds the ability to group arbitrary queries inside double quotes,
#to produce a query matching the individual sub-queries in sequence.
parser.add_plugin(SequencePlugin())
#IMPORTANT!!! Not like phrase query which specify the field outside the double quotation marks,
#you need to specify the field inside the double quotation marks for each subquery
#the query string below represents the query 'abstract:"(child OR childr*) ho*sehold"~3 AND title:tales' 
result=parser.parse(u'"abstract:(child OR childr*) abstract:ho*sehold"~3 AND title:tale*')
print (result)
with ix.searcher() as searcher:
    results=searcher.search(result)
    print (results)
#     print (results[0])
    #we can get the position of a term by doing it manually
    import re
    for result in results:
        analyzer=StandardAnalyzer(stoplist=None)
        a=[(t.pos) for t in analyzer(result['abstract'],positions=True) if re.match(r"tale*",t.text)]
        print "the position of the word pattern "+"<tale*> "+"in document <"+result['title']+"> is:"
        print a

(((abstract:child OR abstract:childr*) NEAR abstract:ho*sehold) AND title:tale*)
<Top 1 Results for And([Sequence([Or([Term(u'abstract', u'child'), Prefix(u'abstract', u'childr')]), Wildcard(u'abstract', u'ho*sehold')], slop=3, boost=1.000000), Prefix(u'title', u'tale')]) runtime=0.00529691550764>
the position of the word pattern <tale*> in document <Grimms' Fairy Tales> is:
[4, 14, 38]


### Reference:
Whoosh documentation website. http://whoosh.readthedocs.io/en/latest/index.html