Why is QueryParser dropping terms?

33 views
Skip to first unread message

Nikolaus Rath

unread,
Sep 21, 2021, 8:27:18 AM9/21/21
to Whoosh

Hello,

Why is the "will" term dropped here?

import whoosh
whoosh.__version__
Out[3]: (2, 7, 4)

import whoosh.index
import whoosh.qparser

ix = whoosh.index.open_dir('.')
qp = whoosh.qparser.QueryParser('content', schema=ix.schema)

qp.parse('foo bar')
Out[10]: And([Term('content', 'foo'), Term('content', 'bar')])

qp.parse('last will')
Out[11]: Term('content', 'last')


Maybe this is related to "will" being such a common term? But dropping it from the AND clause makes the query even more generic, if anything...?


Thanks,
-Nikolaus

David Lowry-Duda

unread,
Sep 21, 2021, 11:15:50 AM9/21/21
to who...@googlegroups.com
The word "will" is one of the 'stop words' that are very common and
removed by default. You can change this behavior.

When you make a schema, you can choose how you want the text to be
analyzed. By default, a 'StandardAnalyzer' is used, something like the
following.

import whoosh
from whoosh.analysis import StandardAnalyzer
from whoosh.fields import *
schema = Schema(content=TEXT(analyzer=StandardAnalyzer()))

The StandardAnalyzer has a stoplist (i.e. list of words to ignore) as
follows:

{'from', 'when', 'us', 'your', 'yet', 'are', 'if', 'an', 'is', 'on',
'may', 'and', 'with', 'have', 'as', 'of', 'to', 'or', 'you', 'for',
'will', 'in', 'we', 'tbd', 'at', 'a', 'by', 'it', 'that', 'this', 'not',
'be', 'the', 'can'}

You could possibly use a different analyzer, or perhaps remove/alter the
stoplist in the standard analyzer.

- DLD
Reply all
Reply to author
Forward
0 new messages