> Any feedback/information on the geometric versus geometri (and
> template versus templat) is appreciated!
geometri is not an English word, and does not appear to have a typical
suffix. It's not surprising, then, that the stemming algorithm leaves
it alone. -e does appear as a removable suffix (i.e. it can be
replaced with -ing or -ed to form related words). Here is a way to
experiment with this:
In [12]: import whoosh
In [13]: sa = whoosh.analysis.StemmingAnalyzer()
In [14]: for x in sa('templat template templating templated geometri
geometric geometry geometrical'):
....: print x.text
....:
....:
templat
templat
templat
templat
geometri
geometr
geometri
geometr
You can see that the stemming algorithm does make a real mistake in
recognizing -cal instead of -ical. Stemming is an imperfect art, and I
think most algorithms will fail in the face of spelling mistakes, but
if you find a better algorithm, it is fortunately pretty easy to plug
in. The default implementation is very commonly used but is not the
most sophisticated one available.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco