Highlighting

8 views
Skip to first unread message

Chris Clark

unread,
Mar 27, 2009, 6:55:56 PM3/27/09
to Whoosh
I've been playing with highlight.py and I think I don't understand the
stemming part of the process. I've hacked highlight.py, from:

.....
txt = open(".. some file...").read().decode("utf8")
.....

to:

txt = u"""NOTE the text below is nonsense! Just picked words that
_may_ hit the highlighter. NOTE no highlight with the G word but there
is with the T word.
The template was NOT very good, in fact it was ungeometric in its use.

In mathematics, a geometric (made up geometri) progression, also known
as a geometric sequence, is a sequence of numbers where each term
after the first is found by doing stuff.

geometri was though.

"""

I.e. basically jibberish :-) so I can see what the highlighting
results look like.

Here is the highlight call:

fs = highlight(txt, ["templat", "geometri"], sa, SentenceFragmenter(),
UppercaseFormatter())

This is what I'm getting out:

--------
TEMPLATE was not very good, in fact it was ungeometric in its
use....mathematics, a geometric (made up GEOMETRI) progression, also
known as a geometric sequence, is a sequence of numbers where each
term after the first is found by doing stuff....GEOMETRI was though.
--------

So "template" is being matched with "templat" (no 'e') but "geometri"
is not being matched with "geometric". Is this expected? I'm really
not sure as I'm not versed in stemming.

As a side note, you may notice the output of unmatched test is
completely lower-cased, this is because I updated UppercaseFormatter()
with explicit lower calls:

def _format_fragment(self, text, fragment):
output = []
index = fragment.startchar

for t in fragment.matches:
if t.startchar > index:
output.append(text[index:t.startchar].lower())

ttxt = text[t.startchar:t.endchar]
if t.matched:
ttxt = ttxt.upper()
else: ## not sure we need the else.. but just in case...
ttxt = ttxt.lower()
output.append(ttxt)
index = t.endchar

output.append(text[index:fragment.endchar].lower())
return "".join(output)

Any feedback/information on the geometric versus geometri (and
template versus templat) is appreciated!

Chris

Robert Kern

unread,
Mar 27, 2009, 7:09:11 PM3/27/09
to who...@googlegroups.com
On Fri, Mar 27, 2009 at 17:55, Chris Clark <Chris...@ingres.com> wrote:

> Any feedback/information on the geometric versus geometri (and
> template versus templat) is appreciated!

geometri is not an English word, and does not appear to have a typical
suffix. It's not surprising, then, that the stemming algorithm leaves
it alone. -e does appear as a removable suffix (i.e. it can be
replaced with -ing or -ed to form related words). Here is a way to
experiment with this:

In [12]: import whoosh

In [13]: sa = whoosh.analysis.StemmingAnalyzer()

In [14]: for x in sa('templat template templating templated geometri
geometric geometry geometrical'):
....: print x.text
....:
....:
templat
templat
templat
templat
geometri
geometr
geometri
geometr


You can see that the stemming algorithm does make a real mistake in
recognizing -cal instead of -ical. Stemming is an imperfect art, and I
think most algorithms will fail in the face of spelling mistakes, but
if you find a better algorithm, it is fortunately pretty easy to plug
in. The default implementation is very commonly used but is not the
most sophisticated one available.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco

Reply all
Reply to author
Forward
0 new messages