standard analyzer ignores stop-words

59 views

Skip to first unread message

Willem Rein Oudshoorn

unread,

Sep 9, 2012, 3:14:24 PM9/9/12

to montez...@googlegroups.com

I think I found a small bug in the standard analyzer.
If I do

(montezuma:all-tokens * nil "a is bee there")

with '*' the standard analyzer, I expect only one token 'bee', but
I get a token for all the words.

As far as I can see, the reason is that the generic method

(token-stream ((self standard-analyzer) ...))

Ignores the stop-words that are stored in the 'standard-analyzer' class
(which is a subclass of stop-analyzer).

The call to

(token-stream ((self stop-analyzer) ...))

Does take the stop-words into account.

(See below for a full transcript to see where I think it goes wrong.)

I have made a patch, to fix this.
Shall I send a pull request?

(Or maybe this isn't a problem at all and I am misunderstanding what the
intention is.)

Kind regards
Wim Oudshoorn.

PS. Transcript of my confusion

im-xml> (montezuma:analyzer *index*)
#<MONTEZUMA:STANDARD-ANALYZER {100BFFB503}>
im-xml> (inspect *)

The object is a STANDARD-OBJECT of type MONTEZUMA:STANDARD-ANALYZER.
0. STOP-WORDS: ("a" "an" "and" "are" "as" "at" "be" "but" "by" "for" "if"
"in" "into" "is" "it" "no" "not" "of" "on" "or" "s" "such"
"t" "that" "the" "their" "then" "there" "these" "they"
"this" "to" "was" "will" "with")
> (montezuma:all-tokens * nil "a is bee there")

(#S(MONTEZUMA::TOKEN :IMAGE "a" :START 0 :END 1 :INCREMENT 1 :TYPE :WORD)
#S(MONTEZUMA::TOKEN :IMAGE "is" :START 2 :END 4 :INCREMENT 1 :TYPE :WORD)
#S(MONTEZUMA::TOKEN
:IMAGE "bee"
:START 5
:END 8
:INCREMENT 1
:TYPE :WORD)
#S(MONTEZUMA::TOKEN
:IMAGE "there"
:START 9
:END 14
:INCREMENT 1
:TYPE :WORD))

Reply all

Reply to author

Forward

0 new messages