Missing features vs. Lucene?

21 views
Skip to first unread message

IanEslick

unread,
Jun 19, 2008, 3:04:33 PM6/19/08
to montezuma-dev
Montezuma is a great tool! Thanks for putting this out there. I
notice that development has been quite for awhile. What are people
using for lisp-based websites for full-text search?

A couple of questions:
- I wasn't able to find or see the effect of grouping queries "field:
(+term1 -term2)" so I presume they are not supported?
- How do we handle special characters? Do we escape when we index and
when we query, only when we query, etc?
I add the alist (("title" . "My lisp function test-it-now"))
I query via "test\\-it\\-now"?
Currently I cheat and use "test?it?now"
- The system doesn't seem to default to using all fields for searching
and I couldn't see how to enable this. That is, if I want to look for
"lisp" in both title and content of a message. Do I always have to
add the field specifiers myself?

Thank you!
Ian

John Wiseman

unread,
Jun 19, 2008, 7:56:38 PM6/19/08
to montez...@googlegroups.com
> Montezuma is a great tool! Thanks for putting this out there. I
> notice that development has been quite for awhile. What are people
> using for lisp-based websites for full-text search?

Hi, Ian. It's true that I haven't worked on Montezuma in quite a
while (about a year and a half, it looks like), and I'm not currently
planning on doing any more development on it. Anyone who is
interested in using it or working on it is encouraged to ask me for
commit access to the repository.


> A couple of questions:

I'll try to answer as well as I remember.


> - I wasn't able to find or see the effect of grouping queries
> "field: (+term1 -term2)" so I presume they are not supported?

I think that's correct, yes.


> - How do we handle special characters? Do we escape when we index
> and when we query, only when we query, etc?
> I add the alist (("title" . "My lisp function test-it-now"))
> I query via "test\\-it\\-now"?
> Currently I cheat and use "test?it?now"

This is handled by whatever analyzer is being used. The default is
STANDARD-ANALYZER, which uses STANDARD-TOKENIZER, which uses a crazy
regex I stole from Ferret to do the tokenization.

CL-USER> (defparameter *a* (make-instance 'montezuma:standard-
analyzer))
*A*
CL-USER> (montezuma:all-tokens *a* nil "My lisp function test-it-now")
(#S(MONTEZUMA::TOKEN :IMAGE "my" :START 0 :END 2 :INCREMENT
1 :TYPE :WORD)
#S(MONTEZUMA::TOKEN :IMAGE "lisp" :START 3 :END 7 :INCREMENT
1 :TYPE :WORD)
#S(MONTEZUMA::TOKEN :IMAGE "function" :START 8 :END 16 :INCREMENT
1 :TYPE :WORD)
#S(MONTEZUMA::TOKEN :IMAGE "test" :START 17 :END 21 :INCREMENT
1 :TYPE :WORD)
#S(MONTEZUMA::TOKEN :IMAGE "it" :START 22 :END 24 :INCREMENT
1 :TYPE :WORD)
#S(MONTEZUMA::TOKEN :IMAGE "now" :START 25 :END 28 :INCREMENT
1 :TYPE :WORD))

I'll point out that the analyzer is used to tokenize documents before
they're indexed, and it's also used to tokenize the relevant bits of
queries.

STANDARD-ANALYZER is kind of a generic analyzer. Montezuma also comes
with WHITESPACE-ANALYZER, which just splits tokens on whitespace, and
STOP-ANALYZER, which splits tokens on any non-alpha character,
converts everything to lower case, and removes stop words. For source
code or other specialized types of documents you might want to write
your own analyzer & tokenizer, which isn't hard.


> - The system doesn't seem to default to using all fields for
> searching and I couldn't see how to enable this. That is, if I want
> to look for "lisp" in both title and content of a message. Do I
> always have to add the field specifiers myself?

I think this code in index/index.lisp means that you can set the
default search field when you create an index, but it should default
to searching all fields:

(setf default-search-field (or (get-index-option options :default-
search-field)
(get-index-option options :default-field)
"*"))

Is that not happening?


John

Francis Leboutte

unread,
Jun 20, 2008, 2:38:40 AM6/20/08
to montez...@googlegroups.com
Le 20/06/2008 01:56, John Wiseman écrivait :
>...

> > - How do we handle special characters? Do we escape when we index
> > and when we query, only when we query, etc?
> > I add the alist (("title" . "My lisp function test-it-now"))
> > I query via "test\\-it\\-now"?
> > Currently I cheat and use "test?it?now"
>
>This is handled by whatever analyzer is being used. The default is
>STANDARD-ANALYZER, which uses STANDARD-TOKENIZER, which uses a crazy
>regex I stole from Ferret to do the tokenization.
>...

Crazy indeed - look also at the normalize methods, called by next-token. I write my own tokenizer and regex which was easier that trying to understand it.

(defclass my-tokenizer (regexp-tokenizer)
())

(defmethod normalize ((self my-tokenizer) str)
str)

(let ((cached-scanner nil))
(defmethod token-regexp ((self my-tokenizer))
(cond
(cached-scanner cached-scanner)
(t (let ((reg-expr ...))
(cl-ppcre:create-scanner reg-expr))))))

If you want to see the regex, just ask (have to clean it)

Regards,

Francis

IanEslick

unread,
Jul 24, 2008, 8:29:27 AM7/24/08
to montezuma-dev

>  > - The system doesn't seem to default to using all fields for
>  > searching and I couldn't see how to enable this.  That is, if I want
>  > to look for "lisp" in both title and content of a message.  Do I
>  > always have to add the field specifiers myself?
>
> I think this code in index/index.lisp means that you can set the
> default search field when you create an index, but it should default
> to searching all fields:
>
>    (setf default-search-field (or (get-index-option options :default-
> search-field)
>                                   (get-index-option options :default-field)
>                                   "*"))
>
> Is that not happening?
>

Hi John, I finally got back to this. My experience is I can do a
query like this:

"+class:question +prompt:exercise" and get back the objects I expect
"+class:question exercise" returns all question objects
"exercise" returns no objects

I appear to have to explicitly specify fields in my queries, terms
without +<field> specifiers don't match.

default-search-field is set to "*" in the index, but that's not done
by default.

I just traced that back to initialize-instance :after on index which
always sets :default-search-field to "" because it uses the value
of :default-field which is set to "" in the options list as a default
before :default-search-field which always picks up the value
of :default-field. I switched the order and things seem to work
better. I was manually setting the default-search-field slot and it
may be that the options plist has to also have the
appropriate :default-search-field parameter set to "*"

Ian



Reply all
Reply to author
Forward
0 new messages