Hi, Ian. It's true that I haven't worked on Montezuma in quite a
while (about a year and a half, it looks like), and I'm not currently
planning on doing any more development on it. Anyone who is
interested in using it or working on it is encouraged to ask me for
commit access to the repository.
> A couple of questions:
I'll try to answer as well as I remember.
> - I wasn't able to find or see the effect of grouping queries
> "field: (+term1 -term2)" so I presume they are not supported?
I think that's correct, yes.
> - How do we handle special characters? Do we escape when we index
> and when we query, only when we query, etc?
> I add the alist (("title" . "My lisp function test-it-now"))
> I query via "test\\-it\\-now"?
> Currently I cheat and use "test?it?now"
This is handled by whatever analyzer is being used. The default is
STANDARD-ANALYZER, which uses STANDARD-TOKENIZER, which uses a crazy
regex I stole from Ferret to do the tokenization.
CL-USER> (defparameter *a* (make-instance 'montezuma:standard-
analyzer))
*A*
CL-USER> (montezuma:all-tokens *a* nil "My lisp function test-it-now")
(#S(MONTEZUMA::TOKEN :IMAGE "my" :START 0 :END 2 :INCREMENT
1 :TYPE :WORD)
#S(MONTEZUMA::TOKEN :IMAGE "lisp" :START 3 :END 7 :INCREMENT
1 :TYPE :WORD)
#S(MONTEZUMA::TOKEN :IMAGE "function" :START 8 :END 16 :INCREMENT
1 :TYPE :WORD)
#S(MONTEZUMA::TOKEN :IMAGE "test" :START 17 :END 21 :INCREMENT
1 :TYPE :WORD)
#S(MONTEZUMA::TOKEN :IMAGE "it" :START 22 :END 24 :INCREMENT
1 :TYPE :WORD)
#S(MONTEZUMA::TOKEN :IMAGE "now" :START 25 :END 28 :INCREMENT
1 :TYPE :WORD))
I'll point out that the analyzer is used to tokenize documents before
they're indexed, and it's also used to tokenize the relevant bits of
queries.
STANDARD-ANALYZER is kind of a generic analyzer. Montezuma also comes
with WHITESPACE-ANALYZER, which just splits tokens on whitespace, and
STOP-ANALYZER, which splits tokens on any non-alpha character,
converts everything to lower case, and removes stop words. For source
code or other specialized types of documents you might want to write
your own analyzer & tokenizer, which isn't hard.
> - The system doesn't seem to default to using all fields for
> searching and I couldn't see how to enable this. That is, if I want
> to look for "lisp" in both title and content of a message. Do I
> always have to add the field specifiers myself?
I think this code in index/index.lisp means that you can set the
default search field when you create an index, but it should default
to searching all fields:
(setf default-search-field (or (get-index-option options :default-
search-field)
(get-index-option options :default-field)
"*"))
Is that not happening?
John
Crazy indeed - look also at the normalize methods, called by next-token. I write my own tokenizer and regex which was easier that trying to understand it.
(defclass my-tokenizer (regexp-tokenizer)
())
(defmethod normalize ((self my-tokenizer) str)
str)
(let ((cached-scanner nil))
(defmethod token-regexp ((self my-tokenizer))
(cond
(cached-scanner cached-scanner)
(t (let ((reg-expr ...))
(cl-ppcre:create-scanner reg-expr))))))
If you want to see the regex, just ask (have to clean it)
Regards,
Francis