Previously (version 0.1.1), I've been successfully indexed and
searched my 5000+ document corpus. With version 0.1.3a I'm receiving
the following error (the first "none" is terminated by #x0, and the
second "none" by two):
term out of order: #S(MONTEZUMA::TERM
:FIELD #1="link"
:TEXT "none") < #S(MONTEZUMA::TERM
:FIELD #1#
:TEXT "none")
[Condition of type SIMPLE-ERROR]
Restarts:
0: [RETRY] Retry SLIME REPL evaluation request.
1: [ABORT] Return to SLIME's top level.
2: [TERMINATE-THREAD] Terminate this thread (#<THREAD "repl-thread"
RUNNING {AB8CE69}>)
> term out of order: #S(MONTEZUMA::TERM
> :FIELD #1="link"
> :TEXT "none") < #S(MONTEZUMA::TERM
> :FIELD #1#
> :TEXT "none")
> [Condition of type SIMPLE-ERROR]
I couldn't reproduce that, are you able to isolate a test case
or send me your corpus?
Yes, but the same special things that don't break 0.1.1. I'll try to
use my own
http://yrk.livejournal.com/235234.html montezuma-indexfiles package to
try and isolate the bug on an unmodified version.
With the exact same corpus and my modified code (sorry, didn't get
around to indexing with a vanilla montezuma yet) I get the out-of-term
error only with version 0.1.3a.
For good measure, I cleaned out all fasls for all of the dependencies
between the two tests.
I'll post more detailed info if I manage to isolate some more useful
info.
> It's especially good to know that 0.1.3a is responsible.
I broke the Reuters corpus into 18,000+ files (1 per report) and
0.1.3a indexed it fine. My conclusion is that Montezuma indexing is
currently undefined if the corpus isn't ASCII clean.
Next I'll try to "poison" the Reuters corpus and try to re-create the
term-out-of-order bug.
A couple of things: the first is that I don't completely understand
the above patch. 0.1.3a of Montezuma calls babel, not sb-ext. What
version of the code is that patch against?
More importantly I've since updated and recompiled all of the
dependencies and can now successfully index and search Hebrew with
0.1.3a with no errors (and without the above patch). Perhaps the "term
out of order" error was a result of a slightly incompatible version of
a library? This isn't a very satisfying resolution to the bug, but
once I start integrating the Hebrew index and search into my work
project I might know more (the loads on the system will be a lot
higher).
The the benefit of posterity here are the library versions that worked
for me:
sbcl 1.0.25
montezuma-0.1.3a
cl-ppcre-2.0.1
cl-fad-0.6.2
babel_0.3.0
> A couple of things: the first is that I don't completely understand > the above patch. 0.1.3a of Montezuma calls babel, not sb-ext. What > version of the code is that patch against?
Oh, sorry. I have an older version of that code around here that still uses SB-EXT. Just replace sb-ext with babel and you should be fine.
> More importantly I've since updated and recompiled all of the > dependencies and can now successfully index and search Hebrew with > 0.1.3a with no errors (and without the above patch). Perhaps the "term > out of order" error was a result of a slightly incompatible version of > a library? This isn't a very satisfying resolution to the bug, but > once I start integrating the Hebrew index and search into my work > project I might know more (the loads on the system will be a lot > higher).
I'm still suspicious. Well, we'll see what you get. :)
It works! I made sure to clean out fasls and start with a clean Lisp
image before each indexing and am getting the "term out of order"
error only *without* the patch. Library versions remain as I posted
above.
> Oh, sorry. I have an older version of that code around here
> that still uses SB-EXT. Just replace sb-ext with babel and you
> should be fine.
I manually entered the code instead. I don't understand the #'(or)
conditional compilation thingy so I might have wrote it wrong (but hey
it works). My string-to-bytes now looks like this:
(defun string-to-bytes (string &key (start 0) end)
"Converts a string to a sequence of bytes (unsigned-byte 8) using
the implementation's default character encoding."
(let ((s (sb-ext:string-to-octets string)))
(subseq s start (or end (length s))))
#+(or)
(let ((s (subseq string start end)))
(sb-ext:string-to-octets s))
#+(or)
(let ((s (subseq string start end)))
(babel:string-to-octets s)))
> It works! I made sure to clean out fasls and start with a clean Lisp > image before each indexing and am getting the "term out of order" > error only *without* the patch. Library versions remain as I posted > above.
Great, I'm going to release the beta version with the fix soon then!