Unicode Support?

16 views
Skip to first unread message

Yarek Kowalik

unread,
Jan 12, 2009, 2:39:47 AM1/12/09
to montezuma-dev
I've just run into a problem with Montezuma: some of my text contains
Unicode chars (RIGHT_SINGLE_QOTEATION_MARK - #8217) which fails
miserably at WRITE-BYTE.

Is there unicode support in Montezuma?

Yarek

Leslie P. Polzer

unread,
Jan 12, 2009, 4:23:52 AM1/12/09
to montez...@googlegroups.com

I can reproduce this; we should strive to fix it.

Leslie P. Polzer

unread,
Jan 12, 2009, 4:34:14 AM1/12/09
to montez...@googlegroups.com

Additionally the string matching seems to be broken when
searching for an Unicode string in a memory index:

MONTEZUMA(59): (defparameter *index* (make-instance 'montezuma:index))

*INDEX*
MONTEZUMA(60): (montezuma:add-document-to-index *index* '(("title" . "äöü’")))

NIL
MONTEZUMA(61): (search *index* "title:*ä*")

#<TOP-DOCS 1 hits, (#<SCORE-DOC #(0) -> 0.31 {D3DE5C9}>)>
MONTEZUMA(62): (search *index* "title:*’*")

#<TOP-DOCS 0 hits, NIL>


Leslie P. Polzer

unread,
Jan 12, 2009, 8:53:05 AM1/12/09
to montez...@googlegroups.com

Attached is a highly experimental SBCL-specific patch for this.
Let me know if it works for you.

Leslie

utf8_experimental_sbcl.diff

Yarek Kowalik

unread,
Jan 12, 2009, 6:58:58 PM1/12/09
to montezuma-dev
Hi Leslie!!!

Thanks for the patch. I applied it and I get this error now:


Illegal :UTF-8 character starting at byte position 87.
[Condition of type SB-IMPL::INVALID-UTF8-STARTER-BYTE]

Restarts:
0: [USE-VALUE] Supply a replacement string designator.
1: [RETRY] Retry SLIME REPL evaluation request.
2: [ABORT] Return to SLIME's top level.
3: [TERMINATE-THREAD] Terminate this thread (#<THREAD "new-repl-
thread" RUNNING {100266C3F1}>)

Backtrace:
0: (SB-IMPL::DECODING-ERROR #(97 100 105 100 97 115 ...) 87
88 :UTF-8 SB-IMPL::INVALID-UTF8-STARTER-BYTE 87)
Locals:
SB-DEBUG::ARG-0 = #(97 100 105 100 97 115 ...)
SB-DEBUG::ARG-1 = 87
SB-DEBUG::ARG-2 = 88
SB-DEBUG::ARG-3 = :UTF-8
SB-DEBUG::ARG-4 = SB-IMPL::INVALID-UTF8-STARTER-BYTE
SB-DEBUG::ARG-5 = 87
1: (SB-IMPL::BYTES-PER-UTF8-CHARACTER-AREF #<unavailable argument>
#<unavailable argument> #<unavailable argument>)
Locals:
SB-DEBUG::ARG-0 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = :<NOT-AVAILABLE>
2: (SB-IMPL::UTF8->STRING-AREF #<unavailable argument> #<unavailable
argument> #<unavailable argument>)
Locals:
SB-DEBUG::ARG-0 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = :<NOT-AVAILABLE>
3: ((SB-PCL::FAST-METHOD MONTEZUMA:GET-DOCUMENT (MONTEZUMA::FIELDS-
READER T)) #(0 NIL 1 NIL 2 NIL) #<unavailable argument>
#<MONTEZUMA::FIELDS-READER {10065D08B1}> 0)
Locals:
SB-DEBUG::ARG-0 = #(0 NIL 1 NIL 2 NIL)
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA::FIELDS-READER {10065D08B1}>
SB-DEBUG::ARG-3 = 0
4: ((LAMBDA (MONTEZUMA:READER)) #<MONTEZUMA::SEGMENT-READER
"_70" (10 docs, 0 deleted docs, 5 field infos) {10065C76C1}>)
Locals:
SB-DEBUG::ARG-0 = #<MONTEZUMA::SEGMENT-READER "_70" (10 docs,
0 deleted docs, 5 field infos) ..
5: (SB-IMPL::%MAP-FOR-EFFECT-ARITY-1 ..)
Locals:
SB-DEBUG::ARG-0 = #<CLOSURE (LAMBDA (MONTEZUMA:READER))
{10067313D9}>
SB-DEBUG::ARG-1 = #(#<MONTEZUMA::SEGMENT-READER "_70" (10
docs, 0 deleted docs, 5 field infos..
6: ((SB-PCL::FAST-METHOD MONTEZUMA::MERGE-FIELDS (MONTEZUMA::SEGMENT-
MERGER)) #(0 NIL 4 NIL 3 NIL ...) #<unavailable argument>
#<MONTEZUMA::SEGMENT-MERGER {10065C6F51}>)
Locals:
SB-DEBUG::ARG-0 = #(0 NIL 4 NIL 3 NIL ...)
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA::SEGMENT-MERGER {10065C6F51}>
7: ((SB-PCL::FAST-METHOD MONTEZUMA::MERGE (MONTEZUMA::SEGMENT-
MERGER)) #(4 NIL) #<unavailable argument> #<MONTEZUMA::SEGMENT-MERGER
{10065C6F51}>)
Locals:
SB-DEBUG::ARG-0 = #(4 NIL)
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA::SEGMENT-MERGER {10065C6F51}>
8: ((SB-PCL::FAST-METHOD MONTEZUMA::MERGE-SEGMENTS (MONTEZUMA:INDEX-
WRITER T)) #(0 NIL 12 NIL 11 NIL ...) #<unused argument>
#<MONTEZUMA:INDEX-WRITER {1006970711}> 2 #<unavailable argument>)
Locals:
SB-PCL::.PV. = #(0 NIL 12 NIL 11 NIL ...)
MONTEZUMA::MAX-SEGMENT = :<NOT-AVAILABLE>
MONTEZUMA::MAX-SEGMENT-SUPPLIED-P = NIL
MONTEZUMA::MIN-SEGMENT = 2
MONTEZUMA::SELF = #<MONTEZUMA:INDEX-WRITER {1006970711}>
9: ((SB-PCL::FAST-METHOD MONTEZUMA::MAYBE-MERGE-SEGMENTS
(MONTEZUMA:INDEX-WRITER)) #(6 NIL 4 NIL 5 NIL ...) #<unavailable
argument> #<MONTEZUMA:INDEX-WRITER {1006970711}>)
Locals:
SB-DEBUG::ARG-0 = #(6 NIL 4 NIL 5 NIL ...)
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA:INDEX-WRITER {1006970711}>
10: ((SB-PCL::FAST-METHOD MONTEZUMA:ADD-DOCUMENT-TO-INDEX
(MONTEZUMA:INDEX T)) #(6 NIL 8 NIL 2 NIL ...) #<unused argument>
#<MONTEZUMA:INDEX {1002854EF1}> #<unavailable argument> NIL)
Locals:
SB-PCL::.PV. = #(6 NIL 8 NIL 2 NIL ...)
MONTEZUMA:ANALYZER = NIL
MONTEZUMA:DOC = :<NOT-AVAILABLE>
MONTEZUMA::SELF = #<MONTEZUMA:INDEX {1002854EF1}>

Yarek

On Jan 12, 5:53 am, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:
>  utf8_experimental_sbcl.diff
> 6KViewDownload

Yarek Kowalik

unread,
Jan 12, 2009, 6:59:07 PM1/12/09
to montezuma-dev
>  utf8_experimental_sbcl.diff
> 6KViewDownload

Yarek Kowalik

unread,
Jan 12, 2009, 7:32:35 PM1/12/09
to montezuma-dev
Oh, I think that old data is not compatible. When I start with a
clean index, and put in the data, the patch seems to be working fine.

Note: I am using SBCL.

Yarek

Yarek Kowalik

unread,
Jan 12, 2009, 7:43:13 PM1/12/09
to montezuma-dev
Hmm... I spoke too early. I do get the same error as above, though
this trace should make it clearer. I think the offending character is #
\REGISTERED_SIGN, and it appears to be happening (I'm totally guessing
this) when reading a previous document during term-merge:

Illegal :UTF-8 character starting at byte position 4.
[Condition of type SB-IMPL::INVALID-UTF8-STARTER-BYTE]

Restarts:
0: [USE-VALUE] Supply a replacement string designator.
1: [RETRY] Retry SLIME REPL evaluation request.
2: [ABORT] Return to SLIME's top level.
3: [TERMINATE-THREAD] Terminate this thread (#<THREAD "repl-thread"
RUNNING {1002CAA181}>)

Backtrace:
0: (SB-IMPL::DECODING-ERROR #(121 111 117 114 174) 4 5 :UTF-8 SB-
IMPL::INVALID-UTF8-STARTER-BYTE 4)
Locals:
SB-DEBUG::ARG-0 = #(121 111 117 114 174)
SB-DEBUG::ARG-1 = 4
SB-DEBUG::ARG-2 = 5
SB-DEBUG::ARG-3 = :UTF-8
SB-DEBUG::ARG-4 = SB-IMPL::INVALID-UTF8-STARTER-BYTE
SB-DEBUG::ARG-5 = 4
1: (SB-IMPL::BYTES-PER-UTF8-CHARACTER-AREF #<unavailable argument>
#<unavailable argument> #<unavailable argument>)
Locals:
SB-DEBUG::ARG-0 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = :<NOT-AVAILABLE>
2: (SB-IMPL::UTF8->STRING-AREF #<unavailable argument> #<unavailable
argument> #<unavailable argument>)
Locals:
SB-DEBUG::ARG-0 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = :<NOT-AVAILABLE>
3: ((SB-PCL::FAST-METHOD MONTEZUMA::READ-TERM-BUFFER
(MONTEZUMA::TERM-BUFFER T T)) ..)
Locals:
SB-DEBUG::ARG-0 = #(2 NIL 3 NIL 0 NIL ...)
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA::TERM-BUFFER field:""
text:"ykk®" {10031DE1C1}>
SB-DEBUG::ARG-3 = #<MONTEZUMA::RAM-INDEX-INPUT
file:#<MONTEZUMA::RAM-FILE name:"_19.tis" size..
SB-DEBUG::ARG-4 = #<MONTEZUMA::FIELD-INFOS {1003114771}>
4: ((SB-PCL::FAST-METHOD MONTEZUMA::NEXT? (MONTEZUMA::SEGMENT-TERM-
ENUM)) #(8 NIL 3 NIL 1 NIL ...) #<unavailable argument>
#<MONTEZUMA::SEGMENT-TERM-ENUM {10031DDBC1}>)
Locals:
SB-DEBUG::ARG-0 = #(8 NIL 3 NIL 1 NIL ...)
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA::SEGMENT-TERM-ENUM {10031DDBC1}>
5: ((SB-PCL::FAST-METHOD MONTEZUMA::MERGE-TERM-INFOS
(MONTEZUMA::SEGMENT-MERGER)) #(8 NIL 3 NIL) #<unavailable argument>
#<MONTEZUMA::SEGMENT-MERGER {10030CA0D1}>)
Locals:
SB-DEBUG::ARG-0 = #(8 NIL 3 NIL)
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA::SEGMENT-MERGER {10030CA0D1}>
6: ((SB-PCL::FAST-METHOD MONTEZUMA::MERGE-TERMS (MONTEZUMA::SEGMENT-
MERGER)) #(0 NIL 4 NIL 5 NIL ...) #<unavailable argument>
#<MONTEZUMA::SEGMENT-MERGER {10030CA0D1}>)
Locals:
SB-DEBUG::ARG-0 = #(0 NIL 4 NIL 5 NIL ...)
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA::SEGMENT-MERGER {10030CA0D1}>
7: ((SB-PCL::FAST-METHOD MONTEZUMA::MERGE (MONTEZUMA::SEGMENT-
MERGER)) #(4 NIL) #<unavailable argument> #<MONTEZUMA::SEGMENT-MERGER
{10030CA0D1}>)
Locals:
SB-DEBUG::ARG-0 = #(4 NIL)
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA::SEGMENT-MERGER {10030CA0D1}>
8: ((SB-PCL::FAST-METHOD MONTEZUMA::MERGE-SEGMENTS (MONTEZUMA:INDEX-
WRITER T)) #(0 NIL 12 NIL 11 NIL ...) #<unused argument>
#<MONTEZUMA:INDEX-WRITER {100241D581}> 3 #<unavailable argument>)
Locals:
SB-PCL::.PV. = #(0 NIL 12 NIL 11 NIL ...)
MONTEZUMA::MAX-SEGMENT = :<NOT-AVAILABLE>
MONTEZUMA::MAX-SEGMENT-SUPPLIED-P = NIL
MONTEZUMA::MIN-SEGMENT = 3
MONTEZUMA::SELF = #<MONTEZUMA:INDEX-WRITER {100241D581}>
9: ((SB-PCL::FAST-METHOD MONTEZUMA::MAYBE-MERGE-SEGMENTS
(MONTEZUMA:INDEX-WRITER)) #(6 NIL 4 NIL 5 NIL ...) #<unavailable
argument> #<MONTEZUMA:INDEX-WRITER {100241D581}>)
Locals:
SB-DEBUG::ARG-0 = #(6 NIL 4 NIL 5 NIL ...)
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA:INDEX-WRITER {100241D581}>
10: ((SB-PCL::FAST-METHOD MONTEZUMA:ADD-DOCUMENT-TO-INDEX
(MONTEZUMA:INDEX T)) #(6 NIL 8 NIL 2 NIL ...) #<unused argument>
#<MONTEZUMA:INDEX {1006E8A001}> #<unavailable argument> NIL)
Locals:
SB-PCL::.PV. = #(6 NIL 8 NIL 2 NIL ...)
MONTEZUMA:ANALYZER = NIL
MONTEZUMA:DOC = :<NOT-AVAILABLE>
MONTEZUMA::SELF = #<MONTEZUMA:INDEX {1006E8A001}>

Leslie P. Polzer

unread,
Jan 13, 2009, 3:11:48 AM1/13/09
to montez...@googlegroups.com

>
> Oh, I think that old data is not compatible. When I start with a
> clean index, and put in the data, the patch seems to be working fine.

Yes, I forgot to mention this. The patch includes changes to the
serializer.

Leslie P. Polzer

unread,
Jan 13, 2009, 3:14:10 AM1/13/09
to montez...@googlegroups.com

>
> Hmm... I spoke too early. I do get the same error as above, though
> this trace should make it clearer. I think the offending character is #
> \REGISTERED_SIGN, and it appears to be happening (I'm totally guessing
> this) when reading a previous document during term-merge:

Can you give me an example to reproduce this?
I'm going to take a look at it today in any case.

Leslie P. Polzer

unread,
Jan 15, 2009, 3:44:25 PM1/15/09
to montezuma-dev
On Jan 13, 9:14 am, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:
Ah, you're using an in-memory index, right?

I have only fixed the persistent serialization engine so it might well
be that additional corrections are necessary for RAM indices.

Leslie P. Polzer

unread,
Jan 15, 2009, 4:02:36 PM1/15/09
to montezuma-dev
> Ah, you're using an in-memory index, right?
>
> I have only fixed the persistent serialization engine so it might well
> be that additional corrections are necessary for RAM indices.

I just tried it with a RAM index and it works for me, too.

So I'd really need some small snippet to reproduce this...

Yarek Kowalik

unread,
Jan 16, 2009, 1:44:16 AM1/16/09
to montezuma-dev
I was using a disk index. I'll get some examples soon, just trying to
juggle some high priority items first. For now I have a solution that
ties me over: clean up non-ASCII characters.

Thanks,

Yarek

On Jan 15, 12:44 pm, "Leslie P. Polzer" <s...@viridian-project.de>

Leslie P. Polzer

unread,
Jan 20, 2009, 7:59:13 AM1/20/09
to montezuma-dev


On Jan 16, 7:44 am, Yarek Kowalik <yarek.kowa...@gmail.com> wrote:
> I was using a disk index.  I'll get some examples soon, just trying to
> juggle some high priority items first.  For now I have a solution that
> ties me over: clean up non-ASCII characters.

Any news on this?

This bug isn't triggered by the test suite; I have seen it myself
but wasn't able to reproduce it.

John Wiseman

unread,
Jan 23, 2009, 3:43:02 PM1/23/09
to montezuma-dev
I won't be working on Montezuma any time soon; would either of you be
interested in becoming a project owner at http://code.google.com/p/montezuma/
?


John

Leslie P. Polzer

unread,
Jan 24, 2009, 5:03:34 AM1/24/09
to montez...@googlegroups.com

> I won't be working on Montezuma any time soon; would either of you be
> interested in becoming a project owner at http://code.google.com/p/montezuma/
> ?

Montezuma is an essential part of my latest application, so I likely
need it to work for years to come.

My schedule is crammed but I can apply patches, release package
and do light patch review and support if needed.

If you intend to drop the project then a transfer of maintainership
would probably make sense -- a little work and time is better than
nothing.

John Wiseman

unread,
Feb 15, 2009, 6:34:03 PM2/15/09
to montezuma-dev
Oops, I thought I had already added you as a project admin but
apparently not. But you're an admin now!


Thanks,
John


On Jan 24, 2:03 am, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:
> > I won't be working on Montezuma any time soon; would either of you be
> > interested in becoming a project owner athttp://code.google.com/p/montezuma/

Leslie P. Polzer

unread,
Feb 19, 2009, 4:44:06 AM2/19/09
to montezuma-dev
On Feb 16, 12:34 am, John Wiseman <jjwise...@gmail.com> wrote:
> Oops, I thought I had already added you as a project admin but
> apparently not.  But you're an admin now!

Thanks!

On-topic again: r398 contains the UTF8 changes. I haven't encountered
the aforementioned bug so far, so maybe it was due to stale FASLs.

I'm going to put out a release candidate for the next version so
people
can choose whether they want beta UTF8 support or rather stable
single-byte charset support.

Does anyone else have patches they want integrated?

Leslie

Leslie P. Polzer

unread,
Mar 19, 2009, 4:19:16 PM3/19/09
to montezuma-dev
On Feb 16, 12:34 am, John Wiseman <jjwise...@gmail.com> wrote:
> Oops, I thought I had already added you as a project admin but
> apparently not.  But you're an admin now!

Hi John,

could you also add me as an admin for the Google group? It doesn't
seem
as if this was picked up there, and I'd like to delete the spam...

John Wiseman

unread,
Mar 19, 2009, 4:52:45 PM3/19/09
to montez...@googlegroups.com

Sure thing, Leslie--you're now a group owner.


John

Reply all
Reply to author
Forward
0 new messages