Issue 5: ":store-term-vector isn't char-safe"

10 views
Skip to first unread message

Yoni Rabkin

unread,
Oct 12, 2009, 9:44:17 AM10/12/09
to montez...@googlegroups.com

The attached patch stops issue 5 from occurring for me. As a result I
can access term position info for Hebrew text:

MONTEZUMA> (montezuma::get-term-vectors (montezuma::open-index-reader "/montezuma-hacking/index0/" :close-directory-p nil) 0)
#(#<MONTEZUMA::SEGMENT-TERM-VECTOR field:"contents" terms:#("וכתובים" "נביאים"
"תורה") term-freqs:#(1
1
1) positions:#(#(2)
#(1)
#(0)) offsets:#(#(#<MONTEZUMA::TERM-VECTOR-OFFSET-INFO start: 12 end: 19>)
#(#<MONTEZUMA::TERM-VECTOR-OFFSET-INFO start: 5 end: 11>)
#(#<MONTEZUMA::TERM-VECTOR-OFFSET-INFO
start:
0
end:
4>))
{B32F639}>)

If the patch makes sense I'll write a test for it with some far-out
UTF-8.

issue-5-proprosed-fix.patch

Leslie P. Polzer

unread,
Oct 13, 2009, 6:02:45 AM10/13/09
to montezuma-dev
On Oct 12, 3:44 pm, Yoni Rabkin <y...@rabkins.net> wrote:
> The attached patch stops issue 5 from occurring for me. As a result I
> can access term position info for Hebrew text:
>
> If the patch makes sense I'll write a test for it with some far-out
> UTF-8.
>

This looks quite sane and very similar to the other fixes that were
needed
for proper string support.

If it passes all the tests and has its own then we should put it into
upstream.

Thanks!

Leslie

Yoni Rabkin

unread,
Oct 14, 2009, 12:23:05 PM10/14/09
to montezuma-dev
> If it passes all the tests and has its own then we should put it into
> upstream.

It doesn't cause any of the other tests to fail.

I've uploaded a patch which extends the term-vector-io test suite. It
will pass only if my fix is present:

http://montezuma-dev.googlegroups.com/web/issue-5-test.patch

Yoni Rabkin

unread,
Oct 15, 2009, 4:13:31 AM10/15/09
to montezuma-dev
I'm seeing some errors I haven't seen before with this patch
and :store-term-vector :with-positions-offsets (not with testing, but
with real-world use). Let the patcher beware.

Yoni Rabkin

unread,
Oct 15, 2009, 4:58:14 AM10/15/09
to montezuma-dev
Continuing my stream-of-consciousness posting here: the error I'm
referring to occurs because in term-vectors-io.lisp there is no
checking here:

(defmethod close ((self term-vectors-reader))
;; FIXME handle errors?
(with-slots (tvx tvd tvf) self
(close tvx)
(close tvd)
(close tvf)))

It can be trivially avoided by using the "don't do that then" approach
and wrapping each CLOSE with a `when' but surely that isn't the Right
Thing.

Leslie P. Polzer

unread,
Oct 15, 2009, 6:27:57 AM10/15/09
to montezuma-dev
On Oct 15, 10:58 am, Yoni Rabkin <yonirab...@gmail.com> wrote:
> On Oct 15, 10:13 am, Yoni Rabkin <yonirab...@gmail.com> wrote:
>
> > I'm seeing some errors I haven't seen before with this patch
> > and :store-term-vector :with-positions-offsets (not with testing, but
> > with real-world use). Let the patcher beware.
>
> Continuing my stream-of-consciousness posting here: the error I'm
> referring to occurs because in term-vectors-io.lisp there is no
> checking here:
>
> (defmethod close ((self term-vectors-reader))
>    ;; FIXME handle errors?

So this is quite independent from your original patch?


>    (with-slots (tvx tvd tvf) self
>     (close tvx)
>     (close tvd)
>     (close tvf)))
>
> It can be trivially avoided by using the "don't do that then" approach
> and wrapping each CLOSE with a `when' but surely that isn't the Right
> Thing.

What are tv{x,d,f}? Who closes them before CLOSE does?

Yoni Rabkin

unread,
Oct 15, 2009, 7:27:29 AM10/15/09
to montezuma-dev
On Oct 15, 12:27 pm, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:
> On Oct 15, 10:58 am, Yoni Rabkin <yonirab...@gmail.com> wrote:
>
> > On Oct 15, 10:13 am, Yoni Rabkin <yonirab...@gmail.com> wrote:
>
> > > I'm seeing some errors I haven't seen before with this patch
> > > and :store-term-vector :with-positions-offsets (not with testing, but
> > > with real-world use). Let the patcher beware.
>
> > Continuing my stream-of-consciousness posting here: the error I'm
> > referring to occurs because in term-vectors-io.lisp there is no
> > checking here:
>
> > (defmethod close ((self term-vectors-reader))
> >    ;; FIXME handle errors?
>
> So this is quite independent from your original patch?

Not really since without term-vectors this code will never run, and
therefore never fail.

> >    (with-slots (tvx tvd tvf) self
> >     (close tvx)
> >     (close tvd)
> >     (close tvf)))
>
> > It can be trivially avoided by using the "don't do that then" approach
> > and wrapping each CLOSE with a `when' but surely that isn't the Right
> > Thing.
>
> What are tv{x,d,f}?

According to http://lucene.apache.org/java/2_1_0/fileformats.html#Term%20Vectors:

tvf = a list of the terms, their frequencies and, optionally, position
and offest information.

tvd = is used to map out the fields that have term vectors stored and
where the field information is in tvf

tvx = for each document, a pointer to the document data in the
Document tvd

> Who closes them before CLOSE does?

MERGE checks if there are term-vectors (which is why this error
should only appear when the index has positions, offsets or
both):

(defmethod merge ((self segment-merger))
...
(when (has-vectors-p field-infos)
(merge-vectors self))
...

The call to MERGE-VECTORS then calls CLOSE.

Note that this error _doesn't always show up_, so I guess I'll be
checking to see what kind of search/indexing condition/s cause this.

Yoni Rabkin

unread,
Oct 15, 2009, 7:57:07 AM10/15/09
to montezuma-dev
I'll just clarify that the above describes the writer while the error
is signaled by the reader. The reader seems to be unconditionally
closed by DO-CLOSE specialized by SEGMENT-READER in segment-
reader.lisp.

Maybe this is similar to the unicode-safe bug the original patch
fixes; the problem was with the writer but the error was signaled by
the reader.

Leslie P. Polzer

unread,
Oct 15, 2009, 8:09:14 AM10/15/09
to montezuma-dev
On Oct 15, 1:57 pm, Yoni Rabkin <yonirab...@gmail.com> wrote:

> Maybe this is similar to the unicode-safe bug the original patch
> fixes; the problem was with the writer but the error was signaled by
> the reader.

I think this is only partly the case. I do agree with your statement,
but in the original UTF8 fixes we had a problem with on-disk data
(i.e. a writer wrote incorrect stuff that its reader pendant wouldn't
grok), whereas now we seem to have a problem with in-memory
management of data structures...

Yoni Rabkin

unread,
Oct 15, 2009, 2:27:17 PM10/15/09
to montezuma-dev
> whereas now we seem to have a problem with in-memory
> management of data structures...

I'm unsure of if it is a problem per-se since tvx, tvd and tvf are
explicitly set to nil during initialization of a TERM-VECTORS-READER
(instead of being set to be compound file streams), but only if
[segment].tvx _isn't_ in the segment file table.

Leslie P. Polzer

unread,
Oct 16, 2009, 4:05:00 AM10/16/09
to montezuma-dev
On Oct 15, 8:27 pm, Yoni Rabkin <yonirab...@gmail.com> wrote:

> I'm unsure of if it is a problem per-se since tvx, tvd and tvf are
> explicitly set to nil during initialization of a TERM-VECTORS-READER
> (instead of being set to be compound file streams), but only if
> [segment].tvx _isn't_ in the segment file table.

Yes, indeed. So CLOSE doesn't take this branch into account.

Let's add a check for tvx=nil in CLOSE (the other methods for the
reader
also have that), add a test case for this issue and give the whole
thing a
test run.

Yoni Rabkin

unread,
Oct 20, 2009, 4:26:18 PM10/20/09
to montezuma-dev

> Let's add a check for tvx=nil in CLOSE (the other methods for the
> reader
> also have that), add a test case for this issue and give the whole
> thing a
> test run.

Here are the patches for the fix, CLOSE fix and tests for both:

http://groups.google.com/group/montezuma-dev/web/issue-5-test-with-closing.patch
http://groups.google.com/group/montezuma-dev/web/issue-5-fix-with-closing.patch

Leslie P. Polzer

unread,
Oct 21, 2009, 5:32:36 AM10/21/09
to montezuma-dev
On Oct 20, 10:26 pm, Yoni Rabkin <yonirab...@gmail.com> wrote:

> Here are the patches for the fix, CLOSE fix and tests for both:

Looks great, thanks! Committed as r419, issue #5 closed.
Reply all
Reply to author
Forward
0 new messages