Concatenating lexically-partitioned indexes

22 views
Skip to first unread message

Valentin Tablan

unread,
Jan 29, 2014, 1:01:13 PM1/29/14
to mg...@googlegroups.com
Hi,

Is Concatenate the correct tool to combine a set of indexes that form a
lexical cluster?

I have a set if indexes that work as expected when lexically clustered,
but the joint index obtained by concatenating them seems to not return
any results.

Before I go over my code looking for bugs, I thought I'd double check
that it is supposed to work as I expect it to...

Thanks,
Valentin

Sebastiano Vigna

unread,
Jan 29, 2014, 1:05:42 PM1/29/14
to mg...@googlegroups.com
Well, no. Concatenate offsets all document pointers using the sum of the sizes of the previous indices. This is not what you want. It is the reverse of contiguous documental partitioning.

You can use Merge. Merge does not offset document pointers. Of course it's not doing any merge for real--each index has a nonintersecting set of term, so it is slightly overkill, but that's the easiest solution.

In principle it is possible to write a fast-as-light combiner that just concatenates the bitstreams correctly.

Ciao,

seba

Valentin Tablan

unread,
Jan 29, 2014, 1:19:23 PM1/29/14
to mg...@googlegroups.com
Hi,

Is Merge really supposed to work here? The Javadocs say:

This class merges indices by performing a simple ordered list merge.
Documents appearing in two indices will cause an error.

Because this is a lexical cluster, there are documents that appear in
different sub-indexes. As advertised, I am indeed getting exceptions
when trying to merge:

============
Caused by: java.lang.IllegalArgumentException: Document 0 has nonzero
length in two indices
at it.unimi.di.big.mg4j.tool.Merge.combineSizes(Merge.java:153)
at it.unimi.di.big.mg4j.tool.Combine.run(Combine.java:563)
at gate.mimir.index.AtomicIndex.compactIndex(AtomicIndex.java:1352)
... 2 more
============

Am I missing something?

Thanks,
Valentin

Sebastiano Vigna

unread,
Jan 29, 2014, 1:25:38 PM1/29/14
to mg...@googlegroups.com
On 29 Jan 2014, at 10:19 AM, Valentin Tablan <v.ta...@gmail.com> wrote:

> This class merges indices by performing a simple ordered list merge. Documents appearing in two indices will cause an error.

I didn't sleep well. Of course you're right.

Well, there's no easy way I can think of, actually. In theory you can write a small piece of code that opens the cluster and feeds it to an IndexWriter...

Ciao,

seba

Reply all
Reply to author
Forward
0 new messages