Document pointers in documental clusters

Valentin Tablan

unread,

Dec 20, 2013, 6:16:55 AM12/20/13

to mg...@googlegroups.com

Hi,

My interpretation of DocumentalConcatenatedCluster is that it expects
its sub-indexes to return global document pointers, i.e.

- member-0 has documents 0..N_0
- member-1 has documents N_0+1 .. N_1,
- etc.

However, when I try to create a new Quasi-succinct index, I get an error
if I try to write document pointers that don't start from zero. At least
that's my guess as to the source of the following error:

Caused by: java.lang.IllegalArgumentException: Too large prefix sum: 88
>= 84
at
it.unimi.di.big.mg4j.index.QuasiSuccinctIndexWriter$Accumulator.add(QuasiSuccinctIndexWriter.java:447)
at
it.unimi.di.big.mg4j.index.QuasiSuccinctIndexWriter.writeDocumentPointer(QuasiSuccinctIndexWriter.java:683)
at gate.mimir.index.AtomicIndex$PostingsList.write(AtomicIndex.java:234)

When I reset the document pointer to zero at the start of each batch,
then the whole process works fine, which suggest that the problem is the
too-large pointers.

If I'm reading the javadocs correctly, I should be able to use
zero-based sub-indexes if I put them all into a merged cluster. However,
that seems to have to do more work that is strictly necessary, as it
needs to remap document pointers.

Am I correct in assuming I should be able to write arbitrary document
pointers into an index?

Thanks,
Valentin

Sebastiano Vigna

unread,

Dec 20, 2013, 11:58:09 AM12/20/13

to mg...@googlegroups.com

On 20 Dec 2013, at 3:16 AM, Valentin Tablan <v.ta...@gmail.com> wrote:

> Hi,
>
>
> My interpretation of DocumentalConcatenatedCluster is that it expects
> its sub-indexes to return global document pointers, i.e.
>
> - member-0 has documents 0..N_0
> - member-1 has documents N_0+1 .. N_1,

Nonono.

* <p>This class assumes that the global document pointers returned by each index will be increasing.
* Using this assumption, no merge is performed; simply, when an index iterator is exhausted we look
* into the next one.

*Global* means the documents returned from the strategy. *Local* is the numbering in each index.

The idea is that you use a ContiguousDocumentalStrategy. The strategy gives you the cutpoints of the global space, essentially, 0, ndoc0, ndoc0+ndoc1, ndoc0+ndoc1+ndoc2 etc.

Each local index is a standard index starting at 0. The strategy will turn the local indices into global indices.

> If I'm reading the javadocs correctly, I should be able to use
> zero-based sub-indexes if I put them all into a merged cluster. However,
> that seems to have to do more work that is strictly necessary, as it
> needs to remap document pointers.

You just need a DocumentalConatenatedCluster and a strategy.

> Am I correct in assuming I should be able to write arbitrary document
> pointers into an index?

You can, but the numberOfDocuments parameter must be a strict upper bound to the document pointers you put in.

In principle, you could write actual global pointers in sub-indices and use something like an IdentityDocumentalStragegy, but compression would be horrible.

If you have a look at Scan's usage of ContiguousDocumentalStrategy, things should be clearer.

And before you tell me, yes, the whole process is underdocumented. Actually, everything is documented carefully in the Javadocs, but there's nothing global guiding you through the process. :(

Ciao,

seba

Valentin Tablan

unread,

Dec 20, 2013, 1:23:46 PM12/20/13

to mg...@googlegroups.com

Thanks, that helps! What does the Merged documental cluster do then; how
is it different from the Concatenated one? It sounds like what I need is
the concatenated one, but I just want to understand the difference.

thanks,
Valentin

Sebastiano Vigna

unread,

Dec 20, 2013, 1:27:27 PM12/20/13

to mg...@googlegroups.com

On 20 Dec 2013, at 10:23 AM, Valentin Tablan <v.ta...@gmail.com> wrote:

> Thanks, that helps! What does the Merged documental cluster do then; how is it different from the Concatenated one? It sounds like what I need is the concatenated one, but I just want to understand the difference.

MergedDocumentalCluster does not expect global document pointers to appear in order, so it keeps track with an indirect heap of the last document returned by each sub-index and picks up the smallest one in turn. It is significantly slower, of course, but it can be useful in some situations. For instance, if you renumber documents each batch will contain global document pointers, and at that point a MergedDocumentalCluster is what you want.

Ciao,

seba

Reply all

Reply to author

Forward