Recommended way to write an index

24 views
Skip to first unread message

Valentin Tablan

unread,
Dec 18, 2013, 2:38:40 PM12/18/13
to mg...@googlegroups.com
Hi,

What is the most maintainable way to write an index from Java? Our
current implementation is a modified version of Scan, which is likely to
diverge from the official version. Is there an API that one is supposed
to use?

IndexWriter looks promising, though even that seems to require good
knowledge of the internals. Is there anything that offers sensible
defaults, and has constructors with fewer than 5 parameters? :)

Thanks,
Valentin


Sebastiano Vigna

unread,
Dec 18, 2013, 2:50:53 PM12/18/13
to mg...@googlegroups.com

On 18 Dec 2013, at 11:38 AM, Valentin Tablan <v.ta...@gmail.com> wrote:

> What is the most maintainable way to write an index from Java? Our current implementation is a modified version of Scan, which is likely to diverge from the official version. Is there an API that one is supposed to use?
>
> IndexWriter looks promising, though even that seems to require good knowledge of the internals. Is there anything that offers sensible defaults, and has constructors with fewer than 5 parameters? :)


Never! :)

No, you have to use an IndexWriter. QuasiSuccinctIndexWriter has 6 arguments, and most have obvious defaults (or you can get them from IndexBuilder). I'll be happy to help.

Ciao,

seba

Valentin Tablan

unread,
Dec 19, 2013, 11:59:31 AM12/19/13
to mg...@googlegroups.com
Hi Sebastiano,

I'm trying to write to quasi-succint index writer, and I'm getting the
following error while writing out the positions:

Caused by: java.lang.IllegalArgumentException: Too large prefix sum:
16926 >= 15425
at
it.unimi.di.big.mg4j.index.QuasiSuccinctIndexWriter$Accumulator.add(QuasiSuccinctIndexWriter.java:447)
at
it.unimi.di.big.mg4j.index.QuasiSuccinctIndexWriter.writeDocumentPositions(QuasiSuccinctIndexWriter.java:700)


I've tried to understand what the problem may be, but it's not clear to
me from just reading the code. I found some comments about the positions
accumulator being strict, which means that zeros are not stored? I'm not
sure what that means - zero positions are clearly valid, so it must be
something else...


In case it helps, this is how I create the index writer:

QuasiSuccinctIndexWriter indexWriter = new QuasiSuccinctIndexWriter(
IOFactory.FILESYSTEM_FACTORY,
mg4jBasename,
documentPointer,
Fast.mostSignificantBit(QuasiSuccinctIndex.DEFAULT_QUANTUM),
QuasiSuccinctIndexWriter.DEFAULT_CACHE_SIZE,
CompressionFlags.DEFAULT_QUASI_SUCCINCT_INDEX,
ByteOrder.nativeOrder());



When indexing some very simple documents, the error does not occur - so
it's not something that's systematically wrong. My problem is that I
don't understand the semantics of the error message, so I don't know
where to look.


Thanks,
Valentin

Sebastiano Vigna

unread,
Dec 19, 2013, 12:59:06 PM12/19/13
to mg...@googlegroups.com
On 19 Dec 2013, at 8:59 AM, Valentin Tablan <v.ta...@gmail.com> wrote:

> Hi Sebastiano,
>
> I'm trying to write to quasi-succint index writer, and I'm getting the
> following error while writing out the positions:
>
> Caused by: java.lang.IllegalArgumentException: Too large prefix sum:
> 16926 >= 15425
> at
> it.unimi.di.big.mg4j.index.QuasiSuccinctIndexWriter$Accumulator.add(QuasiSuccinctIndexWriter.java:447)
> at
> it.unimi.di.big.mg4j.index.QuasiSuccinctIndexWriter.writeDocumentPositions(QuasiSuccinctIndexWriter.java:700)

In general, it means that the upper bound you provided (in this case, sumMaxPos to newInvertedList) is not an upper bound to the values you're feeding. There must be some mistake in your computation of that parameter.

Unfortunately the Accumulator class is used by all three pieces of the index, so it is difficult to give a better message.

Ciao,

seba

Valentin Tablan

unread,
Dec 19, 2013, 1:53:56 PM12/19/13
to mg...@googlegroups.com
On 19/12/13 17:59, Sebastiano Vigna wrote:
> In general, it means that the upper bound you provided (in this case, sumMaxPos to newInvertedList) is not an upper bound to the values you're feeding. There must be some mistake in your computation of that parameter.

OK, this makes sense, as I had to guess what the value should be :).
What is this parameter supposed to contain: the sum of all maximum
positions in one posting list, or in the whole index?

Thanks,
Valentin

Sebastiano Vigna

unread,
Dec 19, 2013, 1:56:15 PM12/19/13
to mg...@googlegroups.com
Why guessing? :)

* @param sumMaxPos the sum of the maximum position in each document (unused if positions are not indexed).

I changed it to

* @param sumMaxPos the sum of the maximum position in each document of the inverted list (unused if positions are not indexed).

maybe it's clearer.

Ciao,

seba

Valentin Tablan

unread,
Dec 19, 2013, 1:57:00 PM12/19/13
to mg...@googlegroups.com
On 19/12/13 18:53, Valentin Tablan wrote:
>
> OK, this makes sense, as I had to guess what the value should be :).
> What is this parameter supposed to contain: the sum of all maximum
> positions in one posting list, or in the whole index?
You can ignore my question - it clearly refers to one postings list.
I'll double check how that value is calculated.

Thanks,
Valentin

Valentin Tablan

unread,
Dec 19, 2013, 2:08:39 PM12/19/13
to mg...@googlegroups.com
Your prompt guided me to the bug which I have no squashed.

Thanks!

Valentin

Valentin Tablan

unread,
Dec 20, 2013, 5:10:18 AM12/20/13
to mg...@googlegroups.com
Hi,

Another question, in the same thread. Is the .sizes file necessary? I
noticed that Scan creates it, but IndexWriter doesn't. If I use
IndexWriter to produce indexes, do I need to take care of creating it by
hand, or can I ignore it?

Thanks,
Valentin

On 18/12/13 19:50, Sebastiano Vigna wrote:

Sebastiano Vigna

unread,
Dec 20, 2013, 11:28:32 AM12/20/13
to mg...@googlegroups.com
On 20 Dec 2013, at 2:10 AM, Valentin Tablan <v.ta...@gmail.com> wrote:

> Another question, in the same thread. Is the .sizes file necessary? I
> noticed that Scan creates it, but IndexWriter doesn't. If I use
> IndexWriter to produce indexes, do I need to take care of creating it by
> hand, or can I ignore it?


Unless you need it (e.g., for computing BM25), it is not necessary. Just pass false to Index.getInstance() in the right place.

Ciao,

seba

Reply all
Reply to author
Forward
0 new messages