Lucene Postings-List Against MG4J Quasi-Succint Postings

Ravikumar Govindarajan

unread,

Jul 31, 2015, 3:49:54 AM7/31/15

to MG4J

Hi,

I ran a contrived test of temporarily generating artificial postings-list {Varying sizes from 10-100k docs} of about 100K terms from both Lucene {version 4.6.1} and MG4J Quasi-Succint index {Latest Version}…

Lucene uses ForUtil encoding in blocks of 128 docs. Any postings-list shorter than 128, it encodes it as Var-Int…

I needed to change a little-bit of lucene source to ensure fair comparison

1. Lucene writes frequencies in the same-file{DOC file} along with postings. But MG4J writes them in 2 separate files. So I indexed the fields as DOCS_ONLY in lucene. {No freq, only postings}

2. Same is the case with skip-lists. I also separated skip-lists for lucene to be written into a separate file, as MG4J does the same thing.

3. Made 2 classes of lucene & MG4J public for running the test-case {Lucene41PostingsWriter & QuasiSuccintIndexWriter.Accumulator}

The results were surprising that quasi-succint indexes were taking up a huge amount of storage in-comparison to lucene {100kb on lucene to 10Mb on MG4J}.

Even if I allow lucene to write skips in the same-file, the order of magnitude difference remains...

I am completely newbie to MG4J and hence very reluctant to trust the output of my test…

It would be great if someone in this group can confirm my test-case and whether I've used MG4J API correctly.

--

Ravi

TestMg4jPostings.java

Sebastiano Vigna

unread,

Jul 31, 2015, 5:40:03 AM7/31/15

to mg...@googlegroups.com

> On 31 Jul 2015, at 09:49, Ravikumar Govindarajan <ravikumar.g...@gmail.com> wrote:
>
> 2. Same is the case with skip-lists. I also separated skip-lists for lucene to be written into a separate file, as MG4J does the same thing.

No, MG4J has no separate skip list. If you are thinking of the offsets file, this is just the file that tells where each posting list starts in the file.

> The results were surprising that quasi-succint indexes were taking up a huge amount of storage in-comparison to lucene {100kb on lucene to 10Mb on MG4J}.

You can look at

https://github.com/lintool/IR-Reproducibility/blob/master/Gov2.md

for some replicable, principled results. Quasi-succinct indices are smaller than anything else, in particular if you consider positions, which are difficult to compress. The only way to beat them is to use a local Golomb coding for position--which is deadly slow.

> Even if I allow lucene to write skips in the same-file, the order of magnitude difference remains...
>
> I am completely newbie to MG4J and hence very reluctant to trust the output of my test…

Yes, I would definitely not trust it :).

> It would be great if someone in this group can confirm my test-case and whether I've used MG4J API correctly.

If you do not now exactly what you're doing, do not try to build quasi-succinct list manually. It's much easier as safer to create a simple, fake document collection where terms are, say, just numbers, and index it.

For instance, here:

for(int i=0;i<docLen;i++) {
int k;
do {
k = rand.nextInt(docLen);
} while(docIds.contains(k));
docIds.add(k);
}

You are generating all numbers from 0 to docLen. This is totally nonsensical. I guess you wanted to do rand.nextInt(maxDoc). In this way you're compressing at 1 bit per element in Lucene, and at log(maxDoc/docLen) bits per element in MG4J. But this is not the way a posting list is done. Even in a simple Bernoullian model, gaps are distributed geometrically.

Ciao,

seba

Ravikumar Govindarajan

unread,

Jul 31, 2015, 6:02:02 AM7/31/15

to MG4J, vi...@di.unimi.it

Yes I did correct that test-bug and was about to reply to the group but you were quick with the help !!!

Now, MG4J is showing a cool 8-10% improvement over lucene postings...

I wanted to directly measure postings file compression of lucene over MG4J. I did not intend to do a full-scale comparison of all index related features…

Thanks for the help and apologies for the false alarm

--

Ravi

Reply all

Reply to author

Forward