> On 31 Jul 2015, at 09:49, Ravikumar Govindarajan <
ravikumar.g...@gmail.com> wrote:
>
> 2. Same is the case with skip-lists. I also separated skip-lists for lucene to be written into a separate file, as MG4J does the same thing.
No, MG4J has no separate skip list. If you are thinking of the offsets file, this is just the file that tells where each posting list starts in the file.
> The results were surprising that quasi-succint indexes were taking up a huge amount of storage in-comparison to lucene {100kb on lucene to 10Mb on MG4J}.
You can look at
https://github.com/lintool/IR-Reproducibility/blob/master/Gov2.md
for some replicable, principled results. Quasi-succinct indices are smaller than anything else, in particular if you consider positions, which are difficult to compress. The only way to beat them is to use a local Golomb coding for position--which is deadly slow.
> Even if I allow lucene to write skips in the same-file, the order of magnitude difference remains...
>
> I am completely newbie to MG4J and hence very reluctant to trust the output of my test…
Yes, I would definitely not trust it :).
> It would be great if someone in this group can confirm my test-case and whether I've used MG4J API correctly.
If you do not now exactly what you're doing, do not try to build quasi-succinct list manually. It's much easier as safer to create a simple, fake document collection where terms are, say, just numbers, and index it.
For instance, here:
for(int i=0;i<docLen;i++) {
int k;
do {
k = rand.nextInt(docLen);
} while(docIds.contains(k));
docIds.add(k);
}
You are generating all numbers from 0 to docLen. This is totally nonsensical. I guess you wanted to do rand.nextInt(maxDoc). In this way you're compressing at 1 bit per element in Lucene, and at log(maxDoc/docLen) bits per element in MG4J. But this is not the way a posting list is done. Even in a simple Bernoullian model, gaps are distributed geometrically.
Ciao,
seba