EOFException while trying to combine indices

4 views
Skip to first unread message

Daniel Bahrdt

unread,
Oct 6, 2016, 12:20:55 PM10/6/16
to mg...@googlegroups.com
Hello,
I'm trying to create an index using my own DocumentCollection,
DocumentSequence etc. For my purpose a document already has processed
tokens. Consequently I'm using the NullTermProcessor together with my
own simple WordReader. I'm using the IndexBuilder to create the index.
The indexing seems to work fine but the combination step always crashes.
I tried my best to find the problem but could not find it.
Here's the stack trace of an execution:


18:09:11.848 [main] INFO it.unimi.di.big.mg4j.tool.Combine - Sizes combined.
18:09:11.849 [main] INFO it.unimi.di.big.mg4j.tool.Combine - Combining
lists...
java.io.EOFException
at it.unimi.dsi.io.InputBitStream.read(InputBitStream.java:435)
at it.unimi.dsi.io.InputBitStream.readUnary(InputBitStream.java:877)
at
it.unimi.dsi.io.InputBitStream.readLongGamma(InputBitStream.java:943)
at it.unimi.di.big.mg4j.tool.Combine.run(Combine.java:615)
at it.unimi.di.big.mg4j.tool.IndexBuilder.run(IndexBuilder.java:534)
at
de.fmi.ocse.MG4JSearchBase$MyIndexBuilder.buildWithIdxBuilder(MG4JSearchBase.java:239)
at
de.fmi.ocse.MG4JSearchBase$MyIndexBuilder.build(MG4JSearchBase.java:338)
at de.fmi.ocse.MG4JSearchBase.index(MG4JSearchBase.java:355)
at de.fmi.ocse.MG4JSearch.index(MG4JSearch.java:101)
at de.fmi.ocse.Worker.index(Worker.java:167)
at de.fmi.ocse.Main.main(Main.java:52)

It always crashes with the same error. I've attached the source-code of
the base class which handles all the interfacing with mg4j. The
buildWithIdxBuilder() method has the options for the index builder. I
hope the comments are enough to roughly understand what is happening.

I'd be grateful if anyone had any idea what the problem could be.

Regards,
Daniel Bahrdt
MG4JSearchBase.java

Sebastiano Vigna

unread,
Oct 7, 2016, 6:58:35 AM10/7/16
to mg...@googlegroups.com

> On 6 Oct 2016, at 18:20, Daniel Bahrdt <daniel.m...@funroll-loops.de> wrote:
>
> Hello,
> I'm trying to create an index using my own DocumentCollection,
> DocumentSequence etc. For my purpose a document already has processed
> tokens. Consequently I'm using the NullTermProcessor together with my
> own simple WordReader. I'm using the IndexBuilder to create the index.
> The indexing seems to work fine but the combination step always crashes.
> I tried my best to find the problem but could not find it.

So, the best would be a small setup in which I can replicate the problem. From what I can see, for some strange reason combine does not find all the occurrencies it should. Can you try to keep the batch and list the length of the *.occurrencies files?

A quick fix would be commenting line 615 and recompile. If everything goes fine, we know that the problem is just there.

Ciao,

seba

Daniel Bahrdt

unread,
Oct 10, 2016, 6:52:38 PM10/10/16
to mg...@googlegroups.com
Hello,
thank you for your quick answer.
While trying to come up with a small example I think I've discovered the
problem. I'm not 100% sure, but at least my smaller data sets seem to
pass. I did not implement the query part yet so maybe there is still
something broken. The problem seem to be line feed and carriage return
control characters in words. My WordReader does not filter them. Is is
expected behavior of a WordReader to not include these characters in
words? I did not find it in the documentation. Nevertheless if it is,
then the same should apply to other unicode line terminators. Though I
would prefer that words were interpreted as arrays of unicode points.

If you're still interested in a test case I'd be happy to provide one.

Regards,
Daniel Bahrdt
Reply all
Reply to author
Forward
0 new messages