question regarding parsing the corpus in java

175 views
Skip to first unread message

tronder

unread,
Jun 13, 2012, 5:20:36 AM6/13/12
to trec...@googlegroups.com
Hi,

I have a question regarding the parsing of the "thrifted" version of the corpus, in fact the tiny one included in the toy system. As our infrastructure is all in java, we prefer to do everything in this language. 
Parsing the decompressed file gives the following error:

org.apache.thrift.transport.TTransportException: FileTransport error: bad event size
at org.apache.thrift.transport.TFileTransport.readEvent(TFileTransport.java:327)
at org.apache.thrift.transport.TFileTransport.read(TFileTransport.java:468)
at org.apache.thrift.transport.TFileTransport.readAll(TFileTransport.java:439)
at org.apache.thrift.protocol.TBinaryProtocol.readBinary(TBinaryProtocol.java:372)
at kba.thrift.ContentItem$ContentItemStandardScheme.read(ContentItem.java:591)            

And we parse the file with the following code:

            TTransport transport = new TFileTransport(new TStandardFile(file),true);
             transport.open();
             TProtocol protocol = new TBinaryProtocol(transport);
             while (true) {
                    StreamItem doc = new StreamItem();
                    doc.read(protocol);
                    p.process(doc);
              }

Anybody else experiencing the same issue? I can't seem to find the working java parser code for the corpus. 

Thanks for any tips in advance.

Regards,
Naimdjon


S.C.C.

unread,
Jun 13, 2012, 1:51:17 PM6/13/12
to TREC-KBA
I haven't had much luck with Thrift either. But the organizer has been
quite helpful.
Kinda wish the cleansed + ner data was in JSON format though... :)

John R. Frank

unread,
Jun 14, 2012, 9:00:31 AM6/14/12
to trec...@googlegroups.com
> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ TTransport transport = new TFileTransport(new TStandardFile(file),true);
> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵtransport.open();
> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵTProtocol protocol = new TBinaryProtocol(transport);
> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵwhile (true) {
> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ StreamItem doc = new StreamItem();
> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ doc.read(protocol);
> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ p.process(doc);
> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ }


The issue here is simply buffering. The following works:

static public void parse(String filename) throws Exception
{
FileInputStream fileInputStream = new FileInputStream(filename);
BufferedInputStream bufferedInputStream = new BufferedInputStream(fileInputStream);
TTransport transport = new TIOStreamTransport(bufferedInputStream);
transport.open();
TProtocol protocol = new TBinaryProtocol(transport);
while (true) {
StreamItem doc = new StreamItem();
try {
doc.read(protocol);
} catch (TTransportException e) {
if (e.getType() == TTransportException.END_OF_FILE)
{
break;
}
}
System.out.println( "stream_id: " + doc.stream_id );
}
}


If anyone would like a complete maven project that pulls down all the
dependencies, please contact me off list.


Related point of clarification: there are TWO kinds of things called
'thrift'.

1) The thrift compiler takes a text file containing thrift struct
definitions and creates a set of custom client classes for that specific
set of structs.

For example, the KBA structs are defined in this text file:
http://trec-kba.org/schemas/v1.0/kba.thrift

and you can construct the corresponding client classes for java or python
by running:

thrift -r --gen py kba.thrift
thrift -r --gen java kba.thrift


2) The other thing called 'thrift' is the framework library available in
each language. You need to import generic thrift components from this
library in order to use the custom client classes that were generated by
the thrift compiler. For example:

In python, you import framework components from the 'thrift' module, which
is different from the 'thrift' compiler on the command line:

# import thrift framework components
from thrift import Thrift
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

# import the class generated by the thrift compiler from kba.thrift
from kba_thrift.ttypes import StreamItem



In Java it is a bit more verbose and more clear that org.apache.thrift is
not the command-line 'thrift' compiler.

// import thrift framework components
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.transport.TIOStreamTransport;
import org.apache.thrift.transport.TTransportException;

// you probably also want:
import java.io.FileInputStream;
import java.io.BufferedInputStream;

// import the class generated by the thrift compiler from kba.thrift
import kba.StreamItem;




If you are wondering how to get thrift, the most common way is to
download it from http://thrift.apache.org and follow these steps:

http://wiki.apache.org/thrift/ThriftInstallation

(Note that ./bootstrap.sh is not present in the distribution, even though
it is listed on that web page.)


If you are only using python, a short cut is to do this:

wget http://pypi.python.org/packages/source/t/thrift/thrift-0.8.0.tar.gz
tar xzf thrift-0.8.0.tar.gz
cd thrift-0.8.0
python setup.py build
...
cp -r build/lib.linux-i686-2.6/thrift/ ../your-working-directory/thrift

This constructs the 'thrift' python module, which you can treat as a
locally importable package instead of installing it system-wide, which is
useful if you do not have root on your system.


Don't hesitate to reach out --- we'll help you get up and running.


jrf

Vasundhara Ranga

unread,
Apr 26, 2013, 8:59:18 AM4/26/13
to trec...@googlegroups.com

Vasundhara Ranga

unread,
Apr 26, 2013, 8:59:35 AM4/26/13
to trec...@googlegroups.com
Hi,
 
I want to transfer files (img, doc, pdf etc..) from source to destination which are netwrok, pls provide the .thrift, source and client information.
 
Please also let me know, the max file size that can be transfered from client to server and vice versa.
 
Thanks,
Vasundhara

Vasundhara Ranga

unread,
Apr 26, 2013, 9:01:18 AM4/26/13
to trec...@googlegroups.com
Hi,
 
I want to transfer files (img, doc, pdf etc..) from source to destination which are netwrok, pls provide the .thrift, source and handler class in java. 
Please also let me know, the max file size that can be transfered from client to server and vice versa.
 
Thanks,
Vasundhara
Hi,
 
I want to transfer files (img, doc, pdf etc..) from source to destination which are netwrok, pls provide the .thrift, source and client information.
 
Please also let me know, the max file size that can be transfered from client to server and vice versa.
 
Thanks,
Vasundhara
On Wednesday, June 13, 2012 2:50:36 PM UTC+5:30, naimdjon wrote:
Reply all
Reply to author
Forward
0 new messages