Extract text content out of .sc files

104 views

Skip to first unread message

Cristina Garbacea

unread,

May 9, 2015, 1:53:47 PM5/9/15

to stream...@googlegroups.com

Hi,

I am trying to extract the text content out of TREC KBA *.sc files, and using the class ReadThrift.java (https://github.com/trec-kba/streamcorpus/blob/master/java/src/test/ReadThrift.java) throws the following error:

rg.apache.thrift.transport.TTransportException

at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)

at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)

at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362)

at org.apache.thrift.protocol.TBinaryProtocol.readByte(TBinaryProtocol.java:251)

at org.apache.thrift.protocol.TBinaryProtocol.readFieldBegin(TBinaryProtocol.java:215)

at streamcorpus.StreamItem$StreamItemStandardScheme.read(StreamItem.java:1496)

at streamcorpus.StreamItem$StreamItemStandardScheme.read(StreamItem.java:1489)

at streamcorpus.StreamItem.read(StreamItem.java:1329)

at test.ReadThrift.main(ReadThrift.java:28)

I know my question is a duplicate of this post https://groups.google.com/forum/#!topic/streamcorpus/u8oNK3CqiCs but I couldn't find any solution to this problem and the script on git doesn't seem to have been updated. Is there any workaround to extract the text content out of these files, or maybe an equivalent class in Python?

Many thanks,

Cristina

John R. Frank

unread,

May 9, 2015, 3:22:01 PM5/9/15

to Cristina Garbacea, stream...@googlegroups.com

Hi Cristina,

There is extensive tooling in Python. See http://streamcorpus.org/ and in particular the Chunk class, which handles decompression automatically.

The Java examples also work. I'm not aware of any bugs in them. You have to handle decompression either by adding in the lzma reader or using the xz commandline tools separately.

Happy to answer more questions.

John

Reply all

Reply to author

Forward

0 new messages