Extract text content out of .sc files

Skip to first unread message

Cristina Garbacea

May 9, 2015, 1:53:47 PM5/9/15
to stream...@googlegroups.com

I am trying to extract the text content out of TREC KBA *.sc files, and using the class ReadThrift.java (https://github.com/trec-kba/streamcorpus/blob/master/java/src/test/ReadThrift.java) throws the following error:

at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362)
at org.apache.thrift.protocol.TBinaryProtocol.readByte(TBinaryProtocol.java:251)
at org.apache.thrift.protocol.TBinaryProtocol.readFieldBegin(TBinaryProtocol.java:215)
at streamcorpus.StreamItem$StreamItemStandardScheme.read(StreamItem.java:1496)
at streamcorpus.StreamItem$StreamItemStandardScheme.read(StreamItem.java:1489)
at streamcorpus.StreamItem.read(StreamItem.java:1329)
at test.ReadThrift.main(ReadThrift.java:28)

I know my question is a duplicate of this post https://groups.google.com/forum/#!topic/streamcorpus/u8oNK3CqiCs but I couldn't find any solution to this problem and the script on git doesn't seem to have been updated. Is there any workaround to extract the text content out of these files, or maybe an equivalent class in Python?

Many thanks,

John R. Frank

May 9, 2015, 3:22:01 PM5/9/15
to Cristina Garbacea, stream...@googlegroups.com
Hi Cristina,

There is extensive tooling in Python. See http://streamcorpus.org/ and in particular the Chunk class, which handles decompression automatically.

The Java examples also work. I'm not aware of any bugs in them. You have to handle decompression either by adding in the lzma reader or using the xz commandline tools separately.

Happy to answer more questions.

Reply all
Reply to author
0 new messages