Extract text content out of .sc files

103 views
Skip to first unread message

Cristina Garbacea

unread,
May 9, 2015, 1:53:47 PM5/9/15
to stream...@googlegroups.com
Hi,

I am trying to extract the text content out of TREC KBA *.sc files, and using the class ReadThrift.java (https://github.com/trec-kba/streamcorpus/blob/master/java/src/test/ReadThrift.java) throws the following error:

rg.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362)
at org.apache.thrift.protocol.TBinaryProtocol.readByte(TBinaryProtocol.java:251)
at org.apache.thrift.protocol.TBinaryProtocol.readFieldBegin(TBinaryProtocol.java:215)
at streamcorpus.StreamItem$StreamItemStandardScheme.read(StreamItem.java:1496)
at streamcorpus.StreamItem$StreamItemStandardScheme.read(StreamItem.java:1489)
at streamcorpus.StreamItem.read(StreamItem.java:1329)
at test.ReadThrift.main(ReadThrift.java:28)

I know my question is a duplicate of this post https://groups.google.com/forum/#!topic/streamcorpus/u8oNK3CqiCs but I couldn't find any solution to this problem and the script on git doesn't seem to have been updated. Is there any workaround to extract the text content out of these files, or maybe an equivalent class in Python?

Many thanks,
Cristina

John R. Frank

unread,
May 9, 2015, 3:22:01 PM5/9/15
to Cristina Garbacea, stream...@googlegroups.com
Hi Cristina,

There is extensive tooling in Python. See http://streamcorpus.org/ and in particular the Chunk class, which handles decompression automatically.

The Java examples also work. I'm not aware of any bugs in them. You have to handle decompression either by adding in the lzma reader or using the xz commandline tools separately.

Happy to answer more questions.

John
Reply all
Reply to author
Forward
0 new messages