> I installed streamcorpus using pip. I am trying to get data out of the
> .sc files. Using the streamcorpus_dump tool by running the following
> command -"streamcorpus_dump
input1.sc --component clean_visible
> >input.txt"
try `streamcorpus_dump input.sc.xz --smart-dump` to get a pretty printed
form of StreamItems. That's a tool for helping developers see what is in
a particular data set.
> Or an alternative way to get data out of the .sc files
If you want to process the text in StreamItems, then you should write code
that reads the thrift messages. See the examples in Java, Scala, C++, and
python. Python has the most developed tooling. You can say this in
python:
from streamcorpus import Chunk
for si in Chunk(path_to_chunk_file):
print(si.stream_id)
if hasattr(si.body, 'clean_visible'):
print(si.body.clean_visible)
jrf