streamcorpus_dump tool to get data out of the .sc files

64 views
Skip to first unread message

Kanika Parashar

unread,
Jun 27, 2015, 5:51:32 PM6/27/15
to stream...@googlegroups.com
I installed streamcorpus using pip. I am trying to get data out of the .sc files. 
Using the streamcorpus_dump tool by running the following command -
"streamcorpus_dump input1.sc --component clean_visible >input.txt"
gives an empty file.Can anybody suggest me how to use this command?
Or an alternative way to get data out of the .sc files
Thanks

John R. Frank

unread,
Jun 28, 2015, 12:16:17 PM6/28/15
to Kanika Parashar, stream...@googlegroups.com

> I installed streamcorpus using pip. I am trying to get data out of the
> .sc files.  Using the streamcorpus_dump tool by running the following
> command -"streamcorpus_dump input1.sc --component clean_visible
> >input.txt"

try `streamcorpus_dump input.sc.xz --smart-dump` to get a pretty printed
form of StreamItems. That's a tool for helping developers see what is in
a particular data set.


> Or an alternative way to get data out of the .sc files

If you want to process the text in StreamItems, then you should write code
that reads the thrift messages. See the examples in Java, Scala, C++, and
python. Python has the most developed tooling. You can say this in
python:

from streamcorpus import Chunk
for si in Chunk(path_to_chunk_file):
print(si.stream_id)
if hasattr(si.body, 'clean_visible'):
print(si.body.clean_visible)


jrf
Reply all
Reply to author
Forward
0 new messages