StreamItem.body.getClean_visible

48 views
Skip to first unread message

wim.g...@gmail.com

unread,
Dec 4, 2013, 8:53:30 AM12/4/13
to stream...@googlegroups.com
Hi, John

I used Java to obtain the body content of StreamItem, and the code is just as follows:
String contentString = StreamItem.body.getClean_visible( )
However, I found that the body content obtained by this method is always incomplete, and I can only get the full texts through the method shown in the example code "dump-sentences.py" which seems to be a little complicated. Are there any other ways of accomplishing the same thing (and maybe some of them are even easier) ?

Thank you very much for your kind consideration!

Best,
wim


The code in "dump-sentences.py" is shown below.

#!/usr/bin/python
import streamcorpus
import sys


## iterate over StreamItem messages in a flat file
for si in streamcorpus.Chunk(path=sys.argv[1]):
## iterate over the sentences map for each tagger, using 'lingpipe' segmentation
for sentence_index in range(len(si.body.sentences["lingpipe"])):
# unique document id
document_id = si.stream_id
# seconds from 1970 (UTC)
document_time = "%d"%(si.stream_time.epoch_ticks)
# sentence index
sentence_index_string = "%d"%(sentence_index)
# sentence tokens
sentence_tokens = si.body.sentences["lingpipe"][sentence_index].tokens
# concatenate token strings into a sentence
sentence=""
for token in sentence_tokens:
sentence = "%s%s "%(sentence,token.token)
print "\t".join([document_id,document_time,sentence_index_string,sentence])

John R Frank

unread,
Dec 4, 2013, 9:21:44 AM12/4/13
to wim.g...@gmail.com, stream...@googlegroups.com

This getClean_visible method is specific to the Java implementation of thrift. Have you tried accessing StreamItem.body.clean_visible directly?

Can you try writing a test for this in Java?

The test can create a StreamItem with a long clean_visible property, and serialize it out to a file, and then read it back in, and check the length using this method?

If that fails, then we should ask the thrift users list at apache.

wim.g...@gmail.com

unread,
Dec 5, 2013, 4:05:26 AM12/5/13
to stream...@googlegroups.com
Sure, I am happy to do that. But, I am not very clear what you mean. I just used "StreamItem.body.getClean_visible( )" to get the body content, and found it does not meet the need well. You mentioned that creating a StreamItem with a long clean_visible property, and I don't really understand it. Would you mind explaining it in detail?

wim.g...@gmail.com

unread,
Dec 5, 2013, 4:05:36 AM12/5/13
to stream...@googlegroups.com

ashwin

unread,
Dec 6, 2013, 11:00:00 PM12/6/13
to stream...@googlegroups.com
SteamItem.body.clean_visible.
body  is ContentItem with clean_visible field.
Reply all
Reply to author
Forward
0 new messages