Which fields are defined under which circumstances?

Laura Dietz

unread,

Aug 5, 2013, 2:20:38 PM8/5/13

to stream...@googlegroups.com

Hi Everyone,

I saw some discussion on when fields such as clean_html and clean_visible are filled in. Now I noticed a few other idiosyncrasies. Can someone tell me whether this is expected, or a bug in the thrift protocol (I used the java bindings).

Can someone tell me whether I found all issues or are there more issues that I am not even aware of?

item.getSource_metadata() sometimes contains the sole key "kba-2012" (e.g. 2012-06-08-08) , sometimes is empty (e.g. 2012-07-10-08)
Reading from the stream always throws a org.apache.thrift.transport.TTransportException. Does this flag the end of the stream?
item.getBody().getRaw() is always null
item.getBody().getTaggings() sometimes contains "lingpipe" sometimes "stanford". Even if a key is present, item.getBody().getTaggings().get(key).getRaw_tagging() may return null.
For instance from batch 2012-06-08-08 onwards I can't find non-null raw_tagging. Was no NER data processed for recent documents or is this a bug in the java bindings? Is getRaw_tagging() not the right way to access NER data?
"other_content" sometimes contains a key called "anchor", any explanation what kind of anchor information is stored in this field?

Does anyone know at which time points these formats change?

Thanks a lot!

Laura

John R. Frank

unread,

Aug 5, 2013, 4:07:54 PM8/5/13

to stream...@googlegroups.com

Hi Laura,

> Does anyone know at which time points these formats change?

See the PDF graph of the source data plotted as a function of time in the
recently released data update. That will probably help you ask more
questions :-)

> I saw some discussion on when fields such as clean_html and
> clean_visible are filled in. Now I noticed a few other idiosyncrasies.
> Can someone tell me whether this is expected, or a bug in the thrift
> protocol (I used the java bindings).

Did you use this java example?

https://github.com/trec-kba/streamcorpus/blob/master/java/src/test/ReadThrift.java

> 1. item.getSource_metadata() sometimes contains the sole key "kba-2012"

> (e.g. 2012-06-08-08) , sometimes is empty (e.g. 2012-07-10-08)

source_metadata is a map and the key "kba-2012" is populated for some of
of the source types from the 2012 KBA corpus.

I don't think there are any other keys in source_metadata.

> 2. Reading from the stream always throws a

> org.apache.thrift.transport.TTransportException. Does this flag the end
> of the stream?

See the example linked above.

> 3. item.getBody().getRaw() is always null

It shouldn't be null --- at least not in the primary corpus:

http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0/index.html

> 4. item.getBody().getTaggings() sometimes contains "lingpipe" sometimes

> "stanford". Even if a key is present,
> item.getBody().getTaggings().get(key).getRaw_tagging() may return null.

StreamItem.body.taggings['stanford'].raw_tagging is the only raw_tagging
that is populated in the corpus.

> 5. For instance from batch 2012-06-08-08 onwards I can't find non-null

> raw_tagging. Was no NER data processed for recent documents or is this a
> bug in the java bindings? Is getRaw_tagging() not the right way to
> access NER data?

You probably want to look in:

StreamItem.body.sentences['lingpipe']

which is an array of Sentence objects, so you can do something like this:

for sentence in StreamItem.body.sentences['lingpipe']:
for tok in sentence.tokens:
print tok.token, tok.entity_type, tok.entity_id, tok.mention_id

We transformed the lingpipe data (with coref chains) into this fully
thrifted structure. We didn't transform the old stanford data (which was
run without dcoref), because the byte offsets changed.

> 6. "other_content" sometimes contains a key called "anchor", any

> explanation what kind of anchor information is stored in this field?

for source=social, it contains 'anchor' and 'title' which are ContentItem
instances containing the title text and the string contained within an
HTML anchor tag that pointed at the page -- these were gathered by the
upstream aggregator.

jrf

Laura Dietz

unread,

Aug 5, 2013, 4:58:52 PM8/5/13

to stream...@googlegroups.com

Hi John,

thanks for the clarifications

On 08/05/2013 04:07 PM, John R. Frank wrote:
> Hi Laura,
>
>
>> Does anyone know at which time points these formats change?
>
> See the PDF graph of the source data plotted as a function of time in
> the recently released data update. That will probably help you ask
> more questions :-)

Can you point me to that PDF? I can't find it on github or
kba-stream-corpus-2013 website.

>> 3. item.getBody().getRaw() is always null
>
> It shouldn't be null --- at least not in the primary corpus:
>
> http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0/index.html
>
>

Ok, I used the cleansed version. I found a comment that this is expected.

>
>
>
>> 5. For instance from batch 2012-06-08-08 onwards I can't find
>> non-null raw_tagging. Was no NER data processed for recent documents
>> or is this a bug in the java bindings? Is getRaw_tagging() not the
>> right way to access NER data?
>

I suppose that the Sentences/Token is the intended way to access NER
information, and the taggings provided by the stanford CoreNLP are
mostly for backwards compatibility?

Cheers,
Laura

John R. Frank

unread,

Aug 5, 2013, 5:01:47 PM8/5/13

to stream...@googlegroups.com

>> See the PDF graph of the source data plotted as a function of time in
>> the recently released data update. That will probably help you ask
>> more questions :-)
>
> Can you point me to that PDF? I can't find it on github or
> kba-stream-corpus-2013 website.

https://groups.google.com/d/msg/trec-kba/nfBHzBa04y8/JHkgllQZ8igJ

>>> 5. For instance from batch 2012-06-08-08 onwards I can't find
>>> non-null raw_tagging. Was no NER data processed for recent documents
>>> or is this a bug in the java bindings? Is getRaw_tagging() not the
>>> right way to access NER data?
>>
> I suppose that the Sentences/Token is the intended way to access NER
> information, and the taggings provided by the stanford CoreNLP are
> mostly for backwards compatibility?

Yes, exactly.

jrf

Laura Dietz

unread,

Aug 5, 2013, 5:22:47 PM8/5/13

to stream...@googlegroups.com

I noticed that when I iterate over lingpipe's Sentences and Tokens, all
token.getLemma() and token.getPos() return null.

Is this intended?

(I sampled from the beginning and end of the stream).

Cheers,
Laura

John R. Frank

unread,

Aug 5, 2013, 6:35:40 PM8/5/13

to stream...@googlegroups.com

> I noticed that when I iterate over lingpipe's Sentences and Tokens, all
> token.getLemma() and token.getPos() return null.
>
> Is this intended?

Yes. The way we used LingPipe only produced entity recognition and coref
chains.

John

wim.g...@gmail.com

unread,

Aug 5, 2013, 11:28:18 PM8/5/13

to stream...@googlegroups.com

Hi, John

> I saw some discussion on when fields such as clean_html and
> clean_visible are filled in. Now I noticed a few other idiosyncrasies.
> Can someone tell me whether this is expected, or a bug in the thrift
> protocol (I used the java bindings).

Did you use this java example?

https://github.com/trec-kba/streamcorpus/blob/master/java/src/test/ReadThrift.java

We have used the java example you give above. However, we still get 'org.apache.thrift.transport.TTransportException' and 'java.io.EOFException' when we read from streams. To clarify, we use the XZ stream reader to read streamitems without decompressing the chunk files and the Exception is not always present. We want to know whether it is caused by using the XZ stream reader or by other reasons.

The Exception details are as follows:

org.apache.thrift.transport.TTransportException: java.io.EOFException

at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)

at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)

at org.apache.thrift.protocol.TBinaryProtocol.readStringBody(TBinaryProtocol.java:354)

at org.apache.thrift.protocol.TBinaryProtocol.readString(TBinaryProtocol.java:347)

at streamcorpus.ContentItem$ContentItemStandardScheme.read(ContentItem.java:1626)

at streamcorpus.ContentItem$ContentItemStandardScheme.read(ContentItem.java:1)

at streamcorpus.ContentItem.read(ContentItem.java:1415)

at streamcorpus.StreamItem$StreamItemStandardScheme.read(StreamItem.java:1549)

at streamcorpus.StreamItem$StreamItemStandardScheme.read(StreamItem.java:1)

at streamcorpus.StreamItem.read(StreamItem.java:1326)

Caused by: java.io.EOFException

at java.io.DataInputStream.readFully(Unknown Source)

at org.tukaani.xz.rangecoder.RangeDecoder.prepareInputBuffer(Unknown Source)

at org.tukaani.xz.LZMA2InputStream.decodeChunkHeader(Unknown Source)

at org.tukaani.xz.LZMA2InputStream.read(Unknown Source)

at org.tukaani.xz.BlockInputStream.read(Unknown Source)

at org.tukaani.xz.SingleXZInputStream.read(Unknown Source)

at org.apache.commons.compress.compressors.xz.XZCompressorInputStream.read(XZCompressorInputStream.java:113)

at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)

... 12 more

By the way, sometimes, the exception is as follows:

java.io.EOFException

at java.io.DataInputStream.readFully(Unknown Source)

at org.tukaani.xz.SingleXZInputStream.initialize(Unknown Source)

at org.tukaani.xz.SingleXZInputStream.<init>(Unknown Source)

at org.apache.commons.compress.compressors.xz.XZCompressorInputStream.<init>(XZCompressorInputStream.java:98)

at org.apache.commons.compress.compressors.xz.XZCompressorInputStream.<init>(XZCompressorInputStream.java:72)

Thank you very much for your kind consideration! We are looking forward for your answers!

Best,
wim

wim.g...@gmail.com

unread,

Aug 5, 2013, 11:28:49 PM8/5/13

to stream...@googlegroups.com

Hi, John

> I saw some discussion on when fields such as clean_html and
> clean_visible are filled in. Now I noticed a few other idiosyncrasies.
> Can someone tell me whether this is expected, or a bug in the thrift
> protocol (I used the java bindings).

Did you use this java example?

https://github.com/trec-kba/streamcorpus/blob/master/java/src/test/ReadThrift.java

John R. Frank

unread,

Aug 6, 2013, 5:49:50 AM8/6/13

to stream...@googlegroups.com

> We have used the java example you give above. However, we still get
> 'org.apache.thrift.transport.TTransportException' and
> 'java.io.EOFException' when we read from streams. To clarify, we use the
> XZ stream reader to read streamitems without decompressing the chunk
> files and the Exception is not always present. We want to know whether
> it is caused by using the XZ stream reader or by other reasons. The

Unfortunately, I have no experience with the XZ stream reader for Java.
The XZ libraries for python were buggy, so our tools for working with XZ
compression always fork a child process to run the xz command line
utility. (Yes, that's inefficient and kludgy, but it has worked well.)

Maybe you can write some tests for the XZ tool you are using --- to remove
other variables.

jrf

Christan Grant

unread,

Aug 6, 2013, 1:18:35 PM8/6/13

to wim.g...@gmail.com, stream...@googlegroups.com

Thrift throws an end_of_file exception when it is finished processing a set of stream items. We use a mix of scala and java but here is how we catch the the errors:

...
try {
      s.read(protocol)

      successful = true
    } catch {

      case e:java.lang.OutOfMemoryError => logError("OOM Error: %s".format(e.getStackTrace.mkString("\n"))); None

      case e:TTransportException => e.getType match {

        case TTransportException.END_OF_FILE => logDebug("mkstream Finished."); None

        case TTransportException.ALREADY_OPEN => logError("mkstream already opened."); None

        case TTransportException.NOT_OPEN => logError("mkstream not open."); None

        case TTransportException.TIMED_OUT => logError("mkstream timed out."); None

        case TTransportException.UNKNOWN => logError("mkstream unknown."); None

        case e => logError("Error in mkStreamItem: %s".format(e.toString)); None

      }
      case e: Exception => logDebug("Error in mkStreamItem"); None
...

We use the same java based xz reader and it is working for us. You may be trying to read from the protocol after it is already finished. You should check the exception type.

--
Christan Grant
<><

wim.g...@gmail.com

unread,

Aug 7, 2013, 4:43:43 AM8/7/13

to stream...@googlegroups.com, wim.g...@gmail.com, christ...@gmail.com

Hi, Christan

Thank you for your valuble suggestions!

Best,
wim

wim.g...@gmail.com

unread,

Aug 7, 2013, 4:46:29 AM8/7/13

to stream...@googlegroups.com

Hi, John

Thank you for your prompt reply!

Best,

wim

Reply all

Reply to author

Forward