Missing fields in sc chunks?

62 views

Skip to first unread message

Tom Kenter

unread,

Jul 30, 2013, 12:29:19 PM7/30/13

to stream...@googlegroups.com

Hi,

Could it be that in some cases the attributes for the processed fields (clean_html, clean_visible, sentences, etc) are not filled?

I have trouble, e.g. finding the clean_visible for file 1320455640-3b39d6ac2c03048d282173a01930531f in chunk: news-234-fc50a7cd7588aeb4ca9e8e173ac4b2b8-e87f403a9d64d3be8efae503d405ab5d.sc (and this appears to be the case for more files).

Also, the 'clean_html' appears to be empty.

If I look at the actual page itself from the 'raw' data, which is non-empty, and it seems to be a normal page...

I redownloaded the file from http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0-english-and-unknown-language/2011-11-05-01/index.html but that didn't help.

Am I doing something wrong or is there ??

Thanks!

Tom

John R. Frank

unread,

Aug 1, 2013, 12:11:38 AM8/1/13

to Tom Kenter, stream...@googlegroups.com

> Could it be that in some cases the attributes for the processed fields
> (clean_html, clean_visible, sentences, etc) are not filled?

Yes, clean_html and clean_visible were only generated when the documents
appeared to be English or unknown language, and even then the stripping
pipeline sometimes failed to generate output.

This file from the recent KBA data update has the lengths of raw,
clean_html, and clean_visible. There is one document in the truth set
that lacks clean_visible. Fortunately, it is rating=-1

trec-kba-ccr-judgments-2013-07-08.before-cutoff.chunk-to-stream_id-and-sizes.txt

Any other questions about this?

John

Reply all

Reply to author

Forward

0 new messages