> Could it be that in some cases the attributes for the processed fields
> (clean_html, clean_visible, sentences, etc) are not filled?
Yes, clean_html and clean_visible were only generated when the documents
appeared to be English or unknown language, and even then the stripping
pipeline sometimes failed to generate output.
This file from the recent KBA data update has the lengths of raw,
clean_html, and clean_visible. There is one document in the truth set
that lacks clean_visible. Fortunately, it is rating=-1
trec-kba-ccr-judgments-2013-07-08.before-cutoff.chunk-to-stream_id-and-sizes.txt
Any other questions about this?
John