Matteo Bernardon

Sep 4, 2015, 10:58:42 AM9/4/15
Hi guys!! 
I'm working about KBA TREC 2014. 
Given the size of it, can I download KBA (using wget) only getting content in English?
Thanks everybody!!

John R. Frank

Sep 6, 2015, 8:53:36 PM9/6/15
The version of the KBA corpus listed *second from the top* of the main
page [1] is probably what you are looking for. This is the Serif-tagged
subset of the largest version of the corpus, which was made in 2014. We
ran Serif on only the English and unknown-language part of the corpus,
which was about half.

See the full list of 2,669,424 file paths to the serif-only subset of the
2014 corpus [2]. These paths must be prepended with [3]


[1] http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html

[2] http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only.s3-paths.txt.xz

[3] http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only/

Matteo Bernardon

Sep 8, 2015, 5:30:03 AM9/8/15
Thanks John!! 

When I download a generic document, it has the format filename.sc.xz.gpg:

-gpg: the file was encrypted and so I decrypted it with the apposite key;

-xz: the file was compressed and so I decompressed it;

-sc: the file was serialized and I MUST deserialize it.

The problem is with the process of deserialization. In the documentation I read that the data was serialized with thrift.

My question is: how can I use thrift to deserialize my file filename.sc?

I want to take just the content of the news, forums, etc. and maybe save it to a text file. In an old post (called "decrypting the corpus") I read that recommend local classes in java and python. I prefer to use java. Can you help me please?

John R Frank

Sep 8, 2015, 8:49:58 AM9/8/15
Hi Matteo,

See these dirs for info on how to use the streamcorpus thrift definitions in Java:


