KBA with only content in English

Matteo Bernardon

unread,

Sep 4, 2015, 10:58:42 AM9/4/15

to streamcorpus

Hi guys!!

I'm working about KBA TREC 2014.

Given the size of it, can I download KBA (using wget) only getting content in English?

Thanks everybody!!

John R. Frank

unread,

Sep 6, 2015, 8:53:36 PM9/6/15

to Matteo Bernardon, streamcorpus

> Hi guys!! I'm working about KBA TREC 2014. Given the size of it, can I
> download KBA (using wget) only getting content in English?

Matteo,

The version of the KBA corpus listed *second from the top* of the main
page [1] is probably what you are looking for. This is the Serif-tagged
subset of the largest version of the corpus, which was made in 2014. We
ran Serif on only the English and unknown-language part of the corpus,
which was about half.

See the full list of 2,669,424 file paths to the serif-only subset of the
2014 corpus [2]. These paths must be prepended with [3]

John

[1] http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html

[2] http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only.s3-paths.txt.xz

[3] http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only/

Matteo Bernardon

unread,

Sep 8, 2015, 5:30:03 AM9/8/15

to streamcorpus, matteo.b...@gmail.com

Thanks John!!

When I download a generic document, it has the format filename.sc.xz.gpg:

-gpg: the file was encrypted and so I decrypted it with the apposite key;

-xz: the file was compressed and so I decompressed it;

-sc: the file was serialized and I MUST deserialize it.

The problem is with the process of deserialization. In the documentation I read that the data was serialized with thrift.

My question is: how can I use thrift to deserialize my file filename.sc?

I want to take just the content of the news, forums, etc. and maybe save it to a text file. In an old post (called "decrypting the corpus") I read that recommend local classes in java and python. I prefer to use java. Can you help me please?

Thanks John!!

Matteo

John R Frank

unread,

Sep 8, 2015, 8:49:58 AM9/8/15

to Matteo Bernardon, stream...@googlegroups.com

Hi Matteo,

See these dirs for info on how to use the streamcorpus thrift definitions in Java:

https://github.com/trec-kba/streamcorpus/tree/master/examples/java

https://github.com/trec-kba/streamcorpus/tree/master/java

John

On Sep 8, 2015, at 5:36 AM, Matteo Bernardon <matteo.b...@gmail.com> wrote:

Reply all

Reply to author

Forward