All users of KBA corpus,
The corpus stripped of non-english texts is 99% complete. I will post
when the last few hundred chunks finish. You can list the dirs like this:
s3cmd ls s3://aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0-english-and-unknown-language/
Previously, I posted the suggestion that instead of downloading the whole
corpus, you can process it in EC2. Just to check that this is easy and
cheap, we ran the attached C++ example of deserializing the corpus in this
simple python and shell script pipeline:
for line in sys.stdin:
chunk_count += 1
url = '
http://s3.amazonaws.com/aws-publicdatasets/' + line.strip()
cmd = '(wget -O - %s | gpg --homedir %s --no-permission-warning --trust-model always --output - --decrypt - | xz --decompress | ./streamcorpus-counter) 2>> subprocess_errors.log' % (url, gpg_dir)
child = Popen(cmd, stdout=PIPE, shell=True)
Running this in an EC2 cc2.8xlarge by splitting the list of 2.2M input
paths and feeding into GNU parallel like this:
ls paths/x???? | ./parallel -j 32 --eta "cat {} | python filter-streamitems-cpp.py 1> {}.speed.log 2> {}.errors.log {}" &> parallel.log &
It runs 123MB/sec or 10TB/day of compressed encrypted chunk files.
That size EC2 machine costs $2.10/hr, so you can process the whole corpus
for about $50, assuming your initial filtering technique is fast.
Doing it with the python generated by thrift is more than 100x slower.
I'd expect Java to be fast like the C++.
jrf