Re: Big Corpora, how to deal with?

83 views
Skip to first unread message

John R. Frank

unread,
Jun 3, 2014, 8:32:18 AM6/3/14
to stream...@googlegroups.com
Hi Wallace,

Moving your question from trec-kba to streamcorpus discussion forum.


> Are there any APIs to get part data I want? 
> Or are there any Web Services I can use? 
> Or need I download the corpora in my server?

There are tools available here:

http://streamcorpus.org/


For KBA, I am generating a trimmed down English-only corpus, which I will
post in a few days.

For processing the corpus or any trimmed down version of it, I highly
recommend using EC2 spot instances to do parallel processing. Here's an
example tool that is fully self-contained and has processed the entire
corpus for a couple hundred dollars worth of spot instance time. The
trimmed down corpus will make that even cheaper.

https://github.com/trec-kba/streamcorpus-pipeline/blob/master/examples/verify_kba2014.py

This example script uses rejester, which is a tool we made for batch
processing in AWS. It is still a bit green, however there is
documentation in streamcorpus.org

This example can also be easily transformed to run without rejester and
instead be used in something like GNU parallel. Let us know what
questions you have, and we'll help you.

John
Reply all
Reply to author
Forward
0 new messages