KBAers,
Several people have asked about EC2 after reading this post:
https://groups.google.com/d/msg/streamcorpus/fi8Y8yseF8o/viJjiFNVLNsJ
In order to run a high-recall filter on the corpus, you need to access the
StreamItem.body.clean_visible (or clean_html). The most computationally
expensive step for accessing the corpus is deserializing the Thrift
messages. Including the decryption and decompression process, both Java
and C++ can deserialize the thrift format of the corpus at about
~120MB/sec
That means that you can rip through the whole corpus with around $50 of
EC2 time. EC2 machines are just regular linux machines. We use Ubuntu:
https://help.ubuntu.com/community/EC2StartersGuide
To get started in EC2, make an account at
aws.amazon.com and then install
the ec2tools,
http://aws.amazon.com/developertools/351
so you can run commands like this:
ec2-run-instances ami-9251c2fb -b /dev/sda1=:200 -t t1.micro --key mykey-name
Which I copied from this page:
http://cloud-images.ubuntu.com/precise/current/
I added "-b /dev/sda1=:200" so that it would make the root volume 200GB
and automatically resize as it starts up.
To list all your EC2 machines, you can do this:
ec2-describe-instances
After you launch an EC2 instance, you can watch in the EC2 console at
aws.amazon.com and it will show up in the list. When you right-click on
it, one of the options will be connect and it will show you a DNS name for
the machine so you can login to it, like this:
ssh -i .ssh/jrf2.pem
ubu...@ec2-23-22-197-146.compute-1.amazonaws.com
You can also launch/stop/terminate machines from the console in
http://aws.amazon.com
t1.micro costs about $0.50/day
cc2.8xlarge costs about $50/day
We recently used an cc2.8xlarge to generate the reverse mapping from
stream_id to chunk_path. It took a day.
Alternatively, you could use spot instances in Amazon Elastic Map Reduce,
which is based on hadoop. It is very cool, and has a longer learning
curve. For KBA, I'd suggest doing high-recall filtering with just a
couple big machines and something like GNU parallel to saturate all the
cores on the box. The logging/debugging on one or two single machines is
*much* easier than debugging hadoop.
As you look at different machines to launch, you may want to explore these
two pages:
http://aws.amazon.com/ec2/pricing/
http://aws.amazon.com/ec2/instance-types/
Have fun!
jrf