EC2 pointers for TREC KBA

82 views

Skip to first unread message

John R. Frank

unread,

May 1, 2013, 9:36:46 PM5/1/13

to trec...@googlegroups.com

KBAers,

Several people have asked about EC2 after reading this post:

https://groups.google.com/d/msg/streamcorpus/fi8Y8yseF8o/viJjiFNVLNsJ

In order to run a high-recall filter on the corpus, you need to access the
StreamItem.body.clean_visible (or clean_html). The most computationally
expensive step for accessing the corpus is deserializing the Thrift
messages. Including the decryption and decompression process, both Java
and C++ can deserialize the thrift format of the corpus at about
~120MB/sec

That means that you can rip through the whole corpus with around $50 of
EC2 time. EC2 machines are just regular linux machines. We use Ubuntu:

https://help.ubuntu.com/community/EC2StartersGuide

To get started in EC2, make an account at aws.amazon.com and then install
the ec2tools,

http://aws.amazon.com/developertools/351

so you can run commands like this:

ec2-run-instances ami-9251c2fb -b /dev/sda1=:200 -t t1.micro --key mykey-name

Which I copied from this page:

http://cloud-images.ubuntu.com/precise/current/

I added "-b /dev/sda1=:200" so that it would make the root volume 200GB
and automatically resize as it starts up.

To list all your EC2 machines, you can do this:

ec2-describe-instances

After you launch an EC2 instance, you can watch in the EC2 console at
aws.amazon.com and it will show up in the list. When you right-click on
it, one of the options will be connect and it will show you a DNS name for
the machine so you can login to it, like this:

ssh -i .ssh/jrf2.pem ubu...@ec2-23-22-197-146.compute-1.amazonaws.com

You can also launch/stop/terminate machines from the console in
http://aws.amazon.com

t1.micro costs about $0.50/day

cc2.8xlarge costs about $50/day

We recently used an cc2.8xlarge to generate the reverse mapping from
stream_id to chunk_path. It took a day.

Alternatively, you could use spot instances in Amazon Elastic Map Reduce,
which is based on hadoop. It is very cool, and has a longer learning
curve. For KBA, I'd suggest doing high-recall filtering with just a
couple big machines and something like GNU parallel to saturate all the
cores on the box. The logging/debugging on one or two single machines is
*much* easier than debugging hadoop.

As you look at different machines to launch, you may want to explore these
two pages:

http://aws.amazon.com/ec2/pricing/
http://aws.amazon.com/ec2/instance-types/

Have fun!

jrf

Reply all

Reply to author

Forward

0 new messages