Is there any preferences which AWS Region to run EMR job on?

j...@webmaster.ms

unread,

Jan 26, 2012, 8:55:49 PM1/26/12

to Common Crawl

Hi.

I would like to ask, is there any preferences which AWS Region to run
EMR job on in order to optimize precessing speed and expenses?
US West, US East, Europe?
Where the CommonCrawl data is located ?

thanks

Mat Kelcey

unread,

Jan 27, 2012, 11:28:41 PM1/27/12

to Common Crawl

I can think of three reasons why you'd want to run in us-east

1) (fact) compute costs are lowest in us-east

2) (fact) the common crawl data is located in us-east and Elastic
MapReduce to S3 is only free within the same region

3) (opinion) one of the best things about Elastic MapReduce is that
it's easy to either get a large number of small instances or a small
number of large instances. My personal experience is that the cc1s and
cc2s provide the best balance of cpu/mem/network for doing processing
on this data (and data sets like it). Since you're going to have to
get some serious instances running to do anything non trivial you're
better off running cc1s and cc2s; which are only available in us-east.
(I guess your milage may vary on this one depending on what you're
trying to do)

Mat

jaidee...@gmail.com

unread,

Jan 29, 2012, 4:57:59 AM1/29/12

to common...@googlegroups.com

Hi,

Thanks for the information, its very useful. Please also share if you have any estimate on how long a job against a complete data set is supposed to run. I am running a job on 9 node cluster (1 + 8 * c1.medium), and according to my calculations it will take around 4 days against just 2010/01/ data, which I guess is only for one month.

Thanks,

Jaideep

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To post to this group, send email to common...@googlegroups.com.
To unsubscribe from this group, send email to common-crawl...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/common-crawl?hl=en.

--
Jaideep Dhok

Mat Kelcey

unread,

Jan 30, 2012, 1:13:24 AM1/30/12

to Common Crawl

Hi again,

It's hard to estimate the cost of processing the data since it totally
depends on what processing you're doing.

The arc files are though all the same size and have, roughly, the same
kind of content spread across them so you can reasonably extrapolate
up from how long it takes to run on, say, 100 of the 300,000 total.

The 2010/01 sub directory is roughly roughly 1/4 of the data (see my
previous post from "what's the latest date of the data") and is 7+TB
alone.

Mat

On Jan 29, 1:57 am, "jaideep.d...@gmail.com" <jaideep.d...@gmail.com>
wrote:

Reply all

Reply to author

Forward