EMR running only one map task

693 views
Skip to first unread message

Joan Smith

unread,
Sep 1, 2012, 7:38:33 PM9/1/12
to mr...@googlegroups.com
Hi all,

I've been stuck on this problem for a bit, and I thought I had it solved, but no luck. Running my task on EMR, I can only get it to launch one (or 2 max) map tasks. My project is certainly big enough to need many more than that. I have a 200mb gzip input file, and each line requires a fairly cpu intensive computation. 

After some digging in the docs (both mrjob and hadoop) I added the following to my mrjob command:
--bootstrap-action="s3://elasticmapreduce/bootstrap-actions/configure-hadoop -m mapred.tasktracker.map.tasks.maximum=10 -m mapred.tasktracker.reduce.tasks.maximum=10"

This seemed to allow me to start two tasks on two m1.smalls, but when I tested with a few c1.xlarge, it only started one map task. 
Any insight toward how to get it to launch a reasonable number of tasks?

Thanks for the help,
Joan

Shivkumar Shivaji

unread,
Sep 2, 2012, 2:25:15 PM9/2/12
to mr...@googlegroups.com
There is an old AWS thread on this: https://forums.aws.amazon.com/message.jspa?messageID=279671

In summary, you cannot control the number of map tasks easily. You can offer a hint when the job is started via map.tasks.maximum or map.tasks but based on the instance type EMR tried to guess the number of map tasks.

In the case of a 200mb gzip file, I also feel that one map task should typically be enough. We have one map task for every large gzip file in most of the jobs at my firm.

I would be concerned if each line in a map reduce task requires fairly intensive CPU computation. If that cannot be improved, you can resort to specifying split size or even a new input format that maps every 1,000 lines as the above thread indicates. However, it would be better (in terms of code maintenance) if the job can be adapted to one where EMR's estimation on the number of map tasks is good enough.

Shiv

Dave Marin

unread,
Sep 4, 2012, 3:08:04 PM9/4/12
to mr...@googlegroups.com
Hadoop can't split gzip files (this is a limitation of the gzip
format, not Hadoop), so if all your input is in a single gzip file,
Hadoop has no choice but to assign it all to the same mapper.

Try leaving the input file uncompressed, or using a splittable
compression format (e.g. bzip2, LZO).

-Dave
--

Yelp is looking to hire great engineers! See http://www.yelp.com/careers.

Shivkumar Shivaji

unread,
Sep 4, 2012, 3:17:31 PM9/4/12
to mr...@googlegroups.com
Thanks for pointing this out. I missed this aspect of the original question.

Unfortunately, a lot of our files on s3 are in gzip format. A terrible choice given the benefit of hindsight.

Shiv

Joan Smith

unread,
Sep 4, 2012, 3:21:43 PM9/4/12
to mr...@googlegroups.com
Ah! That could be it. The first set of solutions hasn't gotten me any further just yet. I'll take a look tonight and report back.

Joan

Dave Marin

unread,
Sep 4, 2012, 3:32:53 PM9/4/12
to mr...@googlegroups.com
Knowing that Hadoop will never split .gz files can come in handy. For
example, at Yelp we use it to sort our access logs so hits in the same
session are always in the same .gz file.

-Dave

Shivkumar Shivaji

unread,
Sep 4, 2012, 3:43:47 PM9/4/12
to mr...@googlegroups.com
Thanks for that insight!

I was aware of the .gz file split issue, but did not realize its significance. I assumed a gz file taking one block in hdfs (say <256 mb) was fine. However, when running a job that spans about a year's worth of data (and >60K files), I see memory and I/O exceptions because the default number of parallel mappers proposed by EMR is somewhat ambitious when using large gz files.

Wonder if future logs should all be LZO compressed instead. The optimization at Yelp is an interesting in-between solution.

Shiv
Reply all
Reply to author
Forward
0 new messages