Increasing number of mappers beyond number of file splits in Scalding

225 views

Skip to first unread message

Wynn Chen

unread,

Mar 25, 2015, 8:53:05 PM3/25/15

to cascadi...@googlegroups.com

Hey everyone,

I am running into a lot of trouble increasing the number of mappers I have in Scalding. I am reading from a large input data set split into about 500 partitions, each with 1.5gb of data. I am trying to increase the number of mappers to reduce runtime on each mapper, which right now is about 30 minutes long. However no matter what I try, my number of mappers keeps getting set to the default behavior, which is the size of each part-___ file on HDFS divided by (default block size).

My understanding is that I can increase number of mappers by decreasing mapred.max.split.size below the default block size (512mb). Then I should be getting more mappers for input file split, and therefore more mappers.

However adding mapred.max.split.size=268435456 is not working for me. Looking at my job logs, that property seems to be getting correctly picked up, but ignored.

I've also tried changing dfs.block.size=268435456, cascading.hadoop.hdfs.combine.max.size=268435456, none of which has affected the # of mappers.

The one thing that did seem to have an effect was setting mapred.map.tasks=6000. However this is not good for the rest of the tasks in my flow, as only one task reads from a large input file.

Am I misunderstanding some concept. I thought that MapReduce would take the min (default block size, max.split.size) and use that as the max amount of data that each mapper could read. Why are more mappers not being allocated for my task?