Problem running multiple mappers in parallel

7 views
Skip to first unread message

Samudra Banerjee

unread,
Mar 28, 2014, 9:02:34 PM3/28/14
to dkpro-big...@googlegroups.com
Hi all,

This may be a general hadoop question, but just thought of posting it here as well. When I alter the number of map tasks in DkproHadoopDriver.java, the number of map tasks are as expected, but they do not run in parallel, but only 2 at a time. For instance, when I set the number of map tasks to 5, its like 2 of them executed first, followed by 2 more, followed by 1. I have an 8-core system and would like to utilize it to the fullest. A bit of online hunting seemed to suggest a few things and I tried the following:

1. Adjusted the parameter "mapred.tasktracker.map.tasks.maximum" in mapred-site.xml to set the number of tasks running in parallel. I set it to 8.
2. Reduced the parameter, "mapred.max.split.size". My input sequence file size is 8448509 or approximately 8 MB. Hence I set it to 2097152 (2 MB).
3. Lowered the DFS block size, "dfs.block.size in dfs-site.xml. I learnt that the block size by default is 64MB. I lowered it to 2097152 (2 MB). 

In spite of all this, I do not see any change in performance. Its still 2 map tasks at a time. I did not format my hdfs and reload the sequence file after 3. Not sure if that is the reason though. 

Attached are my configuration files. Am I missing something here?

Thanks and Regards,
Samudra
hdfs-site.xml
mapred-site.xml
yarn-site.xml

Hans-Peter Zorn

unread,
Mar 29, 2014, 2:16:05 PM3/29/14
to dkpro-big...@googlegroups.com
Hi Samudra,

did you put the files on HDFS again after lowering the block size? The block size is fixed once the file
has been written. You may specify it when uploading the data, without the need to change hdfs-site.xml
hadoop fs -Ddfs.block.size= 2097152 -put ..

Best,
-ho
> --
> You received this message because you are subscribed to the Google Groups "dkpro-bigdata-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-bigdata-u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <hdfs-site.xml><mapred-site.xml><yarn-site.xml>

Samudra Banerjee

unread,
Mar 30, 2014, 1:30:46 AM3/30/14
to dkpro-big...@googlegroups.com
Hi Hans,

Thanks for the info. I uploaded the sequence file once again and checked its properties using fsck. Following is the info I got.

Connecting to namenode via http://localhost:50070
FSCK started by sabanerjee (auth:SIMPLE) from /127.0.0.1 for path /user/sabanerjee/perftest/input/part-00000 at Sun Mar 30 00:20:48 EDT 2014
.Status: HEALTHY
 Total size:    8448509 B
 Total dirs:    0
 Total files:   1
 Total symlinks:                0
 Total blocks (validated):      5 (avg. block size 1689701 B)
 Minimally replicated blocks:   5 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    1
 Average block replication:     1.0
 Corrupt blocks:                0
 Missing replicas:              0 (0.0 %)
 Number of data-nodes:          1
 Number of racks:               1
FSCK ended at Sun Mar 30 00:20:48 EDT 2014 in 6 milliseconds

So looks like there are 5 blocks as expected. However, the number of parallel map tasks is still 2. I noticed however that the number of map tasks in total now was 6, though it is set to 5 in DkproHadoopDriver. 

I searched around a bit more and made the following config changes:

1. Set yarn.nodemanager.resource.cpu-vcores to 5 in yarn-config.xml
2. Removed the option mapreduce.map.memory.mb which was set to 3072. - Since yarn is allocated 6 GB  in yarn-config, I thought maybe this limit is causing the number of mappers to be limited (My total memory is 8 GB) 

My new configurations are at 


Any other thoughts?

Regards,
Samudra
> To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-bigdata-users+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages