RHadoop (rmr2) job does not complete

73 views
Skip to first unread message

RK

unread,
Dec 8, 2014, 2:34:00 PM12/8/14
to rha...@googlegroups.com
I installed RHadoop on Hortonworks sandbox 2.1. I used the instructions of:
 
I was facing the problem that MR-jobs did start but would not complete. The map progress on the terminal stayed at 0% (and in Ambari, it stuck at 5%).
 
I noticed that several other people were having the same problem and that they gave up:
After several hours of struggling, I could finally fix it by changing memory settings. Especially the following resources were helpful:
The following settings worked for me:
- yarn.nodemanager.resource.memory-mb: 3072
- yarn.scheduler.minimum-allocation-mb: 512
- yarn.scheduler.maximum-allocation-mb: 3072
- yarn.nodemanager.vmem-pmem-ratio: 10
- mapreduce.map.memory.mb: 1024
- mapreduce.reduce.memory.mb: 1024
- yarn.app.mapreduce.am.resource.mb: 1024
- yarn.app.mapreduce.am.command-opts: -Xmx768m
- mapreduce.task.io.sort: 512
- mapreduce.map.java.opts: -Xmx768m
- mapreduce.reduce.java.opts: -Xmx768m
Notes:
- Before, I increased VirtualBox memory from 4096 to 5291
- When running R, I entered the following after loading rmr2:
    rmr.options(backend.parameters = list(
      hadoop = list(D = "mapreduce.map.memory.mb=1024", D = "mapreduce.reduce.memory.mb=1024")
    ))
I would like to know if my settings are optimal for HDP 2.1 Sandbox, or if further optimization is possible.

Antonio Piccolboni

unread,
Jan 5, 2015, 12:06:59 PM1/5/15
to rha...@googlegroups.com
Thanks for your report and list of links. These settings are complex for sure and not particularly friendly to non-java application like the ones we are running. In the latest rmr2 release we added a help entry, help(hadoop.settings), to try and collect information about them and how to set them, it's a first attempt and we'll use feedback to try and make progress. As far as your optimality question, it's way beyond what I am able to answer but I'd suggest that optimality depends on the application and even sometimes on the specific input. So my opinion is that it your question is ill-posed. Finally, the sandbox is a learning environment, before you start thinking about optimizing your application you should switch to a real cluster.
Reply all
Reply to author
Forward
0 new messages