some hadoop settings in rmr2

278 views

Skip to first unread message

Antonio Piccolboni

unread,

Nov 27, 2014, 12:16:06 AM11/27/14

to rha...@googlegroups.com

Hi,

since users started switching to YARN we had reports of tasks running out of memory. The problem seems to be that containers have a certain amount of memory allocated to them and most of it gets used up by java and none is left for R. In 3.2.0 the mapreduce call tries to modify some hadoop settings so that the above is less likely to happen. The defaults are

> str(rmr.options("backend.parameters"))

List of 1

$ hadoop:List of 4

..$ D: chr "mapreduce.map.java.opts=-Xmx400M"

..$ D: chr "mapreduce.reduce.java.opts=-Xmx400M"

..$ D: chr "mapreduce.map.memory.mb=4096"

..$ D: chr "mapreduce.reduce.memory.mb=4096"

While these worked on our test clusters, other types of failure ensued: java running out of heap space, container size exceeding max limits etc. We are about to release rmr 3.3.0 and, while there probably are no settings that will work across all deployments and applications, if we could get it to where it will work for most people until they are experienced enough to take charge and not to get discouraged, that would be progress. Our current view is to keep only the *.java.opts settings which should work unless the record size is large and go with the cluster defaults for the *memory.mb settings which are so hw dependent. Another possibility is to compel the user to set these parameters before launching jobs, rather then trying to run at guesstimated values and let people dig through the logs to figure out what went wrong. I would love to hear from those of you who have made the switch to yarn and mr2, are these settings working and if not, how did you modify them? What would you like to see in 3.3.0? Thanks

Saar Golde

unread,

Nov 28, 2014, 9:44:39 AM11/28/14

to rha...@googlegroups.com

In terms of "production" clusters, we have been working with Yarn for a while, but we're still using rmr 3.1.2. Unfortunately, we haven't had a chance to upgrade to 3.2 yet. The non-existing defaults of 3.1.2 worked fine, and when they didn't we found combinations that worked (for specific memory intensive jobs).

One problem I did run into is when installing rmr 3.2 on the HortonWorks Data Platform sandbox (2.1, haven't had a chance to try out 2.2 yet), which I use for teaching. The sandbox is a fully functional single node cluster that lives in a virtual machine, so memory there is very limited (4Gb RAM for the whole machine). When running code there, I include the following lines at the beginning of each script:

## hadoop specific definitions

rmr.options(backend.parameters = list(

hadoop = list(D = "mapreduce.map.memory.mb=1024")

))

The HortonWorks sandbox is a very limited platform, for teaching / training purposes only, so I wouldn't take its limitations as a real guideline for usability... On the other hand, these sandboxes (both HortonWorks and Cloudera have ones) are sometimes people's first foray into the world of hadoop and R, so maybe the defaults should allow people to work with these very limited resources.

The extent to which I would 'compel' users to set these limits would be to have a warning in case the out-of-the-box defaults are used, similar to the "backend.parameters is deprecated" warning. As a novice user, I would prefer to do as little as possible.

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Antonio Piccolboni

unread,

Dec 1, 2014, 3:56:45 PM12/1/14

to rha...@googlegroups.com

On Friday, November 28, 2014 6:44:39 AM UTC-8, Saar Golde wrote:

In terms of "production" clusters, we have been working with Yarn for a while, but we're still using rmr 3.1.2. Unfortunately, we haven't had a chance to upgrade to 3.2 yet. The non-existing defaults of 3.1.2 worked fine, and when they didn't we found combinations that worked (for specific memory intensive jobs).

One problem I did run into is when installing rmr 3.2 on the HortonWorks Data Platform sandbox (2.1, haven't had a chance to try out 2.2 yet), which I use for teaching. The sandbox is a fully functional single node cluster that lives in a virtual machine, so memory there is very limited (4Gb RAM for the whole machine). When running code there, I include the following lines at the beginning of each script:

## hadoop specific definitions
rmr.options(backend.parameters = list(
hadoop = list(D = "mapreduce.map.memory.mb=1024")
))

The HortonWorks sandbox is a very limited platform, for teaching / training purposes only, so I wouldn't take its limitations as a real guideline for usability... On the other hand, these sandboxes (both HortonWorks and Cloudera have ones) are sometimes people's first foray into the world of hadoop and R, so maybe the defaults should allow people to work with these very limited resources.

The extent to which I would 'compel' users to set these limits would be to have a warning in case the out-of-the-box defaults are used, similar to the "backend.parameters is deprecated" warning. As a novice user, I would prefer to do as little as possible.

I think you hit the nail on the head, but even worse novice users may get discouraged and drop out. Experienced users time is also valuable, but at least they can find a way around. I am leaning towards setting java's -Xmx option at 400mb and leave the container at defaults both for map and reduce. I added a short package startup message pointing to a help entry with information about this. Thanks

Antono

On Thu, Nov 27, 2014 at 12:16 AM, Antonio Piccolboni <picc...@gmail.com> wrote:

Hi,
since users started switching to YARN we had reports of tasks running out of memory. The problem seems to be that containers have a certain amount of memory allocated to them and most of it gets used up by java and none is left for R. In 3.2.0 the mapreduce call tries to modify some hadoop settings so that the above is less likely to happen. The defaults are

> str(rmr.options("backend.parameters"))
List of 1
$ hadoop:List of 4
..$ D: chr "mapreduce.map.java.opts=-Xmx400M"
..$ D: chr "mapreduce.reduce.java.opts=-Xmx400M"
..$ D: chr "mapreduce.map.memory.mb=4096"
..$ D: chr "mapreduce.reduce.memory.mb=4096"

While these worked on our test clusters, other types of failure ensued: java running out of heap space, container size exceeding max limits etc. We are about to release rmr 3.3.0 and, while there probably are no settings that will work across all deployments and applications, if we could get it to where it will work for most people until they are experienced enough to take charge and not to get discouraged, that would be progress. Our current view is to keep only the *.java.opts settings which should work unless the record size is large and go with the cluster defaults for the *memory.mb settings which are so hw dependent. Another possibility is to compel the user to set these parameters before launching jobs, rather then trying to run at guesstimated values and let people dig through the logs to figure out what went wrong. I would love to hear from those of you who have made the switch to yarn and mr2, are these settings working and if not, how did you modify them? What would you like to see in 3.3.0? Thanks

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+unsubscribe@googlegroups.com ||

web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward

0 new messages