New to Hadoop: Will it help with R's high memory requirements during processing?

59 views
Skip to first unread message

James Wayne

unread,
Oct 25, 2015, 2:23:33 PM10/25/15
to RHadoop
The R environment stores data in memory during processing, and in my current scenario this is taking well over 300GB - so I'm starting to research new methods of managing a single R environment's need for 500GB+ of memory.
As I understand it, Hadoop will help with the database side of it, but when reading that data into an algorithm for processing, the local R environment will still use the same amount of memory for R to run my operations on / analysis of it. Am I missing something here? I understand that Hadoop will give me access to more data, quicker. But how do people deal with my scenario? I'm currently building a cluster with RMPI to see if it will solve R's memory requirements.

Thanks for your suggestions and ideas.

spe...@spencerboucher.com

unread,
Oct 25, 2015, 4:08:28 PM10/25/15
to rha...@googlegroups.com

James,

All the benefits of using Hadoop to store your data disappear if you
then try to pull all the data into memory at one time. You will need to
find a version of your data processing algorithm that is specifically
designed to work with a lot of data and explicitly prevent everything
from being loaded into memory at one time. There are lots of libraries
for doing this kind of thing these days, and fortunately many of them
have R bindings. Look into Spark, H2O, etc (I'm sure others will have
more suggestions). What specifically are you trying to accomplish?

James Wayne

unread,
Oct 25, 2015, 5:29:18 PM10/25/15
to RHadoop
Thanks Spencer,

I'm working on a machine learning project, so it isn't really feasible to paste all the R code here. We can do a lot of very complex things easily with R - and that brings a load of stuff into memory - so it seems to make sense to let R have access to that much of the data all-at-once. Do these other approaches require rewriting the R algorithms? I'll wait for more people to pipe in with suggestions. There are some cool things out there like, http://www.teraproc.com/front-page-posts/r-on-demand/ - but it would only allow R to own a single instance's (30GB) of memory at once.

spe...@spencerboucher.com

unread,
Oct 25, 2015, 7:37:36 PM10/25/15
to rha...@googlegroups.com

No need to paste any code, just let us know the algorithms that you are
using. Unless you get ahold of a machine with 500gb of RAM, you are
going to have to turn to alternative implementations. Fortunately, some
machine learning techniques are relatively easy to parallelize; its even
possible that the packages you are using provide that ability already.
Reply all
Reply to author
Forward
0 new messages