Setting up RHadoop using Amazon EMR cluster

99 views
Skip to first unread message

Kevin Baker

unread,
Jan 13, 2015, 3:52:11 PM1/13/15
to rha...@googlegroups.com

Hello.  I am trying to scale my R analysis to account for larger/more data, and I am thinking I should be using either RHadoop or parallel processing directly inside my R code (foreach and parallel R libraries).  The problem I am trying to solve involves performing singular value decomposition (SVD) many times (~10,000 times) where the matrix that is decomposed is large (dimension is ~ 10,000 x 10 matrix).  Currently, I have a complete analysis written in R that uses a sequential for loop to do this on much smaller sized matrices and fewer number so now I just need to scale it.  In the future I may even need to increase the matrix size and number of calculations, so I need to find a solution that runs much faster that can easily be run in parallel. 

Currently I am running my analysis using RStudio on a local windows machine that has 4 processors, but I have access to a Linux server that has 8 cores.  I could try to use parallel processing directly in my R code with either of these machines, but I think I should use something different that is easier to scale for even larger matrices, more iterations, additional analyses, etc.  Therefore, I created an Amazon web services account, and was trying to set up an Elastic MapReduce (EMR) cluster to use RHadoop to distribute my data on their server.  I was hoping to go through some examples first before attempting to modify my R analyses I have currently, so I went through a guide posted on Amazon but I am unable to finish the exercise (http://blogs.aws.amazon.com/bigdata/post/Tx37RSKRFDQNTSL/Statistical-Analysis-with-Open-Source-R-and-RStudio-on-Amazon-EMR).  It was difficult for me to even start an Amazon EMR cluster with R, because I was not able to get the cluster created correctly.  Furthermore, I tried to run some of the examples using the rmr2 library, but was not able to run the examples since I cannot find the R libraries necessary to run the code on CRAN (https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md).  Can someone point me in the direction on where I can learn more and maybe even include more detailed step-by-step instructions?  Another option is I might be interested in a paid service or online course to help me get it set up and then I could then run my analyses by myself.

Thank you in advance for your assistance/suggestions.     

Reply all
Reply to author
Forward
0 new messages