Hello. I am trying to scale my R analysis to account for larger/more data, and I am thinking I should be using either RHadoop or parallel processing directly inside my R code (foreach and parallel R libraries). The problem I am trying to solve involves performing singular value decomposition (SVD) many times (~10,000 times) where the matrix that is decomposed is large (dimension is ~ 10,000 x 10 matrix). Currently, I have a complete analysis written in R that uses a sequential for loop to do this on much smaller sized matrices and fewer number so now I just need to scale it. In the future I may even need to increase the matrix size and number of calculations, so I need to find a solution that runs much faster that can easily be run in parallel.
Currently I am running my analysis using RStudio on a local windows machine that has 4 processors, but I have access to a Linux server that has 8 cores. I could try to use parallel processing directly in my R code with either of these machines, but I think I should use something different that is easier to scale for even larger matrices, more iterations, additional analyses, etc. Therefore, I created an Amazon web services account, and was trying to set up an Elastic MapReduce (EMR) cluster to use RHadoop to distribute my data on their server. I was hoping to go through some examples first before attempting to modify my R analyses I have currently, so I went through a guide posted on Amazon but I am unable to finish the exercise (http://blogs.aws.amazon.com/bigdata/post/Tx37RSKRFDQNTSL/Statistical-Analysis-with-Open-Source-R-and-RStudio-on-Amazon-EMR). It was difficult for me to even start an Amazon EMR cluster with R, because I was not able to get the cluster created correctly. Furthermore, I tried to run some of the examples using the rmr2 library, but was not able to run the examples since I cannot find the R libraries necessary to run the code on CRAN (https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md). Can someone point me in the direction on where I can learn more and maybe even include more detailed step-by-step instructions? Another option is I might be interested in a paid service or online course to help me get it set up and then I could then run my analyses by myself.
Thank you in advance for your assistance/suggestions.