Building a Random Forest model with SparkR

1,460 views
Skip to first unread message

Sean Farrell

unread,
Mar 19, 2015, 8:15:26 PM3/19/15
to spark...@googlegroups.com
Hi all,

I've been working on building a Random Forest model on my modest AWS cluster (6 nodes each with 16 cores and 32G RAM) using Spark. My training set is pretty big (~30,000 rows and ~2,000 columns) so won't run on a single node. I had cobbled together some code to build the model as a Map Reduce job, and now I've hacked some code together to do this in Spark (see below). It takes about 15 sec to build a model on my cluster using this code (an equivalent implementation as a MapReduce job took about 15 mins). 

In this example my training set data is located in HDFS as a csv file with no column or row names. The column names of the training set are instead saved in the file train.names.rda which is loaded in before building the model. To combine the random forests together into a single model at the end, I use the my_combine function that was helpfully provided by Joran at http://stackoverflow.com/questions/19170130/combining-random-forests-built-with-different-training-sets-in-r.

I'm sure there is likely to be a better way of doing it than this, so please feel free to suggest improvements (or alternative approaches)!


### START

# Load libraries
library(randomForest)
library(SparkR)

# Set spark memory allocation (important even though I set the spark executor and driver memories when initializing the Spark context, as without it I run into Java heap space errors)
Sys.setenv(SPARK_MEM="40g")

# Initialise Spark context 
sc <- sparkR.init(master="yarn-client",sparkEnvir=list(spark.executor.memory="24g",spark.driver.memory="24g",spark.executor.instances="5",spark.executor.cores="16", spark.driver.maxResultSize="10g"))

# Pass randomForest package to all nodes
includePackage(sc,randomForest)


### Build randomForest models of 10 trees each for entire data set sliced into 50 chunks (so 500 trees in the forest)

# Load the column names of the training set
load("train.names.rda")

# Define RDD with 50 slices
rdd <- textFile(sc, "hdfs:<server_address>/user/spark/spark_training.csv",50)

# Define mapPartition function to build randomForest models
rf <- mapPartitions(rdd, function(input) {

# Parse the RDD into a data frame
df <- data.frame(do.call('rbind',strsplit(as.character(input),',',fixed=TRUE)))

# Set the data frame column names
names(df) <- train.names

# Fix the data type to numeric for all but the classification column
for (i in 1:(ncol(df)-1)){
df[,i] <- as.numeric(as.character(df[,i]))
}

# Get rid of quotation marks in classification column
df$MyClass <- as.factor(gsub("\"","", df$MyClass))
# Build the RF model
randomForest(df[,-ncol(df)],df[,ncol(df)],ntree=10)
})

# Build the models
output.rf<- collect(rf)

# Define function to split output list into individual models and combine into one uber model
uberRF <- function (x) {
        j=1
        rf.list <- list()
        for (i in 1:(length(x)/18)){
                temp <-x[j:(i*18)]
                class(temp) <- "randomForest"
                varname <- paste("rf",i,sep="")
                assign(varname,temp)
                j = j + 18
                rf.list[[length(rf.list)+1]] <- temp
        }

        # Combine the models together
        final.rf <- do.call(my_combine, rf.list)
        final.rf
}

# Combine the RF models
forest <- uberRF(output.rf)

### END

Shivaram Venkataraman

unread,
Mar 20, 2015, 1:49:00 PM3/20/15
to Sean Farrell, spark...@googlegroups.com, Evan Sparks
Hi Sean

Great to know that your random forest implementation is faster now. I am not much of an expert on random forests, but I asked some other machine learning experts in the AMPLab about this:

1. There is a distributed random forests implementation in Spark's MLLib. Unfortunately we don't have R bindings for this yet, but Dan Putler has been working on this and we should have this soon.

2. For the native R version, one thing that might improve the statistical properties is if each partition was sampled with replacement from the original dataset instead of having it partitioned. This is similar to creating bootstrap samples on the original data. However its tricky to create these sample directly using RDDs. BTW can your data be loaded in the memory of a single machine ? If that is the case you can try to create 'N' partitions by doing sampling with replacement on a single machine and then call parallelize on the 'N' partitions. However I don't know if this overhead will be worth the improvements in accuracy.

Thanks
Shivaram


--
You received this message because you are subscribed to the Google Groups "SparkR Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sparkr-dev+...@googlegroups.com.
To post to this group, send email to spark...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sparkr-dev/cab3ba1f-ece4-472a-aea6-86d8c6a34ab4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Gonzalo Andrés Moreno Gómez

unread,
Jul 3, 2016, 8:27:00 PM7/3/16
to SparkR Developers
Hi Sean:

Where do I find the function includePackage(sc, randomforest)?

regards

Gonzalo
Reply all
Reply to author
Forward
0 new messages