Building a Random Forest model with SparkR

Sean Farrell

unread,

Mar 19, 2015, 8:15:26 PM3/19/15

to spark...@googlegroups.com

Hi all,

I've been working on building a Random Forest model on my modest AWS cluster (6 nodes each with 16 cores and 32G RAM) using Spark. My training set is pretty big (~30,000 rows and ~2,000 columns) so won't run on a single node. I had cobbled together some code to build the model as a Map Reduce job, and now I've hacked some code together to do this in Spark (see below). It takes about 15 sec to build a model on my cluster using this code (an equivalent implementation as a MapReduce job took about 15 mins).

In this example my training set data is located in HDFS as a csv file with no column or row names. The column names of the training set are instead saved in the file train.names.rda which is loaded in before building the model. To combine the random forests together into a single model at the end, I use the my_combine function that was helpfully provided by Joran at http://stackoverflow.com/questions/19170130/combining-random-forests-built-with-different-training-sets-in-r.

I'm sure there is likely to be a better way of doing it than this, so please feel free to suggest improvements (or alternative approaches)!

### START

# Load libraries

library(randomForest)

library(SparkR)

# Set spark memory allocation (important even though I set the spark executor and driver memories when initializing the Spark context, as without it I run into Java heap space errors)

Sys.setenv(SPARK_MEM="40g")

# Initialise Spark context

sc <- sparkR.init(master="yarn-client",sparkEnvir=list(spark.executor.memory="24g",spark.driver.memory="24g",spark.executor.instances="5",spark.executor.cores="16", spark.driver.maxResultSize="10g"))

# Pass randomForest package to all nodes

includePackage(sc,randomForest)

### Build randomForest models of 10 trees each for entire data set sliced into 50 chunks (so 500 trees in the forest)

# Load the column names of the training set

load("train.names.rda")

# Define RDD with 50 slices

rdd <- textFile(sc, "hdfs:<server_address>/user/spark/spark_training.csv",50)

# Define mapPartition function to build randomForest models

rf <- mapPartitions(rdd, function(input) {

# Parse the RDD into a data frame

df <- data.frame(do.call('rbind',strsplit(as.character(input),',',fixed=TRUE)))

# Set the data frame column names

names(df) <- train.names

# Fix the data type to numeric for all but the classification column

for (i in 1:(ncol(df)-1)){

df[,i] <- as.numeric(as.character(df[,i]))

}

# Get rid of quotation marks in classification column

df$MyClass <- as.factor(gsub("\"","", df$MyClass))

# Build the RF model

randomForest(df[,-ncol(df)],df[,ncol(df)],ntree=10)

})

# Build the models

output.rf<- collect(rf)

# Define function to split output list into individual models and combine into one uber model

uberRF <- function (x) {

j=1

rf.list <- list()

for (i in 1:(length(x)/18)){

temp <-x[j:(i*18)]

class(temp) <- "randomForest"

varname <- paste("rf",i,sep="")

assign(varname,temp)

j = j + 18

rf.list[[length(rf.list)+1]] <- temp

}

# Combine the models together

final.rf <- do.call(my_combine, rf.list)

final.rf

}

# Combine the RF models

forest <- uberRF(output.rf)

### END

Shivaram Venkataraman

unread,

Mar 20, 2015, 1:49:00 PM3/20/15

to Sean Farrell, spark...@googlegroups.com, Evan Sparks

Hi Sean

Great to know that your random forest implementation is faster now. I am not much of an expert on random forests, but I asked some other machine learning experts in the AMPLab about this:

1. There is a distributed random forests implementation in Spark's MLLib. Unfortunately we don't have R bindings for this yet, but Dan Putler has been working on this and we should have this soon.

2. For the native R version, one thing that might improve the statistical properties is if each partition was sampled with replacement from the original dataset instead of having it partitioned. This is similar to creating bootstrap samples on the original data. However its tricky to create these sample directly using RDDs. BTW can your data be loaded in the memory of a single machine ? If that is the case you can try to create 'N' partitions by doing sampling with replacement on a single machine and then call parallelize on the 'N' partitions. However I don't know if this overhead will be worth the improvements in accuracy.

Thanks

Shivaram

--
You received this message because you are subscribed to the Google Groups "SparkR Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sparkr-dev+...@googlegroups.com.
To post to this group, send email to spark...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sparkr-dev/cab3ba1f-ece4-472a-aea6-86d8c6a34ab4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Gonzalo Andrés Moreno Gómez

unread,

Jul 3, 2016, 8:27:00 PM7/3/16

to SparkR Developers

Hi Sean:

Where do I find the function includePackage(sc, randomforest)?

regards

Gonzalo

Reply all

Reply to author

Forward