Hello!
I tried SparkR and pbdR for some use cases (Logistic
Regression & Back Propagation Neural Network) months ago. I can
share some my own experiences with you about pbdR vs. SparkR.
1. About pdbR.
Firstly,
SPMD programming model which is used by MPI is the core programming
model in pbdR. You need to know SPMD when you use pbdR. Besides SPMD
programming model, pbdR provides two API levels, i.e. pbdDMAT package &
pdbMPI package.
If your program mainly consists of linear algebra operations (
that is to say, your program mainly and nearly only relies on matrix
operations), adapting R-code to pbdR-code will become easy. pbdDMAT
provides near-native-R-matrix-style APIs. Adapting your code to pbdR,
you mainly need to change your matrix initialization code. We have a
little program on logistic regression training of 27 lines of code in
native R. The same code in pbdR version is only one line different from
the original code.
However, if you have some demand on your
program's performance or you have some special operations which can not
be expressed by linear algebra operators, you need to learn about MPI
programming and pbdMPI. By explicitly design your program in MPI programming model, you will enjoy performance gain.
2. About SparkR
Actually,
SparkR after Spark 1.4 and before 1.4 have different set of APIs.
Before Spark 1.4, SparkR is a wrapper for Spark RDD APIs. But after
Spark 1.4, SparkR mainly focus on data frame APIs.
Adopting SparkR in your program means that you need to redesign your whole algorithm explicitly in Spark RDD programming model. The good thing is that when you migrate your code from R to Scala, it is easy.
As
for your question, " map-reduce also a SPMD technique". Yes, you can
implement MapReduce in MPI code. Actually, you can implement a MapReduce
style program in pbdR via explicitly MPI programming. If you do so, you will enjoy performance gain.
The
following figure shows the results of a performance comparison
experiment between a pbdR-based system "iPLAR" and Spark MLlib.
"iPLAR(R)" system mainly relies on the pbdDMAT package.
"iPLAR-Data Parallel (R)" system adopts the data parallel model and relies on a
MapReduce-like framework implemented by pbdDMAT & pbdMPI. As we can
see, by adopting data parallel programming model, pbdR programs achieve near the same performance as Spark Mllib (implemented
in JVM). The same program written in SparkR have poorer performance than
Spark-Mllib, therefore we do not include them in the figure.
If you totally rely on linear algebra operations ( that means you just use functions provided by pbdDMAT), you will have a small migration cost from serial code, but you will have much poorer performance.
Learning some MPI is useful when you want to take fully advantage of pbdR.
Best wishes,
Dale(Zhaokang) Wang
在 2015年12月2日星期三 UTC+8下午8:24:08,Sebastian Konietzny写道: