pbdR vs. SparkR

159 views
Skip to first unread message

Sebastian Konietzny

unread,
Dec 2, 2015, 7:24:08 AM12/2/15
to rbigdatap...@googlegroups.com
Hey!

I am curious concerning the differences in applications for pbdR and SparkR. As far as I know, SparkR is a wrapper for a map-reduce framework. pbdR, in contrast, implements the SPMD paradigm. But what does this mean in practice? Isn't map-reduce also a SPMD technique, in the sense that the same operations are executed on different data sets?

I am looking for an easy-to-use framework that allows me to be flexible in the choice of program execution on i) a single laptop, ii) a compute cluster, or iii) a multi-core shared-memory architecture. For example, I used basic R and snow, respectively, for the first two aspects, And I thought that SparkR would be a better choice than snow for the future. But what are the benefits of pbdR?

The efforts for adapting the R-code should be as minimal as possible to change between scenarios i-iii).


Ostrouchov, George

unread,
Dec 2, 2015, 11:41:00 AM12/2/15
to rbigdatap...@googlegroups.com
Yes, Spark is a map-reduce framework. Probably the most comprehensive comparison to date of MPI and map-reduce is at http://arxiv.org/abs/1403.1528

One way to look at it is that MPI gives you a choice of explicit communication modes that have vendor and architecture optimizations. Map-reduce hides communication from the user and uses one kind: the shuffle, which is conceptually equivalent to a block matrix transpose (alltoall or alltoallv under MPI). Some things, like indexing documents and serving  document queries of many users fit this framework perfectly and can be very efficient. Many divide-recombine computations also fit but are not as efficient as they would be with an MPI-based implementation that takes advantage of any associative and commutative properties of the recombine step.

Lots of examples of pbdR use are in pbdDEMO. For a couple of examples of adapting R code that we have not integrated into pbdDEMO yet, from a talk I gave recently in Tokyo, see http://rbigdata.github.io/misc.html. The examples should run on all three architectures that you mention, when MPI and pbdR packages are installed.

George

From: <rbigdatap...@googlegroups.com> on behalf of Sebastian Konietzny <skon...@gmail.com>
Reply-To: <rbigdatap...@googlegroups.com>
Date: Wednesday, December 2, 2015 at 7:24 AM
To: RBigDataProgramming <rbigdatap...@googlegroups.com>
Subject: [RBigData] pbdR vs. SparkR

Hey!

I am curious concerning the differences in applications for pbdR and SparkR. As far as I know, SparkR is a wrapper for a map-reduce framework. pbdR, in contrast, implements the SPMD paradigm. But what does this mean in practice? Isn't map-reduce also a SPMD techniques, meaning that the same operations are executed on different data sets?

I am looking for an easy-to-use framework that allows me to be flexible in the choice of program execution on i) a single laptop, ii) a compute cluster, or iii) a multi-core shared-memory architecture. The efforts for adapting the R-code should be as minimal as possible.

--
Programming with Big Data in R
Simplifying Scalability
http://r-pbd.org/
---
You received this message because you are subscribed to the Google Groups "RBigDataProgramming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rbigdataprogram...@googlegroups.com.
To post to this group, send email to rbigdatap...@googlegroups.com.
Visit this group at http://groups.google.com/group/rbigdataprogramming.
To view this discussion on the web visit https://groups.google.com/d/msgid/rbigdataprogramming/90d8de65-0e2d-4b1d-b69e-a43f9685d9ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sebastian Konietzny

unread,
Dec 2, 2015, 11:58:34 AM12/2/15
to RBigDataProgramming
Thanks a lot! I am looking forward to reading the referenced paper. Best, Sebastian


Dale Wang

unread,
Dec 9, 2015, 12:25:54 AM12/9/15
to rbigdataprogramming
Hello!

I tried SparkR and pbdR for some use cases (Logistic Regression & Back Propagation Neural Network) months ago. I can share some my own experiences with you about pbdR vs. SparkR.

1. About pdbR.
Firstly, SPMD programming model which is used by MPI is the core programming model in pbdR. You need to know SPMD when you use pbdR. Besides SPMD programming model, pbdR provides two API levels, i.e. pbdDMAT package & pdbMPI package.
If your program mainly consists of linear algebra operations ( that is to say, your program mainly and nearly only relies on matrix operations), adapting R-code to pbdR-code will become easy. pbdDMAT provides near-native-R-matrix-style APIs. Adapting your code to pbdR, you mainly need to change your matrix initialization code. We have a little program on logistic regression training of 27 lines of code in native R. The same code in pbdR version is only one line different from the original code.

However, if you have some demand on your program's performance or you have some special operations which can not be expressed by linear algebra operators, you need to learn about MPI programming and pbdMPI. By explicitly design your program in MPI programming model, you will enjoy performance gain.

2. About SparkR
Actually, SparkR after Spark 1.4 and before 1.4 have different set of APIs. Before Spark 1.4, SparkR is a wrapper for Spark RDD APIs. But after Spark 1.4, SparkR mainly focus on data frame APIs.
Adopting SparkR in your program means that you need to redesign your whole algorithm explicitly in Spark RDD programming model. The good thing is that when you migrate your code from R to Scala, it is easy.

As for your question, " map-reduce also a SPMD technique". Yes, you can implement MapReduce in MPI code. Actually, you can implement a MapReduce style program in pbdR via explicitly MPI programming. If you do so, you will enjoy performance gain.

The following figure shows the results of a performance comparison experiment between a pbdR-based system "iPLAR" and Spark MLlib. "iPLAR(R)" system mainly relies on the pbdDMAT package. "iPLAR-Data Parallel (R)" system adopts the data parallel model and relies on a MapReduce-like framework implemented by pbdDMAT & pbdMPI. As we can see, by adopting data parallel programming model,  pbdR programs achieve near the same performance as Spark Mllib (implemented in JVM). The same program written in SparkR have poorer performance than Spark-Mllib, therefore we do not include them in the figure. 
If you totally rely on linear algebra operations ( that means you just use functions provided by pbdDMAT), you will have a small migration cost from serial code, but you will have much poorer performance. 
Learning some MPI is useful when you want to take fully advantage of pbdR.



Best wishes,
Dale(Zhaokang) Wang
 

在 2015年12月2日星期三 UTC+8下午8:24:08,Sebastian Konietzny写道:
Hey!

I am curious concerning the differences in applications for pbdR and SparkR. As far as I know, SparkR is a wrapper for a map-reduce framework. pbdR, in contrast, implements the SPMD paradigm. But what does this mean in practice? Isn't map-reduce also a SPMD technique, in the sense that the same operations are executed on different data sets?

I am looking for an easy-to-use framework that allows me to be flexible in the choice of program execution on i) a single laptop, ii) a compute cluster, or iii) a multi-core shared-memory architecture. For example, I used basic R and snow, respectively, for the first two aspects, And I thought that SparkR would be a better choice than snow for the future. But what are the benefits of pbdR?

The efforts for adapting the R-code should be as minimal as possible to change between scenarios i-iii).


--
Programming with Big Data in R
Simplifying Scalability
http://r-pbd.org/
---
You received this message because you are subscribed to the Google Groups "RBigDataProgramming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rbigdataprogram...@googlegroups.com.
To post to this group, send email to rbigdatap...@googlegroups.com.
Visit this group at http://groups.google.com/group/rbigdataprogramming.

Dale Wang

unread,
Dec 9, 2015, 12:27:17 AM12/9/15
to RBigDataProgramming

Here is the figure missed in the last reply.


在 2015年12月9日星期三 UTC+8下午1:25:54,Dale Wang写道:
To unsubscribe from this group and stop receiving emails from it, send an email to rbigdataprogramming+unsub...@googlegroups.com.
To post to this group, send email to rbigdataprogramming@googlegroups.com.
Auto Generated Inline Image 1

Sebastian Konietzny

unread,
Dec 12, 2015, 5:34:28 AM12/12/15
to RBigDataProgramming
Wow, thanks a lot for your detailed answer. Great insights.
Reply all
Reply to author
Forward
0 new messages