Spark

Michael Smith

unread,

Aug 22, 2014, 9:24:36 PM8/22/14

to rha...@googlegroups.com

All,

I would like to get your opinions about how RHadoop and Spark. RHadoop is wonderful, and rmr2/plyrmr are running on top of Hadoop MapReduce.

However, based on what I hear from people in the industry, the next big thing seems to be Spark, and it's going to largely replace Hadoop MapReduce (not Hadoop as a whole, but just the MapReduce part). People often say that Spark is "in-memory," which is not entirely true, since it degrades gracefully if it runs out of memory and thus continues working. So although I have not personally tried Spark (or the SparkR bindings to R), it seems like a pretty cool thing, especially if you have an iterative algorithm. And if you really have non-iterative jobs that have data twenty times your memory, the thing of the future seems to be Apache Tez instead of MapReduce.

So what I've heard (and please correct me if I'm wrong) is that Hadoop MapReduce is going to be replaced mostly by Spark or Tez for new development because there doesn't seem to be a use case where MapReduce is superior any more. (I'm talking about new projects, not existing infrastructure, where MapReduce will continue to be used to continue to support it.)

That leaves me with the question of how RHaddop, which is largely (rmr2/plyrmr) built on top of MapReduce, will fit into the picture going forward.

To be clear, I love RHadoop and I firmly believe that Antonio has done an absolutely superb job integrating Hadoop with R. I'm just curious how this all fits together. I've searched the archives, and there was a brief mentioning of Spark in 2013 here but it was more about Apache Hama. (Not sure whether Hama is taking off as well, but it looks that Spark has far more momentum than Hama right now.)

Thanks,

M

Antonio Piccolboni

unread,

Aug 22, 2014, 9:48:02 PM8/22/14

to RHadoop Google Group

On Fri, Aug 22, 2014 at 6:24 PM, Michael Smith <my.r...@gmail.com> wrote:

All,

I would like to get your opinions about how RHadoop and Spark. RHadoop is wonderful, and rmr2/plyrmr are running on top of Hadoop MapReduce.

However, based on what I hear from people in the industry, the next big thing seems to be Spark, and it's going to largely replace Hadoop MapReduce (not Hadoop as a whole, but just the MapReduce part). People often say that Spark is "in-memory," which is not entirely true, since it degrades gracefully if it runs out of memory and thus continues working.

If by degrade gracefully you mean also crashing, I agree.

So although I have not personally tried Spark (or the SparkR bindings to R), it seems like a pretty cool thing, especially if you have an iterative algorithm. And if you really have non-iterative jobs that have data twenty times your memory, the thing of the future seems to be Apache Tez instead of MapReduce.

So what I've heard (and please correct me if I'm wrong) is that Hadoop MapReduce is going to be replaced mostly by Spark or Tez for new development because there doesn't seem to be a use case where MapReduce is superior any more. (I'm talking about new projects, not existing infrastructure, where MapReduce will continue to be used to continue to support it.)

That leaves me with the question of how RHaddop, which is largely (rmr2/plyrmr) built on top of MapReduce, will fit into the picture going forward.

Plyrmr can run on top of SparkR, but due to limitations in sparkR I did not announce it and do not support it (I'll make an exception for your bug reports because of your commitment to RHadoop, of course). You clearly are an early adopter, so please take it for a spin, but don't expect a complete implementation. Major omissions: no writing of files (seriously, your computation has to end in memory); no multiple inputs, hence no joins. You can go though the tests and see which ones are skipped in Spark. Slow in some use cases. Work is ongoing for sparkR and I don't want to condemn a project just because it's in its infancy. We could check out rmr 1.0 and have laugh. But there is a lot of work to do.

To be clear, I love RHadoop and I firmly believe that Antonio has done an absolutely superb job integrating Hadoop with R

I'm just curious how this all fits together. I've searched the archives, and there was a brief mentioning of Spark in 2013 here but it was more about Apache Hama. (Not sure whether Hama is taking off as well, but it looks that Spark has far more momentum than Hama right now.)

In fact there is no Hama backend! Not even unsupported or in a separate branch.

Antonio

Thanks,

M

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Smith

unread,

Aug 22, 2014, 11:33:58 PM8/22/14

to rha...@googlegroups.com, ant...@piccolboni.info

That's very nice that plyrmr can run on top of Spark. I'll give it a try. Thanks for your reply!

M

Reply all

Reply to author

Forward