All,
I would like to get your opinions about how RHadoop and Spark. RHadoop is wonderful, and rmr2/plyrmr are running on top of Hadoop MapReduce.
However, based on what I hear from people in the industry, the next big thing seems to be Spark, and it's going to largely replace Hadoop MapReduce (not Hadoop as a whole, but just the MapReduce part). People often say that Spark is "in-memory," which is not entirely true, since it degrades gracefully if it runs out of memory and thus continues working. So although I have not personally tried Spark (or the SparkR bindings to R), it seems like a pretty cool thing, especially if you have an iterative algorithm. And if you really have non-iterative jobs that have data twenty times your memory, the thing of the future seems to be Apache Tez instead of MapReduce.
So what I've heard (and please correct me if I'm wrong) is that Hadoop MapReduce is going to be replaced mostly by Spark or Tez for new development because there doesn't seem to be a use case where MapReduce is superior any more. (I'm talking about new projects, not existing infrastructure, where MapReduce will continue to be used to continue to support it.)
That leaves me with the question of how RHaddop, which is largely (rmr2/plyrmr) built on top of MapReduce, will fit into the picture going forward.
To be clear, I love RHadoop and I firmly believe that
Antonio has done an absolutely superb job integrating Hadoop with R. I'm just curious how this all fits together. I've searched the archives, and there was a brief mentioning of Spark in 2013
here but it was more about Apache Hama. (Not sure whether Hama is taking off as well, but it looks that Spark has far more momentum than Hama right now.)
Thanks,
M