This is to announce a new package
dplyr.spark, that allows to use
spark as a backend for
dplyr.
dplyr is an uber-popular data manipulation package that allows to do all sorts of data manipulation on data frames and, with some limitations, on all sort of relational databases. This is a 0.1.1 version and, while I went through the
dplyr tutorial with few problems, I would not bet my production systems on it quite yet as we don't have a test suite in place yet. I also have to warn you that the setup is somewhat complicated, including a custom build of
spark, which was pretty smooth for your humble developer, but may not sit well with your 1000-node cluster admin. But it certainly is a good time to kick the tires and provide feedback and maybe do some research work with it. We are of course looking into ways to make the setup smoother, one approach being making this package a smashing success and then lobby the main distributors en masse to provide JDBC connectivity right out of the box. I need your help to do this.
You may want to understand the relation between
dplyr.spark and
plyrmr's spark backend better. They are externally similar but the internals are completely different. The plyrmr spark backend relies on a package, now part of the main
spark project, called
SparkR, which allowed distributed R code execution, including user-specified functions. This means you could write things like
mutate(data, whatever.function(col1, col2, col3) and have it run in spark. It also meant quite a bit of serialization and deserialization overhead, the potential to run into R efficiency issues (small group syndrome), memory limits (large group syndrome), the data being opaque to spark etc. It meant full power but also quite a bit of complexity and a performance penalty. Then the SparkR project performed a sudden change of direction whereby all the functions that plyrmr relied upon are now private and critical bugs in SparkR that plyrmr needed fixed to work properly are no longer high priority. Some of this functionality may be brought back in version 1.5 (see and please comment on https://issues.apache.org/jira/browse/SPARK-7230 if you believe distributed R execution is important for your use case) but nothing is certain. It's unfortunate but we are talking about a pre-release project building upon another pre-release project. It appears our bet was premature.
dplyr.spark represents a different approach. It doesn't rely on distributed R execution but on the tried and true SQL translation layer in dplyr and the SparkSQL execution layer. The data flows through optimized scala code without ever touching the R interpreter (of course there are limited exceptions). What you can do is limited to sql-ish operations and forget about user-defined functions, at least in the short term, but you get the full sparkSQL speed and scalability. For what it does, it does it very well, and it is a much simpler piece of code than rmr2, sparkR or plyrmr. So in a way it's a less ambitious approach, but simpler and it could be very effective in many use cases. Please don't hesitate with your comments and feedback.
Antonio