Hi all,
I've posted a new release of Spark, 0.5.0, which has now become the master branch. The purpose of this release is twofold:
1) Merge the Mesos 0.9 support into the mainline to make it possible to run on the latest releases of Mesos.
2) Provide a stable branch for the current Spark codebase, in preparation for some refactorings we're going to merge in next. These were developed in the Spark Streaming project, which adds support for stream processing to Spark and speeds up a bunch of components to make this possible. (There's a short paper on it at
http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf). We haven't merged the actual streaming code yet because some aspects are still in progress, but we've merged in the core engine improvements in a branch called "dev".
Apart from this, another important change is that this release modifies the default Hadoop version we rely on to 0.20.205.0, which supports HDFS security. This will likely *not* work with older versions of HDFS, but you can change the Hadoop version you use by editing project/SparkBuild.scala and recompiling (sbt/sbt clean compile). In addition, the Mesos configuration in spark-env.sh has changed -- there's now a variable called MESOS_NATIVE_LIBRARY instead of MESOS_HOME (see
https://github.com/mesos/spark/wiki/Running-Spark-on-Mesos).
In addition, the release includes a number of bug fixes, as well as built-in scripts for launching on EC2 so that you don't need to download Mesos for that.
You can download the release as a zip at
https://github.com/mesos/spark/zipball/v0.5.0, or just check out the master branch from Git. (Further fixes to 0.5 will be committed to the master branch.) The old pre-0.9 Mesos code is still available in the branch old-mesos and will receive any major bug fixes to 0.5.
I also wanted to provide a quick sketch of what will be in the next release. Apart from refactoring to make the code simpler, the dev branch also includes a much improved cache (now called the Block Store), which supports a different storage level per RDD (including in-memory, serialized, on-disk, and replicated); a significantly faster shuffle; faster network communication using Akka; straggler mitigation through speculative execution; and faster scheduling for short jobs. In addition, we're working on code to make Spark easier to deploy, both through a pure-Java "standalone" cluster mode and through bindings to run on Hadoop NextGen. We'll make the next release as soon as these are ready and tested, likely in about a month. You can try out some of these things now by watching the "dev" branch, but much of it is not finalized or documented yet.
Finally, I want to say thanks to all the contributors who made this possible! This release saw contributions both from new people at Berkeley and new external contributors, and was guided by issue reports and feature requests from the whole community.
Matei