We are happy to announce Shark 0.8.0, which is a major release the brings many new capabilities and performance improvements. You can download the release here: https://github.com/amplab/shark/releases
We’ve implemented a new data serialization format that substantially improved shuffle performance in the case of large aggregations and joins. The new format is more CPU-efficient, while also reducing the size of the data sent across the network. This can improve performance by up to 3X for queries that have large aggregations or joins.
Memory is precious. To enable you fitting more data into memory, Shark now implements CPU-efficient compression algorithms, including dictionary encoding and run-length encoding. In addition to using less space in-memory compression actually improves the response time of many queries. This is because it reduces GC pressure and improves locality leading to better CPU cache performance. The compression ratio is workload-dependent, however, we have seen anywhere from 2X to 30X compression in real-workloads.
There is also no need to worry about picking the best compression scheme. When first loading the data into memory, Shark will automatically determine the best scheme to apply for the given dataset.
A typical query usually only looks at a small subset of overall data. Partition pruning allows Shark to skip looking at partitions that it knows for sure does not contain any data satisfying the query predicates. For one early user of Shark, this allowed query processing to skip examining 98% of the data.
Different from Hive's partitioning feature, partition pruning refers to Shark's usage of column statistics - collected during in-memory data materialization - to automatically reduce the number of RDD partitions that need to be scanned.
First and foremost, through its Spark 0.8.0 support, this new version of Shark supports a number of important features, including:
Spark’s internal job scheduler has been refactored and extended to include more sophisticated scheduling policies such as fair scheduling. The fair scheduler a fair scheduler allows multiple users to share an instance of Spark, which helps users running shorter jobs to achieve good performance, even when longer-running jobs are running in parallel.
Shark users can also take advantage of this new capability by setting the configuration variablespark.scheduler.cluster.fair.pool
to a specific scheduling pool at runtime. For example:
set mapred.fairscheduler.pool=short_query_pool;
select count(*) from my_shark_in_memory_table;
A continuous integration script has been added that would automatically fetch all the Shark dependencies (Scala, Hive, Spark) and execute both the Shark internal unit tests and the Hive compatibility unit tests. This has been used in various places as part of their Jenkins pipeline.
Users can now build Shark against specific versions of Hadoop without modifying the build file. Simply specify the Hadoop version using the SHARK_HADOOP_VERSION
environmental variable before running the build.
SHARK_HADOOP_VERSION=1.0.5 sbt/sbt package
We would like to thank Konstantin Boudnik, Jason Dai, Harvey Feng, Sarah Gerweck, Jason Giedymin, Cheng Hao, Mark Hamstra, Jon Hartlaub, Shane Huang, Nandu Jayakumar, Andy Konwinski, Haoyuan Li, Harold Lim, Raymond Liu, Antonio Lupher, Kay Ousterhout, Alexander Pivovarov, Sun Rui, Andre Schumacher, Mingfei Shi, Amir Shimoni, Ram Sriharsha, Patrick Wendell, Andrew Xia, Matei Zaharia, and Lin Zhao for their contributions to the release.