Before you can run Big-Bench on Spark SQL, please make sure your Spark includes the following patches.
Spark 1.2.1, the latest Spark GA, does not include the above patches.
Please add the following parameters in your spark-defaults.conf
· spark.driver.extraLibraryPath=${HADOOP_INSTALL_DIR}/hadoop/lib/native
· spark.executor.extraLibraryPath=${HADOOP_INSTALL_DIR}/hadoop/lib/native
· spark.driver.extraJavaOptions -XX:PermSize=128M -XX:MaxPermSize=512M (If you want to run Big-Bench on Spark with 1TB data scale)
· spark.executor.extraClassPath=${HIVE_INSTALL_DIR}/hive/lib/*:${HIVE_CONF_DIR}/hive/conf
· spark.driver.extraClassPath=${HIVE_INSTALL_DIR}/hive/lib/*:${HIVE_CONF_DIR}/hive/conf
Please follow Big-Bench preparation guide to install and configure Big-Bench.
Edit the following portion of $BIG-BENCH/INSTALL_DIR/Big-Bench/engines/spark/conf/engineSettings.conf to point to the Spark SQL binary which Big-Bench will use.
BINARY="$SPARK_INSTALL_DIR/spark/bin/spark-sql"
Please also change BINARY_PARAMS in $BIG-BENCH/INSTALL_DIR/Big-Bench/engines/spark/conf/engineSettings.conf.
BINARY_PARAMS="-v –driver-memory <memory for Spark driver> --executor-memory <memory per Spark executor> --master <spark master URL> --deploy-mode <spark deploy mode> --jars <jars used by big-bench queries> --files <files used by big-bench queries>
|
--driver-memory |
Recommend to use 4g for 1TB data scale |
|
--executor-memory |
Recommend to use 20g for 1TB data scale |
|
--master |
Please refer to the help of Spark SQL |
|
--deploy-mode |
Please refer to the help of Spark SQL. (So far, Big-Bench supports local mode, standalone mode and yarn-client mode) |
|
--jars |
Please add the following jars. opennlp-maxent, opennlp-tools, bigbenchqueriesmr, hive-common, hive-cli, hive-exec, hive-service, hive-metastore, libfb303, jdo-api, antlr-runtime, datanucleus-api, datanucleus-core, datanucleus-rdbms, derby
Please also add your hive-site.xml
|
|
--files |
Please add the following files reducer_q3.py, q4_reducer1.py, q4_reducer2.py, q8_reducer.py, reducer_q30.py, reducer_q29.py |
If Big-Bench works on Spark standalone/yarn mode, please provision your Big-Bench directory to all worker nodes.
Please follow the BigBench guide to use the BigBench driver. Note that, option '-e spark' is required if you want to use BigBench on Spark SQL.
By Intel SSG STO BDT
This document fills the gaps of enabling Big-Bench on Spark SQL. So far, Spark local, standalone and yarn-client modes were supported without modifying any Big-Bench code. All the enabling steps in this document were based on CDH5.3 and CentOS 6.4.
As our enabling experiment is based on Cloudera Distribution of Apache Hadoop, please refer to CDH documentation for hardware requirements and best practices.
Before running Big-Bench, verify that your machines have installed and configured the following software
· CDH 5.3 or higher
Note that CDH can be deployed via CM to setup a workable Hadoop environment easily.
The following set of supported software dependencies must be installed:
· HDFS, YARN
· Spark
· Hive
· Mahout
· JDK 1.7 is required. 64 bit is recommended. A suitable JDK is installed along with Cloudera (if using the parcel installation method)
· Python
If you are using Spark 1.3 or higher, you can skip this section.
Before you run Big-Bench on Spark SQL, please make sure your Spark includes the following patches. None of the Big-Bench queries can pass without these patches.
For better performance, we highly recommend you to apply the following performance relating patches.
· SPARK-4570 This patch enables map join for left semi join. It can highly improves the performance of Big-Bench queries using left semi join.
If your Spark does not include these patches, please apply them by yourselves. We applied these patches based on CDH5.3.0 release tag. You can fetch the patches or pull request from the above Apache JIRA links. After merging code, please build Spark according to the guide Building Spark.
All these fixes are based on Spark master (Spark 1.3 so far). So, if you want to apply them to Spark 1.2.x, please notice the following gaps.
· org.apache.spark.sql.types.StringType in Spark 1.3 should be replaced with org.apche.spark.sql.catalyst.types.StringType in Spark 1.2.x
Don’t forget to provision your patched Spark jar to all Spark nodes.
Please add the following parameters in your spark-defaults.conf
· spark.driver.extraLibraryPath=${HADOOP_INSTALL_DIR}/hadoop/lib/native
· spark.executor.extraLibraryPath=${HADOOP_INSTALL_DIR}/hadoop/lib/native
· spark.driver.extraJavaOptions -XX:PermSize=128M -XX:MaxPermSize=512M (If you want to run Big-Bench on Spark with 1TB data scale)
· spark.executor.extraClassPath=${HIVE_INSTALL_DIR}/hive/lib/*:${HIVE_CONF_DIR}/hive/conf
· spark.driver.extraClassPath=${HIVE_INSTALL_DIR}/hive/lib/*:${HIVE_CONF_DIR}/hive/conf
Please follow Big-Bench preparation guide to install and configure Big-Bench.
Edit the following portion of $BIG-BENCH/INSTALL_DIR/Big-Bench/engines/spark/conf/engineSettings.conf to point to the Spark SQL binary which Big-Bench will use.
BINARY="$SPARK_INSTALL_DIR/spark/bin/spark-sql"
So far Big-Bench driver still has some functionality issue, so, please edit the following portion of $BIG-BENCH/INSTALL_DIR/Big-Bench/ conf/userSettings.conf to work around.
|
export BIG_BENCH_DEFAULT_ENGINE="spark" |
|
|
export BIG_BENCH_DEFAULT_SCALE_FACTOR="1000" |
|
Please also change BINARY_PARAMS in $BIG-BENCH_INSTALL_DIR/Big-Bench/engines/spark/conf/engineSettings.conf.
BINARY_PARAMS="-v –driver-memory <memory for Spark driver> --executor-memory <memory per Spark executor> --master <spark master URL> --deploy-mode <spark deploy mode> --jars <jars used by big-bench queries> --files <files used by big-bench queries>
|
--driver-memory |
Recommend to use 4g for 1TB data scale |
|
--executor-memory |
Recommend to use 20g for 1TB data scale |
|
--master |
Please refer to the help of Spark SQL |
|
--deploy-mode |
Please refer to the help of Spark SQL. (So far, Big-Bench supports local mode, standalone mode and yarn-client mode) |
|
--jars |
|
Please add the following jars. |
· opennlp-maxent · opennlp-tools · bigbenchqueriesmr You can find above jars in Big-Bench install dir.
· hive-common · hive-cli · hive-exec · hive-service · hive-metastore · libfb303 · jdo-api · antlr-runtime · datanucleus-api · datanucleus-core · datanucleus-rdbms · derby You can find above jars in Hive install dir. |
|
Please also add your hive-site.xml |
|
--files |
|
Please add the following files |
· reducer_q3.py · q4_reducer1.py · q4_reducer2.py · q8_reducer.py · reducer_q30.py · reducer_q29.py You can find these files in Big-Bench install dir. |
If Big-Bench works on Spark standalone/yarn mode, please provision your Big-Bench directory to all worker nodes.
Please follow the Big-Bench guide to run driver. Note that, the option ‘-e spark’ is required.
E.g. If you want to run query #1, please use the following command.
"$INSTALL_DIR/bin/bigBench" runQuery -q 1 –e spark
· Big-Bench does not support Spark yarn-cluster mode.
· SPARK-5707 The bug causes exception if enabling spark.sql.codegen while running some Big-Bench query.
· SPARK-5791 The bug highly degrades performance if using join with on
https://github.com/jameszhouyi/Spark-for-BigBench.git