How to use BigBench on Spark SQL

1,036 views
Skip to first unread message

Yi Yao

unread,
Feb 24, 2015, 9:26:36 PM2/24/15
to big-...@googlegroups.com

Preparation


Appling Spark Patches

Before you can run Big-Bench on Spark SQL, please make sure your Spark includes the following patches.

SPARK-5202

SPARK-5237

SPARK-5364

Spark 1.2.1, the latest Spark GA, does not include the above patches.

 

Configuring Spark

Please add the following parameters in your spark-defaults.conf

·         spark.driver.extraLibraryPath=${HADOOP_INSTALL_DIR}/hadoop/lib/native

·         spark.executor.extraLibraryPath=${HADOOP_INSTALL_DIR}/hadoop/lib/native

·         spark.driver.extraJavaOptions -XX:PermSize=128M -XX:MaxPermSize=512M    (If you want to run Big-Bench on Spark with 1TB data scale)

·         spark.executor.extraClassPath=${HIVE_INSTALL_DIR}/hive/lib/*:${HIVE_CONF_DIR}/hive/conf

·         spark.driver.extraClassPath=${HIVE_INSTALL_DIR}/hive/lib/*:${HIVE_CONF_DIR}/hive/conf

 

Installing and Configuring Big-Bench

Please follow Big-Bench preparation guide to install and configure Big-Bench.

 

Configuring and running Big-Bench on Spark

Configuring Big-Bench

Edit the following portion of $BIG-BENCH/INSTALL_DIR/Big-Bench/engines/spark/conf/engineSettings.conf to point to the Spark SQL binary which Big-Bench will use.

BINARY="$SPARK_INSTALL_DIR/spark/bin/spark-sql"

 

Please also change BINARY_PARAMS in $BIG-BENCH/INSTALL_DIR/Big-Bench/engines/spark/conf/engineSettings.conf.

BINARY_PARAMS="-v –driver-memory <memory for Spark driver> --executor-memory <memory per Spark executor> --master <spark master URL> --deploy-mode <spark deploy mode> --jars <jars used by big-bench queries> --files <files used by big-bench queries>

--driver-memory

Recommend to use 4g for 1TB data scale

--executor-memory

Recommend to use 20g for 1TB data scale

--master

Please refer to the help of Spark SQL

--deploy-mode

Please refer to the help of Spark SQL. (So far, Big-Bench supports local mode, standalone mode and yarn-client mode)

--jars

Please add the following jars.

opennlp-maxent, opennlp-tools, bigbenchqueriesmr, hive-common, hive-cli, hive-exec, hive-service, hive-metastore, libfb303, jdo-api, antlr-runtime, datanucleus-api, datanucleus-core, datanucleus-rdbms, derby

 

Please also add your hive-site.xml

 

--files

Please add the following files

reducer_q3.py, q4_reducer1.py, q4_reducer2.py,  q8_reducer.py,  reducer_q30.py, reducer_q29.py

 

If Big-Bench works on Spark standalone/yarn mode, please provision your Big-Bench directory to all worker nodes.

 

Running Big-Bench on Spark

Please follow the BigBench guide to use the BigBench driver. Note that, option '-e spark' is required if you want to use BigBench on Spark SQL.


By Intel SSG STO BDT

Bhaskar Gowda

unread,
Feb 24, 2015, 9:28:30 PM2/24/15
to big-...@googlegroups.com
Thanks Josh. Can you please put this document on the github site.

Manuel Danisch

unread,
Feb 26, 2015, 11:04:35 AM2/26/15
to big-...@googlegroups.com
Hi,

thanks a lot for your work on spark. Good to know that our scripting solution works as intended regarding spark integration. :)

As a side note, if you want to use spark as default execution engine, you can set

export BIG_BENCH_DEFAULT_ENGINE="spark"

in conf/userSettings.conf

Best regards,
Manuel

Max Beer

unread,
Mar 19, 2015, 7:12:03 PM3/19/15
to big-...@googlegroups.com
Hi

Thanks for this documentation.

Query 4 and query 30 get stuck when I run them with Spark. Do you experienced similar issues?
There is no error message displayed. Maybe some spark settings?

Kind regards
Max

Bhaskar Gowda

unread,
Mar 20, 2015, 12:10:18 AM3/20/15
to big-...@googlegroups.com
Max, can you please verify the settings with below instructions. IF this doesn't work, Yi Yao is the expert, he should take a look and be able to guide you.



Introduction

This document fills the gaps of enabling Big-Bench on Spark SQL. So far, Spark local, standalone and yarn-client modes were supported without modifying any Big-Bench code. All the enabling steps in this document were based on CDH5.3 and CentOS 6.4.

 

Hardware Requirements

As our enabling experiment is based on Cloudera Distribution of Apache Hadoop, please refer to CDH documentation for hardware requirements and best practices.

 

Software Requirement

Before running Big-Bench, verify that your machines have installed and configured the following software

·         CDH 5.3 or higher

Note that CDH can be deployed via CM to setup a workable Hadoop environment easily.

 

Software Dependencies

The following set of supported software dependencies must be installed:

·         HDFS, YARN

·         Spark

·         Hive

·         Mahout

·         JDK 1.7 is required. 64 bit is recommended. A suitable JDK is installed along with Cloudera (if using the parcel installation method)

·         Python

 

Preparation

Appling Spark Patches

If you are using Spark 1.3 or higher, you can skip this section.

Before you run Big-Bench on Spark SQL, please make sure your Spark includes the following patches. None of the Big-Bench queries can pass without these patches.

·         SPARK-5202

·         SPARK-5237

·         SPARK-5364

For better performance, we highly recommend you to apply the following performance relating patches.

·         SPARK-4570      This patch enables map join for left semi join. It can highly improves the performance of Big-Bench queries using left semi join.

If your Spark does not include these patches, please apply them by yourselves. We applied these patches based on CDH5.3.0 release tag. You can fetch the patches or pull request from the above Apache JIRA links. After merging code, please build Spark according to the guide Building Spark.

All these fixes are based on Spark master (Spark 1.3 so far). So, if you want to apply them to Spark 1.2.x, please notice the following gaps.

·         org.apache.spark.sql.types.StringType in Spark 1.3 should be replaced with org.apche.spark.sql.catalyst.types.StringType in Spark 1.2.x

Don’t forget to provision your patched Spark jar to all Spark nodes.

 

Configuring Spark

Please add the following parameters in your spark-defaults.conf

·         spark.driver.extraLibraryPath=${HADOOP_INSTALL_DIR}/hadoop/lib/native

·         spark.executor.extraLibraryPath=${HADOOP_INSTALL_DIR}/hadoop/lib/native

·         spark.driver.extraJavaOptions -XX:PermSize=128M -XX:MaxPermSize=512M    (If you want to run Big-Bench on Spark with 1TB data scale)

·         spark.executor.extraClassPath=${HIVE_INSTALL_DIR}/hive/lib/*:${HIVE_CONF_DIR}/hive/conf

·         spark.driver.extraClassPath=${HIVE_INSTALL_DIR}/hive/lib/*:${HIVE_CONF_DIR}/hive/conf

 

So far Big-Bench driver still has some functionality issue, so, please edit the following portion of $BIG-BENCH/INSTALL_DIR/Big-Bench/ conf/userSettings.conf to work around.

export BIG_BENCH_DEFAULT_ENGINE="spark"

 

export BIG_BENCH_DEFAULT_SCALE_FACTOR="1000"

 

 

Please also change BINARY_PARAMS in $BIG-BENCH_INSTALL_DIR/Big-Bench/engines/spark/conf/engineSettings.conf.

BINARY_PARAMS="-v –driver-memory <memory for Spark driver> --executor-memory <memory per Spark executor> --master <spark master URL> --deploy-mode <spark deploy mode> --jars <jars used by big-bench queries> --files <files used by big-bench queries>

 

--driver-memory

Recommend to use 4g for 1TB data scale

--executor-memory

Recommend to use 20g for 1TB data scale

--master

Please refer to the help of Spark SQL

--deploy-mode

Please refer to the help of Spark SQL. (So far, Big-Bench supports local mode, standalone mode and yarn-client mode)

--jars

Please add the following jars.

·         opennlp-maxent

·         opennlp-tools

·         bigbenchqueriesmr

You can find above jars in Big-Bench install dir.

 

·         hive-common

·         hive-cli

·         hive-exec

·         hive-service

·         hive-metastore

·         libfb303

·         jdo-api

·         antlr-runtime

·         datanucleus-api

·         datanucleus-core

·         datanucleus-rdbms

·         derby

You can find above jars in Hive install dir.

 

Please also add your hive-site.xml

--files

Please add the following files

·         reducer_q3.py

·         q4_reducer1.py

·         q4_reducer2.py

·         q8_reducer.py

·         reducer_q30.py

·         reducer_q29.py

You can find these files in Big-Bench install dir.

 

If Big-Bench works on Spark standalone/yarn mode, please provision your Big-Bench directory to all worker nodes.

 

Running Big-Bench on Spark

Please follow the Big-Bench guide to run driver. Note that, the option ‘-e spark’ is required.

E.g. If you want to run query #1, please use the following command.

"$INSTALL_DIR/bin/bigBench" runQuery -q 1 –e spark

 

Known Limitations

Big-Bench Limitations

·         Big-Bench does not support Spark yarn-cluster mode.

Spark Limitations

·         SPARK-5707   The bug causes exception if enabling spark.sql.codegen while running some Big-Bench query.

·         SPARK-5791   The bug highly degrades performance if using join with on

Yi Yao

unread,
Mar 20, 2015, 1:55:33 AM3/20/15
to big-...@googlegroups.com
Hi Max,
Did you add the path of q4_reducer1.py, q4_reducer2.py and reducer_q30.py in BINARY_PARAMS?
If yes, did you specify spark.sql.shuffle.partitions for these 2 queries? I suggest you to increase your partitions to 8000~10000 for these 2 queries.

Regards,
Yi

Max Beer

unread,
Mar 20, 2015, 3:09:28 PM3/20/15
to big-...@googlegroups.com
Hi

Thanks, guys. :)
Increasing spark.sql.shuffle.partitions was the right answer.

Because spark.sql.shuffle.partitions has to be set manually and this settings seems to be strongly correlated to the size of the tables (scale factor), it is really hard to find a reasonable value for the different queries and scale factors.

In my case, setting spark.sql.shuffle.partitions to 10000 for query 30 worked for scale factor 1, but for scale factor 10 I had to raise it (with 20000 it worked fine). Probably for running query 30 with scale factor 1000 I have to raise it to a significantly higher value.
On the other hand, setting spark.sql.shuffle.partitions to 10000 was sufficient for query 29 with scale factor 10. 

I just have to figure out an appropriate value for this setting. I would prefer having a universal value for the different queries (3, 4, 29, 30) and scale factors, but I am not really sure if there is such a value.

Do you run your tests with scale factor 1000?
If so, what have you set for spark.sql.shuffle.partitions (universal or query specific)?

Kind regards
Max

Yi Yao

unread,
Mar 29, 2015, 10:35:10 PM3/29/15
to big-...@googlegroups.com
"Do you run your tests with scale factor 1000?"
yes.


"If so, what have you set for spark.sql.shuffle.partitions (universal or query specific)?"
It depends on your data scale and hardware.

Steve Anderson

unread,
Sep 22, 2015, 1:22:22 PM9/22/15
to Big Data Benchmark for BigBench
Hi,
I'm seeing this error when I use the spark engine

[root@hdp3ims bin]# ./bigBench runBenchmark
BigBench clean all                                      (this might take a while...)
BigBench clean all                                      finished. Time:          0h:00m:02s:498ms
BigBench engine validation: Data generation             (this might take a while...)
BigBench engine validation: Data generation             finished. Time:          0h:01m:15s:241ms
BigBench engine validation: Populate metastore          (this might take a while...)
==============
Benchmark run terminated
Reason: An error occured while running a command
==============
java.io.IOException: Error while running module populateMetastore. More information in corresponding logfile in /benchmark/Big-Data-Benchmark-for-Big-Bench-0.5/logs
        at io.bigdatabenchmark.v1.driver.BigBench.runModuleTimed(BigBench.java:710)
        at io.bigdatabenchmark.v1.driver.BigBench.run(BigBench.java:389)
        at io.bigdatabenchmark.v1.driver.RunBigBench.main(RunBigBench.java:52)
[root@hdp3ims bin]# java.io.IOException: Error while running module populateMetastore

If you can point me to where i should be looking to resolve this, I would really appreciate it

I can configure it to run Hive without issues but have not had Spark run successfully.
I'm sure I must have done something wrong and if you can point me to where i should start looking, that would be ideal

Any advice or guidance would be really appreciated :)

Thanks

dafridgie

Steve Anderson

unread,
Sep 22, 2015, 3:05:54 PM9/22/15
to Big Data Benchmark for BigBench
Ah, my bad, i haven't installed spark standalone correctly, its yarn integrated so will rework that to resolve my errors 

Dafridgie

Steve Anderson

unread,
Sep 22, 2015, 4:27:38 PM9/22/15
to Big Data Benchmark for BigBench
Hi, I have now built out Spark Standalone in my cluster successfully. However I am still seeing errors when trying to run bigbench with spark

here is the error seen.....

spark.driver.extraClassPath -> ${HIVE_INSTALL_DIR}/hive/lib/*:${HIVE_CONF_DIR}/hive/conf
Classpath elements:
file:/benchmark/Big-Data-Benchmark-for-Big-Bench-0.5/engines/hive/queries/Resources/opennlp-maxent-3.0.3.jar
file:/benchmark/Big-Data-Benchmark-for-Big-Bench-0.5/engines/hive/queries/Resources/opennlp-tools-1.5.3.jar
file:/benchmark/Big-Data-Benchmark-for-Big-Bench-0.5/engines/hive/queries/Resources/bigbenchqueriesmr.jar


java.lang.ClassNotFoundException: org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:274)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Failed to load main class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
You need to build Spark with -Phive and -Phive-thriftserver.
An error occured while running command:
==========
runEngineCmd -f /benchmark/Big-Data-Benchmark-for-Big-Bench-0.5/engines/hive/population/hiveCreateLoad_decimal.sql
==========
Please check the log files for details
======= Load data into hive time =========
Start timestamp: 2015/09/22:21:19:23 1442953163
Stop  timestamp: 2015/09/22:21:19:25 1442953165
Duration:  0h 0m 2s
----- result -----
Load data FAILED exit code: 101
time&status: /benchmark/Big-Data-Benchmark-for-Big-Bench-0.5/logs/times.csv
full log: /benchmark/Big-Data-Benchmark-for-Big-Bench-0.5/logs/populateMetastore-run_query.log

I'm using Cloudera 5.4 with Spark 1.3

Any suggestions or guidance in getting this error resolved would be really appreciated 


Dafridgie



On Wednesday, February 25, 2015 at 2:26:36 AM UTC, Yi Yao wrote:
Message has been deleted

Yi Yao

unread,
Sep 23, 2015, 3:39:31 AM9/23/15
to Big Data Benchmark for BigBench
Hi, please compile your Spark using maven with profiles -Phive and -Phive-thriftserver. Note that Spark 1.3 does not support Hive 1.1. We suggest you to use Hive 0.13.1. Besides, Spark 1.3 has several critical bugs which block part of BigBench queries. We suggest you to compile Spark with the following source code.

https://github.com/jameszhouyi/Spark-for-BigBench.git


Best Regards,
Yi

Steve Anderson

unread,
Sep 23, 2015, 8:08:45 AM9/23/15
to Big Data Benchmark for BigBench
Hi Yi, thanks for the quick response. I had hoped to use the Cloudera 5.4 version of spark as I had interpreted the guidance above to say that the CDH parcel version of spark 1.3 would work.
I'll build spark using the method below manually and remove the CDH version of spark before trying bigbench again.

Can you tell me if there any plans to develop bigbench so that it can run natively on CDH without having to revert to a partial manual installation for Spark and Hive components ? 
Ideally we would like to be able to run Bigbench with a Standard Cloudera manager based CDH 5.x deployment using Parcels

Thanks for your time Yi, its really apprecaited

Michael Frank

unread,
Sep 23, 2015, 8:46:15 AM9/23/15
to Big Data Benchmark for BigBench
Hi,
spark versions < 1.4 are just to buggy and instable and therefore not able to execute all of the workloads of bigbench. But the spark team did a great job to improve the issues in v1.4 and 1.5 For us there is no incentive to put allot of effort into supporting an outdated spark version if the fixes are already available in a higher version. Its just a matter of months until the major vendors will release the next version of the hadoop distributions including a more recent spark version. Until then you have to manually upgrade your spark version.

Cheers,
Michael
Reply all
Reply to author
Forward
0 new messages