Yi Yao

unread,

Feb 24, 2015, 9:26:36 PM2/24/15

to big-...@googlegroups.com

Preparation

Appling Spark Patches

Before you can run Big-Bench on Spark SQL, please make sure your Spark includes the following patches.

SPARK-5202

SPARK-5237

SPARK-5364

Spark 1.2.1, the latest Spark GA, does not include the above patches.

Configuring Spark

Please add the following parameters in your spark-defaults.conf

· spark.driver.extraLibraryPath=${HADOOP_INSTALL_DIR}/hadoop/lib/native

· spark.executor.extraLibraryPath=${HADOOP_INSTALL_DIR}/hadoop/lib/native

· spark.driver.extraJavaOptions -XX:PermSize=128M -XX:MaxPermSize=512M (If you want to run Big-Bench on Spark with 1TB data scale)

· spark.executor.extraClassPath=${HIVE_INSTALL_DIR}/hive/lib/*:${HIVE_CONF_DIR}/hive/conf

· spark.driver.extraClassPath=${HIVE_INSTALL_DIR}/hive/lib/*:${HIVE_CONF_DIR}/hive/conf

Installing and Configuring Big-Bench

Please follow Big-Bench preparation guide to install and configure Big-Bench.

Configuring and running Big-Bench on Spark

Configuring Big-Bench

Edit the following portion of $BIG-BENCH/INSTALL_DIR/Big-Bench/engines/spark/conf/engineSettings.conf to point to the Spark SQL binary which Big-Bench will use.

BINARY="$SPARK_INSTALL_DIR/spark/bin/spark-sql"

Please also change BINARY_PARAMS in $BIG-BENCH/INSTALL_DIR/Big-Bench/engines/spark/conf/engineSettings.conf.

BINARY_PARAMS="-v –driver-memory <memory for Spark driver> --executor-memory <memory per Spark executor> --master <spark master URL> --deploy-mode <spark deploy mode> --jars <jars used by big-bench queries> --files <files used by big-bench queries>

--driver-memory	Recommend to use 4g for 1TB data scale
--executor-memory	Recommend to use 20g for 1TB data scale
--master	Please refer to the help of Spark SQL
--deploy-mode	Please refer to the help of Spark SQL. (So far, Big-Bench supports local mode, standalone mode and yarn-client mode)
--jars	Please add the following jars. opennlp-maxent, opennlp-tools, bigbenchqueriesmr, hive-common, hive-cli, hive-exec, hive-service, hive-metastore, libfb303, jdo-api, antlr-runtime, datanucleus-api, datanucleus-core, datanucleus-rdbms, derby Please also add your hive-site.xml
--files	Please add the following files reducer_q3.py, q4_reducer1.py, q4_reducer2.py, q8_reducer.py, reducer_q30.py, reducer_q29.py

If Big-Bench works on Spark standalone/yarn mode, please provision your Big-Bench directory to all worker nodes.

Running Big-Bench on Spark

Please follow the BigBench guide to use the BigBench driver. Note that, option '-e spark' is required if you want to use BigBench on Spark SQL.

By Intel SSG STO BDT

Bhaskar Gowda

unread,

Feb 24, 2015, 9:28:30 PM2/24/15

to big-...@googlegroups.com

Thanks Josh. Can you please put this document on the github site.

Manuel Danisch

unread,

Feb 26, 2015, 11:04:35 AM2/26/15

to big-...@googlegroups.com

Hi,

thanks a lot for your work on spark. Good to know that our scripting solution works as intended regarding spark integration. :)

As a side note, if you want to use spark as default execution engine, you can set

export BIG_BENCH_DEFAULT_ENGINE="spark"

in conf/userSettings.conf

Best regards,
Manuel

Max Beer

unread,

Mar 19, 2015, 7:12:03 PM3/19/15

to big-...@googlegroups.com

Hi

Thanks for this documentation.

Query 4 and query 30 get stuck when I run them with Spark. Do you experienced similar issues?

There is no error message displayed. Maybe some spark settings?

Kind regards

Max

Bhaskar Gowda

unread,

Mar 20, 2015, 12:10:18 AM3/20/15

to big-...@googlegroups.com

Max, can you please verify the settings with below instructions. IF this doesn't work, Yi Yao is the expert, he should take a look and be able to guide you.

Introduction

This document fills the gaps of enabling Big-Bench on Spark SQL. So far, Spark local, standalone and yarn-client modes were supported without modifying any Big-Bench code. All the enabling steps in this document were based on CDH5.3 and CentOS 6.4.

Hardware Requirements

As our enabling experiment is based on Cloudera Distribution of Apache Hadoop, please refer to CDH documentation for hardware requirements and best practices.

Software Requirement

Before running Big-Bench, verify that your machines have installed and configured the following software

· CDH 5.3 or higher

Note that CDH can be deployed via CM to setup a workable Hadoop environment easily.

Software Dependencies

The following set of supported software dependencies must be installed:

· HDFS, YARN

· Spark

· Hive

· Mahout

· JDK 1.7 is required. 64 bit is recommended. A suitable JDK is installed along with Cloudera (if using the parcel installation method)

· Python

Preparation

Appling Spark Patches

If you are using Spark 1.3 or higher, you can skip this section.

Before you run Big-Bench on Spark SQL, please make sure your Spark includes the following patches. None of the Big-Bench queries can pass without these patches.

· SPARK-5202

· SPARK-5237

· SPARK-5364

For better performance, we highly recommend you to apply the following performance relating patches.

· SPARK-4570 This patch enables map join for left semi join. It can highly improves the performance of Big-Bench queries using left semi join.

If your Spark does not include these patches, please apply them by yourselves. We applied these patches based on CDH5.3.0 release tag. You can fetch the patches or pull request from the above Apache JIRA links. After merging code, please build Spark according to the guide Building Spark.

All these fixes are based on Spark master (Spark 1.3 so far). So, if you want to apply them to Spark 1.2.x, please notice the following gaps.

· org.apache.spark.sql.types.StringType in Spark 1.3 should be replaced with org.apche.spark.sql.catalyst.types.StringType in Spark 1.2.x

Don’t forget to provision your patched Spark jar to all Spark nodes.

Configuring Spark

Please add the following parameters in your spark-defaults.conf

· spark.driver.extraLibraryPath=${HADOOP_INSTALL_DIR}/hadoop/lib/native

· spark.executor.extraLibraryPath=${HADOOP_INSTALL_DIR}/hadoop/lib/native

· spark.driver.extraJavaOptions -XX:PermSize=128M -XX:MaxPermSize=512M (If you want to run Big-Bench on Spark with 1TB data scale)

· spark.executor.extraClassPath=${HIVE_INSTALL_DIR}/hive/lib/*:${HIVE_CONF_DIR}/hive/conf

· spark.driver.extraClassPath=${HIVE_INSTALL_DIR}/hive/lib/*:${HIVE_CONF_DIR}/hive/conf

Installing Big-Bench

Please follow Big-Bench preparation guide to install and configure Big-Bench.

Configuring and running Big-Bench on Spark

Configuring Big-Bench

Edit the following portion of $BIG-BENCH/INSTALL_DIR/Big-Bench/engines/spark/conf/engineSettings.conf to point to the Spark SQL binary which Big-Bench will use.

BINARY="$SPARK_INSTALL_DIR/spark/bin/spark-sql"

So far Big-Bench driver still has some functionality issue, so, please edit the following portion of $BIG-BENCH/INSTALL_DIR/Big-Bench/ conf/userSettings.conf to work around.

export BIG_BENCH_DEFAULT_ENGINE="spark"
export BIG_BENCH_DEFAULT_SCALE_FACTOR="1000"

Please also change BINARY_PARAMS in $BIG-BENCH_INSTALL_DIR/Big-Bench/engines/spark/conf/engineSettings.conf.

BINARY_PARAMS="-v –driver-memory <memory for Spark driver> --executor-memory <memory per Spark executor> --master <spark master URL> --deploy-mode <spark deploy mode> --jars <jars used by big-bench queries> --files <files used by big-bench queries>

--driver-memory	Recommend to use 4g for 1TB data scale
--executor-memory	Recommend to use 20g for 1TB data scale
--master	Please refer to the help of Spark SQL
--deploy-mode	Please refer to the help of Spark SQL. (So far, Big-Bench supports local mode, standalone mode and yarn-client mode)

--jars

Please add the following jars.

· opennlp-maxent

· opennlp-tools

· bigbenchqueriesmr

You can find above jars in Big-Bench install dir.

· hive-common

· hive-cli

· hive-exec

· hive-service

· hive-metastore

· libfb303

· jdo-api

· antlr-runtime

· datanucleus-api

· datanucleus-core

· datanucleus-rdbms

· derby

You can find above jars in Hive install dir.

Please also add your hive-site.xml


--files

Please add the following files

· reducer_q3.py

· q4_reducer1.py

· q4_reducer2.py

· q8_reducer.py

· reducer_q30.py

· reducer_q29.py

You can find these files in Big-Bench install dir.

If Big-Bench works on Spark standalone/yarn mode, please provision your Big-Bench directory to all worker nodes.

Running Big-Bench on Spark

Please follow the Big-Bench guide to run driver. Note that, the option ‘-e spark’ is required.

E.g. If you want to run query #1, please use the following command.

"$INSTALL_DIR/bin/bigBench" runQuery -q 1 –e spark

Known Limitations

Big-Bench Limitations

· Big-Bench does not support Spark yarn-cluster mode.

Spark Limitations

· SPARK-5707 The bug causes exception if enabling spark.sql.codegen while running some Big-Bench query.

· SPARK-5791 The bug highly degrades performance if using join with on

Yi Yao

unread,

Mar 20, 2015, 1:55:33 AM3/20/15

to big-...@googlegroups.com

Hi Max,

Did you add the path of q4_reducer1.py, q4_reducer2.py and reducer_q30.py in BINARY_PARAMS?

If yes, did you specify spark.sql.shuffle.partitions for these 2 queries? I suggest you to increase your partitions to 8000~10000 for these 2 queries.

Regards,

Yi

Max Beer

unread,

Mar 20, 2015, 3:09:28 PM3/20/15

to big-...@googlegroups.com

Hi

Thanks, guys. :)

Increasing spark.sql.shuffle.partitions was the right answer.

Because spark.sql.shuffle.partitions has to be set manually and this settings seems to be strongly correlated to the size of the tables (scale factor), it is really hard to find a reasonable value for the different queries and scale factors.

In my case, setting spark.sql.shuffle.partitions to 10000 for query 30 worked for scale factor 1, but for scale factor 10 I had to raise it (with 20000 it worked fine). Probably for running query 30 with scale factor 1000 I have to raise it to a significantly higher value.

On the other hand, setting spark.sql.shuffle.partitions to 10000 was sufficient for query 29 with scale factor 10.

I just have to figure out an appropriate value for this setting. I would prefer having a universal value for the different queries (3, 4, 29, 30) and scale factors, but I am not really sure if there is such a value.

Do you run your tests with scale factor 1000?

If so, what have you set for spark.sql.shuffle.partitions (universal or query specific)?

Kind regards

Max

Yi Yao

unread,

Mar 29, 2015, 10:35:10 PM3/29/15

to big-...@googlegroups.com

"Do you run your tests with scale factor 1000?"

yes.

"If so, what have you set for spark.sql.shuffle.partitions (universal or query specific)?"

It depends on your data scale and hardware.

Steve Anderson

unread,

Sep 22, 2015, 1:22:22 PM9/22/15

to Big Data Benchmark for BigBench

Hi,

I'm seeing this error when I use the spark engine

[root@hdp3ims bin]# ./bigBench runBenchmark

BigBench clean all (this might take a while...)

BigBench clean all finished. Time: 0h:00m:02s:498ms

BigBench engine validation: Data generation (this might take a while...)

BigBench engine validation: Data generation finished. Time: 0h:01m:15s:241ms

BigBench engine validation: Populate metastore (this might take a while...)

==============

Benchmark run terminated

Reason: An error occured while running a command

==============

java.io.IOException: Error while running module populateMetastore. More information in corresponding logfile in /benchmark/Big-Data-Benchmark-for-Big-Bench-0.5/logs

at io.bigdatabenchmark.v1.driver.BigBench.runModuleTimed(BigBench.java:710)

at io.bigdatabenchmark.v1.driver.BigBench.run(BigBench.java:389)

at io.bigdatabenchmark.v1.driver.RunBigBench.main(RunBigBench.java:52)

[root@hdp3ims bin]# java.io.IOException: Error while running module populateMetastore

If you can point me to where i should be looking to resolve this, I would really appreciate it

I can configure it to run Hive without issues but have not had Spark run successfully.

I'm sure I must have done something wrong and if you can point me to where i should start looking, that would be ideal

Any advice or guidance would be really appreciated :)

Thanks

dafridgie

Steve Anderson

unread,

Sep 22, 2015, 3:05:54 PM9/22/15

to Big Data Benchmark for BigBench

Ah, my bad, i haven't installed spark standalone correctly, its yarn integrated so will rework that to resolve my errors

Dafridgie

Steve Anderson

unread,

Sep 22, 2015, 4:27:38 PM9/22/15

to Big Data Benchmark for BigBench

Hi, I have now built out Spark Standalone in my cluster successfully. However I am still seeing errors when trying to run bigbench with spark

here is the error seen.....

spark.driver.extraClassPath -> ${HIVE_INSTALL_DIR}/hive/lib/*:${HIVE_CONF_DIR}/hive/conf

Classpath elements:

file:/benchmark/Big-Data-Benchmark-for-Big-Bench-0.5/engines/hive/queries/Resources/opennlp-maxent-3.0.3.jar

file:/benchmark/Big-Data-Benchmark-for-Big-Bench-0.5/engines/hive/queries/Resources/opennlp-tools-1.5.3.jar

file:/benchmark/Big-Data-Benchmark-for-Big-Bench-0.5/engines/hive/queries/Resources/bigbenchqueriesmr.jar

java.lang.ClassNotFoundException: org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:425)

at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Class.java:274)

at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)

at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)

at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Failed to load main class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.

You need to build Spark with -Phive and -Phive-thriftserver.

An error occured while running command:

==========

runEngineCmd -f /benchmark/Big-Data-Benchmark-for-Big-Bench-0.5/engines/hive/population/hiveCreateLoad_decimal.sql

==========

Please check the log files for details

======= Load data into hive time =========

Start timestamp: 2015/09/22:21:19:23 1442953163

Stop timestamp: 2015/09/22:21:19:25 1442953165

Duration: 0h 0m 2s

----- result -----

Load data FAILED exit code: 101

time&status: /benchmark/Big-Data-Benchmark-for-Big-Bench-0.5/logs/times.csv

full log: /benchmark/Big-Data-Benchmark-for-Big-Bench-0.5/logs/populateMetastore-run_query.log

I'm using Cloudera 5.4 with Spark 1.3

Any suggestions or guidance in getting this error resolved would be really appreciated

Dafridgie

On Wednesday, February 25, 2015 at 2:26:36 AM UTC, Yi Yao wrote:

Message has been deleted

Yi Yao

unread,

Sep 23, 2015, 3:39:31 AM9/23/15

to Big Data Benchmark for BigBench

Hi, please compile your Spark using maven with profiles -Phive and -Phive-thriftserver. Note that Spark 1.3 does not support Hive 1.1. We suggest you to use Hive 0.13.1. Besides, Spark 1.3 has several critical bugs which block part of BigBench queries. We suggest you to compile Spark with the following source code.

https://github.com/jameszhouyi/Spark-for-BigBench.git

Best Regards,
Yi

Steve Anderson

unread,

Sep 23, 2015, 8:08:45 AM9/23/15

to Big Data Benchmark for BigBench

Hi Yi, thanks for the quick response. I had hoped to use the Cloudera 5.4 version of spark as I had interpreted the guidance above to say that the CDH parcel version of spark 1.3 would work.

I'll build spark using the method below manually and remove the CDH version of spark before trying bigbench again.

Can you tell me if there any plans to develop bigbench so that it can run natively on CDH without having to revert to a partial manual installation for Spark and Hive components ?

Ideally we would like to be able to run Bigbench with a Standard Cloudera manager based CDH 5.x deployment using Parcels

Thanks for your time Yi, its really apprecaited

Michael Frank

unread,

Sep 23, 2015, 8:46:15 AM9/23/15

to Big Data Benchmark for BigBench

Hi,
spark versions < 1.4 are just to buggy and instable and therefore not able to execute all of the workloads of bigbench. But the spark team did a great job to improve the issues in v1.4 and 1.5 For us there is no incentive to put allot of effort into supporting an outdated spark version if the fixes are already available in a higher version. Its just a matter of months until the major vendors will release the next version of the hadoop distributions including a more recent spark version. Until then you have to manually upgrade your spark version.

Cheers,
Michael

Reply all

Reply to author

Forward

How to use BigBench on Spark SQL

Yi Yao

Preparation

Appling Spark Patches

Configuring Spark

Installing and Configuring Big-Bench

Configuring and running Big-Bench on Spark

Configuring Big-Bench

Running Big-Bench on Spark

Bhaskar Gowda

Manuel Danisch

Max Beer

Bhaskar Gowda

Yi Yao

Max Beer

Yi Yao

Steve Anderson

Steve Anderson

Steve Anderson

Yi Yao

Steve Anderson

Michael Frank