Hadoop parameters for dataGen, populateMetastore

74 views
Skip to first unread message

Dave Jaffe

unread,
Mar 29, 2016, 11:42:43 AM3/29/16
to Big Data Benchmark for BigBench
Is there a way to set specific Hadoop parameters (like mapreduce.map.memory.mb) for the dataGen and populateMetastore phases similar to the engineLocalSettings.sql files for the queries?

Michael Frank

unread,
Mar 29, 2016, 12:06:30 PM3/29/16
to Big Data Benchmark for BigBench
Hi Dave,
during dataGen stage the PDGF tool runs as separate process in the mapper (it just uses hadoop pigyback as "clusterexec" ), so there is no need to "tune" anything. The only tuning option you have is the number of parallel (map) tasks with bigbench's "-m <tasks>" option.
As a guideline: you want to set <tasks> to:  (<number of vcores/yarn containers> - 1) * n with n the number of waves. e.g. if you have a 10 node machine á 16 cores and want to do data generation in 2 waves:  ((10*16) -1) * 2 = 318  => -m 318
For everything below 1 TB (sf 1000) you probably want to go with a single wave.

The populate metastore phase has no "local" settings like the queries.
Just place your settings in  engines/hive/conf/engineSettings.sql or create your own version of the /engines/hive/population/hiveCreateLoad.sql file.
You can switch population sql files by editing:
engines/hive/conf/engineSettings.conf
line 163: export BIG_BENCH_POPULATE_METASTORE_FILE="${USER_POPULATE_FILE:-"$BIG_BENCH_POPULATION_DIR/hiveCreateLoad.sql"}"

Cheers,
Michael

Dave Jaffe

unread,
Mar 29, 2016, 2:10:34 PM3/29/16
to Big Data Benchmark for BigBench
Thanks for pointing me to the hiveCreateLoad.sql file , Michael. You confirmed my understanding of dataGen.
Reply all
Reply to author
Forward
0 new messages