Hadoop parameters for dataGen, populateMetastore

Dave Jaffe

unread,

Mar 29, 2016, 11:42:43 AM3/29/16

to Big Data Benchmark for BigBench

Is there a way to set specific Hadoop parameters (like mapreduce.map.memory.mb) for the dataGen and populateMetastore phases similar to the engineLocalSettings.sql files for the queries?

Michael Frank

unread,

Mar 29, 2016, 12:06:30 PM3/29/16

to Big Data Benchmark for BigBench

Hi Dave,
during dataGen stage the PDGF tool runs as separate process in the mapper (it just uses hadoop pigyback as "clusterexec" ), so there is no need to "tune" anything. The only tuning option you have is the number of parallel (map) tasks with bigbench's "-m <tasks>" option.
As a guideline: you want to set <tasks> to: (<number of vcores/yarn containers> - 1) * n with n the number of waves. e.g. if you have a 10 node machine á 16 cores and want to do data generation in 2 waves: ((10*16) -1) * 2 = 318 => -m 318
For everything below 1 TB (sf 1000) you probably want to go with a single wave.

The populate metastore phase has no "local" settings like the queries.
Just place your settings in engines/hive/conf/engineSettings.sql or create your own version of the /engines/hive/population/hiveCreateLoad.sql file.
You can switch population sql files by editing:
engines/hive/conf/engineSettings.conf

line 163: export BIG_BENCH_POPULATE_METASTORE_FILE="${USER_POPULATE_FILE:-"$BIG_BENCH_POPULATION_DIR/hiveCreateLoad.sql"}"

Cheers,
Michael

Dave Jaffe

unread,

Mar 29, 2016, 2:10:34 PM3/29/16

to Big Data Benchmark for BigBench

Thanks for pointing me to the hiveCreateLoad.sql file , Michael. You confirmed my understanding of dataGen.

Reply all

Reply to author

Forward