Hi Dave,
during dataGen stage the PDGF tool runs as separate process in the mapper (it just uses hadoop pigyback as "clusterexec" ), so there is no need to "tune" anything. The only tuning option you have is the number of parallel (map) tasks with bigbench's "-m <tasks>" option.
As a guideline: you want to set <tasks> to: (<number of vcores/yarn containers> - 1) * n with n the number of waves. e.g. if you have a 10 node machine á 16 cores and want to do data generation in 2 waves: ((10*16) -1) * 2 = 318 => -m 318
For everything below 1 TB (sf 1000) you probably want to go with a single wave.
The populate metastore phase has no "local" settings like the queries.
Just place your settings in engines/hive/conf/engineSettings.sql or create your own version of the /engines/hive/population/hiveCreateLoad.sql file.
You can switch population sql files by editing:
engines/hive/conf/engineSettings.conf
line 163: export BIG_BENCH_POPULATE_METASTORE_FILE="${USER_POPULATE_FILE:-"$BIG_BENCH_POPULATION_DIR/hiveCreateLoad.sql"}"
Cheers,
Michael