Hi Dwayne,
As an example, SF 3000 (3TB) I use:
"Client Java Heap Size in Bytes" = 3340763136 (Starting to launch local task to process map join; maximum memory = 3340763136)
mapreduce.input.fileinputformat.split.maxsize=134217728 //128MB
hive.exec.reducers.bytes.per.reducer=67108864 //64MB
As you see 8GB should be plenty enough.
If you are still experiencing "
exit status: 3
" despite having a lot of memory dedicated to the "Client Java Heap Size in Bytes", you should consider reduceing "hive.mapjoin.smalltable.filesize".
This variable is used by hive to determine if a table is processed with a local fast mapjoin or a slow normal join:
hive.mapjoin.smalltable.filesize=25000000
hive.mapjoin.localtask.max.memory.usage=0.9
Query 9 is optimal to test your mapjoin settings, as this query is very sensitive regarding mapjoin configuration. If query 9 runs at your desired Scale Factor and settings, the other queries will probably as well.
./bin/bigBench runBenchmark -i "POWER_TEST" -b -q 9 //runs only query 9 with debug prints enabled. Requires that in a previous run you generated the data and populated the hive database.
The BigBench FAQ covers this topic as well:
https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench/blob/master/README.md#execution-failed-with-exit-status-3The choice of your cluster settings is strongly dependent on your cluster and the data set size you choose to run with. Because of this fact, we do not provide "default settings" for various SF as it would be impossible to give you the "right" ones and may be even missleading and causing more harm then good.
Now that I placed my warning: Here are some parameters I use at different SF running on a 10 Node AWS cluster (16VCores/Node). Dependent on your cluster you may have to use totally different settings.
Notice that the settings a rather radical for SF < 1000, which you would never do so in a production system! The goal of these settings was to maximize cluster utilization, even with small data sizes.
For any serious benchmarking of a BigData system you want to consider running at least SF 1000.
--sf 1 settings good values between 4 and 6 mb (wont achieve 100% - jobs running not long enough, mainly Hive/MR startup overhead)
--set mapreduce.input.fileinputformat.split.minsize=4194304;
--set mapreduce.input.fileinputformat.split.maxsize=6291456;
--set hive.exec.reducers.bytes.per.reducer=6291456;
-- sf 10 setting.
--set mapreduce.input.fileinputformat.split.minsize=4194304;
--set mapreduce.input.fileinputformat.split.maxsize=8388608;
--set hive.exec.reducers.bytes.per.reducer=8388608;
--sf 100 setting
--set mapreduce.input.fileinputformat.split.minsize=4194304;
--set mapreduce.input.fileinputformat.split.maxsize=16777216;
--set hive.exec.reducers.bytes.per.reducer=16777216;
--sf 1000 setting good values between 32 and 64 mb
set mapreduce.input.fileinputformat.split.minsize=4194304;
set mapreduce.input.fileinputformat.split.maxsize=67108864;
set hive.exec.reducers.bytes.per.reducer=67108864;
Cheers,
Michael