Hi Bart,
Yes, BigBench's data generation for the raw data overwrites HDFS default replication factor.
You can configure this behaviour in your Big-Bench/setEnvVars file with the BIG_BENCH_DATAGEN_DFS_REPLICATION variable.
We consider the raw staging data as volatile data. Besides, the datageneration is already distributed on each node. So we think it is better to keep the storage consumption for the only once read data low, thus the replication factor of 1.
But you mixed something up. This: /user/bart/benchmarks/bigbench/data/customer/ is NOT the data the benchmark runs with!
The HIVE loading stage reads this data, converts it to ORC and stores it within hives warehouse/ directory. During HIVES writing of ORC format tables, the default HDFS replication factor is applied.
You may want to check:
hadoop fs -ls /user/hive/warehouse/bigbench.db/customer/
best regards,
Michael