replication factor of generated data in HDFS

44 views
Skip to first unread message

Bart Vandewoestyne

unread,
Oct 7, 2014, 4:59:24 AM10/7/14
to big-...@googlegroups.com
Just out of curiosity, I checked what the replication factor is for the files generated by Big-Bench.  I noticed that although my dfs.replication is set to the default value of 3, the files generated by Big-Bench seem to have a replication factor of 1, see for example the customer directory:

bart@sandy-quad-1:~$ hadoop fs -ls /user/bart/benchmarks/bigbench/data/customer/ | head -5
Found 256 items
-rw-r--r--   1 yarn bart     733356 2014-10-03 17:27 /user/bart/benchmarks/bigbench/data/customer/customer_1.dat
-rw-r--r--   1 yarn bart     740415 2014-10-03 17:27 /user/bart/benchmarks/bigbench/data/customer/customer_10.dat
-rw-r--r--   1 yarn bart     745401 2014-10-03 17:33 /user/bart/benchmarks/bigbench/data/customer/customer_100.dat
-rw-r--r--   1 yarn bart     745462 2014-10-03 17:33 /user/bart/benchmarks/bigbench/data/customer/customer_101.dat

Does the PDGF somehow override the replication factor of the HDFS filesystem and if 'yes', why is that?

Regards,
Bart

Michael Frank

unread,
Oct 7, 2014, 9:24:08 AM10/7/14
to big-...@googlegroups.com
Hi Bart,

Yes, BigBench's data generation for the raw data overwrites HDFS default replication factor.
You can configure this behaviour in your Big-Bench/setEnvVars file with the BIG_BENCH_DATAGEN_DFS_REPLICATION variable.
We consider the raw staging data as volatile data. Besides, the datageneration is already distributed on each node. So we think it is better to keep the storage consumption for the only once read data low, thus the replication factor of 1.

But you mixed something up. This: /user/bart/benchmarks/bigbench/data/customer/ is NOT the data the benchmark runs with!
The HIVE loading stage reads this data, converts it to ORC and stores it within hives warehouse/ directory. During HIVES writing of ORC format tables, the default HDFS replication factor is applied.
You may want to check:
hadoop fs -ls /user/hive/warehouse/bigbench.db/customer/

best regards,
Michael

Bart Vandewoestyne

unread,
Oct 7, 2014, 10:10:33 AM10/7/14
to big-...@googlegroups.com
On Tuesday, October 7, 2014 3:24:08 PM UTC+2, Michael Frank wrote:
Hi Bart,

Yes, BigBench's data generation for the raw data overwrites HDFS default replication factor.
You can configure this behaviour in your Big-Bench/setEnvVars file with the BIG_BENCH_DATAGEN_DFS_REPLICATION variable.
We consider the raw staging data as volatile data. Besides, the datageneration is already distributed on each node. So we think it is better to keep the storage consumption for the only once read data low, thus the replication factor of 1.

But you mixed something up. This: /user/bart/benchmarks/bigbench/data/customer/ is NOT the data the benchmark runs with!
The HIVE loading stage reads this data, converts it to ORC and stores it within hives warehouse/ directory. During HIVES writing of ORC format tables, the default HDFS replication factor is applied.
You may want to check:
hadoop fs -ls /user/hive/warehouse/bigbench.db/customer/

best regards,
Michael

OK.  Thanks for explaining.  This removes my confusion :-)

Small typo however: it's /user/hive/warehouse/bigbenchorc.db/customer/ instead of what you typed ;-)

Regards,
Bart
Reply all
Reply to author
Forward
0 new messages