I figured what the possible problem is :
I am giving a list of files as input to get more than one map (my
input is 5G of binary, hence not splittable). So, I create say 10
binary files in HDFS and give those as input using reg-ex. With this,
I notice that the data size reported is 0 and hence the number of
chunks is 1 (even though total data is >2G).
This is what the output looks like :
options : --chunksize 2040109465 --inputformat VoldLkpInputFormat --
input file.dat.[0-9]*
10/12/13 17:12:06 INFO mr.HadoopStoreBuilder: Data size = 0,
replication factor = 2, numNodes = 3, chunk size = 2040109465,
num.chunks = 1
10/12/13 17:12:06 INFO mr.HadoopStoreBuilder: Number of reduces: 3
...
If I use only 1 split then I get the data size:
10/12/13 17:24:23 INFO mr.HadoopStoreBuilder: Data size = 512739299,
replication factor = 2, numNodes = 3, chunk size = 2040109465,
num.chunks = 1
The related code in contrib/hadoop-store-builder/src/java/voldemort/
store/readonly/mr/HadoopStoreBuilder.java is
176 // delete output dir if it already exists
177 FileSystem tempFs = tempDir.getFileSystem(conf);
178 tempFs.delete(tempDir, true);
179
180 long size = sizeOfPath(tempFs, inputPath);
181 int numChunks = Math.max((int)
(storeDef.getReplicationFactor() * size
182 /
cluster.getNumberOfNodes() / chunkSizeBytes), 1);
183
logger.info("Data size = " + size + ", replication
factor = "
184 + storeDef.getReplicationFactor() + ",
numNodes = "
185 + cluster.getNumberOfNodes() + ",
chunk size = " + chunkSizeBytes
186 + ", num.chunks = " + numChunks);
187 conf.setInt("num.chunks", numChunks);
Further, when I give the whole file as one big binary file that is 5G,
I run into "Error: Java heap space".
10/12/13 18:05:27 INFO mapred.JobClient: Failed map tasks=1
voldemort.VoldemortException: java.io.IOException: Job failed!
at
voldemort.store.readonly.mr.HadoopStoreBuilder.build(HadoopStoreBuilder.java:
242)
at
voldemort.store.readonly.mr.HadoopStoreJobRunner.run(HadoopStoreJobRunner.java:
180)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
voldemort.store.readonly.mr.HadoopStoreJobRunner.main(HadoopStoreJobRunner.java:
257)
Caused by: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:
1293)
at
voldemort.store.readonly.mr.HadoopStoreBuilder.build(HadoopStoreBuilder.java:
192)
... 3 more
Any workaround ?