I have been trying to use index 1.0G bz2 files with
HadoopDruidIndexer but couldn't make it work.
These are the options I've tried:
1). With
"partitionsSpec" : {
"type" : "hashed",
"targetPartitionSize" : 1000000,
"maxPartitionSize" : 1500000,
"assumeGrouped" : true
},
I am getting Out of Memory error even after I changed all memory parameters the maximum values allowed.
Error: java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:849)
at com.google.common.io.Files.map(Files.java:680)
at com.google.common.io.Files.map(Files.java:666)
at com.google.common.io.Files.map(Files.java:635)
at com.google.common.io.Files.map(Files.java:609)
at com.metamx.common.io.smoosh.SmooshedFileMapper.mapFile(SmooshedFileMapper.java:124)
at io.druid.segment.IndexIO$DefaultIndexIOHandler.convertV8toV9(IndexIO.java:350)
at io.druid.segment.IndexMerger.makeIndexFiles(IndexMerger.java:843)
at io.druid.segment.IndexMerger.merge(IndexMerger.java:306)
at io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:168)
at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:372)
at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:247)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:636)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:396)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1300)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153)
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:846)
2. with
"partitionsSpec" : {
"type" : "hashed",
"targetPartitionSize" : 300000,
"maxPartitionSize" : 450000,
"assumeGrouped" : true
},
I am getting "Number of shards [329] exceed the maximum limit of [128],"
File System Counters
FILE: Number of bytes read=212872
FILE: Number of bytes written=79740927
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1210690603
HDFS: Number of bytes written=18808
HDFS: Number of read operations=1201
HDFS: Number of large read operations=0
HDFS: Number of write operations=401
Job Counters
Killed map tasks=3
Launched map tasks=203
Launched reduce tasks=200
Other local map tasks=42
Data-local map tasks=101
Rack-local map tasks=60
Total time spent by all maps in occupied slots (ms)=213572352
Total time spent by all reduces in occupied slots (ms)=125947408
Map-Reduce Framework
Map input records=106487384
Map output records=200
Merged Map outputs=40000
GC time elapsed (ms)=134680
CPU time spent (ms)=13432860
Physical memory (bytes) snapshot=189270577152
Virtual memory (bytes) snapshot=1439910318080
Total committed heap usage (bytes)=182140207104
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1210652603
File Output Format Counters
Bytes Written=18800
2014-07-03 23:36:35,527 INFO [main] io.druid.indexer.DetermineHashedPartitionsJob - Job completed, loading up partitions for intervals[Optional.of([2014-07-01T08:00:00.000Z/2014-07-01T09:00:00.000Z])].
2014-07-03 23:36:35,853 ERROR [main] io.druid.cli.CliHadoopIndexer - failure!!!!
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at io.druid.cli.CliHadoopIndexer.run(CliHadoopIndexer.java:104)
at io.druid.cli.Main.main(Main.java:92)
Caused by: com.metamx.common.ISE: Number of shards [329] exceed the maximum limit of [128], either targetPartitionSize is too low or data volume is too high
3. "partitionsSpec" : {
"type" : "hashed",
"numShards": 200
},
Then it will only have one reducer and the process will run forever.
Could somebody suggest what would be the right way to go?
Thanks.