In trying to index a fairly large file (~750 mb when gzipped, around 8.3 million lines) in a Hadoop indexer task I get an
This particular run had a ton of issues with ZooKeeper, as far as I can tell from the logs, but this has happened repeatedly, also without any ZK issues in the log output.
If I use a smaller subset of the file - 50.000 lines or 500.000 lines - it finishes processing as expected. I haven't experimented with other sizes yet.
2014-08-12 00:32:43,630 WARN [Thread-448] org.apache.hadoop.mapred.LocalJobRunner - job_local1920330805_0003
java.lang.Exception: java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.regex.Pattern.matcher(Pattern.java:1088)
at java.lang.String.replace(String.java:2180)
at io.druid.data.input.MapBasedRow.getFloatMetric(MapBasedRow.java:106)
at io.druid.segment.incremental.SpatialDimensionRowFormatter$5.getFloatMetric(SpatialDimensionRowFormatter.java:156)
at io.druid.segment.incremental.SpatialDimensionRowFormatter$5.getFloatMetric(SpatialDimensionRowFormatter.java:156)
at io.druid.segment.incremental.IncrementalIndex$1$2.get(IncrementalIndex.java:237)
at io.druid.query.aggregation.LongSumAggregator.aggregate(LongSumAggregator.java:60)
at io.druid.segment.incremental.IncrementalIndex.add(IncrementalIndex.java:367)
at io.druid.segment.incremental.IncrementalIndex.add(IncrementalIndex.java:141)
at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:303)
at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:253)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2014-08-12 00:32:43,758 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job - Job job_local1920330805_0003 failed with state FAILED due to: NA
2014-08-12 00:32:43,826 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job - Counters: 38
File System Counters
FILE: Number of bytes read=1857818703704
FILE: Number of bytes written=1942485935992
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
S3N: Number of bytes read=0
S3N: Number of bytes written=8451918159
S3N: Number of read operations=0
S3N: Number of large read operations=0
S3N: Number of write operations=0
Map-Reduce Framework
Map input records=8294987
Map output records=8294987
Map output bytes=8116137478
Map output materialized bytes=8149317612
Input split bytes=132
Combine input records=0
Combine output records=0
Reduce input groups=31
Reduce shuffle bytes=8149317612
Reduce input records=8204124
Reduce output records=0
Spilled Records=24045003
Shuffled Maps =31
Failed Shuffles=0
Merged Map outputs=31
GC time elapsed (ms)=4772813
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=58565591040
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=763481881
File Output Format Counters
Bytes Written=248
This is the indexing task I use for the job:
{
"task":"index_hadoop_ds_2014-08-11T21:48:12.607Z",
"payload":{
"id":"index_hadoop_ds_2014-08-11T21:48:12.607Z",
"spec":{
"dataSchema":{
"dataSource":"ds",
"parser":{
"type":"string",
"parseSpec":{
"format":"json",
"timestampSpec":{
"column":"timestamp",
"format":"iso"
},
"dimensionsSpec":{
"dimensions":[
"lid",
"timestamp",
"sessionorder",
"responsetime",
"internal",
"onclick",
"url",
"domain",
"referrer",
"referrer_domain",
"sessioncookie",
"permcookie",
"useragent",
"browser",
"browserversion",
"os_name",
"os_family",
"device_type",
"resolution",
"ipaddress",
"network",
"country",
"region",
"city",
"organization",
"search_term",
"lastpage",
"lastlink",
"ext_search"
],
"dimensionExclusions":[
"timestamp",
"unique_visitors",
"exits",
"responsetime_max",
"sessions",
"responsetime_min",
"count",
"search_hits",
"entries",
"bounces"
],
"spatialDimensions":[
]
}
}
},
"metricsSpec":[
{
"type":"count",
"name":"count"
},
{
"type":"longSum",
"name":"entries",
"fieldName":"entry"
},
{
"type":"longSum",
"name":"exits",
"fieldName":"exit"
},
{
"type":"longSum",
"name":"bounces",
"fieldName":"bounce"
},
{
"type":"longSum",
"name":"search_hits",
"fieldName":"search_hits"
},
{
"type":"min",
"name":"responsetime_min",
"fieldName":"responsetime"
},
{
"type":"max",
"name":"responsetime_max",
"fieldName":"responsetime"
},
{
"type":"cardinality",
"name":"sessions",
"fieldNames":[
"sessioncookie"
]
},
{
"type":"cardinality",
"name":"unique_visitors",
"fieldNames":[
"permcookie"
]
}
],
"granularitySpec":{
"type":"uniform",
"segmentGranularity":"DAY",
"queryGranularity":{
"type":"duration",
"duration":60000,
"origin":"1970-01-01T00:00:00.000Z"
},
"intervals":[
"2014-03-01T00:00:00.000Z/2014-03-31T23:59:59.000Z"
]
}
},
"ioConfig":{
"type":"hadoop",
"inputSpec":{
"type":"static",
"paths":"/path/to/data.json.gz"
},
"metadataUpdateSpec":null,
"segmentOutputPath":null
},
"tuningConfig":{
"type":"hadoop",
"workingPath":null,
"version":"2014-08-11T21:48:12.607Z",
"partitionsSpec":{
"type":"dimension",
"partitionDimension":null,
"targetPartitionSize":5000000,
"maxPartitionSize":7500000,
"assumeGrouped":false,
"numShards":-1
},
"shardSpecs":{
},
"rowFlushBoundary":500000,
"leaveIntermediate":false,
"cleanupOnFailure":true,
"overwriteFiles":false,
"ignoreInvalidRows":false,
"jobProperties":{
},
"combineText":false
}
},
"hadoopDependencyCoordinates":null,
"groupId":"index_hadoop_ds_2014-08-11T21:48:12.607Z",
"dataSource":"ds",
"resource":{
"availabilityGroup":"index_hadoop_ds_2014-08-11T21:48:12.607Z",
"requiredCapacity":1
}
}
}