Indexing task ingesting data from S3 failing

240 views
Skip to first unread message

Brian Webb

unread,
Apr 27, 2016, 9:09:21 PM4/27/16
to Druid User
Hello,

I have the following situation... I have data on S3 in single directory containing many part files with one line per event (JSON). I want to run a batch indexing job to load the data into Druid. I've created an index_hadoop task. The task appears to successfully read all data from S3, process it, and then fails at the very end due to an exception. Any suggestions on how to resolve this?

Here is the exception from the indexing task:

2016-04-27T23:56:21,905 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=index_hadoop_experiment-events_2016-04-27T23:53:09.664Z, type=index_hadoop, dataSource=experiment-events}]
java
.lang.RuntimeException: java.lang.reflect.InvocationTargetException
        at com
.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]
        at io
.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:160) ~[druid-indexing-service-0.9.0.jar:0.9.0]
        at io
.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:208) ~[druid-indexing-service-0.9.0.jar:0.9.0]
        at io
.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:338) [druid-indexing-service-0.9.0.jar:0.9.0]
        at io
.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:318) [druid-indexing-service-0.9.0.jar:0.9.0]
        at java
.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_91]
        at java
.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]
        at java
.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]
        at java
.lang.Thread.run(Thread.java:745) [?:1.8.0_91]
Caused by: java.lang.reflect.InvocationTargetException
        at sun
.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_91]
        at sun
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_91]
        at sun
.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_91]
        at java
.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_91]
        at io
.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:157) ~[druid-indexing-service-0.9.0.jar:0.9.0]
       
... 7 more
Caused by: com.metamx.common.ISE: Job[class io.druid.indexer.IndexGeneratorJob] failed!
        at io
.druid.indexer.JobHelper.runJobs(JobHelper.java:343) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
        at io
.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:94) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
        at io
.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessing.runTask(HadoopIndexTask.java:261) ~[druid-indexing-service-0.9.0.jar:0.9.0]
        at sun
.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_91]
        at sun
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_91]
        at sun
.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_91]
        at java
.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_91]
        at io
.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:157) ~[druid-indexing-service-0.9.0.jar:0.9.0]
       
... 7 more
2016-04-27T23:56:21,916 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
 
"id" : "index_hadoop_experiment-events_2016-04-27T23:53:09.664Z",
 
"status" : "FAILED",
 
"duration" : 187999
}

I am running Druid 0.9.0 on EC2. I do not have separate dedicated Hadoop cluster. 

Relevant common configuration:
druid.extensions.loadList=["druid-s3-extensions", "mysql-metadata-storage"]
druid.storage.type=s3
druid.storage.bucket=<my bucket>
druid.storage.baseKey=druid/segments
druid.s3.accessKey=<my key>
druid.s3.secretKey=<my secret>

druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=<my bucket>
druid.indexer.logs.s3Prefix=druid/indexing-logs

Relevant Overload config:
druid.indexer.runner.type=remote

Relevant middleManager config:
druid.worker.capacity=3
druid.indexer.runner.javaOpts=-server -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
druid.indexer.task.baseTaskDir=<my path>
druid.server.http.numThreads=8
druid.processing.buffer.sizeBytes=256000000
druid.processing.numThreads=2
druid.indexer.task.hadoopWorkingPath=/tmp/druid-indexing
druid.indexer.task.defaultHadoopCoordinates=["org.apache.hadoop:hadoop-client:2.3.0"]

Relevant parts of index job config:

{

 "type": "index_hadoop",

 "spec": {

   "ioConfig": {

     "type": "hadoop",

     "inputSpec": {

       "type": "static",

       "paths": "s3n://<my bucket>/experiment"

     }

   },

   "tuningConfig": {

     "type": "hadoop",

     "partitionsSpec" : {

       "type" : "hashed",

       "targetPartitionSize" : 5000000

     },

     "jobProperties" : {

       "fs.s3.awsAccessKeyId" : "<my key>",

       "fs.s3.awsSecretAccessKey" : "<my secret>",

       "fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",

       "fs.s3n.awsAccessKeyId" : "<my key>",

       "fs.s3n.awsSecretAccessKey" : "<my secret>",

       "fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",

      "io.compression.codecs" : "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"

     }

   }

 }

}


Thanks!


Fangjin Yang

unread,
Apr 29, 2016, 9:06:07 PM4/29/16
to Druid User
Do you have the full task log?
Reply all
Reply to author
Forward
0 new messages