Indexing task ingesting data from S3 failing

240 zobrazení
Přeskočit na první nepřečtenou zprávu

Brian Webb

nepřečteno,
27. 4. 2016 21:09:2127.04.16
komu: Druid User
Hello,

I have the following situation... I have data on S3 in single directory containing many part files with one line per event (JSON). I want to run a batch indexing job to load the data into Druid. I've created an index_hadoop task. The task appears to successfully read all data from S3, process it, and then fails at the very end due to an exception. Any suggestions on how to resolve this?

Here is the exception from the indexing task:

2016-04-27T23:56:21,905 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=index_hadoop_experiment-events_2016-04-27T23:53:09.664Z, type=index_hadoop, dataSource=experiment-events}]
java
.lang.RuntimeException: java.lang.reflect.InvocationTargetException
        at com
.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]
        at io
.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:160) ~[druid-indexing-service-0.9.0.jar:0.9.0]
        at io
.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:208) ~[druid-indexing-service-0.9.0.jar:0.9.0]
        at io
.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:338) [druid-indexing-service-0.9.0.jar:0.9.0]
        at io
.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:318) [druid-indexing-service-0.9.0.jar:0.9.0]
        at java
.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_91]
        at java
.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]
        at java
.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]
        at java
.lang.Thread.run(Thread.java:745) [?:1.8.0_91]
Caused by: java.lang.reflect.InvocationTargetException
        at sun
.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_91]
        at sun
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_91]
        at sun
.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_91]
        at java
.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_91]
        at io
.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:157) ~[druid-indexing-service-0.9.0.jar:0.9.0]
       
... 7 more
Caused by: com.metamx.common.ISE: Job[class io.druid.indexer.IndexGeneratorJob] failed!
        at io
.druid.indexer.JobHelper.runJobs(JobHelper.java:343) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
        at io
.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:94) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
        at io
.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessing.runTask(HadoopIndexTask.java:261) ~[druid-indexing-service-0.9.0.jar:0.9.0]
        at sun
.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_91]
        at sun
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_91]
        at sun
.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_91]
        at java
.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_91]
        at io
.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:157) ~[druid-indexing-service-0.9.0.jar:0.9.0]
       
... 7 more
2016-04-27T23:56:21,916 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
 
"id" : "index_hadoop_experiment-events_2016-04-27T23:53:09.664Z",
 
"status" : "FAILED",
 
"duration" : 187999
}

I am running Druid 0.9.0 on EC2. I do not have separate dedicated Hadoop cluster. 

Relevant common configuration:
druid.extensions.loadList=["druid-s3-extensions", "mysql-metadata-storage"]
druid.storage.type=s3
druid.storage.bucket=<my bucket>
druid.storage.baseKey=druid/segments
druid.s3.accessKey=<my key>
druid.s3.secretKey=<my secret>

druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=<my bucket>
druid.indexer.logs.s3Prefix=druid/indexing-logs

Relevant Overload config:
druid.indexer.runner.type=remote

Relevant middleManager config:
druid.worker.capacity=3
druid.indexer.runner.javaOpts=-server -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
druid.indexer.task.baseTaskDir=<my path>
druid.server.http.numThreads=8
druid.processing.buffer.sizeBytes=256000000
druid.processing.numThreads=2
druid.indexer.task.hadoopWorkingPath=/tmp/druid-indexing
druid.indexer.task.defaultHadoopCoordinates=["org.apache.hadoop:hadoop-client:2.3.0"]

Relevant parts of index job config:

{

 "type": "index_hadoop",

 "spec": {

   "ioConfig": {

     "type": "hadoop",

     "inputSpec": {

       "type": "static",

       "paths": "s3n://<my bucket>/experiment"

     }

   },

   "tuningConfig": {

     "type": "hadoop",

     "partitionsSpec" : {

       "type" : "hashed",

       "targetPartitionSize" : 5000000

     },

     "jobProperties" : {

       "fs.s3.awsAccessKeyId" : "<my key>",

       "fs.s3.awsSecretAccessKey" : "<my secret>",

       "fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",

       "fs.s3n.awsAccessKeyId" : "<my key>",

       "fs.s3n.awsSecretAccessKey" : "<my secret>",

       "fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",

      "io.compression.codecs" : "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"

     }

   }

 }

}


Thanks!


Fangjin Yang

nepřečteno,
29. 4. 2016 21:06:0729.04.16
komu: Druid User
Do you have the full task log?
Odpovědět všem
Odpověď autorovi
Přeposlat
0 nových zpráv