0.10.1 S3A Batch Ingestion Issues

1,126 views
Skip to first unread message

Robert Ervin

unread,
Aug 24, 2017, 10:35:13 PM8/24/17
to Druid User
I'm attempting to run batch ingestion with a base `0.10.1` installation. The only extensions loaded are "druid-s3-extensions" and "postgresql-metadata-storage"

When I create an indexing task with an "s3a://<file_url>" inputSpec path and a jobProperties like
"jobProperties":{
    "fs.s3a.impl":"org.apache.hadoop.fs.s3a.S3AFileSystem",
    "fs.s3a.server-side-encryption-algorithm":"AES256",
    "fs.s3a.connection.ssl.enabled":"true"
    ...<map-reduce-configs>
}

, it throws the following error:

Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org
.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101) ~[?:?]
        at org
.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193) ~[?:?]
        at org
.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654) ~[?:?]
        at org
.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) ~[?:?]
        at org
.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) ~[?:?]
        at org
.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) ~[?:?]
        at org
.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) ~[?:?]
        at org
.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) ~[?:?]
        at org
.apache.hadoop.fs.Path.getFileSystem(Path.java:295) ~[?:?]
        at org
.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:500) ~[?:?]
        at org
.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:110) ~[?:?]
        at org
.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) ~[?:?]
        at org
.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) ~[?:?]
        at org
.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) ~[?:?]
        at org
.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) ~[?:?]
        at org
.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) ~[?:?]
        at java
.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_141]
        at javax
.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_141]
        at org
.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) ~[?:?]
        at org
.apache.hadoop.mapreduce.Job.submit(Job.java:1287) ~[?:?]
        at io
.druid.indexer.DetermineHashedPartitionsJob.run(DetermineHashedPartitionsJob.java:117) ~[druid-indexing-hadoop-0.10.1.jar:0.10.1]
        at io
.druid.indexer.JobHelper.runJobs(JobHelper.java:372) ~[druid-indexing-hadoop-0.10.1.jar:0.10.1]
        at io
.druid.indexer.HadoopDruidDetermineConfigurationJob.run(HadoopDruidDetermineConfigurationJob.java:91) ~[druid-indexing-hadoop-0.10.1.jar:0.10.1]
        at io
.druid.indexing.common.task.HadoopIndexTask$HadoopDetermineConfigInnerProcessing.runTask(HadoopIndexTask.java:307) ~[druid-indexing-service-0.10.1.jar:0.10.1]


Robert Ervin

unread,
Aug 25, 2017, 1:28:56 PM8/25/17
to Druid User
I also tried installing `hadoop-aws:2.7.3`, and that threw the following error on a batch load job:

Caused by: java.lang.NoSuchMethodError: com.amazonaws.AmazonWebServiceRequest.copyPrivateRequestParameters()Ljava/util/Map;
        at com
.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3506) ~[?:?]
        at com
.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031) ~[?:?]
        at com
.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994) ~[?:?]
        at org
.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297) ~[?:?]
        at org
.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) ~[?:?]

        at org
.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) ~[?:?]
        at org
.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) ~[?:?]
        at org
.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) ~[?:?]
        at org
.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) ~[?:?]
        at org
.apache.hadoop.fs.Path.getFileSystem(Path.java:295) ~[?:?]
        at org
.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:500) ~[?:?]
        at org
.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:110) ~[?:?]

Ryan Cole

unread,
Aug 25, 2017, 5:23:08 PM8/25/17
to Druid User
I'm having the same problem with using S3A with 0.10.1. Let me know if you figure out how to make it work!

Slim Bouguerra

unread,
Aug 25, 2017, 5:35:01 PM8/25/17
to druid...@googlegroups.com
can you share the job spec ?
is this on EMR ? 

-- 

B-Slim
_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______

-- 
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/038bde57-9a7d-4e5d-ba15-5f774cf0758e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ryan Cole

unread,
Aug 25, 2017, 6:35:36 PM8/25/17
to Druid User
Here's my index job:

{
 
"type": "index_hadoop",
 
"spec": {
   
"ioConfig": {
     
"type": "hadoop",
     
"inputSpec": {
       
"type": "static",
       
"inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
       
"paths": "s3a://path/to/data/"
     
}
   
},
   
"dataSchema": {
     
"dataSource": "sf",
     
"parser": {
       
"type": "parquet",
       
"parseSpec": {
         
"format": "timeAndDims",
         
"timestampSpec": {
           
"column": "updateTime",
           
"format": "auto"
         
},
         
"dimensionsSpec": {
           
"dimensions": [
             
{
               
"type": "long",
               
"name": "userId"
             
},
             
"city",
             
"state",
             
"zipCode",
             
"countryCode",
             
{
               
"type": "long",
               
"name": "component"
             
}
           
],
           
"dimensionExclusions": [],
           
"spatialDimensions": []
         
}
       
}
     
},
     
"metricsSpec": [
       
{
         
"type" : "thetaSketch",
         
"name" : "userId_sketch",
         
"fieldName" : "userId"
       
},
       
{
         
"type" : "thetaSketch",
         
"name" : "component_sketch",
         
"fieldName" : "component"
       
}
     
],
     
"granularitySpec": {
       
"type": "uniform",
       
"segmentGranularity": "WEEK",
       
"queryGranularity": "none",
       
"intervals": ["2017-04-01/2017-07-23"]
     
}
   
},
   
"tuningConfig": {
     
"type": "hadoop",
     
"partitionsSpec": {
       
"targetPartitionSize": 5000000
     
},
     
"jobProperties" : {
       
"mapreduce.job.user.classpath.first" : true,
       
"fs.s3.awsAccessKeyId" : xxx,
       
"fs.s3.awsSecretAccessKey" : xxx,
       
"fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
       
"fs.s3n.awsAccessKeyId" : xxx,
       
"fs.s3n.awsSecretAccessKey" : xxx,
       
"fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
       
"fs.s3a.awsAccessKeyId" : xxx,
       
"fs.s3a.awsSecretAccessKey" : xxx,
       
"fs.s3a.impl" : "org.apache.hadoop.fs.s3a.S3AFileSystem",
     
},    
     
"leaveIntermediate": false
   
}
 
}
}

This is on EMR. Is it still the case that s3a won't work at all with EMR?

Also, now when I try doing s3n or s3, I get a "java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found" error, which wasn't happening before I upgraded to 0.10.1. Did I maybe miss loading something in? I have druid-s3-extensions in my load list.

Thanks!
Ryan

Robert Ervin

unread,
Aug 27, 2017, 1:04:55 PM8/27/17
to Druid User
My job spec was 

{  
  "type" : "index_hadoop",
  "spec":{  
     "dataSchema":{  
        "dataSource":"<datasource>",
        "parser":{  
           "type":"hadoopyString",
           "parseSpec":{  
              "format":"csv",
              "columns":[<columns>],
              "timestampSpec":{  
                 "column":"timestamp",
                 "format":"auto"
              },
              "dimensionsSpec":{  
                 "dimensions":[<dimensions>]
              }
           }
        },
        "metricsSpec":[<metricsSpec>],
        "granularitySpec":{  
           "type":"uniform",
           "segmentGranularity":"YEAR",
           "queryGranularity":{  
              "type":"none"
           },
           "rollup":false,
           "intervals":[  
              "2012-07-06T19:23:37.000Z/2017-06-16T20:38:36.000Z"
           ]
        }
     },
     "ioConfig":{  
        "type":"hadoop",
        "inputSpec":{  
           "type":"static",
           "paths":"s3a://<bucket>/<path>.csv.gz"
        },
        "metadataUpdateSpec":null,
        "segmentOutputPath":null
     },
     "tuningConfig":{  
        "type":"hadoop",
        "workingPath":null,
        "version":"2017-08-24T21:48:23.199Z",
        "partitionsSpec":{  
           "type":"hashed",
           "targetPartitionSize":500000,
           "maxPartitionSize":750000,
           "assumeGrouped":false,
           "numShards":-1,
           "partitionDimensions":[  

           ]
        },
        "shardSpecs":{  

        },
        "indexSpec":{  
           "bitmap":{  
              "type":"concise"
           },
           "dimensionCompression":"lz4",
           "metricCompression":"lz4",
           "longEncoding":"longs"
        },
        "maxRowsInMemory":50000,
        "leaveIntermediate":false,
        "cleanupOnFailure":true,
        "overwriteFiles":true,
        "ignoreInvalidRows":false,
        "jobProperties":{
           "fs.s3a.impl":"org.apache.hadoop.fs.s3a.S3AFileSystem",
           "fs.s3a.server-side-encryption-algorithm":"AES256",
           "fs.s3a.connection.ssl.enabled":"true",
           "mapreduce.job.classloader": "true",
           "mapreduce.job.user.classpath.first": "true",
           "mapreduce.task.timeout":"1800000",
           "mapreduce.job.maps":"1",
           "mapreduce.job.reduces":"1",
           "mapreduce.map.memory.mb":"256",
           "mapreduce.map.java.opts":"-server -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8",
           "mapreduce.reduce.memory.mb":"256",
           "mapreduce.reduce.java.opts":"-server -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8",
           "mapreduce.map.output.compress":"true",
           "mapred.map.output.compress.codec":"org.apache.hadoop.io.compress.SnappyCodec"
        },
        "combineText":false,
        "useCombiner":false,
        "buildV9Directly":true,
        "numBackgroundPersistThreads":0,
        "forceExtendableShardSpecs":false,
        "useExplicitVersion":false,
        "allowedHadoopPrefix":[  

        ]
     }
  },
  "dataSource":"<datasource>",
  "hadoopDependencyCoordinates": ["org.apache.hadoop:hadoop-client:2.7.3", "org.apache.hadoop:hadoop-aws:2.7.3"]
}

This is not running on EMR. Just an Amazon Linux EC2 machine.

Johnson Johnson

unread,
Sep 5, 2017, 3:03:04 PM9/5/17
to Druid User
Robert, were you able to get past this issue?

Robert Ervin

unread,
Sep 5, 2017, 4:33:10 PM9/5/17
to Druid User
@Johnson. I have not been able to get past this. I spend a whole day on it and got nowhere. I was able to get s3n to work again though by explicitly including "hadoop-aws:2.7.3" in my hadoopDependencies with a base 0.10.1 installation. 

Johnson Johnson

unread,
Sep 5, 2017, 5:37:30 PM9/5/17
to Druid User
Thanks, Robert -  I ended up getting this when I reverted to that hadoop dependency in my ingestion spec:

2017-09-05T21:36:16,114 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=index_ip_queries-201, type=index_hadoop, dataSource=ip_queries}]
io
.druid.java.util.common.ISE: Hadoop dependency [/opt/druid/druid-0.10.1-rc3/hadoop-dependencies/hadoop-aws/2.7.3] didn't exist!?
        at io.druid.initialization.Initialization.getHadoopDependencyFilesToLoad(Initialization.java:274) ~[druid-server-0.10.1-rc3.jar:0.10.1-rc3]
        at io.druid.indexing.common.task.HadoopTask.buildClassLoader(HadoopTask.java:156) ~[druid-indexing-service-0.10.1-rc3.jar:0.10.1-rc3]
        at io.druid.indexing.common.task.HadoopTask.buildClassLoader(HadoopTask.java:130) ~[druid-indexing-service-0.10.1-rc3.jar:0.10.1-rc3]
       

Have you seen this one?

Robert Ervin

unread,
Sep 5, 2017, 5:51:13 PM9/5/17
to Druid User
Sorry, I should have been more clear.

My hadoopDependencyCoordinates are 

Lawrence Huang

unread,
Sep 29, 2017, 12:49:47 PM9/29/17
to Druid User
I managed to fix this issue by adding the hadoop-aws jar to the hadoop dependencies. I was getting the following error during hadoop batch ingestion:
java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

I fixed it by adding  hadoop-aws-2.7.3.jar (https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.3) to hadoop-dependencies/hadoop-client/2.7.3/hadoop-aws-2.7.3.jar

Artem Moskvin

unread,
Nov 2, 2017, 6:43:07 AM11/2/17
to Druid User
@Lawrence. Were you able to make it work with S3A? What region is your bucket in? I was only able to make it work with S3N in v2 regions. I can't make it work with v4 regions neither S3A, not S3N.

biswa...@gmail.com

unread,
Nov 30, 2017, 5:30:43 PM11/30/17
to Druid User
Anyone using hdfs to push the segment to s3 using s3a ?? what I noticed if I use hdfs by default config converter adds path of Hadoop from config
to "segmentOutputPath" : "hdfs://<HADOOP>:9000"

druid.storage.type=hdfs
druid.storage.bucket=druid-data
druid.storage.baseKey=druid_
druid.storage.useS3aSchema=True
druid.storage.storageDirectory=s3a://druid-data

~ Biswajit

biswa...@gmail.com

unread,
Dec 1, 2017, 2:40:06 PM12/1/17
to Druid User
bump ..

Any one have suggestion with S3A with Hadoop 2.7.3 ??

Mohanraj Naidu

unread,
Jan 10, 2018, 4:10:13 PM1/10/18
to Druid User
For me it is work with s3n


Hope this helps ....

Lawrence Huang

unread,
May 24, 2018, 7:54:28 PM5/24/18
to Druid User
See my comment here for using s3a deep storage:
Reply all
Reply to author
Forward
0 new messages