Hi..
My data is residing at s3 and it an RC file format generated after snappy compression. When i am trying to ingest it into druid through an hadoop job, i get the following error that the HIVE job running in the hadoop EMR cluster is not able to parse it.
2017-03-21T08:45:56,788 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job - Running job: job_1490079260199_0003
2017-03-21T08:46:02,900 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job - Job job_1490079260199_0003 running in uber mode : false
2017-03-21T08:46:02,902 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job - map 0% reduce 0%
2017-03-21T08:46:20,632 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job - Task Id : attempt_1490079260199_0003_m_000000_0, Status : FAILED
Error: com.metamx.common.RE: Failure on row[RCF )org.apache.hadoop.io.compress.SnappyCodec��� hive.io.rcfile.column.number 77V:���n��s��:� � ��� '?� u�� '?� u��� �� �� � � � M � <� 9� � � �� 4 � � �} � � � � � � � � � � Q � ,� ; � � � 3T{ � � � �� �� � � � OH� � � � / �� � U �� 6@� �� � � �� � � � ]
at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:98)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:152)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
Caused by: com.metamx.common.parsers.ParseException: Unparseable timestamp found!
at io.druid.data.input.impl.MapInputRowParser.parse(MapInputRowParser.java:55)
at io.druid.data.input.impl.StringInputRowParser.parseMap(StringInputRowParser.java:96)
at io.druid.data.input.impl.StringInputRowParser.parse(StringInputRowParser.java:91)
at io.druid.indexer.HadoopDruidIndexerMapper.parseInputRow(HadoopDruidIndexerMapper.java:106)
at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:79)
... 8 more
Caused by: java.lang.NullPointerException: Null timestamp in input: {imp=[RCF, null, )org.apache.hadoop.io.compress.SnappyCodec���, hive.io.rcfile.column.number 77V:��...
at io.druid.data.input.impl.MapInputRowParser.parse(MapInputRowParser.java:46)
... 12 more
1. the time stamp are proper and not null.
2. also the format that i have specified for time stamp is yyyy-MM-dd HH:mm:ss as my timestamp is in this format.
the ingestion spec that i am usning is :
{
"type": "index_hadoop",
"spec": {
"dataSchema": {
"dataSource": "dataSET_1",
"parser": {
"type": "string",
"parseSpec": {
"format": "tsv",
"timestampSpec": {
"column": "dt",
"format": "yyyy-MM-dd HH:mm:ss"
},
"columns": ["the cloumns that i have used"],
"delimiter": "\t",
"dimensionsSpec": {
"dimensions": [
"includes the dimensions"
],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
"metricsSpec": [{
"type": "count",
"name": "count"
}],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "DAY",
"queryGranularity": "NONE",
"intervals": ["2017-02-28/2017-03-02"]
}
},
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "static",
"paths": "s3n://path/where/data/is/located/"
}
},
"tuningConfig": {
"type": "hadoop",
"jobProperties": {
"fs.s3.awsAccessKeyId": "xx",
"fs.s3.awsSecretAccessKey": "xx",
"fs.s3.impl": "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
"fs.s3n.awsAccessKeyId": "xx",
"fs.s3n.awsSecretAccessKey": "xx",
"fs.s3n.impl": "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
"io.compression.codecs": "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"
}
}
}
}