Snappy Compressed data ingestion through hadoop indexer is failing

154 views
Skip to first unread message

Bikash

unread,
Mar 21, 2017, 8:02:03 AM3/21/17
to Druid User
Hi..
My data is residing at s3 and it an RC file format generated after snappy compression. When i am trying to ingest it into druid through an hadoop job, i get the following error that the HIVE job running in the hadoop EMR cluster is not able to parse it.

2017-03-21T08:45:56,788 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job - Running job: job_1490079260199_0003
2017-03-21T08:46:02,900 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job - Job job_1490079260199_0003 running in uber mode : false
2017-03-21T08:46:02,902 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job -  map 0% reduce 0%
2017-03-21T08:46:20,632 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job - Task Id : attempt_1490079260199_0003_m_000000_0, Status : FAILED
Error: com.metamx.common.RE: Failure on row[RCF  )org.apache.hadoop.io.compress.SnappyCodec���  hive.io.rcfile.column.number 77V:���n��s��:� � ��� '?� u�� '?� u��� �� �� � �	 � M  � <� 9� �  � �� 4  �  � �} �  �    �  �  �  �  �  �  �  � Q  �   ,� ;  �  �  � 3T{ � � � ��  ��  �  � � OH� � �  � /  ��  � U  �� 6@�  ��  �  �  �� � �  �   ]
	at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:98)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:152)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
Caused by: com.metamx.common.parsers.ParseException: Unparseable timestamp found!
	at io.druid.data.input.impl.MapInputRowParser.parse(MapInputRowParser.java:55)
	at io.druid.data.input.impl.StringInputRowParser.parseMap(StringInputRowParser.java:96)
	at io.druid.data.input.impl.StringInputRowParser.parse(StringInputRowParser.java:91)
	at io.druid.indexer.HadoopDruidIndexerMapper.parseInputRow(HadoopDruidIndexerMapper.java:106)
	at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:79)
	... 8 more
Caused by: java.lang.NullPointerException: Null timestamp in input: {imp=[RCF, null, )org.apache.hadoop.io.compress.SnappyCodec���,  hive.io.rcfile.column.number 77V:��...
	at io.druid.data.input.impl.MapInputRowParser.parse(MapInputRowParser.java:46)
	... 12 more


1. the time stamp are proper and not null.
2. also the format that i have specified for time stamp is yyyy-MM-dd HH:mm:ss as my timestamp is in this format.


the ingestion spec that i am usning is :
{
	"type": "index_hadoop",
	"spec": {
		"dataSchema": {
			"dataSource": "dataSET_1",
			"parser": {
				"type": "string",
				"parseSpec": {
					"format": "tsv",
					"timestampSpec": {
						"column": "dt",
						"format": "yyyy-MM-dd HH:mm:ss"
					},
					"columns": ["the cloumns that i have used"],
					"delimiter": "\t",
					"dimensionsSpec": {
						"dimensions": [
							"includes the dimensions"
						],
						"dimensionExclusions": [],
						"spatialDimensions": []
					}
				}
			},
			"metricsSpec": [{
				"type": "count",
				"name": "count"
			}],
			"granularitySpec": {
				"type": "uniform",
				"segmentGranularity": "DAY",
				"queryGranularity": "NONE",
				"intervals": ["2017-02-28/2017-03-02"]
			}
		},
		"ioConfig": {
			"type": "hadoop",
			"inputSpec": {
				"type": "static",
				"paths": "s3n://path/where/data/is/located/"
			}
		},
		"tuningConfig": {
			"type": "hadoop",
			"jobProperties": {
				"fs.s3.awsAccessKeyId": "xx",
				"fs.s3.awsSecretAccessKey": "xx",
				"fs.s3.impl": "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
				"fs.s3n.awsAccessKeyId": "xx",
				"fs.s3n.awsSecretAccessKey": "xx",
				"fs.s3n.impl": "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
				"io.compression.codecs": "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"
			}
		}
	}
}




Anuj Singhania

unread,
Aug 24, 2017, 4:55:19 AM8/24/17
to Druid User
Hi,

We are also facing the same issue.Did you find any solution ?
Reply all
Reply to author
Forward
0 new messages