NULL Pointer Exception concerning hadoop ingestion to HDFS deep storage

PE Montabrun

unread,

Nov 25, 2014, 5:17:33 AM11/25/14

to druid-de...@googlegroups.com

Hi guys,

I was testing the ingestion service from HDFS to HDFS deep storage and I ran into this issue with the hadoop index generator job.

I was working on the wikipedia example on the 0.6.160 release version. Is it a bug or a misconfiguration?

Error: com.metamx.common.RE: Failure on row[{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}]
	at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:81)
	at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:33)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.NullPointerException
	at io.druid.indexer.HadoopDruidIndexerConfig.getBucket(HadoopDruidIndexerConfig.java:343)
	at io.druid.indexer.IndexGeneratorJob$IndexGeneratorMapper.innerMap(IndexGeneratorJob.java:228)
	at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:77)
	... 9 more

my task (using the wikipedia example):

{
    "type" : "index_hadoop",
  "hadoopDependencyCoordinates" : ["org.apache.hadoop:hadoop-client:2.4.0"],
    "config" : {
      "dataSource" : "wikipedia",
      "timestampSpec" : {
        "column" : "timestamp",
        "format" : "auto"
      },
      "dataSpec" : {
        "format" : "json",
        "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
      },
      "granularitySpec" : {
        "type" : "uniform",
        "gran" : "DAY",
        "intervals" : [ "2013-08-31/2013-09-01" ]
      },
      "pathSpec" : {
        "type" : "static",
        "paths" : "hdfs://someserver.com:8020/logs/test.json"
      },
      "targetPartitionSize" : 5000000,
      "rollupSpec" : {
        "aggs": [
          {
            "type" : "count",
            "name" : "count"
          },
          {
            "type" : "doubleSum",
            "name" : "added",
            "fieldName" : "added"
          },
          {
            "type" : "doubleSum",
            "name" : "deleted",
            "fieldName" : "deleted"
          },
          {
            "type" : "doubleSum",
            "name" : "delta",
            "fieldName" : "delta"
          }
        ],
        "rollupGranularity" : "none"
      }
    }
}

Thanks a lot,

Pierre-Edouard

Charles Allen

unread,

Nov 25, 2014, 12:13:58 PM11/25/14

to druid-de...@googlegroups.com

Config for jobs has undergone a LOT of internal changes recently. Some of the docs might not properly reflect these changes yet.

The 3 items not deprecated for hadoop are

@JsonProperty("dataSchema") DataSchema dataSchema,

@JsonProperty("ioConfig") HadoopIOConfig ioConfig,
@JsonProperty("tuningConfig") HadoopTuningConfig tuningConfig,

Please try the following config (modifying for your DB and your input/output paths of course):

{

"dataSchema" : {

"dataSource" : "wikipedia",

"timestampSpec" : {

"column" : "timestamp",

"format" : "auto"

},

"parser" : {

"parseSpec" : {

"format" : "json",

"timestampSpec":{

"column":"timestamp",

"format":"iso"

},

"dimensionsSpec" : {

"dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]

}

},

"granularitySpec" : {

"type" : "uniform",

"gran" : "DAY",

"intervals" : [ "2013-08-31/2013-09-01" ]

},

"metricsSpec" : [

{

"type" : "count",

"name" : "count"

},

{

"type" : "doubleSum",

"name" : "added",

"fieldName" : "added"

},

{

"type" : "doubleSum",

"name" : "deleted",

"fieldName" : "deleted"

},

{

"type" : "doubleSum",

"name" : "delta",

"fieldName" : "delta"

}

]

},

"ioConfig" : {

"inputSpec" : {

"paths" : "/Users/charlesallen/bin/wrk/testbad.json",

"type" : "static"

},

"metadataUpdateSpec" : {

"type":"mysql",

"connectURI" : "jdbc:mysql://localhost:3306/druid",

"password" : "diurd",

"segmentTable" : "druid_segments",

"user" : "druid"

},

"segmentOutputPath" : "/Users/charlesallen/bin/wrk/data/index/output",

"type" : "hadoop"

},

"tuningConfig" : {

"workingPath" : "/tmp",

"bufferSize" : 20971520,

"rowFlushBoundary" : 500000,

"type" : "hadoop"

}

Charles Allen

unread,

Nov 25, 2014, 12:15:52 PM11/25/14

to druid-de...@googlegroups.com

You should be able to see a line like:

2014-11-25 17:11:53,147 INFO [pool-4-thread-1] io.druid.indexer.HadoopDruidIndexerConfig - Running with config:

which will reflect your configuration back to you. That will give you hints on what kind of stuff may be missing in the future as you get more complex with your ingestion.

Fangjin Yang

unread,

Nov 25, 2014, 12:22:40 PM11/25/14

to druid-de...@googlegroups.com

Charles, this is 0.6.160.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/ce7b70af-4f6c-4b48-8aa4-d30bc74949f2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Fangjin Yang

unread,

Nov 25, 2014, 12:48:28 PM11/25/14

to druid-de...@googlegroups.com

Hi,

I think the problem here is with timezones. Can you make sure that you have the JVM arg user.timezone=UTC set on both Druid and your external hadoop cluster if that is where you are running the batch job.

Charles, this is 0.6.160.

To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

PE Montabrun

unread,

Nov 26, 2014, 9:50:56 AM11/26/14

to druid-de...@googlegroups.com

Hi,

It was indeed a UTC problem and setting these two properties in mapred-site.xml solved the issue (from http://druid.io/docs/latest/Hadoop-Configuration.html):

<property>
    <name>mapreduce.map.java.opts</name>
    <value>-server -Xmx1536m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</value>
  </property>
  <property>
    <name>mapreduce.reduce.java.opts</name>
    <value>-server -Xmx2560m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</value>
  </property>

Thanks,

Pierre-Edouard Montabrun

--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/AyoUsfOwEUM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/4f083f71-26c4-4b6c-b5f2-4fc9d05c70d3%40googlegroups.com.

Reply all

Reply to author

Forward