HadoopDruidIndexerMain Error(No suitable partitioning dimension found!)

404 views
Skip to first unread message

shazam

unread,
May 15, 2013, 10:14:02 PM5/15/13
to druid-de...@googlegroups.com
Hi ,

I am trying to use  HadoopDruidIndexer for batch ingestion from a CSV file. I  have installed Hadoop on Ubuntu in single node  cluster mode. According to the wiki we can leave "partitionDimension" option blank for Index node.  But when I run HadoopDruidIndexer I see below error:


attempt_201305151342_0016_r_000000_0: com.metamx.common.ISE: No suitable partitioning dimension found!
attempt_201305151342_0016_r_000000_0: at com.metamx.druid.indexer.DeterminePartitionsJob$DeterminePartitionsDimSelectionReducer.innerReduce(DeterminePartitionsJob.java:652)
attempt_201305151342_0016_r_000000_0: at com.metamx.druid.indexer.DeterminePartitionsJob$DeterminePartitionsDimSelectionBaseReducer.reduce(DeterminePartitionsJob.java:444)
attempt_201305151342_0016_r_000000_0: at com.metamx.druid.indexer.DeterminePartitionsJob$DeterminePartitionsDimSelectionBaseReducer.reduce(DeterminePartitionsJob.java:417)
attempt_201305151342_0016_r_000000_0: at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
attempt_201305151342_0016_r_000000_0: at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
attempt_201305151342_0016_r_000000_0: at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
attempt_201305151342_0016_r_000000_0: at org.apache.hadoop.mapred.Child.main(Child.java:170)
attempt_201305151342_0016_r_000000_0: 2013-05-15 14:52:13,016 INFO [main] org.apache.hadoop.mapred.TaskRunner - Runnning cleanup for the task
2013-05-15 14:52:28,660 INFO [main] org.apache.hadoop.mapred.JobClient - Task Id : attempt_201305151342_0016_r_000000_1, Status : FAILED
attempt_20

Here is the config that I am using:

{
  "dataSource":"customer",
  "timestampColumn": "ts",
  "timestampFormat": "auto",
  "dataSpec": {
    "format": "csv",
    "columns": [ "ts","City", "State", "ZipCode", "Country"],
    "dimensions": ["City", "State"]
  },
  "granularitySpec": {
    "type":"uniform",
    "intervals":["2013-05-10/2013-05-11"],
    "gran":"day"
  },
  "pathSpec": { "type": "granularity",
                "dataGranularity": "year",
                "inputPath": "hdfs://localhost:54310/home/test/app/hadoop/tmp/customer",
                "filePattern": ".*" },
  "rollupSpec": { "aggs": [
                    { "type": "count", "name":"event_count" },
                    { "type": "count", "fieldName": "ZipCode", "name": "revenue" }
                  ],
                  "rollupGranularity": "minute"},
  "workingPath": "/home/sharmin/app/hadoop/tmp",
  "segmentOutputPath": "hdfs://localhost:54310/home/test/app/hadoop/tmp/customer/output",
  "leaveIntermediate": "false",
  "partitionsSpec": {
    "targetPartitionSize": 5000000
  },
  "updaterJobSpec": {
    "type":"db",
    "connectURI":"jdbc:mysql://10.244.198.252:3306/druid",
    "user":"root",
    "password":"root",
    "segmentTable":"prod_segments"
  }
}

Not sure what I am missing here. Any help appreciated.

Thanks,
Sharmin

Fangjin Yang

unread,
May 16, 2013, 2:14:49 PM5/16/13
to druid-de...@googlegroups.com
Hi Sharmin,

This error is possible if Druid tries to create a shard that is 50% larger than the target partition size or it is attempting to partition on a multi-value column. I imagine you are probably encountering the first case. A common cause of that problem is if you have a single valued dimension that has a lot of nulls. These nulls will end up in the same shard and exceed the target partition size. A possible solution is to increase your target partition size to something greater than 5M rows.

One way we can more accurately determine the root cause of the problem is if you send us the reducer logs for the dim selection job. There may be a lot of reducers, but one of them is actually doing the dim selection. The logs should have messages about trying to find a dimension to partition on.

I also have some comments in line.

--FJ

>>>> ZipCode appears to be a dimension you want to filter on and is not a metric you want to aggregate over. I notice that in your list of columns you did not include the metrics you wanted. The specification you may be looking for here is:
{"type":"doubleSum", "fieldName":"revenue", "name":"revenue"}
You will also need to include the revenue metric as part of your column schema earlier on. The specification you have right now will create a count metric called revenue. The result of any aggregations over this metric will be the exact same as the event_count metric.
                  ],
                  "rollupGranularity": "minute"},
  "workingPath": "/home/sharmin/app/hadoop/tmp",
  "segmentOutputPath": "hdfs://localhost:54310/home/test/app/hadoop/tmp/customer/output",
  "leaveIntermediate": "false",
  "partitionsSpec": {
    "targetPartitionSize": 5000000
  },
  "updaterJobSpec": {
    "type":"db",
    "connectURI":"jdbc:mysql://10.244.198.252:3306/druid",
    "user":"root",
    "password":"root",
    "segmentTable":"prod_segments"
  }
}

Not sure what I am missing here. Any help appreciated.

Thanks,
Sharmin

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/8448373f-a389-48f3-917a-edda103cb208%40googlegroups.com?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

shazam

unread,
May 16, 2013, 3:09:51 PM5/16/13
to druid-de...@googlegroups.com
Hi Fangjin,

Thank you very much for replying. I am attaching the index node console log and reducer log. I am actually trying to upload very small amount of data just for testing.

Thanks,
Sharmin
IndexNodeConsole.log
attempt_201305151342_0016_r_000000_1.log

Fangjin Yang

unread,
May 16, 2013, 4:23:18 PM5/16/13
to druid-de...@googlegroups.com
Hi Shazam,

The logs are unclear as to what exactly is wrong. Druid may be dropping events because the timestamp of events in the data set are outside of the range specified in the schema. You can try running the indexer with -Duser.timezone=UTC. You can also try removing the targetPartitionSize entirely, I believe that will give us more meaningful exceptions as to what is happening. Also, you can send me a sample of the data you are sending into the indexer and I will try and see if anything seems out of the ordinary.

Thanks!
FJ


--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.

shazam

unread,
May 17, 2013, 1:30:20 PM5/17/13
to druid-de...@googlegroups.com
Hi FJ,

Thank you for your help. It looks like the issue was " the timestamp of events in the data set are outside of the range specified in the schema". So after fixing the timestamp, I am not getting the partitioning issue any more. But I am getting a NULL Pointer Exception from line 173 of com.metamx.druid.indexer.IndexGeneratorJob. I added some log statements and found that "fs.listStatus(descriptorInfoDir)" is actually returning NULL.

 For me the descriptorInfoDir is:  "hdfs://localhost:54310/..../2013-05-17T094933.554-0700/segmentDescriptorInfo". In this directory "hdfs://localhost:54310/..../2013-05-17T094933.554-0700/", three other directories are created( _logs, groupedData and part-m-00000) after running the indexer job, but there is no segmentDescriptorInfo directory.

In my specFile I have these directories specified:

 "workingPath": "hdfs://localhost:54310/home/test/app/hadoop/tmp"

 "inputPath": "hdfs://localhost:54310/home/test/app/hadoop/tmp/customer"

"segmentOutputPath": "hdfs://localhost:54310/home/test/app/hadoop/tmp/customer_output"

And the data in CSV file are in this format:

2013-05-10T00:14:00.000Z,Bland,VA,24315,US
2013-05-10T00:14:00.000Z,Las Vegas,NM,87701,US
2013-05-10T00:14:00.000Z,Enon,OH,45323,US


Am I missing something in the specFile? I go the same NULL pointer Exception after removing the targetPartitionSize as well.

Thank you!

Fangjin Yang

unread,
May 17, 2013, 2:14:24 PM5/17/13
to druid-de...@googlegroups.com
Hi Shazam,

What version of Druid are you using? We have been investigating the problem and there is a minor bug that exists with the HadoopDruidIndexer in 0.4.12 that appears to be the exact issue you are facing. it exists in versions 0.4.9 to 0.4.14. We are in the process of making a new stable, 0.4.12.1 to address this problem. I'll let you know as soon as it is ready.

Thanks for your patience!
FJ


shazam

unread,
May 17, 2013, 2:47:56 PM5/17/13
to druid-de...@googlegroups.com
Hi FJ,

I  see  0.4.9-SNAPSHOT  version in the pom. Looks like I have to wait for 0.4.12.1 release.

Thank You!

Fangjin Yang

unread,
May 17, 2013, 6:44:46 PM5/17/13
to druid-de...@googlegroups.com
Hi Shazam, our latest stable is 0.4.12.2.

I notice you are running a SNAPSHOT version, are you running Druid directly from the master branch? If so, you may want to update your master to the latest version. The latest SNAPSHOT will have the fixes in place but there were a lot of new features added. It may be worthwhile to work with the stable branch/tag. Please let me know about any developments from your side. 

Thanks again,
FJ


Reply all
Reply to author
Forward
0 new messages