Error when trying to reindex to add dimension

134 views

Skip to first unread message

Ben Vee

unread,

Aug 10, 2016, 12:46:13 PM8/10/16

to Druid User

Hello, I am just experimenting with druid for not more than a day to figure out if it might be interesting for an upcoming project.

So I added some data doing batch file ingestion.

The data is in JSON format like this:

{"datetime":"2016-08-01T10:10:10","serial_no":"12321432","id":"1232423","type":"some_type","data":{"some":{"nested":"data"} }}

The data schema for the first run looked like this:

{
  "type" : "index_hadoop",
  "spec" : {
    "ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "static",
        "paths" : "journal/journal-sample.json"
      }
    },
    "dataSchema" : {
      "dataSource" : "journal",
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "day",
        "queryGranularity" : "none",
        "intervals" : ["2016-07-19/2016-08-04"]
      },
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "dimensionsSpec" : {
            "dimensions" : ["id","type","serial_no"]
          },
          "timestampSpec" : {
            "format" : "auto",
            "column" : "datetime"
          }
        }
      },
      "metricsSpec" : [
      ]
    },
    "tuningConfig" : {
      "type" : "hadoop",
      "partitionsSpec" : {
        "type" : "hashed",
        "targetPartitionSize" : 5000000
      },
      "jobProperties" : {}
    }
  },
  "hadoopDependencyCoordinates": ["org.apache.hadoop:hadoop-client:2.6.0"]
}

This worked perfectly, but now I decided to reindex my ingested data to add the data key as dimension. (I am assuming the entire data was put to hdfs since this is my deep storage and only the specified dimensions have been indexed so I can always reindex on the entire data to add and query new demensions - is this right?)

Actually since only the data key was not indexed I wanted to any dimension spec so all keys are being used as dimension. (This is the default behaviour according to the documentation)

So this is my data schema for reindexing: (Only ioConfig adjusted according to the documentation http://druid.io/docs/0.9.1.1/ingestion/update-existing-data.html)

{
  "type" : "index_hadoop",
  "spec" : {
    "ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "dataSource",
        "ingestionSpec" : {
          "dataSource" : "journal",
          "intervals" : ["2016-07-19/2016-08-04"]
        }
      }
    },
    "dataSchema" : {
      "dataSource" : "journal",
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "day",
        "queryGranularity" : "none",
        "intervals" : ["2016-07-19/2016-08-04"]
      },
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "dimensionsSpec" : {
            "dimensions" : []
          },
          "timestampSpec" : {
            "format" : "auto",
            "column" : "datetime"
          }
        }
      },
      "metricsSpec" : [
      ]
    },
    "tuningConfig" : {
      "type" : "hadoop",
      "partitionsSpec" : {
        "type" : "hashed",
        "targetPartitionSize" : 5000000
      },
      "jobProperties" : {}
    }
  },
  "hadoopDependencyCoordinates": ["org.apache.hadoop:hadoop-client:2.6.0"]
}

However when I run this ingestion I get this error:

2016-08-10T16:17:45,449 INFO [LocalJobRunner Map Task Executor #0] io.druid.indexer.hadoop.DatasourceRecordReader - load schema [{"dataSource":"journal","intervals":["2016-07-19T00:00:00.000Z/2016-08-04T00:00:00.000Z"],"segments":null,"filter":null,"granularity":{"type":"none"},"dimensions":["id","type","serial_no"],"metrics":[],"ignoreWhenNoSegments":true}]
2016-08-10T16:17:45,449 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.MapTask - Starting flush of map output
2016-08-10T16:17:45,459 INFO [Thread-24] org.apache.hadoop.mapred.LocalJobRunner - map task executor complete.
2016-08-10T16:17:45,459 WARN [Thread-24] org.apache.hadoop.mapred.LocalJobRunner - job_local561690682_0001
java.lang.Exception: com.metamx.common.ISE: load schema does not have metrics
	at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) [hadoop-mapreduce-client-common-2.3.0.jar:?]
Caused by: com.metamx.common.ISE: load schema does not have metrics
	at io.druid.indexer.hadoop.DatasourceRecordReader.readAndVerifyDatasourceIngestionSpec(DatasourceRecordReader.java:185) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
	at io.druid.indexer.hadoop.DatasourceRecordReader.initialize(DatasourceRecordReader.java:69) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
	at org.apache.hadoop.mapreduce.lib.input.DelegatingRecordReader.initialize(DelegatingRecordReader.java:84) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:525) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_101]
	at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_101]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_101]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_101]
	at java.lang.Thread.run(Thread.java:745) ~[?:1.7.0_101]

I intentionally didn't add any metrics initially and I do not want to add any at reindexing. So is there anything else wrong with the schema for reindexing or is my understanding wrong?

Also, why would the new dimension not being added when batch ingesting the data from my origin file with an updated schema definition?

Apparently added dimensions are being ignored when running the initial batch ingestion for the same interval again.

Fangjin Yang

unread,

Aug 15, 2016, 7:19:21 PM8/15/16

to Druid User

Hi Ben,

You cannot have have empty set of metrics when ingestion data into Druid. For simplicity, can you add a count metric in the metrics spec?

The added dimensions should definitely not get ignored when you are reindexing for the same time period. If you are successful with indexing you should see a new segment with the new schema for the same time interval.

Reply all

Reply to author

Forward

0 new messages