Index CSV Multiple Columns for TimestampSpec

51 views
Skip to first unread message

SS

unread,
Jul 7, 2016, 3:18:21 AM7/7/16
to Druid User
Hi all,

I am new to Druid but for an OLAP application we need to do fast querying over big data. I have a csv file with volume data month on month for four years. 
The file size is 25GB. Can I give two columns(month and year) in the timeStampSpec of the indexing input file for druid to index over it.

My csv has following format:

year,month,volume,store_id,product_id

2015,jul,2.113,22653,45


My index.json is as follows:

{

 

 
"type" : "index_hadoop",

 
"spec" : {

   
"ioConfig" : {

     
"type" : "hadoop",

     
"inputSpec" : {

       
"type" : "static",

       
"paths" : "../../data/volume-data.csv"

     
}

   
},

   
"dataSchema" : {

     
"dataSource" : "volume-data",

     
"listDelimiter":",",

     
"granularitySpec" : {

       
"type" : "uniform",

       
"segmentGranularity" : "month",

       
"queryGranularity" : "none",

       
"intervals" : ["2011-09-12/2016-09-13"]

     
},

     
"parser" : {

       
"type" : "string",

       
"parseSpec" : {

         
"format" : "csv",

         
"columns":["year","month","volume","store_id","product_id"],

         
"dimensionsSpec" : {

           
"dimensions" : [

               
"store_id",

               
"product_id",

               
"month",

               
"year"

           
]

         
},

         
"timestampSpec" : [{

           
"format" : "YYYY",

           
"column" : "year"

         
},

       
{

           
"format" : "mmm",

           
"column" : "month"

         
}]

       
}

     
},

     
"metricsSpec" : [

       
{

         
"name" : "volume",

         
"type" : "doubleSum",

         
"fieldName" : "volume"

       
}

     
]

   
},

   
"tuningConfig" : {

     
"type" : "hadoop",

     
"partitionsSpec" : {

       
"type" : "hashed",

       
"targetPartitionSize" : 5000000

     
},

     
"jobProperties" : {}

   
}

 
}

}



I get the following exception in the logs






2016-07-07T07:07:58,660 WARN [Thread-21] org.apache.hadoop.mapred.LocalJobRunner - job_local1102702440_0001
java
.lang.Exception: java.lang.IllegalArgumentException: Can not deserialize instance of io.druid.data.input.impl.TimestampSpec out of START_ARRAY token
 at
[Source: N/A; line: -1, column: -1]
 at org
.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]
 at org
.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) [hadoop-mapreduce-client-common-2.3.0.jar:?]
Caused by: java.lang.IllegalArgumentException: Can not deserialize instance of io.druid.data.input.impl.TimestampSpec out of START_ARRAY token
 at
[Source: N/A; line: -1, column: -1]
 at com
.fasterxml.jackson.databind.ObjectMapper._convert(ObjectMapper.java:2774) ~[jackson-databind-2.4.6.jar:2.4.6]
 at com
.fasterxml.jackson.databind.ObjectMapper.convertValue(ObjectMapper.java:2700) ~[jackson-databind-2.4.6.jar:2.4.6]
 at io
.druid.segment.indexing.DataSchema.getParser(DataSchema.java:101) ~[druid-server-0.9.1.1.jar:0.9.1.1]
 at io
.druid.indexer.HadoopDruidIndexerConfig.verify(HadoopDruidIndexerConfig.java:567) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
 at io
.druid.indexer.HadoopDruidIndexerConfig.fromConfiguration(HadoopDruidIndexerConfig.java:209) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
 at io
.druid.indexer.DetermineHashedPartitionsJob$DetermineHashedPartitionsPartitioner.setConf(DetermineHashedPartitionsJob.java:399) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]


Is the only option to merge the month and year column to create a new timestamp column? Or can I directly use the two?

SS

unread,
Jul 7, 2016, 5:56:09 AM7/7/16
to Druid User
Fixed it.

Cannot use two columns. Had to create a date out of month and year. It works.
Reply all
Reply to author
Forward
0 new messages