Batch loading data from CSV file

Ranjit Iyer

unread,

Mar 17, 2015, 7:40:14 PM3/17/15

to druid...@googlegroups.com

Hi,

I am struggling to craft a correct spec file for loading CSV data. Quite simply I want to load data that is in the following format in a csv file.

<app_id,time,name,version,num_starts,num_crashes>

Is this the correct format of the spec file?

{

"type": "index",

"spec": {

"dataSchema": {

"dataSource": "AppInfo",

"parser": {

"type": "string",

"parseSpec": {

"format": "csv",

"timestampSpec": {

"column": "time"

},

"dimensionsSpec": {

"dimensions": [

"app_id",

"time",

"name",

"version",

"num_starts",

"num_crashes"

],

"dimensionExclusions": [],

"spatialDimensions": []

}

},

"metricsSpec": [

{

"type": "count",

"name": "count"

},

{

"type": "doubleSum",

"name": "num_starts",

"fieldName": "num_starts"

},

{

"type": "doubleSum",

"name": "num_crashes",

"fieldName": "num_crashes"

}

],

"granularitySpec": {

"type": "uniform",

"segmentGranularity": "DAY",

"queryGranularity": "NONE",

"intervals": ["2010-01-01/2015-03-15"]

}

},

"ioConfig": {

"type": "index",

"firehose": {

"type": "local",

"baseDir": "dataFolder/",

"filter": "appdata.csv",

"parser" : {

"timestampSpec" : {

"column" : "time"

},

"data" : {

"format" : "csv",

"columns" : ["app_id","time","name","version","num_starts","num_crashes"],

"dimensions" : ["app_id","time","name","version","num_starts","num_crashes"]

}

},

"tuningConfig": {

"type": "index",

"targetPartitionSize": 0,

"rowFlushBoundary": 0

}

Thanks,

Ranjit

appinfo.json

Ranjit Iyer

unread,

Mar 17, 2015, 7:54:43 PM3/17/15

to druid...@googlegroups.com

When I submit the indexing task, I get the following error

<html>

<head>

<title>Error 500 </title>

</head>

<body>

<h2>HTTP ERROR: 500</h2>

<p>Problem accessing /druid/indexer/v1/task. Reason:

<pre> javax.servlet.ServletException: com.fasterxml.jackson.databind.JsonMappingException: Instantiation of [simple type, class io.druid.data.input.impl.CSVParseSpec] value failed: columns</pre></p>

<hr /><i><small>Powered by Jetty://</small></i>

</body>

</html>

Fangjin

unread,

Mar 17, 2015, 8:05:45 PM3/17/15

to Ranjit Iyer, druid...@googlegroups.com

Hi Ranjit:

Here is an example index task:

http://druid.io/docs/latest/Tasks.html

Please read this to understand how to construct a data schema:

http://druid.io/docs/latest/Ingestion.html

You are missing a "columns" field.

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/39bc5908-276a-4ab1-b8a8-eeb27d973393%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Ranjit Iyer

unread,

Mar 18, 2015, 1:47:28 PM3/18/15

to druid...@googlegroups.com, ranji...@gmail.com

Hi Fangjin,

Thanks for your response. I am running into an error after adding 'columns', perhaps I am off by something tiny here, and thought it would help for another pair of eyes to look at it. If you can be so kind to take a look at my spec, I'd highly appreciate it.

Spec

====

{

"type": "index",

"spec": {

"dataSchema": {

"dataSource": "AppInfo",

"parser": {

"parseSpec": {

"type":"csv",

"timestampSpec": {

"column": "time",

"format":"auto"

},

"dimensionsSpec": {

"dimensions": [

"app_id",

"time",

"name",

"version",

"num_starts",

"num_crashes"

],

"dimensionExclusions": [],

"spatialDimensions": []

},

"columns" : ["app_id","time","name","version","num_starts","num_crashes"]

}

},

"metricsSpec": [

{

"type": "count",

"name": "count"

},

{

"type": "doubleSum",

"name": "num_starts",

"fieldName": "num_starts"

},

{

"type": "doubleSum",

"name": "num_crashes",

"fieldName": "num_crashes"

}

],

"granularitySpec": {

"type": "uniform",

"segmentGranularity": "DAY",

"queryGranularity": "NONE",

"intervals": ["2010-01-01/2015-03-15"]

}

},

"ioConfig": {

"type": "index",

"firehose": {

"type": "local",

"baseDir": "datamart/",

"filter": "appinfo.csv",

"parser" : {

"timestampSpec" : {

"column" : "time"

},

"data" : {

"format" : "csv",

"columns" : ["app_id","time","name","version","num_starts","num_crashes"],

"dimensions" : ["app_id","time","name","version","num_starts","num_crashes"]

}

},

"tuningConfig": {

"type": "index",

"targetPartitionSize": 0,

"rowFlushBoundary": 0

}

Error

====

<p>Problem accessing /druid/indexer/v1/task. Reason:

<pre> javax.servlet.ServletException: com.fasterxml.jackson.databind.JsonMappingException: Instantiation of [simple type, class io.druid.data.input.impl.StringInputRowParser] value failed: null</pre></p>

Ranjit Iyer

unread,

Mar 18, 2015, 2:09:58 PM3/18/15

to druid...@googlegroups.com, ranji...@gmail.com

I got past that error after I removed the "data" section from "firehose".

Now, I am running into a timestamp issue. Does the timestamp have to be the first column in the data? My data looks like this

"app_a","2013-08-31T01:02:33Z","A",1,2,1

"app_b","2013-09-31T01:02:33Z","B",1,2,1

"app_c","2013-07-31T01:02:33Z","C",1,2,1

"app_d","2013-06-31T01:02:33Z","D",1,2,1

But the Indexer seems to have assumes the full line as the app_id.

Null timestamp in input: {app_id=app_a,2013-08-31T01:02:33Z,A,1,2,1}

Caused by: java.lang.NullPointerException: Null timestamp in input: {app_id=app_a,2013-08-31T01:02:33Z,A,1,2,1}
	at io.druid.data.input.impl.MapInputRowParser.parse(MapInputRowParser.java:46) ~[druid-api-0.3.4.jar:0.3.4]
	... 11 more
2015-03-18T18:03:41,860 INFO [task-runner-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_AppInfo_2015-03-18T18:03:32.070Z",
  "status" : "FAILED",
  "duration" : 62
}

Thanks,
Ranjit

Ranjit Iyer

unread,

Mar 18, 2015, 3:38:19 PM3/18/15

to druid...@googlegroups.com, ranji...@gmail.com

I further reduced it down but still get date parse errors.

Caused by: java.lang.IllegalArgumentException: Invalid format: "2015-08-31T01:02:33Z,"app_a"" is malformed at ","app_a""
	at org.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187) ~[joda-time-2.6.jar:2.6]

CSV

====

2015-08-31T01:02:33Z,"app_a"

Spec

===

{
  "type": "index",
  "spec": {
    "dataSchema": {
      "dataSource": "AppInfo",
      "parser": {
        "parseSpec": {
          "type":"csv",
          "timestampSpec": {
            "column": "time"
          },
          "dimensionsSpec": {
            "dimensions": [
              "app_id"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          },
          "columns" : ["time","app_id"]
        }
      },
      "metricsSpec": [
        {
          "type": "count",
          "name": "count"
        }                                                
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "NONE",
        "intervals": ["2010-01-01/2015-03-15"]
      }
    },
    "ioConfig": {
      "type": "index",
      "firehose": {
        "type": "local",
        "baseDir": "datamart/",
        "filter": "appinfo.csv"
      }
    },
    "tuningConfig": {
      "type": "index",
      "targetPartitionSize": 0,
      "rowFlushBoundary": 0
    }
  }
}

Fangjin

unread,

Mar 18, 2015, 4:06:15 PM3/18/15

to Ranjit Iyer, druid...@googlegroups.com

"2015-08-31T01:02:33Z,"app_a" is missing a "

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/70d78856-f81e-41dc-86b3-32717a9f9a41%40googlegroups.com.

Ranjit Iyer

unread,

Mar 18, 2015, 4:36:56 PM3/18/15

to druid...@googlegroups.com, ranji...@gmail.com

Although the error seems to say I'm missing quotes, I don't have them right now around the date in the CSV file. Do I need to have double quotes around the date?

CSV

====

2015-08-31T01:02:33Z,"app_a"

Fangjin

unread,

Mar 18, 2015, 5:32:48 PM3/18/15

to Ranjit Iyer, druid...@googlegroups.com

Can you include a line of your csv data?

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/10e602a9-3c6f-463c-b0d6-d2c937495f32%40googlegroups.com.

Ranjit Iyer

unread,

Mar 18, 2015, 5:39:26 PM3/18/15

to druid...@googlegroups.com, ranji...@gmail.com

Sure,

CSV File (Just one line)

====

2015-08-31T01:02:33Z,"app_a"

Spec File

====

{

"type": "index",

"spec": {

"dataSchema": {

"dataSource": "AppInfo",

"parser": {

"parseSpec": {

"type":"csv",

"timestampSpec": {

"column": "time",

"format":"auto"

},

"dimensionsSpec": {

"dimensions": [

"time",

Task Error

====

com.metamx.common.parsers.ParseException: Unparseable timestamp found!
	at io.druid.data.input.impl.MapInputRowParser.parse(MapInputRowParser.java:55) ~[druid-api-0.3.4.jar:0.3.4]
	at io.druid.data.input.impl.StringInputRowParser.parseMap(StringInputRowParser.java:96) ~[druid-api-0.3.4.jar:0.3.4]
	at io.druid.data.input.impl.StringInputRowParser.parse(StringInputRowParser.java:91) ~[druid-api-0.3.4.jar:0.3.4]
	at io.druid.data.input.impl.FileIteratingFirehose.nextRow(FileIteratingFirehose.java:54) ~[druid-api-0.3.4.jar:0.3.4]
	at io.druid.indexing.common.task.IndexTask.getDataIntervals(IndexTask.java:207) ~[druid-indexing-service-0.7.0.jar:0.7.0]
	at io.druid.indexing.common.task.IndexTask.run(IndexTask.java:164) ~[druid-indexing-service-0.7.0.jar:0.7.0]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:235) [druid-indexing-service-0.7.0.jar:0.7.0]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:214) [druid-indexing-service-0.7.0.jar:0.7.0]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_31]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_31]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_31]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_31]
Caused by: java.lang.IllegalArgumentException: Invalid format: "2015-08-31T01:02:33Z,"app_a"" is malformed at ","app_a""
	at org.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187) ~[joda-time-2.6.jar:2.6]
	at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:780) ~[joda-time-2.6.jar:2.6]
	at org.joda.time.convert.StringConverter.getInstantMillis(StringConverter.java:65) ~[joda-time-2.6.jar:2.6]
	at org.joda.time.base.BaseDateTime.<init>(BaseDateTime.java:175) ~[joda-time-2.6.jar:2.6]
	at org.joda.time.DateTime.<init>(DateTime.java:257) ~[joda-time-2.6.jar:2.6]
	at com.metamx.common.parsers.TimestampParser$1.apply(TimestampParser.java:46) ~[java-util-0.26.14.jar:?]
	at com.metamx.common.parsers.TimestampParser$1.apply(TimestampParser.java:38) ~[java-util-0.26.14.jar:?]
	at io.druid.data.input.impl.TimestampSpec.extractTimestamp(TimestampSpec.java:51) ~[druid-api-0.3.4.jar:0.3.4]
	at io.druid.data.input.impl.MapInputRowParser.parse(MapInputRowParser.java:43) ~[druid-api-0.3.4.jar:0.3.4]
	... 11 more
2015-03-18T21:39:00,645 INFO [task-runner-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_AppInfo_2015-03-18T21:38:51.001Z",
  "status" : "FAILED",
  "duration" : 62

}

Fangjin

unread,

Mar 18, 2015, 5:57:16 PM3/18/15

to Ranjit Iyer, druid...@googlegroups.com

instead of "type":"csv", can you use "format":"csv"?

--

You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/dbf4ccc4-6edf-4419-8448-6a180893223d%40googlegroups.com.

Ranjit Iyer

unread,

Mar 18, 2015, 6:01:47 PM3/18/15

to druid...@googlegroups.com, ranji...@gmail.com

That fixed it! Thanks so much! My sincere apologies for this newbie mistake. I got confused between 'type' and 'format'

Ranjit Iyer

unread,

Mar 18, 2015, 6:06:02 PM3/18/15

to druid...@googlegroups.com, ranji...@gmail.com

I think I got confused because the 'CSVParser' section in this doc http://druid.io/docs/latest/Ingestion.html mentions 'type'. Not sure if I interpreted this incorrectly or if the doc needs to be updated..

Fangjin

unread,

Mar 18, 2015, 8:22:40 PM3/18/15

to Ranjit Iyer, druid...@googlegroups.com

Hi Ranjit, this is a bug in the documentation. Great catch. Fixing now.

On Wed, Mar 18, 2015 at 3:06 PM, Ranjit Iyer <ranji...@gmail.com> wrote:

I think I got confused because the 'CSVParser' section in this doc http://druid.io/docs/latest/Ingestion.html mentions 'type'. Not sure if I interpreted this incorrectly or if the doc needs to be updated..

--

You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/70636fd8-1d45-4541-a021-785d65532c4b%40googlegroups.com.

Ranjit Iyer

unread,

Mar 19, 2015, 5:57:20 PM3/19/15

to druid...@googlegroups.com, ranji...@gmail.com

Hey Fangjin,

I am loading a ~700MB CSV and the Indexer has been loading for a good 5 hours now but hasn't completed. I see the logs getting updated, but not sure if the loading is complete? What should I look for to know how close it is to finishing?

Thanks,

Ranjit

Fangjin Yang

unread,

Mar 19, 2015, 8:44:26 PM3/19/15

to druid...@googlegroups.com, ranji...@gmail.com

Hi Ranjit, that is very slow, even for the index task. How do you have your peons set up?

Ranjit Iyer

unread,

Mar 20, 2015, 1:28:42 AM3/20/15

to druid...@googlegroups.com, ranji...@gmail.com

I only ran the Indexer. The Middle Manager and Peons were not running during this time. Do they need to be explicitly started?

Nishant Bangarwa

unread,

Mar 20, 2015, 4:07:47 AM3/20/15

to Ranjit Iyer, druid...@googlegroups.com

Hi Ranjit,

you seem to be running with an interval of around 5 years with daily granularity, that will generate around 1800 segments for 700M of data.

IndexTask is not greatly optimized for generating lots of sall segments,

We usually suggest segment sizes around 500M.

Can you try changing the segment granularity to a YEAR or MONTH ?

Also specifying the exact interval for which the data is being ingested helps.

On Fri, Mar 20, 2015 at 10:58 AM, Ranjit Iyer <ranji...@gmail.com> wrote:

I only ran the Indexer. The Middle Manager and Peons were not running during this time. Do they need to be explicitly started?

--

You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/a5ba091d-0469-4d81-a6ff-7b3815d895a9%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Nishant

Software Engineer

|

METAMARKETS

m	+91-9729200044

nishant....@metamarkets.com

Reply all

Reply to author

Forward