Configure druid to parse json files with nested structures - failing

1,530 views
Skip to first unread message

Scott Kinney

unread,
Jun 2, 2016, 5:59:08 PM6/2/16
to Druid User
New to Druid. 
I would like something to query lots of json files gz'ed in s3 but i'm testing out on a small local sample file. 
The json has a few nested levels.

Looks like...

{
    "arr": [
        {
            "data": [
                {
                    "delta_t": 1,
                    "f": 60,
                    "i": [
                        -1,
                        -1,
                        -1
                    ],
                    "kw": [
                        68.948,
                        79.242,
                        67.05
                    ],
                    "orig_t": "2015-07-28T15:19:18.769",
                    "t": "2015-07-28T15:19:18.769",
                    "v": [
                        -1,
                        -1,
                        -1
                    ]
                }
            ],
            "id": "this-that-the-pther"
        }
    ],
    "ver": "1.0"
}

I configured a job schema like...

{
    "type" : "index_hadoop",
    "spec" : {
        "ioConfig" : {
            "type" : "hadoop",
            "inputSpec" : {
                "type" : "static",
                "paths" : "/home/ubuntu/datawarehouse/data.json"
            }
        },
        "dataSchema" : {
            "dataSource" : "test-job",
            "granularitySpec" : {
                "type" : "uniform"
            },
            "parser" : {
                "type" : "string",
                "parseSpec": {
                    "format": "json",
                    "flattenSpec": {
                        "useFieldDiscovery": true,
                        "fields": [
                            {
                                "type": "nested",
                                "name": "id",
                                "expr": "$.arr.id"
                            },
                            {
                                "type": "nested",
                                "name": "t",
                                "expr": "$.arr.data.t"
                            },
                            {
                                "type": "nested",
                                "name": "orig_t",
                                "expr": "$.arr.data.orig_t"
                            },
                            {
                                "type": "nested",
                                "name": "f",
                                "expr": "$.arr.data.f"
                            },
                            {
                                "type": "nested",
                                "name": "v_0",
                                "expr": "$.arr.data.v[0]"
                            },
                            {
                                "type": "nested",
                                "name": "v_1",
                                "expr": "$.arr.data.v[1]"
                            },
                            {
                                "type": "nested",
                                "name": "v_2",
                                "expr": "$.arr.data.v[2]"
                            },
                            {
                                "type": "nested",
                                "name": "i_0",
                                "expr": "$.arr.data.i[0]"
                            },
                            {
                                "type": "nested",
                                "name": "i_1",
                                "expr": "$.arr.data.i[1]"
                            },
                            {
                                "type": "nested",
                                "name": "i_2",
                                "expr": "$.arr.data.i[2]"
                            },
                            {
                                "type": "nested",
                                "name": "kw_0",
                                "expr": "$.arr.data.kw[0]"
                            },
                            {
                                "type": "nested",
                                "name": "kw_1",
                                "expr": "$.arr.data.kw[1]"
                            },
                            {
                                "type": "nested",
                                "name": "kw_2",
                                "expr": "$.arr.data.kw[2]"
                            },
                            {
                                "type": "nested",
                                "name": "delta_t",
                                "expr": "$.arr.data.delta_t"
                            }
                        ]
                    },
                    "dimensionsSpec" : {
                        "dimensions" : ["ver", "id"]
                    },
                    "timestampSpec" : {
                        "format" : "auto",
                        "column" : "t"
                    }
                }
            },
            "metricsSpec" : [
                {"name": "views", "type": "count"}
            ]
        },
        "tuningConfig" : {
            "type" : "hadoop",
            "partitionsSpec" : {
                "type" : "hashed",
                "targetPartitionSize" : 5000000
            }
        }
    }
}

It tires to run then fails. 
I can't make sense of the logs but one thing that stands out is..

2016-06-02T21:43:52,564 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.MapTask - Starting flush of map output
2016-06-02T21:43:52,573 INFO [Thread-21] org.apache.hadoop.mapred.LocalJobRunner - map task executor complete.
2016-06-02T21:43:52,574 WARN [Thread-21] org.apache.hadoop.mapred.LocalJobRunner - job_local348698092_0001
java.lang.Exception: java.lang.IllegalArgumentException: Can not construct instance of io.druid.data.input.impl.JSONPathFieldType, problem: No enum constant io.druid.data.input.impl.JSONPathFieldType.NESTED
 at [Source: N/A; line: -1, column: -1] (through reference chain: java.util.ArrayList[0])
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) [hadoop-mapreduce-client-common-2.3.0.jar:?]
Caused by: java.lang.IllegalArgumentException: Can not construct instance of io.druid.data.input.impl.JSONPathFieldType, problem: No enum constant io.druid.data.input.impl.JSONPathFieldType.NESTED
 at [Source: N/A; line: -1, column: -1] (through reference chain: java.util.ArrayList[0])

Any Ideas where I'm screwing up?
Thanks yall!
-SK

Jonathan Wei

unread,
Jun 2, 2016, 6:23:40 PM6/2/16
to druid...@googlegroups.com
Hi Scott,

Can you try changing the "type" property on the field definitions to use "path" instead of "nested"? The docs are out-of-date for that section

Thanks,
Jon

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/741fdba8-7ecf-4866-91bc-e02d96674501%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Scott Kinney

unread,
Jun 2, 2016, 6:32:46 PM6/2/16
to Druid User
ah ha! Thank you!
Got passed that now its failing with...
java.lang.Exception: com.metamx.common.RE: Failure on row[{"arr": [{"data": [{"f": 60, "i": [-1, -1, -1], "delta_t": 1, "kw": [68.948, 79.242, 67.05], "t": "2015-07-28T15:19:18.769", "v": [-1, -1, -1], "orig_t": "2015-07-28T15:19:18.769"}], "id": "pgx.hq.stem-8e-71-6b.vmonitor.site.telemetry"}], "ver": "1.0"}]
	at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) [hadoop-mapreduce-client-common-2.3.0.jar:?]
Caused by: com.metamx.common.RE: Failure on row[{"arr": [{"data": [{"f": 60, "i": [-1, -1, -1], "delta_t": 1, "kw": [68.948, 79.242, 67.05], "t": "2015-07-28T15:19:18.769", "v": [-1, -1, -1], "orig_t": "2015-07-28T15:19:18.769"}], "id": "pgx.hq.stem-8e-71-6b.vmonitor.site.telemetry"}], "ver": "1.0"}]

This is probably suggesting my 'flattenSpec' is incorrect? It seems to make each json blob an array. do i need to update my paths to something like...
                            {
                                "type": "path",
                                "name": "delta_t",
                                "expr": "$[0].arr.data.delta_t"
                            }

Jonathan Wei

unread,
Jun 2, 2016, 6:56:53 PM6/2/16
to druid...@googlegroups.com
The row looks like an array there, but I think the outer brackets in the exception are coming from the exception message itself:
throw new RE(e, "Failure on row[%s]", value);

Did you see any more detailed exceptions in the logs that might point to the field(s) that had errors?


Scott Kinney

unread,
Jun 2, 2016, 8:28:50 PM6/2/16
to Druid User
yep, I didn't look carefully enough

Caused by: com.metamx.common.parsers.ParseException: Unparseable timestamp found!

which is:
"t": "2015-07-28T15:19:18.769",

maybe druid doesn't like the .769 probably milliseconds.


Fangjin Yang

unread,
Jun 3, 2016, 7:57:33 PM6/3/16
to Druid User
See this link about timestamp formats druid supports:

http://druid.io/docs/0.9.0/ingestion/index.html#timestampspec

Scott Kinney

unread,
Jun 5, 2016, 7:14:09 PM6/5/16
to Druid User
That says druid supports iso and this is pretty clearly iso
 "t": "2015-07-28T15:19:18.769"

Scott Kinney

unread,
Jun 5, 2016, 7:29:07 PM6/5/16
to Druid User
these stands out:

Caused by: com.metamx.common.parsers.ParseException: Unparseable timestamp found!

Caused by: java.lang.NullPointerException: Null timestamp in input: {ver=1.0}

a larger snip of the log:

2016-06-05T23:20:05,534 INFO [LocalJobRunner Map Task Executor #0] io.druid.indexer.HadoopDruidIndexerConfig - Running with config:
{
  "spec" : {
    "dataSchema" : {
      "dataSource" : "vmonitor.site.telemetry",
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "timestampSpec" : {
            "column" : "t",
            "format" : "iso"
          },
          "flattenSpec" : {
            "useFieldDiscovery" : true,
            "fields" : [ {
              "type" : "path",
              "name" : "id",
              "expr" : "$.arr.id"
            }, {
              "type" : "path",
              "name" : "t",
              "expr" : "$.arr.data.t"
            }, {
              "type" : "path",
              "name" : "orig_t",
              "expr" : "$.arr.data.orig_t"
            }, {
              "type" : "path",
              "name" : "f",
              "expr" : "$.arr.data.f"
            }, {
              "type" : "path",
              "name" : "v_0",
              "expr" : "$.arr.data.v[0]"
            }, {
              "type" : "path",
              "name" : "v_1",
              "expr" : "$.arr.data.v[1]"
            }, {
              "type" : "path",
              "name" : "v_2",
              "expr" : "$.arr.data.v[2]"
            }, {
              "type" : "path",
              "name" : "i_0",
              "expr" : "$.arr.data.i[0]"
            }, {
              "type" : "path",
              "name" : "i_1",
              "expr" : "$.arr.data.i[1]"
            }, {
              "type" : "path",
              "name" : "i_2",
              "expr" : "$.arr.data.i[2]"
            }, {
              "type" : "path",
              "name" : "kw_0",
              "expr" : "$.arr.data.kw[0]"
            }, {
              "type" : "path",
              "name" : "kw_1",
              "expr" : "$.arr.data.kw[1]"
            }, {
              "type" : "path",
              "name" : "kw_2",
              "expr" : "$.arr.data.kw[2]"
            }, {
              "type" : "path",
              "name" : "delta_t",
              "expr" : "$.arr.data.delta_t"
            } ]
          },
          "dimensionsSpec" : {
            "dimensions" : [ "ver", "id" ]
          }
        }
      },
      "metricsSpec" : [ {
        "type" : "count",
        "name" : "views"
      }, {
        "type" : "count",
        "name" : "kw_0"
      }, {
        "type" : "count",
        "name" : "kw_1"
      }, {
        "type" : "count",
        "name" : "kw_2"
      } ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "DAY",
        "queryGranularity" : {
          "type" : "none"
        },
        "intervals" : null
      }
    },
    "ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "static",
        "paths" : "/home/ubuntu/datawarehouse/vmonitor.site.telemetry.json"
      },
      "metadataUpdateSpec" : null,
      "segmentOutputPath" : "file:/home/ubuntu/druid-0.9.0/var/druid/segments/vmonitor.site.telemetry"
    },
    "tuningConfig" : {
      "type" : "hadoop",
      "workingPath" : "var/druid/hadoop-tmp",
      "version" : "2016-06-05T23:19:59.090Z",
      "partitionsSpec" : {
        "type" : "hashed",
        "targetPartitionSize" : 5000000,
        "maxPartitionSize" : 7500000,
        "assumeGrouped" : false,
        "numShards" : -1
      },
      "shardSpecs" : { },
      "indexSpec" : {
        "bitmap" : {
          "type" : "concise"
        },
        "dimensionCompression" : null,
        "metricCompression" : null
      },
      "maxRowsInMemory" : 80000,
      "leaveIntermediate" : false,
      "cleanupOnFailure" : true,
      "overwriteFiles" : false,
      "ignoreInvalidRows" : false,
      "jobProperties" : { },
      "combineText" : false,
      "useCombiner" : false,
      "buildV9Directly" : false,
      "numBackgroundPersistThreads" : 0
    },
    "uniqueId" : "be11a9ca18e748b2b4f681ff9d42cdf7"
  }
}
2016-06-05T23:20:05,544 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.MapTask - Starting flush of map output
2016-06-05T23:20:05,550 INFO [Thread-21] org.apache.hadoop.mapred.LocalJobRunner - map task executor complete.
2016-06-05T23:20:05,551 WARN [Thread-21] org.apache.hadoop.mapred.LocalJobRunner - job_local761517617_0001
java.lang.Exception: com.metamx.common.RE: Failure on row[{"arr": [{"data": [{"f": 60, "i": [-1, -1, -1], "delta_t": 1, "kw": [68.948, 79.242, 67.05], "t": "2015-07-28T15:19:18.769", "v": [-1, -1, -1], "orig_t": "2015-07-28T15:19:18.769"}], "id": "pgx.hq.stem-8e-71-6b.vmonitor.site.telemetry"}], "ver": "1.0"}]
	at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) [hadoop-mapreduce-client-common-2.3.0.jar:?]
Caused by: com.metamx.common.RE: Failure on row[{"arr": [{"data": [{"f": 60, "i": [-1, -1, -1], "delta_t": 1, "kw": [68.948, 79.242, 67.05], "t": "2015-07-28T15:19:18.769", "v": [-1, -1, -1], "orig_t": "2015-07-28T15:19:18.769"}], "id": "pgx.hq.stem-8e-71-6b.vmonitor.site.telemetry"}], "ver": "1.0"}]
	at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:88) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at io.druid.indexer.DetermineHashedPartitionsJob$DetermineCardinalityMapper.run(DetermineHashedPartitionsJob.java:282) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_101]
	at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_101]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_101]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_101]
	at java.lang.Thread.run(Thread.java:745) ~[?:1.7.0_101]
Caused by: com.metamx.common.parsers.ParseException: Unparseable timestamp found!
	at io.druid.data.input.impl.MapInputRowParser.parse(MapInputRowParser.java:72) ~[druid-api-0.3.16.jar:0.3.16]
	at io.druid.data.input.impl.StringInputRowParser.parseMap(StringInputRowParser.java:136) ~[druid-api-0.3.16.jar:0.3.16]
	at io.druid.data.input.impl.StringInputRowParser.parse(StringInputRowParser.java:131) ~[druid-api-0.3.16.jar:0.3.16]
	at io.druid.indexer.HadoopDruidIndexerMapper.parseInputRow(HadoopDruidIndexerMapper.java:98) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:69) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at io.druid.indexer.DetermineHashedPartitionsJob$DetermineCardinalityMapper.run(DetermineHashedPartitionsJob.java:282) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_101]
	at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_101]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_101]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_101]
	at java.lang.Thread.run(Thread.java:745) ~[?:1.7.0_101]
Caused by: java.lang.NullPointerException: Null timestamp in input: {ver=1.0}
	at io.druid.data.input.impl.MapInputRowParser.parse(MapInputRowParser.java:63) ~[druid-api-0.3.16.jar:0.3.16]
	at io.druid.data.input.impl.StringInputRowParser.parseMap(StringInputRowParser.java:136) ~[druid-api-0.3.16.jar:0.3.16]
	at io.druid.data.input.impl.StringInputRowParser.parse(StringInputRowParser.java:131) ~[druid-api-0.3.16.jar:0.3.16]
	at io.druid.indexer.HadoopDruidIndexerMapper.parseInputRow(HadoopDruidIndexerMapper.java:98) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:69) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at io.druid.indexer.DetermineHashedPartitionsJob$DetermineCardinalityMapper.run(DetermineHashedPartitionsJob.java:282) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_101]
	at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_101]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_101]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_101]
	at java.lang.Thread.run(Thread.java:745) ~[?:1.7.0_101]

Scott Kinney

unread,
Jun 6, 2016, 4:39:42 PM6/6/16
to Druid User
changing to 'path' stopped the errors but the json object is not flattening but I cant see any dimensions that i've flattened. the only dimensions that are queryable are in the root of the json.



On Thursday, June 2, 2016 at 3:23:40 PM UTC-7, Jonathan Wei wrote:

Fangjin

unread,
Jun 6, 2016, 5:15:05 PM6/6/16
to Druid User
Hi Scott, are you sure all of your timestamps are valid? 

It appears you have this row in your data; "{ver=1.0}" and that row definitely doesn't have a timestamp

Scott Kinney

unread,
Jun 6, 2016, 6:27:39 PM6/6/16
to Druid User
Hi Fangjin, 
The timestamp is correct but my jayway was incorrect. 
I was following the flattenSpec docs on druid.io but the jayway in that example is wrong.
This https://github.com/jayway/JsonPath was a big help.

shiva...@media.net

unread,
Oct 2, 2017, 8:27:50 AM10/2/17
to Druid User
Hello,

Will it be possible to create parse spec for array of no fix size.
My file contains multiple json with different array sizes.
Reply all
Reply to author
Forward
0 new messages