Inferring schema from JSON records

31 views
Skip to first unread message

Buntu Dev

unread,
Feb 12, 2015, 1:51:30 PM2/12/15
to cdk...@cloudera.org
Hi Ryan/Team,

I just wanted to follow-up on the comment from other thread related to merging schemas ie, "We needed it to infer a schema for JSON records".

Is there something available in kite-dataset CLI to infer the schema from JSON records in Kite 0.18.0 release?


Thanks!

Ryan Blue

unread,
Feb 12, 2015, 2:09:14 PM2/12/15
to Buntu Dev, cdk...@cloudera.org
Yes, we just added initial support for JSON! The CLI now has json-schema
and json-import. The API also has a JSON format.

JSON support is like the CSV support. Kite assumes that the JSON records
adhere to some schema and constructs records according to that schema.
If a required field is missing, Kite will complain. If the JSON contains
data for fields not in the schema, it is ignored.

We're happy to get some feedback on how we should be improving the
support, so please try it out and get back to us!

Here's the output of json-schema on a json sample:

blue@work:~/tmp$ head movies.json -n 1
{"id": 1, "title": "Toy Story (1995)", "release_date": "01-Jan-1995",
"video_release_date": "", "imdb_url":
"http:\/\/us.imdb.com\/M\/title-exact?Toy%20Story%20(1995)"}
blue@work:~/tmp$ kite-dataset json-schema movies.json --class Movie
{
"type" : "record",
"name" : "Movie",
"fields" : [ {
"name" : "id",
"type" : "int",
"doc" : "Type inferred from '1'"
}, {
"name" : "title",
"type" : "string",
"doc" : "Type inferred from '\"Toy Story (1995)\"'"
}, {
"name" : "release_date",
"type" : "string",
"doc" : "Type inferred from '\"01-Jan-1995\"'"
}, {
"name" : "video_release_date",
"type" : "string",
"doc" : "Type inferred from '\"\"'"
}, {
"name" : "imdb_url",
"type" : "string",
"doc" : "Type inferred from
'\"http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)\"'"
} ]
}

rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Buntu Dev

unread,
Feb 12, 2015, 2:25:16 PM2/12/15
to Ryan Blue, cdk...@cloudera.org
Awesome, thanks! 

I will checkout the CLI reference docs and provide the feedback.

Buntu Dev

unread,
Feb 12, 2015, 4:35:59 PM2/12/15
to Ryan Blue, cdk...@cloudera.org
Hi Ryan,

I was able to use 'json-schema' to infer the schema on a reasonably complex set of JSON objects, which apparently also appended the "doc" fields to the generated schema.

When attempting to create the dataset using the schema generated, I ran into this error:

~~~~~~
Unknown error: org.codehaus.jackson.JsonParseException: Unexpected end-of-input: was expecting closing quote for a string value at [Source: java.io.StringReader@2eba4fff; line: 1, column: 6001]

~~~~~

Good news is after eliminating the "doc" fields from the schema, I was able to create the dataset and import the json files using 'json-import'


Thanks!


Ryan Blue

unread,
Feb 12, 2015, 9:34:45 PM2/12/15
to Buntu Dev, cdk...@cloudera.org
On 02/12/2015 01:35 PM, Buntu Dev wrote:
> Hi Ryan,
>
> I was able to use 'json-schema' to infer the schema on a reasonably
> complex set of JSON objects, which apparently also appended the "doc"
> fields to the generated schema.
>
> When attempting to create the dataset using the schema generated, I ran
> into this error:
>
> ~~~~~~
> Unknown error: org.codehaus.jackson.JsonParseException: Unexpected
> end-of-input: was expecting closing quote for a string value at [Source:
> java.io.StringReader@2eba4fff; line: 1, column: 6001]
>
> ~~~~~
>
> Good news is after eliminating the "doc" fields from the schema, I was
> able to create the dataset and import the json files using 'json-import'
>
>
> Thanks!

It sounds like you are creating a Hive dataset and the Schema is too big
to fit in table properties. A fix is on the way to store schemas in
HDFS, but in the mean time if you create the table from a schema already
stored in HDFS, you will avoid the error.

Buntu Dev

unread,
Feb 12, 2015, 10:28:56 PM2/12/15
to Ryan Blue, cdk...@cloudera.org
Thanks Ryan for the info.

--
You received this message because you are subscribed to the Google Groups "CDK Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cdk-dev+unsubscribe@cloudera.org.
For more options, visit https://groups.google.com/a/cloudera.org/d/optout.

Reply all
Reply to author
Forward
0 new messages