Strategy for improving validation performance?

948 views
Skip to first unread message

Jim Klo

unread,
Apr 18, 2013, 6:22:04 PM4/18/13
to json-...@googlegroups.com
Greetings,

I've generated a rather large and somewhat complex schema. https://gist.github.com/5416348

Does anyone have any ideas on strategies for improving performance?  This schema takes between 30 seconds and 1 minute per object to validate.  My initial thought was to remove dependencies through "oneOf", "allOf" and "anyOf" and just specifying all properties, to create fewer traversals of the schema,  but run into the issue where each a property can specify a generalized type, but actually store just a more specific type.

Given this definition:
{
    "properties": {
        "myThing": { "$ref": "#/definitions/Thing" }
    }
    "definitions": {
        "Widget": {
            "properties": {
                "type": { "enum": ["Widget"] }
                "shape": { "type": "string" }
            }
            "allOf": [
                { "$ref": "#/definitions/Thing"}
            ]

        },
        "Thing": {
            "properties": {
                "type": { "enum": ["Thing", "Widget"] }
                "name": { "type": "string" }
            }
        }
    }
}

and this object instance:
{
    "myThing": {
        "type": "Widget",
        "shape": "square",
        "name": "Square Bolt"
    }
}

how would I alter the schema to be more efficient in validation?
would this be the right approach?
{
    "properties": {
        "myThing": {
            "oneOf": [
                { "$ref": "#/definitions/Thing" }
                { "$ref": "#/definitions/Widget" }
            ]
        }
    }
    "definitions": {
        "Widget": {
            "properties": {
                "type": { "enum": ["Widget"] }
                "shape": { "type": "string" }
                "name": { "type": "string" }
            }
        },
        "Thing": {
            "properties": {
                "type": { "enum": ["Thing"] }
                "name": { "type": "string" }
            }
        }
    }
}



Geraint

unread,
Apr 18, 2013, 10:35:06 PM4/18/13
to json-...@googlegroups.com
Whoah.  O_o   Do you have some example data?

Also, I'd be interested to know what validator are you using.  It looks like there are a lot of "$ref"s in there - it might be worth checking whether your validation library is pre-processing the schemas to resolve those references, or whether it's doing a lookup each time.

I know that for my validator library, I haven't looked at that kind of optimisation, because I've only been using it on schemas orders of magnitude smaller than yours.

Jim Klo

unread,
Apr 19, 2013, 2:14:49 PM4/19/13
to json-...@googlegroups.com

On Thursday, April 18, 2013 7:35:06 PM UTC-7, Geraint wrote:
Whoah.  O_o   Do you have some example data?


I guess I win for largest JSON Schema?

Example data:
{
    "items": [{
        "type": ["http://schema.org/CreativeWork"],
        "id": "urn:www.khanacademy.org:node_slug:e/area_of_a_circle",
        "properties": {
            "name": ["Area of a circle"],
            "author": [{
                "type": ["http://schema.org/Person"],
                "properties": {
                    "name": ["Omar Rizwan"]
                }
            }],
            "dateCreated": ["2012-04-13T23:13:03Z"],
            "educationalAlignment": [{
                "type": ["http://schema.org/AlignmentObject"],
                "id": "urn:corestandards.org:guid:8111E58EA0054B8C8DE2CF7AA27F2FD8",
                "properties": {
                    "alignmentType": ["teaches"],
                    "educationalFramework": ["Common Core State Standards"],
                    "targetName": ["CCSS.Math.Content.7.G.B.4"],
                    "targetDescription": ["Know the formulas for the area and circumference of a circle and use them to solve problems; give an informal derivation of the relationship between the circumference and area of a circle."],
                    "targetUrl": ["http://corestandards.org/Math/Content/7/G/B/4"]
                }
            }],
            "learningResourceType": ["exercise"]
        }
    }]
}

and 

{
    "items": [{
        "type": ["http://schema.org/CreativeWork"],
        "id": "urn:www.khanacademy.org:node_slug:v/writing-and-using-inequalities-2",
        "properties": {
            "name": ["Writing and using inequalities 2"],
            "author": [{
                "type": ["http://schema.org/Person"],
                "properties": {
                    "name": ["Sal Khan"]
                }
            }, {
                "type": ["http://schema.org/Person"],
                "properties": {
                    "name": ["Monterey Institute for Technology and Education"]
                }
            }],
            "interactionCount": ["28103 UserPlays"],
            "datePublished": ["2011-02-20T16:28:11Z"],
            "educationalAlignment": [{
                "type": ["http://schema.org/AlignmentObject"],
                "id": "urn:corestandards.org:guid:666AA11DF39241C68112FFBE28D811C4",
                "properties": {
                    "alignmentType": ["teaches"],
                    "educationalFramework": ["Common Core State Standards"],
                    "targetName": ["CCSS.Math.Content.6.EE.B.6"],
                    "targetDescription": ["Use variables to represent numbers and write expressions when solving a real-world or mathematical problem; understand that a variable can represent an unknown number, or, depending on the purpose at hand, any number in a specified set."],
                    "targetUrl": ["http://corestandards.org/Math/Content/6/EE/B/6"]
                }
            }],
            "learningResourceType": ["video"],
            "video": [{
                "id": "urn:www.youtube.com:videoid:cCMpin3Te4s",
                "type": ["http://schema.org/VideoObject"],
                "properties": {
                    "playerType": ["youtube"],
                    "thumbnailUrl": ["http://s3.amazonaws.com/KA-youtube-converted/cCMpin3Te4s.mp4/cCMpin3Te4s.png"],
                    "encodingFormat": ["mpeg4"]
                }
            }],
            "keywords": ["U05_L1_T3_we2, Writing, and, using, inequalities, CC_6_EE_6, CC_6_EE_8, CC_7_EE_4, CC_7_EE_4_b, CC_39336_A-CED_1, CC_39336_A-CED_3, CC_39336_A-REI_3"],
            "description": ["Writing and using inequalities 2"]
        }
    }]
}

I have about 2K records of sample data, however most of it is in the same format - and I know that derivatives of CreativeWork can be used in place - this is just so bleeding edge; there's not a lot of data out there quite yet in the format.

The background here is we are using the Schema.org vocabulary modeled as a HTML Microdata transformed into JSON, per the HTML Microdata specification.  I know it's not exactly perfect but it meets our general needs. FWIW the source 'schema' that I'm using for building the JSON Schema is http://schema.rdfs.org/all.json

Other strategies I've considered is splitting Classes, which indicate type from property sets, which don't indicate type,  such that a schema for a CreativeWork might start to look something like:
{
    "properties": {
        "type": {
            "enum": ["http://schema.org/CreativeWork"]
        }
        "properties": {
            "anyOf": [
                { "$ref": "#/definitions/props_CreativeWork" },
                { "$ref": "#/definitions/props_Thing"}
            ]
        }
    },
    "definitions": {
        "props_CreativeWork": {
            "properties": {
                "comment": {},
                "creator": {},
            }
        },
        "props_Thing": {
            "properties": {
                "name": {},
                "uri": {},
                "image": {},
            }
        }
    }
}
 
Also, I'd be interested to know what validator are you using.  It looks like there are a lot of "$ref"s in there - it might be worth checking whether your validation library is pre-processing the schemas to resolve those references, or whether it's doing a lookup each time.


So the validators I'm using right now to develop with are:


and 


and may ultimately be looking for one that will work in JavaScript that can run inside CouchDB VDU function.

FWIW: I don't have any issue contributing to existing validators out there to cache ahead... 

Ideally this schema would probably lend itself better to translating the JSON to XML then using an XSD; however because of the CouchDB dependency XSD validation is futile at the moment.

At least for the python validator - I don't think that lookup should be bad as it's just a hash lookup.  The Java one is a tad too complex for me to just casually look at the source to figure out what it's doing - it's a bit hard to follow the code flow (Sorry FGE if you're seeing this... good work though in any case!)

It's possible that there just isn't a way to make it faster... which I may need to fall back to a reduced validation set.

Geraint

unread,
Apr 19, 2013, 3:55:29 PM4/19/13
to json-...@googlegroups.com
Just in case I'm misunderstanding something - with that schema from the Gist, and those example data objects, validation is taking 30 seconds per object?

That doesn't seem right to me - I tried it with tv4 (my JavaScript library) and they validated in 200-300ms each in Chrome (code attached).

Or have I missed something?
timer.zip

Francis Galiegue

unread,
Apr 19, 2013, 4:13:35 PM4/19/13
to json-...@googlegroups.com
On Fri, Apr 19, 2013 at 8:14 PM, Jim Klo <jim...@sri.com> wrote:
[...]
>
> FWIW: I don't have any issue contributing to existing validators out there
> to cache ahead...
>

I do already cache, but of course there is a size limit. And here you
don't seem to be using $refs to external sources either. So, this is
simple JSON Pointer resolution from the root.

The fact that you say it takes 30 _seconds_ to validate one entry is
rather puzzling to me. Have you tried and forked the app locally and
run it on your machine?


--
Francis Galiegue, fgal...@gmail.com
JSON Schema in Java: http://json-schema-validator.herokuapp.com

Jim Klo

unread,
Apr 19, 2013, 7:54:06 PM4/19/13
to json-...@googlegroups.com


On Friday, April 19, 2013 1:13:35 PM UTC-7, fge wrote:
On Fri, Apr 19, 2013 at 8:14 PM, Jim Klo <jim...@sri.com> wrote:
[...]
>
> FWIW: I don't have any issue contributing to existing validators out there
> to cache ahead...
>

I do already cache, but of course there is a size limit. And here you
don't seem to be using $refs to external sources either. So, this is
simple JSON Pointer resolution from the root.

The fact that you say it takes 30 _seconds_ to validate one entry is
rather puzzling to me. Have you tried and forked the app locally and
run it on your machine?

I was intentionally not trying to use external sources for that reason...

And yes.. I forked and checked out locally and ran... mostly because I was developing without direct internet access...

Francis Galiegue

unread,
Apr 19, 2013, 9:46:38 PM4/19/13
to json-...@googlegroups.com
On Sat, Apr 20, 2013 at 1:54 AM, Jim Klo <jim...@sri.com> wrote:
[...]
>>
>> I do already cache, but of course there is a size limit. And here you
>> don't seem to be using $refs to external sources either. So, this is
>> simple JSON Pointer resolution from the root.
>>
>> The fact that you say it takes 30 _seconds_ to validate one entry is
>> rather puzzling to me. Have you tried and forked the app locally and
>> run it on your machine?
>>
> I was intentionally not trying to use external sources for that reason...
>
> And yes.. I forked and checked out locally and ran... mostly because I was
> developing without direct internet access...
>

That still does not explain the 30 seconds... My library is ordinarily
ultra fast. Care to share a schema and sample data?

Craig McClanahan

unread,
Apr 19, 2013, 9:52:48 PM4/19/13
to json-...@googlegroups.com
Just as a total WAG, the last couple of times I had unexplainable 30 second timeouts, they were DNS calls that (for some reason) didn't complete correctly and timed out after 30 seconds.  It's worth trying this in a different environment just to eliminate that possibility.

Craig



--
You received this message because you are subscribed to the Google Groups "JSON Schema" group.
To unsubscribe from this group and stop receiving emails from it, send an email to json-schema...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



Jim Klo

unread,
Apr 20, 2013, 12:56:28 AM4/20/13
to json-...@googlegroups.com


On Friday, April 19, 2013 6:46:38 PM UTC-7, fge wrote:
On Sat, Apr 20, 2013 at 1:54 AM, Jim Klo <jim...@sri.com> wrote:
[...]
>>
>> I do already cache, but of course there is a size limit. And here you
>> don't seem to be using $refs to external sources either. So, this is
>> simple JSON Pointer resolution from the root.
>>
>> The fact that you say it takes 30 _seconds_ to validate one entry is
>> rather puzzling to me. Have you tried and forked the app locally and
>> run it on your machine?
>>
> I was intentionally not trying to use external sources for that reason...
>
> And yes.. I forked and checked out locally and ran... mostly because I was
> developing without direct internet access...
>

That still does not explain the 30 seconds... My library is ordinarily
ultra fast. Care to share a schema and sample data?

I'm still refining the schema... trying to ensure I have it working... 
however use lrmi.json here https://gist.github.com/5416348 (there's 2 files there)

here's 2 objects... first one works, second one 'should' work, but isn't - trying to figure out why... either way... it's taking a long time... Granted... the lrmi.json is 22K lines... 

 {
    "items": [{
        "type": ["http://schema.org/CreativeWork"],
        "id": "urn:www.khanacademy.org:node_slug:e/area_of_a_circle",
        "properties": {
            "name": ["Area of a circle"],
            "author": [{
                "type": ["http://schema.org/Person"],
                "properties": {
                    "name": ["Omar Rizwan"]
                }
            }],
            "dateCreated": ["2012-04-13T23:13:03Z"],
            "educationalAlignment": [{
                "type": ["http://schema.org/AlignmentObject"],
                "id": "urn:corestandards.org:guid:8111E58EA0054B8C8DE2CF7AA27F2FD8",
                "properties": {
                    "alignmentType": ["teaches"],
                    "educationalFramework": ["Common Core State Standards"],
                    "targetName": ["CCSS.Math.Content.7.G.B.4"],
                    "targetDescription": ["Know the formulas for the area and circumference of a circle and use them to solve problems; give an informal derivation of the relationship between the circumference and area of a circle."],
                    "targetUrl": ["http://corestandards.org/Math/Content/7/G/B/4"]
                }
            }],
            "learningResourceType": ["exercise"]
        }
    }]
}


                    "thumbnailUrl": ["http://s3.amazonaws.com/KA-youtube-converted/cCMpin3Te4s.mp4/cCMpin3Te4s.png"],
                    "encodingFormat": ["mpeg4"]
                }
            }],
            "keywords": ["U05_L1_T3_we2, Writing, and, using, inequalities, CC_6_EE_6, CC_6_EE_8, CC_7_EE_4, CC_7_EE_4_b, CC_39336_A-CED_1, CC_39336_A-CED_3, CC_39336_A-REI_3"],
            "description": ["Writing and using inequalities 2"]
        }
    }]
}



 
--

Francis Galiegue

unread,
Apr 20, 2013, 11:58:57 AM4/20/13
to json-...@googlegroups.com
On Sat, Apr 20, 2013 at 6:56 AM, Jim Klo <jim...@sri.com> wrote:
[...]
>>
>> That still does not explain the 30 seconds... My library is ordinarily
>> ultra fast. Care to share a schema and sample data?
>>
> I'm still refining the schema... trying to ensure I have it working...
> however use lrmi.json here https://gist.github.com/5416348 (there's 2 files
> there)
>
> here's 2 objects... first one works, second one 'should' work, but isn't -
> trying to figure out why... either way... it's taking a long time...
> Granted... the lrmi.json is 22K lines...
>
[...]

OK, I can reproduce easily... I'll work on that!

At a first glance I'd say the fact that I validate "deeply" in case of
*Of looks like a culprit, but also the fact that the cache is
overloaded.

Care to open an issue?

Julian Berman

unread,
Apr 20, 2013, 10:08:38 PM4/20/13
to json-...@googlegroups.com
I'd be highly surprised if my validator takes 30 seconds too.

We do do ref caching by default.

I'll take a look later at this.

Julian

Jim Klo

unread,
Apr 22, 2013, 7:50:52 PM4/22/13
to json-...@googlegroups.com
No, yours performs relatively fast - as expected.. :)

techgan...@gmail.com

unread,
Jul 25, 2015, 3:12:38 AM7/25/15
to json-...@googlegroups.com
I have plane json schema. To validate 28000 records it is taking around 32sec.I am using github json validation frame work. Can you please share me performance tips for json validation?

Reply all
Reply to author
Forward
0 new messages