Understanding how the IDs are handled when the schema has no provenance.

120 views
Skip to first unread message

Xample

unread,
Mar 7, 2012, 6:09:04 AM3/7/12
to json-...@googlegroups.com

Sometimes it is hard to understand how the id works, especially how the root id (the one in the root of the schema) is missing or incomplete and therefore will be redefined by the environment


Per default, an URL is composed of 

1. A scheme

2. An authority

3. A path

4. A query

5. A fragment


- The scheme is the string which appears before the first ":" in an url example "aScheme:"

- The Authority is the the string right after the double slash and delimited by end of line or "/" "//authority" or "//authority/"

- The path is absolute if stating with a "/" relative to its current position otherwise. example: "/path/to/aFile.json" We sometimes call the base path the " /path/to/" and the relative one the ending "file.json" part.

- The query (not relevant in our case)

- The fragment : starting with a "#", in the json schema, we do even have fragment path (json pointers) example : "#fragment/path"


Per default, while accessing a resource we use:

- The scheme : to define the protocol to use (http, file, ftp, urn, …) are popular scheme.

- The authority : it does defines the root server-side document. example: "www.json.com" if it is a domain, but can also be "123-456-789-123" if an uuid (associated to the urn scheme)

- The path: it does defines the server-side path the server have to browse to in order to return the right document. example : "/path/to/stuff.json"

- The query : if the path points to a service, the query will be there to pass arguments to the service supposed to return the content.

- The fragment : it does defines the client-side document path. When you have a document, the fragment can point to a part of the document using the fragment path (or json pointer)


Even if you are used to use incomplete urls like "www.json.com", "/", "#" and so on… the engine behind always have to build an absolute one.

for example, in a browser:

- www.json.com will be implicitly resolved to: http://www.json.com/#

- "/" to the root of your current url. If you are in "scheme://authority/base/path/relativePath.com#fragment", you will be redirected to "scheme://authority/#" (the path is changed to "/", the fragment reset)

- "#" to the root of your current document. "scheme://authority/base/path/relativePath.com#fragment/path" becomes "scheme://authority/base/path/relativePath.com#"


Now… when you do store schemas into the environment, they will automatically be associated to an url (based on the root "id" property of the schema), if there is no "id" in your schema, or if it is incomplete, the environment will complete it for you in order it to be able to later retrieve your schema correctly. 


The minimum requirement to find a document being the scheme and the path (or authority, it depends on the protocol [1]), your environment have to provide a default scheme as well as being able to forge an unique path (or authority) if none is provided.

Note that per default, if the schema has been retrieved from an url, it will use its scheme, authority and path to create an id but when those informations are not available, it's the environment work to create its own.

For example, the JSV environment decided to use the "urn" as the default scheme, and an uuid (Universally unique identifier) to build a default path (or authority again it depends on how you see it)


Here is how your schemas will be implicitly resolved (here is JSV)


A full url

{ "id" : "http://www.json.com/#root" }

Scheme: http

Hier-part: www.json.com/

Fragment: root

Keeps the same

{ "id" : "http://www.json.com/#root" }

Scheme: http

Hier-part: www.json.com/

Fragment: root


An url without any scheme

{ "id" : "www.json.com/#root" }

Scheme: 

Hier-part: www.json.com/

Fragment: root

The default "urn" scheme is added

{ "id" : "urn://www.json.com/#root" }

Scheme: urn

Hier-part: www.json.com/

Fragment: root


Anything without any scheme

{ "id" : "mickey" }

Scheme: 

Hier-part: mickey

Fragment:

The default "urn" scheme is added

{ "id" : "urn:mickey#" }

Scheme: urn

Hier-part: mickey

Fragment:


Only a scheme

{ "id" : "scheme:" }

Scheme: scheme

Hier-part:

Fragment:

The path is considered as empty (or "")

{ "id" : "scheme:#" }

Scheme: scheme

Hier-part: 

Fragment:


Nothing

{ }

Scheme:

Hier-part:

Fragment:

The default "urn" scheme is added, a default path (uuid) is created

{ "id" : "urn:uuid:12345678-1234-1234-1234-1234567890ab#" }

Scheme: urn

Hier-part: uuid:12345678-1234-1234-1234-1234567890ab

Fragment:


Only a fragment

{ "id" : "#mySchemaName" }

Scheme: 

Hier-part:

Fragment: mySchemaName

No scheme, no hier-part, it is handled the same way as "Nothing" i.e. The default "urn" scheme is added, a default path (uuid) is created. The fragment stays the same

{ "id" : "urn:uuid:12345678-1234-1234-1234-1234567890ab#mySchemaName" }

Scheme: urn

Hier-part: uuid:12345678-1234-1234-1234-1234567890ab

Fragment: mySchemaName


In summary, in the case where the schema has no provenance (has not been retrieved from a remote url but just inserted into the env):

- If there is no id in the schema root, we create one using a uuid

- If there is no scheme defined in the id, we use "urn" as the default scheme


[1] : In some cases, there is simply no authority. For example when you type file:///some/path/to/a/file the authority is blank i.e. "", this explains why you do have 3 slashes. The reason is that you are supposed to be on the computer you are looking for a file, and therefore there is no need for an authority. In an environment, this is the same, we are in our environment and therefore an authority is not mandatory. However it is not obvious to call a uuid a path neither an authority, because it is neither the first or the second. This is why we usually talk about the "hier-part" which is simply the string identifier which is able to point to a document according to a scheme.

Francis Galiegue

unread,
Mar 7, 2012, 8:09:00 AM3/7/12
to json-...@googlegroups.com
Excellent summing up, thanks!

I still have concerns though, as the spec in theory allows any type of
URI anywhere, and that is a problem.

For instance, this is legal, in theory:

{
"type": "object",
"properties": {
"p1": {
"id": "http://some.site/path/to/my.json#/some/fragment/here"
}
}
}

What are you supposed to do with such a schema?

Similarly, fragments as IDs: why?

{
"type": "object",
"properties": {
"p1": {
"id": "#myfragment"
}
}
}

This conflicts with JSON Pointer and makes two ways to access the same
schema: urn:uuid:whatever:#myfragment and
urn:uuid:whatever:#/properties/p1

This is why I have voiced the URI concern so long ago and maintain my
stance: not every URI should be allowed in "id", and subschemas should
only ever be accessed using JSON Pointer.

--
Francis Galiegue, fgal...@gmail.com
"It seems obvious [...] that at least some 'business intelligence'
tools invest so much intelligence on the business side that they have
nothing left for generating SQL queries" (Stéphane Faroult, in "The
Art of SQL", ISBN 0-596-00894-5)

Xample

unread,
Mar 7, 2012, 11:54:05 AM3/7/12
to json-...@googlegroups.com

This is not a problem to be able to access the same schema using 2 different path. This is a bit like using a folder alias in an HD, or having different domains pointing to the same ip address.

The opposite case would be problematic i.e. having 2 schema belonging to the same url, but this only happens if you override an existing path with an id. for example:

{

    "id" : "#notUnique",

    "properties" :

        {

        "aKey":

            {

            "id" : "#notUnique"

        }

    }

}


Unless you are not in this case, you are free to identify you schema or inner schemas the way you want (even if there is no logical meaning to the scheme and hier-part value you set)


For your first example, your env should index the inner schema:

{

"id" : "http://some.site/path/to/my.json#/some/fragment/here"

simply under the explicit url "http://some.site/path/to/my.json#/some/fragment/here

That's all… this said, I would strongly discourage using a fragment path in an id…, this will just mess up your index table and be subject to ambiguous interpretation.


The choice of the id is risky if not wisely chosen… the purpose of an id is to have something unique and efficient. Nothing prevents you to do the following:

myscheme://fge/bigProject1/users/#address

myscheme://fge/bigProject1/users/#purchases

myscheme://fge/bigProject1/users/#history


For the fragment as id, this is just the usual way to name an inner schema. When you specify an id somewhere which is not an absolute one (and a single fragment is obviously not one) you have to resolve it to make it absolute.

When I parse an "id" the rules are as follow:


1. if this "id" is absolute, index it as it is otherwise resolve it relatively to the root "id" property of the containing schema.

2. If the containing schema root "id" is relative (or empty),  resolve it relatively to the source "url" where the schema has been retrieved (for example from a html page or a file). 

3. If the source does not exist, i.e. the schema has been created from "the code" and not loaded from anywhere… use the urn:uuid:123456 as the default url.


Now if you do not want to use the fragments here is a "#fragment" vs "/path"


{

    "id":"scheme://authority/aPath/schema.json

    "property":{"key":

            {

            "id":"#fragment" // accessible through scheme://authority/aPath/schema.json#fragment

        }

    }

}


{

    "id":"scheme://authority/aPath/schema.json

    "property":{"key":

            {

            "id":"/path" // accessible through scheme://authority/path/#

        }

    }

}


PS: Note that normalization is important while dealing with url, for example: http://domain/something have to be equal to http://domain/something/.//# otherwise your environment will not be able to retrieve anything in the index table… 

Francis Galiegue

unread,
Mar 7, 2012, 4:52:22 PM3/7/12
to json-...@googlegroups.com
On Wed, Mar 7, 2012 at 17:54, Xample <flavien...@gmail.com> wrote:
> This is not a problem to be able to access the same schema using 2 different
> path. This is a bit like using a folder alias in an HD, or having different
> domains pointing to the same ip address.

No, it is not the same.

You mention the case of using a "folder alias in an HD": this does not
identify a resource at all. On a filesystem, I can very well make it
so that file:///a/b/c and file:///foo/bar/baz point to exactly the
same object -- ON THE SAME MACHINE. All I have to do is a well-crafted
set of mount points and symlinks. The only unique identifier for the
resource is local only and is the inode number of the relevant file.
Which can have any other URI for that matter, it is just a matter of
how _I_ decide to make it available. _But that doesn't make it
available to the outside world_.

On the other hand, using JSON Pointer, there are no two ways of
referring to a path into a JSON document (whether it be a schema or
anything else), there can only be ONE. And that is the core value of
it. The _ROOT_ of the document doesn't matter. It may be
"myscheme://myauthority/my/path" or
"http://some.site/path/to/the.json" -- who cares? The third party
accessing the document does NOT have to second guess what
"#somefragment" may refer to if you use JSON Pointer.

Have a look at that:

{
"type": "object",
"id": "#/foo",
"foo": {
"type": "integer"
}
}

And you are asked to validate an instance with this schema and path
"#/foo": what do you do?

I persist: what can be found in "id" MUST be restricted, and NEVER
include fragments. Fragments, in JSON Schema, MUST be the domain of
JSON Pointer, since this is the only spec currently able to address
all of a JSON document, whether it be a JSON Schema or other, in a
unique, non ambiguous way.

Xample

unread,
Mar 8, 2012, 3:02:07 AM3/8/12
to json-...@googlegroups.com
I think you misunderstood my alias example, what you are writing is exactly the example I wanted to describe. i.e. 2 path can point to the same file but this does not matter.

Concerning your example with fragments, using a fragment #foo is never a problem (this is a bit like creating a folder alias with an unique name at an unique path) you will never come into an ambiguous case.
In a json schema, adding an ID is equivalent to adding an alias in the root of the schema. If 2 aliases have the same name, one will override the second.
Moreover, I would never use fragments path (such as #/foo) as an ID, in this case it is virtually equivalent to adding an id in another folder. Again if there is already a folder where you are adding the alias of another folder (you add the root as the place where there is the #/foo sub schema) you are just doing it wrong.

It is not a problem to make it available from the outside world, the schema scope (scheme + hier-part) is enough to restrict the links to the right scheme. Again you should absolutize your schema (or a copy) before using it, that is much easier. Here is a little example on how your schema should be represented once indexed (of course in your env, you will use objects (pointers) and not duplicate the schema N times.) Take a look especially:
- How the id urls are completed #bar becomes http://domain.com/#bar // in the example I did not changed them "in place", I just used them to index the inner schemas.
- How they are normalized http://domain.com becomes http://domain.com/#
- How each id becomes a line of the indexed table
- How the $refs are absolutized "$ref":"#" becomes "$ref":"http://domain.com/#"

{
    "id" : "http://domain.com",
    "properties":
        {
        "foo":
            {
            "id":"#bar",
            "description":"foo inner schema"
        },
        "something" : 
            {
            "$ref":"#"
        }
    }
}


var indexed:
{
        "id" : "http://domain.com",
        "properties":
            {
            "foo":
                {
                "id":"#bar",
                "description":"foo inner schema"
            },
            "something" : 
                {
                "$ref":"http://domain.com/#"
            }
        }
    },
        {
        "id":"#bar",
        "description":"foo inner schema"
    }
}

Note that if you had got this schema from the url http://www.aDomain.com/ I would even have indexed the same content of "http://domain.com/#" under one more key http://www.aDomain.com/#
In that manner I could always be able refer the schema elements using the real url the schema is stored as well as it's explicitly defined id.
For this last example, those would point to the same inner-schema:
http://domain.com/#/properties/foo

Hope it helps

Francis Galiegue

unread,
Mar 8, 2012, 3:22:10 AM3/8/12
to json-...@googlegroups.com
On Thu, Mar 8, 2012 at 09:02, Xample <flavien...@gmail.com> wrote:
> I think you misunderstood my alias example, what you are writing is exactly
> the example I wanted to describe. i.e. 2 path can point to the same file but
> this does not matter.
>
> Concerning your example with fragments, using a fragment #foo is never a
> problem (this is a bit like creating a folder alias with an unique name at
> an unique path) you will never come into an ambiguous case.

Again, why? There is JSON Pointer.

> In a json schema, adding an ID is equivalent to adding an alias in the root
> of the schema. If 2 aliases have the same name, one will override the
> second.

I don't understand, what do you mean?

> Moreover, I would never use fragments path (such as #/foo) as an ID[...]

... but the spec currently allows for it. The spec allows for anything
as an URI within "id" and I consider this an error.

Another example to illustrate my point:

{
"s1": {},
"s2": {
"id": "#/s1"
}
}

This is legal. What is #/s1 in this schema?

If you forbid embedded ids/aliases/etc, forbid non empty fragments in
schema IDs, and mandate the use of JSON Pointer for addressing, you
get rid of all these issues instantly. This is 10 lines to be added to
the spec and we are definitely done with all addressing problem and
simplify implementations a _lot_. You get all the advantages and none
of the drawbacks.

Xample

unread,
Mar 9, 2012, 3:28:12 AM3/9/12
to json-...@googlegroups.com
A bit like html pages, the id tags is supposed to be unique and present not to have to always refer items from the root.
I strongly prefer writing {"$ref":"#aSchema"} than {"$ref":"#/properties/some/properties/path/properties/to/properties/aSchema"}
especially if the schema's structure moved and that the second path does not exists anymore…
It is allowed to drink poison but I would not do it, it is allowed to make a schema with ambiguous id paths but I would neither not do it. In your example, yes it is legal but ambiguous. It is then to your implementation convention to rule if the priority is made on json pointers path or on set ids. I would strongly suggest putting the priority on the "id" (and the longest indexed path first) for the following reasons:
Even if I agree with you that it is really tricky (not to say dangerous) to use fragment paths in ids. This said there is 2 cases where they can be really useful
1. Redefining a ghost path : you modified your schema but would like to keep retrocompatibility, you can still create aliases using {"id":"#/missingKey/" "extends":{"$ref":"#/newKey"}}
2. Overriding a schema at somepath : {"id":"#/aDenied/properties/key", "deny":"all"}

In my opinion, the rfc should keep allowing using fragments in path but put a convention on how to manage ambiguous cases (which schema to trust in priority). I like the way : the most specific to the more generic i.e. longest id to json pointers. (like CSS actually)

Francis Galiegue

unread,
Mar 9, 2012, 4:47:13 AM3/9/12
to json-...@googlegroups.com
On Fri, Mar 9, 2012 at 09:28, Xample <flavien...@gmail.com> wrote:
> A bit like html pages, the id tags is supposed to be unique and present not
> to have to always refer items from the root.
> I strongly prefer writing {"$ref":"#aSchema"} than
> {"$ref":"#/properties/some/properties/path/properties/to/properties/aSchema"}
> especially if the schema's structure moved and that the second path does not
> exists anymore…

Fair point. But if you have such a JSON Pointer to write, you have a
schema design problem :)

> It is allowed to drink poison but I would not do it, it is allowed to make a
> schema with ambiguous id paths but I would neither not do it. In your
> example, yes it is legal but ambiguous. It is then to your implementation
> convention to rule if the priority is made on json pointers path or on set
> ids.

Disagree. If you leave this up to implementations, you'll end up with
different validations for the same schema. It is a recipe for
disaster... Subschema resolution, with fragment ids and JSON Pointer,
and their priority, should be decided upon and put as a requirement in
the spec proper.

> Even if I agree with you that it is really tricky (not to say dangerous) to
> use fragment paths in ids. This said there is 2 cases where they can be
> really useful
> 1. Redefining a ghost path : you modified your schema but would like to keep
> retrocompatibility, you can still create aliases using {"id":"#/missingKey/"
> "extends":{"$ref":"#/newKey"}}

Careful, "#/missingKey/" and "#/missingKey" are not the same thing!
The first looks for key "" of key "missingKey" from the root of the
JSON document...

I disagree with this proposal though: if your core schema changes, so
should its id. Problem solved.

> 2. Overriding a schema at somepath : {"id":"#/aDenied/properties/key",
> "deny":"all"}
>

I don't understand that...

> In my opinion, the rfc should keep allowing using fragments in path but put
> a convention on how to manage ambiguous cases (which schema to trust in
> priority). I like the way : the most specific to the more generic i.e.
> longest id to json pointers. (like CSS actually)
>

Fair enough. Here is my proposal:

* at the root of a schema, if there is an ID, it should be absolute,
with either no fragment or an empty fragment;
* IDs in subschemas can only be fragments, but MUST NOT be JSON Pointers;
* no two subschemas in the same schema can have the same ID;
* when resolving a schema referenced via $ref:
- if the URI is absolute, the implementation should split the URI
between the locator and fragment, then resolve the locator; no
fragment and an empty fragment are the same;
- if the fragment is not a JSON Pointer, then the implementation
should look for a subschema with an ID matching the fragment from the
resolved schema;
- if it is a JSON Pointer, then the implementation should compute
the path from the root of the resolved schema (see the JSON Pointer
spec);
- $ref should be resolved recursively;
* if, in the process of $ref resolution, one such event happens:
- the locator cannot be resolved to a JSON document; or
- recursive $ref resolution leads to a loop; or
- a document resolved by $ref is not a JSON Object;
then the implementation MUST consider validation a failure.

This needs better wording, but you get the idea...

Xample

unread,
Mar 9, 2012, 6:13:03 PM3/9/12
to json-...@googlegroups.com

- As the validators uniformity matters, I would suggest to make an open json-schema test unit in order everyone to conform his validator to the same behavious (especially in tricky cases).

- Reading an RFC is never a pleasure, examples covering all the cases (if applicable) would strongly help (if we can join this with a common test unit…)

- I opened a topic concerning the trailing slash / in json pointers (very nice to have notices this)

- Overriding the path #/properties/anOverridenPath with the scheme {"deny":"any"}, would make the following document invalid {"anOveridenPath":"something"}; (this could have been useful in MongoDB queries)


For your proposal:

1. I do not agree again. The rfc specifies the following fallback : id relatively to the url where the shema has been got, relatively to the environment. This is only a matter of building absolute paths from relative one.The rfc RFC 2396 tells how to do it, you should be able to use the java.net.URL by URL url = new URL( "http://base.com/url" , "../relativeOne.html");

Normally: new URL( "http://absoluteOne.com" , "http://notARelative.com"); will give you http://notARelative.com


Try: 

String environment = "urn:"; // your fallback scheme (you could have made a whole fallback id as well "urn:uuid:123456")

String downloadedFrom = "//www.domain.com/theFile.json";

downloadedFrom = (new URL( environment , downloadedFrom)).toString(); // will absolutize the downloadedFrom url

String rootID = "#fragmentName";

rootID = (new URL( downloadedFrom , rootID)).toString(); // will absolutize the rootID


-> rootID should be = urn://www.domain.com/theFile.json/#fragmentName

-> index your schema in a dictionary under the (absolute and normalized) url rootID, you are done…

-> add one more step for inner schema (where the absolute path will be the rootID and the relative one the id's url)


If your rootID has been "http://www.mickey.com/#" it would simply not have been changed.

If the downloadedFrom has been = "", we would simply omitted this step. (new URL( anAbsolutePath , "")).toString().equals(anAbsolutePath) is true.



2. Any url but no fragment path (json pointers) -> no more overriding, well… not bad in some way

3. An id should be unique -> I agree, this said it should be already the case (or would you validate them all ? :-) )

4. a. about url normalization http://en.wikipedia.org/wiki/URL_normalization , probably that using the URL class will get rid of this problem itself. Never store an url by yourself. Create an URL with the url string, normalize it (if applicable) then serialize it, this will solve a lot of things by itself.

5. I agree all of those cases 

   a. The ref is not found, timeout, unavailable -> failure

   b. Recursive $ref will probably results in a stack overflow… the worst case being {"$id":"#A", "$ref":"#A"} -> failure

   c. The ref is not a valid schema -> failure

 

Francis Galiegue

unread,
Mar 12, 2012, 4:20:35 AM3/12/12
to json-...@googlegroups.com
Just one more thing about my proposal...

On Fri, Mar 9, 2012 at 09:28, Xample <flavien...@gmail.com> wrote:

[...]


> It is allowed to drink poison but I would not do it

... and the best solution for this is to _not_ have poison to drink in
the first place. Hence my proposal.

Xample

unread,
Mar 12, 2012, 4:31:02 AM3/12/12
to json-...@googlegroups.com
Haha, yes… until there is a good reason to… (which is actually the problem)

Francis Galiegue

unread,
Mar 12, 2012, 4:38:41 AM3/12/12
to json-...@googlegroups.com
On Sat, Mar 10, 2012 at 00:13, Xample <flavien...@gmail.com> wrote:
[...]

>
> 1. I do not agree again. The rfc specifies the following fallback : id
> relatively to the url where the shema has been got, relatively to the
> environment. This is only a matter of building absolute paths from relative
> one.The rfc RFC 2396 tells how to do it, you should be able to use the
> java.net.URL by URL url = new URL( "http://base.com/url" ,
> "../relativeOne.html");
>
> Normally: new URL( "http://absoluteOne.com" , "http://notARelative.com");
> will give you http://notARelative.com
>

And if the machine names don't exist, this fails. URL does name
resolution. You cannot rely on that.

Please, understand my point of view: I want to keep things _simple_,
it is the best way to kickstart a more widespread use of JSON Schema.
Restricting the values of "id" the way I proposed it makes
implementations easy.

Reply all
Reply to author
Forward
0 new messages