Extending JSON-stat to support, not only datasets' dissemination, but also dataset discovery, would mean that a single format (and a single library) could be used for all the stages in statistics dissemination. In the current state of the standard, this extension seems simple as it only requires to enable a general grouping mechanism.
JSON-stat supports different native responses (bundle, dataset, dimension). Bundles were just a convenient way to retrieve data from different datasets in a single query (like a waste table and a population table to compute waste per person): they weren't meant as a meaningful way of grouping datasets.
A general grouping mechanism would allow a data provider to group datasets (with data or only metadata [full tabulations of a survey]), but also to group dimensions [classifications], or even to group groups [for example, grouping surveys by topic].
If this addition is approved (v. 1.03), a JSON-stat response of class "collection" would look like this:
{
"class" : "collection",
"label" : "The label (optional)",
"href" : "The self link (optional)",
"link" : {
"item" : [...]
}
}
This is 100% backward compatible with the current specification: it contains a subset of the properties used when class is "dataset". When encountering a class of "collection", a client must look for the link.item array, as this is the core content of this class of native response.
Let's see some examples of a possible full JSON-stat API that benefits from this addition:
Response:
{
"class" : "collection",
"label" : "Provider.org API root ontology",
"link" : {
"item" : [
{
"class" : "collection",
"label" : "Topics",
},
{
"class" : "collection",
"label" : "Statistics",
}
]
}
}
Explanation:
This API declares (using a response of class "collection") two paths to retrieve data: browsing through a collection of topics and browsing through a collection of statistics. This response is a collection of collections.
Of course, any provider can organize its datasets as it pleases: "topics" and "statistics" are not elements of the JSON-stat standard: the API root request is where you specify these things.
2) Get an ordered list of available topics
Response:
{
"class" : "collection",
"label" : "Main topics",
"link" : {
"item" : [
{
"class" : "collection",
"label" : "Labor",
},
{
"class" : "collection",
"label" : "Prices",
}
]
}
}
Explanation:
Again we have a collection of collections. The top-level topics for this provider are "Labor" and "Prices". We know that "Labor" and "Prices" are collections, but in this response it's not yet possible to know if they are collections of datasets or, again, collections of collections (subtopics). You need to continue browsing to know this. In version 1.03 "extension" (
http://json-stat.org/format/#extension) will also be available inside Relation ID (
http://json-stat.org/format/#relationid) and a provider would be able to use it to describe the nature of these collections.
3) Get the contents of the "Labor" topic
Request:
Response:
{
"class" : "collection",
"label" : "Statistics available for topic Labor",
"link" : {
"item" : [
{
"class" : "collection",
"label" : "Labor Force Survey",
},
{
"class" : "collection",
"label" : "Structure of Earnings Survey",
}
]
}
}
Explanation:
Again a collection of collections, as in the previous request. For this provider, topic "Labor" has two children: the Labor Force Survey and the Structure of Earnings Survey. These could be subtopics of Labor. But for this provider, they aren't: for this provider they are something called "Statistics" (see 1)). This is noticeable in the "href" property (.../stats/...). It also means that you can end up finding these "statistics" following a different path: instead of browsing through topics like we have been doing, these stats are also available in the full list of stats:
or as a result of search
Again, if you want to declare these collections as something "special" you can use the "extension" property: from a format point of view, they are just plain JSON-stat collections.
4) Get the content of the Labor Force Survey
Request:
Response:
{
"class" : "collection",
"label" : "Labor Force Survey tables",
"link" : {
"item" : [
{
"class" : "dataset",
"label" : "Metadata for employment by sex and age",
},
{
"class" : "dataset",
"label" : "Metadata for unemployment by sex and age",
}
]
}
}
Explanation:
Finally, we have a collection of datasets: that's the meaning of "statistic" (survey) for this provider (a collection of tables). The Labor Force Survey contains two datasets (tables): "Employment by sex and age" and "Unemployment by sex and age".
According to the "label" property (for humans), these "datasets" only contain metadata. At this stage, though, machines cannot be aware of this (dataset metadata, dataset data and dataset metadata+data all use the same native response: "dataset").
Again, the provider can use the "extension" property to specify that these datasets are metadata-only.
Could the provider have skipped this step? Yes, it could have chosen to offer datasets with full info at this stage. But in some cases there's a good reason not to (see at the end).
5) Get the metadata of the "Employment by sex and age" table
Request:
Response:
{
"class" : "dataset",
"label" : "Metadata for employment by sex and age",
"link" : {
"describes" : [
{
"class" : "dataset",
"label" : "Data for employment by sex and age",
}
]
},
"source" : "...",
"updated" : "...",
"dimension" : {...}
//No "value" present
}
Explanation:
This is not a collection anymore. It's a response of class "dataset". Because "value" is missing, a machine can know that this response only contains the metadata for the table. It's also noticeable because in "link" the "describes" relation name is present (it points to the table data).
6) Get the "Employment by sex and age" table
Request:
Response:
{
"class" : "dataset",
"label" : "Employment by sex and age data",
"link" : {
"describedby" : [
{
"class" : "dataset",
"label" : "Metadata for employment by sex and age",
}
]
},
"source" : "...",
"updated" : "...",
"dimension" : {...},
"value" : [...],
"status" : {...}
}
Explanation:
Because this response contains, for example, the "dimension" property, a machine is aware that this is not a data-only response (it contains metadata too). The provider could have chosen not to include the metadata: in such case, thanks to the "describedby" relation, a machine where to go to retrieve the associated metadata.
In this example, we have assumed that the provider has pre-defined the tabulation of its surveys: these surveys are just a collection of small cubes (tables). Sometimes this is not the case: sometimes what we have is a queryable supercube. That's why step 4 makes sense: you'll get the metadata of the supercube and (new) a description of how it can be queried (what dimensions are required and what aren't, how to filter a particular category of a dimension).
I'll explain how I propose to do this in another message. Before, I need you feedback on the "collection" class addition. Does the API I have described in the post make sense to you?