Streaming Support

55 views
Skip to first unread message

Duraid Abbas

unread,
May 31, 2012, 10:19:47 AM5/31/12
to json...@googlegroups.com
Hi, 

I'm new here and I'm interested in json-stat format. I had a quick look at the dataset format and it occurred to me that with the way data is structured you have to read a full dataset in order to process it record by record.

In the csv format for example one can read the file record by record which allows scalability. 

So my question is how does json-stat support streaming?

Xavier Badosa

unread,
May 31, 2012, 1:19:08 PM5/31/12
to json...@googlegroups.com
Duraid,

I'm not familiar with the streaming JSON parsers out there (Clarinet, Gson, JSON-stream, Jackson...) and with how they manage a big array. I sincerely hope they don't need to load the full array in memory.

If someone is acquainted with these parsers, please speak.

X.

Xavier Badosa

unread,
Jun 11, 2012, 12:11:48 PM6/11/12
to json...@googlegroups.com
Duraid,

Very recently the US Census Bureau published its API


The interesting thing about it (I'm sure you'd love it) is that it is a CSV file expressed in JSON.

Have a look at this response (apparently, population by US state according to the 2010 Census):

[
["P0010001","NAME","state"],

["710231","Alaska","02"], 
["4779736","Alabama","01"], 
["2915918","Arkansas","05"], 
["6392017","Arizona","04"], 
["37253956","California","06"],
...
]

Because the response is an array (an array of arrays), if it can be read sequentially (streaming), then a JSON-stat response can be too because in JSON-stat data are also inside an array (an array of single values).

It's true that in the CSV model the "metadata" are beside every single datum. But it's also true that this model does not contain rich metadata, and particularly no metadata about the dataset (in the example, no time reference, no unit, no source...).

In my opinion, the US Census Bureau initiative stresses the need to have a standard like JSON-stat: what we're seen here is yet another way of expressing statistical results in JSON for dissemination purposes. I'm not saying that JSON-stat as it is is the final solution: I'm just saying that we should agree on a single solution.

Duraid Abbas

unread,
Jun 12, 2012, 2:15:58 PM6/12/12
to json...@googlegroups.com
Hi Xavier, 

I'm not sure I understand how JSON-stat format can also support streaming. Can you please provide an equivalent in json-stat format for the US Census data above? 

Let met stress that streaming does not depend on the format (whether csv, json or xml) but on how the data is laid out because all formats can support streaming (some more naturally than others) if the data is laid out in way to support streaming. 

A simple and a very common use case for streaming is inserting the data in the database. For the US census data that you provided, the program only needs to hold in memory one row (or record) at a time which scales very well. 

Xavier Badosa

unread,
Jun 13, 2012, 12:33:27 PM6/13/12
to json...@googlegroups.com
Duraid,

I'm assuming that streaming is only needed when datasets are very big. In that case, it makes sense to request only data, something that JSON-stat allows (you already retrieved the categories used in the different dimensions in a previous request or you simply know how the response will be structured because that information is already contained in the request).

The response would mainly contain an array of data (the US population example):

{
"P0010001":
{
"value": [
710231,
4779736,
2915918,
6392017,
37253956,
...
]
}
}

You could even include the dimension categories (the States in the example) in the JSON-stat response. Because properties in JSON are not ordered, JSON-stat allows to include the dimension information before the data in those situations where streaming is important.

That said, JSON-stat tries to be as light-weight as possible (lots of info in a few bytes) and as flexible as possible (only include the info needed by the consumer; requests can be broken down in several queries), two features that should minimize the need for streaming.

X.

Duraid Abbas

unread,
Jun 14, 2012, 10:14:22 AM6/14/12
to json...@googlegroups.com
Thank you Xavier. It's clearer now but can you give an example on how would the request that contains the dimension information look like for the above example?

I understand that it does not have to be a separate request but included in the same request before the data.

Xavier Badosa

unread,
Jun 25, 2012, 12:23:10 PM6/25/12
to json...@googlegroups.com
Duraid,

Sorry for this late reply (I've been busy or outside town lately).

can you give an example on how would the request that contains the dimension information look like for the above example?

First of all (case 1), the user should be able to request a response with data but no metadata. In that case, the dimension information is not included in the response (or it's only included "by reference").

I understand that it does not have to be a separate request but included in the same request before the data. 

Both things should be possible. In case 1, we are assuming the user retrieved the categories of a classification in a previous request (maybe they were linked to a different dataset; maybe it was a metadata-only request).

For example, imagine a request that looks like this (this could be the request of the Census data example):

metric=pop
&
time=2010
&
geo=us
&
source=census
&
mult=0
&
class=state
&
scheme=censusStateCode
&
resp=dataonly

All the metadata are already in the request and it's unnecessary to include them in the response. The only important thing to include is the URI scheme censusStateCode in order that the user can retrieve that classification if she needs to.

That's why "uri" is a possible key inside "category" (see "Shared vocabularies as partial responses":


).

In case 2 (the case you are interested in), the censusStateCode scheme is included in the response BEFORE the data array. It would look like this:

{
  "P0010001": {
    "dimension" : {
      "id" : ["metric", "time", "geo"], "size" : [1, 1, 50],
      "metric": {
        "category" : {
          "label" : { "pop" : "Population" } ,
          "unit" : {
   "type" : { "pop" : "count" } , 
   "base" : { "pop" : "Person" } ,
   "symbol" : { "pop" : null } ,
   "mult" : { "pop" : 0 } 
 }
        }
      } ,
      "time" : {
        "category" : {
          "index": {"20100101":0}
        }
      } ,
      "geo" : {
        "category" : {
          "index" : {
            "02" : 0 ,
            "01" : 1 ,
            ...
          } ,
          "label" : {
            "01" : "Alabama" ,
            "02" : "Alaska" ,
            ...
          }
        }
      }
    } ,
    "value": [
      710231,
      4779736,
      2915918,
      6392017,
      37253956,
      ...
    ]
  }
}

In this case, I'm assuming the censusStateCode is a scheme defined by the Census Bureau for the census data that is unknown to the user. Imagine that instead of that particular classification the user requests an international standard (like ISO-3166-2:US) she is familiar with.

(The weird order of the geo "index" is necessary because the Census data array -taken from the Census Bureau API docs- includes the population of Alaska before Alabama's.)

X.

Duraid Abbas

unread,
Jun 26, 2012, 6:31:35 AM6/26/12
to json...@googlegroups.com
Thank you Xavier. The picture is very clear to me now.
Reply all
Reply to author
Forward
0 new messages