[erlang-questions] SAX-like JSON Parser for Erlang

77 views
Skip to first unread message

Yash Ganthe

unread,
Jan 18, 2013, 7:18:32 AM1/18/13
to erlang-questions

SAX allows the client application to read an XML piece-by-piece. This is different than DOM which expects the entire XML to be loaded in memory. SAX is thus useful for reading very large XML documents.

 

Mochijson is a good JSON parser which emits structs that correspond to individual JSON strings. It however expects the entire JSON string to be given to its functions.

If I have about 10000 records in an JSON such as this,

{ "d" :

[

                {

                                "ID": 1, "Name": "p1", "Email": "p...@p1.com"

                },

                {

                                "ID": 2, "Name": "p2", "Email": "p...@p2.com"

                },

                {

                                "ID": 3, "Name": "p3", "Email": "p...@p3.com"

                },

                {

                                "ID": 4, "Name": "p4", "Email": "p...@p4.com"

                } . . . . .

] }

the entire JSON string would have to be first obtained and then passed to mochijson/mochijson2.

 

I am looking for a way to let the module give me one record at a time from the large JSON-formatted array. Is there any module that lets us do this?

 

Thanks,

Yash

Dmitry Kolesnikov

unread,
Jan 18, 2013, 7:21:12 AM1/18/13
to Yash Ganthe, erlang-questions
Hello,


- Dmitry

_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Thomas Lindgren

unread,
Jan 18, 2013, 8:55:01 AM1/18/13
to erlang-questions


One potential problem with SAX-style parsing is that repeated keys in a JSON object will use/return only the last value (a behaviour inherited from javascript, I believe). Thus, to find the value of a given key you need in principle to parse the whole object anyway. This makes SAX less convenient, though using it might still save memory. 

If you know repeated keys will never happen (the JSON RFC says "The names within an object SHOULD be unique."), you could roll the dice and go ahead anyway.

Best,
Thomas


>________________________________
> From: Yash Ganthe <yas...@gmail.com>
>To: erlang-questions <erlang-q...@erlang.org>
>Sent: Friday, January 18, 2013 1:18 PM
>Subject: [erlang-questions] SAX-like JSON Parser for Erlang

Willem de Jong

unread,
Jan 18, 2013, 2:43:50 PM1/18/13
to Yash Ganthe, erlang-questions
Hi Yash,

I don't know if it meets your requirements, but I wrote a SAX-like JSON parser a couple of years ago. You can find it on trapexit.org in the user contributions section:


Good luck,
Willem

Richard O'Keefe

unread,
Jan 22, 2013, 12:51:42 AM1/22/13
to Thomas Lindgren, erlang-questions

On 19/01/2013, at 2:55 AM, Thomas Lindgren wrote:

>
>
> One potential problem with SAX-style parsing is that repeated keys in a JSON object will use/return only the last value (a behaviour inherited from javascript, I believe).

When implementing a JSON parser, I found the JSON specification
silent about this and asked for clarification. Taking the last
value for repeated keys is *NOT* a defined property of JSON nor
is it something you can expect as a de-facto standard. Some
parsers will take the first value. Some will take the last one.
Some will raise a sensible exception. Some a non-sensible one.

Thomas Lindgren

unread,
Jan 22, 2013, 7:41:11 AM1/22/13
to Richard O'Keefe, erlang-questions




----- Original Message -----
> From: Richard O'Keefe <o...@cs.otago.ac.nz>
> To: Thomas Lindgren <thomasl...@yahoo.com>
> Cc: erlang-questions <erlang-q...@erlang.org>
> Sent: Tuesday, January 22, 2013 6:51 AM
> Subject: Re: [erlang-questions] SAX-like JSON Parser for Erlang
>
>
> On 19/01/2013, at 2:55 AM, Thomas Lindgren wrote:
>
>>
>>
>> One potential problem with SAX-style parsing is that repeated keys in a
> JSON object will use/return only the last value (a behaviour inherited from
> javascript, I believe).
>
> When implementing a JSON parser, I found the JSON specification
> silent about this and asked for clarification.  Taking the last
> value for repeated keys is *NOT* a defined property of JSON nor
> is it something you can expect as a de-facto standard.  Some
> parsers will take the first value.  Some will take the last one.
> Some will raise a sensible exception.  Some a non-sensible one.

Could you give a reference to that clarification? It could be useful in upcoming discussions.

Taking the RFC at face value, it seems the behaviour is unspecified (since it only says the use of repeated keys is discouraged). As far as I'm aware, though, the common expectation would be that the decoder behaves "like javascript". I've seen some proposed usage that relies on the parser handling duplicate keys (without throwing exceptions). See the fairly recent node.js mailing list archives.

Best,
Thomas

Michel Rijnders

unread,
Jan 22, 2013, 7:59:32 AM1/22/13
to erlang-questions
Yep, I was recently looking into this because I'm writing a library that will provide XPath like querying for JSON. The only thing RFC 4627 (JSON) says is: ""The names within an object SHOULD be unique." But this is "should" in the sense of RFC 2119, i.e. it means "recommended"...
--
My other car is a cdr.

Richard O'Keefe

unread,
Jan 22, 2013, 9:03:12 PM1/22/13
to Michel Rijnders, erlang-questions

On 23/01/2013, at 1:59 AM, Michel Rijnders wrote:

> Yep, I was recently looking into this because I'm writing a library that will provide XPath like querying for JSON.

That would be JSON pointers would it?
http://tools.ietf.org/html/draft-pbryan-zyp-json-pointer-00
the last time I looked at it. What's happening about that,
anyway?

Bob Ippolito

unread,
Jan 22, 2013, 10:41:33 PM1/22/13
to Richard O'Keefe, erlang-questions
JSON pointers are much less powerful than XPath, it's not really a query language. No predicates, result must be a single node. It's just '/foo/bar/baz/0' instead of 'obj.foo.bar.baz[0]' (in JS).

Something in-between (more powerful than JSON pointers, less powerful than XPath) would be something like https://github.com/etrepum/kvc -- It won't generate results from a stream, so you'd need to use it with a standard JSON parser.

-bob

Richard O'Keefe

unread,
Jan 23, 2013, 12:29:42 AM1/23/13
to Bob Ippolito, erlang-questions

On 23/01/2013, at 4:41 PM, Bob Ippolito wrote:

> JSON pointers are much less powerful than XPath, it's not really a query language. No predicates, result must be a single node. It's just '/foo/bar/baz/0' instead of 'obj.foo.bar.baz[0]' (in JS).
>
> Something in-between (more powerful than JSON pointers, less powerful than XPath) would be something like https://github.com/etrepum/kvc -- It won't generate results from a stream, so you'd need to use it with a standard JSON parser.

The examples in the README.md of https://github.com/etrepum/kvc
do things like

wibble =:= kvc:path(foo.bar.baz, [{foo, [{bar, [{baz, wibble}]}]}]).

This feels wrong to me. A path should be a _list_ of
step descriptions, [foo,bar,bar]. Reasons:
(1) You can have integer steps (this element of a tuple)
as well as atom steps (this entry in a dict &c).
And you can also have atom steps that _look_ like integers.
(2) It is more efficient not to have to split a path into steps
at run time.
(3) Perhaps the most painfully obvious:
a named step might need to include a dot (or any other fixed
character) in its name.
(4) Given a recursive data structure with a small set of labels,
using dotted atoms you can quickly exhaust the size of
Erlang's atom table.

In many ways, this is a perfect example of "strings are wrong".
The abstract concept we need here is
"path"
and
"path" = sequence of "step"
and
"step" = receiver-specific position or label

Packing a path up as a dotted atom or any other kind of
string representation means having to recover at run time
and unreliably information that has been _hidden_ inside
the string, when it could have been made directly available
as a simple data structure.

I have a key-value component for my Smalltalk library, and
in that (1,3,4) were not issues, but (2) had my programmer's
instincts screaming 'this is a bad idea'. In fact one of the
things on my TODO list is to replace
aPath subStrings: ' .' asClass: Symbol trimmed: true
by
aPath _steps
in order to let aPath be returned if it's a sequence of steps
already.

Fortunately(?) the README.md is incomplete, and that KVC
implementation _does_ accept a list of steps (which look as
though they have to be binaries). That module is better than
it sounds.

By the way, things like this remind me irresistibly of
Nicklaus Wirth's "Professor Cleverbyte's Visit to Heaven"
(Software Practice and Experience, Vol 7 pp155-158, 1977).

Bob Ippolito

unread,
Jan 23, 2013, 2:02:38 AM1/23/13
to Richard O'Keefe, erlang-questions
Sounds like I ought to fix up that documentation. 

My primary goal was to just clone Apple's Key-Value Coding API (which I should've stated more explicitly in the README). I never used this library for anything but introspecting data structures interactively in the shell. In that environment I find it to be friendlier if I can input the paths as an atom or string. Efficiency wasn't much of a concern.

Paul Davis

unread,
Jan 23, 2013, 1:30:34 PM1/23/13
to Bob Ippolito, erlang-questions
And similar to kvc is the props library:



Richard O'Keefe

unread,
Jan 23, 2013, 7:06:37 PM1/23/13
to Bob Ippolito, erlang-questions

On 23/01/2013, at 8:02 PM, Bob Ippolito wrote:

> Sounds like I ought to fix up that documentation.
>
> My primary goal was to just clone Apple's Key-Value Coding API (which I should've stated more explicitly in the README). I never used this library for anything but introspecting data structures interactively in the shell. In that environment I find it to be friendlier if I can input the paths as an atom or string. Efficiency wasn't much of a concern.

That should definitely be stated in the documentation.
I don't find [foo,bar,ugh] any harder to type than <<"foo.bar.ugh">>
or for that matter "foo.bar.ugh". Same number of characters as the
string, four fewer than the binary. foo.bar.ugh I grant you.

There is one crucial feature of Apple's key-value coding
that does not apply in the Erlang context. Each non-@ step
must be the name of an Objective C selector or variable.
That means that such steps cannot themselves contain dots
or at-signs. Erlang atoms _may_ contain dots and at-signs,
and may even do so without quoting. alp...@beth.gamow is a
perfectly good Erlang unquoted atom.

> I have a key-value component for my Smalltalk library, and
> in that (1,3,4) were not issues, but (2) had my programmer's
> instincts screaming 'this is a bad idea'. In fact one of the
> things on my TODO list is to replace
> aPath subStrings: ' .' asClass: Symbol trimmed: true
> by
> aPath _steps
> in order to let aPath be returned if it's a sequence of steps
> already.

Did it last night. I decided to call it #asPropertyPath rather
than #_steps.
Reply all
Reply to author
Forward
0 new messages