[ANN] clarinet - SAX based streaming JSON parser in JavaScript for the browser and nodejs

322 views
Skip to first unread message

Nuno Job

unread,
Dec 21, 2011, 11:55:41 AM12/21/11
to nod...@googlegroups.com
I just released clarinet, a sax based json parser written in javascript. It works in node.js and in the browser:


For more information on why you would need this and how it compares to native JSON.parse feel free to read this article (includes a performance benchmark):


Feedback is (always) welcome :)

Nuno

Dave Clements

unread,
Dec 21, 2011, 5:42:16 PM12/21/11
to nodejs
hey nuno,

i'm not too hot with stats etc, but if I'm reading it right your
conclusion says that json.parse is faster, especially with larger
docs.... does this mean clarinet only has an advantage with large json
blocks where memory limitations are concerned?

Nuno Job

unread,
Dec 21, 2011, 6:11:58 PM12/21/11
to nod...@googlegroups.com

Yes. You are correct. JSON.parse ""cheats"" cause its precompilled so no pure javascript can in theory have the same performance characteristics[1]

The advantages are:

1) having a sax like api (for my use case the parsed json is about as useful as the string itself)
2) memory usage as you said
3) JSON.parse cannot process chunks that aren't valid json
4) I think JSON.parse is blocking but would love someone here to confirm this. If this is the case you might rather spend some more time in clarinet while not blocking node. However this is unlikely to be a factor in most applications cause it means many users dealing with large json files at the same time
5) Clarinet works outside v8 too :)

Bottom line if in doubt use JSON.parse. It's likely that if you need something like clarinet you know need something like clarinet :)

Hope this helps
Nunoo

[1], check comment by Alex Russel at http://www.quora.com/Where-can-I-find-JSON-parser-used-in-V8-JavaScript-Engine

--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

Mikeal Rogers

unread,
Dec 21, 2011, 6:27:42 PM12/21/11
to nod...@googlegroups.com
Calling C from js is not free. Precompiling doesn't necessarily mean that it's goings to be faster.

One thing that I'm sure is not helping with pure js parser performance are the conversions from buffers to strings for all the chunks in addition to the conversions in to javascript types.

The only way to make a js parser faster is to limit conversions. Parsing buffers rather than strings could limit some conversions, and a declarative model for the parser so that it only selectively converted values to javascript types could do the same.

-Mikeal

Micheil Smith

unread,
Dec 21, 2011, 6:32:15 PM12/21/11
to nod...@googlegroups.com
Nuno:

Would an ideal use case for clarinet be the bulk uploading of records
as JSON where the data is in the form of an Array? eg:

{ data: [
{ ... },
{ ... },
...
] }

Then you'd be able to get an event back for each item added to the
'data' key's array?

Regards,
Micheil Smith
--
BrandedCode.com

Nuno Job

unread,
Dec 21, 2011, 6:42:13 PM12/21/11
to nod...@googlegroups.com

Yes you will get one event fired for each part of the array. It's a good use case specially if you have no need for the document per se. But if the file is not chunked and or large enough why bother? Unless you want to learn about streaming parsers of course. Or if you like all things javascript :) Both good reasons on my book. Plus I did the stats on performance so you can reason with it. Is 1 second to parse 13M of json that slow? Probably for most applications the answer is no.

Mikeal was chasing for filters to do something like that and only selectively emit. Clarinet is Imo lower level than that but one could write an abstraction on top of clarinet that does that.

On Dec 21, 2011 11:32 PM, "Micheil Smith" <mic...@brandedcode.com> wrote:

Nuno Job

unread,
Dec 22, 2011, 11:25:50 AM12/22/11
to nod...@googlegroups.com
One thing that I'm sure is not helping with pure js parser performance are the conversions from buffers to strings for all the chunks in addition to the conversions in to javascript types.

Well in my sync tests, the one that are reflected in the page, this is not a problem cause it's done before I run the tests. So it's assumed that the buffer and the strings are available before the tests starts. 

This is because it was the only fair way to test the behavior in browser where buffers don't exist.

Still Mikeal is right. There's some things you can do to make a streaming parser faster:

1) Emit less (and overall call less functions)
2) Do let string work and more buffer work.
3) Don't work on native types or do type conversions.
4) Avoid calling functions to do comparisons in the loop
5) When possible avoid state transitions and favor a inner loop that then substring of the chunk

Also check the awesome blog post (and presentation that goes with it) by @mraleph http://blog.mrale.ph/. Also to do performance work this is a really cool thing to have in mind: (thanks @sidorares @mraleph @ryah)

npm install -g tick; node --prof server.js; node-tick-processor | less

Problem is that is that you are left dealing with all the encoding issues in the world, plus in the browser this doesn't even apply so you need two code bases. Checking that my parser was still faster for the use case I wanted compared to jsonparse (Which emits a lot less and does buffers not strings) I would argue the small performance benefit is not sufficient to make it attractive for me (if I thought it was I would have reimplemented everything with buffers). Other things have more impact, like emitting a lot and doing type conversions. I still do it cause I believe these were fundamental to a streaming json parser. 

Anyway if you feel this is interesting and feel like I screwed up feel free to fork my project and improve performance. It will do everyone a favor to have faster parsers. But fast is not the only requirement in a parser.
 
and a declarative model for the parser so that it only selectively converted values to javascript types

Mikeal can you give me some pseudo code so I understand what this is? I didn't follow. How you mean like giving the write functions something to look for and selectively ignore everything else until it finds it? 

Like I'm interested in author.name and then when I parse if I see something that is not author I just go one and ignore everything until I pop back to the level I want author in? Imho this would be an information retrieval technique and more suited to use indexes than at the parsing stage. Still if this is what your looking for I think it's super interesting. You could go from over the wire to responde in no time for really large responses. Still ineficient as a generic use case (in the dont emit your full database to get an aggregate sense) but open to super cool use cases (e.g. im streaming wikipedia and I want to only look at x keys).

If this is it I'll probably do some research on it after the node summit, or maybe we can try to hack a spec together when I'm in SFO next.

Also and once again if anyone wants to give it a try in clarinet, feel free. Just contact me on the syntax for the selectors cause I won't be very happy with anything more that dot notation and simple predicatates. e.g. book.tags[1]

nuno

Nuno Job

unread,
Jan 18, 2012, 11:54:49 AM1/18/12
to nod...@googlegroups.com
An update on this. Me and Jann Horn are working on implementing selectors that @mikeal asked. Jann has added them to clarinet core without compromising performance (we are now parsing all of npmjs in 0.8 seconds down from almost 2 seconds when we started clarinet).

We think the best way to expose selectors is with a library that builds on top of clarinet. We are currently discussing our that interface should look like so comments from the community are most welcome:


Thank you guys, happy JSON streaming ;)
Nuno
Reply all
Reply to author
Forward
0 new messages