[Haskell-cafe] JSON parser that returns the rest of the string that was not used

22 views
Skip to first unread message

Ryan Newton

unread,
May 29, 2016, 1:10:26 PM5/29/16
to Haskell Cafe
As someone who spent many years putting data in S-expression format, it seems natural to me to write multiple S-expressions (or JSON objects) to a file, and expect a reader to be able to read them back one at a time.

This seems comparatively uncommon in the JSON world.  Accordingly, it looks like the most popular JSON parsing lib, Aeson, doesn't directly provide this functionality.  Functions like decode just return a "Maybe a", not the left-over input, meaning that you would need to somehow split up your multi-object file before attempting to parse, which is annoying and error prone.

It looks like maybe you can get Aeson to do what I want by dropping down to the attoparsec layer and messing with IResult.

But is there a better way to do this?  Would this be a good convenience routine to add to aeson in a PR?  I.e. would anyone else use this?

Thanks,
  -Ryan


Sanae

unread,
May 29, 2016, 1:24:22 PM5/29/16
to haskel...@haskell.org
You could drop down to the attoparsec layer, but instead of messing with IResults, use it to make another parser that will parse all the objects in the file.

E.g.  json `sepBy` skipSpace :: Parser [Value]

sepBy and skipSpace both taken from Data.Attoparsec.Text
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Stephen Tetley

unread,
May 29, 2016, 1:54:00 PM5/29/16
to Haskell Cafe
Hi Ryan

Isn't this a problem of JSON rather than it's parsers?

That's too say I believe (but could easily be wrong...) that a file
with multiple JSON objects would be ill-formed; it would be
well-formed if the multiple objects were in a single top-level array.

Nikita Volkov

unread,
May 29, 2016, 2:48:38 PM5/29/16
to rrne...@gmail.com, Haskell Cafe
I know of at least two packages providing the incremental JSON parsing functionality:

Being the author of the latter one I recommend checking out both.

вс, 29 мая 2016 г. в 20:10, Ryan Newton <rrne...@gmail.com>:

Adam Bergmark

unread,
May 29, 2016, 4:28:08 PM5/29/16
to Nikita Volkov, Haskell Cafe

Ryan Newton

unread,
May 29, 2016, 5:53:59 PM5/29/16
to Sanae, Haskell Cafe
Thanks, I'll have to try it and see if the Parser [Value] can enable streaming/incremental IO.

Carter Schonwald

unread,
May 29, 2016, 8:18:15 PM5/29/16
to rrne...@gmail.com, Haskell Cafe
Last spring some colleagues and I wrote a properly streaming Json parser that could incrementally emit Json primitive values as its fed bytestrings. 

There's some Corner cases that come up wrt how atto parsecs float parser parses 0.0 as two different zero literals depending on how it's split between chunks, but that aside its a pretty simple stack machine implementation that worked out pretty well 

Ryan Newton

unread,
May 29, 2016, 8:53:00 PM5/29/16
to Stephen Tetley, Haskell Cafe
On Sun, May 29, 2016 at 1:53 PM, Stephen Tetley <stephen...@gmail.com> wrote:
Isn't this a problem of JSON rather than it's parsers?

I can understand that a file with multiple JSONs is not a legal "JSON text".  But... isn't that issue separate from whether parsers expect terminated strings, or, conversely, are tolerant of arbitrary text following the JSON expr?  Scheme "read" functions from time immemorial would read the first expression off a handle without worrying about what followed it!  It doesn't mean the whole file needs to be valid JSON, just that each prefix chewed off the front is valid JSON.

Thanks to Nikita for the links to json-stream and json-incremental-decoder.  My understanding is that if I use a top-level array to wrap the objects, then these approaches will let me retain a streaming/incremental IO.  I'm not sure yet how to use this to stream output from a monadic computation.

Let me be specific about the scenario I'm trying to handle:

Criterion loops over benchmarks, and after running each, it writes the report out to disk appending it to a file:


This way, the report doesn't sit in memory affecting subsequent benchmarks.  (I.e. polluting the live set for major GC.)  When all benchmarks are complete, the reports are read back from the file.

There are bugs in the binary serialization used in the linked code.  We want to switch it to dump and read back in JSON instead.

In this case, we can just write an initial "[" to the file, and then serialize one JSON object at a time, interspersed with ",".  That's ok... but it's kind of an ugly solution -- it requires that, we, the client of the JSON serialization API, make assumptions about the serialization format and reimplement a tiny tiny fraction of it.

Cheers,
   -Ryan

Richard A. O'Keefe

unread,
May 29, 2016, 10:26:06 PM5/29/16
to haskel...@haskell.org


On 30/05/16 5:53 AM, Stephen Tetley wrote:
> Hi Ryan
>
> Isn't this a problem of JSON rather than it's parsers?
>
> That's too say I believe (but could easily be wrong...) that a file
> with multiple JSON objects would be ill-formed; it would be
> well-formed if the multiple objects were in a single top-level array.

"A file with multiple JSON objects would be ill-formed" -- it would be an
ill-formed *what*?

The media type application/json appears to describe a format
containing precisely one JSON value, but RFC 7159 is otherwise silent
about streams of JSON values.

JSON is sometimes used as the format for entries in logs;
it would be pretty useless for that if you couldn't have more than
one in a sequence.

If a JSON value is true, false, or null it ends at its last letter;
if it's a string it ends at the closing double quote;
if it's an array it ends at the closing ];
if it's an object it ends at the closing };
only if it is a number is there any need to check the next
character, but then only one character needs to be checked,
and thanks to the requirement that numbers be in ASCII, only
one byte needs to be checked, there being no need to decode
the next Unicode code point in full.

Ryan Newton

unread,
May 30, 2016, 9:17:10 AM5/30/16
to Richard A. O'Keefe, Haskell Cafe
Thanks Richard.  I didn't know that the spec was precise about the JSON expr not going beyond the closing character.  (I wasn't sure, for instance, if it would also include whitespace after that point.)

For logging, I bet it helps if people try to enforce the invariant that JSON text doesn't internally include newlines...  

Best,
  -Ryan

Richard A. O'Keefe

unread,
May 30, 2016, 8:42:24 PM5/30/16
to rrne...@gmail.com, Haskell Cafe


On 31/05/16 1:16 AM, Ryan Newton wrote:
> Thanks Richard. I didn't know that the spec was precise about the
> JSON expr not going beyond the closing character.


First, older versions of JSON required a value to be an object or
an array. Second, the JSON grammar in RFC 7159 is quite
precise.

Insignificant whitespace is allowed before or after any of the six
structural characters.
-- That is, [ ] { } , :
-- Oddly enough, the specification does NOT say that whitespace
-- is allowed before or after any other token; it appears that
-- "false" is legal at JSON top level but " false" is not.

ws = *(
%x20 / ; Space
%x09 / ; Horizontal tab
%x0A / ; Line feed or New line
%x0D ) ; Carriage return


value = false / null / true / object / array / number / string

false = %x66.61.6c.73.65 ; false

null = %x6e.75.6c.6c ; null

true = %x74.72.75.65 ; true

Note that it's not a matter of scanning a sequence of letters
and then checking for particular values, it must be one of
those three exact sequences. Once you have read the "e"
of "false" there is no point in reading any further. You
certainly don't need to skip white space, indeed, if you take
the specification literally, you mustn't. (But, sigh, it IS ok
to skip white space after a final ] or }. Such are the standards
the net is made from.)

object = ws %x7B ws [ member *( ws %x2C ws member ) ]
ws %x7D ws

member = string ws %x3A ws value

array = ws %x5B ws [ value *( ws %x2C ws value ) ] ws %x5D ws

And yes, the grammar is ambiguous. Consider
"[ [ ] ]"
Does the first white space character go with the first left bracket
or the second one?
All they needed to do was to say that strings, any other values,
and , : ] } can be preceded by insignificant white space, and the
ambiguity would be gone and " false" would be legal.

Every kind of number ends with a block of digits; since white space
isn't allowed after numbers, the next character, whatever it is,
should not be consumed, but must be checked to make sure it is
not a digit.

I wonder if anyone has a JSON parser that follows the letter of the
standard? Preparing this message has made me realise that
(a) mine doesn't and (b) I don't really want it to.
Reply all
Reply to author
Forward
0 new messages