RFC for opt-in streaming support in encoding/json package

134 views
Skip to first unread message

Jonathan Hall

unread,
Aug 9, 2019, 10:33:07 AM8/9/19
to golang-nuts

I debated posting here, or straight to GitHub. If that's the better place, I can move the thread there. I have long wanted proper streaming support in the `encoding/json` library. Lately I’ve been doing some digging to understand the current state of things, and I think I’ve come to grips with most of it.


A number of previous issues relate to this topic: https://github.com/golang/go/issues/7872, https://github.com/golang/go/issues/11046, https://github.com/golang/go/issues/12001, https://github.com/golang/go/issues/14140


I have read through each of these issues, and believe I have a fair understanding of the problems associated with streaming JSON input/output. If I'm overlooking something please enlighten me.


In a nutshell: The library implicitly guarantees that marshaling will never write an incomplete JSON object due to an error, and that during unmarshaling, it will never pass an incomplete JSON message to `UnmarshalJSON`.


Work toward this was done about 3 years ago, on this CL: https://go-review.googlesource.com/c/go/+/13818/

Workt was eventually abandoned, apparently when the author was unsure how to make the new behavior opt-in. I believe this proposal will solve that issue.


The problem to be solved


Dealing with large JSON structures is inefficient, due to the internal buffering done by `encoding/json`. `json.NewEncoder` and `json.NewDecoder` appear to offer streaming benefits, but this is mostly an idiomatic advantage, not a performance one, as internal buffering still takes place. Particular aspects of the broader problem are already addressed on the above mentioned links, and benefits of streaming should be easy to imagine, so I won't bore people with details unless someone asks.

A naïve solution


I believe a simple solution (simple from the perspective of a consumer of the library--the internal changes are not so simple) would be to add two interfaces:


    type StreamMarshaler interface {

        MarshalJSONStream(io.Writer) error

    }


    type StreamUnmarshaler interface {

        UnmarshalJSONStream(io.Reader) error

    }


During (un)marshaling, where `encoding/json` looks for `json.Marshaler` and `json.Unmarshaler` respectively, it will now look for (and possibly prefer) the new interfaces instead. Wrapping either the old or new interfaces to work as the other is a trivial matter.


With this change, and the requisite internal changes, it would be possible to begin streaming large JSON data to a server immediately, from within a `MarshalJSONStream()` implementation, for instance.


The drawback is that it violates the above mentioned promise of complete reads and writes, even with errors.


Opt-in


To accommodate this requirement, I believe it would be possible to expose the streaming functionality _only_ with the `json.Encoder` and `json.Decoder` implementations, and only when `EnablePartial*` (name TBD) is enabled.  So further, the following two functions would be added to the public API:


    func (*Encoder) EnablePartialWrites(on bool)


    func (*Decoder) EnablePartialReads(on bool)


The default behavior, even when a type implements one of the new `Stream*` interfaces, will be to operate on an entire JSON object at once. That is to say, the Encoder will internally buffer `MarshalJSONStream`'s output, and process any error before continuing, and a decoder will read an entire JSON object into a buffer, then pass it to `UnmarshalJSONStream` only if there are no errors.


However, when `EnablePartial*` is enabled, the library will bypass this internal buffering, allowing for immediate streaming to/from the source/destination.


Enabling streaming with the `EnablePartial*` toggle could be enough to already experience a benefit for many users, even without the use of the additional interfaces above.


Toggling `EnablePartial*` on will, of course, enable streaming for all types, not just those which implement the new interface above, so this could be considered a separate part of the proposal. In my opinion, this alone would be worth implementing, even if the new interface types above are done later or never.


Internals


CL 13818 can serve as very informative for this part of the discussion. I've also done some digging in the `encoding/json` package (as of 1.12) recently, for more current context. A large number of internal changes will be necessary to allow for this. I started playing around with a few internals, and I believe this is doable, but will mean a lot of code churn, so will need to be done carefully, in small steps with good code review.


As an exercise, I have successfully rewritten`indent()` to work with streams, rather than on byte slices, and began doing the same with `compact()`. The `encodeState` type would need to work with a standard `io.Writer` rather than specifically a `bytes.Buffer`. This seems to be a bigger change, but not technically difficult. I know there are other changes needed--I haven't done a complete audit of the code.


An open question is how these changes might impact performance. My benchmarks after changing `indent()` showed no change in performance, but it wasn't a particularly rigorous test.


With the internals rewritten to support streams, then it's just a matter of doing the internal buffering at the appropriate place, such as at API boundaries (i.e. in `Marshal()` and `Unmarshal()`), rather than as a bulit-in fundamental concept.  Then, as described above, turning off that buffering when properly configured above.


Final comments


To be clear, I am interested in working on this. I’m not just trying to throw out a “nice to have, now would somebody do this for me?” type of proposal. But I want to make sure I fully understand the history and context of this situation before I start too far down this rabbit hole.


I'm curious to hear the opinions of others who have been around longer. Perhaps such a proposal was already discussed (and possibly rejected?) in greater length than I can find in the above linked tickets. If so, please point me to the relevant conversation(s).


I am aware of several third-party libraries that offer some support like this, but most have various drawbacks (relying on code generation, or over-complex APIs). I would love to see this kind of support in the standard library. And one last aside: CL 13818 also added support for marshaling channels. That may or may not be a good idea (my personal feeling: probably not), but that can be addressed separately.


burak serdar

unread,
Aug 9, 2019, 10:46:19 AM8/9/19
to Jonathan Hall, golang-nuts
Instead of modifying the existing Encoder/Decoder, wouldn't it be
better to do this as a separate encoder/decoder?




>
>
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/9b081989-0ffa-40a1-8d6b-f56833e7b749%40googlegroups.com.

Jonathan Hall

unread,
Aug 9, 2019, 10:53:13 AM8/9/19
to golang-nuts
Can you say more? Better in which way?



On Friday, August 9, 2019 at 4:46:19 PM UTC+2, burak serdar wrote:

Instead of modifying the existing Encoder/Decoder, wouldn't it be
better to do this as a separate encoder/decoder?




>
>
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golan...@googlegroups.com.

burak serdar

unread,
Aug 9, 2019, 11:14:52 AM8/9/19
to Jonathan Hall, golang-nuts
On Fri, Aug 9, 2019 at 8:53 AM Jonathan Hall <fli...@flimzy.com> wrote:
>
> Can you say more? Better in which way?

Better in the way that it wouldn't change existing code. Also, I think
the use cases for existing and proposed json encoders/decoders are
different enough to justify a separate implementation. A wihle ago
with similar concerns you described, I wrote a separate json
encoder/decoder using the existing ones to deal with large json files.
The purpose was to stream json input/output, and direct json
manipulation in memory. I ended up using that only for cases where I
expect large json data, and use the std encoder/decoder for small json
processing like config files, or API calls with known structures.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/cc2b65c1-4abc-4309-b0c8-b311456873ee%40googlegroups.com.

Jonathan Hall

unread,
Aug 9, 2019, 11:38:09 AM8/9/19
to golang-nuts
Thanks for the reply. My responses inline below.


On Friday, August 9, 2019 at 5:14:52 PM UTC+2, burak serdar wrote:
On Fri, Aug 9, 2019 at 8:53 AM Jonathan Hall <fli...@flimzy.com> wrote:
>
> Can you say more? Better in which way?

Better in the way that it wouldn't change existing code.

That doesn't seem like a benefit, in its own right

I understand the desire not to just change code for its own sake, or add extra features nobody needs. But people have been asking for these types of features for several years.  This doesn't seem like a frivolous code change to me.
 
Also, I think
the use cases for existing and proposed json encoders/decoders are
different enough to justify a separate implementation.

I don't think I agree with this.

The proposal deals with a subset of current use cases, but not, strictly speaking, a _different set_ of use cases. And the number of commentators on the issues linked above, I think lends weight to the idea that the use cases this proposal addresses are not insignificant, or fundamentally "different".

If I were to fork the standard `encoding/json` library, and add my proposed functionality, the code would still be 95% the same. Standard reasons for sharing code apply, as far as I can tell.

Robert Engels

unread,
Aug 9, 2019, 11:53:41 AM8/9/19
to Jonathan Hall, golang-nuts
In other environments (e.g. Java), the streaming processors are distinct from the instance oriented - usually for good reason as the APIs are very different - the former is usually event based, and the latter returns realized instances as a whole or an error. The streaming processors can often skip ill-formed entities, and/or have them manipulated during decoding.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Jonathan Hall

unread,
Aug 9, 2019, 12:00:46 PM8/9/19
to golang-nuts
An interesting observation.

Although in a sense, we already have the decoding half of that low-level streaming API exposed by way of the `json.Decoder` type.

Would you be suggesting that a separate, stream-based API makes sense even within the stdlib?

I'm not really sure what that separate API would look like, or do differently than my proposal (I'm open to new ideas, though).

Given that the "Go way" of handling streams is with io.Reader/io.Writer (as opposed to events, for example), and the internal implementation of `json/encoding` is already so close to that, I wonder if the APIs would end up looking very much the same, anyway.

Jonathan
To unsubscribe from this group and stop receiving emails from it, send an email to golan...@googlegroups.com.

Ian Davis

unread,
Aug 9, 2019, 12:15:31 PM8/9/19
to golan...@googlegroups.com
On Fri, 9 Aug 2019, at 3:33 PM, Jonathan Hall wrote:

I debated posting here, or straight to GitHub. If that's the better place, I can move the thread there. I have long wanted proper streaming support in the `encoding/json` library. Lately I’ve been doing some digging to understand the current state of things, and I think I’ve come to grips with most of it.


A number of previous issues relate to this topic: https://github.com/golang/go/issues/7872, https://github.com/golang/go/issues/11046, https://github.com/golang/go/issues/12001, https://github.com/golang/go/issues/14140


I have read through each of these issues, and believe I have a fair understanding of the problems associated with streaming JSON input/output. If I'm overlooking something please enlighten me.


In a nutshell: The library implicitly guarantees that marshaling will never write an incomplete JSON object due to an error, and that during unmarshaling, it will never pass an incomplete JSON message to `UnmarshalJSON`.


Work toward this was done about 3 years ago, on this CL: https://go-review.googlesource.com/c/go/+/13818/

Workt was eventually abandoned, apparently when the author was unsure how to make the new behavior opt-in. I believe this proposal will solve that issue.


You may also be interested in a CL I created last year to add an unbuffered write mode to the encoder


I think I addressed all the review comments but it stalled behind a tangential issue around the current version's use of sync.Pool https://github.com/golang/go/issues/27735

Ian


Robert Engels

unread,
Aug 9, 2019, 4:10:23 PM8/9/19
to Jonathan Hall, golang-nuts
I'm sorry, maybe I didn't understand your original concern. There is an example of doing stream based processing in the json package (using Decoder).

How is this not sufficient?

The only problem I see with it is that it doesn't allow error correction/continuation, but in the modern world that seems rather rare (or very difficult to do well).

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/718e2b9e-39e2-44f2-9308-8c94e31afbff%40googlegroups.com.



burak serdar

unread,
Aug 9, 2019, 4:17:41 PM8/9/19
to Robert Engels, Jonathan Hall, golang-nuts
On Fri, Aug 9, 2019 at 2:10 PM Robert Engels <ren...@ix.netcom.com> wrote:
>
> I'm sorry, maybe I didn't understand your original concern. There is an example of doing stream based processing in the json package (using Decoder).
>
> How is this not sufficient?
>
> The only problem I see with it is that it doesn't allow error correction/continuation, but in the modern world that seems rather rare (or very difficult to do well).

I was thinking similarly and after reading those github issues, it
looks like the main problem is with Encoder, and not with Decoder.
Encoder's problem can be solved by providing an unbuffered output
option that directly writes to the io.Writer.

I like the idea of stream-friendly marshaler/unmarshaler interfaces.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/924142684.7377.1565381399326%40wamui-agami.atl.sa.earthlink.net.

Jonathan Hall

unread,
Aug 9, 2019, 5:27:43 PM8/9/19
to golang-nuts
The problem is that when encoding, even with json.Encoder, the entire object is marshaled, into memory, before it is written to the io.Writer. My proposal allows writing the JSON output immediately, rather than waiting for the entire process to complete successfully first.

The same problem occurs in reverse--when reading a large JSON response, you cannot begin processing the result until the entire result is received.

To anchor these abstract concepts to real life, let me offer an example of each where this is quite painful:

When writing a CouchDB document, it may contain an arbitrary amount of data, possibly even including BASE64-encoded attachments. For some extreme cases, these documents may be multiple mb. Dozens or hundreds of kb is not at all unusual. A typical use case may have 10k of normal JSON, with an additional 200k of, say, an image.  The current JSON implementation buffers this entire payload, ensures there are no marshaling errors, then writes to the `io.Writer`.  My proposal would allow writing immediately, with no need to buffer 100s of kb of JSON.

In the case of CouchDB, the reverse may actually be more harmful (and I've already gone to some lengths to mitigate the worst of it, using json.Decoder's tokenizer API):

A typical query returns multiple documents (which, again, may be up to hundreds of kb each).  With the existing implementation, one must read the entire resultset from the network, before parsing the first document. My proposal would make it possible to begin reading the individual JSON documents (and indeed, even individual parts of said document), without waiting for the entire result to be buffered.

Jonathan Hall

unread,
Aug 9, 2019, 5:28:46 PM8/9/19
to golang-nuts
Oh, thanks for pointing that out.

it is indeed very similar to my proposal. What do you think the chances of getting it resurrected and merged? Is more discussion still needed with respect to sync.Pool?

Jonathan Hall

unread,
Aug 9, 2019, 5:31:54 PM8/9/19
to golang-nuts
I think you're right that most people are frustrated by the encoder, but as I mentioned in another message just a bit ago, the same fundamental problem exists with decoder, and for certain workloads, I believe it should be solved.

Having said that, I think tackling the writer first definitely makes the most sense.

With the decoder, at least it's possible, although terribly cumbersome, to cobble together a solution with the tokenizer interface of json.Decoder. And truth be told, my proposal wouldn't really elminate the use of the tokenizer interface--but it would make it possible to use it on a burried type (i.e. within a UnmarshalJSONStream() method), to achieve the benefit of stream reading.

> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/718e2b9e-39e2-44f2-9308-8c94e31afbff%40googlegroups.com.
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golan...@googlegroups.com.

burak serdar

unread,
Aug 9, 2019, 5:41:17 PM8/9/19
to Jonathan Hall, golang-nuts
On Fri, Aug 9, 2019 at 3:32 PM Jonathan Hall <fli...@flimzy.com> wrote:
>
> I think you're right that most people are frustrated by the encoder, but as I mentioned in another message just a bit ago, the same fundamental problem exists with decoder, and for certain workloads, I believe it should be solved.
>
> Having said that, I think tackling the writer first definitely makes the most sense.
>
> With the decoder, at least it's possible, although terribly cumbersome, to cobble together a solution with the tokenizer interface of json.Decoder. And truth be told, my proposal wouldn't really elminate the use of the tokenizer interface--but it would make it possible to use it on a burried type (i.e. within a UnmarshalJSONStream() method), to achieve the benefit of stream reading.

It may not be terribly cumbersome, and may not need the tokenizer
interface. I have a json streaming library (streaming in the sense
that multiple json docs one after the other, not one large doc) based
on the std encoder/decoder, and something like that can be developed
to deal with large json docs.

https://github.com/bserdar/jsonstream
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/4d901ba9-5724-45cd-b7f7-24fccd9ef5db%40googlegroups.com.

Jonathan Hall

unread,
Aug 10, 2019, 10:18:11 AM8/10/19
to golang-nuts
You're absolutely right. I just meant that the tokenizer interface wouldn't be completely replaced. There are still some corner cases where it will be necessary.
Reply all
Reply to author
Forward
0 new messages