How to stream across json-seq RFC-7464

215 views
Skip to first unread message

Greg Saylor

unread,
Mar 27, 2021, 3:42:40 PM3/27/21
to golang-nuts
Good afternoon,

For a case where there's a file containing a sequence of hashes (it could be arrays too, as the underlying object type seems irrelevant) as per RFC-7464.  I cannot figure out how to handle this in a memory efficient way that doesn't involve pulling each blob 

I've tried to express this on Go playground here: https://play.golang.org/p/Aqx0gnc39rn
Note that I'm using exponent-io/jsonpath as the JSON decoder, but certainly that could be swapped for something else.

In essence here is an example of the input bytes:

{
   "elements" : [
      {
         "Space" : "YCbCr",
         "Point" : {
            "Cb" : 0,
            "Y" : 255,
            "Cr" : -10
         }
      },
      {
         "Point" : {
            "B" : 255,
            "R" : 98,
            "G" : 218
         },
         "Space" : "RGB"
      }
   ]
}
{
   "elements" : [
      {
         "Space" : "YCbCr",
         "Point" : {
            "Cb" : 3000,
            "Y" : 355,
            "Cr" : -310
         }
      },
      {
         "Space" : "RGB",
         "Point" : {
            "B" : 355,
            "G" : 318,
            "R" : 108
         }
      }
   ]
}
{
   "elements" : [
      {
         "Space" : "YCbCr",
         "Point" : {
            "Cr" : -410,
            "Cb" : 400,
            "Y" : 455
         }
      },
      {
         "Space" : "RGB",
         "Point" : {
            "B" : 455,
            "R" : 118,
            "G" : 418
         }
      }
   ]
}

I can iterate through that with this code:

w := json.NewDecoder(bytes.NewReader(j))
for w.More() {
var v interface{}
w.Decode(&v)
fmt.Printf("%+v\n", v)
}

This works, but the downside is that each {...} of bytes has to be pulled into memory.  And the functions that is called is already designed to receive an io.Reader and parse the VERY large inner blob in an efficient manner.

So in principal, this is kinda want I want to do, but maybe I'm looking at it all wrong:


w := json.NewDecoder(bytes.NewReader(j))
for w.More() {
reader2 := ???? //Some io.Reader that represents each of the 3 json-seq blocks
secondDecoder(reader2)
}

func secondDecoder(reader io.Reader) {
w2 := json.NewDecoder(reader)
var v interface{}
w2.Decode(&v)
fmt.Printf("%+v\n", v)
}

Any ideas on how to solve this problem?

I should note that it is not possible for the input to change in this case as the system that consumes it is not the same one that has been generating it for the past 5 years.

Thanks!

- Greg

Brian Candler

unread,
Mar 28, 2021, 4:26:17 AM3/28/21
to golang-nuts
> This works, but the downside is that each {...} of bytes has to be pulled into memory.  And the functions that is called is already designed to receive an io.Reader and parse the VERY large inner blob in an efficient manner.

Is the inner blob decoder actually using a json.Decoder, as shown in your example func secondDecoder()?  In that case, the simplest and most efficient answer is to create a persistent json.Decoder which wraps the underlying io.Reader directly, and just keep calling w2.Decode(&v) on each call.  It will happily consume the stream, one object at a time.

If that's not possible for some reason, then it sounds like you want to break the outer stream at outer object boundaries, i.e. { ... }, without fully parsing it.  You can do that with json.RawMessage:

However, you've still read each object as a stream of bytes into memory, and you've still done some of the work of parsing the JSON to find the start and end of each object.  You can turn it back into an io.Reader by creating a bytes.NewBuffer around it, if that's what the inner parser requires.   However if each object is large, and you really need to avoid reading it into memory at all, then you'd need some sort of rewindable stream.

Another approach is to stop the source generating pretty-printed JSON, and make it generate in JSON-Lines format instead.  It sounds like you're unable to change the source, but you might be able to un-prettyprint the JSON by using an external tool (perhaps jq can do this).  Then I am thinking you could make a custom io.Reader which returns data up to a newline, then sends EOF and sends you a fresh io.Reader for the next line.

But this is all very complicated, when keeping the inner Decoder around from object to object is a simple solution to the problem that you described.  Is there some other constraint which prevents you from doing this?

Greg Saylor

unread,
Mar 28, 2021, 2:17:32 PM3/28/21
to golang-nuts
The inner blob is expecting an io.Reader.  But, perhaps I can change that to pass a Decoder based on what you are saying.   For some reason I hadn't grokked that is how Decoder was working.  Just to re-iterate what I think you are saying (and in case anyone stumbles across this thread later), assuming a file that has this type of structure (call each of the outer blobs A, B, C for reference):

{
  [
   {...},
   {...}
  ]
}
{
  [
   {...},
   {...}
  ]
}
[
  {...},
  {...}
]


The first call to Decoder() will move the pointer to the first `{` in A.
   Something like exponen-io.jsonpath Seek() could be used to advance to A's `[`
   The second call to Decoder(), with the embedded reader,  will set the position at A's first inner {...}
   Each subsequent call to Decode() will process each inner {...} of A one at a time until More() is false,  at which point the position is at A's `]`

The third call to Decoder() will move the pointer to the first `{` in B.  Question: Is this in fact correct?  If not how to I get reader to this point of the stream?
   The fourth call to Decoder() will allow me to stream read to B's `[` (in this case using exponent-io.jsonpath SeekTo() or some other mechanism)
   Each subsequent call to Decode() will process each inner {...} of B one at a time  until More() is false, at which point the position is at B's `]`

The fifth call to Decoder() will move the pointer to the first `[` in C.
   Each subsequent call to Decode() will process each inner {...} of C one at a time until More() is false


I realize this may not what actually is going internally inside these packages, but at a high level is that conceptually something approaching what is going on?

If this is true, I gotta say this is one of the things I *LOVE* about Go.  I cannot count the number of times I had some complicated problem which, which Go makes a  whole lot easier.  Or put another way: I was over-complicating the problem and not recognizing the underlying code defect which should change.   In fact, even refactoring this code even though its used in about 100 places would be trivial.  I could probably just use perl -pie to fix the code.  And also, if I may be a bit indulgent here, the quality of the answers that come out of the Golang community are just amazing.  I love reading this mailing list even though I've only posted to it a few times.

- Greg

Brian Candler

unread,
Mar 28, 2021, 3:15:34 PM3/28/21
to golang-nuts
No, it's even simpler than that:

* The first call to decoder.Decode() will return the first object in the stream.
* The second call to decoder.Decode() will return the second object in the stream.
* And so on...

By "object" I mean top-level object: everything between the opening "{" and its matching closing "}", including all its nested values.  (Define a struct which contains all the nested attributes, for it to be deserialized into).

If an io.Reader stream consists of a series of separate JSON objects - as yours does - then you get one object at a time.  They don't have to be separated by whitespace or newlines, but they can be.

Don't think about seeking.  I don't know the internals of decoder.Decode(), but I would expect that it reads in chunks from the io.Reader.  This means it will likely overshoot the object boundaries, but will buffer the excess and process it on the next call to Decode.

Greg Saylor

unread,
Mar 28, 2021, 5:35:20 PM3/28/21
to golang-nuts

I've tried this suggestion and although its certainly a bit more refactoring then I expected - the outcome looks to be exactly as you described here.

Thank you so much for the suggestion, take a bow!

- Greg

Sean Liao

unread,
Mar 29, 2021, 7:23:03 AM3/29/21
to golang-nuts
If you can guarantee your input is always pretty printed like that, you could use bufio with a custom splitfunc to match `\n{`, no need to double parse json
Reply all
Reply to author
Forward
0 new messages