How to parse JSON arrays 1 row at a time

1,197 views
Skip to first unread message

Derek Perkins

unread,
Dec 8, 2014, 6:37:34 PM12/8/14
to golan...@googlegroups.com
I have large amounts of JSON arrays that I would like to parse one row at a time. In the json documentation, it has the following example which shows how to parse groups of json independently. Each "row" is parsed on its own, and the for loop iterates 5 times.
func main() {
const jsonStream = `
{"Name": "Ed", "Text": "Knock knock."}
{"Name": "Sam", "Text": "Who's there?"}
{"Name": "Ed", "Text": "Go fmt."}
{"Name": "Sam", "Text": "Go fmt who?"}
{"Name": "Ed", "Text": "Go fmt yourself!"}
`
type Message struct {
Name, Text string
}
dec := json.NewDecoder(strings.NewReader(jsonStream))
for {
var m Message
if err := dec.Decode(&m); err == io.EOF {
break
} else if err != nil {
log.Fatal(err)
}
fmt.Printf("%s: %s\n", m.Name, m.Text)
}
}

This is the same code, but the json stream contains one array rather than many ungrouped objects. The output here shows that the entire jsonStream was decoded in one pass, so that the for loop iterates only once.
func main() {
const jsonStream = `[
{"Name": "Ed", "Text": "Knock knock."},
{"Name": "Sam", "Text": "Who's there?"},
{"Name": "Ed", "Text": "Go fmt."},
{"Name": "Sam", "Text": "Go fmt who?"},
{"Name": "Ed", "Text": "Go fmt yourself!"}
]`
type Message struct {
Name, Text string
}
dec := json.NewDecoder(strings.NewReader(jsonStream))
var m []Message
for i := 1; ; i++ {
if err := dec.Decode(&m); err == io.EOF {
break
} else if err != nil {
log.Fatal(err)
}
fmt.Printf("row count: %d\n", i)
}
for _, msg := range m {
fmt.Printf("%s: %s\n", msg.Name, msg.Text)
}
}

If I have inbound json as an array of 1 million rows, I don't want to allocate the entire structure at once. I only need to allocate enough memory for each row to be parsed independently. What's the best approach for handling this?

Thanks,
Derek

Dave Cheney

unread,
Dec 8, 2014, 7:07:34 PM12/8/14
to golan...@googlegroups.com
I don't know if Json has the idea of a row, but looking at the data if you wrote a bufIo.Scanner custom splitter that split between { and } then you could feed each line to json.Unmarshal individually.

Kevin Malachowski

unread,
Dec 8, 2014, 7:09:17 PM12/8/14
to golan...@googlegroups.com
This is a perfect use for channels and goroutines: http://play.golang.org/p/yUVYh8dejY

Kevin Malachowski

unread,
Dec 8, 2014, 7:12:27 PM12/8/14
to golan...@googlegroups.com
Well, now that I think of it that may only help if your program is already designed to work with this sort of style.

What do you need to do with the messages you receive? Why not just use your first example and do whatever you have to do rather than fmt.Println'ing? If you need them to be asynchronous you could either use a work group or just spawn a new goroutine and just continue with the next loop iteration.

Derek Perkins

unread,
Dec 8, 2014, 7:16:42 PM12/8/14
to golan...@googlegroups.com
I don't know if Json has the idea of a row, but looking at the data if you wrote a bufIo.Scanner custom splitter that split between { and } then you could feed each line to json.Unmarshal individually.

Not a row I guess, but each nested object in the array at least, and order doesn't matter. It'd have to be a smart scanner that would be able to handle potential nested objects / string data. I guess it seemed like a fairly common use case, so I figured that someone else might have tackled it already. The standard json decoder already knows how to parse individual objects in order to work with json.RawMessage, so maybe I could copy / hook into that.

Derek Perkins

unread,
Dec 8, 2014, 7:18:00 PM12/8/14
to golan...@googlegroups.com
Kevin - It's not a matter of goroutines or asynchronous processing. It's a question of the incoming data being objects in an array (like the second example) vs completely separate and unrelated json objects like the first example.

Nate Brennand

unread,
Dec 8, 2014, 7:43:27 PM12/8/14
to Derek Perkins, golan...@googlegroups.com
Hi Derek,

I wrote something to parse large json list files. I wasn't able to figure out how to generalize it at the time because you need to pass it a typed channel but you should be able to adapt it to your purpose easily:

Let me know if I can clarify anything,
Nate


On Mon, Dec 8, 2014 at 7:18 PM, Derek Perkins <de...@derekperkins.com> wrote:
Kevin - It's not a matter of goroutines or asynchronous processing. It's a question of the incoming data being objects in an array (like the second example) vs completely separate and unrelated json objects like the first example.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tamás Gulácsi

unread,
Dec 9, 2014, 12:21:36 AM12/9/14
to golan...@googlegroups.com
Just skip the first [ and use the Decoder in a loop on the rest.

tejo...@gmail.com

unread,
Dec 9, 2014, 12:22:03 AM12/9/14
to golan...@googlegroups.com

As of now there is no provision in golang API for streaming codecs.


--Tejorupan

Matt Harden

unread,
Dec 9, 2014, 9:24:45 PM12/9/14
to Tamás Gulácsi, golan...@googlegroups.com
That actually doesn't work. The Decoder rejects the commas separating the array elements. Also the Decoder reads ahead into its own buffer, so you can't just consume the commas yourself. There was a conversation on this exact subject earlier on this list, and it had some solutions.

On Mon Dec 08 2014 at 11:22:36 PM Tamás Gulácsi <tgula...@gmail.com> wrote:
Just skip the first [ and use the Decoder in a loop on the rest.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.

Matt Harden

unread,
Dec 9, 2014, 9:26:51 PM12/9/14
to Tamás Gulácsi, golan...@googlegroups.com

nkatsaros

unread,
Dec 9, 2014, 11:02:40 PM12/9/14
to golan...@googlegroups.com, tgula...@gmail.com
You can use io.MultiReader(decoder.Buffered(), r) to consume the commas yourself.

I just checked the thread you mentioned and that's pretty much what it suggests.


On Tuesday, December 9, 2014 9:24:45 PM UTC-5, Matt Harden wrote:
That actually doesn't work. The Decoder rejects the commas separating the array elements. Also the Decoder reads ahead into its own buffer, so you can't just consume the commas yourself. There was a conversation on this exact subject earlier on this list, and it had some solutions.

On Mon Dec 08 2014 at 11:22:36 PM Tamás Gulácsi <tgula...@gmail.com> wrote:
Just skip the first [ and use the Decoder in a loop on the rest.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Matt Harden

unread,
Dec 10, 2014, 9:36:28 AM12/10/14
to nkatsaros, golan...@googlegroups.com, tgula...@gmail.com
Yes, but if you're not careful you can end up with a tall "tower" of MultiReaders containing other MultiReaders.

Another option I've considered is to insert a filter that "stops" at every delimiter (untested):

type commaReader struct{ bufio.Reader }

func (r *commaReader) Read(p []byte) (int, error) {
    b, err := r.ReadSlice(',')
    if err == bufio.ErrBufferFull {
        err = nil
    }
    if err != nil {
        return 0, err
    }
    if len(p) > len(b) {
        p = p[:len(b)]
    }
    b, err = r.ReadSlice(']')
    if err == bufio.ErrBufferFull {
        err = nil
    }
    if err != nil {
        return 0, err
    }
    if len(p) > len(b) {
        p = p[:len(b)]
    }
    return r.Reader.Read(p)
}

In this case, after each successful decode, Decoder.Buffered() should only contain zero or more whitespace, followed by a comma or closing bracket.

Andrew Bursavich

unread,
Dec 10, 2014, 2:45:55 PM12/10/14
to golan...@googlegroups.com
If you know none of your JSON strings will contain braces... It should be fairly trivial to write a reader that counts unclosed braces (increment on open brace, decrement on close brace) and replaces commas and brackets with new lines when the count is zero. If your JSON strings may contain braces, then it gets a little more complicated.

Cheers,
Andy

Andrew Bursavich

unread,
Dec 10, 2014, 10:22:48 PM12/10/14
to golan...@googlegroups.com
Here's a quick version of what I was suggesting. It should handle all valid JSON with the exception of certain unicode strings. Might be good enough for your use case. Proper rune decoding would have to handle cases where multi-byte runes span multiple reads.


Cheers,
Andy

Tommi Virtanen

unread,
Dec 11, 2014, 1:25:45 AM12/11/14
to golan...@googlegroups.com
On Monday, December 8, 2014 3:37:34 PM UTC-8, Derek Perkins wrote:
I have large amounts of JSON arrays that I would like to parse one row at a time.

I wrote this a while ago: https://github.com/tv42/jsonarray

Andrew Bursavich

unread,
Dec 11, 2014, 2:04:38 AM12/11/14
to golan...@googlegroups.com
I thought about it a little more and the code I posted will work fine with all UTF-8.

Matt Harden

unread,
Dec 11, 2014, 2:04:25 PM12/11/14
to Andrew Bursavich, golan...@googlegroups.com
And it's very close to being a full JSON lexer. It would be nice if json.Decoder gave us access to the lexer tokens like xml.Decoder does, thus avoiding people writing their own lexers to work around the limitations of the stdlib package.

On Thu Dec 11 2014 at 1:04:49 AM Andrew Bursavich <aburs...@gmail.com> wrote:
I thought about it a little more and the code I posted will work fine with all UTF-8.

--
Reply all
Reply to author
Forward
0 new messages