How to concurrently process large json file / map. Facing problem in currently calling recursive calls

808 views
Skip to first unread message

Abhijit Kadam

unread,
Apr 24, 2014, 2:52:00 AM4/24/14
to golan...@googlegroups.com
I have to parse and process large json file 60MB plus size. What should be the fastest way to processes?
In the sample I am parsing small json from memory while in actually it will be file I/O


There is a function ProcessMap that calls itself if it is a map. How can I concurrently invoke the recursive call when there is a map inside a map. I have used waitgroup however it doesn't work and the functions just exits

C Banning

unread,
Apr 24, 2014, 6:44:48 AM4/24/14
to golan...@googlegroups.com
http://play.golang.org/p/uHcn_w24Dx - but you aren't guaranteed sequential processing.

First, however, you need to handle the correct value types in the switch - see code behind http://godoc.org/github.com/clbanning/mxj#Map.StringIndent.

If your file is a collection of JSON strings perhaps something like http://godoc.org/github.com/clbanning/mxj#HandleJsonReader would work, where you can hand each JSON string off to a go routine.

Kyle Wolfe

unread,
Apr 24, 2014, 10:20:04 AM4/24/14
to golan...@googlegroups.com
Is this file going to contain one large json document or many? My questino on top of his would be, can you have the io spool each document in rather than parse after the whole file is in memory?

From this sample data, I'd think the flow would be:

Read file into memory (or spool?)
Read the headers
Go func to spool each data element into channel
Go create n workers to read from channel
wait on group

Abhijit Kadam

unread,
Apr 24, 2014, 10:49:43 AM4/24/14
to golan...@googlegroups.com
It contains many huge documents like in the sample format "section1" & "section2" can be thought of documents. However the format is not I can change or decide. With as is sample format is it possible to spool or buffer it concurrently?

Kyle Wolfe

unread,
Apr 24, 2014, 11:17:52 AM4/24/14
to golan...@googlegroups.com
I think this is a documented example of it against a string. http://golang.org/pkg/encoding/json/#example_Decoder

So I'd say to start main, fire off a go routine that reads the stream into a channel, after that create x workers to read from said channel. After that I'm not sure what your doing with each document to help you make it more concurrent

Henrik Johansson

unread,
Apr 24, 2014, 11:18:07 AM4/24/14
to Abhijit Kadam, golang-nuts
What Kyle suggests would probably work just fine for your sections.

Feed sections one by one into the channel and have the channel readers (maybe 12? have to try it i guess) consume and process them as you need to. It could speed things up for you but it really depends on many factors I guess.




On Thu, Apr 24, 2014 at 4:49 PM, Abhijit Kadam <abhij...@gmail.com> wrote:
It contains many huge documents like in the sample format "section1" & "section2" can be thought of documents. However the format is not I can change or decide. With as is sample format is it possible to spool or buffer it concurrently?

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

C Banning

unread,
Apr 24, 2014, 11:25:30 AM4/24/14
to golan...@googlegroups.com
Something might this might get you there as well: http://play.golang.org/p/Li8sOr7o7Q

Abhijit Kadam

unread,
Apr 24, 2014, 11:29:22 AM4/24/14
to golan...@googlegroups.com, Abhijit Kadam
I knew that example however the objects appear in the form of stream in that document. In my case they are enclosed inside "{ ... }" which is one big object not stream of objects.

Abhijit Kadam

unread,
Apr 24, 2014, 11:33:45 AM4/24/14
to golan...@googlegroups.com
Thanks will look into mxj lib. I hope that files that I have, they have some convention like "section1", "section2".
Message has been deleted

C Banning

unread,
Apr 24, 2014, 11:53:31 AM4/24/14
to golan...@googlegroups.com
I routinely use it to process large data sets containing multiple 8-10 MB documents (XML).  If the individual JSON objects are large and you know the key's location, ValuesForPath() can be much faster than ValuesForKey().

Abhijit Kadam

unread,
Apr 24, 2014, 11:57:07 AM4/24/14
to C Banning, golang-nuts

OK. How these functions work? Do they perform find and return from already parsed file's map or they parse the file as we ask for values by key or path?

On Apr 24, 2014 9:21 PM, "C Banning" <clba...@gmail.com> wrote:
if you know the key's location ValuesForPath() can be much faster than ValuesForKey().

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/wObmEq1MAps/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
Message has been deleted

Abhijit Kadam

unread,
Apr 25, 2014, 11:20:51 AM4/25/14
to golan...@googlegroups.com, C Banning
Charles,
I tried your xml lib with "ValuesForPath" and it is very convenient to use like ValuesForPath("section1.details.F1") and it gives the desired value. Great! However got some things to discuss if you do not mind. 

I used mxj.NewMapJsonReader to read my 50 MB plus file. This function kind of hung up for long. Then I just used golangs ioutil.ReadFile to read the contents and then used "mxj.NewMapJson" to parse into map. Then used functions like ValuesForPath(). I did not get chance to further look into NewMapJsonReader. As right now I am focusing on processing. The file format : '{ {sections...1},...{sections ...n} }' not '{section..1} {section..2}....{section..n}'

Another thing if I used ValuesForPath and when accessing array is there a way to access array element? 
data = m.ValuesForPath("section1.data.F1") returns array and then using data[0] I can get the desired value. However it is fetching array and will not be efficient always.
something like data = m.ValuesForPath("section1.data[0]F1") will be useful

Abhijit Kadam

unread,
Apr 25, 2014, 11:26:15 AM4/25/14
to golan...@googlegroups.com, C Banning
Some Clarification. In the above post In second point I mean "section1.data" is an array and each element of array has field F1. However the return is another array with F1 value from data[0] and data[1]. If it is copy then it may be efficient to access like that. 

C Banning

unread,
Apr 28, 2014, 10:37:21 AM4/28/14
to golan...@googlegroups.com, C Banning
Abhijit,  Thanks for you suggestion.  It's now available: http://godoc.org/github.com/clbanning/mxj#Map.ValuesForPath

Kevin Gillette

unread,
Apr 28, 2014, 11:32:53 AM4/28/14
to golan...@googlegroups.com
Your data would have to be very deeply nested to reach any recursion limits. If that does happen, you can design your system to simulate recursion using a trampoline. Essentially it means have a for loop that calls functions which themselves return functions (which are often closures). This flattens the call stack.
Reply all
Reply to author
Forward
0 new messages