Chunking text files into Goroutines?

790 views
Skip to first unread message

mk...@yahoo.com

unread,
Aug 31, 2013, 11:23:31 PM8/31/13
to golan...@googlegroups.com
Coming to Go from Python, I have been approaching problems from a Pythonic approach and try to "translate" them into Go. Working with large text files in ETL scripts, I have used this approach with the Multiprocessing module to split up a task: http://www.ngcrawford.com/2012/03/29/python-multiprocessing-large-files/ (not my blog).

Is there a module or common pattern I would in Go to accomplish the same, where a large file would be divided up into chunks and passed to multiple Goroutines for processing?

Kyle Lemons

unread,
Sep 1, 2013, 10:20:02 PM9/1/13
to mk...@yahoo.com, golang-nuts
Remember the rules of optimization:
1. Don't optimize.
2. (experts only) Don't optimize yet.

If you're reading a file from magnetic storage, there's a nonzero probability that reading the data from the drive is going to be slower than the entirety of the processing you need to do on the file.

That being said, however, the idea is very natural with goroutines:


On Sat, Aug 31, 2013 at 8:23 PM, <mk...@yahoo.com> wrote:
Coming to Go from Python, I have been approaching problems from a Pythonic approach and try to "translate" them into Go. Working with large text files in ETL scripts, I have used this approach with the Multiprocessing module to split up a task: http://www.ngcrawford.com/2012/03/29/python-multiprocessing-large-files/ (not my blog).

Is there a module or common pattern I would in Go to accomplish the same, where a large file would be divided up into chunks and passed to multiple Goroutines for processing?

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Konstantin Khomoutov

unread,
Sep 2, 2013, 12:39:45 PM9/2/13
to Kyle Lemons, mk...@yahoo.com, golang-nuts
On Sun, 1 Sep 2013 19:20:02 -0700
Kyle Lemons <kev...@google.com> wrote:

> Remember the rules of optimization:
> 1. Don't optimize.
> 2. (experts only) Don't optimize yet.
>
> If you're reading a file from magnetic storage, there's a nonzero
> probability that reading the data from the drive is going to be
> slower than the entirety of the processing you need to do on the file.
>
> That being said, however, the idea is very natural with goroutines:
> http://play.golang.org/p/_9daZBbJXT

[...]

I'd note that the prospective process() function missing in the example
should call wg.Done() after it completed its task -- preferably using a
deferred call placed somewhere at the beginning, like this:

func process(wg sync.WaitGroup, lines []string) {
defer wg.Done()
for _, line := range lines {
// do something with the line...
}
}

Kyle Lemons

unread,
Sep 2, 2013, 12:48:52 PM9/2/13
to Konstantin Khomoutov, mksql, golang-nuts
I included the process function in my example :).

Matt K

unread,
Sep 2, 2013, 6:14:43 PM9/2/13
to golan...@googlegroups.com, mk...@yahoo.com
> Remember the rules of optimization:

This is a good point. In the ETL process in question, I was hitting performance limits in the Python interpreter, and going to multiple processes resulted in a significant speed-up. However I have not yet fully tested a single process Go version, which may not have the same bottlenecks, resulting in the choke point being the storage.
Reply all
Reply to author
Forward
0 new messages