Reading a file concurrently

2,367 views
Skip to first unread message

Boopathi Rajaa

unread,
Jun 14, 2013, 4:09:21 AM6/14/13
to golan...@googlegroups.com
What is the best method to read a file concurrently ?

I set a buffer size (say BUF=1024 lines), and for every go routine, I read BUF lines, and parse each line and append to the report.

Usage - Parsing log files - apache, mysql. 

Let's say the log file contains N lines. 

I saw that there is a file.seek function. that moves the pointer, and on next Readline() reads from the seek. 
So should I open the file (N / BUF) times in go routines ? (Since, there ll be N / BUF go routines)

Or is there a simpler way to do it ?

Péter Szilágyi

unread,
Jun 14, 2013, 4:11:55 AM6/14/13
to Boopathi Rajaa, golang-nuts
Hi,

  If you open the file a load of times and start seekeing, it will kill the performance. Just open it once, have a go routine read all the data, and every N lines fire up a processor for it. I think this would be the simplest and fastest solution.

Cheers,
  Peter


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Péter Szilágyi

unread,
Jun 14, 2013, 5:02:25 AM6/14/13
to andrey mirtchovski, Boopathi Rajaa, golang-nuts
Yes, but that has mostly to do with networking issues imho. If you want to read a single file locally and process it, then a single sequential read should always outperform parallel reads (the storage device in both cases has to read the same amount of data, but with jumping around there is a much higher operational cost). Of course if the IO throughput is higher than what your process can handle then having parallel reads can be beneficial.

So all in all, I think we can agree that the "optimal" solution is scenario specific :)


On Fri, Jun 14, 2013 at 11:21 AM, andrey mirtchovski <mirtc...@gmail.com> wrote:
reading a file from multiple goroutines will not necessarily kill
performance, especially when the file is served over a high latency
connection. the cp code below, which does exactly what you suggest,
may not be idiomatic go (it's a translation from a similar C program
using Plan9's go-like libthread), but it definitely outperforms the
run-of-the-mill "cp" for very large or very far away files:

go get github.com/rminnich/u-root/cp

andrey mirtchovski

unread,
Jun 14, 2013, 4:21:24 AM6/14/13
to Péter Szilágyi, Boopathi Rajaa, golang-nuts

Kyle Lemons

unread,
Jun 14, 2013, 12:59:13 PM6/14/13
to Péter Szilágyi, Boopathi Rajaa, golang-nuts
On Fri, Jun 14, 2013 at 1:11 AM, Péter Szilágyi <pet...@gmail.com> wrote:
Hi,

  If you open the file a load of times and start seekeing, it will kill the performance. Just open it once, have a go routine read all the data, and every N lines fire up a processor for it. I think this would be the simplest and fastest solution.

Does mmaping the file help with this sort of thing?  I have to admit, I've only ever done single-threaded use of an mmaped file so I can't really speculate.

James Bardin

unread,
Jun 14, 2013, 1:58:16 PM6/14/13
to golan...@googlegroups.com, Péter Szilágyi, Boopathi Rajaa


On Friday, June 14, 2013 12:59:13 PM UTC-4, Kyle Lemons wrote:

Does mmaping the file help with this sort of thing?  I have to admit, I've only ever done single-threaded use of an mmaped file so I can't really speculate.
 

It's the random IO that really kills performance, but mostly if it's on spinning disks, and an mmap'ed file is still going to read the blocks in on demand. In the end though, your really throttled by the speed you can read the file, so even if random IO were as fast as sequential there's nothing to gain from the added overhead.


Reply all
Reply to author
Forward
0 new messages