Why is buffer size not always an integer multiple of 4096 when reading file line by line?

362 views
Skip to first unread message

mingLiu

unread,
Jul 6, 2014, 8:57:25 AM7/6/14
to golan...@googlegroups.com
Hi everyone,
 I'm a newbie Go programmer.Please correct me if i mistake something.
The sample code is,
    // test.go
    package main

    import (
   "bufio"
   "os"
    )
    func main() {
   if len(os.Args) != 2 {
   println("Usage:", os.Args[0], "")
   os.Exit(1)
    }
    fileName := os.Args[1]
    fp, err := os.Open(fileName)
    if err != nil {
  println(err.Error())
  os.Exit(2)
    }
    defer fp.Close()
    r := bufio.NewScanner(fp)
    var lines []string
    or r.Scan() {
  lines = append(lines, r.Text())
    }
    }
c:\\>go build test.go

c:\\>test.exe test.txt

Then I monitored its process using process monitor when executing it, part of the output is:
    test.exe  ReadFile  SUCCESS     Offset: 4,692,375, Length: 8,056
    test.exe  ReadFile  SUCCESS     Offset: 4,700,431, Length: 7,198
    test.exe  ReadFile  SUCCESS     Offset: 4,707,629, Length: 8,134
    test.exe  ReadFile  SUCCESS     Offset: 4,715,763, Length: 7,361
    test.exe  ReadFile  SUCCESS     Offset: 4,723,124, Length: 8,056
    test.exe  ReadFile  SUCCESS     Offset: 4,731,180, Length: 4,322
    test.exe  ReadFile  END OF FILE  Offset: 4,735,502, Length: 8,192

The equivalent java code is,
    //Test.java
    import java.io.BufferedReader;
    import java.io.FileInputStream;
    import java.io.InputStreamReader;
     
    public class Test{
    public static void main(String[] args) {
      try
      {
      FileInputStream in = new FileInputStream("test.txt");
      BufferedReader br = new BufferedReader(new InputStreamReader(in));
      String strLine;
      while((strLine = br.readLine())!= null)
      {
       ;
      }
      }catch(Exception e){
       System.out.println(e);
      }
     }
    }
c:\\>javac Test.java
c:\\>java Test

Then part of the monitoring output is:
    java.exe  ReadFile  SUCCESS      Offset: 4,694,016, Length: 8,192
    java.exe  ReadFile  SUCCESS      Offset: 4,702,208, Length: 8,192
    java.exe  ReadFile  SUCCESS      Offset: 4,710,400, Length: 8,192
    java.exe  ReadFile  SUCCESS      Offset: 4,718,592, Length: 8,192
    java.exe  ReadFile  SUCCESS      Offset: 4,726,784, Length: 8,192
    java.exe  ReadFile  SUCCESS      Offset: 4,734,976, Length: 526
    java.exe  ReadFile  END OF FILE  Offset: 4,735,502, Length: 8,192

As you see, the buffer size in java is 8192 and it read 8192 bytes each time.Why is the Length in Go changing during each time reading file?
I have tried 'bufio.ReadString('\n')','bufio.ReadBytes('\n')' and both of them have the same problem.

Benjamin Measures

unread,
Jul 6, 2014, 8:47:35 PM7/6/14
to golan...@googlegroups.com
On Sunday, 6 July 2014 13:57:25 UTC+1, mingLiu wrote:
As you see, the buffer size in java is 8192 and it read 8192 bytes each time.Why is the Length in Go changing during each time reading file?

Java's BufferedReader reads (up to buffer size) when it's empty, whilst Go's bufio reads (up to max buffer size) when it needs more.

Where a line spans reads, Java's BufferedReader reads it out (temporary), whilst Go's bufio reads more. Go's bufio has very little (mostly 0) allocation and doesn't suffer from unbounded lines.

I have tried 'bufio.ReadString('\n')','bufio.ReadBytes('\n')' and both of them have the same problem.

Why is this a problem?

ming...@gmail.com

unread,
Jul 6, 2014, 10:56:09 PM7/6/14
to golan...@googlegroups.com
 Thanks for your explanation.

在 2014年7月7日星期一UTC+8上午8时47分35秒,Benjamin Measures写道:
 I have post the question to stackoverflow yesterday and received many good replies.FYI, https://stackoverflow.com/questions/24597157/why-isnt-buffer-size-always-an-integer-multiple-of-4096-when-reading-file-line
My concern is performance. System page size is 4096 so maybe returning a multiple of 4096 will get better performance,right?

Benjamin Measures

unread,
Jul 7, 2014, 7:59:57 PM7/7/14
to golan...@googlegroups.com
On Monday, 7 July 2014 03:56:09 UTC+1, Ming Liu wrote:
My concern is performance. System page size is 4096 so maybe returning a multiple of 4096 will get better performance,right?

No, since (at least, Intel) processors can't copy memory in units of 4KB.

Besides, any memcpy that fits in L2 cache would be measured in the tens of GBps. Since you're reading files, this all sounds like premature optimisation to me. Have you tried benchmarking and found something lacking?
Reply all
Reply to author
Forward
0 new messages