Re: [go-nuts] Which is the most efficient way to read STDIN lines 100s of MB long and tens of MB info is passed to it

1,263 views
Skip to first unread message

Jan Mercl

unread,
May 7, 2022, 4:46:33 PM5/7/22
to Constantine Vassilev, golang-nuts
On Sat, May 7, 2022 at 10:24 PM Constantine Vassilev <ths...@gmail.com> wrote:

> I need to write a program that reads STDIN and should output every line that contains a search word "test" to STDOUT.

Piping the data through grep(1) would be my first option.

Amnon

unread,
May 7, 2022, 5:49:08 PM5/7/22
to golang-nuts
How about something like 

func grep(pat []byte, r io.Reader, w io.Writer) error {
    scanner := bufio.NewScanner(r)
    for scanner.Scan() {
        if (bytes.Contains(scanner.Bytes(), pat)) {
            w.Write(scanner.Bytes())
        }
    }

    return scanner.Err()
}

and for extra speed, just allocate a bigger buffer to the scanner...

Amnon

unread,
May 7, 2022, 5:53:54 PM5/7/22
to golang-nuts
p.s. If you changed the above code to use strings rather than []byte 
it would run many times slower due to the cost of allocation.

Amnon

unread,
May 7, 2022, 6:18:51 PM5/7/22
to golang-nuts
The other interesting question is what algorithm we use to find the pattern in each line.
Generally bytes.Contains uses Rabin-Karp. But as the pattern is the word "test" which is only 4 bytes long,
a brute force search is used, using SSE type instructions where available.
So the naive Go approach will give you a very fast execution. The main thing is to set up your scanner with a large buffer, to minimize the number
of file system reads, and to avoid the newbie error of working with strings rather than []byte, and forcing the code to do vast numbers of 
unnecessary and expensive allocations.

Dan Kortschak

unread,
May 7, 2022, 6:31:31 PM5/7/22
to golan...@googlegroups.com
On Sat, 2022-05-07 at 15:18 -0700, Amnon wrote:
> The other interesting question is what algorithm we use to find the
> pattern in each line.
> Generally bytes.Contains uses Rabin-Karp. But as the pattern is the
> word "test" which is only 4 bytes long,
> a brute force search is used, using SSE type instructions where
> available.
> So the naive Go approach will give you a very fast execution. The
> main thing is to set up your scanner with a large buffer, to minimize
> the number
> of file system reads, and to avoid the newbie error of working with
> strings rather than []byte, and forcing the code to do vast numbers
> of
> unnecessary and expensive allocations.

There is an interesting post from the author of grep that goes over
some of these details
https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
.


Const V

unread,
May 7, 2022, 7:03:09 PM5/7/22
to golang-nuts
Now the next question is if I have to handle runes.

Const V

unread,
May 7, 2022, 7:15:01 PM5/7/22
to golang-nuts
Here is what came up withL

func TestGrep1(t *testing.T) {
cmd := exec.Command("./read.bash")
fmt.Printf("%v\n", cmd)
stdout, err := cmd.StdoutPipe()
if err != nil {
log.Fatal(err)
}
if err := cmd.Start(); err != nil {
log.Fatal(err)
}
fmt.Printf("%v\n", stdout)
find := []byte{'b', 'u', 'f', 'i', 'o'}
grep(find, stdout, os.Stdout)
actual := ""
expected := ""
if actual != expected {
t.Fatalf("Input %s. Expected: %s, actual: %s\n", actual, expected, actual)
}
}

////////////
read.bash
/////////////
#!/usr/bin/env bash
echo cat Grep1.go
cat Grep1.go


Const V

unread,
May 7, 2022, 7:16:04 PM5/7/22
to golang-nuts
The question is will scanner.Scan handle a line of 100s MB?

Dan Kortschak

unread,
May 7, 2022, 7:43:59 PM5/7/22
to golan...@googlegroups.com
On Sat, 2022-05-07 at 16:16 -0700, Const V wrote:
> The question is will scanner.Scan handle a line of 100s MB?

No, at least not by default (https://pkg.go.dev/bufio#Scanner.Buffer).
But that that point you want to start questioning why you're doing what
you're doing.

Your invocation of grep can be a lot simpler

```
func grep(match string, r io.Reader, stdout, stderr io.Writer) error {
cmd := exec.Command("fgrep", match)
cmd.Stdin = r
cmd.Stdout = stdout
cmd.Stderr = stderr
return cmd.Run()
}
```

You can then do whatever you want with the buffered output and stderr.
You can also make this use a pipe with minor changes if you expect the
output to be long and that stream processing of that would be sensible
(use cmd.Start instead of Run and look into StdoutPipe).


Const V

unread,
May 7, 2022, 7:55:28 PM5/7/22
to golang-nuts
I cannot use fgrep in this environment.

Looks like Scan buffer is resizable.
https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;l=190
// Is the buffer full? If so, resize.
       if s.end == len(s.buf) {
           // Guarantee no overflow in the multiplication below.
           const maxInt = int(^uint(0) >> 1)
           if len(s.buf) >= s.maxTokenSize || len(s.buf) > maxInt/2 {
               s.setErr(ErrTooLong)
               return false
           }
           newSize := len(s.buf) * 2
           if newSize == 0 {
               newSize = startBufSize
           }
           if newSize > s.maxTokenSize {
               newSize = s.maxTokenSize
           }
           newBuf := make([]byte, newSize)
           copy(newBuf, s.buf[s.start:s.end])
           s.buf = newBuf
           s.end -= s.start
           s.start = 0
       }

Amnon

unread,
May 8, 2022, 1:25:29 AM5/8/22
to golang-nuts
So you raise a couple of questions:

1) How about handling runes?
The nice thing about utf8 is you don't have to care. If you are searching for the word ascii byte 'test', you can 
simply compare byte by byte - the letter t is represented by 0x74, and this byte in the search buffer can 
only represent t - it can never be part of a multi-byte-rune - not in utf8.

2) Can we handle lines of hundreds of MB? 
Not really. The default will limit MaxTokenSize to 64K. 
If you set your own buffer (recommended) then the size of this buffer will be your maximum line length.
Unfortunately you can't really solve this problem without slurping the entire file into memory. Because 
the whole file could consist of a single line, with the word "test" at the very end - and at this point, you will 
still need access to the beginning of the file in order to output it.

3) The Mike Haertel message (from 2010) is interesting, as is the reference to the paper
"Fast String Searching", by Andrew Hume and Daniel Sunday, in the November 1991 issue of Software Practice & Experience
All this is a quite old. Computer architecture has changed significantly since then, CPUs have got a lot faster and added vector
instructions which speed up dumb algorithms a lot. The relative cost of accessing memory has massively increased, as memory
speeds have not increased at anything like the rate of CPU speeds (though the Apple M1 system on chips family has changed this)
so memory -> CPU is often the bottleneck,  and optimising CPU performance through clever algorithms often merely results in 
the CPU being stuck in stall cycles waiting for cache misses to be serviced.
That said, some of the points Mike makes are still valid. One is the massive cost of breaking up lines - which slows down performance
by an order of magnitude, by forcing you to examine every byte. So searching for the search-text, and then when you find it, searching backwards
from that point to find the enclosing line, is a big win.
The other point he makes which is still valid is to consider the cost of copying data. 
This will probably be the dominant cost. Running `cat > /dev/null` will tell you how long it takes to merely read the data, and this will give
you a baseline figure. A reasonably carefully written grep clone should achieve roughly the same runtime.

As you specified the problem as reading from stdin, using memory mapped files, using faster filesystems and fast SSDs will not help you.

And, as always in Go, allocations trump almost everything else. If you end up converting each line into a string, forcing allocations and GC work,
that will knock orders of magnitude off your performance. 

Disclaimer: I have not actually written any code or run any benchmarks here. But there are some interesting questions which could make an 
interesting paper. To what extent has Hume and Sunday's 1991 paper been invalidated by hardware changes? Are smart algorithm's which 
avoid looking at every byte still a win, or do they actually slow down performance by adding branch mispredictions, and preventing vectorisation
optimisations? To what extent does the Apple/Arm system on chip family, with their much lower memory latency, change these considerations?
How much slower will a simple Go clone of grep be than the C version which has had decades of optimisation?
And how much effort would it take to optimise the Go clone to reach the performance of the C version?
If anyone wants to write the code, do the benchmarks, and publish the conclusions, then send me a link. It will be an interesting read.

Barnim Dzwillo

unread,
May 8, 2022, 4:24:40 PM5/8/22
to golang-nuts
I had a similar use case in the past and got the best performance when using ReadSlice() instead of scanner.Scan().
See sample code here: https://go.dev/play/p/EfvadCURcXt
Message has been deleted

Const V

unread,
May 8, 2022, 4:26:34 PM5/8/22
to golang-nuts
Using r.ReadLine() I can successfully read 100 MB line in a string, using the following conditional statement which is 
increasing the buffer until '\n' is encountered.
for isPrefix && err == nil {
line, isPrefix, err = r.ReadLine()
ln = append(ln, line...)
}

Now the last problem is how to search a substring in 100MB string
Message has been deleted

Const V

unread,
May 8, 2022, 4:27:52 PM5/8/22
to golang-nuts
pre-allocating a buffer is not an option, it should be dynamic

Robert Engels

unread,
May 8, 2022, 5:31:49 PM5/8/22
to Const V, golang-nuts
Way over complicating this. Use a buffered reader. Keep track of the position the last newline was seen. It is a trivial state machine the find ‘test’  continue to next newline. Seek to stored last newline position and buffered read and write to stdout until next newline. 


On May 8, 2022, at 3:28 PM, Const V <ths...@gmail.com> wrote:

pre-allocating a buffer is not an option, it should be dynamic
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/23d14932-17a3-4c68-a72d-f3e0b97dcb55n%40googlegroups.com.

Const V

unread,
May 8, 2022, 5:40:47 PM5/8/22
to golang-nuts
write to stdout is not working for MB long strings

Amnon BC

unread,
May 8, 2022, 6:21:52 PM5/8/22
to Const V, golang-nuts
On Sun, May 8, 2022 at 10:41 PM Const V <ths...@gmail.com> wrote:
write to stdout is not working for MB long strings


That is very surprising indeed.
How do you reach the conclusion?
How can we replicate that failure?

Const V

unread,
May 8, 2022, 6:36:36 PM5/8/22
to golang-nuts
reading 1 line '\n' delimited 100MB file 
r1 := bufio.NewReader(file)
s := ReadWithReadLine(r1)
InputProcessing(strings.NewReader(s), os.Stdout)

in InputProcessing"
w.Write([]byte(s)) -> waiting forever
w.Write([]byte(s)[:100]) -> working
---
func InputProcessing(r io.Reader, w io.Writer) {
find := "error"
/////////////////////////////////
s := ReadWithReadLine(r)
if strings.Contains(s, find) {
w.Write([]byte(s)[:100])
w.Write([]byte("\n"))
} else {
w.Write([]byte(" \n"))
}
}
---

Bakul Shah

unread,
May 8, 2022, 9:43:02 PM5/8/22
to Constantine Vassilev, golang-nuts
On May 7, 2022, at 1:24 PM, Constantine Vassilev <ths...@gmail.com> wrote:
>
> I need to write a program that reads STDIN and should output every line that contains a search word "test" to STDOUT.
>
> How I can test that considering the problem is a line can be 100s of MB long (\n is line end) and tens of MB info is passed to it.

What do you mean by "tens of MB info"? Do you mean the search
string can be tens of MB long or something else? Is the search
string really "test" or can it be anything? Can it be a regular
expression? Describing a problem clearly and accurately is half
the battle.

Anyway, think about what you would have to do if lines were so long
they *didn't* fit in memory! [Alternatively think what you would
have to do on a PDP11!]

Amnon BC

unread,
May 8, 2022, 11:33:09 PM5/8/22
to Const V, golang-nuts

Why don't you try redirecting stdout to /dev/null and see how your program behaves.

Also, which OS are you using?

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/IUMIxWx9aLk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/b1259910-58ca-4a57-bb05-1dde2fc73443n%40googlegroups.com.

Const V

unread,
May 8, 2022, 11:59:37 PM5/8/22
to golang-nuts
I'm using OSX. 

The only reasonI need to redirect is to catch the STOUT output in a string for testing,
It seems Pipe has limited capacity. May be there is another way.

Amnon

unread,
May 9, 2022, 12:05:51 AM5/9/22
to golang-nuts
So what happens when you run your program > /dev/null
?

For testing I would write a function that reads from an io.Reader and writes to an io.Reader.
Write a unit test which uses a bytes.Buffer to catch the output.

Const V

unread,
May 9, 2022, 12:36:34 AM5/9/22
to golang-nuts
The program is working fine with STDOUT and there is no need of  program > /dev/null.

I was using the Pipe for the unit tests with _test.go to catch the STDOUT.
How you will write your function as you mention it?

Const V

unread,
Jun 9, 2022, 2:37:29 AM6/9/22
to golang-nuts
I did implementations and profiling of following:

BigGrepStrFnd - Boyer-Moore (grep)

BigGrepBytes - Rabin-Karp

BigGrepStr - Rabin-Karp

BigGrepScan - search with sliding window

Additionally implemented them using concurrency.

Tested on 100 files containing one 100MB line. Searching for 20 character string at the end of the line as worst case scenario.

The program is reading each file and loading the 100MB line in the memory for analysis. With 100 files I'm getting consistent results.

Here are the results. I needed to know also the memory requirements.

Created two profiles:

go test -run=NONE -bench=cpuprofile

go test -run=NONE -bench=memprofile

Then analyzed with go tool pprof

Computer Mac-mini M1 8 cores, 16 GB

Concurrent versions are performing the search in parallel on 8 slices of the 100MB line.

Fastest

3.04s -60.47GB -  BigGrepBytes_Concurrent ~2.85 times faster than non concurrent.

8.55s - 60.48GB BigGrepBytes

From non concurrent versions the fastest is BigGrepStrFnd. Note it has bigger memory requirements.

5.44s - 72.96GB BigGrepStrFnd - Boyer-Moore (grep) ~1.63 times faster than BigGrepBytes

3.82s - 72.93GB BigGrepStrFndConcurrent

----

go test -run=NONE -bench=cpuprofile

goos: darwin

goarch: arm64

pkg: mobiledatabooks.com/biggrep

Benchmark_100MB_End__20charsBigGrepBytes_Concurrent-8              1    7653778209 ns/op

Benchmark_100MB_End__20charsBigGrepBytes_-8                        1    12326358083 ns/op

Benchmark_100MB_End__20charsBigGrepScan__-8                        1    14934611708 ns/op

Benchmark_100MB_End__20charsBigGrepStr___-8                        1    13178685417 ns/op

Benchmark_100MB_End__20charsBigGrepStr___Concurrent-8              1    8302944958 ns/op

Benchmark_100MB_End__20charsBigGrepStrFnd-8                        1    8765527416 ns/op

Benchmark_100MB_End__20charsBigGrepStrFndConcurrent-8              1    7361004166 ns/op

Benchmark_100MB_End__20charsBigGrepScan__Concurrent-8              1    8413663834 ns/op

PASS

ok     mobiledatabooks.com/biggrep    81.204s

----

----

Type: cpu

Time: Jun 8, 2022 at 10:15pm (PDT)

Duration: 81.11s, Total samples = 89.36s (110.17%)

Active filters:

   focus=Benchmark

   hide=benchm

   show=Benchmark

Showing nodes accounting for 47.88s, 53.58% of 89.36s total

      flat  flat%   sum%        cum   cum%

    10.37s 11.60% 11.60%     10.37s 11.60%  mobiledatabooks.com/biggrep.Benchmark_100MB_End__20charsBigGrepScan__

     8.99s 10.06% 21.67%      8.99s 10.06%  mobiledatabooks.com/biggrep.Benchmark_100MB_End__20charsBigGrepStr___

     8.55s  9.57% 31.23%      8.55s  9.57%  mobiledatabooks.com/biggrep.Benchmark_100MB_End__20charsBigGrepBytes_

     5.44s  6.09% 37.32%      5.44s  6.09%  mobiledatabooks.com/biggrep.Benchmark_100MB_End__20charsBigGrepStrFnd

     4.02s  4.50% 41.82%      4.02s  4.50%  mobiledatabooks.com/biggrep.Benchmark_100MB_End__20charsBigGrepStr___Concurrent

     3.82s  4.27% 46.09%      3.82s  4.27%  mobiledatabooks.com/biggrep.Benchmark_100MB_End__20charsBigGrepStrFndConcurrent

     3.65s  4.08% 50.18%      3.65s  4.08%  mobiledatabooks.com/biggrep.Benchmark_100MB_End__20charsBigGrepScan__Concurrent

     3.04s  3.40% 53.58%      3.04s  3.40%  mobiledatabooks.com/biggrep.Benchmark_100MB_End__20charsBigGrepBytes_Concurrent


----


----

Type: alloc_space

Time: Jun 8, 2022 at 10:17pm (PDT)

Active filters:

   focus=Benchmark

   hide=benchm

   show=Benchmark

Showing nodes accounting for 533.60GB, 100% of 533.61GB total

      flat  flat%   sum%        cum   cum%

   72.96GB 13.67% 13.67%    72.96GB 13.67%  mobiledatabooks.com/biggrep.Benchmark_100MB_End__20charsBigGrepStrFnd

   72.95GB 13.67% 27.34%    72.95GB 13.67%  mobiledatabooks.com/biggrep.Benchmark_100MB_End__20charsBigGrepStr___Concurrent

   72.94GB 13.67% 41.01%    72.94GB 13.67%  mobiledatabooks.com/biggrep.Benchmark_100MB_End__20charsBigGrepStr___

   72.93GB 13.67% 54.68%    72.93GB 13.67%  mobiledatabooks.com/biggrep.Benchmark_100MB_End__20charsBigGrepStrFndConcurrent

   60.48GB 11.33% 66.01%    60.48GB 11.33%  mobiledatabooks.com/biggrep.Benchmark_100MB_End__20charsBigGrepBytes_

   60.47GB 11.33% 77.35%    60.47GB 11.33%  mobiledatabooks.com/biggrep.Benchmark_100MB_End__20charsBigGrepBytes_Concurrent

   60.45GB 11.33% 88.67%    60.45GB 11.33%  mobiledatabooks.com/biggrep.Benchmark_100MB_End__20charsBigGrepScan__Concurrent

   60.43GB 11.33%   100%    60.43GB 11.33%  mobiledatabooks.com/biggrep.Benchmark_100MB_End__20charsBigGrepScan__


----3.82s

Reply all
Reply to author
Forward
0 new messages