Which is the most efficient way to read STDIN lines 100s of MB long and tens of MB info is passed to it?

502 views
Skip to first unread message

Const V

unread,
May 7, 2022, 4:40:58 PM5/7/22
to golang-nuts

I need to write a program that reads STDIN and should output every line that contains a search word "test" to STDOUT. 

How I can test that considering the problem is a line can be 100s of MB long (\n is line end) and tens of MB info is passed to it.

Tamás Gulácsi

unread,
May 8, 2022, 4:36:15 AM5/8/22
to golang-nuts
If that "100s of MB" is acceptable to be in memory, then bufio.Scanner with properly sized scanner.Buffer is the easiest and battle tested solution.

If not, then you have to reimplement the same, with a temp file as a buffer: read input, store in file, seeking to the start on each '\n', copying the file to STDOUT when "test" is found.
Not too hard, but have a few edge cases.

Manlio Perillo

unread,
May 8, 2022, 10:20:22 AM5/8/22
to golang-nuts
You can try to use mmap.  Store in memory the start and end/size of a line, so that when a line contains your word, you can pass the address directly to libc.write, using cgo.
In alternative, you can create a slice backed by the mapped memory using https://pkg.go.dev/reflect#SliceHeader.

Not tested.

Manlio


Peter Galbavy

unread,
May 8, 2022, 4:24:37 PM5/8/22
to golang-nuts
Rather than overthink it at this stage, just use bufio.Scanner and strings.Contains() and see what performance is like. I suspect that for a plain string and a large-ish file it will be about as good as it gets.

Const V

unread,
May 8, 2022, 5:20:27 PM5/8/22
to golang-nuts
Using r.ReadLine repeatedly I was able to read the full line in memory. It is working very fast.

for isPrefix && err == nil {
line, isPrefix, err = r.ReadLine()
ln = append(ln, line...)
}

if strings.Contains(s, "error") {

finds the substring very fast

Now is coming the last step - writing 100MB string to STDOUT
w io.Writer
w.Write([]byte(s))

It is too big and freezes with large string.

Marvin Renich

unread,
May 10, 2022, 10:39:07 AM5/10/22
to golan...@googlegroups.com
* Const V <ths...@gmail.com> [220508 17:20]:
> Using r.ReadLine repeatedly I was able to read the full line in memory. It
> is working very fast.
>
> for isPrefix && err == nil {
> line, isPrefix, err = r.ReadLine()
> ln = append(ln, line...)
> }
>
> if strings.Contains(s, "error") {
>
> finds the substring very fast
>
> Now is coming the last step - writing 100MB string to STDOUT
> w io.Writer
> w.Write([]byte(s))
>
> It is too big and freezes with large string.

I'm not sure why w.Write would freeze; is your process starting to swap
and it is not really frozen but just taking a long time? Is it being
killed by the kernel's out-of-memory monitor?

However, you shouldn't convert the []byte to string and then back to
[]byte again. You are doing two huge allocations unnecessarily. Also,
your code using ReadLine is doing essentially the same thing as
ReadBytes, and the documentation for ReadLine suggests using ReadBytes
instead.

Does https://go.dev/play/p/mSZ-Ft6C8pZ do what you want?

Note that if you are trying to be _really_ efficient, you should look at
the "Why Is Grep So Fast" article mentioned previously in this thread.
I would allocate a slice of large (1M? 10M?) byte slices, using them as
a ring buffer, adding one []byte to the ring at a time as needed. Then
implement Boyer-Moore or Rabin-Karp in a way that handles the boundaries
between []byte buffers properly. The total size of the ring buffer will
max out at the longest line, which is necessary anyway, and you will
minimize allocations and copying of data.

...Marvin

Amnon

unread,
May 10, 2022, 12:52:04 PM5/10/22
to golang-nuts
> I'm not sure why w.Write would freeze; is your process starting to swap
> and it is not really frozen but just taking a long time? Is it being
> killed by the kernel's out-of-memory monitor?

In the OP's code, w.Write was writing a large amount of data to a pipe, while nothing was reading from the other end.
This was the reason for the deadlock.

Marvin Renich

unread,
May 10, 2022, 1:24:48 PM5/10/22
to golan...@googlegroups.com
* Amnon <amn...@gmail.com> [220510 12:52]:
That makes sense. Ah, now, looking through
https://groups.google.com/g/golang-nuts (I normally use email), I see a
different thread to which I had not paid attention.

...Marvin

Reply all
Reply to author
Forward
0 new messages