Re: "big size file" for Go

694 views
Skip to first unread message

Kevin Gillette

unread,
Jan 26, 2013, 5:32:55 AM1/26/13
to golan...@googlegroups.com, itmi...@gmail.com
Are you referring to go source files, or the size of files that os.Open can deal with?

On Saturday, January 26, 2013 3:14:22 AM UTC-7, itmi...@gmail.com wrote:
Is it safe to assume:

- small size: <100KB
- medium size: <1MB
- big size: <10MB
- huge size: >10MB

?

Thanks,
Mitică

chris dollin

unread,
Jan 26, 2013, 5:43:38 AM1/26/13
to itmi...@gmail.com, golan...@googlegroups.com
On 26 January 2013 10:14, <itmi...@gmail.com> wrote:
> Is it safe to assume:
>
> - small size: <100KB
> - medium size: <1MB
> - big size: <10MB
> - huge size: >10MB

10Mb isn't "huge" for a file.

Where are these assumptions being applied?

Chris

--
Chris "allusive" Dollin

Donovan Hide

unread,
Jan 26, 2013, 5:57:50 AM1/26/13
to itmi...@gmail.com, golang-nuts, ehog....@googlemail.com
Input and output in Go make use of Reader and Writer Interfaces which allow the programmer to decide how much of a stream of data they process at a time. This means that the file size doesn't really have any effect on a Go programme's ability to deal with it. 

Of course you can bypass this by using something like ioutil.ReadAll() which will put all data into memory, in which case you might have issues :-)


On 26 January 2013 10:49, <itmi...@gmail.com> wrote:
In file processing, byte by byte.

--
 
 

chris dollin

unread,
Jan 26, 2013, 5:58:37 AM1/26/13
to itmi...@gmail.com, golan...@googlegroups.com
On 26 January 2013 10:49, <itmi...@gmail.com> wrote:
> In file processing, byte by byte.

Sorry, I wasn't clear. What effect does labelling
those sizes with those names have? What
does "gracefully" mean?

Is what you're asking something like "how long
does it take on a typical [1] current machine to
read through this much data from a file?"?

Chris [2]

[1] Not that there is a "typical" ...

[2] Absent knowing an answer, attempting to
understand the question.

--
Chris "allusive" Dollin

Jesse McNelis

unread,
Jan 26, 2013, 6:08:23 AM1/26/13
to Dumitru Ungureanu, golang-nuts
On Sat, Jan 26, 2013 at 9:42 PM, <itmi...@gmail.com> wrote:
I mean the file sizes Go handles gracefully, not source files, sorry.
I know it's machine and platform dependent, but still.

It's dependant on the system. But for most modern file systems max size of a file is likely larger than the storage you have.  eg. ext4 on linux supports up to 16TB files.


peterGo

unread,
Jan 26, 2013, 6:15:43 AM1/26/13
to golan...@googlegroups.com, itmi...@gmail.com
Mitică,

It's not sfafe to assume that; it's a nonsense question.

How would you interpret the result of my test on an 8GB page file?

package main

import (
    "fmt"
    "io"
    "os"
)

func main() {
    f, err := os.Open("/home/peter/pagefile.sys")
    if err != nil {
        fmt.Println(err)
        return
    }
    defer f.Close()
    data := make([]byte, 4096)
    var bytes, zeros int64
    for {
        data = data[:cap(data)]
        n, err := f.Read(data)
        if err != nil {
            if err == io.EOF {
                break
            }
            fmt.Println(err)
            return
        }
        data = data[:n]
        for _, b := range data {
            bytes++
            if b == 0 {
                zeros++
            }
        }
    }
    fmt.Println("bytes:", bytes, "zeros:", zeros)
}

$ time go run temp.go
bytes: 8553422848 zeros: 978805027

real    1m45.525s
user    0m28.286s
sys    0m4.932s
$

Peter


On Saturday, January 26, 2013 5:14:22 AM UTC-5, itmi...@gmail.com wrote:
Is it safe to assume:

- small size: <100KB
- medium size: <1MB
- big size: <10MB
- huge size: >10MB

?

Thanks,
Mitică

Donovan Hide

unread,
Jan 26, 2013, 6:20:47 AM1/26/13
to itmi...@gmail.com, golang-nuts, ehog....@googlemail.com

The read buffer is not the issue.
Consider the case of "jumping" inside the file, to certain bytes positions.

 os.File implements the io.ReaderAt Interface for exactly that situation:

http://golang.org/pkg/os/#File.ReadAt

And if you need to jump backwards, it implements io.ReadSeeker:

http://golang.org/pkg/os/#File.Seek
 
To make use of this functionality, you'd build an io.SectionReader, which still is only loads n bytes at a time, so still, file size is not an issue unless you wan't to load part of a file which is larger than your available memory. The only thing which might be a concern is the speed of processing. Is that what your question is about?

Donovan Hide

unread,
Jan 26, 2013, 6:32:54 AM1/26/13
to Dumitru Ungureanu, golang-nuts, ehog....@googlemail.com
Looking at a "modest typical current machine", or at different other samples, from what file sizes are performance degradation for repeated ReadAt() positioning operations become noticeable?

This isn't really a function of Go, or the file size. It's a function of your disk/page cache and the seek time of your persistent storage:

 

peterGo

unread,
Jan 26, 2013, 7:44:52 AM1/26/13
to golan...@googlegroups.com, itmi...@gmail.com
Mitică,

On Saturday, January 26, 2013 6:32:29 AM UTC-5, itmi...@gmail.com wrote:
On Saturday, January 26, 2013 1:15:43 PM UTC+2, peterGo wrote:
Mitică,

It's not sfafe to assume that; it's a nonsense question.

What I'm trying, is an assessment before non-liniar processing.

That's exactly my point. Unless you simulate exactly what you want to do you are going to get nonsense answers.

How would you interpret the results of this test, which does random reads.


package main

import (
    "fmt"
    "math/rand"
    "os"
)

func main() {
    f, err := os.Open("/home/peter/linuxmint-14-cinnamon-dvd-64bit.iso")

    if err != nil {
        fmt.Println(err)
        return
    }
    defer f.Close()
    fi, err := f.Stat()

    if err != nil {
        fmt.Println(err)
        return
    }
    data := make([]byte, 4096)
    var reads int64
    for reads = 0; reads < 1e7; reads++ {
        data = data[:cap(data)]
        n, err := f.ReadAt(data, rand.Int63n(fi.Size()-int64(len(data))))

        if err != nil {
            fmt.Println(err)
            return
        }
        data = data[:n]
    }
    fmt.Println("file size:", fi.Size(), "reads:", reads)

}

$ time go run temp.go
file size: 923795456 reads: 10000000
real    0m32.998s
user    0m5.840s
sys    0m27.098s
$

Peter



andrey mirtchovski

unread,
Jan 26, 2013, 11:37:28 AM1/26/13
to Dumitru Ungureanu, golang-nuts
here's a threaded 'cp' using ReadAt which does significantly better
than the host OSs cp for large file sizes:

https://github.com/rminnich/u-root/tree/master/cp

Kevin Gillette

unread,
Jan 26, 2013, 1:31:48 PM1/26/13
to golan...@googlegroups.com, itmi...@gmail.com
peterGo: You're should be doing error handling _after_ you process the data that's returned, since the io.Reader documentation explicitly states that an implementor may either return the data read up to the point at which the error occurred along with the error, _or_ wait to return the error on the next call with no bytes having been read; anyone using Read should account for both cases by processing n before err. In theory, the last n bytes (as returned by Read) of the pagefile could contain zeroes that you're not counting.

minux

unread,
Jan 26, 2013, 1:46:36 PM1/26/13
to Kevin Gillette, golan...@googlegroups.com, itmi...@gmail.com
On Sun, Jan 27, 2013 at 2:31 AM, Kevin Gillette <extempor...@gmail.com> wrote:
peterGo: You're should be doing error handling _after_ you process the data that's returned, since the io.Reader documentation explicitly states that an implementor may either return the data read up to the point at which the error occurred along with the error, _or_ wait to return the error on the next call with no bytes having been read; anyone using Read should account for both cases by processing n before err. In theory, the last n bytes (as returned by Read) of the pagefile could contain zeroes that you're not counting.
also note that users of io.Reader should be aware that the seeming
impossible return 0,nil and len(buf),err (err != nil) are in fact allowed.

And it's very tricky to get code calling Read directly 100% correct.

In fact, even io.ReadFull can't correctly handle the case when a
bad reader always returns len(buf), err where err != nil && err != io.EOF
(ReadFull will return len(buf), and a non-nil error, violating the
documented interface if you assume the sentence
"It returns the number of bytes copied and an error if fewer bytes were read."
means that whenever it returns non-nil error, n < len(buf))

However, we can't ban len(buf), err return from readers as if we do,
the reader is forced to keep the error until next read, and a lot of reader
implementations don't do that, so Go 1 contract will be broken.

The take away is:
strictly speaking, even ignoring the returned `n` from io.ReadFull is
an error.

Daniel Bryan

unread,
Jan 26, 2013, 5:37:44 PM1/26/13
to golan...@googlegroups.com, Kevin Gillette, itmi...@gmail.com

Very interesting, thank you for that. I have a feeling this might be the source of quite a few of my bugs..

Nate Finch

unread,
Jan 30, 2013, 8:43:26 AM1/30/13
to golan...@googlegroups.com
Here's my understanding....

n, err := r.Read(buf)

If Read returns a non-zero number of bytes read, you need to process those bytes, regardless of what err is, because *something* got read.
If err is non-nil, you won't be getting any more bytes out of Read, so what you've gotten so far is all you'll get.
If err is nil, you can Read again, if you didn't get enough bytes the first time.

Something like this:

// Returns a buffer of n bytes, or a nil buffer if we couldn't read that many
// error == nil iff the buffer was successfully filled
// error != nil iff the buffer was not completely filled
func ReadN(r io.Reader, size int), []byte, err {
    buf := make([]byte, size)
    tmp := buf  // temp buffer that we can reslice as needed
    cnt := 0  // total number of bytes read
    for {
        n, err := r.Read(tmp)
        cnt += n
        if cnt == size {
            // successfully read size bytes
            // err might be non-nil, but we got the bytes asked for, so we can return
            // yes, this potentially drops a returned error on the floor... probably not the best idea
            return buf, nil
        }
        if err != nil {
            // not enough bytes, but we got an error, so we know we can't get more
            return nil, err
        }
        // no error, but not enough bytes, keep trying
        // reslice tmp so we start writing after the last byte written before
        tmp = tmp[n:]
    }
}

Peter Waller

unread,
Jan 30, 2013, 8:54:57 AM1/30/13
to Nate Finch, golan...@googlegroups.com
On 30 January 2013 13:43, Nate Finch <nate....@gmail.com> wrote:
Here's my understanding....

n, err := r.Read(buf)

[snip behaviour of read]

I'm happy that I understand how read behaves. The problem is that usually I want simpler semantics, so it would be nice if I had a function which provided those semantics.

[...] func ReadN(r io.Reader, size int), []byte, err { [...]

Yup, like this! I too wrote a function, then I realised there was io.ReadFull, which was supposed to do (almost exactly) this for me. Great! Except.. ReadFull doesn't behave how I thought it was supposed to?


// Returns a buffer of n bytes, or a nil buffer if we couldn't read that many
// error == nil iff the buffer was successfully filled
// error != nil iff the buffer was not completely filled

This sounds almost exactly like the io.ReadFull docs:


    "It returns the number of bytes copied and an error if fewer bytes were read. "

.. except from what minux wrote, I understand this might not be as straightforward as I hoped from my reading of the above. Ideally, there would be a standard library function I can use to get the desired behaviour: "Read N bytes from `fd` into `byteslice` (retrying if necessary) except in case of critical failure, in which case err != nil". I'm even happy if the failure wasn't with reading my N bytes, but `Read` encountered a problem now which will affect subsequent reads.

Is io.ReadFull that function? If not, is there something simple that I can use instead? Ideally without copy-pasting code into every project where I want such a "read-like" function..

Donovan Hide

unread,
Jan 30, 2013, 9:36:42 AM1/30/13
to peter....@gmail.com, golang-nuts, Kevin Gillette, Dumitru Ungureanu, minux

To state an example scenario, I'm reading some header of an arbitrary binary format, I'm some way into it, and I want to read the next 32 bytes. What's the best way to achieve that under all circumstances except unrecoverable failure? Are there cases where it's necessary to manually retry a io.ReadFull?

My naive code might have looked like:

    buf := make([]byte, 32)
    _, err := io.ReadFull(fd, buf)
    if err != nil { ... invoke panic, or ideally do something sensible ... }


I'm guessing  LimitedReader and SectionReader would be useful in this situation. It's hard to answer this question correctly without knowing the state of the reader immediately before this section of code. Maybe post an example?

Peter Waller

unread,
Jan 30, 2013, 10:51:24 AM1/30/13
to Donovan Hide, golang-nuts, Kevin Gillette, Dumitru Ungureanu, minux
On 30 January 2013 14:36, Donovan Hide <donov...@gmail.com> wrote:
I'm guessing  LimitedReader and SectionReader would be useful in this situation. It's hard to answer this question correctly without knowing the state of the reader immediately before this section of code. Maybe post an example?

My example: I'm reading from a file descriptor on linux (or any operating system for that matter, since I hope I'm not programming to a specific OS when I write go, ideally).

I guess that's really one of the problems is that I fear that I don't (and can't, reasonably) know what "all possible states of the input reader" could potentially be. I write a lot of code to read from readers of all kinds. I was hoping there existed a function which would work under all circumstances, regardless of reader. I just want a `ReadFull`-like function which either works when it can, or fails whe something unrecoverable happens.

After having digested above, my current feeling is that ReadFull will do what I expect (as stated earlier in this conversation), except in addition it might return (N==len(input), err) with err != nil && err != EOFofSomeSort. Therefore, if this is the last thing I'm reading from a stream (and I don't intend to do any more reads), instead of just failing if err != nil, I should first check if N == len(input), in which case the read succeeded but the input stream might be tainted - but I don't care.

I still fear that ReadFull might return an error which is recoverable, such as something like EINTR ([1]), where as a (semi-)high level programmer I would prefer to ignore such details (since I don't want to learn all of them), and just say "please keep reading until you cannot because you exploded".

It may be that I can't escape such details, if so, I'd like to know that. Does anyone have any insight?

I currently don't see any way to get the semantics I'm after, looking through the source:

http://golang.org/src/pkg/os/file_unix.go#L171
http://golang.org/src/pkg/syscall/zsyscall_linux_amd64.go?h=Read#L736

It seems as though EINTR (and others - are there others I should care about?) will get propagated to my program if I use ReadFull on a file descriptor.

[1] http://factor-language.blogspot.co.uk/2010/09/two-things-every-unix-developer-should.html

Nate Finch

unread,
Jan 30, 2013, 11:37:11 AM1/30/13
to golan...@googlegroups.com, Nate Finch, p...@pwaller.net
On Wednesday, January 30, 2013 8:54:57 AM UTC-5, Peter Waller wrote:
On 30 January 2013 13:43, Nate Finch <nate....@gmail.com> wrote:
Here's my understanding....

[...] func ReadN(r io.Reader, size int), []byte, err { [...]

Yup, like this! I too wrote a function, then I realised there was io.ReadFull, which was supposed to do (almost exactly) this for me. Great! Except.. ReadFull doesn't behave how I thought it was supposed to?

// Returns a buffer of n bytes, or a nil buffer if we couldn't read that many
// error == nil iff the buffer was successfully filled
// error != nil iff the buffer was not completely filled

This sounds almost exactly like the io.ReadFull docs:

Yeah, I realized I was basically rewriting ReadFull halfway through.... and figured I'd just continue so as to show how to use a reader manually...

I think minux's point was just that ReadFull could return an error even if it returned what you were asking for... which to me is not a big deal. It just means you can't rely on the error alone to tell you if you got what you were looking for... but you can just check the length of the buffer for that.

This looks to be proper use of ReadFull:

n, err := io.ReadFull(r, buf)
if n != len(buf) {
    // handle incomplete read... err should generally not be nil, so that probably indicates the error
} else {
    // success!
    if err != nil {
         // (mostly)
         // you got the full buffer you asked for, but there was also an error
         // probably not a big deal, but you might want to log the error (or not)
    }
}


Kyle Lemons

unread,
Jan 30, 2013, 12:58:34 PM1/30/13
to peter....@gmail.com, golang-nuts, Kevin Gillette, Dumitru Ungureanu, minux
On Wed, Jan 30, 2013 at 5:06 AM, <peter....@gmail.com> wrote:
On Saturday, 26 January 2013 18:46:36 UTC, minux wrote:
And it's very tricky to get code calling Read directly 100% correct.

In fact, even io.ReadFull can't correctly handle the case when a

I'm often faced with wanting to read a known number of bytes into a slice, so io.ReadFull is my tool of choice (or io/ioutil.ReadAll() if I want to read to EOF). I've had many issues stemming from forgetting a check here or there (that N was what I was expecting, etc).

I had to think about the text below for quite a while before I came to the conclusion: reading is hard. This is way more subtle than I realised.

So a question: What does "correct" code look like?

To state an example scenario, I'm reading some header of an arbitrary binary format, I'm some way into it, and I want to read the next 32 bytes. What's the best way to achieve that under all circumstances except unrecoverable failure? Are there cases where it's necessary to manually retry a io.ReadFull?

My naive code might have looked like:

    buf := make([]byte, 32)
    _, err := io.ReadFull(fd, buf)
    if err != nil { ... invoke panic, or ideally do something sensible ... }

If ReadFull returns an incomplete read with a nil error, that is a bug.

To quote the docs: "It returns the number of bytes copied and an error if fewer bytes were read."
 
I would have assumed that either my program bails because of an unrecoverable problem (e.g, the file is short and therefore corrupted), or everything is okay and I read the expected number of bytes into buf. I'm surprised to hear if there are other possibilities - maybe this should be considered a documentation bug?

Is the above code wrong? If so, under what circumstances will it fail, and how can I get this right?

Is it sufficient to check (s/_/N/) N after the above code, and if we're short some bytes, am I able to perform another read, or is `fd` dead at that point?

p.s. oh, and please also substitute `fd` for a socket or readable of your choice which illustrates issues with the above.

On Saturday, 26 January 2013 18:46:36 UTC, minux wrote:
also note that users of io.Reader should be aware that the seeming
impossible return 0,nil and len(buf),err (err != nil) are in fact allowed.

And it's very tricky to get code calling Read directly 100% correct.

In fact, even io.ReadFull can't correctly handle the case when a
bad reader always returns len(buf), err where err != nil && err != io.EOF
(ReadFull will return len(buf), and a non-nil error, violating the
documented interface if you assume the sentence
"It returns the number of bytes copied and an error if fewer bytes were read."
means that whenever it returns non-nil error, n < len(buf))

However, we can't ban len(buf), err return from readers as if we do,
the reader is forced to keep the error until next read, and a lot of reader
implementations don't do that, so Go 1 contract will be broken.

The take away is:
strictly speaking, even ignoring the returned `n` from io.ReadFull is
an error.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Nate Finch

unread,
Jan 30, 2013, 3:12:03 PM1/30/13
to golan...@googlegroups.com
On Wednesday, January 30, 2013 12:58:34 PM UTC-5, Kyle Lemons wrote:

    buf := make([]byte, 32)
    _, err := io.ReadFull(fd, buf)
    if err != nil { ... invoke panic, or ideally do something sensible ... }

If ReadFull returns an incomplete read with a nil error, that is a bug.

To quote the docs: "It returns the number of bytes copied and an error if fewer bytes were read."

 I think the point was that it could return a complete read with a non-nil error.

Kyle Lemons

unread,
Jan 30, 2013, 3:56:05 PM1/30/13
to Nate Finch, golang-nuts
The docs specify that EOF will only be reported if zero bytes were read.  If the error is not EOF and not ErrShortRead, I think I'd rather not trust the data that was read either, even if it had the right number of bytes. 

minux

unread,
Jan 30, 2013, 4:03:26 PM1/30/13
to Kyle Lemons, Nate Finch, golang-nuts
please try to supply a Reader that always len(buf), err with err != nil && err != io.EOF
to io.ReadFull to see what happens.

The docs seems to suggest that when ReadFull returns an error, the buf is not read
completely, but it's not the case for the reader i just mentioned.
Of course, the docs for ReadFull is correct regarding this if you read it carefully enough.

However, this means that calling io.ReadFull like this is definitely wrong:
_, err = io.ReadFull(rd, buf)
if err != nil {
      // handle the error without touching buf
}

A sad truth is that there are quite a few usages like this even in the standard library.

Kyle Lemons

unread,
Jan 30, 2013, 4:08:17 PM1/30/13
to minux, Nate Finch, golang-nuts
I'm not sure that it's definitely wrong.  The only error that I expect when reading is io.EOF.  Imagine a reader that reads N bytes and then a short checksum of those bytes.  If you read N bytes successfully but the checksum is wrong, what should the reader do?  Return an error.  If I get an error other than EOF, I don't trust the data that was read.

Steve McCoy

unread,
Jan 30, 2013, 4:21:56 PM1/30/13
to golan...@googlegroups.com
On Wednesday, January 30, 2013 4:03:26 PM UTC-5, minux wrote:

However, this means that calling io.ReadFull like this is definitely wrong:
_, err = io.ReadFull(rd, buf)
if err != nil {
      // handle the error without touching buf
}

A sad truth is that there are quite a few usages like this even in the standard library.


I think this is fine. Chances are, if somebody's using ReadFull, they really want it all.

Steve McCoy

unread,
Jan 30, 2013, 4:25:39 PM1/30/13
to golan...@googlegroups.com
D'oh, please ignore me; I misunderstood what you meant.

minux

unread,
Jan 30, 2013, 4:28:11 PM1/30/13
to Kyle Lemons, Nate Finch, golang-nuts
That's just only one case of why the reader returns len(buf) and an non-EOF and non-nil error
And, this is not what docs for Read says:
    When Read encounters an error or end-of-file condition after
    successfully reading n > 0 bytes, it returns the number of bytes read.
    It may return the (non-nil) error from the same call or return the error
    (and n == 0) from a subsequent call. An instance of this general case is
    that a Reader returning a non-zero number of bytes at the end of the
    input stream may return either err == EOF or err == nil. The next Read
    should return 0, EOF regardless.
    Callers should always process the n > 0 bytes returned before
    considering the error err. Doing so correctly handles I/O errors that
    happen after reading some bytes and also both of the allowed EOF
    behaviors.
Which seems to imply that in your bad checksum case, it should return 0, ECHECKSUM
because if the n here is non-zero, the caller of Reader must treat it as valid read data.

Kyle Lemons

unread,
Jan 30, 2013, 4:35:42 PM1/30/13
to minux, Nate Finch, golang-nuts
Indeed.

The Read API is probably something that needs to be addressed someday.  If it's so easy to get wrong that we do so repeatedly in the standard library, we should change it.  We probably can't require all readers to return a nil error of n == len(buf) until Go 2, but we could at least audit the standard library and document that such behavior is deprecated and considered dangerous (likely to cause data loss).

Nate Finch

unread,
Jan 30, 2013, 4:37:01 PM1/30/13
to golan...@googlegroups.com, Kyle Lemons, Nate Finch
On Wednesday, January 30, 2013 4:28:11 PM UTC-5, minux wrote:

    Callers should always process the n > 0 bytes returned before
    considering the error err. Doing so correctly handles I/O errors that
    happen after reading some bytes and also both of the allowed EOF
    behaviors.
Which seems to imply that in your bad checksum case, it should return 0, ECHECKSUM
because if the n here is non-zero, the caller of Reader must treat it as valid read data.

Yep, that's my assumption from that clause - if there's data returned, that data is valid, regardless of what err is.  If that data is somehow invalidated by the error, it shouldn't have been returned to me. 

minux

unread,
Jan 31, 2013, 12:32:37 PM1/31/13
to golan...@googlegroups.com
FYI, Russ proposed https://codereview.appspot.com/7235074/ to fix the io.ReadFull/ReadAtLeast
problem.

Thus io.ReadFull and io.ReadAtLeast will behave as expected and we can just ignore the n returned.
Reply all
Reply to author
Forward
0 new messages