Hello everyone,
I've been using the archive/zip package to read large archives (40+ gigabytes) directly from a network file system (NFS/S3). I've noticed that the performance is limited by the number of small read operations.
After looking into the code, it seems the zip.Reader uses a fixed internal buffer of 4096 here in
reader.go bytes. While this is a sensible default for local disk access, it's inefficient for high-latency network reads, where each read operation has significant overhead.
To address this, I'd like to discuss the possibility of allowing the buffer size to be configured by the user.
A potential API could be a new function, zip.NewReaderSize, similar to bufio.NewReaderBuffSize:
func NewReaderBuffSize(r io.ReaderAt, size int64, bufSize int) (*Reader, error)
The existing zip.NewReader would just call this new function with the default 4096-byte buffer to maintain backward compatibility.
In a preliminary test reading a 40GB zip file over a network take more then 10 minutes to read CentralDirectory and parse FileHeaderInfo.Updating buffer to 4 Mb instead take 10 seconds to do the same operation.