files, readers, byte arrays (slices?), byte buffers and http.requests

191 views
Skip to first unread message

Sri G

unread,
Jul 2, 2016, 2:15:19 AM7/2/16
to golang-nuts
I'm working on receiving uploads through a form.

The tricky part is validation.

I attempt to read the first 1024 bytes to check the mime of the file and then if valid read the rest and hash it and also save it to disk. Reading the mime type is successful and I've gotten it to work by chaining TeeReader but it seems very hackish. Whats the idiomatic way to do this?

I'm trying something like this: 


// Parse my multi part form 
...
// Get file handle
file, err := fh.Open()

var a bytes.Buffer

io.CopyN(&a, file, 1024)

mime := mimemagic.Match("", a.Bytes())
// Check mime type (this works fine)

I'm trying to seek a stream so this should be no-op
file.Seek(0, 0)

The file stored on disk is 1KB larger than the original so it appears to be re-copying the entire file and appending it to bytes.Buffer
io.Copy(&a, file)

checksum := md5.New()
b := io.TeeReader(&a, checksum)

md5hex := hex.EncodeToString(checksum.Sum(nil))
fmt.Println("md5=", md5hex)

//Open file f for writing to disk
...
//Save file
io.Copy(f, b)


Checked the md5 of (1KB of orig + orig), and (orginal - first 1 KB), neither match the md5 of the file being hashed.

Why can't I append the rest of the stream to the byte buffer to get the complete file in memory and why is the byte buffer being "consumed"?

I simply need to read the same array of byte multiple times, I don't need to "copy" them. I'm coming from a C background so I'm wondering what is going on behind the scenes as well.

Tamás Gulácsi

unread,
Jul 2, 2016, 3:18:51 AM7/2/16
to golang-nuts

If you know you'll have to read the whole file into memory, then do that, and use bytes.NewReader to create  a reader for that byte slice.

If you read partly, to decide whether to go on, then use fh.Read or io.ReadAtLeast with a byte slice.

If you read sth, then want to read the whole from the beginning, construct a Reader with io.MultiReader(bytes.NewReader(b), fh).

You can combine these approaches, but if the while file size is less than a few KiB, I think it is easier, simpler and more performant (!) to read the whole file up into memory,
into a bytes.Buffer, and construct the needed readers with bytes.NewReader(buf.Bytes()).

Sri G

unread,
Jul 2, 2016, 5:48:45 PM7/2/16
to golang-nuts
Thanks for the pointer. I also found this helpful Asynchronously Split an io.Reader in Go (golang) « Rodaine but I'm still missing something.

Version 1: the uploaded file is 1024 bytes extra at the end (too big):

mimebuf := make([]byte, 1024)
_, err = file.Read(mimebuf)

mime := mimemagic.Match("", mimebuf)

fileReader := io.MultiReader(bytes.NewReader(mimebuf), file)

checksum := md5.New()

b := io.TeeReader(fileReader, checksum)

md5hex := hex.EncodeToString(checksum.Sum(nil))

// Save file
io.Copy(f, b)

Version 2: the uploaded file is truncated by 1024 byte (too small): (this makes sense since the first 1024 bytes of file was consumed)

mimebuf := make([]byte, 1024)
_, err = file.Read(mimebuf)

mime := mimemagic.Match("", mimebuf)

checksum := md5.New()

// Adding file.Seek(0,0) here does not fix this issue

b := io.TeeReader(file, checksum)

md5hex := hex.EncodeToString(checksum.Sum(nil))

// Save file
io.Copy(f, b)


What is incorrect which is causing this? How do I get the goldilocks version that's just right?

Sri G

unread,
Jul 2, 2016, 6:20:41 PM7/2/16
to golang-nuts
Update:

Adding file.Seek(0,0) does fix the issue in Version 2. The uploaded file is the correct size on disk with the correct md5. Without it, the uploaded file which is saved is missing the first 1024 bytes. This makes sense.

There is something wrong with the way the md5 is calculated, it keeps giving the same hash. Any ideas?

This version, while most likely not idiomatic, works:

mimebuf := make([]byte, 1024)
 _, err = file.Read(mimebuf)


mime := mimemagic.Match("", mimebuf)

file.Seek(0, 0)

checksum := md5.New()

io.Copy(checksum, file)

md5hex := hex.EncodeToString(checksum.Sum(nil))
fmt.Println("md5=", md5hex)

file.Seek(0, 0)
io.Copy(f, file)

It would be much appreciated if someone understands the idiomatic way to do this with and can explain it.

Dave Cheney

unread,
Jul 2, 2016, 9:27:15 PM7/2/16
to golang-nuts
The hash is always the same because you ask for the hash value before writing any data through it with io.Copy.

Sri G

unread,
Aug 4, 2016, 1:29:20 AM8/4/16
to golang-nuts
Doh. Thanks. I did the setup but didnt click "execute".

Revisiting this because its now a bottleneck since it directly impact user experience (how long a request will take to process) and scalability (requests per second a single instance can handle). It wasn't pre-mature optimization, rather proper architecture planning :)

In C, the request would come into a ring buffer of Struct of Arrays (read SoAs vs AoS on Intel x86) -> a pointer to the post data is kept. This is used to check the mime type as well as compute the md5. Then it passed to be written to disk before it is released. No copies are needed.

How can I accomplish this in idiomatically Go? When I say idiomatic, I mean efficient in space, time and verbosity depending on the requirements and most importantly, not fighting the language. 

I'm having difficulty grokking whether a command copies data or uses a reference to the underlying buffer (pointer). Or does everything copy because data needs to be in each stack for each go routine? 

I've read the source code of io.copy, if there is a reader.ReadFrom or writer.WriteTo, the copy uses the existing buffer, avoiding allocation and a copy. However crypto/md5 does not have either of these, so its not possible to compute the md5 without copying data. Is this because the md5 library is written for streaming data vs static data?

Is there a way to accomplish this? i.e. here's a buffer of data, compute the md5 on it.

Re: the mimetype, I should be able to create a 1024 byte slice of the file and pass it to mimemagic. This should avoid the copy.


On Saturday, July 2, 2016 at 9:27:15 PM UTC-4, Dave Cheney wrote:
The hash is always the same because you ask for the hash value before writing any data through it with io.Copy.

Tamás Gulácsi

unread,
Aug 4, 2016, 2:48:29 AM8/4/16
to golang-nuts
If md5 is enough at the end, use an io.T eeReader. If not, you need to buffer it, with bytes.Buffer. That can be reused with sync.Pool (don't forget the Reset).

For mime, the first 1024 bytes is enough. Read that into a [1024]byte and create a Reader with io.MultiReader.

Reply all
Reply to author
Forward
0 new messages