Improving Performance of Hashing large files

4,077 views
Skip to first unread message

Sanjay Menakuru

unread,
Oct 9, 2011, 12:02:33 AM10/9/11
to golan...@googlegroups.com
Hi,

I am creating a personal media server, and thought that to nicely handle file renaming, I could have files be identified by their md5/sha1/sha512 sum. Then I realized that hashing many large files might be a timesuck for a 2TB library, and so I wrote a trivial Go program that hashes one file provided by a flag to check speed of hashing.

This program when given a 1.5 GB file took 11.809s to MD5 and 17.998s to SHA1 on my laptop (my server would be faster).
However, what was weird is that openssl took 3.407s to MD5 and 4.696s to SHA1.

Openssl is written in C so some difference is expected, but for such a purely computational benchmark, I'd have thought their performance would be closer.

Does anyone have any tips on improving the performance of Go's hash libraries? 

Thanks in advance,
Sanjay

PS - my code is available here: http://pastebin.com/DgurHbpe
PPS - Yes, I did run the test a few times and only reported a stable value

Evan Shaw

unread,
Oct 9, 2011, 12:27:01 AM10/9/11
to golan...@googlegroups.com
I don't think anyone's put any effort into optimizing Go's md5 or sha1
packages. The routines openssl is using are probably at least
partially written in assembly. Go's package are all Go code.

- Evan

unread,
Oct 9, 2011, 7:59:12 AM10/9/11
to golang-nuts
Wouldn't it be sufficient to hash only a fraction of data in a media
file? It works for smplayer on Linux.

Jens-Uwe Mager

unread,
Oct 9, 2011, 8:19:29 AM10/9/11
to golan...@googlegroups.com
I see the same problem with my still in-progress web based mp3 player. I do address mp3 files via sha1 and it takes ages to build the index (nearly 2 hours for 300gb on an iMac).

If one wanted to re-implement git in go, one would see the same problem as git requires hashes for all files.

Sanjay Menakuru

unread,
Oct 9, 2011, 2:19:16 PM10/9/11
to golan...@googlegroups.com
Thanks a lot!

Thats a great idea.

 - Sanjay

Dave Cheney

unread,
Oct 9, 2011, 5:34:47 PM10/9/11
to golan...@googlegroups.com
Are you able to post any code of the main indexing routine?

Cheers

Dave

Sent from my iPhone

Jens-Uwe Mager

unread,
Oct 11, 2011, 11:52:09 AM10/11/11
to golan...@googlegroups.com
I use the following index func (please ignore the naive error handling for now):

import (
    "fmt"
    "os"
    "path"
    "path/filepath"
    "io"
    "bufio"
    "crypto/sha1"
    "encoding/hex"
    "sort"
    "flag"
    "gob"
    "github.com/ascherkus/go-id3/src"
    "http"
)

func indexMusicLib(dir string) (res MusicLib) {
    res.Root = dir
    res.Hash = make(map[string]*MP3File)
    res.Tracks = make(map[string][]*MP3File)
    res.Albums = make(map[string][]string)
    walkFn := func(path string, info *os.FileInfo, err os.Error) os.Error {
        var (
            fullpath string
            f *os.File
        )
        if err != nil {
            debug("walkFn path %v, err %v\n", path, err)
            return err
        }
        // continue to walk all directories
        if info.IsDirectory() {
            return nil
        }
        fullpath, err = filepath.Abs(path)
        if err != nil {
            panic(err.String())
        }
        f, err = os.Open(fullpath)
        if err != nil {
            panic(err.String())
        }
        defer f.Close()
        mp3 := id3.Read(f)
        if mp3 == nil {
            // ignore non-mp3 files
            return nil
        }
        //debug("id3.File = %#v\n", mp3)
        _, err = f.Seek(0, 0)
        if err != nil {
            panic(err.String())
        }
        reader := bufio.NewReader(f)
        sha1 := sha1.New()
        _, err = io.Copy(sha1, reader)
        if err != nil {
            panic(err.String())
        }
        res.Files = append(res.Files, MP3File{FileName: f.Name(), SHA1Hash:hex.EncodeToString(sha1.Sum()), ID3Info: *mp3})
        n := &res.Files[len(res.Files)-1]
        debug("n = %#v\n", n)
        res.Hash[n.SHA1Hash] = n
        res.Tracks[n.ID3Info.Album] = append(res.Tracks[n.ID3Info.Album], n)
        if !containsString(res.Albums[n.ID3Info.Artist], n.ID3Info.Album) {
            res.Albums[n.ID3Info.Artist] = append(res.Albums[n.ID3Info.Artist], n.ID3Info.Album)
        }
        if !containsString(res.Artists, n.ID3Info.Artist) {
            res.Artists = append(res.Artists, n.ID3Info.Artist)
        }
        return nil
    }
    err := filepath.Walk(dir, walkFn)
    if err != nil {
        debug("walk err %v\n", err)
    }
    sort.Strings(res.Artists)
    for _, v := range res.Albums {
        sort.Strings(v)
    }
    return
}

Kyle Lemons

unread,
Oct 11, 2011, 1:41:31 PM10/11/11
to golan...@googlegroups.com

        reader := bufio.NewReader(f)
        sha1 := sha1.New()
        _, err = io.Copy(sha1, reader)

I find that for things like this, you can get a lot of performance out of making a bufio.Reader with a larger buffer size.  In my case 4MB was the best (I did a throughput testing.Benchmark at logarithmic increments starting from 1024 bytes to 1GB to pick the one with the highest througput.

~K 

Jens-Uwe Mager

unread,
Oct 12, 2011, 11:08:36 AM10/12/11
to golan...@googlegroups.com
Ok, tried using:

bufio.NewReaderSize(f, 4*1024*1024)

and now it takes more than an hour longer and the process size grows to more than 19GB. I can not see where the memory leak should be in this simple loop.

Kyle Lemons

unread,
Oct 12, 2011, 12:24:14 PM10/12/11
to golan...@googlegroups.com
That seems suspiciously large.  I'd profile it to see where the in-use objects were allocated.
~K
Reply all
Reply to author
Forward
0 new messages