Duplicate File Checker Performance

249 views
Skip to first unread message

Sri G

unread,
Oct 15, 2016, 5:15:29 AM10/15/16
to golang-nuts
I wrote a multi-threaded duplicate file checker using md5, here is the complete source: https://github.com/hbfs/dupe_check/blob/master/dupe_check.go

Benched two variants on the same machine, on the same set of files (~1.7GB folder with  ~600 files, each avg 3MB), multiple times, purging disk cache in between each run.

With this code:

    hash := md5.New()

    if _, err := io.Copy(hash, file); err != nil {
      fmt.Println(err)
    }

    var md5sum [md5.Size]byte
    copy(md5sum[:], hash.Sum(nil)[:16])

// 3.35s user 105.20s system 213% cpu 50.848 total, memory usage is ~ 30MB


With this code:

  data, err := ioutil.ReadFile(path)
  if err != nil {
    fmt.Println(err)
  }

  md5sum := md5.Sum(data)

 // 3.10s user 31.75s system 104% cpu 33.210 total, memory usage is ~ 1.52GB

The memory usage make sense, but why is the streaming version ~3x slower than the read the entire file into memory version? This trade off doesn't make sense to me since the file is being read from disk in both situations which should be the limiting factor. Then the md5sum is being computed. 

In the streaming version, there is an extra copy from []byte to [16]byte but that should be negligible.

My only theory I can think of is context switching

streaming version:
disk -> processor
processor waiting for disk read so it switch to read another file, sleeping the thread.

entire file:
disk -> memory -> processor
file is in memory so not as much context switching.

What do you think? Thanks!

Kevin Malachowski

unread,
Oct 15, 2016, 3:25:42 PM10/15/16
to golang-nuts
It also might be inefficient to call Sum multiple times with smaller slices compared to calling it once with a larger slice. Try using io.CopyBuffer and passing in larger buffer sizes to see if the CPU and memory usage start to converge.

Kevin Malachowski

unread,
Oct 15, 2016, 3:27:38 PM10/15/16
to golang-nuts
Sorry, I meant that calling Write on the hash type might be slower if it's called more often.

(I'm on mobile right now. When I get back to a keyboard I'll try to come up with an example)

Sri G

unread,
Oct 15, 2016, 6:30:53 PM10/15/16
to golang-nuts
Too diagnose this issue, I tried some benchmarks with time tested tools:

On the same directory:

find DIR -type f -exec md5 {} \; 

5.36s user 2.93s system 50% cpu 16.552 total

Adding a hashmap on top of that wouldn't significantly increase the time.

Making this multi-processed (32 processes): 

find DIR -type f -print0 | xargs -0 -n 1 -P 32 md5

5.32s user 3.24s system 43% cpu 19.503 total

With 64 processes, like GOMAXPROCS=64 on this machine.

find DIR -type f -print0 | xargs -0 -n 1 -P 64 md5

5.31s user 3.66s system 42% cpu 20.999 total

So it seems disk access is the bottleneck as it should be and the biggest performance hit comes from the synchronization

I wrote a python script to do the same, code is here: https://github.com/hbfs/dupe_check/blob/master/dupe_check.py

2.97s user 0.92s system 24% cpu 15.590 total, memory usage is ~ 8MB

My next step is to try a single threaded/goroutine version in Go to replicate this level of performance and get a deeper understand of how Go is built and how to use it more effectively. Advice appreciated!

Sri G

unread,
Oct 15, 2016, 9:46:00 PM10/15/16
to golang-nuts
Thanks. Made the go code similar to python using CopyBuffer with a block size of 65536. 

    buf := make([]byte, 65536)
    
    if _, err := io.CopyBuffer(hash, file, buf); err != nil {
        fmt.Println(err)
    }

Didn't make too much of a difference, was slightly faster.

What got it to the same place was running ComputeHash in the same goroutine as the Walk function vs its own go routine for each file

+    ComputeHash(path, info, queue, wg)
-    go ComputeHash(path, info, queue, wg)


2.88s user 0.98s system 23% cpu 16.086 total, memory usage ~ 7MB

Here's the before and after pprof webs:

BEFORE with 'go ComputeHash(...):



AFTER with 'ComputeHash(...):



Since disk read are SOO much slower, computing the hash for each file in its own goroutine caused a huge slowdown.. 

btw this is on a RAID10, with SSD: 

Old code SSD: 3.31s user 17.87s system 244% cpu 8.667 total

New code SDD: 2.88s user 0.84s system 69% cpu 5.369 total

Shows you can throw hardware at a problem BUT the old code locks up my system momentarily..

Michael Jones

unread,
Oct 16, 2016, 1:26:24 PM10/16/16
to Sri G, golang-nuts

Sri G,

 

How does this time compare to my “Dup” program? I can’t test for you…since it is your filesystem…but I thought I had it going about as fast as possible a few years ago when I wrote that one.

 

https://github.com/MichaelTJones/dup

 

Michael

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sri G

unread,
Oct 16, 2016, 3:17:40 PM10/16/16
to golang-nuts, sriakhil...@gmail.com
This isn't exactly the same because I deleted some files but it shouldn't really matter.  

Switched to md5..

--- a/dup.go
+++ b/dup.go
@@ -3,7 +3,7 @@ package main
-       "crypto/sha256"
+       "crypto/md5"

@@ -207,8 +207,8 @@ func main() {
+type Hash [16]byte // appropriate for MD5
+// type Hash [32]byte // appropriate for SHA-256

 func hashFile(p string, hash []byte, prefix int64) (count int64) {
@@ -221,8 +221,8 @@ func hashFile(p string, hash []byte, prefix int64) (count int64) {
+       hasher := md5.New() // select MD5 in concert with "Hash" above
+       // hasher := sha256.New() // select SHA-256 in concert with "Hash" above


Checking only same sized files is huge speed up (82x less bytes checked)

016/10/16  14:33:51      total:      566 files ( 100.00%),    1667774744 bytes ( 100.00%)
2016/10/16 14:33:51   examined:        9 files (   1.59%),      20271440 bytes (   1.22%) in 0.4209 seconds
2016/10/16 14:33:51 duplicates:        9 files (   1.59%),      20271440 bytes (   1.22%)

Checking the first 4KB of files and only hashing if they are the same is another cool optimization (check avg. 768x less bytes in my case). Really nice Michael.

With workers = 8:

RAID10: 0.05s user 0.04s system 37% cpu 0.231 total, couldn't check memory usage but its probably negligible
SSD:    0.05s user 0.04s system 59% cpu 0.137 total

Since SSD's and my filesystem are optimized for 4K random reads, it makes sense to use multiple threads/goroutines.

Optimal # of workers=9 on RAID 10: 0.05s user 0.04s system 40% cpu 0.220 total
on SSD workers = 8~9:              0.04s user 0.04s system 68% cpu 0.117 total

Not so much when you're doing a full sequential read. Because I use the md5 for other purposes, the entire file must be hashed, so sadly I cant use these optimizations.

Michael Jones

unread,
Oct 16, 2016, 4:32:34 PM10/16/16
to Sri G, golang-nuts

Oh, I see. Well if you must read and hash every byte of every file then you really are mostly measuring device speed.

Sri G

unread,
Oct 21, 2016, 3:15:05 PM10/21/16
to golang-nuts, sriakhil...@gmail.com
Yea :/

Appreciate you sharing your project and your code! I learned a lot of useful Go patterns (referencing a fixed size byte buffer as a slice) and how to re-use byte buffers like in the python version to keep memory usage down.
Reply all
Reply to author
Forward
0 new messages