Resumable Hash Functions

801 views
Skip to first unread message

josh...@docker.com

unread,
Apr 2, 2015, 9:24:14 PM4/2/15
to golan...@googlegroups.com
Hi golang-dev,

My team is building an application where clients upload content to a digest-addressed store. Clients push up the data in multiple chunks with multiple HTTP requests, then finalize the upload by telling the server what the SHA2 digest of the content should be. The server verifies this then places the content in the store, keyed by that SHA2 digest. The problem that we have is that final step where the server has to verify the digest. It has to read all of the content again from the backend and compute the digest to compare it with what the client computed. This is very slow for large uploads.

A performance optimization that we've experimented with is for the server to use a resumable version of SHA2 which allows you to snapshot the current digest state at the end of each upload chunk request and store the state of the digest so that a hash object can be restored from that state on the next chunk upload. When finalizing the upload, the server only has to restore the digest state, call `Sum(nil)`, and compare it with the digest from the client, this saves us the slow/expensive read from our storage backend (AWS S3).

I was able to do this by forking `crypto/sha256` and `crypto/sha512` and add 3 new methods to implement a new ResumableHash interface:

type ResumableHash interface {
// ResumableHash is a superset of hash.Hash
hash.Hash
// Len returns the number of bytes written to the Hash so far.
Len() uint64
// State returns a snapshot of the state of the Hash.
State() ([]byte, error)
// Restore resets the Hash to the given state.
Restore(state []byte) error
}

and I've published the implementation here: https://github.com/jlhawn/go-crypto

Of course, I didn't modify any of the SHA2 logic, I put all of my changes in a new file [1] which uses `encoding/gob` to marshal/unmarshal the `digest` struct, and added a new test file [2] which compares it with the stdlib Hash to ensure that State() and Restore() work correctly. Any other changes are only to imports or return types.


I know that there are very few use cases for this, but it is a huge performance improvement for our application. Is there any chance of getting something like this implemented upstream in the Go standard library?

- Josh


Adam Langley

unread,
Apr 3, 2015, 1:21:04 PM4/3/15
to josh...@docker.com, golang-dev
The standard answer for hashing large amounts of data is
https://en.wikipedia.org/wiki/Merkle_tree


Cheers

AGL

josh...@docker.com

unread,
Apr 3, 2015, 4:38:53 PM4/3/15
to golan...@googlegroups.com, josh...@docker.com
Thanks for the reply AGL,

my team has considered using hash trees, but unfortunately that's not an option for us at the moment because it would require changing our API. Perhaps a later version of our API will use hash trees - it does make the client a bit more complex though. Given that we can't go out and change the already existing clients, do you think that the solution I outlined with the resumable hash interface is okay for us to use for now? It only requires us to change our server to use resumable hashes while our clients require no change at all.

On the topic of hash trees and the stdlib, what do you guys think of having a generic hash tree implementation added to the `hash` package? We can have a constructor which you give 3 arguments:

- A `hash.Hash` object, like `sha256.New()` This will be the digest algorithm used for each block and node of the tree.
- A desired block size, say 1024*1024 (for 1MB blocks).
- A branching factor (number of child nodes), 2 for a binary hash tree and more if you so desire.

- Josh

Brad Fitzpatrick

unread,
Apr 6, 2015, 3:57:51 AM4/6/15
to josh...@docker.com, Adam Langley, golang-dev
On Fri, Apr 3, 2015 at 10:38 PM, <josh...@docker.com> wrote:
Thanks for the reply AGL,

my team has considered using hash trees, but unfortunately that's not an option for us at the moment because it would require changing our API. Perhaps a later version of our API will use hash trees - it does make the client a bit more complex though. Given that we can't go out and change the already existing clients, do you think that the solution I outlined with the resumable hash interface is okay for us to use for now? It only requires us to change our server to use resumable hashes while our clients require no change at all.

On the topic of hash trees and the stdlib, what do you guys think of having a generic hash tree implementation added to the `hash` package?

Camlistore uses hash trees and I can't think of a generic interface which would be useful in the standard library.

I have however found myself wanting like your original request:

    package sha1
    func NewFromSum(sum [Size]byte) hash.Hash

... but I've generally worked around it.

I think your proposed "ResumableHash" interface is probably too much.

Damian Gryski

unread,
Apr 8, 2015, 8:04:08 AM4/8/15
to golan...@googlegroups.com, josh...@docker.com

On Friday, April 3, 2015 at 3:24:14 AM UTC+2, josh...@docker.com wrote:
Hi golang-dev,

My team is building an application where clients upload content to a digest-addressed store. Clients push up the data in multiple chunks with multiple HTTP requests, then finalize the upload by telling the server what the SHA2 digest of the content should be. The server verifies this then places the content in the store, keyed by that SHA2 digest. The problem that we have is that final step where the server has to verify the digest. It has to read all of the content again from the backend and compute the digest to compare it with what the client computed. This is very slow for large uploads.

A performance optimization that we've experimented with is for the server to use a resumable version of SHA2 which allows you to snapshot the current digest state at the end of each upload chunk request and store the state of the digest so that a hash object can be restored from that state on the next chunk upload. When finalizing the upload, the server only has to restore the digest state, call `Sum(nil)`, and compare it with the digest from the client, this saves us the slow/expensive read from our storage backend (AWS S3).

I was able to do this by forking `crypto/sha256` and `crypto/sha512` and add 3 new methods to implement a new ResumableHash interface:

type ResumableHash interface {
// ResumableHash is a superset of hash.Hash
hash.Hash
// Len returns the number of bytes written to the Hash so far.
Len() uint64
// State returns a snapshot of the state of the Hash.
State() ([]byte, error)
// Restore resets the Hash to the given state.
Restore(state []byte) error
}



FWIW, a friend implemented a Tor relay in Go which required patching the sha1 package to add .Clone() to be able to add some data to a hash and back it out if the digest didn't match.  Just an example of a second real-world use case.

Damian
Reply all
Reply to author
Forward
0 new messages