Understanding hash.Hash's io.WriteString and Sum()

1,045 views
Skip to first unread message

Ukiah Danger Smith

unread,
Feb 16, 2014, 12:03:03 PM2/16/14
to golan...@googlegroups.com
Help me understand the hash.Hash interface. What is the difference between using io.WriteString and h.Sum() ?

Why would you use one over the other. Or are they to be used together somehow?

I know they are different because of this simple example; http://play.golang.org/p/T06uCN4FJ-


Thanks,

Ukiah
Message has been deleted

Ukiah Danger Smith

unread,
Feb 16, 2014, 12:29:58 PM2/16/14
to golan...@googlegroups.com


On Sunday, February 16, 2014 12:12:33 PM UTC-5, Islan Dberry wrote:
Write adds data to the running hash. Sum returns the current hash. Typical use is to call Write one or more times followed by Sum.

The argument to Sum is not added to the running hash. http://golang.org/pkg/hash/#Hash


This is very close to what the documentation says, but I don't understand the underlying theory. My experience with hash functions are implementations that take a string and return a hash as a string.

What does it mean to add data to the running hash? What is a running hash? Would I use Write() to add a secret key to the hash, and use Sum() to get the specific hashes of the data I want to hash? Why do using Write() vs Sum() return different data on the same string? I don't understand what they are doing different from each other.

Caleb Spare

unread,
Feb 16, 2014, 12:54:27 PM2/16/14
to Ukiah Danger Smith, golang-nuts
On Sun, Feb 16, 2014 at 9:29 AM, Ukiah Danger Smith <super...@walledcity.com> wrote:


On Sunday, February 16, 2014 12:12:33 PM UTC-5, Islan Dberry wrote:
Write adds data to the running hash. Sum returns the current hash. Typical use is to call Write one or more times followed by Sum.

The argument to Sum is not added to the running hash. http://golang.org/pkg/hash/#Hash


This is very close to what the documentation says, but I don't understand the underlying theory. My experience with hash functions are implementations that take a string and return a hash as a string.

The idea is that you can compute a hash iteratively, chunk-by-chunk, rather than all at once. This allows you to compute the hash of a large amount of data (say a big file, or an unknown-length stream) as you read it, rather than buffering all the data at once.

Sum() appends to the provided buffer, which allows you to reuse a buffer rather than generating garbage every time.

This is the mode of operation that's supported by the Hash interface, although you can also just hash a big chunk of data in one go by calling Write(yourdata) and then Sum(nil). It's a little more complicated than having something simpler like having each hash be a func ([]byte) []byte or something like that, but it allows you to be much more efficient in many common use cases in ways that a simplistic hash interface would not allow.

It's also worth pointing out that some of the particular hash implementations (although not the general Hash interface) have shorthand methods for the simpler use case as well: for instance sha1.Sum or crc32.ChecksumIEEE.


What does it mean to add data to the running hash? What is a running hash? Would I use Write() to add a secret key to the hash, and use Sum() to get the specific hashes of the data I want to hash? Why do using Write() vs Sum() return different data on the same string? I don't understand what they are doing different from each other.

Please re-read the hash.Hash documentation comments. They are terse, but it also fairly precise and complete, and they answer most of these questions. In particular, note that Sum() "does not change the underlying hash state", so you should not be passing in data to Sum() that you want to be hashed. You can provide nil if there's no []byte that you're trying to reuse.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ukiah Danger Smith

unread,
Feb 16, 2014, 3:47:41 PM2/16/14
to golan...@googlegroups.com, Ukiah Danger Smith


On Sunday, February 16, 2014 12:54:27 PM UTC-5, Caleb Spare wrote:

The idea is that you can compute a hash iteratively, chunk-by-chunk, rather than all at once. This allows you to compute the hash of a large amount of data (say a big file, or an unknown-length stream) as you read it, rather than buffering all the data at once.

Thank you, this makes sense.  But I'm not certain of the mechanics of it. Would it work something like this?

    h := sha1.New()
    for {
        io.WriteString(h, getChunckFromVeryVeryLargeFile())
    }
    fmt.Printf("% x\n", h.Sum(nil))


This is the mode of operation that's supported by the Hash interface, although you can also just hash a big chunk of data in one go by calling Write(yourdata) and then Sum(nil). It's a little more complicated than having something simpler like having each hash be a func ([]byte) []byte or something like that, but it allows you to be much more efficient in many common use cases in ways that a simplistic hash interface would not allow.

It's also worth pointing out that some of the particular hash implementations (although not the general Hash interface) have shorthand methods for the simpler use case as well: for instance sha1.Sum or crc32.ChecksumIEEE.

sha1.Sum() is closer to my needs than hash.Hash is. But for the sake of learning I'm going to follow this a little further.
 
Please re-read the hash.Hash documentation comments. They are terse, but it also fairly precise and complete, and they answer most of these questions. In particular, note that Sum() "does not change the underlying hash state", so you should not be passing in data to Sum() that you want to be hashed. You can provide nil if there's no []byte that you're trying to reuse.

 OK, so the hash.Hash is a hash of []bytes that can be updated by adding new data via the ioWriter. As new data is being added the hash value updates. Is that right?

I'm still confused then as to what Sum() is used for. The docs say that Sum() appends the current hash to the passed []byte slice and returns it. So you end up with a slice of []bytes that is comprised of the what you passed and the current hash. What would this be used for? It looks like Sum() is only useful when passing it nil and getting the current hash value. What else would be passed to Sum(). And what would I use the returning []byte slice for? If it isn't for data, and it isn't the hash.

 
Sum() appends to the provided buffer, which allows you to reuse a buffer rather than generating garbage every time.
 
I don't know what you mean about generating garbage.

Thank you for you help, it's starting to make sense.

Ukiah


Ukiah Danger Smith

unread,
Feb 16, 2014, 4:23:32 PM2/16/14
to golan...@googlegroups.com, Ukiah Danger Smith
I found an example, of presumably a large file being processed with the hash.Hash interface.

https://github.com/bradfitz/camlistore/blob/830c6966a11ddb7834a05b6106b2530284a4d036/pkg/fileembed/genfileembed/genfileembed.go#L293

Caleb Spare

unread,
Feb 16, 2014, 4:26:42 PM2/16/14
to Ukiah Danger Smith, golang-nuts
On Sun, Feb 16, 2014 at 12:47 PM, Ukiah Danger Smith <super...@walledcity.com> wrote:


On Sunday, February 16, 2014 12:54:27 PM UTC-5, Caleb Spare wrote:

The idea is that you can compute a hash iteratively, chunk-by-chunk, rather than all at once. This allows you to compute the hash of a large amount of data (say a big file, or an unknown-length stream) as you read it, rather than buffering all the data at once.

Thank you, this makes sense.  But I'm not certain of the mechanics of it. Would it work something like this?

    h := sha1.New()
    for {
        io.WriteString(h, getChunckFromVeryVeryLargeFile())
    }
    fmt.Printf("% x\n", h.Sum(nil))

No need to use strings unnecessarily -- this is probably a case where you would use []byte throughout.

That's the general idea though. This might give you more ideas -- it wraps an io.Reader and computes the SHA-1 hash as you read through it:




This is the mode of operation that's supported by the Hash interface, although you can also just hash a big chunk of data in one go by calling Write(yourdata) and then Sum(nil). It's a little more complicated than having something simpler like having each hash be a func ([]byte) []byte or something like that, but it allows you to be much more efficient in many common use cases in ways that a simplistic hash interface would not allow.

It's also worth pointing out that some of the particular hash implementations (although not the general Hash interface) have shorthand methods for the simpler use case as well: for instance sha1.Sum or crc32.ChecksumIEEE.

sha1.Sum() is closer to my needs than hash.Hash is. But for the sake of learning I'm going to follow this a little further.
 
Please re-read the hash.Hash documentation comments. They are terse, but it also fairly precise and complete, and they answer most of these questions. In particular, note that Sum() "does not change the underlying hash state", so you should not be passing in data to Sum() that you want to be hashed. You can provide nil if there's no []byte that you're trying to reuse.

 OK, so the hash.Hash is a hash of []bytes that can be updated by adding new data via the ioWriter. As new data is being added the hash value updates. Is that right?

yes
 

I'm still confused then as to what Sum() is used for. The docs say that Sum() appends the current hash to the passed []byte slice and returns it. So you end up with a slice of []bytes that is comprised of the what you passed and the current hash. What would this be used for?

It's useful for the not-generating garbage thing I mentioned. See my example below.
 
It looks like Sum() is only useful when passing it nil and getting the current hash value.

It's true that this API can be confusing (I didn't understand it without staring at it for a bit and I've noticed it trips up a lot of newcomers). Most of the time when I've used it, I have passed in nil. But there are valid reasons to pass in a non-nil slice; see below.
 
What else would be passed to Sum(). And what would I use the returning []byte slice for? If it isn't for data, and it isn't the hash.

It is the hash, appended to the provided buffer. Here is an example of reusing a buffer to avoid generating garbage:


The first iteration, we're passing in a nil []byte, so Sum will append to nil and allocate some space, but every iteration thereafter, we're reusing the same memory and overwriting it, and so we avoid generating new garbage to be collected on every loop iteration.
 

 
Sum() appends to the provided buffer, which allows you to reuse a buffer rather than generating garbage every time.
 
I don't know what you mean about generating garbage.

Thank you for you help, it's starting to make sense.

Ukiah


Ukiah Danger Smith

unread,
Feb 16, 2014, 6:01:35 PM2/16/14
to golan...@googlegroups.com, Ukiah Danger Smith


On Sunday, February 16, 2014 4:26:42 PM UTC-5, Caleb Spare wrote:


It is the hash, appended to the provided buffer. Here is an example of reusing a buffer to avoid generating garbage:


The first iteration, we're passing in a nil []byte, so Sum will append to nil and allocate some space, but every iteration thereafter, we're reusing the same memory and overwriting it, and so we avoid generating new garbage to be collected on every loop iteration.

Ah! Ok, now I think I'm beginning to understand it better.

Thank you.
Reply all
Reply to author
Forward
0 new messages