Something wrong doing HASH (SHA1 and MD5)

1,313 views
Skip to first unread message

Phrozen

unread,
Jun 10, 2011, 5:21:00 PM6/10/11
to golang-nuts
Hi, I'm starting to get into GO, I have experience in programming, but
this is pretty new. So... in the process of learning I tried to make a
simple command line md5/sha1 calculator (simple and useful) problem
is: everything compiles and runs as intended but my GO hashes are
different from other tools hashes, maybe its something wrong with my
implementation.

Code: http://pastebin.com/mkDhd4J5

It's pretty simple and straight forward. I tried to use as many of the
language as I could. I used flag package and I read somewhere that as
hashes implement Writer and Files implement Reader interfaces, I could
simply use io.Copy to do the "math" maybe thats the problem.

Well thanks for your help, and PLEASE correct me anything you see fit,
syntactic sugar commonly used on go, optimization using go natives,
idioms, etc... TY

Guillermo Estrada

Marko Mikulicic

unread,
Jun 10, 2011, 5:42:37 PM6/10/11
to Phrozen, golang-nuts
Hi Gullermo,

I made a couple of changes to fix a compilation issue I had with your code and made the output in the same format
as the unix md5sum and sha1sum tools.


As you can see the computed hash is the same. Which other tools did you see for comparison?
Keep in mind that the hash is a sequence of bytes which can be rendered in different forms. This is the hex encoding, which is
more popular but you might also encounter a base64 encoding (e.g. o/gSuf4ykyKj1rAHImlOhA==)

Cheers,
Marko

On 10 June 2011 23:21, Phrozen <phro...@gmail.com> wrote:
Hi, I'm starting to get into GO, I have experience in programming, but
this is pretty new. So... in the process of learning I tried to make a
simple command line md5/sha1 calculator (simple and useful) problem
is: everything compiles and runs as intended but my GO hashes are
different from other tools hashes, maybe its something wrong with my
implementation.

Code: http://pastebin.com/mkDhd4J5

It's pretty simple and straight forward. I tried to use as many of the
language as I could. I used flag package and I read somewhere that as
hashes implement Writer and Files implement Reader interfaces, I could
simply use io.Copy to do the "math" maybe thats the problem.

it works. There is also another nice thing about go io libraries which I found very useful

...
 dst := io.MultiWriter(destfile, md5h)
 io.Copy(dst, somereader)
 ...

This can be used to save a given reader to a file and compute the hash in one pass, with only a couple of lines of code. 

Cheers,
Marko

Kyle Lemons

unread,
Jun 10, 2011, 5:50:10 PM6/10/11
to Phrozen, golang-nuts
If you're interested, I modified your code a bit to demonstrate a few of the standard go idioms, but other than that I didn't see any issues with your code (other than the previously mentioned failure to compile):

And it produces the correct sums for me:
$ ./mdsha -md5 mdsha
MD5: a8f6b394a802b9d4b0e755cacf011ca6
$ md5sum mdsha
a8f6b394a802b9d4b0e755cacf011ca6  mdsha

Phrozen

unread,
Jun 10, 2011, 5:56:59 PM6/10/11
to golang-nuts
Thnx Marko, sorry about those errors I hope it was the 'f' on the flag
variables (funny is I already had corrected that, dunno why didn't
stick, and used exactly ur same named adding 'f') Well...

phrozen@phrozen-VirtualBox:~/Ubuntu One/code/hash$ 6g hash.go
phrozen@phrozen-VirtualBox:~/Ubuntu One/code/hash$ 6l -o hash hash.6
phrozen@phrozen-VirtualBox:~/Ubuntu One/code/hash$ ./hash -md5 -sha1
hash.go
MD5: d3220931f087e46477988da02bbd6cfc
SHA1: da39a3ee5e6b4b0d3255bfef95601890afd80709
phrozen@phrozen-VirtualBox:~/Ubuntu One/code/hash$ md5sum hash.go
d3220931f087e46477988da02bbd6cfc hash.go
phrozen@phrozen-VirtualBox:~/Ubuntu One/code/hash$ sha1sum hash.go
c2a6aa2a3040716d90b286d6f39378b5e217d37b hash.go

MD5 seems to be fine, but SHA1....

On Jun 10, 4:42 pm, Marko Mikulicic <marko.mikuli...@isti.cnr.it>
wrote:
> Hi Gullermo,
>
> I made a couple of changes to fix a compilation issue I had with your code
> and made the output in the same format
> as the unix md5sum and sha1sum tools.
>
> https://gist.github.com/1019818
>
> As you can see the computed hash is the same. Which other tools did you see
> for comparison?
> Keep in mind that the hash is a sequence of bytes which can be rendered in
> different forms. This is the hex encoding, which is
> more popular but you might also encounter a base64 encoding
> (e.g. o/gSuf4ykyKj1rAHImlOhA==)
>
> Cheers,
> Marko
>

Kyle Lemons

unread,
Jun 10, 2011, 6:01:05 PM6/10/11
to Phrozen, golang-nuts
Ah.  If you will have a look at my code, you will see that I don't allow for multiple hashes simultaneously.  My guess is that you read the entire file into the md5sum hash and there is nothing left to read into the sha1 hash.

Cheers,
~K

Phrozen

unread,
Jun 10, 2011, 6:06:39 PM6/10/11
to golang-nuts
Excellent!!! I love what u did, I was trying to reuse the variable
declaring it as Hash, but docs stated that Hash is in "crypto" package
and I din't make it right (hash.Hash).

Apart from that, I didnt use a switch clause because I want to be able
to use one flag, or the other one or both (both flags calculate both
hashes in 1 run), and switch excludes the other one (like if-else) so
I used 2 if so that each calculated&printed his part. I like how you
showed the Go idioms and common formatting, I'll be needing that a
lot.

Apart from that, as u can see I get MD5 right, but not SHA1, any
insight?

Phrozen

unread,
Jun 10, 2011, 6:09:22 PM6/10/11
to golang-nuts
SO, TRUE! I'll have to reset the "pointer/cursor" or re open the file
again. Thnx for all your insight. I'll be porting a Raytracer and I
need to get used to all of this Go thingy idiom interface type stuff
before even trying to hope to do without OOP.

Phrozen

unread,
Jun 10, 2011, 6:12:13 PM6/10/11
to golang-nuts
Just as a side note, I knew that hash was familiar:

from Wikipedia:
The hash of the zero-length string is:

SHA1("")
= da39a3ee 5e6b4b0d 3255bfef 95601890 afd80709

and

SHA1: da39a3ee5e6b4b0d3255bfef95601890afd80709

Thnx for the help!


On Jun 10, 5:01 pm, Kyle Lemons <kev...@google.com> wrote:

Marko Mikulicic

unread,
Jun 10, 2011, 7:14:12 PM6/10/11
to Phrozen, golang-nuts
Hi,

The problem with the previous examples is that there is a lot of code per hash type. Imagine you have 20 algorithms, in Kyle's version the io.Copy
invocation is only written once, but there are still a lot of variables declared for each hash type.

Furthermore the example can compute only one hash type at a time, because computing it will consume the output. One possibility
would be to re-seek the input to the beginning of the file, but this : a) works only with files (not net sockets) b) requires the file to be read again (if the file is huge you can feel the difference).
 
So, I wondered how to exploit the io.MultiWriter in order to compute multiple hashes in one pass and came up with this:


it allows the user to compute md5, sha1, sha256, sha512 and ripemd160. You can add more hash types buy just changing two lines of code (the algo name and the algo constructor)

Still not perfect, especially I struggled searching for a way to splice a slice into a variadic invocation (something like python "doit(arg, *otherargs)" or scala's "doit(args :_*)")
and I ended up with a chain (actually a leaning tree) of MultiWriters.


If there is no intention to enhance the language in order to provide this splicing, perhaps it could be useful to provide the definition of:

// MultiWriter creates a writer that duplicates its writes to all the
// provided writers, similar to the Unix tee(1) command.
func MultiWriter(writers ...Writer) Writer {
        return &multiWriter{writers}
}

func MultiWriterFromSlice(writers []Writer) Writer {
        return &multiWriter{writers}
}


Cheers,
Marko

Steven

unread,
Jun 10, 2011, 7:19:41 PM6/10/11
to Marko Mikulicic, Phrozen, golang-nuts

Marko Mikulicic

unread,
Jun 10, 2011, 7:37:30 PM6/10/11
to Steven, Phrozen, golang-nuts
On 11 June 2011 01:19, Steven <stev...@gmail.com> wrote:
Use `writers...`


thank you. I was looking through this line of spec without seeing it.  The old mailing lists are full of ways to trick this with reflection an other constructs, so I guess
it's a recent addition.

Cheers,
Marko

PS:

I made a mistake in https://gist.github.com/1019970, confusing length and capacity:

writers := make([]io.Writer, 2)

this created a slice starting with two 'nil' values 

writers := make([]io.Writer, 0, len(impls))

Phrozen

unread,
Jun 10, 2011, 7:41:48 PM6/10/11
to golang-nuts
I like what you implemented. From what I get from Steven's post:

dest := MultiWriter(...writers)
io.Copy(dest, infile)

That would be the "correct" form. Also you can create the writers
slice from the correct size by counting while checking for flags
(dunno how append implementation works, but usually has a penalty). I
love the idea of MultiWriter, (I have to check the source, maybe its
already parallelized) but it leaves no space for concurrency, in this
example, probably just copying data in order is ok, but thinking in
other tasks it would be nice to be able to concurently calculate all
given algos (writers) with the same input data (file read). I just
bring this because as I have stated my plan is to port a raytracer,
and that is the case you may have to run multiple writers with a
single data input. Any insight on this?


On Jun 10, 6:14 pm, Marko Mikulicic <marko.mikuli...@isti.cnr.it>
wrote:

Kyle Lemons

unread,
Jun 10, 2011, 7:42:59 PM6/10/11
to Steven, Marko Mikulicic, Phrozen, golang-nuts
Ah, now that's fun.  Now you can even use closures!

Phrozen

unread,
Jun 10, 2011, 7:55:43 PM6/10/11
to golang-nuts
Nice!! now, would be easier to elaborate on Mark's code to define the
array of algos, and then while you check for flags, push the correct
hash.Hash into the writers slice?? that way you can add as many hashes
as you want without doing the if for each one. Just by adding into
both arrays algos and impls. Although Marko's solution creates all
hash.Hash(es) wven those not used, as far as I have read, those are
only pointers and have no execution time/memory penalty, right??

Phrozen

unread,
Jun 10, 2011, 8:04:49 PM6/10/11
to golang-nuts
So... just standing on the shoulders of the giants:

https://gist.github.com/1020054

phrozen@phrozen-pc:~/code/hash$ ./hash -md5 -sha1 -sha256 hash.go
md5: 17908cbf723b6745e5d59f5414ba55b5
sha1: b6c0029f576d928cc61fc770cce132c0bf30259b
sha256:
9ee6676271f3f19023a06f249d8c7b1de70ff3568bef0d4511a0227cf2ac275c

Marko Mikulicic

unread,
Jun 10, 2011, 7:59:26 PM6/10/11
to Phrozen, golang-nuts
On 11 June 2011 01:55, Phrozen <phro...@gmail.com> wrote:
Nice!! now, would be easier to elaborate on Mark's code to define the
array of algos, and then while you check for flags, push the correct
hash.Hash into the writers slice?? that way you can add as many hashes
as you want without doing the if for each one. Just by adding into
both arrays algos and impls. Although Marko's solution creates all
hash.Hash(es) wven those not used, as far as I have read, those are
only pointers and have no execution time/memory penalty, right??

it's possible that these hash packages involve some initialization overhead, but it's a small constant number of them. I might me 
interesting to tweak for exercise but I don't think that in this concrete scenario it's necessary. The important thing is that unnecessary hashes are not computed for the whole stream.

Cheers,
Marko 

Marko Mikulicic

unread,
Jun 10, 2011, 9:02:18 PM6/10/11
to Phrozen, golang-nuts
On 11 June 2011 01:41, Phrozen <phro...@gmail.com> wrote:
I like what you implemented. From what I get from Steven's post:

dest := MultiWriter(...writers)
io.Copy(dest, infile)

That would be the "correct" form. Also you can create the writers
slice from the correct size by counting while checking for flags
(dunno how append implementation works, but usually has a penalty). I
love the idea of MultiWriter, (I have to check the source, maybe its
already parallelized) but it leaves no space for concurrency, in this
example, probably just copying data in order is ok, but thinking in
other tasks it would be nice to be able to concurently calculate all
given algos (writers) with the same input data (file read). I just
bring this because as I have stated my plan is to port a raytracer,
and that is the case you may have to run multiple writers with a
single data input. Any insight on this?

It depends on what you want to do.

If we are continuing with our use case here, we want to achieve three things:
a) scanning the input file once b) avoid to keep the whole input file in memory c)  possibly parallelize hash calculation
on multiple cpus

In my experience a+b are most important. The vast majority of day to day performance issues with normal applications is because 
applications read whole files in memory, copy network buffers to temporary files so that they can so multiple passes etc (e.g. the ease of use prevailed).
This is why I think modern toolkits/languages should make it easy to avoid it without doing to much work.

So, a and b being solved, how to do c?

The io.MultiWriter simple creates an implementor of io.Writer interface which repeats the io.Write method call to each single writer it's configured with.
So, if you want to parallize the hash computation you could try to have an implementation of io.Write which delegates the actual writes to another goroutine and use
GOMAXPROCS env var > 1 (or a call to runtime.GOMAXPROCS(..)).

In practice there are issues regarding 1) termination 2) error handling.

This could be solved if we implement WriteCloser instead of just write. Then before taking the sums, we close all the parallel writers and this will wait for all goroutines to exit
and perhaps raise an error if some of them failed.

In alternative, you might want to make the error handling more explicit and expose the an error channel from each parallel writer, and stop the execution on the first error.


I tried something like this for the start:

-----------------
type ParallelWriter struct {
writer io.Writer
ch chan []byte
}

func NewParallelWriter(writer io.Writer) ParallelWriter {
ch := make(chan []byte, 10)

go func() {
for chunk := range ch {
writer.Write(chunk)
}
}()

return ParallelWriter{writer, ch}
}

func (self ParallelWriter) Write(p []byte) (n int, err os.Error) {

cp := make([]byte, len(p), len(p))
copy(cp, p)
self.ch <- cp

return len(p), nil
}
// ......
// ....
time.Sleep(5e8) // all writers have finished
// print the sums
--------------------------

(I wasted some time for a stupid bug. The reader reuses the slices for subsequent buffers, so you have to make sure you make a copy of the data you pass through the channel)

Here is the full example which supports a "-parallel" option which attempts to use this trick


of course, the code that waits when they are finished is still missing

I did a quick test doing only sha512 + sha256 on a biggish file but I don't see huge benefits (The file is cached in memory). The cpu load also didn't increase so much (although improved after I incresed the channel buffer size
from 10 to 100)
When I added all of them, the load saturated by 2 core laptop, there was noticeable gain. However without ripemd160 (which is probably compute intensive) the other
cases the overhead of 

GOMAXPROCS=2 ./hash -md5 -sha1 -sha512 -sha256 -ripemd160 -parallel   32.25s user 0.28s system 151% cpu 21.413 total
GOMAXPROCS=2 ./hash -md5 -sha1 -sha512 -sha256 -ripemd160   30.77s user 0.12s system 99% cpu 31.105 total



Cheers,
Marko

Phrozen

unread,
Jun 13, 2011, 11:16:26 AM6/13/11
to golang-nuts
Hey Marko, I've been checking you parallel version and it seems quite
nice, I'm starting to catch up with how interfacing via types is done
in Go and how simple is to implement interfaces and just send them to
methods, it's a nice way of duck typing. Also I need to check on
channels!! Those things look amazing. Anyway now I see how a
concurrent implementation looks like, just a question about the other
part. You mentioned 3 things to achieve:

a) scanning the input file once
b) avoid to keep the whole input file in memory
c) possibly parallelize hash calculation on multiple cpus

And actually said that a+b are the most important (I agree) but that
those 2 were solved. But afterwards in the benchmarks you do mention
that the biggish file is cached in memory. I want to say that
io.File.Read implementantion and io.Copy method to hash.Hash.Write
implementation are buffered and parallel, but even if I haven't
checked the code, I'm pretty sure they aren't. Read will usually read
the whole file into memory and then Copy it via MultiWriter or
ParallelWriter to process and Output. There should be used a buffered
read implementation of the file (for video files larger than 1GB or
even distro files and ISO's larger than 4GB that usually require SHA1
calcs) so that you only have 1 piece of the file in the buffer, send
it to the writers and read some more, can that be done via channels??
Also, it is my understanding that buffered method will of course mess
up parallel method because not all algorithms calculate at the same
time, and without having all the file in memory to feed each
algorithm, you'll be waiting for the slower one to finish.

Anyway I want to say thanks a lot to both You and Kyle, I've been
learning a lot and Fast! Be reading you soon around.

Go Nuts!
Guillermo

On Jun 10, 8:02 pm, Marko Mikulicic <marko.mikuli...@isti.cnr.it>
wrote:
> *GOMAXPROCS env var > 1 (or a call to runtime.GOMAXPROCS(..)).*
> *
> *
> *In practice there are issues regarding 1) termination 2) error handling.*
> *
> *
> *This could be solved if we implement *WriteCloser instead of just write.

Marko Mikulicic

unread,
Jun 13, 2011, 11:36:16 AM6/13/11
to Phrozen, golang-nuts
On 13 June 2011 17:16, Phrozen <phro...@gmail.com> wrote:
Hey Marko, I've been checking you parallel version and it seems quite
nice, I'm starting to catch up with how interfacing via types is done
in Go and how simple is to implement interfaces and just send them to
methods, it's a nice way of duck typing. Also I need to check on
channels!! Those things look amazing. Anyway now I see how a
concurrent implementation looks like, just a question about the other
part. You mentioned 3 things to achieve:

a) scanning the input file once
b) avoid to keep the whole input file in memory
c)  possibly parallelize hash calculation on multiple cpus

And actually said that a+b are the most important (I agree) but that
those 2 were solved. But afterwards in the benchmarks you do mention
that the biggish file is cached in memory.

I meant cached by the operating system. This was important in this mini-benchmark in order
to measure only the cpu speedup.
 
I want to say that
io.File.Read implementantion and io.Copy method to hash.Hash.Write
implementation are buffered and parallel, but even if I haven't
checked the code, I'm pretty sure they aren't. Read will usually read
the whole file into memory and then Copy it via MultiWriter or
ParallelWriter to process and Output.

Copy() is implemented by invoking Read() in loop, each time reading a small (some kilobytes) buffer and then writing it to output.

You can see into how many buffers your file is split by  inserting a print into our ParallelWriter.Write method
-----------------
func (self ParallelWriter) Write(p []byte) (n int, err os.Error) {

        fmt.Printf("writing %d bytes", len(p)) // add this
  
cp := make([]byte, len(p), len(p))
copy(cp, p)
self.ch <- cp
        .....
-----------------

Furthermore, Read() reuses the same buffer array. This is why we have to copy the array before adding it to the channel.

There should be used a buffered
read implementation of the file (for video files larger than 1GB or
even distro files and ISO's larger than 4GB that usually require SHA1
calcs) so that you only have 1 piece of the file in the buffer, send
it to the writers and read some more, can that be done via channels??
Also, it is my understanding that buffered method will of course mess
up parallel method because not all algorithms calculate at the same
time, and without having all the file in memory to feed each
algorithm, you'll be waiting for the slower one to finish.

Yes, this happens implicitly by having a blocking channel, each of them contains a fixed number of buffers.
Of course, if one of our hash writers is too slow it will block the whole multiwriter.

This might be a problem for you, especially if you want to do something with the other hashes before the last one is finished.
I think it would a premature and excessive optimization to worry about wasted cycles when faster hashes are waiting data because the input
stream is blocked on a slow hasher: since by definition this problem makes sense only when there are more CPUs, and the other hashes are faster than the one
that causes blocking, whenever a new input buffers will be unblocked the other cpus will have some work to do and eventually complete before the slower hash.
 
It would be worst if the speed of processing were uneven and irregular. But in this case one could just increase the buffering by making longer channels.

You might want to try it out: instead of having one input file and one multiwriter with our parallel trick, just swapn N goroutines each reading it's own file
(OS caching will make it fast enough, but you might want also to read the file in a byte array and the "bytes" package)

I would be interesting to see if the overall time diverges significantly from the version capped by the slow writer.


Anyway I want to say thanks a lot to both You and Kyle, I've been
learning a lot and Fast! Be reading you soon around.

:-)

Cheers,
Marko

Reply all
Reply to author
Forward
0 new messages