Help creating a rsync "alike" tool in go

2,165 views
Skip to first unread message

josvazg

unread,
Mar 20, 2013, 10:49:05 AM3/20/13
to golan...@googlegroups.com
I am working on a small tool to allow receiving big (GByte sized) files over the Internet or though slow links in general in less time and with less data transfer, provided you already have a previous version of that big file, of course.

It is like something in between rsync and zsync:

The project is at github:

Right now I am just implementing a naive difference detection algorithm: 
The file is sliced in blocks of some size (1MiB by default) and only the blocks that match in-place in both sides are used from the local file.

This seems quite fast, (when both the server and client side bulk hash dumps are pregenerated), but does not cover shifting and shuffling changes that might happen as well sometimes.

To cover those cases rsync/zsync use a rolling checksum of a block size that can be calculated at any point (scrolling) just by adding the new byte to the checksum and "deleting/removing" the first from it.

I was pointed to rsync's author docs on rsync design:

And I now have a few go related questions...

Is go's hash/adler32 checksum a rolling hash I can use for this?

If so... How can I roll a byte from the end?
(Any code sample available?)

If not, is there already go implementation of a rolling checksum I could use for my tool?
(That would save me quite some time)

Any more help and guidance on this topic is well come.

Péter Szilágyi

unread,
Mar 20, 2013, 11:02:46 AM3/20/13
to josvazg, golang-nuts
Hi,

  Adler32 in general is a rolling hash. Go's implementation is not (usually it's a rare necessity to require such a thing).

  There's a nice wiki article about the hash: http://en.wikipedia.org/wiki/Adler-32, I'm quite sure you can figure out how to turn this into a rolling one (to chop off the first input byte) I think your best bet would be to "fork" the http://golang.org/src/pkg/hash/adler32/adler32.go and add the corrections yourself.

Hints (without too much thought):
  • A = (A - byte) mod 65521
  • B = (B - n * byte - 1) mod 65521

Cheers,
  Peter



--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

josvazg

unread,
Mar 21, 2013, 3:52:08 AM3/21/13
to golan...@googlegroups.com, josvazg
Thanks Peter,

I'll give it a try and get back here...

Jose

josvazg

unread,
Mar 21, 2013, 11:49:49 AM3/21/13
to golan...@googlegroups.com, josvazg
You were absolutely right!

This code did the trick:

func roll(a, b uint32, window, oldest, newest uint32) (aa, bb uint32) {
a += newest - oldest
b += a - (window * oldest) - 1
if b > (0xffffffff-255)/2 {
a %= mod
b %= mod
}
return a, b
}

Basically you add the new byte and take the old one from a ("the sum of bytes"...) and then you have to add a to b (the "sum of sums"...) AND take the "accumulated weight" of the old byte in the window: that is substract it "window times" AND also take away the initial 1 you carried from the start.

This is the code I use to make "it more hash.Hash(32) like":

// Roll32 will displace the window checksum window by one position, 
// taking old-byte from the beginning and adding new-byte at the end
// and returns the new current checksum 32bit state
func (d *digest) Roll32(window uint32, oldbyte, newbyte byte) uint32 {
d.a, d.b = roll(d.a, d.b, window, uint32(oldbyte), uint32(newbyte))
return d.Sum32()
}

// Roll calls Roll32 and returns the rolled checksum as a byte slice
func (d *digest) Roll(window uint32, oldbyte, newbyte byte) []byte {
d.Roll32(window, oldbyte, newbyte)
return d.Sum(nil)
}

And:

// RollingHash can roll or scroll the hash window
type RollingHash interface {
hash.Hash
Roll(window uint32, oldbyte, newbyte byte) []byte
}

// RollingHash32 is a rollingHash that produces 32bit hashes
type RollingHash32 interface {
hash.Hash32
Roll(window uint32, oldbyte, newbyte byte) []byte
Roll32(window uint32, oldbyte, newbyte byte) uint32
}

I couldn't extend the original hash/adler32 on the standard library, cause I needed access to the internal digest state a and b variables, so I just copied the full code at http://golang.org/src/pkg/hash/adler32/adler32.go and add what I needed in my own package.

I wonder if the go core team cares for a "RolllingHash" interface and this "RollingAdler32" implementation into the standard library...

My test is for this is quite dumb, I just read the package go code files and put them together in memory, one after the other, and compare the rolling checksum with a stdlib hash/adler32 recalculated on each slice completely. You can see the code here if you may:

Thanks again!

Jose

El miércoles, 20 de marzo de 2013 16:02:46 UTC+1, Péter Szilágyi escribió:

André Moraes

unread,
Mar 21, 2013, 12:01:22 PM3/21/13
to josvazg, golan...@googlegroups.com
Take a look at:
http://doc.cat-v.org/plan_9/4th_edition/papers/venti/
http://plan9.bell-labs.com/sys/doc/venti/venti.html

Venti is a CAS store, so every block can only be acessed by the hash
of it's contents, You could setup something similar to hash all blocks
of all files from a directory and the client could send a list of his
hashes, then you just send the hash's that are missing.

To identify what hash is part of what file, check Fossil, it's a
filesystem built on top of Venti.

http://doc.cat-v.org/plan_9/4th_edition/papers/fossil/

Or you could just send together with the hashes a list of files and
it's associated hashes.

The advantage of Venti is that if you have two blocks that are equal
(in the same file or in different files) you just transfer the block
once.

--
André Moraes
http://amoraes.info

josvazg

unread,
Mar 22, 2013, 3:07:13 AM3/22/13
to golan...@googlegroups.com, josvazg
Thank you André,

I'll definitely take a look at those papers, cause it's a subject I am really interested in beyond this development.

The problem is that they are a storage and filesystems respectively, that is a more ambitious scope than my tool. 

I just planned to do a simple rsync alike http based tool, something you could add to an http deployment without loading it too much or breaking the server side. Then you just download a client that does the heavy processing part and allows you to download huge files faster when you already have a previous version of them.

Jose

David Anderson

unread,
Mar 22, 2013, 3:20:05 AM3/22/13
to josvazg, golang-nuts
Sounds a lot like Cloudflare's Railgun (also written in Go) : http://www.cloudflare.com/railgunhttp://blog.cloudflare.com/go-at-cloudflare

- Dave


--

Ingo Oeser

unread,
Mar 22, 2013, 4:31:44 AM3/22/13
to golan...@googlegroups.com
Railgun does difference + sync between dynamic content, too.

David Anderson

unread,
Mar 22, 2013, 4:41:37 AM3/22/13
to Ingo Oeser, golang-nuts
Dynamic content is just very young static content ;-).

But yes, the distinction matters when serving stuff over HTTP. But the algorithm works just as well regardless of content age.

- Dave


On Fri, Mar 22, 2013 at 1:31 AM, Ingo Oeser <night...@googlemail.com> wrote:
Railgun does difference + sync between dynamic content, too.

Tamás Gulácsi

unread,
Mar 26, 2013, 1:24:21 AM3/26/13
to golan...@googlegroups.com
Sorry Jose to hijack this thread a little, but I'm very interested in storage, and André mentioned Venti, which is very promising.
But I just can't find a nice proper implementation of it...
goventi is missing any version, govt seems to be left half (no nice arena implementation) implemented...

Any ideas? I'd need a solution for AIX, maybe Linux, but rock solid and future proof (If such thing exists).

Thanks,
GThomas

Brad Fitzpatrick

unread,
Mar 26, 2013, 2:13:45 AM3/26/13
to Tamás Gulácsi, golan...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages