[ANN] rclone - rsync for object storage

352 views
Skip to first unread message

Nick Craig-Wood

unread,
May 30, 2014, 12:47:33 PM5/30/14
to golang-nuts
I'm looking for testers, users and developers for rclone the somewhat
ambitiously subtitled "rsync for object storage".

It is written in Go so may be of interest here!

-----

Rclone is a command line program to sync files and directories to and from

* Google Drive
* Amazon S3
* Openstack Swift / Rackspace cloud files / Memset Memstore
* The local filesystem

Features

* MD5SUMs checked at all times for file integrity
* Timestamps preserved on files
* Partial syncs supported on a whole file basis
* Copy mode to just copy new/changed files
* Sync mode to make a directory identical
* Check mode to check all MD5SUMs
* Can sync to and from network, eg two different Drive accounts

Links

* http://rclone.org/ - main website
* http://github.com/ncw/rclone - code/bugs
* https://google.com/+RcloneOrg - G+

--
Nick Craig-Wood <ni...@craig-wood.com> -- http://www.craig-wood.com/nick

Gustavo Niemeyer

unread,
May 30, 2014, 8:06:02 PM5/30/14
to Nick Craig-Wood, golang-nuts
That looks very interesting, and I'll certainly have a closer look next week.

Thanks for publishing, Nick.


(trivial note: s/go install/go get/ on docs)
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--

gustavo @ http://niemeyer.net

Aaron Cannon

unread,
May 30, 2014, 9:36:06 PM5/30/14
to Nick Craig-Wood, golang-nuts
Sounds awesome. Question: any reason you chose md5, over a non deprecated hash algorithm?

Aaron

--
This message was sent from a mobile device

Daniel Theophanes

unread,
May 30, 2014, 11:46:38 PM5/30/14
to golan...@googlegroups.com, ni...@craig-wood.com
rsync uses a hash to identify uniqueness in a non-security setting. In this case you want the fastest hash that does the job.
The only other hash I'd suggest using is blake2 (https://blake2.net/) but it probably doesn't have a practical difference for most people.

Gustavo Niemeyer

unread,
May 30, 2014, 11:54:48 PM5/30/14
to Daniel Theophanes, golan...@googlegroups.com, Nick Craig-Wood
On Sat, May 31, 2014 at 5:46 AM, Daniel Theophanes <kard...@gmail.com> wrote:
> rsync uses a hash to identify uniqueness in a non-security setting. In this
> case you want the fastest hash that does the job.

The stronger digest is only ever applied after the rolling checksum
has matched, and the rolling checksum is only ever applied on a buffer
window with a given probability, so I doubt the speed difference is
relevant in this case. Much more likely, he's using MD5 because that's
already in use by some of those services. For example:

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html
https://developers.google.com/drive/v1/reference/files


gustavo @ http://niemeyer.net

Daniel Theophanes

unread,
May 31, 2014, 12:02:00 AM5/31/14
to golan...@googlegroups.com, kard...@gmail.com, ni...@craig-wood.com
I'm aware of how the rsync algo works [1], when I tested on my package it made enough difference that I kept it as md5 by default (though you can replace it with whatever Hasher you want). But at the time I was using it for rdiff for a file over 2 GiB, so the hash time added up.

Gustavo Niemeyer

unread,
May 31, 2014, 12:23:41 AM5/31/14
to Daniel Theophanes, golan...@googlegroups.com, Nick Craig-Wood
On Sat, May 31, 2014 at 6:02 AM, Daniel Theophanes <kard...@gmail.com> wrote:
> I'm aware of how the rsync algo works [1], when I tested on my package it
> made enough difference that I kept it as md5 by default (though you can
> replace it with whatever Hasher you want). But at the time I was using it
> for rdiff for a file over 2 GiB, so the hash time added up.

Even for large files, the difference should be similar to proportional
to the difference of applying both algorithms to the file once.

> 1. https://bitbucket.org/kardianos/rsync/src/

That doesn't look right. The signature is being created based on fixed
block sizes. If you prepend 1 byte to the whole file, it'll resend the
whole file as the delta, because all of the blocks have been shifted.
You need to apply the same rolling checksum logic when creating the
signature to split the blocks at appropriate places and allow for
reuse.


gustavo @ http://niemeyer.net

Aaron Cannon

unread,
May 31, 2014, 12:32:26 AM5/31/14
to Daniel Theophanes, golan...@googlegroups.com, ni...@craig-wood.com
Have you considered CRC64 then?  I haven't tested it, but I suspect it would be an order of magnitude faster than any hash designed for security. 

Aaron

--
This message was sent from a mobile device

Gustavo Niemeyer

unread,
May 31, 2014, 1:10:39 AM5/31/14
to Daniel Theophanes, golan...@googlegroups.com, Nick Craig-Wood
On Sat, May 31, 2014 at 6:23 AM, Gustavo Niemeyer <gus...@niemeyer.net> wrote:
>> 1. https://bitbucket.org/kardianos/rsync/src/
>
> That doesn't look right. The signature is being created based on fixed
> block sizes. If you prepend 1 byte to the whole file, it'll resend the

Sorry, I see your apply implementation also diverges from rsync:
you're computing the weak hash and looking for a weak match on every
single byte rather than doing what rsync does. Would get a significant
boost in speed by making the match more probabilistic, but the blocks
will vary in size.

gustavo @ http://niemeyer.net

John Souvestre

unread,
May 31, 2014, 2:58:06 AM5/31/14
to Aaron Cannon, Daniel Theophanes, golan...@googlegroups.com, ni...@craig-wood.com

I had the same thought.  Also, if you aren’t comfortable with the 96 bits provided by the combination of the rolling checksum (32-bits) and a 64-bit CRC, then you could use a 128-bit CRC.  I understand it would take about 50% longer than a 64-bit CRC.

 

John

    John Souvestre - New Orleans LA

Nick Craig-Wood

unread,
May 31, 2014, 6:16:59 AM5/31/14
to Aaron Cannon, golang-nuts
On 31/05/14 02:35, Aaron Cannon wrote:
> Sounds awesome. Question: any reason you chose md5, over a non deprecated hash algorithm?

md5 is because it seems to have become a standard in object storage
systems for integrity checking.

All of s3, swift and google drive can report md5sums of the remote
objects, so you can check against the md5sum of the local file without
having to download the object.

Nick Craig-Wood

unread,
May 31, 2014, 6:19:01 AM5/31/14
to Gustavo Niemeyer, Daniel Theophanes, golan...@googlegroups.com
That is exactly the reason - you are right!

The fact that the remote object storage system can report the MD5
without you having to download the object is what makes rclone work.
Reply all
Reply to author
Forward
0 new messages