hello there!

40 views
Skip to first unread message

Zooko Wilcox-O'Hearn

unread,
Nov 25, 2011, 2:42:22 PM11/25/11
to backshift
Dear backshifters:

I just happened to discover backshift from a post by Dan Stromberg to
the PyPy mailing list. Judging from the docs, it looks great! I'm
happy to see another implementation of the "convergent variable-length
block deduplication" idea. As far as I know, that idea originated in
bup (https://github.com/apenwarr/bup ). Is that right? Did you get it
from bup?

I've added backshift to the "Related Projects" page of the open source
storage project that I work on -- Tahoe-LAFS:
https://tahoe-lafs.org/trac/tahoe-lafs/wiki/RelatedProjects#OtherProjects

I'll suggest to the author of the Tahoe-LAFS Weekly News (https://
tahoe-lafs.org/trac/tahoe-lafs/wiki/TahoeLAFSWeeklyNews ) that he
should look into featuring backshift as the "Open Source Project of
the Week".

I really like the comparison table here:
http://stromberg.dnsalias.org/~dstromberg/backshift/documentation/comparison/index.html

Bup should be added to it, definitely. Here are my guesses for some of
the values:

bup:

Gist: variable-length block deduplication, uses some of gits internals
Transmitting small changes to large files (EG: Linux DVD torrent, Log
files): Pretty good: Only sends changed blocks
Storing small changes to large files (EG: Linux DVD torrent, Log
files): Pretty good
Deduplication: Deduplicates down to content-based, variable-length
blocks of $SOMETHING_OR_OTHER megabytes on average for large files,
across the same or different machines

I would suggest adding a licence row to the table. bup is GPLv2.

Now, about Tahoe-LAFS (https://tahoe-lafs.org/trac/tahoe-lafs ). It is
two things -- a secure, decentralized, fault-tolerant storage system
(backend) and a file-level deduplicating backup tool (frontend). The
frontend is a command line tool named "tahoe backup" that inspects
files and backs them up if they've changed. The backend automatically
deduplicates at the whole-file level, so the combination of this
frontend and that backend make for file-level deduplication with
excellent browsability (each file and directory is separately
accessible through an HTTP server or other interfaces).

I would be pretty interested in combining the backshift frontend
behavior with the Tahoe-LAFS backend storage! It looks like from
http://stromberg.dnsalias.org/~strombrg/backshift/documentation/for-all/how-it-works.html
that the interface between backshift and its persistent storage is
just that it stores compressed blocks into a filesystem under the
filename of the hash of the compressed block. That sounds like a
pretty easy requirement for Tahoe-LAFS to support.

In fact, once backshift has done the variable-length block chopping
and the compression, it could then rely on the Tahoe-LAFS backend to
do the deduplication (by putting each compressed blockshift block into
a Tahoe-LAFS file).

Let me know if you are interested! There is already a duplicity front-
end to Tahoe-LAFS so there is precedent for this sort of combination
being useful.

Regards,

Zooko

Dan Stromberg

unread,
Nov 25, 2011, 9:49:24 PM11/25/11
to back...@googlegroups.com
On Fri, Nov 25, 2011 at 11:42 AM, Zooko Wilcox-O'Hearn <zoo...@gmail.com> wrote:
Dear backshifters:

I just happened to discover backshift from a post by Dan Stromberg to
the PyPy mailing list. Judging from the docs, it looks great! I'm
happy to see another implementation of the "convergent variable-length
block deduplication" idea. As far as I know, that idea originated in
bup (https://github.com/apenwarr/bup ). Is that right? Did you get it
from bup?

I really don't recall where I first learned of variable-length content-based blocking.

I've added backshift to the "Related Projects" page of the open source
storage project that I work on -- Tahoe-LAFS:
https://tahoe-lafs.org/trac/tahoe-lafs/wiki/RelatedProjects#OtherProjects

Cool.  :)
 
I'll suggest to the author of the Tahoe-LAFS Weekly News (https://
tahoe-lafs.org/trac/tahoe-lafs/wiki/TahoeLAFSWeeklyNews ) that he
should look into featuring backshift as the "Open Source Project of
the Week".
Wonderful.
 
I really like the comparison table here:
http://stromberg.dnsalias.org/~dstromberg/backshift/documentation/comparison/index.html

Bup should be added to it, definitely.
Thanks.  Agreed, I should add a Bup column, and perhaps a Tahoe-LAFS column as well.
 
Here are my guesses for some of
the values:

bup:

Gist: variable-length block deduplication, uses some of gits internals
Transmitting small changes to large files (EG: Linux DVD torrent, Log
files):         Pretty good: Only sends changed blocks
Storing small changes to large files (EG: Linux DVD torrent, Log
files):         Pretty good
Deduplication:  Deduplicates down to content-based, variable-length
blocks of $SOMETHING_OR_OTHER megabytes on average for large files,
across the same or different machines

I would suggest adding a licence row to the table. bup is GPLv2.
Also a good idea.
 
Now, about Tahoe-LAFS (https://tahoe-lafs.org/trac/tahoe-lafs ). It is
two things -- a secure, decentralized, fault-tolerant storage system
(backend) and a file-level deduplicating backup tool (frontend). The
frontend is a command line tool named "tahoe backup" that inspects
files and backs them up if they've changed. The backend automatically
deduplicates at the whole-file level, so the combination of this
frontend and that backend make for file-level deduplication with
excellent browsability (each file and directory is separately
accessible through an HTTP server or other interfaces).
I see.

Is it fixed-blocksize?
 
I would be pretty interested in combining the backshift frontend
behavior with the Tahoe-LAFS backend storage! It looks like from
http://stromberg.dnsalias.org/~strombrg/backshift/documentation/for-all/how-it-works.html
that the interface between backshift and its persistent storage is
just that it stores compressed blocks into a filesystem under the
filename of the hash of the compressed block. That sounds like a
pretty easy requirement for Tahoe-LAFS to support.

Yes, that's correct (about what backshift does), and it'd be interesting to integrate with Tahoe-LAFS.  What might benefit backshift the most, is if there were some atomic way of saying "Do you have this block yet, under such-and-so hash?" and if yes, just reuse it, and if not, hand in the block itself.  Perhaps some syscall or ioctl?

Right now, backshift stores 2^20 files in one directory, then 2^24 files in subdirectories thereof.  It's probably the slowest part of the system. I'm finding that it works fine for large files, like movies, but is kinda slow for hordes of little files, like C header files.

In fact, once backshift has done the variable-length block chopping
and the compression, it could then rely on the Tahoe-LAFS backend to
do the deduplication (by putting each compressed blockshift block into
a Tahoe-LAFS file).

Nod.  This might be pretty cool.
 
Let me know if you are interested! There is already a duplicity front-
end to Tahoe-LAFS so there is precedent for this sort of combination
being useful.

Interesting topic.

Thanks!

--
Dan Stromberg

Zooko O'Whielacronx

unread,
Dec 13, 2011, 11:37:05 AM12/13/11
to back...@googlegroups.com
Dear Dan Stromberg:

Thanks for adding Tahoe-LAFS to
http://stromberg.dnsalias.org/~dstromberg/backshift/documentation/comparison/index.html
. Here are my suggestions for the fields.

The gist: I wouldn't say "The gist" includes FUSE. Tahoe-LAFS does
optionally support FUSE, but most users treat it as a backup or
file-sharing application—interacting with it through a web UI or the
tahoe-specific executable on the command line—rather than as a local
filesystem.

Backs up hardlinks?: It backs up the files just fine and deduplicates
them on the server side, but it doesn't have a restore which would
restore the hardlinks, so on restore you would end up with two copies
of the identical file.

Uses hardlinks for deduplication?: Doesn't use a filesystem at all on
the server side, so the answer is "mu". :-) But it does deduplicate.

Transmitting small changes to large files: no, it uploads and stores
the entire contents of each file each time (except if the identical
complete file is already upload in which case it deduplicates that).

Storing small changes to large files: same as above

Compression over the wire: no

Compression at rest: no

Data format: Custom, but accessible via an open Python API

Incremental behavior: deduplicates entire files, keeps a client-side
cache of timestamps of backed up files so that it doesn't attempt to
backup files that haven't changed since the last backup.

Deduplication: Yes, but whole files.

CLI?: yes

GUI?: WUI

Scheduling?: Only via cron or launchd or task scheduler

Failed backup notices?: Only via wrapper scripts

Transmits data encrypted?: yes

Stores data encrypted?: yes

Means of restores?: recursively download the files (therefore
hardlinks aren't saved and restored, nor is metadata such as file
ownership, permission bits, timestamps)

Permissions / ownership: not stored

Media (disk to disk, disk to tape): Disk to cloud. :-)

Supported platforms: <a
href="https://tahoe-lafs.org/trac/tahoe-lafs/browser/docs/quickstart.rst">All
manner of *ix, including Mac OS/X, plus Cygwin (Windows)</a>

Can expire old files?: yes, on a per-file granularity

One more thing, some of these are likely to change in the near future
as we're working on improvements in things like scheduling of backups
and failure notices, and probably other improvements, too. To prevent
your table from making stale statements, you should add a timestamp or
indicator of what version of Tahoe-LAFS these statements are about. My
answers above (plus your answers that are already on the table) are
correct for Tahoe-LAFS v1.9.

Thanks again for maintaining that informative table!

Regards,

Zooko

[1] https://en.wikipedia.org/wiki/Mu_%28negative%29

Dan Stromberg

unread,
Dec 13, 2011, 6:57:59 PM12/13/11
to zo...@zooko.com, back...@googlegroups.com

Thanks for the info!  Feel free (or not) to vet the changes as I've made them: http://stromberg.dnsalias.org/~strombrg/backshift/documentation/comparison/index.html
--
Dan Stromberg
Reply all
Reply to author
Forward
0 new messages