I just happened to discover backshift from a post by Dan Stromberg to
the PyPy mailing list. Judging from the docs, it looks great! I'm
happy to see another implementation of the "convergent variable-length
block deduplication" idea. As far as I know, that idea originated in
bup (https://github.com/apenwarr/bup ). Is that right? Did you get it
from bup?
I've added backshift to the "Related Projects" page of the open source
storage project that I work on -- Tahoe-LAFS:
https://tahoe-lafs.org/trac/tahoe-lafs/wiki/RelatedProjects#OtherProjects
I'll suggest to the author of the Tahoe-LAFS Weekly News (https://
tahoe-lafs.org/trac/tahoe-lafs/wiki/TahoeLAFSWeeklyNews ) that he
should look into featuring backshift as the "Open Source Project of
the Week".
I really like the comparison table here:
http://stromberg.dnsalias.org/~dstromberg/backshift/documentation/comparison/index.html
Bup should be added to it, definitely. Here are my guesses for some of
the values:
bup:
Gist: variable-length block deduplication, uses some of gits internals
Transmitting small changes to large files (EG: Linux DVD torrent, Log
files): Pretty good: Only sends changed blocks
Storing small changes to large files (EG: Linux DVD torrent, Log
files): Pretty good
Deduplication: Deduplicates down to content-based, variable-length
blocks of $SOMETHING_OR_OTHER megabytes on average for large files,
across the same or different machines
I would suggest adding a licence row to the table. bup is GPLv2.
Now, about Tahoe-LAFS (https://tahoe-lafs.org/trac/tahoe-lafs ). It is
two things -- a secure, decentralized, fault-tolerant storage system
(backend) and a file-level deduplicating backup tool (frontend). The
frontend is a command line tool named "tahoe backup" that inspects
files and backs them up if they've changed. The backend automatically
deduplicates at the whole-file level, so the combination of this
frontend and that backend make for file-level deduplication with
excellent browsability (each file and directory is separately
accessible through an HTTP server or other interfaces).
I would be pretty interested in combining the backshift frontend
behavior with the Tahoe-LAFS backend storage! It looks like from
http://stromberg.dnsalias.org/~strombrg/backshift/documentation/for-all/how-it-works.html
that the interface between backshift and its persistent storage is
just that it stores compressed blocks into a filesystem under the
filename of the hash of the compressed block. That sounds like a
pretty easy requirement for Tahoe-LAFS to support.
In fact, once backshift has done the variable-length block chopping
and the compression, it could then rely on the Tahoe-LAFS backend to
do the deduplication (by putting each compressed blockshift block into
a Tahoe-LAFS file).
Let me know if you are interested! There is already a duplicity front-
end to Tahoe-LAFS so there is precedent for this sort of combination
being useful.
Regards,
Zooko
Dear backshifters:
I just happened to discover backshift from a post by Dan Stromberg to
the PyPy mailing list. Judging from the docs, it looks great! I'm
happy to see another implementation of the "convergent variable-length
block deduplication" idea. As far as I know, that idea originated in
bup (https://github.com/apenwarr/bup ). Is that right? Did you get it
from bup?
I've added backshift to the "Related Projects" page of the open source
storage project that I work on -- Tahoe-LAFS:
https://tahoe-lafs.org/trac/tahoe-lafs/wiki/RelatedProjects#OtherProjects
I'll suggest to the author of the Tahoe-LAFS Weekly News (https://
tahoe-lafs.org/trac/tahoe-lafs/wiki/TahoeLAFSWeeklyNews ) that he
should look into featuring backshift as the "Open Source Project of
the Week".
I really like the comparison table here:
http://stromberg.dnsalias.org/~dstromberg/backshift/documentation/comparison/index.html
Bup should be added to it, definitely.
Here are my guesses for some of
the values:
bup:
Gist: variable-length block deduplication, uses some of gits internals
Transmitting small changes to large files (EG: Linux DVD torrent, Log
files): Pretty good: Only sends changed blocks
Storing small changes to large files (EG: Linux DVD torrent, Log
files): Pretty good
Deduplication: Deduplicates down to content-based, variable-length
blocks of $SOMETHING_OR_OTHER megabytes on average for large files,
across the same or different machines
I would suggest adding a licence row to the table. bup is GPLv2.
Now, about Tahoe-LAFS (https://tahoe-lafs.org/trac/tahoe-lafs ). It is
two things -- a secure, decentralized, fault-tolerant storage system
(backend) and a file-level deduplicating backup tool (frontend). The
frontend is a command line tool named "tahoe backup" that inspects
files and backs them up if they've changed. The backend automatically
deduplicates at the whole-file level, so the combination of this
frontend and that backend make for file-level deduplication with
excellent browsability (each file and directory is separately
accessible through an HTTP server or other interfaces).
I would be pretty interested in combining the backshift frontend
behavior with the Tahoe-LAFS backend storage! It looks like from
http://stromberg.dnsalias.org/~strombrg/backshift/documentation/for-all/how-it-works.html
that the interface between backshift and its persistent storage is
just that it stores compressed blocks into a filesystem under the
filename of the hash of the compressed block. That sounds like a
pretty easy requirement for Tahoe-LAFS to support.
In fact, once backshift has done the variable-length block chopping
and the compression, it could then rely on the Tahoe-LAFS backend to
do the deduplication (by putting each compressed blockshift block into
a Tahoe-LAFS file).
Let me know if you are interested! There is already a duplicity front-
end to Tahoe-LAFS so there is precedent for this sort of combination
being useful.
Thanks for adding Tahoe-LAFS to
http://stromberg.dnsalias.org/~dstromberg/backshift/documentation/comparison/index.html
. Here are my suggestions for the fields.
The gist: I wouldn't say "The gist" includes FUSE. Tahoe-LAFS does
optionally support FUSE, but most users treat it as a backup or
file-sharing application—interacting with it through a web UI or the
tahoe-specific executable on the command line—rather than as a local
filesystem.
Backs up hardlinks?: It backs up the files just fine and deduplicates
them on the server side, but it doesn't have a restore which would
restore the hardlinks, so on restore you would end up with two copies
of the identical file.
Uses hardlinks for deduplication?: Doesn't use a filesystem at all on
the server side, so the answer is "mu". :-) But it does deduplicate.
Transmitting small changes to large files: no, it uploads and stores
the entire contents of each file each time (except if the identical
complete file is already upload in which case it deduplicates that).
Storing small changes to large files: same as above
Compression over the wire: no
Compression at rest: no
Data format: Custom, but accessible via an open Python API
Incremental behavior: deduplicates entire files, keeps a client-side
cache of timestamps of backed up files so that it doesn't attempt to
backup files that haven't changed since the last backup.
Deduplication: Yes, but whole files.
CLI?: yes
GUI?: WUI
Scheduling?: Only via cron or launchd or task scheduler
Failed backup notices?: Only via wrapper scripts
Transmits data encrypted?: yes
Stores data encrypted?: yes
Means of restores?: recursively download the files (therefore
hardlinks aren't saved and restored, nor is metadata such as file
ownership, permission bits, timestamps)
Permissions / ownership: not stored
Media (disk to disk, disk to tape): Disk to cloud. :-)
Supported platforms: <a
href="https://tahoe-lafs.org/trac/tahoe-lafs/browser/docs/quickstart.rst">All
manner of *ix, including Mac OS/X, plus Cygwin (Windows)</a>
Can expire old files?: yes, on a per-file granularity
One more thing, some of these are likely to change in the near future
as we're working on improvements in things like scheduling of backups
and failure notices, and probably other improvements, too. To prevent
your table from making stale statements, you should add a timestamp or
indicator of what version of Tahoe-LAFS these statements are about. My
answers above (plus your answers that are already on the table) are
correct for Tahoe-LAFS v1.9.
Thanks again for maintaining that informative table!
Regards,
Zooko