bup already compresses the chunks that it splits out using gzip - just
like git does. The compression ratio isn't ideal, though, since gzip
works better if it has more data to work with (and bup chunks are
usually 8k or less).
I've been thinking that it would be possible to improve the gzip
compression rate by pre-loading its dictionary based on the previous
few file chunks, but that would slow things down and complicate them a
bit. Already bup's compression is much better than (for example)
bzipping a VMware image.
Have you done any size/speed comparisons between bup's compression and
librsync/duplicity's compression? That would be interesting to see.
LZMA is pretty slow compared to gzip and would greatly slow down your
backups, I think.
> Adding encryption to the mix would really make the perfect backup
> tool.
We had some discussion about this on the mailing list:
It sounds doable (particularly if you didn't want to include the
"distributed repository shared with untrusted partners" part) but I'm
not sure how valuable it would actually be. Encrypted backups ==
backups you're less likely to be able to restore in an emergency. If
you lost all your files, will you still have your key? If your key is
just a password (as opposed to a really secure 256-bit symmetric key
or whatever), is it even worth anything?
> There is a similar project called duplicity which use librsync +
> encryption + gzip compression.
> The main issue with duplicity is that it's using a lot of different
> process / lib to perform its work (like gpg) and it's slow.
Yeah, gpg is *really* slow, so that would be a pretty bad idea for
performance. Of course, with cryptography, there's often a tradeoff
between speed and security. :(
Have fun,
Avery
Only if you don't lose your key. But okay, fair enough.
> I'm not an expert, but I've written multiple crypto code in the past,
> and the simplest, yet efficient, way would be to perform this:
> 1) A user-secret password is derived (with algorithms like mcrypt, or
> ccrypt) to decrypt a (newly generated) private key.
> 2) The matching public key is used to encrypt a random symmetric key.
> 3) The symmetric key is used to cypher the data.
> 4) When a given amount of data is encrypted, a new random symmetric
> key is used, and the public key encrypt it.
This method has a few problems. Basically, you're adding a lot of
layers that don't seem to improve the cryptography at all.
I'm certainly not a cryptographer, but my understanding is that the
mathematically weakest part of cryptography at the moment is stuff
based on public keys. Symmetric key encryption, when you have a good
key generation algorithm, is on much firmer ground. Now, people
obviously depend on public keys, so they're not that weak, but I don't
see how adding a public/private keypair to the mix (and then
distributing the private key!) can possibly make things better.
To be more specific, including the private key along with the backup
set is roughly equivalent to taking my gpg private key (which is
encrypted with a passphrase) and posting it on my website. That
passphrase is nice, but that's not what you're supposed to use it for.
Second, a cryptosystem is only as strong as its weakest link. In this
case, that's the human-memorizable password in step 1. Using 256-bit
session keys and 4096-bit public keys and 7 layers of encryption
sounds great, but if guessing the password means I can see all the
data, that's going to be by far the easiest vector of attack. And if
so, then we might as well do something trivial like just using a hash
of the password as the symmetric key. What does gpg do when you tell
it to use symmetric encryption? I bet there's no public/private key
stuff involved.
Third, you can't just encrypt all your backup sets separately with
different keys and expect deduplication to work. Restoring a
particular backup set will require retrieving objects from prior
backup sets. So we need some way to support that, which means we
can't just encrypt stuff as a single stream - and that reduces your
security vs. anything that can encrypt an entire stream in one go.
Finally, any encryption scheme will automatically make our
repositories incompatible (at least in some way) with git
repositories. That's kind of too bad, although not the end of the
world.
> The weakness of this algorithm resides in the small entropy for the
> user-password, but I've never heard of mcrypt or ccrypt failing yet,
> so I guess it's secure enough.
They will certainly fail if the user is dumb enough to choose an
easily-guessable password :)
But yes, for most people using a password is the only reasonable
option, so that's the best we can do. And if you use a password, then
you need a good algorithm for hashing it into a key, and I gather
those algorithms are fine (I've mostly used bcrypt for password
hashing, but whatever).
Have fun,
Avery
I don't think layer 3 is actually necessary. Or rather, there's no
reason it should be using public key crypto. You could just as easily
generate a single strong symmetric key, and encrypt all your session
keys with that.
Like I said above: I'm pretty sure GPG doesn't use a public/private
keypair when you use its password encryption option (--symmetric),
right? And one would hope that GPG is doing something at least as
secure as what we're trying to do.
>> I'm certainly not a cryptographer, but my understanding is that the
>> mathematically weakest part of cryptography at the moment is stuff
>> based on public keys.
>
> This is wrong. Don't make a rule from (old) RSA weaknesses. Now most PKI
> system are based on Elliptic curve cryptography which are smaller and safer
> than RSA.
> In ECC, a 256 bits keys is expected to be safe for the next 10 years.
This all still depends on the fact that it's hard to find a point on
an elliptic curve. If that became untrue, ECC would be obsolete
overnight. My understanding is that overcoming the best symmetric
crypto algorithms - in a fundamental way - is much more difficult. So
if public key crypto doesn't buy us anything over symmetric, we
shouldn't be building it in.
Besides which, is there a standard ECC algorithm to use that's not
patented? All my GPG/SSH keys are either DSA or RSA.
> Also, it's not because you've broken a ECC key pair that you've broken the
> whole cryptosystem.
> If key pairs are generated regularly (for example, each 128MB of data), only
> the slice you've broken is recoverable (which, IMHO, is not enough to get
> any interesting data, if it's delta only data).
That doesn't really make sense; if I break the key that's being used
to encrypt all the keypairs, then fundamentally I can access the
entire repo. Unless I'm storing a whole *batch* of keys locally and
you need the whole batch of them in order to decrypt a backup.
> The idea is to avoid distributing the private key (obviously).
> Either you save the private key on the "safe" system (that is, my ".bup"
> folder on my computer), basically encrypted with some hash of a password.
> The file permissions will be restricted to u:rw (chmod 0600)
> This is exactly what SSH et al do, as there is no way to get better security
> on a computer.
>
> One can backup his private keys elsewhere, but, again, that's how other
> security based package works.
> If you loose your SSH private key, anyone can connect on your server.
That sounds nice in theory, but I certainly don't want to be the guy
who's responsible for producing a backup system where the backed-up
files can't be recovered if you lose your original disk. Then when
someone inevitably loses their key because they ran the system in the
default way, it's my fault.
If we encrypt stuff and expect it to be decryptable using a password,
then we can't require people to *also* have a keyfile lying around.
> Notice that using a symmetric key here doesn't change the deal at all, you
> still have to save your key from private eyes.
Right: it doesn't change anything. Except symmetric algorithms are
vastly simpler, so we should be using them unless public key gives us
some kind of advantage.
> Using only the TRANSFORMED(password) as a key for symmetric crypto is dumb
> for multiple reason:
Yes yes, we obviously want to use strong session keys and not overuse
any one key. I skimmed over that in my email, but it's important.
>> Third, you can't just encrypt all your backup sets separately with
>> different keys and expect deduplication to work. Restoring a
>> particular backup set will require retrieving objects from prior
>> backup sets. So we need some way to support that, which means we
>> can't just encrypt stuff as a single stream - and that reduces your
>> security vs. anything that can encrypt an entire stream in one go.
>
> I can't tell. Duplicity use signature files (and archives) for computing the
> rolling checksum/deduplication.
> IIRC, librsync support this, you can compute the exact difference between 2
> files by only comparing the file on your system to its previously saved
> signature file.
> The signature file is very small compared to the file itself (something like
> 1/16th of the file size).
Well, bup definitely doesn't use the same sort of signatures that
rsync does. The tricky part in bup isn't comparing the signatures -
we have indexfiles for that - but in restoring the data when we come
back later. We might need to pull chunks out of 15 previous backups,
which would involve decrypting each one. If we use (ideally) a
chaining cipher for each file, that would involve reading through the
whole file just to decrypt as little as one block. If we *don't* use
a chaining cipher (or restart the chain for each bup chunk, about 4k)
we potentially reduce the security.
Other backup systems (possibly including duplicity, I don't know) have
big sequential files that they have to read through *anyway*, which
makes the encryption not a big deal. But bup files are random-access.
>> Finally, any encryption scheme will automatically make our
>> repositories incompatible (at least in some way) with git
>> repositories. That's kind of too bad, although not the end of the
>> world.
>
> I think there is an easy workaround. Provide a pipe-filter process for this,
> and it'll probably work correctly.
> So instead of calling "git checkout some-revision" you'll call "bupf-git
> checkout some-revision".
> Most software dealing with git allow to specify the path to the git binary,
> so it's possible to use them transparently.
That won't work, unfortunately. Or rather, if we did this, the
*filenames* wouldn't be encrypted. Or we could encrypt the filenames
(somehow) but it would still be obvious how many files are in each
directory. We really should be encrypting entire tree objects, not
the individual items in each tree. But if the tree object is
encrypted, git won't be able to use it *at all*. Like I said, that's
not the end of the world - we can just do all the work ourselves in
bup. But it's not as straightforward as you make it sound.
Have fun,
Avery
But why do you think you need to use a public key to generate the
session keys? You can generate them using any random number
generator. You still have to store them somewhere.
Think of it this way: the advantage of a public-key system is that you
can distribute the public key to the *public*, and it won't negatively
impact your security. Who are we going to give the public key to?
The only person we ever want to be able to decrypt the data is the
person who encrypted it in the first place. That's purely symmetric.
> You still have to save those session keys somewhere (not in the same
> place as the server's backup obviously).
That's not *actually* obvious. For example, in the case of SSL or
GPG, the session keys are obviously stored (albeit encrypted) in the
same stream as the data.
I agree that you're better off if you store them elsewhere, but how
much better off? I don't know. Does the security advantage outweigh
the inconvenience disadvantage? I don't know.
I object to just being dogmatic about it, however, since I don't know
of any professionally-designed systems that work using a
separately-stored list of session keys. (Duplicity is a fine product,
but I don't think it was designed by cryptographers. GPG, SSL, and
SSH were, and they inline their session keys.)
> Making your own system with symmetric crypto is a bit more "unsafe",
> as you can make error (crypto error are very hard to spot, as only few
> dev understand it).
We're already talking about putting together an entirely new crypto
system here. It has to be new, because bup's file format is special
and there aren't any pre-existing products (that I know of) that do
the whole end-to-end security.
Well, that's not entirely true: actually bup already uses ssh for
end-to-end transport security. So I trust the fact that people can't
steal your data while it's in flight. If there was a well-designed,
well-audited system for controlled random access to encrypted files on
a disk, I would trust that too. But anything we design here is
effectively going to be crap.
>> Besides which, is there a standard ECC algorithm to use that's not
>> patented? All my GPG/SSH keys are either DSA or RSA.
>
> I think ECC patents are an hoax (or maybe it's possible to patent an
> idea or mathematical equation in the US?).
> In that case, ECC is a problem known from last century, so the patent
> would be invalid anyway.
>
> A specific implementation might be patented, but for example, I doubt
> OpenSSL implementation is patented in any way.
Hoax patents are as good as real patents when it comes to losing tons
of money on legal bills. They probably won't sue "bup" as an open
source project with no corporation attached, but any company wanting
to use bup in its products would be at risk.
>> That sounds nice in theory, but I certainly don't want to be the guy
>> who's responsible for producing a backup system where the backed-up
>> files can't be recovered if you lose your original disk. Then when
>> someone inevitably loses their key because they ran the system in the
>> default way, it's my fault.
>
> GPG has big red message for this. SSH does this too.
> There is no solution against user ignoring what the software says.
Yes there is: make the software do the right thing by default. I
*hate* those idiotic messages from GPG and SSH. They're constantly
whining about how I'm a bad person for not being secure enough, but
they don't actually make it easy to be a good person. So I, like
everybody else, ignore them.
>> If we encrypt stuff and expect it to be decryptable using a password,
>> then we can't require people to *also* have a keyfile lying around.
>
> Good point. I'm viewing my SSH private key like my home's keys.
That's a terrible analogy. If I lose the keys to my house, I can just
hire a locksmith and be into my house within 30 minutes. If I lose
the super-secure private key that I used to encrypt all my backups,
I'll probably *never* get back in (unless maybe I can afford to hire
the NSA). The data will be lost forever.
That's not actually the level of security normal people want.
> If you update the keys for each file, (or every 1MB for example), then
> it's only that part that should be downloaded.
> I don't think it's so bad to add 32 bytes (new random key) every 4kB
> of data.
> I don't think it decreases the security anyway if you do that.
Well, there's nothing really stopping us from doing a key per 4kb...
except if we're keeping those locally, that's potentially many
megabytes of extra key stuff if you're backing up a few hundred gigs.
Which is not *so* bad I guess, but that's maybe 50 megs of stuff that
you now have to... back up somewhere. So that you can access your
backups.
Further, one of bup's big advantages is that it can deduplicate data
between completely separate backup sets. For example, if I have
workstations A, B, and C, and I want to back them up to server S,
it'll automatically remove duplicated data between them. Now, we can
do that using chunk hashes even if the contents of each chunk are
encrypted - but as Peter McCurdy brought up in the earlier thread on
encryption, just knowing a particular hash is present can give you key
information about a system (eg. that /etc/shadow contains a particular
default password). So if you're paranoid about that (which actually
I'm not, so i don't really care) you basically can't do deduplication
between client machines. That loses a huge part of bup's advantage
for me. If you (like me) don't care about that, you can leave the
object hashes visible, but if the chunks from A are encrypted using a
key that's only visible on A, and B has the same chunks, then to
restore a deduplicated backup of B, you need the keys from A *and* B.
How will the user on B get the keys from A?
> The number of file is not a sensible information by itself.
In the crypto industry I think they call that "famous last words."
> Or ask git developers to include this. Yes, it's not as easy as I
> said.
> But when everything else is done (backup + restore with encrypted
> content), adding this is a probably not that complex.
Well, if you think it's easy, feel free to start working on some patches :)
Have fun,
Avery