LZMA compression and encryption

1,052 views

Skip to first unread message

X Ryl

unread,

Jun 24, 2010, 9:38:35 AM6/24/10

to bup-list

Hi,

Bup would be perfect if it compressed the rsync-like diff too
(ideally with LZMA).
Adding encryption to the mix would really make the perfect backup
tool.

There is a similar project called duplicity which use librsync +
encryption + gzip compression.
The main issue with duplicity is that it's using a lot of different
process / lib to perform its work (like gpg) and it's slow.

What do you think ?

Avery Pennarun

unread,

Jun 24, 2010, 1:04:09 PM6/24/10

to X Ryl, bup-list

On Thu, Jun 24, 2010 at 9:38 AM, X Ryl <boite.p...@gmail.com> wrote:
> Bup would be perfect if it compressed the rsync-like diff too
> (ideally with LZMA).

bup already compresses the chunks that it splits out using gzip - just
like git does. The compression ratio isn't ideal, though, since gzip
works better if it has more data to work with (and bup chunks are
usually 8k or less).

I've been thinking that it would be possible to improve the gzip
compression rate by pre-loading its dictionary based on the previous
few file chunks, but that would slow things down and complicate them a
bit. Already bup's compression is much better than (for example)
bzipping a VMware image.

Have you done any size/speed comparisons between bup's compression and
librsync/duplicity's compression? That would be interesting to see.

LZMA is pretty slow compared to gzip and would greatly slow down your
backups, I think.

> Adding encryption to the mix would really make the perfect backup
> tool.

We had some discussion about this on the mailing list:

http://groups.google.com/group/bup-list/browse_thread/thread/b8c65fc202534fd0/2e552f443e36ce9e?#2e552f443e36ce9e

It sounds doable (particularly if you didn't want to include the
"distributed repository shared with untrusted partners" part) but I'm
not sure how valuable it would actually be. Encrypted backups ==
backups you're less likely to be able to restore in an emergency. If
you lost all your files, will you still have your key? If your key is
just a password (as opposed to a really secure 256-bit symmetric key
or whatever), is it even worth anything?

> There is a similar project called duplicity which use librsync +
> encryption + gzip compression.
> The main issue with duplicity is that it's using a lot of different
> process / lib to perform its work (like gpg) and it's slow.

Yeah, gpg is *really* slow, so that would be a pretty bad idea for
performance. Of course, with cryptography, there's often a tradeoff
between speed and security. :(

Have fun,

Avery

X Ryl

unread,

Jun 25, 2010, 4:05:47 AM6/25/10

to bup-list

On 24 juin, 19:04, Avery Pennarun <apenw...@gmail.com> wrote:

> On Thu, Jun 24, 2010 at 9:38 AM, X Ryl <boite.pour.s...@gmail.com> wrote:
> > Bup would be perfect if it compressed the rsync-like diff too
> > (ideally with LZMA).
>
> bup already compresses the chunks that it splits out using gzip - just
> like git does. The compression ratio isn't ideal, though, since gzip
> works better if it has more data to work with (and bup chunks are
> usually 8k or less).

Great. I didn't realized this.

> > Adding encryption to the mix would really make the perfect backup
> > tool.
>
> We had some discussion about this on the mailing list:
>

> http://groups.google.com/group/bup-list/browse_thread/thread/b8c65fc2...

>
> It sounds doable (particularly if you didn't want to include the
> "distributed repository shared with untrusted partners" part) but I'm
> not sure how valuable it would actually be. Encrypted backups ==
> backups you're less likely to be able to restore in an emergency. If
> you lost all your files, will you still have your key? If your key is
> just a password (as opposed to a really secure 256-bit symmetric key
> or whatever), is it even worth anything?

The idea is that as soon as you can't control the server, encrypting
private data is a good idea anyway.
I'm not an expert, but I've written multiple crypto code in the past,
and the simplest, yet efficient, way would be to perform this:
1) A user-secret password is derived (with algorithms like mcrypt, or
ccrypt) to decrypt a (newly generated) private key.
2) The matching public key is used to encrypt a random symmetric key.
3) The symmetric key is used to cypher the data.
4) When a given amount of data is encrypted, a new random symmetric
key is used, and the public key encrypt it.

Decoding is easy:
1) The user password is derived to decrypt privake key.
2) The private key is used to decrypt the symmetric keys
3) The symmetric keys are used to decrypt the blocks.

In this scheme, you never need to compromise some private/public key,
as it's generated on each backup step.
The weakness of this algorithm resides in the small entropy for the
user-password, but I've never heard of mcrypt or ccrypt failing yet,
so I guess it's secure enough.

Cheers,
Cyril

Avery Pennarun

unread,

Jun 25, 2010, 12:34:21 PM6/25/10

to X Ryl, bup-list

On Fri, Jun 25, 2010 at 4:05 AM, X Ryl <boite.p...@gmail.com> wrote:
> The idea is that as soon as you can't control the server, encrypting
> private data is a good idea anyway.

Only if you don't lose your key. But okay, fair enough.

> I'm not an expert, but I've written multiple crypto code in the past,
> and the simplest, yet efficient, way would be to perform this:
> 1) A user-secret password is derived (with algorithms like mcrypt, or
> ccrypt) to decrypt a (newly generated) private key.
> 2) The matching public key is used to encrypt a random symmetric key.
> 3) The symmetric key is used to cypher the data.
> 4) When a given amount of data is encrypted, a new random symmetric
> key is used, and the public key encrypt it.

This method has a few problems. Basically, you're adding a lot of
layers that don't seem to improve the cryptography at all.

I'm certainly not a cryptographer, but my understanding is that the
mathematically weakest part of cryptography at the moment is stuff
based on public keys. Symmetric key encryption, when you have a good
key generation algorithm, is on much firmer ground. Now, people
obviously depend on public keys, so they're not that weak, but I don't
see how adding a public/private keypair to the mix (and then
distributing the private key!) can possibly make things better.

To be more specific, including the private key along with the backup
set is roughly equivalent to taking my gpg private key (which is
encrypted with a passphrase) and posting it on my website. That
passphrase is nice, but that's not what you're supposed to use it for.

Second, a cryptosystem is only as strong as its weakest link. In this
case, that's the human-memorizable password in step 1. Using 256-bit
session keys and 4096-bit public keys and 7 layers of encryption
sounds great, but if guessing the password means I can see all the
data, that's going to be by far the easiest vector of attack. And if
so, then we might as well do something trivial like just using a hash
of the password as the symmetric key. What does gpg do when you tell
it to use symmetric encryption? I bet there's no public/private key
stuff involved.

Third, you can't just encrypt all your backup sets separately with
different keys and expect deduplication to work. Restoring a
particular backup set will require retrieving objects from prior
backup sets. So we need some way to support that, which means we
can't just encrypt stuff as a single stream - and that reduces your
security vs. anything that can encrypt an entire stream in one go.

Finally, any encryption scheme will automatically make our
repositories incompatible (at least in some way) with git
repositories. That's kind of too bad, although not the end of the
world.

> The weakness of this algorithm resides in the small entropy for the
> user-password, but I've never heard of mcrypt or ccrypt failing yet,
> so I guess it's secure enough.

They will certainly fail if the user is dumb enough to choose an
easily-guessable password :)

But yes, for most people using a password is the only reasonable
option, so that's the best we can do. And if you use a password, then
you need a good algorithm for hashing it into a key, and I gather
those algorithms are fine (I've mostly used bcrypt for password
hashing, but whatever).

Have fun,

Avery

Avery Pennarun

unread,

Jun 28, 2010, 5:16:13 AM6/28/10

to X Ryl, bup-list

On Mon, Jun 28, 2010 at 4:20 AM, X Ryl <boite.p...@gmail.com> wrote:

> On Fri, Jun 25, 2010 at 6:34 PM, Avery Pennarun <apen...@gmail.com> wrote:
>> On Fri, Jun 25, 2010 at 4:05 AM, X Ryl <boite.p...@gmail.com> wrote:
>> > I'm not an expert, but I've written multiple crypto code in the past,
>> > and the simplest, yet efficient, way would be to perform this:
>> > 1) A user-secret password is derived (with algorithms like mcrypt, or
>> > ccrypt) to decrypt a (newly generated) private key.
>> > 2) The matching public key is used to encrypt a random symmetric key.
>> > 3) The symmetric key is used to cypher the data.
>> > 4) When a given amount of data is encrypted, a new random symmetric
>> > key is used, and the public key encrypt it.
>>
>> This method has a few problems. Basically, you're adding a lot of
>> layers that don't seem to improve the cryptography at all.
>

> I don't think so. You (mandatory) need:
> 1) symmetric crypto for strong cyphering + speed
> 2) random key generation for symmetric crypto (never ever cypher a lot of
> data with the same key)
> 3) PKI for key strength (removing that layer almost means using the user
> password as symmetric key which is a bad idea) and random key cyphering.
> 4) User password can be avoided if you're using your GPG/SSH private key
> here, but most people tends to (wrongly) feel better with a password.

I don't think layer 3 is actually necessary. Or rather, there's no
reason it should be using public key crypto. You could just as easily
generate a single strong symmetric key, and encrypt all your session
keys with that.

Like I said above: I'm pretty sure GPG doesn't use a public/private
keypair when you use its password encryption option (--symmetric),
right? And one would hope that GPG is doing something at least as
secure as what we're trying to do.

>> I'm certainly not a cryptographer, but my understanding is that the
>> mathematically weakest part of cryptography at the moment is stuff
>> based on public keys.
>

> This is wrong. Don't make a rule from (old) RSA weaknesses. Now most PKI
> system are based on Elliptic curve cryptography which are smaller and safer
> than RSA.
> In ECC, a 256 bits keys is expected to be safe for the next 10 years.

This all still depends on the fact that it's hard to find a point on
an elliptic curve. If that became untrue, ECC would be obsolete
overnight. My understanding is that overcoming the best symmetric
crypto algorithms - in a fundamental way - is much more difficult. So
if public key crypto doesn't buy us anything over symmetric, we
shouldn't be building it in.

Besides which, is there a standard ECC algorithm to use that's not
patented? All my GPG/SSH keys are either DSA or RSA.

> Also, it's not because you've broken a ECC key pair that you've broken the
> whole cryptosystem.
> If key pairs are generated regularly (for example, each 128MB of data), only
> the slice you've broken is recoverable (which, IMHO, is not enough to get
> any interesting data, if it's delta only data).

That doesn't really make sense; if I break the key that's being used
to encrypt all the keypairs, then fundamentally I can access the
entire repo. Unless I'm storing a whole *batch* of keys locally and
you need the whole batch of them in order to decrypt a backup.

> The idea is to avoid distributing the private key (obviously).
> Either you save the private key on the "safe" system (that is, my ".bup"
> folder on my computer), basically encrypted with some hash of a password.
> The file permissions will be restricted to u:rw (chmod 0600)
> This is exactly what SSH et al do, as there is no way to get better security
> on a computer.
>
> One can backup his private keys elsewhere, but, again, that's how other
> security based package works.
> If you loose your SSH private key, anyone can connect on your server.

That sounds nice in theory, but I certainly don't want to be the guy
who's responsible for producing a backup system where the backed-up
files can't be recovered if you lose your original disk. Then when
someone inevitably loses their key because they ran the system in the
default way, it's my fault.

If we encrypt stuff and expect it to be decryptable using a password,
then we can't require people to *also* have a keyfile lying around.

> Notice that using a symmetric key here doesn't change the deal at all, you
> still have to save your key from private eyes.

Right: it doesn't change anything. Except symmetric algorithms are
vastly simpler, so we should be using them unless public key gives us
some kind of advantage.

> Using only the TRANSFORMED(password) as a key for symmetric crypto is dumb
> for multiple reason:

Yes yes, we obviously want to use strong session keys and not overuse
any one key. I skimmed over that in my email, but it's important.

>> Third, you can't just encrypt all your backup sets separately with
>> different keys and expect deduplication to work. Restoring a
>> particular backup set will require retrieving objects from prior
>> backup sets. So we need some way to support that, which means we
>> can't just encrypt stuff as a single stream - and that reduces your
>> security vs. anything that can encrypt an entire stream in one go.
>

> I can't tell. Duplicity use signature files (and archives) for computing the
> rolling checksum/deduplication.
> IIRC, librsync support this, you can compute the exact difference between 2
> files by only comparing the file on your system to its previously saved
> signature file.
> The signature file is very small compared to the file itself (something like
> 1/16th of the file size).

Well, bup definitely doesn't use the same sort of signatures that
rsync does. The tricky part in bup isn't comparing the signatures -
we have indexfiles for that - but in restoring the data when we come
back later. We might need to pull chunks out of 15 previous backups,
which would involve decrypting each one. If we use (ideally) a
chaining cipher for each file, that would involve reading through the
whole file just to decrypt as little as one block. If we *don't* use
a chaining cipher (or restart the chain for each bup chunk, about 4k)
we potentially reduce the security.

Other backup systems (possibly including duplicity, I don't know) have
big sequential files that they have to read through *anyway*, which
makes the encryption not a big deal. But bup files are random-access.

>> Finally, any encryption scheme will automatically make our
>> repositories incompatible (at least in some way) with git
>> repositories. That's kind of too bad, although not the end of the
>> world.
>

> I think there is an easy workaround. Provide a pipe-filter process for this,
> and it'll probably work correctly.
> So instead of calling "git checkout some-revision" you'll call "bupf-git
> checkout some-revision".
> Most software dealing with git allow to specify the path to the git binary,
> so it's possible to use them transparently.

That won't work, unfortunately. Or rather, if we did this, the
*filenames* wouldn't be encrypted. Or we could encrypt the filenames
(somehow) but it would still be obvious how many files are in each
directory. We really should be encrypting entire tree objects, not
the individual items in each tree. But if the tree object is
encrypted, git won't be able to use it *at all*. Like I said, that's
not the end of the world - we can just do all the work ourselves in
bup. But it's not as straightforward as you make it sound.

Have fun,

Avery

X Ryl

unread,

Jun 28, 2010, 11:25:58 AM6/28/10

to bup-list

On 28 juin, 11:16, Avery Pennarun <apenw...@gmail.com> wrote:
> On Mon, Jun 28, 2010 at 4:20 AM, X Ryl <boite.pour.s...@gmail.com> wrote:
> > On Fri, Jun 25, 2010 at 6:34 PM, Avery Pennarun <apenw...@gmail.com> wrote:

> >> On Fri, Jun 25, 2010 at 4:05 AM, X Ryl <boite.pour.s...@gmail.com> wrote:
> >> > I'm not an expert, but I've written multiple crypto code in the past,
> >> > and the simplest, yet efficient, way would be to perform this:
> >> > 1) A user-secret password is derived (with algorithms like mcrypt, or
> >> > ccrypt) to decrypt a (newly generated) private key.
> >> > 2) The matching public key is used to encrypt a random symmetric key.
> >> > 3) The symmetric key is used to cypher the data.
> >> > 4) When a given amount of data is encrypted, a new random symmetric
> >> > key is used, and the public key encrypt it.
>
> >> This method has a few problems. Basically, you're adding a lot of
> >> layers that don't seem to improve the cryptography at all.
>
> > I don't think so. You (mandatory) need:
> > 1) symmetric crypto for strong cyphering + speed
> > 2) random key generation for symmetric crypto (never ever cypher a lot of
> > data with the same key)
> > 3) PKI for key strength (removing that layer almost means using the user
> > password as symmetric key which is a bad idea) and random key cyphering.
> > 4) User password can be avoided if you're using your GPG/SSH private key
> > here, but most people tends to (wrongly) feel better with a password.
>
> I don't think layer 3 is actually necessary. Or rather, there's no
> reason it should be using public key crypto. You could just as easily
> generate a single strong symmetric key, and encrypt all your session
> keys with that.

I think I don't understand fully what you meant, so please correct if
I'm wrong.

For symmetric crypto to be safe, you MUST use a random key that
changes often.
You can't generate a unique "strong symmetric key" from a password as
if one gets the ciphertext for a known message (like if he knows
debian's /bin/ls is in the ciphertext for example), he can find out
the symmetric key way faster than brute forcing.
You have to generate a lot of such "strong keys" to be safe, and you
have to save them somehow.
Please notice that if there is a logical relation between keys, (like
key[i+1] = MD5(key[i] + salt)), the system is as weak as if there was
only one key involved.

Finally, the solution you suggested, that I understand as generating
random keys for sym crypto for the session, and encrypt those random
keys too with sym crypto, doesn't change the main issue.
You still have to save those session keys somewhere (not in the same
place as the server's backup obviously).

Using PKI here simply reuse what people are used too (most people
using crypto have their own priv/key pair I'm using for decoding
email, logging on a server and so on).
Making your own system with symmetric crypto is a bit more "unsafe",
as you can make error (crypto error are very hard to spot, as only few
dev understand it).

Also, symmetric crypto doesn't increase key entropy (unlike PKI). That
means that the best entropy you'll get in your sym key will be the
password entropy.
In PKI, the password is only used to encrypt the private key, the keys
themselves are made from random (so with the highest possible
entropy).

There are good example to why/when using an asym crypto over sym
crypto here:
http://en.wikipedia.org/wiki/Public-key_cryptography#A_postal_analogy

>
> Like I said above: I'm pretty sure GPG doesn't use a public/private
> keypair when you use its password encryption option (--symmetric),
> right? And one would hope that GPG is doing something at least as
> secure as what we're trying to do.
>

GPG use a obsolete sym algorithm (CAST5), and bruteforce from
dictionary word takes only few hours on a desktop computer when using
--symmetric.

> >> I'm certainly not a cryptographer, but my understanding is that the
> >> mathematically weakest part of cryptography at the moment is stuff
> >> based on public keys.
>
> > This is wrong. Don't make a rule from (old) RSA weaknesses. Now most PKI
> > system are based on Elliptic curve cryptography which are smaller and safer
> > than RSA.
> > In ECC, a 256 bits keys is expected to be safe for the next 10 years.
>
> This all still depends on the fact that it's hard to find a point on
> an elliptic curve. If that became untrue, ECC would be obsolete
> overnight. My understanding is that overcoming the best symmetric
> crypto algorithms - in a fundamental way - is much more difficult. So
> if public key crypto doesn't buy us anything over symmetric, we
> shouldn't be building it in.

It's true. PKI infrastructure is based on currently unsolvable
mathematical problem.
The security comes from the fact that, in the last x years, nobody
found a way to solve this problem (whether RSA or ECC).

Symmetric crypto can be (naively) resumed to this operation "Output =
Input Xor PseudoRandom"
PseudoRandom is derived from your key and multiple pass of bit
mangling. As you see there is no gain (nor lost) in entropy here.
This means that the hard part comes from hiding the best possible way
your key in PseudoRandom.

Asym crypto on the other hand, only care about setting up a secret
(with the highest possible pseudo random).
Password is only used to "hide" the private key, not to "build" the
key.
This means that once you have ciphertext, you have to solve the
symmetric crypto algorithm to get the session key, AND then you have
to solve the asym problem to get the private key.
For the symmetric case, you "only" have to solve the symmetric crypto
algorithm to get the key so it's "easier".

>
> Besides which, is there a standard ECC algorithm to use that's not
> patented? All my GPG/SSH keys are either DSA or RSA.

I think ECC patents are an hoax (or maybe it's possible to patent an
idea or mathematical equation in the US?).
In that case, ECC is a problem known from last century, so the patent
would be invalid anyway.
A specific implementation might be patented, but for example, I doubt
OpenSSL implementation is patented in any way.

>
> > Also, it's not because you've broken a ECC key pair that you've broken the
> > whole cryptosystem.
> > If key pairs are generated regularly (for example, each 128MB of data), only
> > the slice you've broken is recoverable (which, IMHO, is not enough to get
> > any interesting data, if it's delta only data).
>
> That doesn't really make sense; if I break the key that's being used
> to encrypt all the keypairs, then fundamentally I can access the
> entire repo. Unless I'm storing a whole *batch* of keys locally and
> you need the whole batch of them in order to decrypt a backup.

Yes exactly. That's how OpenSSH does in your session for example, it
updates/renegotiate the session keys after a bench of data has been
encrypted/exchanged.
Symmetric crypto doesn't change anything here, since can't use your
F(password) as the unique session key, you have to save the random
keys you've generated for the session somewhere.

>
> > The idea is to avoid distributing the private key (obviously).
> > Either you save the private key on the "safe" system (that is, my ".bup"
> > folder on my computer), basically encrypted with some hash of a password.
> > The file permissions will be restricted to u:rw (chmod 0600)
> > This is exactly what SSH et al do, as there is no way to get better security
> > on a computer.
>
> > One can backup his private keys elsewhere, but, again, that's how other
> > security based package works.
> > If you loose your SSH private key, anyone can connect on your server.
>
> That sounds nice in theory, but I certainly don't want to be the guy
> who's responsible for producing a backup system where the backed-up
> files can't be recovered if you lose your original disk. Then when
> someone inevitably loses their key because they ran the system in the
> default way, it's my fault.

GPG has big red message for this. SSH does this too.
There is no solution against user ignoring what the software says.

>
> If we encrypt stuff and expect it to be decryptable using a password,
> then we can't require people to *also* have a keyfile lying around.

Good point. I'm viewing my SSH private key like my home's keys.
I don't save them everywhere, but keep them with me everytime.
BTW, my private keys are protected with a pass phrase, but I'm not
casual Joe.

Again, for me, a password is not really required as it's a very weak
point in all crypto system.
In PKI password are not used to generate a key, but only to protect
against private eyes.
The security in PKI is not based on password, but on the fact that the
private key file is kept secured (in your home computer, or in your
USB key, or in a smartcard, etc...)

If you update the keys for each file, (or every 1MB for example), then
it's only that part that should be downloaded.
I don't think it's so bad to add 32 bytes (new random key) every 4kB
of data.
I don't think it decreases the security anyway if you do that.

>
> Other backup systems (possibly including duplicity, I don't know) have
> big sequential files that they have to read through *anyway*, which
> makes the encryption not a big deal. But bup files are random-access.
>
> >> Finally, any encryption scheme will automatically make our
> >> repositories incompatible (at least in some way) with git
> >> repositories. That's kind of too bad, although not the end of the
> >> world.
>
> > I think there is an easy workaround. Provide a pipe-filter process for this,
> > and it'll probably work correctly.
> > So instead of calling "git checkout some-revision" you'll call "bupf-git
> > checkout some-revision".
> > Most software dealing with git allow to specify the path to the git binary,
> > so it's possible to use them transparently.
>
> That won't work, unfortunately. Or rather, if we did this, the
> *filenames* wouldn't be encrypted. Or we could encrypt the filenames
> (somehow) but it would still be obvious how many files are in each
> directory.

encfs (the FUSE's encrypt on the fly FS) does this. Filename are
encrypted.
The number of file is not a sensible information by itself.

> We really should be encrypting entire tree objects, not
> the individual items in each tree. But if the tree object is
> encrypted, git won't be able to use it *at all*. Like I said, that's
> not the end of the world - we can just do all the work ourselves in
> bup. But it's not as straightforward as you make it sound.

Or ask git developers to include this. Yes, it's not as easy as I
said.
But when everything else is done (backup + restore with encrypted
content), adding this is a probably not that complex.

Cheers,
Cyril

Avery Pennarun

unread,

Jun 28, 2010, 4:56:41 PM6/28/10

to X Ryl, bup-list

On Mon, Jun 28, 2010 at 11:25 AM, X Ryl <boite.p...@gmail.com> wrote:
> On 28 juin, 11:16, Avery Pennarun <apenw...@gmail.com> wrote:
>> I don't think layer 3 is actually necessary. Or rather, there's no
>> reason it should be using public key crypto. You could just as easily
>> generate a single strong symmetric key, and encrypt all your session
>> keys with that.
>
> I think I don't understand fully what you meant, so please correct if
> I'm wrong.
>
> For symmetric crypto to be safe, you MUST use a random key that
> changes often.
> You can't generate a unique "strong symmetric key" from a password as
> if one gets the ciphertext for a known message (like if he knows
> debian's /bin/ls is in the ciphertext for example), he can find out
> the symmetric key way faster than brute forcing.
> You have to generate a lot of such "strong keys" to be safe, and you
> have to save them somehow.
> Please notice that if there is a logical relation between keys, (like
> key[i+1] = MD5(key[i] + salt)), the system is as weak as if there was
> only one key involved.

But why do you think you need to use a public key to generate the
session keys? You can generate them using any random number
generator. You still have to store them somewhere.

Think of it this way: the advantage of a public-key system is that you
can distribute the public key to the *public*, and it won't negatively
impact your security. Who are we going to give the public key to?
The only person we ever want to be able to decrypt the data is the
person who encrypted it in the first place. That's purely symmetric.

> You still have to save those session keys somewhere (not in the same
> place as the server's backup obviously).

That's not *actually* obvious. For example, in the case of SSL or
GPG, the session keys are obviously stored (albeit encrypted) in the
same stream as the data.

I agree that you're better off if you store them elsewhere, but how
much better off? I don't know. Does the security advantage outweigh
the inconvenience disadvantage? I don't know.

I object to just being dogmatic about it, however, since I don't know
of any professionally-designed systems that work using a
separately-stored list of session keys. (Duplicity is a fine product,
but I don't think it was designed by cryptographers. GPG, SSL, and
SSH were, and they inline their session keys.)

> Making your own system with symmetric crypto is a bit more "unsafe",
> as you can make error (crypto error are very hard to spot, as only few
> dev understand it).

We're already talking about putting together an entirely new crypto
system here. It has to be new, because bup's file format is special
and there aren't any pre-existing products (that I know of) that do
the whole end-to-end security.

Well, that's not entirely true: actually bup already uses ssh for
end-to-end transport security. So I trust the fact that people can't
steal your data while it's in flight. If there was a well-designed,
well-audited system for controlled random access to encrypted files on
a disk, I would trust that too. But anything we design here is
effectively going to be crap.

>> Besides which, is there a standard ECC algorithm to use that's not
>> patented? All my GPG/SSH keys are either DSA or RSA.
>
> I think ECC patents are an hoax (or maybe it's possible to patent an
> idea or mathematical equation in the US?).
> In that case, ECC is a problem known from last century, so the patent
> would be invalid anyway.
>
> A specific implementation might be patented, but for example, I doubt
> OpenSSL implementation is patented in any way.

Hoax patents are as good as real patents when it comes to losing tons
of money on legal bills. They probably won't sue "bup" as an open
source project with no corporation attached, but any company wanting
to use bup in its products would be at risk.

>> That sounds nice in theory, but I certainly don't want to be the guy
>> who's responsible for producing a backup system where the backed-up
>> files can't be recovered if you lose your original disk. Then when
>> someone inevitably loses their key because they ran the system in the
>> default way, it's my fault.
>
> GPG has big red message for this. SSH does this too.
> There is no solution against user ignoring what the software says.

Yes there is: make the software do the right thing by default. I
*hate* those idiotic messages from GPG and SSH. They're constantly
whining about how I'm a bad person for not being secure enough, but
they don't actually make it easy to be a good person. So I, like
everybody else, ignore them.

>> If we encrypt stuff and expect it to be decryptable using a password,
>> then we can't require people to *also* have a keyfile lying around.
>
> Good point. I'm viewing my SSH private key like my home's keys.

That's a terrible analogy. If I lose the keys to my house, I can just
hire a locksmith and be into my house within 30 minutes. If I lose
the super-secure private key that I used to encrypt all my backups,
I'll probably *never* get back in (unless maybe I can afford to hire
the NSA). The data will be lost forever.

That's not actually the level of security normal people want.

> If you update the keys for each file, (or every 1MB for example), then
> it's only that part that should be downloaded.
> I don't think it's so bad to add 32 bytes (new random key) every 4kB
> of data.
> I don't think it decreases the security anyway if you do that.

Well, there's nothing really stopping us from doing a key per 4kb...
except if we're keeping those locally, that's potentially many
megabytes of extra key stuff if you're backing up a few hundred gigs.
Which is not *so* bad I guess, but that's maybe 50 megs of stuff that
you now have to... back up somewhere. So that you can access your
backups.

Further, one of bup's big advantages is that it can deduplicate data
between completely separate backup sets. For example, if I have
workstations A, B, and C, and I want to back them up to server S,
it'll automatically remove duplicated data between them. Now, we can
do that using chunk hashes even if the contents of each chunk are
encrypted - but as Peter McCurdy brought up in the earlier thread on
encryption, just knowing a particular hash is present can give you key
information about a system (eg. that /etc/shadow contains a particular
default password). So if you're paranoid about that (which actually
I'm not, so i don't really care) you basically can't do deduplication
between client machines. That loses a huge part of bup's advantage
for me. If you (like me) don't care about that, you can leave the
object hashes visible, but if the chunks from A are encrypted using a
key that's only visible on A, and B has the same chunks, then to
restore a deduplicated backup of B, you need the keys from A *and* B.
How will the user on B get the keys from A?

> The number of file is not a sensible information by itself.

In the crypto industry I think they call that "famous last words."

> Or ask git developers to include this. Yes, it's not as easy as I
> said.
> But when everything else is done (backup + restore with encrypted
> content), adding this is a probably not that complex.

Well, if you think it's easy, feel free to start working on some patches :)

Have fun,

Avery

X Ryl

unread,

Jun 29, 2010, 3:50:04 AM6/29/10

to bup-list

On 28 juin, 22:56, Avery Pennarun <apenw...@gmail.com> wrote:

It's true, if you consider that you start from scratch.
It's false if you consider that you want to reuse your own priv/public
key pair.
Anyway, I wanted to justify/explain the pro & con of both system so
you can implement whatever you think is best.

>
> > You still have to save those session keys somewhere (not in the same
> > place as the server's backup obviously).
>
> That's not *actually* obvious. For example, in the case of SSL or
> GPG, the session keys are obviously stored (albeit encrypted) in the
> same stream as the data.

No, hopefully they aren't. In both GPG and SSH, they use your public
key (stored on both computer implied in communication) to encrypt the
session keys, and the recipient computer use its private key to
decrypt the session key.
If the destination server knows the symmetric session key, there is no
security at all (especially for a backup software!)

In PKI, the secret key used for encrypting the session is negotiated
with Diffie Hellman algorithm, that is, both side of the link don't
know the secret of the other side (priv keys), and still are able to
compute a shared secret (the symmetric key). It's not possible to
perform DH without PKI. At no time the session keys is sent on the bus
(even not encrypted!)

>
> I agree that you're better off if you store them elsewhere, but how
> much better off? I don't know. Does the security advantage outweigh
> the inconvenience disadvantage? I don't know.

Currently, in SSH, you already have to backup your ~/.ssh/
authorized_keys. Similarly in GPG, you have to backup the wallet.
I'm used to it, but, again, you are probably right most people don't
do this.
In that case, the day their hard-drive crash, they can't identify
anymore on any system where they've used their keys.

>
> I object to just being dogmatic about it, however, since I don't know
> of any professionally-designed systems that work using a
> separately-stored list of session keys. (Duplicity is a fine product,
> but I don't think it was designed by cryptographers. GPG, SSL, and
> SSH were, and they inline their session keys.)
>

That's wrong. GPG and SSL/SSH don't inline their session keys (that's
the main interest of PKI here, they use DH to generate a shared secret
privately on both side, and use that, without NEVER transmitting it).
I don't think it's possible to do so in bup however.

> > Making your own system with symmetric crypto is a bit more "unsafe",
> > as you can make error (crypto error are very hard to spot, as only few
> > dev understand it).
>
> We're already talking about putting together an entirely new crypto
> system here. It has to be new, because bup's file format is special
> and there aren't any pre-existing products (that I know of) that do
> the whole end-to-end security.
>
> Well, that's not entirely true: actually bup already uses ssh for
> end-to-end transport security. So I trust the fact that people can't
> steal your data while it's in flight. If there was a well-designed,
> well-audited system for controlled random access to encrypted files on
> a disk, I would trust that too. But anything we design here is
> effectively going to be crap.

Not necessarly

>
> >> Besides which, is there a standard ECC algorithm to use that's not
> >> patented? All my GPG/SSH keys are either DSA or RSA.
>
> > I think ECC patents are an hoax (or maybe it's possible to patent an
> > idea or mathematical equation in the US?).
> > In that case, ECC is a problem known from last century, so the patent
> > would be invalid anyway.
>
> > A specific implementation might be patented, but for example, I doubt
> > OpenSSL implementation is patented in any way.
>
> Hoax patents are as good as real patents when it comes to losing tons
> of money on legal bills. They probably won't sue "bup" as an open
> source project with no corporation attached, but any company wanting
> to use bup in its products would be at risk.
>

We're alone to judge this.
I don't care about patent troll at all, as where I live, computing
patent are illegal.
Anyway, any crypto system made in 2010 still using RSA1024 is not
serious at all (it's like HDCP encryption in HDTV still using 40 bits
keys in AES)

> >> That sounds nice in theory, but I certainly don't want to be the guy
> >> who's responsible for producing a backup system where the backed-up
> >> files can't be recovered if you lose your original disk. Then when
> >> someone inevitably loses their key because they ran the system in the
> >> default way, it's my fault.
>
> > GPG has big red message for this. SSH does this too.
> > There is no solution against user ignoring what the software says.
>
> Yes there is: make the software do the right thing by default. I
> *hate* those idiotic messages from GPG and SSH. They're constantly
> whining about how I'm a bad person for not being secure enough, but
> they don't actually make it easy to be a good person. So I, like
> everybody else, ignore them.
>

I'm not. I've already lost a unprotected SSH private key, and I can
tell you that it was such a pain to connect on all the computers
you've installed your keys onto to remove it.
It's exactly like backup software. You don't use them until you've
lost all your data once, and then you learn your lesson.

> >> If we encrypt stuff and expect it to be decryptable using a password,
> >> then we can't require people to *also* have a keyfile lying around.
>
> > Good point. I'm viewing my SSH private key like my home's keys.
>
> That's a terrible analogy. If I lose the keys to my house, I can just
> hire a locksmith and be into my house within 30 minutes. If I lose
> the super-secure private key that I used to encrypt all my backups,
> I'll probably *never* get back in (unless maybe I can afford to hire
> the NSA). The data will be lost forever.

Any cryptosystem which would allow to recover your data without your
keys (or password whatever), would be extremelly useless.
I EXPECT my data will be lost forever without the key or password.

Also, you shouldn't throw the baby with the bath's water, loosing your
data require multiple simultaneous events to happens:
1) You loose your original data (source disk)
and
2) You loose your backup set
3) You loose your original private key
4) You loose your backup private key

Except if you're backing up on the same computer & same drive, these
aren't supposed to happens in the same time.

>
> That's not actually the level of security normal people want.

I wouldn't even consider a crypto system with a backdoor in it.

It's not possible (if you want it to be safe).
There are workaround, like not using the user's private key for
backing up the file not owned by the user, but instead used a
distributed public key instead.
(so /bin/ls will be encrypted with a public "debian distro" key,
while /home/me/myFile.txt will be encrypted with "My" key).
That's what Tahoe does for example in their distributed FS.
It's complex, but at the same time, I don't care about deduplication
between client machine, but only on my backup.

>
> > The number of file is not a sensible information by itself.
>
> In the crypto industry I think they call that "famous last words."
>
> > Or ask git developers to include this. Yes, it's not as easy as I
> > said.
> > But when everything else is done (backup + restore with encrypted
> > content), adding this is a probably not that complex.
>
> Well, if you think it's easy, feel free to start working on some patches :)
>