bup encryption and distribution

301 views
Skip to first unread message

Zoran Zaric

unread,
Apr 16, 2010, 2:49:14 PM4/16/10
to bup-...@googlegroups.com
hey,

i have an idea and am interested in your thoughts about it.

I read the README and the DESIGN document, I hope the termonology I use
is correct.

I think bup could be extended to a p2p-backup system. If understand git
and bup correct the chunks in packfiles are reused. This deduplication
could be used to not only store MY file-revisions effectively, but
OTHERS', too.

Of course chunks I send to other people should be encrypted. GnuPG
should be fine for this.

What do you think?

Cheers,
Zoran



--
Subscription settings: http://groups.google.com/group/bup-list/subscribe?hl=en

Avery Pennarun

unread,
Apr 16, 2010, 4:33:51 PM4/16/10
to Zoran Zaric, bup-...@googlegroups.com
On Fri, Apr 16, 2010 at 2:49 PM, Zoran Zaric <li...@zoranzaric.de> wrote:
> I think bup could be extended to a p2p-backup system. If understand git
> and bup correct the chunks in packfiles are reused. This deduplication
> could be used to not only store MY file-revisions effectively, but
> OTHERS', too.

Yeah, this does indeed sound fun. And letting multiple people share
the same repo could result in a pretty big space savings...

> Of course chunks I send to other people should be encrypted. GnuPG
> should be fine for this.

...but encryption is a bit of a problem. You can't really deduplicate
chunks between users if the chunks are all encrypted using different
people's keys. Even if you could do it somehow, then when person #2
went to restore his file and it gave him a chunk from person #1, he
wouldn't have the right key to decrypt it.

Now of course, there are other options. If you can trust the server
(and who says you can't?), you could just store the data unencrypted
on the server, but only transmit it back and forth using SSL and only
to/from people with a valid public key.

Another option would be to encrypt each block using a function of the
block itself. For example, we name chunks based on their sha1sum; we
could *encrypt* the blocks using, say, their sha256 (something other
than their sha1) as a symmetric key. Only people who know the
original block's content can calculate the block's sha256, and thus
the encryption key, but *everybody* who knows the original block's
content will calculate the same one, so deduplication is still
possible.

After that, you just store an index of sha1->sha256 mappings, gpg
that, and also store it in bup. (Obviously that part will be
non-deduplicatable, but it's a heckuvalot smaller than the original
data so it's probably okay.)

But I'm not a cryptographer so my idea probably has fatal flaws.
Anybody want to point out what they are?

Have fun,

Avery

Zoran Zaric

unread,
Apr 16, 2010, 5:02:17 PM4/16/10
to Avery Pennarun, bup-...@googlegroups.com
Am Freitag, den 16.04.2010, 16:33 -0400 schrieb Avery Pennarun:
> ...but encryption is a bit of a problem. You can't really deduplicate
> chunks between users if the chunks are all encrypted using different
> people's keys. Even if you could do it somehow, then when person #2
> went to restore his file and it gave him a chunk from person #1, he
> wouldn't have the right key to decrypt it.

I don't know how good the deduplication works with binary data (encrypted
data), i assumed it works pretty well :) This might be my initial mistake.

I just tested adding some 10M files with /dev/urandom-data to a git index and
well... not much deduplication... damn. Any experience on a bigger scale?

> Now of course, there are other options. If you can trust the server
> (and who says you can't?), you could just store the data unencrypted
> on the server, but only transmit it back and forth using SSL and only
> to/from people with a valid public key.

When it comes to backups of stuff like passwords and personal stuff I
don't trust anybody.

I think of a P2P-backup net that just anybody can join, not just trusted
partys. But this might be an option, too.

> Another option would be to encrypt each block using a function of the
> block itself. For example, we name chunks based on their sha1sum; we
> could *encrypt* the blocks using, say, their sha256 (something other
> than their sha1) as a symmetric key. Only people who know the
> original block's content can calculate the block's sha256, and thus
> the encryption key, but *everybody* who knows the original block's
> content will calculate the same one, so deduplication is still
> possible.

Perhaps I'm just misunderstanding, but if I have to know the content of
a block to decrypt it and I need my BACKUP I think i have a problem...

> After that, you just store an index of sha1->sha256 mappings, gpg
> that, and also store it in bup. (Obviously that part will be
> non-deduplicatable, but it's a heckuvalot smaller than the original
> data so it's probably okay.)
>
> But I'm not a cryptographer so my idea probably has fatal flaws.
> Anybody want to point out what they are?
>
> Have fun,
>
> Avery

Cheers Zoran

Avery Pennarun

unread,
Apr 16, 2010, 5:15:06 PM4/16/10
to Zoran Zaric, bup-...@googlegroups.com
On Fri, Apr 16, 2010 at 5:02 PM, Zoran Zaric <li...@zoranzaric.de> wrote:
> Am Freitag, den 16.04.2010, 16:33 -0400 schrieb Avery Pennarun:
>> ...but encryption is a bit of a problem.  You can't really deduplicate
>> chunks between users if the chunks are all encrypted using different
>> people's keys.  Even if you could do it somehow, then when person #2
>> went to restore his file and it gave him a chunk from person #1, he
>> wouldn't have the right key to decrypt it.
>
> I don't know how good the deduplication works with binary data (encrypted
> data), i assumed it works pretty well :) This might be my initial mistake.
>
> I just tested adding some 10M files with /dev/urandom-data to a git index and
> well... not much deduplication... damn. Any experience on a bigger scale?

Haha, well, random data doesn't *have* duplication to remove. If it
did, it wouldn't be very random, would it?

Similarly, encryption deliberately mangles your data as much as
possible. If two people encrypt the same thing, the result will be
completely different. If it weren't completely different, then
someone without your key would be able to decrypt it. So therefore
you can't deduplicate either.

bup works great at deduplicating *binary* data (and text data too of
course), but it isn't magic.

> I think of a P2P-backup net that just anybody can join, not just trusted
> partys. But this might be an option, too.

I agree that would be better :)

>> Another option would be to encrypt each block using a function of the
>> block itself.  For example, we name chunks based on their sha1sum; we
>> could *encrypt* the blocks using, say, their sha256 (something other
>> than their sha1) as a symmetric key.  Only people who know the
>> original block's content can calculate the block's sha256, and thus
>> the encryption key, but *everybody* who knows the original block's
>> content will calculate the same one, so deduplication is still
>> possible.
>
> Perhaps I'm just misunderstanding, but if I have to know the content of
> a block to decrypt it and I need my BACKUP I think i have a problem...

Under my proposal, basically your backup is divided into two parts.
First is the sha1->sha256 mapping, which is encrypted using your gpg
key or a password or whatever. You can restore this part knowing just
the gpg key.

Once you have the first part, you can use the mapping to decrypt all
your other blocks, which happen everyone will have encrypted in the
same way as you (ie. everyone's sha1->sha256 mapping is the same), so
deduplication is possible.

It seems a bit too good to be true though. Looking forward to being
shot down by someone who knows better :)

Have fun,

Avery

Zoran Zaric

unread,
Apr 16, 2010, 6:12:40 PM4/16/10
to Avery Pennarun, bup-...@googlegroups.com
Am Freitag, den 16.04.2010, 17:15 -0400 schrieb Avery Pennarun:
> On Fri, Apr 16, 2010 at 5:02 PM, Zoran Zaric <li...@zoranzaric.de> wrote:
> > Am Freitag, den 16.04.2010, 16:33 -0400 schrieb Avery Pennarun:
> >> ...but encryption is a bit of a problem. You can't really deduplicate
> >> chunks between users if the chunks are all encrypted using different
> >> people's keys. Even if you could do it somehow, then when person #2
> >> went to restore his file and it gave him a chunk from person #1, he
> >> wouldn't have the right key to decrypt it.
> >
> > I don't know how good the deduplication works with binary data (encrypted
> > data), i assumed it works pretty well :) This might be my initial mistake.
> >
> > I just tested adding some 10M files with /dev/urandom-data to a git index and
> > well... not much deduplication... damn. Any experience on a bigger scale?
>
> Haha, well, random data doesn't *have* duplication to remove. If it
> did, it wouldn't be very random, would it?

ok i think I assumed a little bit too much magic ;)


> Under my proposal, basically your backup is divided into two parts.
> First is the sha1->sha256 mapping, which is encrypted using your gpg
> key or a password or whatever. You can restore this part knowing just
> the gpg key.
>
> Once you have the first part, you can use the mapping to decrypt all
> your other blocks, which happen everyone will have encrypted in the
> same way as you (ie. everyone's sha1->sha256 mapping is the same), so
> deduplication is possible.

Ok I'll try to explain it to myself in my words and you tell me if I got
it.

Heres a ascii-graph of the normal hashsplitting- and saving-steps

+----------------+ +-++-++-++-++-++-+
| file | --------------> |b||b||b||b||b||b|
+----------------+ hashsplitting +-++-++-++-++-++-+
|
Legend | save
b - block v
+--------------------+
| packfile |
| |
| +-++-++-++-++-++-+ |
| |b||b||b||b||b||b| |
| +-++-++-++-++-++-+ |
| +-++-++-++-++-++-+ |
| |b||b||b||b||b||b| |
| +-++-++-++-++-++-+ |
| |
+--------------------+

Using your encryption technique we'd get something like this:

+----------------+ +-++-++-++-++-++-+
| file | --------------> |b||b||b||b||b||b|
+----------------+ hashsplitting +-++-++-++-++-++-+
|
+-----------------+
+-+
---|f|----------------+
| +-+ |
v v
+-----------+ +--------------------+
| Table | | packfile |
+-----------+ | |
|sha1|sha256| | +-++-++-++-++-++-+ |
|sha1|sha256| | |x||x||x||x||x||x| |
|sha1|sha256| | +-++-++-++-++-++-+ |
+-----------+ | +-++-++-++-++-++-+ |
| |x||x||x||x||x||x| |
Legend | +-++-++-++-++-++-+ |
b - block | |
x - crypted +--------------------+
block
f - sha256

Did I get that right?

> It seems a bit too good to be true though. Looking forward to being
> shot down by someone who knows better :)

Ok that really sounds good, how much work do you expect a proof of
concept to be?

Sadly I haven't written any Python code yet, but I might just start
learning it.

Cheers,
Zoran

Thomas Kula

unread,
Apr 16, 2010, 11:14:22 PM4/16/10
to bup-...@googlegroups.com
On Fri, Apr 16, 2010 at 05:15:06PM -0400, Avery Pennarun wrote:
>
> Under my proposal, basically your backup is divided into two parts.
> First is the sha1->sha256 mapping, which is encrypted using your gpg
> key or a password or whatever. You can restore this part knowing just
> the gpg key.
>
> Once you have the first part, you can use the mapping to decrypt all
> your other blocks, which happen everyone will have encrypted in the
> same way as you (ie. everyone's sha1->sha256 mapping is the same), so
> deduplication is possible.
>
> It seems a bit too good to be true though. Looking forward to being
> shot down by someone who knows better :)

I've been following bup for a while when I came across it for some
other stuff I've been working on. Thought I'd pipe up here.

First, this stuff is always tricky, and I'm not a cryptographer,
but I think essentially what you are doing is taking a chunk
(let's call it chunk A) and encrypting it with some key. It just
so happens that the key itself is derivable from the contents of
the file itself. You are storing the key in a file (let's call it
a keylist file) that is encrypted more conventially, with a
passphrase or private key or whatnot.

In order to decrypt the chunk you need the key. The only way to get
the key is:

- Know the password/private key to the keylist file
You have to keep this secret anyways, nothing new here.
- Have the contents of the chunk already. If you have the
contents of the chunk already, you don't really need to
decrypt it....

Like has been said before, encrypting each block with a unique
key and storing all of those unique keys in a keylist file that
is encrypted with something an end-user supplies is nothing new.
You could certainly use random keys for each unique key, it's
just that by using a key derived from the chunk itself, you are
getting deduplication benefits because any other user that
happens to have the exact same chunk will store the encrypted
data exactly the same way. And because any other that happens to
have the exact same chunk already has that chunk, you don't have
to worry about them knowing the decryption key.

Also, I do not think you need to use a different hash to derive
the chunk key. I think you can salt the chunk itself with some
data and use the sha1 hash of that to derive the key. That
key will be differnt than the sha1 hash that identifies the
chunk, and still will not be derivable without knowing the contents
of the chunk. An example:

Say you have a chunk whose contents are:

File Contents\n

The sha1 has of that is 4bbcd2924d4ba804ea743175ec5ab83b76b2cc12

To generate the key, prepend the chunk with some agreed upon salt,
say:

bupencryptedfileaes-256-cbcFile Contents\n

That is, the string "bupencryptedfile" concatenated with "aes-256-cbc"
(an example encryption algorithm I pulled out of thin air) concatenated
with the chunk contents. This salt is completely random, you just
need to create *some* convention and follow it --- there may be some
reason to have a salt of a particular length, or at least of some
minimum length, but I'm not sure off hand if that's the case here.

The sha1 of that is c09629c703cffc6c143466f61a3a91854336033c. That's what
you use as the key (or derive the key from).

Again, that key can not be derived unless you know the contents of the
chunk to begin with, and it is radically different than the sha1 identifier
of the original chunk. I think you may want to prepend the salt rather than
append the salt, since I have this suspicion (I have no idea if it is
actually true) that if you know the sha1 hash of a stream you can derive
the sha1 hash of what the stream + some other stuff at the end would be,
without knowing the contents of the stream. I have no idea how much state
sha1 keeps around, and if this is actually possible, but I think it certainly
won't work if you prepend the salt. It's still fairly efficient to do this,
as you are chunking up a file, you just keep track of two sha1 hashes, you
just happen to pre-populate the second one with the salt.


Just an idea, I'd like to know what other folks think.

--
Thomas L. Kula | ku...@tproa.net | http://kula.tproa.net/

Avery Pennarun

unread,
Apr 17, 2010, 1:42:45 AM4/17/10
to Thomas Kula, bup-...@googlegroups.com
On Fri, Apr 16, 2010 at 11:14 PM, Thomas Kula <ku...@tproa.net> wrote:
> First, this stuff is always tricky, and I'm not a cryptographer,
> but I think essentially what you are doing is taking a chunk
> (let's call it chunk A) and encrypting it with some key. It just
> so happens that the key itself is derivable from the contents of
> the file itself. You are storing the key in a file (let's call it
> a keylist file) that is encrypted more conventially, with a
> passphrase or private key or whatnot.

Mmm, a keylist. That totally sounds like we know what we're doing. I like it.

It occurred to me that even the keylist is mostly redundant with other
people's copies. If we divided it up into chunks somehow - perhaps
one chunk per git tree - we could deduplicate *that* part too.
Fundamentally, the only part we need to encrypt with a passphrase (or
whatever) is a single hash that points at the toplevel keylist object.

> You could certainly use random keys for each unique key, it's
> just that by using a key derived from the chunk itself, you are
> getting deduplication benefits because any other user that
> happens to have the exact same chunk will store the encrypted
> data exactly the same way. And because any other that happens to
> have the exact same chunk already has that chunk, you don't have
> to worry about them knowing  the decryption key.

Yup. After a few more hours of thinking about it, I think there's
only one *primary* flaw in this whole scheme: if I know the contents
of /etc on my system, and I know what blocks you backed up (and
especially if I know in which order you backed them up) then I can
tell a little bit about your system.

Put another way, I can play a plaintext-guessing attack against you:
try a plaintext, see if it matches any of the blocks you backed up.
With normal encryption (where the attacker doesn't know the key at
all) this kind of attack should be impossible.

I don't know the ramifications of this though. Maybe nothing
important. After all, that's also an attribute of sha1, and that
attribute doesn't help people make fake cookies, etc.

> Also, I do not think you need to use a different hash to derive
> the chunk key. I think you can salt the chunk itself with some
> data and use the sha1 hash of that to derive the key. That
> key will be differnt than the sha1 hash that identifies the
> chunk, and still will not be derivable without knowing the contents
> of the chunk. An example:

Ooh, an excellent point. That sounds much better than a totally
separate algorithm, you're right.

> I think you may want to prepend the salt rather than
> append the salt, since I have this suspicion (I have no idea if it is
> actually true) that if you know the sha1 hash of a stream you can derive
> the sha1 hash of what the stream + some other stuff at the end would be,
> without knowing the contents of the stream.

Good suspicion - I think exactly that problem was the basis of a
Facebook and/or Twitter cookie attack a few months back. However,
it's also another reminder that misusing hashes for things they
weren't designed to do (like generating encryption keys, in this case)
can lead to very surprising failures. Thus I'm nervous. A research
paper telling me this is a good idea would alleviate my fears a bit :)

Have fun,

Avery

Avery Pennarun

unread,
Apr 17, 2010, 1:50:26 AM4/17/10
to Zoran Zaric, bup-...@googlegroups.com
On Fri, Apr 16, 2010 at 6:12 PM, Zoran Zaric <li...@zoranzaric.de> wrote:
> Ok I'll try to explain it to myself in my words and you tell me if I got
> it.
[...]
> Did I get that right?

Yes, with one additional clarification: the sha1->sha256 mapping table
can *also* be stored as an object in the packfile. So you're still
only storing a bunch of git objects.

(Read Thomas Kula's email for why we should just use a seeded-sha1
instead of sha256 though.)

>> It seems a bit too good to be true though.  Looking forward to being
>> shot down by someone who knows better :)
>
> Ok that really sounds good, how much work do you expect a proof of
> concept to be?

In fact, I think it wouldn't really be all that hard. To get started,
you could just change git.PackWriter to encrypt every block and keep
an extra sha1->key mapping somewhere. Then you would want to write
out that mapping (encrypted using a user-defined key) once at the very
end. (Eventually you'd want to implement it using some sort of
tree/fanout so that you don't end up with 400 million mapping entries
while doing a backup, though.)

Then, on restore, you'd need to open up the sha1->key list, followed
by the individual objects, and decrypt them by looking up their keys
in the list.

That would be the proof of concept. The details to make it fast and
scalable are, as usual, a bit more work and I certainly haven't
thought it all the way through yet :)

Have fun,

Avery

Peter McCurdy

unread,
Apr 17, 2010, 3:54:02 AM4/17/10
to Avery Pennarun, Thomas Kula, bup-...@googlegroups.com
The word you're looking for is HMAC: http://en.wikipedia.org/wiki/HMAC
- if Twitter had remembered this, they'd have been much happier. But
yes, this just goes to prove again the first rule of designing
cryptographic systems: Don't Do It. But that's hardly stopped anyone
before (see Twitter), so wheee, off we go!

The second rule of designing cryptographic systems is that you can
only judge a system against the expected threats and usage, which we
haven't really gotten into. My personal interest with bup and
cryptography would be to not care about multi-user encrypted
deduplication; what I want is to make a remote backup to a
semi-trusted server (e.g. some shared host somewhere) and for it to be
impossible for anyone with access to this backup to get any of my
private data. Other people's data is their own problem; they can make
their own backups to their own servers.

For this threat model, the simplest (and therefore massively best, in
cryptanalysis terms) solution is to gpg-encrypt every chunk with my
private key, and leave it at that. I'll still be leaking some
information compared with backing up a big fat GPG-encrypted tarball
(e.g. the distribution of the chunks in my backup data, filenames,
file sizes, etc) but for my purposes, it's probably worth the time and
space savings.

If you do want multi-user deduplication, there's no two ways about it,
you will inherently be leaking information, to wit: whether or not
anyone else has saved the same chunk. In certain contexts, this can,
in fact, be troublesome.

For instance, imagine two users who each set up a new standard
installation of, say, Ubuntu, and the first thing they do right away
is back up /etc with bup. The only difference between the two backups
will pretty much be /etc/passwd and /etc/shadow, and they'll only be
off by one line each. A malicious user (Eve, by convention, short for
Eavesdropper) could mount a brute-force offline attack against the
encrypted password file of the other sap (Alice). This is rather more
computational work than just trying to brute-force an SSH login, but
not overwhelmingly so (Eve just has to guess Alice's user name, salt,
and password, the username is probably known and there's only 4096
different salt values) and the fact that Eve can do this offline is a
pretty big advantage: computers are rather fast these days, this is
massively parallelizable, and Eve can avoid the network and all its
attendant delays and IDSes. And come to think of it, they don't even
have to have backed up at the same time; at any point, Eve can try to
find any version of /etc/shadow that Alice (or any other user) has
ever saved.

Maybe this isn't something any of you care about; if so, then the
proposed scheme looks good as far as I can tell. But I'll again refer
you to the first rule: these things can be tremendously tricky and
subtle, as Twitter found out. My advice would be that, unless you
really, really need multiuser deduplication, you should try to keep
things as simple as possible for now and just skip it.

--
Peter McCurdy
Co-founder, The Navarra Group
http://navarra.ca

Thomas Kula

unread,
Apr 17, 2010, 8:12:52 AM4/17/10
to bup-...@googlegroups.com
On Sat, Apr 17, 2010 at 03:54:02AM -0400, Peter McCurdy wrote:
>
> Maybe this isn't something any of you care about; if so, then the
> proposed scheme looks good as far as I can tell. But I'll again refer
> you to the first rule: these things can be tremendously tricky and
> subtle, as Twitter found out. My advice would be that, unless you
> really, really need multiuser deduplication, you should try to keep
> things as simple as possible for now and just skip it.
>

I would certainly agree whole-heartedly about the first rule,
and probably wouldn't actually use this backup scheme unless
someone who was a real cryptographic expert (read: not me)
vetted the thing.


--
Thomas L. Kula | ku...@tproa.net | http://kula.tproa.net/


Thomas Kula

unread,
Apr 17, 2010, 8:19:41 AM4/17/10
to bup-...@googlegroups.com
On Sat, Apr 17, 2010 at 01:42:45AM -0400, Avery Pennarun wrote:
> Mmm, a keylist. That totally sounds like we know what we're doing. I like it.
>
> It occurred to me that even the keylist is mostly redundant with other
> people's copies. If we divided it up into chunks somehow - perhaps
> one chunk per git tree - we could deduplicate *that* part too.
> Fundamentally, the only part we need to encrypt with a passphrase (or
> whatever) is a single hash that points at the toplevel keylist object.

That I do not think is true. Now all you are protecting with a secret
is where to go to find all the other secrets. It means that there will
exist a chunk somewhere in the repository that contains a naked
chunkid->encryption key mapping, and all you have to do is scan the
entire repository to find it. Tiresome, but highly automatable and
completely trivial.

There are some things you do not want to de-dup. The list of your
encryption keys is, I believe, one of those.

--
Thomas L. Kula | ku...@tproa.net | http://kula.tproa.net/


Avery Pennarun

unread,
Apr 17, 2010, 2:33:55 PM4/17/10
to Peter McCurdy, Thomas Kula, bup-...@googlegroups.com
On Sat, Apr 17, 2010 at 3:54 AM, Peter McCurdy <peter....@gmail.com> wrote:
> On Sat, Apr 17, 2010 at 1:42 AM, Avery Pennarun <apen...@gmail.com> wrote:
>> On Fri, Apr 16, 2010 at 11:14 PM, Thomas Kula <ku...@tproa.net> wrote:
> But yes, this just goes to prove again the first rule of designing
> cryptographic systems: Don't Do It.  But that's hardly stopped anyone
> before (see Twitter), so wheee, off we go!
>
> The second rule of designing cryptographic systems is that you can
> only judge a system against the expected threats and usage, which we
> haven't really gotten into.

Yes, I certainly appreciate these two rules. To put it in context,
I've been successful before at it, where by "successful" I mean I made
an open-source package that used flawed encryption to provide a useful
service, and the data it was protecting was never important enough to
steal, but if it was in plaintext, it wouldn't have been good enough.
In fact, the flawed encryption was actually never pointed out to me; I
only figured out some of the flaws myself, five years later.

> My personal interest with bup and
> cryptography would be to not care about multi-user encrypted
> deduplication; what I want is to make a remote backup to a
> semi-trusted server (e.g. some shared host somewhere) and for it to be
> impossible for anyone with access to this backup to get any of my
> private data.  Other people's data is their own problem; they can make
> their own backups to their own servers.

To be honest, bup so far has been designed with an even simpler threat
model: only trusted people have access to my servers, so I don't need
any encryption at all. It's actually working fine with that model so
far, and we shouldn't be too quick to discount it :)

I also think your model above is a very sensible one, especially for
people who might want to store on a hosting provider somewhere, or S3,
or whatever, where you don't completely trust the participants but
you're the one who ends up paying for storage.

At the same time, full-on peer-to-peer deduplication + encryption is
pretty much the holy grail of distributed storage systems. So it
seems a shame to discount it too quickly :)

> For this threat model, the simplest (and therefore massively best, in
> cryptanalysis terms) solution is to gpg-encrypt every chunk with my
> private key, and leave it at that.  I'll still be leaking some
> information compared with backing up a big fat GPG-encrypted tarball
> (e.g. the distribution of the chunks in my backup data, filenames,
> file sizes, etc) but for my purposes, it's probably worth the time and
> space savings.

People keep saying things like that - and other backup systems I've
seen claim to do it this way - but is this even remotely plausible?
Can you run gpg on every 8k block separately and hope to get even
halfway sane performance? There's public key crypto in every single
block that way, AFAIK.

On the other hand, if you do the public key crypto once, eg. to
encrypt a symmetric key, then you're overusing a symmetric key and I
think that leads to problems too (which is why things like SSL change
their session key occasionally).

gpg was originally designed for sending emails back and forth. Does
encrypting terabytes of data in 8k chunks (basically billions of small
messages) violate its assumptions to a degree that weakens the
encryption? Beats me.

My point here is just that even applying gpg to the problem is not
exactly the same as using an existing cryptosystem, since the
"existing" cryptosystem was designed for a different purpose. In some
ways it's the same mistake twitter made when applying SHA1 where they
should have used HMAC-SHA1.

> If you do want multi-user deduplication, there's no two ways about it,
> you will inherently be leaking information, to wit: whether or not
> anyone else has saved the same chunk.  In certain contexts, this can,
> in fact, be troublesome.
>
> For instance, imagine two users who each set up a new standard
> installation of, say, Ubuntu, and the first thing they do right away
> is back up /etc with bup.  The only difference between the two backups
> will pretty much be /etc/passwd and /etc/shadow, and they'll only be
> off by one line each. [...]

That's a very good example indeed :) On the other hand, the
cost/benefit ratio in this example has a bit of an exciting symmetry
to it. Alice's entire Ubuntu system's backup can be only one block
long :) The huge benefit here could outweigh the danger of revealing
/etc/shadow... or maybe it just means we need to think harder so that
successful attacks like this become implausible.

One idea that comes to mind would be to just use really slow
encryption, eg. bcrypt, which is known to make password brute forcing
"too hard." But then of course the backup would be "too slow." :)

...oh hey. If /etc/shadow is using bcrypt or something like it, then
brute forcing is *already* too slow. So maybe this particular example
doesn't work so well anyway. Though I'm sure there are lots of
similar ones.

> Maybe this isn't something any of you care about; if so, then the
> proposed scheme looks good as far as I can tell.  But I'll again refer
> you to the first rule: these things can be tremendously tricky and
> subtle, as Twitter found out.  My advice would be that, unless you
> really, really need multiuser deduplication, you should try to keep
> things as simple as possible for now and just skip it.

I think this is good advice. Maybe the real lesson here is: the
encryption schemes should be pluggable. How can we design an API that
lets us use:

0) no encryption
1) full encryption using a single key (for really important data)
2) deduplicating encryption as I described (for data that's only semi-private)
...
n) Any other method we come up with?

It's basically some sort of blob/tree transform, which ought to be
possible to plug in somewhere without too much trouble. Yay dynamic
languages.

Have fun,

Avery

Avery Pennarun

unread,
Apr 17, 2010, 2:36:24 PM4/17/10
to Thomas Kula, bup-...@googlegroups.com
On Sat, Apr 17, 2010 at 8:19 AM, Thomas Kula <ku...@tproa.net> wrote:
> On Sat, Apr 17, 2010 at 01:42:45AM -0400, Avery Pennarun wrote:
>> Mmm, a keylist.  That totally sounds like we know what we're doing.  I like it.
>> It occurred to me that even the keylist is mostly redundant with other
>> people's copies.  If we divided it up into chunks somehow - perhaps
>> one chunk per git tree - we could deduplicate *that* part too.
>> Fundamentally, the only part we need to encrypt with a passphrase (or
>> whatever) is a single hash that points at the toplevel keylist object.
>
> That I do not think is true. Now all you are protecting with a secret
> is where to go to find all the other secrets. It means that there will
> exist a chunk somewhere in the repository that contains a naked
> chunkid->encryption key mapping, and all you have to do is scan the
> entire repository to find it. Tiresome, but highly automatable and
> completely trivial.
>
> There are some things you do not want to de-dup. The list of your
> encryption keys is, I believe, one of those.

No, you don't understand what I was saying. You're encrypting the
keylist using the same scheme (let's call it "reflexive encryption")
that you used for the rest of the data. So the root of the keylist is
encrypted with gpg; it contains a pointer to the first level's
encrypted keylist, and a key to decrypt it. The first level points to
the second level, and so on. You still can't decrypt any level of the
keylist unless you have the key to that level, so it's as secure as
the rest of the system (which, as we're discussing, is not known for
sure).

Have fun,

Avery

Zoran Zaric

unread,
Apr 17, 2010, 2:57:55 PM4/17/10
to Avery Pennarun, bup-...@googlegroups.com
Am Samstag, den 17.04.2010, 14:33 -0400 schrieb Avery Pennarun:
> > My personal interest with bup and
> > cryptography would be to not care about multi-user encrypted
> > deduplication; what I want is to make a remote backup to a
> > semi-trusted server (e.g. some shared host somewhere) and for it to be
> > impossible for anyone with access to this backup to get any of my
> > private data. Other people's data is their own problem; they can make
> > their own backups to their own servers.
>
> To be honest, bup so far has been designed with an even simpler threat
> model: only trusted people have access to my servers, so I don't need
> any encryption at all. It's actually working fine with that model so
> far, and we shouldn't be too quick to discount it :)

Shure that never was my goal :)

Having encrytion, multi server/user deduplication and a p2p net would
should all be optional choices.

I think encryption and distribution can be designed and developed
independant. Shure if i'd use the world-wide-p2p-option i'd switch on
encryption. but having a little p2p-net with people i fully trust would
be a option, too.

A second use case for the "private p2p-net" would be a personal p2p-net
for severel servers, desktops, notebooks... you name it

> People keep saying things like that - and other backup systems I've
> seen claim to do it this way - but is this even remotely plausible?
> Can you run gpg on every 8k block separately and hope to get even
> halfway sane performance? There's public key crypto in every single
> block that way, AFAIK.

I'd like to benchmark using gpg on bup's 8k blocks. What is the easiest
way to extract them from the packfiles into single files?

> > Maybe this isn't something any of you care about; if so, then the
> > proposed scheme looks good as far as I can tell. But I'll again refer
> > you to the first rule: these things can be tremendously tricky and
> > subtle, as Twitter found out. My advice would be that, unless you
> > really, really need multiuser deduplication, you should try to keep
> > things as simple as possible for now and just skip it.
>
> I think this is good advice. Maybe the real lesson here is: the
> encryption schemes should be pluggable. How can we design an API that
> lets us use:
>
> 0) no encryption
> 1) full encryption using a single key (for really important data)
> 2) deduplicating encryption as I described (for data that's only semi-private)

3) when backing up bup encrypts files and directorys given via a list on
a file basis using a pluggable encryption method

> ...
> n) Any other method we come up with?

Cheers,
Zoran

Avery Pennarun

unread,
Apr 17, 2010, 3:17:46 PM4/17/10
to Zoran Zaric, bup-...@googlegroups.com
On Sat, Apr 17, 2010 at 2:57 PM, Zoran Zaric <li...@zoranzaric.de> wrote:
> Am Samstag, den 17.04.2010, 14:33 -0400 schrieb Avery Pennarun:
>> People keep saying things like that - and other backup systems I've
>> seen claim to do it this way - but is this even remotely plausible?
>> Can you run gpg on every 8k block separately and hope to get even
>> halfway sane performance?  There's public key crypto in every single
>> block that way, AFAIK.
>
> I'd like to benchmark using gpg on bup's 8k blocks. What is the easiest
> way to extract them from the packfiles into single files?

Take a look at 'man git-unpack-objects'.

Have fun,

Avery

Rob Browning

unread,
Apr 17, 2010, 6:05:59 PM4/17/10
to Peter McCurdy, Avery Pennarun, Thomas Kula, bup-...@googlegroups.com
Peter McCurdy <peter....@gmail.com> writes:

> My personal interest with bup and cryptography would be to not care
> about multi-user encrypted deduplication; what I want is to make a
> remote backup to a semi-trusted server (e.g. some shared host
> somewhere) and for it to be impossible for anyone with access to this
> backup to get any of my private data. Other people's data is their
> own problem; they can make their own backups to their own servers.

One point of reference here, for those who haven't already seen it, is
duplicity: http://duplicity.nongnu.org/

--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Avery Pennarun

unread,
Apr 17, 2010, 6:20:13 PM4/17/10
to Rob Browning, Peter McCurdy, Thomas Kula, bup-...@googlegroups.com
On Sat, Apr 17, 2010 at 6:05 PM, Rob Browning <r...@defaultvalue.org> wrote:
> Peter McCurdy <peter....@gmail.com> writes:
>> My personal interest with bup and cryptography would be to not care
>> about multi-user encrypted deduplication; what I want is to make a
>> remote backup to a semi-trusted server (e.g. some shared host
>> somewhere) and for it to be impossible for anyone with access to this
>> backup to get any of my private data.  Other people's data is their
>> own problem; they can make their own backups to their own servers.
>
> One point of reference here, for those who haven't already seen it, is
> duplicity: http://duplicity.nongnu.org/

I've never used duplicity, but it looks like they do things in a way
that wouldn't work for bup. Namely, they produce a single big file
and encrypt it from beginning to end. This is a great way to do
encryption (great security, great speed) but not a great way to do
deduplication.

Avery

Avery Pennarun

unread,
Apr 17, 2010, 6:22:53 PM4/17/10
to Zoran Zaric, bup-...@googlegroups.com
On Fri, Apr 16, 2010 at 4:33 PM, Avery Pennarun <apen...@gmail.com> wrote:
> Another option would be to encrypt each block using a function of the
> block itself.  For example, we name chunks based on their sha1sum; we
> could *encrypt* the blocks using, say, their sha256 (something other
> than their sha1) as a symmetric key.  Only people who know the
> original block's content can calculate the block's sha256, and thus
> the encryption key, but *everybody* who knows the original block's
> content will calculate the same one, so deduplication is still
> possible.
>
> After that, you just store an index of sha1->sha256 mappings, gpg
> that, and also store it in bup.  (Obviously that part will be
> non-deduplicatable, but it's a heckuvalot smaller than the original
> data so it's probably okay.)
>
> But I'm not a cryptographer so my idea probably has fatal flaws.
> Anybody want to point out what they are?

Aha, I knew I couldn't have been the first person to have thought of
this idea. It's called "convergent encryption":

http://www.google.ca/search?q=convergent+encryption

Rob Browning

unread,
Apr 17, 2010, 7:05:44 PM4/17/10
to Avery Pennarun, Peter McCurdy, Thomas Kula, bup-...@googlegroups.com
I just mentioned it as a tool that handles remote encrypted backups
(with intra-file incremental diffs), which I think is at least part of
what Peter wanted.

--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4


Thomas Kula

unread,
Apr 19, 2010, 8:11:08 PM4/19/10
to bup-...@googlegroups.com
On Sat, Apr 17, 2010 at 02:36:24PM -0400, Avery Pennarun wrote:

> No, you don't understand what I was saying. You're encrypting the
> keylist using the same scheme (let's call it "reflexive encryption")
> that you used for the rest of the data. So the root of the keylist is
> encrypted with gpg; it contains a pointer to the first level's
> encrypted keylist, and a key to decrypt it. The first level points to
> the second level, and so on. You still can't decrypt any level of the
> keylist unless you have the key to that level, so it's as secure as
> the rest of the system (which, as we're discussing, is not known for
> sure).

Ah, that makes sense.

--
Thomas L. Kula | ku...@tproa.net | http://kula.tproa.net/


Reply all
Reply to author
Forward
0 new messages