On Sat, Apr 17, 2010 at 3:54 AM, Peter McCurdy <
peter....@gmail.com> wrote:
> On Sat, Apr 17, 2010 at 1:42 AM, Avery Pennarun <
apen...@gmail.com> wrote:
>> On Fri, Apr 16, 2010 at 11:14 PM, Thomas Kula <
ku...@tproa.net> wrote:
> But yes, this just goes to prove again the first rule of designing
> cryptographic systems: Don't Do It. But that's hardly stopped anyone
> before (see Twitter), so wheee, off we go!
>
> The second rule of designing cryptographic systems is that you can
> only judge a system against the expected threats and usage, which we
> haven't really gotten into.
Yes, I certainly appreciate these two rules. To put it in context,
I've been successful before at it, where by "successful" I mean I made
an open-source package that used flawed encryption to provide a useful
service, and the data it was protecting was never important enough to
steal, but if it was in plaintext, it wouldn't have been good enough.
In fact, the flawed encryption was actually never pointed out to me; I
only figured out some of the flaws myself, five years later.
> My personal interest with bup and
> cryptography would be to not care about multi-user encrypted
> deduplication; what I want is to make a remote backup to a
> semi-trusted server (e.g. some shared host somewhere) and for it to be
> impossible for anyone with access to this backup to get any of my
> private data. Other people's data is their own problem; they can make
> their own backups to their own servers.
To be honest, bup so far has been designed with an even simpler threat
model: only trusted people have access to my servers, so I don't need
any encryption at all. It's actually working fine with that model so
far, and we shouldn't be too quick to discount it :)
I also think your model above is a very sensible one, especially for
people who might want to store on a hosting provider somewhere, or S3,
or whatever, where you don't completely trust the participants but
you're the one who ends up paying for storage.
At the same time, full-on peer-to-peer deduplication + encryption is
pretty much the holy grail of distributed storage systems. So it
seems a shame to discount it too quickly :)
> For this threat model, the simplest (and therefore massively best, in
> cryptanalysis terms) solution is to gpg-encrypt every chunk with my
> private key, and leave it at that. I'll still be leaking some
> information compared with backing up a big fat GPG-encrypted tarball
> (e.g. the distribution of the chunks in my backup data, filenames,
> file sizes, etc) but for my purposes, it's probably worth the time and
> space savings.
People keep saying things like that - and other backup systems I've
seen claim to do it this way - but is this even remotely plausible?
Can you run gpg on every 8k block separately and hope to get even
halfway sane performance? There's public key crypto in every single
block that way, AFAIK.
On the other hand, if you do the public key crypto once, eg. to
encrypt a symmetric key, then you're overusing a symmetric key and I
think that leads to problems too (which is why things like SSL change
their session key occasionally).
gpg was originally designed for sending emails back and forth. Does
encrypting terabytes of data in 8k chunks (basically billions of small
messages) violate its assumptions to a degree that weakens the
encryption? Beats me.
My point here is just that even applying gpg to the problem is not
exactly the same as using an existing cryptosystem, since the
"existing" cryptosystem was designed for a different purpose. In some
ways it's the same mistake twitter made when applying SHA1 where they
should have used HMAC-SHA1.
> If you do want multi-user deduplication, there's no two ways about it,
> you will inherently be leaking information, to wit: whether or not
> anyone else has saved the same chunk. In certain contexts, this can,
> in fact, be troublesome.
>
> For instance, imagine two users who each set up a new standard
> installation of, say, Ubuntu, and the first thing they do right away
> is back up /etc with bup. The only difference between the two backups
> will pretty much be /etc/passwd and /etc/shadow, and they'll only be
> off by one line each. [...]
That's a very good example indeed :) On the other hand, the
cost/benefit ratio in this example has a bit of an exciting symmetry
to it. Alice's entire Ubuntu system's backup can be only one block
long :) The huge benefit here could outweigh the danger of revealing
/etc/shadow... or maybe it just means we need to think harder so that
successful attacks like this become implausible.
One idea that comes to mind would be to just use really slow
encryption, eg. bcrypt, which is known to make password brute forcing
"too hard." But then of course the backup would be "too slow." :)
...oh hey. If /etc/shadow is using bcrypt or something like it, then
brute forcing is *already* too slow. So maybe this particular example
doesn't work so well anyway. Though I'm sure there are lots of
similar ones.
> Maybe this isn't something any of you care about; if so, then the
> proposed scheme looks good as far as I can tell. But I'll again refer
> you to the first rule: these things can be tremendously tricky and
> subtle, as Twitter found out. My advice would be that, unless you
> really, really need multiuser deduplication, you should try to keep
> things as simple as possible for now and just skip it.
I think this is good advice. Maybe the real lesson here is: the
encryption schemes should be pluggable. How can we design an API that
lets us use:
0) no encryption
1) full encryption using a single key (for really important data)
2) deduplicating encryption as I described (for data that's only semi-private)
...
n) Any other method we come up with?
It's basically some sort of blob/tree transform, which ought to be
possible to plug in somewhere without too much trouble. Yay dynamic
languages.
Have fun,
Avery