Cutting files

192 views
Skip to first unread message

o lu

unread,
Jul 31, 2020, 3:31:56 AM7/31/20
to Perkeep
How does Perkeep know where to cut media files when deduping?

Ian Denhardt

unread,
Jul 31, 2020, 4:33:43 AM7/31/20
to per...@googlegroups.com
Quoting o lu (2020-07-31 01:37:47)

> How does Perkeep know where to cut media files when deduping?

The key idea is: after each byte, compute a (non-cryptographic) hash of
the most recent N bytes seen so far. If the bottom M bits of the hash
are zero, split there, otherwise keep going. IIRC perkeep chooses N = 64
and M = 13.

For an intuition of why this works well, imagine the case where some
characters are inserted in the middle of a large file, which was
previously stored in several chunks. All of the chunks before the
location of the insertion will be the same. The insertion itself will
result in not re-using the chunk that contained that section of the
file, but because the split point is chosen based on the content at that
point, the new chunk will probably end at the same place the old one
did, and then for subsequent chunks we'll have "synchronized" with the
old copy, allowing us to re-use the old chunks after the modification as
well.

The particular hash function used is designed to make incrementally
hashing a sliding window like this efficient, but for the purposes of
choosing a "good" split point you could use any hash function, it would
just be slower. The code for the hash function is in the package
go4.org/rollsum

The full implementation takes into account some other details -- things
like enforcing a maximum & minimum chunk size -- but the above is where
the magic comes from. The actual code that does the splitting is in
writeFileChunks in pkg/schema/filewriter.go, if you're interested in all
the gory details.

I have kicking around on my hard drive a WIP library that just provides
the splitting algorithm, without being entangled with the rest of the
file upload logic. I should finish that up and publish it...

Hope this is useful,

-Ian

Bob Glickstein

unread,
Jul 31, 2020, 9:53:51 AM7/31/20
to per...@googlegroups.com
On Fri, Jul 31, 2020 at 1:33 AM Ian Denhardt <i...@zenhack.net> wrote:
I have kicking around on my hard drive a WIP library that just provides
the splitting algorithm, without being entangled with the rest of the
file upload logic.

I did that recently too. It's not identical to the implementation in Perkeep though. It's at https://github.com/bobg/hashsplit.

Cheers,
- Bob
 
I should finish that up and publish it...

Hope this is useful,

-Ian

--
You received this message because you are subscribed to the Google Groups "Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email to perkeep+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/perkeep/159618442004.25997.10874365701907187897%40localhost.localdomain.

Ian Denhardt

unread,
Jul 31, 2020, 3:07:15 PM7/31/20
to Bob Glickstein, per...@googlegroups.com
Nice! My implementation wasn't exactly the same either, and it seems
like you came up with a very similar API to what I had in mind -- so
maybe I'll just use yours.

It would be nice to have a precisely specified splitting function, so
that it was possible to write interoperable software without having to
deal with different tools computing different splits (thereby somewhat
negating the dedup benefits) or having to just rely on the same library
everywhere (and thus language...). With cryptographic hash functions
we have names for specific widely implemented functions (e.g. sha224);
it would be nice to have the same for hash splitting.

I could be talked into helping out with a formal spec.

-Ian

Quoting Bob Glickstein (2020-07-31 09:53:36)
> On Fri, Jul 31, 2020 at 1:33 AM Ian Denhardt <[1]i...@zenhack.net>
> wrote:
>
> I have kicking around on my hard drive a WIP library that just
> provides
> the splitting algorithm, without being entangled with the rest of
> the
> file upload logic.
>
> I did that recently too. It's not identical to the implementation in
> Perkeep though. It's at� [2]https://github.com/bobg/hashsplit.
> Cheers,
> - Bob
> �
>
> I should finish that up and publish it...
> Hope this is useful,
> -Ian
> --
> You received this message because you are subscribed to the Google
> Groups "Perkeep" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to [3]perkeep+u...@googlegroups.com.
> To view this discussion on the web visit
> [4]https://groups.google.com/d/msgid/perkeep/159618442004.25997.1087
> 4365701907187897%40localhost.localdomain.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Perkeep" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [5]perkeep+u...@googlegroups.com.
> To view this discussion on the web visit
> [6]https://groups.google.com/d/msgid/perkeep/CAEf8c4_Cq5%3D%3DEetpr1G3N
> C-HqzGfRrFdxg%2BdCAqX8P-5F8SbQA%40mail.gmail.com.
>
> Verweise
>
> 1. mailto:i...@zenhack.net
> 2. https://github.com/bobg/hashsplit
> 3. mailto:perkeep%2Bunsu...@googlegroups.com
> 4. https://groups.google.com/d/msgid/perkeep/159618442004.25997.10874365701907187897%40localhost.localdomain
> 5. mailto:perkeep+u...@googlegroups.com
> 6. https://groups.google.com/d/msgid/perkeep/CAEf8c4_Cq5%3D%3DEetpr1G3NC-HqzGfRrFdxg%2BdCAqX8P-5F8SbQA%40mail.gmail.com?utm_medium=email&utm_source=footer

Nick S

unread,
Jul 31, 2020, 4:03:00 PM7/31/20
to Perkeep
That's a pretty cool method. Is it actually a Rabin fingerprint? I noticed the number of bottom bits (M = 13) is the same as what they give in the Wikipedia article.

Ian Denhardt

unread,
Jul 31, 2020, 4:43:48 PM7/31/20
to Nick S, Perkeep
Quoting Nick S (2020-07-31 16:03:00)

> Is it actually a Rabin fingerprint?

I haven't looked too carefully at the concrete implementation, it could
be, but there are a number of different choices for a rolling hash
function here:

https://en.wikipedia.org/wiki/Rolling_hash

> I noticed the number of bottom bits (M = 13) is the same as what they
> give in the [1]Wikipedia article.

Note that the choice of M is entirely orthogonal to the choice of hash
function. M determines the average block size; in particular the
probability of a split at a given point is 1/(2^M), so the average block
size will be 2^M bytes. So for M = 13 you get 8KiB blocks on average.

-Ian

Nick S

unread,
Jul 31, 2020, 9:55:52 PM7/31/20
to Perkeep
Oh yeah, it's funny, I knew a choice of M = 13 gives you avg 8kb blocks, which is why it was chosen. But I didn't put it together and realize it means another algorithm choosing 13 isn't that much of a coincidence.

Thanks for the pointer to the rolling hash article! Somehow I hadn't come across it.

Bob Glickstein

unread,
Aug 9, 2020, 1:49:28 PM8/9/20
to Ian Denhardt, per...@googlegroups.com
On Fri, Jul 31, 2020 at 12:07 PM Ian Denhardt <i...@zenhack.net> wrote:
It would be nice to have a precisely specified splitting function, so
that it was possible to write interoperable software without having to
deal with different tools computing different splits

I heartily agree! And thanks for the suggestion. In order to contribute to such a specification I would first need to educate myself on the pluses and minuses of different rolling hashes (e.g. beginning here). What I have right now in my hashsplit package is a little cargo-culty.

Not sure what my time will allow, but I will make this effort and keep you posted. Meanwhile, if you make any progress I'd love to hear about it.

Cheers,
- Bob

Ian Denhardt

unread,
Aug 10, 2020, 10:46:22 PM8/10/20
to Bob Glickstein, bo...@emphatic.com, per...@googlegroups.com
Quoting Bob Glickstein (2020-08-09 13:49:15)

> I heartily agree! And thanks for the suggestion. In order to contribute
> to such a specification I would first need to educate myself on the
> pluses and minuses of different rolling hashes (e.g. beginning
> [2]here). What I have right now in my hashsplit package is a little
> cargo-culty.

I spent a bit of time researching, and it looks like there's a long
history of cargo culting; go4.org/rollsum looks like it's an exact
transliteration of bup's implementation:

https://github.com/bup/bup/blob/bb092f8a5c148534655b6a709896d853e3587dc8/lib/bup/bupsplit.c

Which in turn is almost exactly the rsync algorithm:

https://rsync.samba.org/tech_report/node3.html

...the only difference being that the bytes are shifted by 31, per the
comment in bup's source:

// According to librsync/rollsum.h:
// "We should make this something other than zero to improve the
// checksum algorithm: tridge suggests a prime number."
// apenwarr: I unscientifically tried 0 and 7919, and they both ended up
// slightly worse than the librsync value of 31 for my arbitrary test data.
#define ROLLSUM_CHAR_OFFSET 31

Frankly, I think we should just stick to this hash for any
formalization: it works fine and there are apparently already a couple
implementations -- for the hash itself I think we should just nab the
rsync doc, stick the char offset constant in those formulae and be done
with it.

Note that bup's splitting function appears to look for a number of
bits that are *set*, rather than bits that are clear.

Documenting the splitting algorithm on top of that is I think
the more significant amount of work, as well as figuring out how much we
want to be configurable, do we want to try to capture the parameters of
existing implementations, etc.

-Ian

Bob Glickstein

unread,
Aug 11, 2020, 11:00:46 AM8/11/20
to Ian Denhardt, per...@googlegroups.com
On Mon, Aug 10, 2020 at 7:46 PM Ian Denhardt <i...@zenhack.net> wrote:
Frankly, I think we should just stick to this hash for any
formalization

I have been on the fence about this. On the plus side, there's a lot of existing code, so no need to invent something new. On the minus side, as you pointed out the existing implementations are all slightly mutually incompatible, so what is there to preserve, exactly?

Then just now I did an experiment that convinced me: https://github.com/bobg/hashsplit/commit/5a64185e331e11fb69800411edd763ad0a6f6bf4

If you run my new benchmark with environment variable BENCHMARK_ROLLSUM_COUNT_ZEROES=1, it counts how often each bit of the digest is 0. Should be about 50% for every bit, all the time, right? But look:

    BenchmarkRollsum: rollsum_test.go:25: using seed 1597156860
    BenchmarkRollsum: rollsum_test.go:52: with b.N == 8493040:
    BenchmarkRollsum: rollsum_test.go:54:   bit 0 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 1 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 2 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 3 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 4 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 5 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 6 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 7 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 8 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 9 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 10 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 11 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 12 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 13 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 14 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 15 is zero 48.3% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 16 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 17 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 18 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 19 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 20 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 21 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 22 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 23 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 24 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 25 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 26 is zero 46.7% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 27 is zero 56.1% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 28 is zero 99.9% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 29 is zero 0.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 30 is zero 100.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 31 is zero 100.0% of the time

The highest few bits do not exhibit good randomness.

Now the hunt is on for a better rolling checksum: one that's fast (this one's only about 6-7 ns on my computer for each Roll/Digest call pair, when running with BENCHMARK_ROLLSUM_COUNT_ZEROES turned off), where every bit is zero 50% of the time, and where there is little or no measurable correlation between bits (a test I haven't written yet).

You might argue we shouldn't care about the highest few bits for hashsplitting purposes - we seldom care about more than 20 or so of the lowest bits - but I don't feel ready to settle for that until we've evaluated some other candidate algorithms.

Cheers,
- Bob

Ian Denhardt

unread,
Aug 11, 2020, 6:38:55 PM8/11/20
to Bob Glickstein, bo...@emphatic.com, per...@googlegroups.com
Quoting Bob Glickstein (2020-08-11 11:00:33)

> as you pointed out the existing implementations are all slightly
> mutually incompatible, so what is there to preserve, exactly?

The difference between go4.org/rollsum and bup's choice of OnSplit() can
be factored out into a choice of bitmask and expected value easily
enough, and can be computed as:

digest() & mask == value

For go4.org/rollsum, you have:

mask = ((1 << 13) - 1)
value = 0

And for bup you have:

mask = ((1 << 13) - 1)
value = mask

I modified your benchmark (after applying the rng pr I sent you) to
compute an `onsplit` this way, and the performance difference was small;
less than 3% difference vs. calling rs.OnSplit(). For most applications,
this function is fast enough that its cost is going to be completely
swallowed by other overheads.

All of this is to say I think it is reasonable to capture the difference
between bup & perkeep in slightly different parameters to the same
algorithm. We could add a third parameter to also capture the rsync
version, though I haven't measured the overhead of making that a
run-time value -- but I expect it to be small enough not to matter
much.

Unfortunately, making the rolling checksum itself an interface
approximately doubles the running time. But I'm not sure whether
that actually matters for any real applications. We're still talking 6
ns/op; for perkeep & bup I expect that difference is still going to be
completely swallowed by the computation of a cryptographic hash for the
block as a whole.

But, I think a good spec would document the hash function orthogonally
to the rest of the algorithm, so this isn't a blocker on getting
started on the other parts.

> The highest few bits do not exhibit good randomness.
>
> [...]
>
> You might argue we shouldn't care about the highest few bits for
> hashsplitting purposes - we seldom care about more than 20 or so of
> the lowest bits - but I don't feel ready to settle for that until we've
> evaluated some other candidate algorithms.

I'm open to alternative suggestions, but (1) I don't see a problem;
you're getting to 32MiB blocks before you start to have any differences
in variance, and realistically +/-5% probably isn't a big deal -- so
you're looking at max avg block size of 128MiB, which seems like plenty.
(2) I do think it would be nice to arrange things such that some
existing systems' algorithms fall out of the spec as choices of
particular parameters. For that to work for perkeep & bup, we'd need to
keep some version of this hash function.

If we decide it's ok to put the function behind an interface in the
code, we could explore multiple possible hash functions, but (1) I think
it's valuable for this function to be one of the options for
compatibility, and (2) I don't want to give too many options; a
different hash function is much more extra implementation work than a
numeric parameter, and if we add too many we've sort of missed the
point of standardization. So I'd only want to do this if there are clear
compelling use cases for each of the functions we include.

Whatever parameters we decide to add, we should pick a
default/recommended set of values for them.

Bob Glickstein

unread,
Aug 13, 2020, 10:58:41 AM8/13/20
to Ian Denhardt, per...@googlegroups.com
On Tue, Aug 11, 2020 at 3:38 PM Ian Denhardt <i...@zenhack.net> wrote:
I think
it's valuable for this function to be one of the options for
compatibility

I agree there, but I'm becoming increasingly convinced that it should not be the default, or even recommended.

I've added a new level of analysis to my benchmark function as of the latest commit: in addition to counting how often each bit is zero (which should approach 50%), it counts how often each pair of bits is correlated (which should also approach 50%). The results for rollsum are not great. On the other hand, the results for the other algorithms in github.com/chmduquesne/rollinghash, which are now added to the benchmark, are great (except for adler32). Try running with and without the env var BENCHMARK_ROLLSUM_ANALYZE=1 to see the results.

By the way, there's probably more sophisticated analysis that could be done on the distribution produced by these hashes but I suspect we're into diminishing returns after the pairwise bit correlations I'm now doing. I could be wrong though. Whether I am, and how else the results should be analyzed, are left as an exercise for other readers of this thread.

Cheers,
- Bob

Ian Denhardt

unread,
Aug 13, 2020, 2:30:31 PM8/13/20
to Bob Glickstein, bo...@emphatic.com, per...@googlegroups.com
Okay, that's starting to seem a bit more relevant.

It would be interesting to measure empirically what the mean & median
block sizes for a real-world perkeep store are; if there's a high
correlation between bits one would expect these metrics to be lower
than a uniformly random hash function would provide. It would be
valuable on its own to have documentation describing how to get a
desired block size with this statistically imperfect function.

I'm okay with having a statistically more compelling hash be our
"recommended" choice, including this one only for compatibility.

I started writing a spec for this hash (based on the rsync document, but
including the deviations made by perkeep & bup). I'll publish it
somewhere soon so we can collaborate.

-Ian

Quoting Bob Glickstein (2020-08-13 10:58:28)
> On Tue, Aug 11, 2020 at 3:38 PM Ian Denhardt <[1]i...@zenhack.net>
> wrote:
>
> I think
> it's valuable for this function to be one of the options for
> compatibility
>
> I agree there, but I'm becoming increasingly convinced that it should
> not be the default, or even recommended.
> I've added a new level of analysis to my benchmark function as of
> [2]the latest commit: in addition to counting how often each bit is
> zero (which should approach 50%), it counts how often each pair of bits
> is correlated (which should also approach 50%). The results for rollsum
> are not great. On the other hand, the results for the other algorithms
> in� [3]github.com/chmduquesne/rollinghash, which are now added to the
> benchmark, are great (except for adler32). Try running with and without
> the env var� BENCHMARK_ROLLSUM_ANALYZE=1 to see the� results.
> By the way, there's probably more sophisticated analysis that could be
> done on the distribution produced by these hashes but I suspect we're
> into diminishing returns after the pairwise bit correlations I'm now
> doing. I could be wrong though. Whether I am, and how else the results
> should be analyzed, are left as an exercise for other readers of this
> thread.
> Cheers,
> - Bob
>
> and (2) I don't want to give too many options; a
> different hash function is much more extra implementation work than
> a
> numeric parameter, and if we add too many we've sort of missed the
> point of standardization. So I'd only want to do this if there are
> clear
> compelling use cases for each of the functions we include.
> Whatever parameters we decide to add, we should pick a
> default/recommended set of values for them.
>
> Verweise
>
> 1. mailto:i...@zenhack.net
> 2. https://github.com/bobg/hashsplit/commit/17195adda444fcc11e96cfd6058613edd88af5be
> 3. http://github.com/chmduquesne/rollinghash

Joe Moore

unread,
Aug 14, 2020, 12:21:50 PM8/14/20
to per...@googlegroups.com
I'm a bit late to this part of the discussion, but I think this is relevant:

I know why splitting blobs is useful (performance, scale, dedup) but is there a reason why all backends need to use the same algorithm/settings?

If I'm pulling a file from Amazon glacier, and being charged per-file-requested, I'd rather have the file in large chunks.

If it's a video file, there may be better ways to split it up (such as on key frames) than just looking at the rolling hash. 

If it's just tweets, why ever bother calculating the rollsum, it's never going to be that big.

Proposal:
Why not let the backend blobstore service take care of chunking a blob into pieces however it makes sense?

The use cases above (huge single-chunk file, content-aware splitting, optimize for small blobs) naturally resolve with different parameters.
 
For the developers, you could replicate from blobstore-with-rollsum (dechunk, transfer data, chunk differently) over to blobstore-with-different-rollsum to compare performance or sizes or whatever.

It would make for a very simple existing-filesystem layer: set the split size to \inf and point the backend existing (whole) files.

The downside as I see it would be a reansonable-default config option for blobservers that choose to implement this parameter.

--Joe

--
You received this message because you are subscribed to the Google Groups "Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email to perkeep+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/perkeep/159734342442.787.1979326535815718855%40localhost.localdomain.

ta...@gulacsi.eu

unread,
Aug 14, 2020, 2:39:28 PM8/14/20
to per...@googlegroups.com
The client is chunking because then it can hash the chunk and check whether that chunk is already at the backend, saving transfer costs.
Different clients should chunk the same, otherwise this "already there" phenomenon would be rare.

From: per...@googlegroups.com <per...@googlegroups.com> on behalf of Joe Moore <jpv...@gmail.com>
Sent: Friday, August 14, 2020 6:21:34 PM
To: per...@googlegroups.com <per...@googlegroups.com>
Subject: Re: Cutting files
 

Ian Denhardt

unread,
Aug 14, 2020, 2:54:17 PM8/14/20
to Joe Moore, per...@googlegroups.com
Quoting Joe Moore (2020-08-14 12:21:34)

> I know why splitting blobs is useful (performance, scale, dedup) but is
> there a reason why all backends need to use the same
> algorithm/settings?
> If I'm pulling a file from Amazon glacier, and being charged
> per-file-requested, I'd rather have the file in large chunks.
> If it's a video file, there may be better ways to split it up (such as
> on key frames) than just looking at the rolling hash.�
> If it's just tweets, why ever bother calculating the rollsum, it's
> never going to be that big.

I think making constants (things like target block size, min/max sizes)
tunable is fine, and esp. given that there are already systems out in
the wild using different parameters here, libraries building this stuff
ought to allow the caller to provide these parameters. It's easy enough
to do since they're just the choice of numbers.

I think there's a stronger reason to avoid a proliferation of hash
functions though, as each additional hash adds meaningful implementation
burden. We my still want to specify a number of options to capture
existing systems, but we should avoid adding hash functions without
clear use cases.

> Proposal:
> Why not let the backend blobstore service take care of chunking a blob
> into pieces however it makes sense?

I can think of a couple reasons why hiding the splits from the client
might not be desirable:

- If the client doesn't know about the splitting algorithm, it has to
transfer whole files, every time, potentially wasting lots of
bandwidth. If the splitting is done client side, only modified blocks
need to be transferred.
- If the splits are opaque, transferring from one store to another using
only public interfaces basically has to re-duplicate the blobs, as all
all of the files are stitched back together, transferred, and then
split again. If you do this really naively it could result in
potentially exponential space blowup, but even being smart about only
copying whole files once this could be very expensive. As a point of
reference, I migrated to perkeep from a backup tool I'd written myself
that did hash-based dedup at the file level only, and the perkeep
version used about half the storage of the old system.

Your proposal may be fine for some use cases, but I still think there's
value in being able to replicate the same split without using the same
piece of software.

-Ian

Ian Denhardt

unread,
Aug 14, 2020, 7:37:34 PM8/14/20
to Bob Glickstein, bo...@emphatic.com, per...@googlegroups.com
Quoting Ian Denhardt (2020-08-13 14:30:24)

> I started writing a spec for this hash (based on the rsync document,
> but including the deviations made by perkeep & bup). I'll publish it
> somewhere soon so we can collaborate.

Published:

https://github.com/zenhack/hashsplit-spec

-Ian
Reply all
Reply to author
Forward
0 new messages