File schema

Bob Glickstein

unread,

Dec 14, 2019, 12:13:15 PM12/14/19

to per...@googlegroups.com

Hello! I have a few questions about the file schema.

In schema/filewriter.go, files are split into trees of chunks at "interesting rollsum boundaries."

The comment describing an "interesting" rollsum boundary says it's when the trailing 13 bits of the rolling checksum are "set the same way," which sounds like it means all-zeroes or all-ones. But the implementation says they have to be all ones.

Question 1: Which is right, the comment or the implementation?
Question 1a: If the comment is right, then the intention is for 2 out of every 1<<13 checksum values to satisfy OnSplit. Why not 1 out of every 1<<12?

This chunk splitting happens on the second and subsequent chunks of a file after the size of the chunk surpasses 64kb, which by my calculation happens, on average, within the following 5,678 bytes.

Question 2: Why make irregularly sized chunks at all based on this obscure property? Why not split at 64kb boundaries?

Each chunk created gets a "bits" score which seems to be the number of trailing ones in its rolling checksum (though I'm not quite sure about that). If this is larger than the bits score of the last 1 or more chunks, those are made "children" of this new chunk.

Question 3: Why?

Thanks,

- Bob

Aaron Boodman

unread,

Dec 14, 2019, 9:36:57 PM12/14/19

to per...@googlegroups.com

See https://github.com/bup/bup/blob/master/DESIGN#L92 for the answer to most of these questions.

--
You received this message because you are subscribed to the Google Groups "Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email to perkeep+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/perkeep/CAEf8c49V3gbxBXX%3DTeFnsi8kP_8tKTyZR9f26%3D_aaL-5%3DTYZFg%40mail.gmail.com.

Yuval Kogman

unread,

Dec 14, 2019, 9:37:14 PM12/14/19

to per...@googlegroups.com

Hi,

On Sat, 14 Dec 2019 at 17:13, Bob Glickstein <bob.gli...@gmail.com> wrote:

Question 1: Which is right, the comment or the implementation?
Question 1a: If the comment is right, then the intention is for 2 out of every 1<<13 checksum values to satisfy OnSplit. Why not 1 out of every 1<<12?

I will refrain from addressing these, since it's just a parameter, and I don't know what the original motivations for setting it that way were

This chunk splitting happens on the second and subsequent chunks of a file after the size of the chunk surpasses 64kb, which by my calculation happens, on average, within the following 5,678 bytes.
Question 2: Why make irregularly sized chunks at all based on this obscure property? Why not split at 64kb boundaries?

A somewhat contrived but helpful example: suppose you have a very large text file. You store it in perkeep. Then you find a typo in the beginning, fix it, and store the new version. Although the files are mostly identical, with a Levenstein distance of 1, a fixed width approach will encode them into completely disjoint sets of chunks.

By using subregions of the contents to determine the boundary positions, large strings with identical contents will be split at different offsets if the data at those offsets are the same.

In this example, the 1st chunk will be different for both files (with high probability) all subsequent chunks will be identical, allowing the data to be deduplicated. The tradeoff is the complexity of variable sized chunks, and computing the hashes over these sliding windows, and furthermore this is convergent independently of the order in which the data was inserted (unlike say Git which depends on the commit ordering for deduplicating similar but non identical files when constructing pack files). There are many variants of this technique, I believe the idea originated as Rabin fingerprinting, and that rsync was the first prominent application specifically intended for a distributed setting (don't quote me on that though, I did not verify).

A less contrived example might be large documents with embedded resources (e.g. a slideshow with images), or binary files with editable metadata (although I believe formats I place such metadata at the end)

Each chunk created gets a "bits" score which seems to be the number of trailing ones in its rolling checksum (though I'm not quite sure about that). If this is larger than the bits score of the last 1 or more chunks, those are made "children" of this new chunk.
Question 3: Why?

Having stared at the code for a bit, I'm not sure exactly why this tree structure is computed the way that it is.

The rationale for the Bytes schema object being recursive is that if the chunks of a file/bytes object were encoded as a linear sequence, there would be no possibility of reusing intermediate nodes.

If I'm not mistaken, this logic organizes this tree structure based on the bit score just to retain the same kind of order insensitivity as the chunking itself, and using the bits score is just a stochastic approach to get a structure that is roughly balanced, by arbitrarily promoting nodes with progressively rarer bit scores (but equally some other metric of how to group the chunks could be used, c.f. prolly trees in noms). I'm sure someone will correct me if I've missed something.

Bob Glickstein

unread,

Dec 15, 2019, 8:11:44 PM12/15/19

to per...@googlegroups.com

Very helpful - thank you!

Cheers,

- Bob

--

You received this message because you are subscribed to the Google Groups "Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email to perkeep+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/perkeep/CAAQdECC5nNF1xCTNt%3DjD5_GYOfrFbSKq2F3eZ1jP9SHbeD0opg%40mail.gmail.com.

Reply all

Reply to author

Forward