DESIGN document: How can one git checkout a file if it consists of multiple blobs?

mle...@gmail.com

unread,

Dec 16, 2022, 2:26:25 PM12/16/22

to bup-list

Hi,

I am confused by how it is possible that files stored by bup can be restored by standard git checkout when they are split into multiple blobs. Hopefully you can help me and I do not bother you too much these days. The file DESIGN says:

> Anyway, so we're dividing up those files into chunks based on the rolling

> checksum. Then we store each chunk separately (indexed by its sha1sum) as a git blob.

And then, a sequence of multiple blobs is stored using as a tree

> The next problem is less obvious: after you store your series of chunks as

> git blobs, how do you store their sequence?

> (...)

> We didn't split this list in the same way. We could

> have, in fact, but it wouldn't have been very "git-like", since we'd like to

> store the list as a git 'tree' object in order to make sure git's

> refcounting and reachability analysis doesn't get confused. Never mind the

> fact that we want you to be able to 'git checkout' your data without any special tools.

But that is strange, since the recommended reading "git for computer scientists" and the git internals documentation says that a tree represents a directory structure, not a list of blobs that are concatenated to one single file:

> tree: Directories are represented by tree object. They refer to blobs that have the

> contents of files (filename, access mode, etc is all stored in the tree), and to other

> trees for subdirectories. ("git for computer scientists" https://eagain.net/articles/git-for-computer-scientists/)

> The next type of Git object we’ll examine is the tree, (...) All the content is

> stored as tree and blob objects, with trees corresponding to UNIX directory entries and

> blobs corresponding more or less to inodes or file contents. A single tree object

> contains one or more entries, each of which is the SHA-1 hash of a blob or subtree with

> its associated mode, type, and filename.

How is it possible to assemble one file from multiple blobs?

Best regards,

Moritz

Greg Troxel

unread,

Dec 16, 2022, 8:18:45 PM12/16/22

to mle...@gmail.com, bup-list

The short answer is that files are split into blobs, and blobs are
aggegated by trees, and files point to these trees. I think that the
idea that a file can be restored by a standard git checkout is
incorrect.

signature.asc

Johannes Berg

unread,

Dec 18, 2022, 3:25:08 PM12/18/22

to Greg Troxel, mle...@gmail.com, bup-list

Indeed it cannot be, but it could be post-processed fairly simply - e.g.
a 'largefile' would be saved as a folder/tree

largefile.bup/

with a bunch of stuff inside

largefile.bup/000/000
largefile.bup/000/001

etc.

and all you really have to do is concatenate those files in the
(numerical) order to get back 'largefile'.

johannes

mle...@gmail.com

unread,

Dec 21, 2022, 11:44:56 AM12/21/22

to bup-list

Thank you for your help. This is indeed what I found when I did a "git checkout", except that I got hexadecimal folder and file names:

find file.JPG.bup/ -type f |sort
file.JPG.bup/000000/00000
file.JPG.bup/000000/00d0a
file.JPG.bup/000000/0105f
...

file.JPG.bup/023a5c/01530
file.JPG.bup/023a5c/01bc5
...

from which I could extract the original file using "find file.JPG.bup -type f | sort | xargs cat > file.JPG"

Johannes Berg

unread,

Apr 25, 2023, 1:30:20 PM4/25/23

to mle...@gmail.com, bup-list

On Wed, 2022-12-21 at 08:44 -0800, mle...@gmail.com wrote:
> Thank you for your help. This is indeed what I found when I did a "git
> checkout", except that I got hexadecimal folder and file names:
>
> find file.JPG.bup/ -type f |sort
> file.JPG.bup/000000/00000
> file.JPG.bup/000000/00d0a
> file.JPG.bup/000000/0105f
> ...
> file.JPG.bup/023a5c/01530
> file.JPG.bup/023a5c/01bc5

Just for the record, yes, indeed. And those numbers are the byte offsets
from the beginning of the file/containing folder, so random access to
such files and file size calculation (in _normal_or_chunked_file_size())
is easier.

johannes

Reply all

Reply to author

Forward