[PATCH 1/4] DESIGN: add repository taxonomy describing expected variations

0 views
Skip to first unread message

Rob Browning

unread,
Nov 19, 2025, 5:36:17 PM (3 days ago) Nov 19
to bup-...@googlegroups.com
Signed-off-by: Rob Browning <r...@defaultvalue.org>
---
DESIGN.md | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)

diff --git a/DESIGN.md b/DESIGN.md
index 65ca4ed1..ef172d5e 100644
--- a/DESIGN.md
+++ b/DESIGN.md
@@ -523,6 +523,39 @@ different set of filesystems than the save tree, complete sets of
hardlinks may not be restored.


+Repository Taxonomy
+-------------------
+
+The format of the data that may appear in a repository has varied over
+time, both as a result of intentional changes and earlier bugs.
+
+ - A tree object may not have bup created metadata (i.e. may not have
+ a `.bupm` file). Perhaps because it was created by git or a version
+ of bup before metadata support was added. Eventually that might
+ also result from repairs, though for the moment, it's not
+ possible. The abridgement repair (see below) comes close, but ends
+ up leaving a `.bupm` with empty entries for everything except ".".
+
+ - A `.bupm` file may be abridged, i.e. have missing entries due to a
+ bug introduced in 0.25 by 16f9f9829038f25aec80ebfae3c882a66281e145
+ ("save-cmd.py: don't crash when a path disappears between index and
+ save") and fixed for 0.30.1 by
+ 47891d8951a95b8e0d9ca94387107cdf12ca3d3c ("save: add empty metadata
+ if reading fails"). Related: `bup-validate-refs --bupm` and `bup
+ get --repair`.
+
+ - A `.bupm` may have "empty" entries, i.e. a path's entry in a
+ `.bupm` might be the encoding of a `Metadata()` object with no
+ attributes. This may be because it's a "." entry for a "synthetic"
+ directory (created via save strip/graft operations), or it may be
+ due to the fix for the abridgement issue described above, and it
+ can also occur as the result of repairs (cf. `bup-get`(1)).
+
+ - Repositories created before the introduction of split trees won't
+ of course have split trees, nor will current repositories with
+ bup.split.trees set to false.
+
+
Filesystem Interaction
======================

--
2.47.3

Rob Browning

unread,
Nov 19, 2025, 5:36:17 PM (3 days ago) Nov 19
to bup-...@googlegroups.com
Signed-off-by: Rob Browning <r...@defaultvalue.org>
---
DESIGN.md | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/DESIGN.md b/DESIGN.md
index ef172d5e..5581477a 100644
--- a/DESIGN.md
+++ b/DESIGN.md
@@ -788,9 +788,9 @@ comprehensive solution. In theory, this might be sufficient, but our
initial randomized testing discovered that some binary arguments would
crash Python during startup[1]. Eventually Johannes Berg tracked down
the [cause](https://sourceware.org/bugzilla/show_bug.cgi?id=26034),
-and we hope that the problem will be fixed eventually in glibc or
-worked around by Python, but in either case, it will be a long time
-before any fix is widely available.
+and we hoped that the problem would be fixed eventually in glibc or
+worked around by Python, but in either case, we assumed it would be a
+long time before any fix was widely available.

Before we tracked down that bug we were pursuing an approach that
would let us side step the issue entirely by manipulating the
@@ -798,14 +798,13 @@ LC_CTYPE, but that approach was somewhat complicated, and once we
understood what was causing the crashes, we decided to just let Python
3 operate "normally", and work around the issues.

-Consequently, we've had to wrap a number of things ourselves that
-incorrectly return Unicode strings (libacl, libreadline, hostname,
-etc.) and we've had to come up with a way to avoid the fatal crashes
+Consequently, we ended up wrapping a number of things ourselves that
+incorrectly returned Unicode strings (libacl, libreadline, hostname,
+etc.) and we had to come up with a way to avoid the fatal crashes
caused by some command line arguments (sys.argv) described above. To
-fix the latter, for the time being, we just use a trivial sh wrapper
-to redirect all of the command line arguments through the environment
-in BUP_ARGV_{0,1,2,...} variables, since the variables are unaffected,
-and we can access them directly in Python 3 via environb.
+fix the latter, we changed bup from a Python script to a binary
+executable, to allow direct access to the process argv. (This also
+makes bup appear as "bup" to the system, rather than python.)

[1] Our randomized argv testing found that the byte smuggling approach
was not working correctly for some values (initially discovered in
--
2.47.3

Rob Browning

unread,
Nov 19, 2025, 5:36:17 PM (3 days ago) Nov 19
to bup-...@googlegroups.com
Some updates prompted during work on rewrite/repair support. Describe
repository format variations, describe the bupm ordering issue, etc.

Pushed to main.

--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Rob Browning

unread,
Nov 19, 2025, 5:36:18 PM (3 days ago) Nov 19
to bup-...@googlegroups.com
Signed-off-by: Rob Browning <r...@defaultvalue.org>
---
README.md | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/README.md b/README.md
index 235b9d81..6b335d03 100644
--- a/README.md
+++ b/README.md
@@ -71,14 +71,17 @@ Reasons you might want to avoid bup
more likely to eat your data. It's also missing some
probably-critical features, though fewer than it used to be.

- - It requires python 3.7 or newer, a C compiler, and an installed git
- version >= 1.7.2. It also requires par2 if you want fsck to be
- able to generate the information needed to recover from some types
- of corruption.
-
- - It currently only works on Linux, FreeBSD, NetBSD, OS X >= 10.4,
- Solaris, or Windows (with Cygwin, and WSL). Patches to support
- other platforms are welcome.
+ - While it is intended to work with python 3.7 or newer, a C
+ compiler, and an installed git version >= 1.7.2, it is currently
+ only automatically tested against some python versions 3.9 and
+ newer and git versions 2.3 and newer. Please report any
+ discrepancies. It also requires par2 if you want fsck to be able
+ to generate the information needed to recover from some types of
+ corruption.
+
+ - It has only been reported to work on Linux, FreeBSD, NetBSD, OS X
+ >= 10.4, Solaris, or Windows (with Cygwin, and WSL). Patches to
+ support other platforms are welcome.

- Any items in "Things that are stupid" below.

--
2.47.3

Rob Browning

unread,
Nov 19, 2025, 5:36:18 PM (3 days ago) Nov 19
to bup-...@googlegroups.com
Signed-off-by: Rob Browning <r...@defaultvalue.org>
---
DESIGN.md | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/DESIGN.md b/DESIGN.md
index 5581477a..f4cebe18 100644
--- a/DESIGN.md
+++ b/DESIGN.md
@@ -469,11 +469,20 @@ store file contents with a small bit of extra information, like
symlink targets and executable bits, so we have to store the rest
some other way.

-Bup stores more complete metadata in the VFS in a file named .bupm in
-each tree. This file contains one entry for each file in the tree
-object, sorted in the same order as the tree. The first .bupm entry
-is for the directory itself, i.e. ".", and its name is the empty
-string, "".
+Excepting much earlier versions, bup stores more complete metadata in
+the repository in a file named .bupm in each tree. This file contains
+one entry for each file in the tree object. The first .bupm entry is
+for the directory itself, i.e. ".", and its name is the empty string,
+"".
+
+The .bupm entries were intended to be in the same order as the git
+tree so that you could walk through the tree and .bupm incrementally,
+in parallel, but unfortunately, they currently aren't. The bupm
+entries are ordered by the corresponding tree entry's "real"
+(unmangled name), not the actual name in the tree. Though the .bupm
+ordering does account for the fact that git sorts trees (including
+chunked trees) as if their names ended with "/" (so "fo" sorts after
+"fo." iff fo is a directory).

Each .bupm entry contains a variable length sequence of records
containing the metadata for the corresponding path. Each record
@@ -486,10 +495,6 @@ The .bupm file is optional, and when it's missing, bup will behave as
it did before the addition of metadata, and restore files using the
tree information.

-The nice thing about this design is that you can walk through each
-file in a tree just by opening the tree and the .bupm contents, and
-iterating through both at the same time.
-
Since the contents of any .bupm file should match the state of the
filesystem when it was *indexed*, bup must record the detailed
metadata in the index. To do this, bup records four values in the
--
2.47.3

Stefan Monnier

unread,
Nov 19, 2025, 5:41:22 PM (3 days ago) Nov 19
to Rob Browning, bup-...@googlegroups.com
> + - A tree object may not have bup created metadata (i.e. may not have

In my world this means "A tree object is not allowed to ...".
Maybe "A tree object may fail to have ..." or "A tree object may lack ..."?


Stefan

Rob Browning

unread,
Nov 19, 2025, 6:01:39 PM (3 days ago) Nov 19
to Stefan Monnier, bup-...@googlegroups.com
Hah, fair.

...or "A tree object might not have bup created metadata...".

I'll fix it.

Thanks
Reply all
Reply to author
Forward
0 new messages