Index Accumulates Deletes

23 views
Skip to first unread message

Freshleaf Media Ltd

unread,
Sep 1, 2021, 4:58:21 AM9/1/21
to bup-list
Hello,

Is it expected behaviour that `bup index --print --status` shows all deletes that have ever happened?

Steps to reproduce:
  1. initialise new Bup index
  2. index and save some files
  3. delete some files
  4. index and save again
  5. Index status still shows the files as deleted
I expected that once the deletes had been indexed and saved they wouldn't appear in the index status output again. I've noticed this because after using Bup for a while, running backups everyday, printing the index status shows all the files that have ever been deleted. This obscures what's actually changed.

Additionally the documentation (https://bup.github.io/man/bup-index.html#modes) explicitly says:

> "...marked in the index as added, modified, deleted, or unchanged since the last backup"

Full steps to reproduce:

Initialise new Bup index
```
mkdir /tmp/bup
cd /tmp/bup
bup init
```

Index and save some files
```
touch /tmp/bup/{0..9}

bup index --update /tmp/bup
bup save --name alpha --compress 3 -vv /tmp/bup
```

delete some files
```
rm /tmp/bup/0
```

Index and save again
```
bup index --update /tmp/bup
bup save --name alpha --compress 3 -vv /tmp/bup
```

Index status still shows the files as deleted

```
bup index --print --status
```

```
  9
  8
  7
  6
  5
  4
  3
  2
  1
D 0
  ./
```

Rob Browning

unread,
Sep 2, 2021, 1:33:19 AM9/2/21
to Freshleaf Media Ltd, bup-list
Freshleaf Media Ltd <goo...@freshleafmedia.co.uk> writes:

> Hello,
>
> Is it expected behaviour that `bup index --print --status` shows all
> deletes that have ever happened?

If I recall correctly, that may be the expected behavior of the current
index implementation. There's even a question in cmd/index.py:

# FIXME: shouldn't we remove deleted entries eventually? When?

I suspect we'll address that later. I was toying with a potential
replacement for the current index a while ago, but haven't gotten back
to it to see if it really works out.

For now, you could --clear the index to drop all of those, but the next
index run after that will be expensive (i.e. it'll have to re-read *all*
the filesystem data).

--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Johannes Berg

unread,
Sep 2, 2021, 3:09:00 AM9/2/21
to Rob Browning, Freshleaf Media Ltd, bup-list
On Thu, 2021-09-02 at 00:33 -0500, Rob Browning wrote:
>
> For now, you could --clear the index to drop all of those, but the next
> index run after that will be expensive (i.e. it'll have to re-read *all*
> the filesystem data).

To clarify - the next *index* won't be a big difference, but the next
*save* after that will re-read all the data, and thus be expensive,
because after --clear it no longer knows the checksums of on-disk files.

I suppose a --clear-deleted wouldn't be hard to implement.

johannes

Greg Troxel

unread,
Sep 2, 2021, 8:06:19 AM9/2/21
to Johannes Berg, Rob Browning, Freshleaf Media Ltd, bup-list

Johannes Berg <joha...@sipsolutions.net> writes:

> I suppose a --clear-deleted wouldn't be hard to implement.

It wouldn't, and it would let people try it and find out if there are
any issues.

I realize Rob is talking about different formats -- but I wonder if that
also means different semantics.
signature.asc

Freshleaf Media Ltd

unread,
Sep 2, 2021, 8:10:46 AM9/2/21
to bup-list
A --clear-deleted argument would solve my use case and I would be happy to feedback any issues I come across when using it.

Johannes Berg

unread,
Sep 2, 2021, 8:27:36 AM9/2/21
to Greg Troxel, Rob Browning, Freshleaf Media Ltd, bup-list
I think he's talking about the sqlite index thoughts/work, which would
in fact make this easier - today I think it would require rewriting the
index.

But I think you're right - I'm not sure we _should_ be making semantic
changes when the format changes, seems that should be orthogonal. OTOH,
the biggest reason for having them linger in the index is that it
requires rewriting and that's harder than just newly indexing/updating.

If I find time later, I might take a look at just implementing something
like --clear-deleted (probably better called --prune-deleted?)

johannes

Rob Browning

unread,
Sep 3, 2021, 12:26:52 PM9/3/21
to Johannes Berg, Greg Troxel, Freshleaf Media Ltd, bup-list
Johannes Berg <joha...@sipsolutions.net> writes:

> I think he's talking about the sqlite index thoughts/work, which would
> in fact make this easier - today I think it would require rewriting the
> index.

Right, it doesn't have to be sqlite, but that's what I've been toying
with. The main purpose is to make it easier to handle various
operations (and enhancements), including deletes, and even more
importantly, to be able to handle additional information, like the
current metadata safely.

Right now (extended) metatadata is stored separately, because the
current index is an efficient, mmapped data structure (array
representation of a tree with inter-node pointers represented by array
offsets) that has no easy way to handle variable length data that might
change size -- without something like a rewrite (merge). That's why the
extended metadata was originally stored externally with a fixed length
integer pointer field linking the index entry to its (extended)
metadata.

Updates to existing index entries (mtime changes, etc. -- basic
fixed-length stat data is in the main index) happen in-place via direct
array writes to the mmapped data structure, and so there's no
(pedestrian) way to avoid the potential for data races, "torn" updates,
etc. The integer value linking an entry to its externally stored,
extended metadata is also updated that way whenever the metadata
changes. We change the "id" because the extended metadata entries are
deduplicated and we never modify one in-place.

I believe the separation between the index proper, and the metadata
store is the source of some of the "broken index" problems we see
reported periodically. With sqlite, of course, it'd be easy to handle
those updates transactionally.

> But I think you're right - I'm not sure we _should_ be making semantic
> changes when the format changes, seems that should be orthogonal. OTOH,
> the biggest reason for having them linger in the index is that it
> requires rewriting and that's harder than just newly indexing/updating.

Right, I'd been preserving the existing semantics, though expected we
might want to discuss deletes if some of the current semantics were
partially a result of the storage method.

I've vaguely wondered about what semantics I think I might *want* with
respect to deletes. Offhand, conceptually, I could imagine wanting them
to disappear after the next save that includes their parent directory,
but given the fact (for example) that you can save arbitrary individual
paths...

> If I find time later, I might take a look at just implementing something
> like --clear-deleted (probably better called --prune-deleted?)

If there's no obvious way we already think we should handle deletes more
implicitly, then sounds plausible, and might be plausible anyway.

But I'd want to give it a bit of thought if we have any suspicions that
we might not want/need to keep the option/semantics if we were to rework
the index.
Reply all
Reply to author
Forward
0 new messages