[PATCH 1/1] bup save could fail trying to read the metadata of a deleted file

78 views
Skip to first unread message

Damien Robert

unread,
Oct 13, 2012, 5:09:59 PM10/13/12
to bup-...@googlegroups.com, Damien Robert
If bup index see that a file foo has no change, bup save still need to read
foo in order to update the metadata. If the user removed foo between the
index and the save, the metadata call metadata.from_path would result in a
non intercepted error. This commit adds a try/except to handle this case.

Signed-off-by: Damien Robert <Damien.Olivi...@gmail.com>
---
cmd/save-cmd.py | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/cmd/save-cmd.py b/cmd/save-cmd.py
index bd72b76..7afc994 100755
--- a/cmd/save-cmd.py
+++ b/cmd/save-cmd.py
@@ -332,9 +332,18 @@ for (transname,ent) in r.filter(extra, wantrecurse=wantrecurse_during):
shalists[-1].append(git_info)
sort_key = git.shalist_item_sort_key((ent.mode, file, id))
hlink = find_hardlink_target(hlink_db, ent)
- metalists[-1].append((sort_key,
+ # In this case, the index version of the file has not changed since
+ # the last update. We still need to upload the metadata, but this
+ # can fail if the file has been removed after the indexing (or
+ # because we indexed a different subset of files, so in the index
+ # we do not see that the file is missing)
+ try:
+ metalists[-1].append((sort_key,
metadata.from_path(ent.name,
hardlink_target=hlink)))
+ except (OSError, IOError), e:
+ add_error(e)
+ lastskip_name = ent.name
else:
if stat.S_ISREG(ent.mode):
try:
@@ -375,9 +384,15 @@ for (transname,ent) in r.filter(extra, wantrecurse=wantrecurse_during):
shalists[-1].append(git_info)
sort_key = git.shalist_item_sort_key((ent.mode, file, id))
hlink = find_hardlink_target(hlink_db, ent)
- metalists[-1].append((sort_key,
+ # an error here is much less probable since we have just read
+ # the file, but there still could be a race condition
+ try:
+ metalists[-1].append((sort_key,
metadata.from_path(ent.name,
hardlink_target=hlink)))
+ except (OSError, IOError), e:
+ add_error(e)
+ lastskip_name = ent.name
if exists and wasmissing:
count += oldsize
subcount = 0
--
Patched on top of bup-0.25-rc1-69-gee00d53

Damien Robert

unread,
Oct 13, 2012, 5:09:58 PM10/13/12
to bup-...@googlegroups.com, Damien Robert
Whith the metadata branch, bup save reads the metadata of a file even if it
is not updated in the index. This lead to a bug if the file was deleted in
the meantime, because the exception in xstat.py was not catched.

This patch catch the exception, so bup save prints some warnings, but at
least the save is done.

This is still not quite satisfactory, because each next save will produce
the same warning. So you need to remove .bup_index to solve the problem.
When I have time I'll try to write a refresh option to bup index.

More precisely there is two ways this could be usefull. Let's say you save
3 files (or dirs) a b c in dir foo.
cd foo
bup index a b c
bup save .

Now you modify c, bup index a b c, rm c, bup save .
This produces a warning that c does not exist.

Another possibility is that you modify a, bup index a b c, rm c, bup save.
Here bup save crashes because c does not exist anymore when xstat tries to
read the metadata. The patch corrects this and print a warning.

But still, in each of the two cases there will be a warning. To teach bup
that c is not here anymore, you can do bup index foo (but if you have other
files/dirs in foo, this will add themp which is not what you want), or
remove the index and reconstruct it.


Damien Robert (1):
bup save could fail trying to read the metadata of a deleted file

cmd/save-cmd.py | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)

Rob Browning

unread,
Oct 14, 2012, 4:43:45 PM10/14/12
to Damien Robert, bup-...@googlegroups.com, Damien Robert
Damien Robert <damien.oli...@gmail.com> writes:

> Whith the metadata branch, bup save reads the metadata of a file even if it
> is not updated in the index. This lead to a bug if the file was deleted in
> the meantime, because the exception in xstat.py was not catched.

Thanks. I'll take a look at this more carefully when I get a chance.

--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Oei, Yung-Chin

unread,
Oct 19, 2012, 11:35:57 AM10/19/12
to Damien Robert, bup-...@googlegroups.com, Damien Robert
On Sat, Oct 13, 2012 at 11:09:59PM +0200, Damien Robert wrote:
> If bup index see that a file foo has no change, bup save still need to read
> foo in order to update the metadata. If the user removed foo between the
> index and the save, the metadata call metadata.from_path would result in a
> non intercepted error. This commit adds a try/except to handle this case.

Thanks for bringing this up. I think there are actually worse scenarios
than the "removed foo" that you fix in this patch - and therefore, I
feel we might need a more drastic fix. Eg. what if foo got replaced by
another file named foo?

Appended below is a quick script that shows that in that case, we will
end up restoring the contents of the old file (that's fine, it's
expected), but applying the new file's metadata to it.

My gut-feeling would be that the only "safe" way to go about this, is to
bup-join the .bupm file from the repository, and replace only the lines
for files that we really update? That's a hell of a change though...

YC
meta_test.sh
meta_test.out

Damien Robert

unread,
Oct 19, 2012, 4:11:47 PM10/19/12
to bup-...@googlegroups.com
"Oei, Yung-Chin" wrote in message <20121019153557.GA22910@D630new>:
> Thanks for bringing this up. I think there are actually worse scenarios
> than the "removed foo" that you fix in this patch - and therefore, I
> feel we might need a more drastic fix. Eg. what if foo got replaced by
> another file named foo?
>
> Appended below is a quick script that shows that in that case, we will
> end up restoring the contents of the old file (that's fine, it's
> expected), but applying the new file's metadata to it.
>
> My gut-feeling would be that the only "safe" way to go about this, is to
> bup-join the .bupm file from the repository, and replace only the lines
> for files that we really update? That's a hell of a change though...

I agree with you that the only sane way is to only update the metadata of
files that were indexed. Or more precisely: when running bup index, add to
the changed list also the files whose metadata has changed, and when
running bup save, don't store the metadata of files that are not in the
changelist. In other words treat metadata change as if it was a content change.

Otherwise we get situations like the one you showed, where the metadata get
updated but not the content, which can't be right...

Rob Browning

unread,
Oct 19, 2012, 9:42:16 PM10/19/12
to Oei, Yung-Chin, Damien Robert, bup-...@googlegroups.com, Damien Robert
"Oei, Yung-Chin" <yung...@yungchin.nl> writes:

> Appended below is a quick script that shows that in that case, we will
> end up restoring the contents of the old file (that's fine, it's
> expected), but applying the new file's metadata to it.

If I understand this correctly, it's a show-stopper for the
tmp/pending/meta integration (fortunately the relevant bits aren't in
master yet). But I'll need to look into it more carefully (probably
tomorrow).

Thanks to you both.

Rob Browning

unread,
Oct 20, 2012, 12:28:50 PM10/20/12
to Damien Robert, bup-...@googlegroups.com, Damien Robert, Avery Pennarun
Damien Robert <damien.oli...@gmail.com> writes:

> Whith the metadata branch, bup save reads the metadata of a file even if it
> is not updated in the index. This lead to a bug if the file was deleted in
> the meantime, because the exception in xstat.py was not catched.
>
> This patch catch the exception, so bup save prints some warnings, but at
> least the save is done.
>
> This is still not quite satisfactory, because each next save will produce
> the same warning. So you need to remove .bup_index to solve the problem.
> When I have time I'll try to write a refresh option to bup index.
>
> More precisely there is two ways this could be usefull. Let's say you save
> 3 files (or dirs) a b c in dir foo.
> cd foo
> bup index a b c
> bup save .
>
> Now you modify c, bup index a b c, rm c, bup save .
> This produces a warning that c does not exist.

So let's start here -- while there may well be a bug with the metadata
code, and its interactions with the index, my first question would be
"How is this intended to work"? At least back as far as bup-0.25-rc1,
bup has treated this situation (a missing file) as a deferred error.

And during my work, I've assumed that the only way you're guaranteed to
end up with a completely accurate backup is if you use something like
snapshots to make sure that the filesystem doesn't change between index
and save. That said, I think we should try to handle any discrepancies
gracefully whenever feasible.

In general, I do think it's a reasonable choice to treat a missing file
as a deferred error. At least for my current uses, that would always
indicate a problem that I'd want to know about.

> Another possibility is that you modify a, bup index a b c, rm c, bup save.
> Here bup save crashes because c does not exist anymore when xstat tries to
> read the metadata. The patch corrects this and print a warning.

Right -- bup certainly shouldn't crash.

> But still, in each of the two cases there will be a warning. To teach bup
> that c is not here anymore, you can do bup index foo (but if you have other
> files/dirs in foo, this will add themp which is not what you want), or
> remove the index and reconstruct it.

Right -- offhand, right now, I don't think we have a clean way to ask
the index to notice that c is gone without reindexing the whole tree.

The current index is additive only as far as I know. If you want to
remove anything from the tree that's going to be saved, you have to
start over.

Thanks

Rob Browning

unread,
Oct 20, 2012, 1:01:29 PM10/20/12
to Damien Robert, bup-...@googlegroups.com, Damien Robert, Avery Pennarun
Rob Browning <r...@defaultvalue.org> writes:

> The current index is additive only as far as I know. If you want to
> remove anything from the tree that's going to be saved, you have to
> start over.

...additive wasn't a good choice of words. I meant that you can't
remove individual files from a directory in the index. For that I
believe you have to reindex a relevant subtree.

Rob Browning

unread,
Oct 21, 2012, 1:49:01 PM10/21/12
to Oei, Yung-Chin, Damien Robert, bup-...@googlegroups.com, Damien Robert, Avery Pennarun
"Oei, Yung-Chin" <yung...@yungchin.nl> writes:

> My gut-feeling would be that the only "safe" way to go about this, is to
> bup-join the .bupm file from the repository, and replace only the lines
> for files that we really update? That's a hell of a change though...

Let me see if I understand the situation. The problem is that as
currently envisioned, tmp/pending/meta can only capture an entire
directory's worth of metadata, which it stores in the .bupm file inside
that directory.

And of course, if *anything* changes in a dir, or the dir itself
changes, bup has to rewrite the entire .bupm file, i.e. if foo/ or
anything in foo changes, even just a timestamp, bup will completely
rebuild foo/.bupm.

This approach works fine in the simple case where there are no
--exclude*s, and no partial indexing/saving. Although the user still
has to make sure the filesystem doesn't change inappropriately between
index and save. But this approach doesn't work fine in the more
complicated cases.

It sounds like you're suggesting that in the relevant cases we try to
arrange for bup to extract an old .bupm (if one exists), and apply only
the "appropriate changes" to the old .bupm to generate the new one.

Offhand, I'm not exactly sure how difficult that might be.

[1] Recall that foo/.bupm contains the metadata records for foo itself,
and each of its items, in git-tree order, with the metadata for foo
as the first entry (with a name of "").

Thanks

Oei, Yung-Chin

unread,
Oct 21, 2012, 7:26:38 PM10/21/12
to Rob Browning, Damien Robert, bup-...@googlegroups.com, Damien Robert, Avery Pennarun
On 21 October 2012 18:49, Rob Browning <r...@defaultvalue.org> wrote:
> Let me see if I understand the situation. [...]

Yep, or at least that's exactly the same as how I understand it.

> This approach works fine in the simple case where there are no
> --exclude*s, and no partial indexing/saving. Although the user still
> has to make sure the filesystem doesn't change inappropriately between
> index and save.

About that last point: what's inappropriate? I mean, I always use
read-only snapshots, so the potential for problems never occurred to me
until I read Damien's report. But unless snapshotting becomes mandatory,
I think even the simplest use case may have unexpected behaviour: if you
just update the index for a whole tree, and it finds that some files
haven't changed, and then there are changes to those files before you
run a save, that would also give you old data with new metadata (often
just new mtimes, but still).

(simplified test script for that case attached below - I tried it on
tmp/pending/meta)

> It sounds like you're suggesting that in the relevant cases we try to
> arrange for bup to extract an old .bupm (if one exists), and apply only
> the "appropriate changes" to the old .bupm to generate the new one.

Yes, but I don't know the metadata code closely enough to see any
potentially smarter tricks. That is, I'm not suggesting it's the optimal
solution, it would just seem the safest bet, to just say "never read-in
metadata without reading-in the data".

The thing is, when you apply such a simple rule, you'll pretty much
_always_ fetch an old bupm for every directory that's changed or
contains changed files, because there will almost always be some
directory contents that didn't change. So for performance's sake I hope
there's a better trick?

Thanks, YC

meta_test.sh

Rob Browning

unread,
Oct 27, 2012, 3:00:50 PM10/27/12
to Oei, Yung-Chin, Damien Robert, bup-...@googlegroups.com, Damien Robert, Avery Pennarun
"Oei, Yung-Chin" <yung...@yungchin.nl> writes:

> On 21 October 2012 18:49, Rob Browning <r...@defaultvalue.org> wrote:
>> Let me see if I understand the situation. [...]
>
> Yep, or at least that's exactly the same as how I understand it.

So ignoring efficiency for a moment, could this be a reasonable
improvement (perhaps it's what you were thinking about):

Whenever bup save isn't going to look at a path's content, because it
sees that the repository already contains the (valid) index sha1 for
that path, then bup finds finds the relevant .bupm in the current
save's parent commit, and uses that metadata blob for the path.

A simple implementation might just traverse the .bupm metadata entries
during save in parallel with the existing index and filesystem
traversals.

Damien Robert

unread,
Oct 27, 2012, 3:22:28 PM10/27/12
to bup-...@googlegroups.com
Rob Browning wrote in message
<87lier6...@trouble.defaultvalue.org>:
> So ignoring efficiency for a moment, could this be a reasonable
> improvement (perhaps it's what you were thinking about):
>
> Whenever bup save isn't going to look at a path's content, because it
> sees that the repository already contains the (valid) index sha1 for
> that path, then bup finds finds the relevant .bupm in the current
> save's parent commit, and uses that metadata blob for the path.

But then, if a file metadata is updated without content change, bup save
will not save the new metadata.

As I said in a previous post, I am afraid the only way to be consistent is
to treat metadata as data. Meaning that bup index look for data and meta
data change. Bup save saves the data and metadata of any files that were
marked as changed by the index, and only them (currently it update the
metadata of every file in a folder indexed by the index).

Any other way and there will be a potential for desyncrhonization between
data and metadata I think.

> A simple implementation might just traverse the .bupm metadata entries
> during save in parallel with the existing index and filesystem
> traversals.

If efficiency is needed maybe the .bupm metada entries could be replaced by
a sqlite database in the .bup/ folder.

Cheers,
Damien

Oei, Yung-Chin

unread,
Oct 27, 2012, 7:22:49 PM10/27/12
to Rob Browning, Damien Robert, Damien Robert, bup-...@googlegroups.com, Damien Robert, Avery Pennarun
On Sat, Oct 27, 2012 at 02:00:50PM -0500, Rob Browning wrote:
> So ignoring efficiency for a moment, could this be a reasonable
> improvement (perhaps it's what you were thinking about): [...]

Yes, that's pretty much what I had in mind. But there's a concern that
just crept up on me: is it guaranteed that the previous commit holds the
right metadata for the path? One problematic example is something I've
done in my own repo's more than a few times: run a few bup-saves of a
system, then, say, discover an old dd-image of that same system on a USB
disk, and bup-save it under the same refname (with the -d option to mark
it as an older backup). If we now save the live system again, and for
this retrieve old bupm files, we might have a mess?

(Please don't ask what's my alternative, for now - thinking about it
hard, but not getting anywhere!)

(I also find it hard not to think about efficiency, even for the moment:
in particular, I'd expect the save performance to a remote repository to
be seriously degraded by having to fetch the bupm files.)


On Sat, Oct 27, 2012 at 07:22:28PM +0000, Damien Robert wrote:
> But then, if a file metadata is updated without content change, bup save
> will not save the new metadata.

I think there's no risk of that happening: if there's been any metadata
change for a file whatsoever, bup-index will always mark its hash
invalid, so that file's data will always be read again.

YC

Rob Browning

unread,
Oct 27, 2012, 8:37:49 PM10/27/12
to Damien Robert, bup-...@googlegroups.com
Damien Robert <damien.olivie...@gmail.com> writes:

> But then, if a file metadata is updated without content change, bup save
> will not save the new metadata.

I'm not sure that's true, but I'll have to look more carefully later.

Right now, we rely on the assumption that a change to *any* metadata
will cause a change to one of the limited bits of metadata that bup
index actually watches (via Entry.from_stat()). In most cases, I
imagine that will mean that the mtime changes.

Once that happens, bup index's call to from_stat() will invalidate the
path, and bup save will eventually store the path's "new" content along
with the new metadata.

Again, that might not be right, but that's what it looks like to me from
a quick investigation.

Thanks

Rob Browning

unread,
Oct 28, 2012, 12:03:53 PM10/28/12
to Oei, Yung-Chin, Damien Robert, Damien Robert, bup-...@googlegroups.com, Damien Robert, Avery Pennarun
"Oei, Yung-Chin" <yung...@yungchin.nl> writes:

> Yes, that's pretty much what I had in mind. But there's a concern that
> just crept up on me: is it guaranteed that the previous commit holds the
> right metadata for the path?

I wondered about that too, but haven't had a chance to check carefully
yet. I suspect that save *always* commits the next backup to "HEAD".
You should be able to look at save-cmd.py to be sure, but from a brief
scan, I think -d may not have any effect on the branch structure.

> (I also find it hard not to think about efficiency, even for the moment:
> in particular, I'd expect the save performance to a remote repository to
> be seriously degraded by having to fetch the bupm files.)

At least currently, you'd have to fetch a .bupm for each directory that
changed, or contained a changed file.

Oei, Yung-Chin

unread,
Oct 28, 2012, 10:16:51 PM10/28/12
to Rob Browning, Damien Robert, Damien Robert, bup-...@googlegroups.com, Damien Robert, Avery Pennarun
OK, maybe in the end that idea of fetching the previous bupm files
doesn't seem so solid...

On Sun, Oct 28, 2012 at 11:03:53AM -0500, Rob Browning wrote:
> > Yes, that's pretty much what I had in mind. But there's a concern that
> > just crept up on me: is it guaranteed that the previous commit holds the
> > right metadata for the path?
>
> I wondered about that too, but haven't had a chance to check carefully
> yet. I suspect that save *always* commits the next backup to "HEAD".

Yes, I think it does too. So I think the problem can be summarised in
that if you want to safely reload old bupm files, you need to be able to
tell whether the state of the bupindex is the same state as the ref's
HEAD.

In my example use I mentioned, that's not the case, because when I slip
in a backup from an old backup disk, that will look like a second,
different tree to bup-index (because of the different device number). So
when I then write to the same branch-name, the original tree in the
index will now be in a different state than the ref's HEAD.

Or, for a clearer example: someone might back up to the same refname in
a remote repository from separate sources (I guess that doesn't sound
advisable, but technically it's allowed - imagine some crazy scheme for
backing up database mirrors). Each of those sources doesn't even know
about the other's bupindex, and at any point the refname may always
refer to a different state than what's in bupindex.

Given all that, maybe the safest place to keep the old metadata [*]
would be the bupindex? I mean, if bup-save, when it does save a path,
would store the metadata it sees at that time in the index, then it can
be sure that it's ok to reuse that whenever it decides not to update
that path later, I think.

Would that be manageable? If I read things correctly, there's already
a bit of metadata going into the index - would the extra size if you
stick the full metadata for every path in there, be acceptible?

Thanks, YC

[*]: or maybe, the second-safest... you could also imagine catting the
metadata block together with the data, and writing those as one object.
That way data and metadata would never diverge for sure. But that would
probably result in horrible overhead with small or empty files and
changes of atimes etc etc. And worse, it would break compatibility with
the current storage format.

Rob Browning

unread,
Oct 28, 2012, 10:51:53 PM10/28/12
to Oei, Yung-Chin, Damien Robert, Damien Robert, bup-...@googlegroups.com, Damien Robert, Avery Pennarun
Rob Browning <r...@defaultvalue.org> writes:

> I wondered about that too, but haven't had a chance to check carefully
> yet. I suspect that save *always* commits the next backup to "HEAD".
> You should be able to look at save-cmd.py to be sure, but from a brief
> scan, I think -d may not have any effect on the branch structure.

Actually, after working on the problem for a bit today, I think this may
be the least of our worries.

Perhaps I'm missing something, but given that --strip/graft options can
change from save run to save run, how can you assume any correspondence
between paths from two adjacent saves?

And even if you didn't have to worry about that, if we eventually add
support for --exclude* at save time, then the set of paths within a
directory might change from save to save, so you won't be able to take
the naive approach of traversing the .bupm in parallel with the
directory/index traversal. I think you'll probably have to grab the
relevant tree object from the previous save (in addition to the .bupm),
and carefully track the filename correspondences.

Rob Browning

unread,
Oct 28, 2012, 10:57:19 PM10/28/12
to Oei, Yung-Chin, Damien Robert, Damien Robert, bup-...@googlegroups.com, Damien Robert, Avery Pennarun
"Oei, Yung-Chin" <yung...@yungchin.nl> writes:

> Yes, that's pretty much what I had in mind. But there's a concern that
> just crept up on me: is it guaranteed that the previous commit holds the
> right metadata for the path? One problematic example is something I've
> done in my own repo's more than a few times: run a few bup-saves of a
> system, then, say, discover an old dd-image of that same system on a USB
> disk, and bup-save it under the same refname (with the -d option to mark
> it as an older backup). If we now save the live system again, and for
> this retrieve old bupm files, we might have a mess?

Are you saying you do those multiple saves of different filesystems
without reindexing?

Rob Browning

unread,
Oct 28, 2012, 11:22:08 PM10/28/12
to Oei, Yung-Chin, Damien Robert, Damien Robert, bup-...@googlegroups.com, Damien Robert, Avery Pennarun
"Oei, Yung-Chin" <yung...@yungchin.nl> writes:

> Or, for a clearer example: someone might back up to the same refname in
> a remote repository from separate sources (I guess that doesn't sound
> advisable, but technically it's allowed - imagine some crazy scheme for
> backing up database mirrors). Each of those sources doesn't even know
> about the other's bupindex, and at any point the refname may always
> refer to a different state than what's in bupindex.

Hmm, sounds like maybe this is another variation on the problem I just
mentioned wrt --strip/graft. Basically, we don't know how to reliably
find the .bupm for a given dir in the "previous" backup, and as your
example demonstrates, we can't even identify the previous backup.

> Given all that, maybe the safest place to keep the old metadata [*]
> would be the bupindex? I mean, if bup-save, when it does save a path,
> would store the metadata it sees at that time in the index, then it can
> be sure that it's ok to reuse that whenever it decides not to update
> that path later, I think.

Offhand, I think that may be right -- if we stored the metadata in the
index, then we could just grab it from there whenever we decided not to
(re)save the current path. We'd be putting all of the metadata into the
index instead of a limited subset.

Of course if we ever need to record really large metadata, this approach
could be awkward -- for example, I believe macos used to put audio clips
and images in the resource fork. (Although that's not something our
current approach would handle well either.)

> Would that be manageable? If I read things correctly, there's already
> a bit of metadata going into the index - would the extra size if you
> stick the full metadata for every path in there, be acceptible?

(Among other things,) I don't know enough about how the index works to
have a reasonable impression right now.

Another solution that was discussed and rejected back when we were
starting the metadata work was to record the metadata for each path in a
separate file:

foo/
foo/.bupm
foo/bar
foo/bar.bupm
foo/baz
foo/baz.bupm

That would also solve the "rebuild the .bupm file" problem, and make
tasks like retrieving VFS metadata easier. Though for it to work, we'd
also have to add the sha1 for the .bupm files to the index entries for
each "real" path. i.e. an index entry for a path would include
(data_sha1, bupm_sha1).

Of course, this approach has its significant drawbacks (which is why it
was rejected), not the least of which is doubling the number of paths in
every repository.

Oei, Yung-Chin

unread,
Oct 29, 2012, 11:57:00 AM10/29/12
to Rob Browning, Damien Robert, Damien Robert, bup-...@googlegroups.com, Damien Robert, Avery Pennarun
On Sun, Oct 28, 2012 at 09:57:19PM -0500, Rob Browning wrote:
> Are you saying you do those multiple saves of different filesystems
> without reindexing?

No I do, but bup-index sees them as different paths altogether, so it's
indeed sort of as-if there was no reindexing inbetween.

On Sun, Oct 28, 2012 at 10:22:08PM -0500, Rob Browning wrote:
> Basically, we don't know how to reliably find the .bupm for a given
> dir in the "previous" backup, and as your example demonstrates, we
> can't even identify the previous backup.

Yes, this really sums it up. I couldn't capture it clearly before (and
after giving it more thought, I think there are tons of less-contrived
problem cases than that one :)).

> We'd be putting all of the metadata into the index instead of a
> limited subset.
>
> Of course if we ever need to record really large metadata, this approach
> could be awkward -- for example, I believe macos used to put audio clips
> and images in the resource fork.

Maybe we could allow a future extension to a "hybrid" approach - a
little bit of what you describe with the individual .bupm files, but not
all the way? So say, if some file requires a lot of metadata space,
we'll set a flag in its metadata entry, that tells us there will also be
a "filename.bupm" file?

> Another solution that was discussed and rejected back when we were
> starting the metadata work was to record the metadata for each path in a
> separate file:

Right, I can see how that could get pretty inefficient, also blowing up
the idx sizes (which I would think a big downside). If anyone else is
looking for the original discussion:
https://groups.google.com/d/topic/bup-list/RXIQpwcQu8o/discussion

Anyway, I've done some very illiterate poking around, looking at typical
sizes of .bupm files and bupindex files here. It looks like I see an
average of ~45-50 bytes per .bupm entry. If that's right, it should be
possible to cache full metadata in the bupindex, uncompressed, with it
not even doubling in size. That sounds acceptable to me.

Thanks, YC

Damien Robert

unread,
Nov 20, 2012, 3:52:44 PM11/20/12
to bup-...@googlegroups.com, Damien Robert
The preceding patch only handled the case of a removed file when trying to
collect the metadata. Collecting the metadata of a dir is done in another
part, so we also need to catch the exception if the dir was removed. These
two patches could probably be squashed together.

Damien Robert (2):
bup save could fail trying to read the metadata of a deleted file
Correcting a metadata bug when directories are removed

cmd/save-cmd.py | 35 +++++++++++++++++++++++++++++------
1 file changed, 29 insertions(+), 6 deletions(-)

--
Patched on top of bup-0.25-rc1-78-g03c25d6

Damien Robert

unread,
Nov 20, 2012, 3:52:45 PM11/20/12
to bup-...@googlegroups.com, Damien Robert
If bup index see that a file foo has no change, bup save still need to read
foo in order to update the metadata. If the user removed foo between the
index and the save, the metadata call metadata.from_path would result in a
non intercepted error. This commit adds a try/except to handle this case.

Signed-off-by: Damien Robert <Damien.Olivi...@gmail.com>
---
cmd/save-cmd.py | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)

Damien Robert

unread,
Nov 20, 2012, 3:52:46 PM11/20/12
to bup-...@googlegroups.com, Damien Robert
The preceding commit handled the case when a file was removed but still
present in an index. In this case, bup save tries to read the metadata of
the file, which gives an exception since the file does not exist anymore,
and the exception was not catched.

This commit do the same but for directories rather than files. The code for
directories is not at the same place than for the files, so the preceding
catch did not catch the exception corresponding to a removed directory.

I also added the catch in
if first_root == None
although admittedly it would only produce if the whole structure that bup
save tries to save was gone, so it would not make much sense for the user
to save it...

Signed-off-by: Damien Robert <Damien.Olivi...@gmail.com>
---
cmd/save-cmd.py | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/cmd/save-cmd.py b/cmd/save-cmd.py
index 7afc994..f4913dc 100755
--- a/cmd/save-cmd.py
+++ b/cmd/save-cmd.py
@@ -292,8 +292,12 @@ for (transname,ent) in r.filter(extra, wantrecurse=wantrecurse_during):
if first_root == None:
dir_name, fs_path = dirp[0]
first_root = dirp[0]
- meta = metadata.from_path(fs_path) if fs_path else metadata.Metadata()
- _push(dir_name, meta)
+ try:
+ meta = metadata.from_path(fs_path) if fs_path else metadata.Metadata()
+ _push(dir_name, meta)
+ except (OSError, IOError), e:
+ add_error(e)
+ lastskip_name = dir_name
elif first_root != dirp[0]:
root_collision = True

@@ -304,8 +308,12 @@ for (transname,ent) in r.filter(extra, wantrecurse=wantrecurse_during):
# If switching to a new sub-tree, start a new sub-tree.
for path_component in dirp[len(parts):]:
dir_name, fs_path = path_component
- meta = metadata.from_path(fs_path) if fs_path else metadata.Metadata()
- _push(dir_name, meta)
+ try:
+ meta = metadata.from_path(fs_path) if fs_path else metadata.Metadata()
+ _push(dir_name, meta)
+ except (OSError, IOError), e:
+ add_error(e)
+ lastskip_name = dir_name

if not file:
if len(parts) == 1:

Damien Robert

unread,
Nov 20, 2012, 4:06:26 PM11/20/12
to bup-...@googlegroups.com
Rob Browning wrote in message <87r4otu...@trouble.defaultvalue.org>:
> In general, I do think it's a reasonable choice to treat a missing file
> as a deferred error. At least for my current uses, that would always
> indicate a problem that I'd want to know about.

Yes I agree there should be an error or at least a warning. The question
is: should bup remove the missing file from the index? I believe it should
(this way the next bup save won't have this error), but this is not what my
patch does.

>> Another possibility is that you modify a, bup index a b c, rm c, bup save.
>> Here bup save crashes because c does not exist anymore when xstat tries to
>> read the metadata. The patch corrects this and print a warning.
> Right -- bup certainly shouldn't crash.

I have just sent an updated version, the previous one would only prevent
crash against missing files, not missing directories.

> Right -- offhand, right now, I don't think we have a clean way to ask
> the index to notice that c is gone without reindexing the whole tree.

That's why I don't bother removing c from the index, I only catch the
error, leaving to the user the responsibility of 'refreshing the index'. If
I have time, I'll try to add options like -u and -A to bup index (similar
to git add) to help for that.

Damien Robert

unread,
Nov 20, 2012, 4:10:33 PM11/20/12
to bup-...@googlegroups.com
Rob Browning wrote in message
<87bofn4...@trouble.defaultvalue.org>:
> Damien Robert <damien.olivie...@gmail.com> writes:
> Right now, we rely on the assumption that a change to *any* metadata
> will cause a change to one of the limited bits of metadata that bup
> index actually watches (via Entry.from_stat()). In most cases, I
> imagine that will mean that the mtime changes.
>
> Once that happens, bup index's call to from_stat() will invalidate the
> path, and bup save will eventually store the path's "new" content along
> with the new metadata.

Oh that's great, I hadn't catched that the index already store the mtime.
Well from the discussion it look like the metadata problem I have
mentioned in my cover letter will be pretty hard to handle in a correct
way; and I am afraid I don't know bup enough to help you with that, good
luck with that :)

(Personally the situation while not ideal is fine for me. I don't really
care if the metadata get out of sync from time to time, and now that bup
does not crash anymore on missing file I can always refresh the index if
need be).

Rob Browning

unread,
Nov 21, 2012, 11:41:24 PM11/21/12
to Damien Robert, bup-...@googlegroups.com
Damien Robert <damien.olivie...@gmail.com> writes:

> Oh that's great, I hadn't catched that the index already store the mtime.
> Well from the discussion it look like the metadata problem I have
> mentioned in my cover letter will be pretty hard to handle in a correct
> way; and I am afraid I don't know bup enough to help you with that, good
> luck with that :)
>
> (Personally the situation while not ideal is fine for me. I don't really
> care if the metadata get out of sync from time to time, and now that bup
> does not crash anymore on missing file I can always refresh the index if
> need be).

I've actually been working on adding full-resolution (ns) timestamps to
the index, along with all the metadata.

I think I'm fairly close, but still not quite finished. I hope that
these changes, if acceptable, will fix most of the problems we've been
discussing.

Tim Riemenschneider

unread,
Feb 1, 2013, 12:57:48 PM2/1/13
to bup-...@googlegroups.com
I revisited this issue this week. While the current meta-branch includes metadata for the files in the index, this only removes 1 of 4 places of metadata.from_path this patch(es) guarded. While the case "root dir missing" is more esoteric, the other 2 could be hit more easily.
I rebased/reworked the patch to the current meta-data-branch (and I added a testcase).
Currently that patch sits in my github-account ( https://github.com/timri/bup/commits/tmp/save-no-crash-on-missing ), should I repost the series to the list?

Zoran Zaric

unread,
Feb 2, 2013, 11:19:44 AM2/2/13
to Tim Riemenschneider, bup-...@googlegroups.com
On Fri, Feb 01, 2013 at 09:57:48AM -0800, Tim Riemenschneider wrote:
> I revisited this issue this week.

Thanks for bringing it back up!


> While the current meta-branch includes metadata for the files in the index,
> this only removes 1 of 4 places of metadata.from_path this patch(es) guarded.
> While the case "root dir missing" is more esoteric, the other 2 could be hit
> more easily. I rebased/reworked the patch to the current meta-data-branch
> (and I added a testcase).
> Currently that patch sits in my github-account (
> https://github.com/timri/bup/commits/tmp/save-no-crash-on-missing ), should I
> repost the series to the list?

Since the patches sit ontop of tmp/pending/meta and we're progressing pretty
well with mergning into master, I'd suggest you post the patches once we merged
tmp/pending/meta. Is this ok for you?

Thanks,
Zoran

Damien Robert

unread,
Feb 3, 2013, 7:30:52 AM2/3/13
to bup-...@googlegroups.com
Zoran Zaric wrote in message
<20130202161...@trunks.SSG5-Serial>:
> Since the patches sit ontop of tmp/pending/meta and we're progressing pretty
> well with mergning into master, I'd suggest you post the patches once we merged
> tmp/pending/meta. Is this ok for you?

If anybody has time, it would be nice if these patches, when they detect
that a file is missing, remove the file from the index instead of just
outputting a warning (currently, the missing files stay in the index so we
get a warning each time).

Otherwise I'll try to do it, but I am afraid it won't be before the merge
of tmp/pending/meta since I am quite busy right now :) (In this case the
current patches are better than nothing since they allow bup to operates
rather than crash).

Cheers!
Damien

Reply all
Reply to author
Forward
0 new messages