Metadata support in bup-fuse

132 views
Skip to first unread message

Gabriele Santilli

unread,
Jul 11, 2012, 6:20:21 AM7/11/12
to bup-list
I don't know any python, but maybe I can try doing this if someone's
willing to help me.

What would it take to add metadata support to bup-fuse? Where should I
look? Any pointers and suggestions?

Oei, YC

unread,
Jul 11, 2012, 8:55:47 AM7/11/12
to Gabriele Santilli, bup-list, Rob Browning
This would be brilliant to have - I'd very much like to be able to eg.
verify backups without the need for a restore (nor the space for
that).

On 11 July 2012 11:20, Gabriele Santilli <santilli...@gmail.com> wrote:
> I don't know any python, but maybe I can try doing this if someone's
> willing to help me.

I'm terrible in python despite working with it every day, but would be
keen to work on this.

For the fuse side of things, I think it's "just" a matter of defining
some more methods in BupFs, for instance getxattr(). Also some things
currently have a dummy-implementation, eg. if you follow st_nlink it's
eventually just a "return 1". I figure the tricky bit is in the
metadata "backend" side of things, where you'd have to consult the
.bupm files all the time (and should probably do some sort of
caching).

I'm optimistic, mostly out of ignorance. Probably Rob's the best
person to say if this has any chance of success?

YC

Gabriele Santilli

unread,
Jul 11, 2012, 9:19:15 AM7/11/12
to Oei, YC, bup-list, Rob Browning
On Wed, Jul 11, 2012 at 2:55 PM, Oei, YC <oei.yu...@gmail.com> wrote:

> For the fuse side of things, I think it's "just" a matter of defining
> some more methods in BupFs, for instance getxattr().

That was my hope...

> Also some things
> currently have a dummy-implementation, eg. if you follow st_nlink it's
> eventually just a "return 1". I figure the tricky bit is in the
> metadata "backend" side of things, where you'd have to consult the
> .bupm files all the time (and should probably do some sort of
> caching).

Right, and here is where I don't know how much work needs to be done.
I've been holding off attempting this for a long time, because I don't
really have free time and I don't know python. But... having this
feature might save me quite a bit of time (if my assumption about
something else is correct).

So my plan, I guess, is that I'll be throwing bad code to the list in
the hope that someone will review and fix it. :/

Gabriel Filion

unread,
Jul 11, 2012, 12:40:08 PM7/11/12
to Gabriele Santilli, Oei, YC, bup-list, Rob Browning
Hello,

On 12-07-11 09:19 AM, Gabriele Santilli wrote:
> On Wed, Jul 11, 2012 at 2:55 PM, Oei, YC <oei.yu...@gmail.com> wrote:
>
>> For the fuse side of things, I think it's "just" a matter of defining
>> some more methods in BupFs, for instance getxattr().
>
> That was my hope...

I started work last week (before being buried by a ton and a half of
work) on "bup ls -l", which would be exposing premissions and other
stuff that ls -l exposes.

I hit the point where I need to expose metadata in vfs.py ... and that
would make exposing them in fuse trivial! :)

so if we have this last key part, we can easily implement fuse and ls -l
exposition.

we'd probably have to mirror what bup-restore is doing to fetch metadata
for a single file and move that inside vfs.py to expose metadata in Node
objects.

>> Also some things
>> currently have a dummy-implementation, eg. if you follow st_nlink it's
>> eventually just a "return 1". I figure the tricky bit is in the
>> metadata "backend" side of things, where you'd have to consult the
>> .bupm files all the time (and should probably do some sort of
>> caching).
>
> Right, and here is where I don't know how much work needs to be done.
> I've been holding off attempting this for a long time, because I don't
> really have free time and I don't know python. But... having this
> feature might save me quite a bit of time (if my assumption about
> something else is correct).

caching can be implemented in a second phase (it will probably help a
bunch). but we can start by exposing metadata, however slow that might
prove to be, and then add caching to that to optimize performance.

--
Gabriel Filion

Gabriele Santilli

unread,
Jul 11, 2012, 12:47:35 PM7/11/12
to Gabriel Filion, Oei, YC, bup-list, Rob Browning
On Wed, Jul 11, 2012 at 6:40 PM, Gabriel Filion <lel...@gmail.com> wrote:

> I hit the point where I need to expose metadata in vfs.py ... and that
> would make exposing them in fuse trivial! :)

Cool, let us know how things proceed then. :)

> caching can be implemented in a second phase (it will probably help a
> bunch). but we can start by exposing metadata, however slow that might
> prove to be, and then add caching to that to optimize performance.

Agreed.

Gabriel Filion

unread,
Jul 11, 2012, 1:01:28 PM7/11/12
to Gabriele Santilli, Oei, YC, bup-list, Rob Browning
On 12-07-11 12:47 PM, Gabriele Santilli wrote:
>> I hit the point where I need to expose metadata in vfs.py ... and that
>> > would make exposing them in fuse trivial! :)
> Cool, let us know how things proceed then. :)

I hope to try my luck at it soon, but I can't promise when: I've had
multiple power failures in datacenters so I'm going to be moving things
out of there for some time now... bleh

I'll keep you posted. If HOPE 9 gets boring at some point, maybe I'll
have some time this weekend ;)

(p.s.: if some of you guys are going to HOPE, maybe we can arrange [with
private mails] a meeting there ^-^ -- sorry for this _totally_ off topic
message)

--
Gabriel Filion

Gabriele Santilli

unread,
Jul 11, 2012, 1:04:18 PM7/11/12
to Gabriel Filion, Oei, YC, bup-list, Rob Browning
On Wed, Jul 11, 2012 at 7:01 PM, Gabriel Filion <lel...@gmail.com> wrote:

> I hope to try my luck at it soon, but I can't promise when: I've had
> multiple power failures in datacenters so I'm going to be moving things
> out of there for some time now... bleh

No worries, it would probably take me more time anyway...

Rob Browning

unread,
Jul 11, 2012, 5:07:34 PM7/11/12
to Gabriel Filion, Gabriele Santilli, Oei, YC, bup-list
Gabriel Filion <lel...@gmail.com> writes:

> caching can be implemented in a second phase (it will probably help a
> bunch). but we can start by exposing metadata, however slow that might
> prove to be, and then add caching to that to optimize performance.

I haven't thought about it carefully yet, but one very simple approach
could be to just change vfs.py so that we have a Node._metadata, rather
than a Dir._metadata, and so that Node._metadata stores the actual
Metadata object for *that* Node, rather than a File object for the Dir's
.bupm.

Then we could add a Dir.metadata() like this:

def metadata():
if self._subs == None:
self._mksubs()
return self._metadata

and have Dir's _mksubs() populate _metadata for itself and all of its
immediate children. Non-dirs would just return self._metadata.

Offhand, I think this should be fairly easy to implement, and if we
like, I could probably handle it immediately, but I suppose might be too
expensive as a default.

Thoughts?
--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Rob Browning

unread,
Jul 12, 2012, 1:39:19 PM7/12/12
to Gabriel Filion, Gabriele Santilli, Oei, YC, bup-list
Rob Browning <r...@defaultvalue.org> writes:

> Offhand, I think this should be fairly easy to implement, and if we
> like, I could probably handle it immediately, but I suppose might be too
> expensive as a default.

...and as a bit of example, imagine the case where a directory has 100k+
files (i.e. debian-devel maildir). Calling metadata() for that
directory or any file in it, or calling _mksubs() indirectly, would
result in the immediate creation of 100k+ Metadata objects.

Gabriel Filion

unread,
Jul 12, 2012, 1:44:16 PM7/12/12
to Rob Browning, Gabriele Santilli, Oei, YC, bup-list
On 12-07-11 05:07 PM, Rob Browning wrote:
> I haven't thought about it carefully yet, but one very simple approach
> could be to just change vfs.py so that we have a Node._metadata, rather
> than a Dir._metadata, and so that Node._metadata stores the actual
> Metadata object for *that* Node, rather than a File object for the Dir's
> .bupm.
>
> Then we could add a Dir.metadata() like this:
>
> def metadata():
> if self._subs == None:
> self._mksubs()
> return self._metadata

that function is already in the Dir class (named metadata_file() ). So
we only need to implement something that exposes metadata in File and
Symlink.
Then, we can then retrieve the metadata of each object when iterating
over them in "front-end" commands like bup ls, bup fuse, bup web.

--
Gabriel Filion

Rob Browning

unread,
Jul 12, 2012, 3:31:35 PM7/12/12
to Gabriel Filion, Gabriele Santilli, Oei, YC, bup-list
Gabriel Filion <lel...@gmail.com> writes:

> that function is already in the Dir class (named metadata_file() ). So
> we only need to implement something that exposes metadata in File and
> Symlink.

Right, but metadata_file() is the file containing the entire set of
metadata entries (in encoded format) for all of the items in the
directory, including the directory itself. If that's all you need, (and
plan to linearly search it for a particular entry, or will normally just
be iterating over them all -- which is efficient) then that's fine, but
I was assuming you might need to be able to retrieve the metadata for
each individual file/dir/symlink quickly.

Gabriel Filion

unread,
Jul 14, 2012, 4:16:28 PM7/14/12
to Rob Browning, Gabriele Santilli, Oei, YC, bup-list
On 12-07-12 03:31 PM, Rob Browning wrote:
> Gabriel Filion <lel...@gmail.com> writes:
>> that function is already in the Dir class (named metadata_file() ). So
>> we only need to implement something that exposes metadata in File and
>> Symlink.
>
> Right, but metadata_file() is the file containing the entire set of
> metadata entries (in encoded format) for all of the items in the
> directory, including the directory itself. If that's all you need, (and
> plan to linearly search it for a particular entry, or will normally just
> be iterating over them all -- which is efficient) then that's fine, but


> I was assuming you might need to be able to retrieve the metadata for
> each individual file/dir/symlink quickly.

right, I'd want that. but for now the .bupm files don't have an index,
so we need to iterate over it to find metadata for one particular file,
don't we?

we'd need to either make the code cache metadata entries, or add an
index to the .bupm file.

the former sounds like it can suffer on performance and memory footprint
in cases where directories contain a lot of files.

--
Gabriel Filion

Rob Browning

unread,
Jul 14, 2012, 4:38:48 PM7/14/12
to Gabriel Filion, Gabriele Santilli, Oei, YC, bup-list
Gabriel Filion <lel...@gmail.com> writes:

> right, I'd want that. but for now the .bupm files don't have an index,
> so we need to iterate over it to find metadata for one particular file,
> don't we?

Right.

> we'd need to either make the code cache metadata entries, or add an
> index to the .bupm file.

Right -- the former was what I was initially talking about. I think I
can add that easily (and probably fairly immediately), but I wasn't sure
if it was reasonable.

> the former sounds like it can suffer on performance and memory footprint
> in cases where directories contain a lot of files.

Exactly.

Though now that I think about it -- I wonder if we could just implement
the dumb caching approach for now, and add indexing (or something else
smarter) when we get a chance.

For example, if the fs_item.metadata() API seems reasonable, then for
the moment, I could just implement it via (potentially expensive)
on-demand caching, and later, we could be smarter.

Come to think of it, if we do end up with indexes, I suppose as a first
pass, they could also be generated on-demand, the first time we retrieve
an object from that .bupm.

Thoughts?

Gabriel Filion

unread,
Jul 14, 2012, 4:50:12 PM7/14/12
to Rob Browning, Gabriele Santilli, Oei, YC, bup-list
On 12-07-14 04:38 PM, Rob Browning wrote:
> Though now that I think about it -- I wonder if we could just implement
> the dumb caching approach for now, and add indexing (or something else
> smarter) when we get a chance.
>
> For example, if the fs_item.metadata() API seems reasonable, then for
> the moment, I could just implement it via (potentially expensive)
> on-demand caching, and later, we could be smarter.
>
> Come to think of it, if we do end up with indexes, I suppose as a first
> pass, they could also be generated on-demand, the first time we retrieve
> an object from that .bupm.

sounds like a reasonable plan. we'll have at least something to use for
exposing data in the front-ends and we can optimize later, since adding
an index means more thinking and more work.

--
Gabriel Filion

Rob Browning

unread,
Jul 14, 2012, 5:00:58 PM7/14/12
to Gabriel Filion, Gabriele Santilli, Oei, YC, bup-list
Gabriel Filion <lel...@gmail.com> writes:

> sounds like a reasonable plan. we'll have at least something to use for
> exposing data in the front-ends and we can optimize later, since adding
> an index means more thinking and more work.

OK, well, the newer metadata bits are still in their own branch for now
anyway, so I think I'll implement the simple approach I initially
mentioned, and push it so those interested can try it out. We can
always rework whatever's broken.

I've also nearly finished some preliminary support for "save [[--ignore
<pattern>] ...]" which works more or less like top-level gitignore
patterns. I'll probably post that to the list soon, so people can take
a look. At the moment, it requires a system-level (i.e. non-python)
fnmatch().

Thanks

Rob Browning

unread,
Jul 16, 2012, 10:28:48 PM7/16/12
to Gabriel Filion, Gabriele Santilli, Oei, YC, bup-list
Gabriel Filion <lel...@gmail.com> writes:

> sounds like a reasonable plan. we'll have at least something to use for
> exposing data in the front-ends and we can optimize later, since adding
> an index means more thinking and more work.

OK, I got it working at debconf. I'll probably push a patch for
evaluation in the next week. The interface is just meta =
vfs_obj.metadata(), and then you can call meta.uid, etc.

Gabriel Filion

unread,
Jul 17, 2012, 7:25:12 PM7/17/12
to Rob Browning, Gabriele Santilli, Oei, YC, bup-list
On 12-07-16 10:28 PM, Rob Browning wrote:
> Gabriel Filion <lel...@gmail.com> writes:
>> sounds like a reasonable plan. we'll have at least something to use for
>> exposing data in the front-ends and we can optimize later, since adding
>> an index means more thinking and more work.
>
> OK, I got it working at debconf. I'll probably push a patch for
> evaluation in the next week. The interface is just meta =
> vfs_obj.metadata(), and then you can call meta.uid, etc.

very nice, I'll be checking it out when you can send it.

I couldn't work on anything useful during hope. but once you send your
patch, I'll make sure to try and allocate some time to test it out and
to base my work on it to finish bup ls -l

--
Gabriel Filion

Rob Browning

unread,
Jul 21, 2012, 5:56:27 PM7/21/12
to Gabriel Filion, Gabriele Santilli, Oei, YC, bup-list
Gabriel Filion <lel...@gmail.com> writes:

> very nice, I'll be checking it out when you can send it.
>
> I couldn't work on anything useful during hope. but once you send your
> patch, I'll make sure to try and allocate some time to test it out and
> to base my work on it to finish bup ls -l

OK, I've pushed node_dir_file_or_symlink.metadata() to tmp/pending/meta
here:

http://git.debian.org/?p=users/rlb/bup.git
clone URLs:
git://anonscm.debian.org/users/rlb/bup.git
git+ssh://git.debian.org/git/users/rlb/bup.git
http://anonscm.debian.org/git/users/rlb/bup.git

Consider it an initial attempt, and let me know if you have any
trouble.

I've also finished (but not included) initial support for
a gitignore(5)-style "bup index --ignore ..." facility:

bup index --ignore '/mnt/*' --ignore '/proc/*' --ignore '*.tmp' ...

The behavior is intended to be mostly identical to gitignore,
except that a trailing "/*" will ignore the contents of a directory, but
not the directory itself. The current matcher probably needs help -- it
just splits complex paths on '/' and otherwise relies on python's
fnmatch.

I'll post separately about that soon (and send the patch to the list,
since it's not metadata-specific).

I think that may be the last thing I really needed (though "bup gc"
would be nice), before starting to test with real backups here.

Thanks
Reply all
Reply to author
Forward
0 new messages