FUSE reload

Stefan Buller

unread,

Nov 6, 2012, 12:38:08 PM11/6/12

to bup-...@googlegroups.com

bup fuse currently ignores new commits once it's started. I'm serving up my bup fuse with samba, and as a result need to manually restart fuse whenever I make a commit. This is complicated by processes holding the fs open. I'm planning to deal with this with a 'fuser -mk $MOUNTPOINT'. I'm not terribly worried about data loss here, but there is a race condition.

There are a few solutions that could help here:

1. Have bup fuse watch for new commits, reloading when necessary

2. Have bup fuse register itself in the BUP_DIR, and have bup save force a reload

3. Provide a means (-HUP) to manually force a reload.

All of this requires code for reloading or attaching new commits in the fuse layer of bup. I, unfortunately, don't feel prepared to tackle this myself. I feel that this is beyond my understanding of the code, and am working on other projects at the moment.

--
Stefan Buller

Gabriel Filion

unread,

Dec 28, 2012, 1:45:12 PM12/28/12

to Stefan Buller, bup-...@googlegroups.com

zoran and I just tried to bite the bullet and implement some kind of
proof of concept for this... and it was harder than we first thought. I
actually decided to abandon the idea and review stuff from rob to make
my time useful. but here's a report of what we saw/thought about:

we thought about implementing this by having the bup-fuse process set up
an inotify watch on $BUP_DIR/refs and then make it poll inotify and
update things.

the polling / updating objects looks like it can only be achieved from
inside methods of the fuse object (we thought about injecting it into
readdir() ). so things will only get updated if there is interation with
the fuse mount (possibly making things unresponsive after a long moment
of inactivity)

while thinking about how to place things we thought about a corner case
that's gonna need discussion here:

suppose you've launched a script for processing and/or copying stuff
from a branch's "latest" symlink. then a new backup gets pushed while
this is happening. the latest link gets updated to point to the latest
commit, as expected of it, however your command might start working on
stuff from a different commit, which is very not expected.

so we're wondering if there could be a way to work around this issue, or
if we should remove the "latest" symlink for adding this feature so that
we don't expose a "dangerous" link.

--
Gabriel Filion

signature.asc

Avery Pennarun

unread,

Dec 28, 2012, 3:58:26 PM12/28/12

to Gabriel Filion, Stefan Buller, bup-...@googlegroups.com

On Fri, Dec 28, 2012 at 1:45 PM, Gabriel Filion <lel...@gmail.com> wrote:
> On 11/06/2012 12:38 PM, Stefan Buller wrote:
>> All of this requires code for reloading or attaching new commits in the
>> fuse layer of bup. I, unfortunately, don't feel prepared to tackle this
>> myself. I feel that this is beyond my understanding of the code, and am
>> working on other projects at the moment.
>
> zoran and I just tried to bite the bullet and implement some kind of
> proof of concept for this... and it was harder than we first thought. I
> actually decided to abandon the idea and review stuff from rob to make
> my time useful. but here's a report of what we saw/thought about:
>
> we thought about implementing this by having the bup-fuse process set up
> an inotify watch on $BUP_DIR/refs and then make it poll inotify and
> update things.
>
> the polling / updating objects looks like it can only be achieved from
> inside methods of the fuse object (we thought about injecting it into
> readdir() ). so things will only get updated if there is interation with
> the fuse mount (possibly making things unresponsive after a long moment
> of inactivity)

This should actually not be so hard. I apologize for being lazy and
not implementing this the first time around :)

You shouldn't need any tricks like inotify, because you can cheat
instead. The important things to note are:
- commits, trees, and objects never disappear from the repo once
they're there (other than Zoran's repacking patches, but we can treat
repacking as a special case later)
- even if an object wasn't present when you supplied the readdir()
contents, people can still read it if they know the exact name
- therefore only the list of branches and the list of contents for
each branch will ever change.

Thus, the easy way to implement this would be to just update the list
of objects *every* time someone does readdir() on the list of refs or
the contents of a given ref. Something like this:
- move list of child objects to a tmpdict
- for each object that now exists:
- if object existed before: move it from the tmpdict
- else: create new object
- delete any objects (trees) now in the tmplist

To prevent memory leaks, make sure that the tree structure is held
correctly: child objects should either not have references to parent
objects, or if they do, they should be made using the python weakref
module. Parent objects just hold *normal* (non-weak) references to
child objects. I can't remember if this code is correct right now
(and since objects currently never disappear because the tree never
refreshes, we wouldn't suffer even if there was a bug in this
respect).

> while thinking about how to place things we thought about a corner case
> that's gonna need discussion here:
>
> suppose you've launched a script for processing and/or copying stuff
> from a branch's "latest" symlink. then a new backup gets pushed while
> this is happening. the latest link gets updated to point to the latest
> commit, as expected of it, however your command might start working on
> stuff from a different commit, which is very not expected.
>
> so we're wondering if there could be a way to work around this issue, or
> if we should remove the "latest" symlink for adding this feature so that
> we don't expose a "dangerous" link.

symlink-to-latest is a standard Unix technique and bup is doing it
correctly. It's perfectly okay for bup to change that link whenever
it wants, since that's the point of the link.

However, if you want to avoid the race condition you're talking about,
the client program *using* the link needs to be written carefully,
using one of these tricks:

a) chdir() to the 'latest' directory (or one of its children) before
accessing any of the contents. When the symlink changes, your cwd
will still point at the old (symlink-resolved) location so you'll keep
getting the old set of files.

b) open(filename, O_DIRECTORY) the 'latest' directory or a dir inside
it, and then use openat() (and similar whateverat() functions) to
access files using relative paths.

c) readlink() the 'latest' link to find out where it points, then
explicitly use that as the base path.

I recommend (a) unless you have to use this trick on two symlinks
simultaneously, at which point you have to use (b) since you can only
chdir() to one place at a time. So far I've never needed two symlinks
at once. I don't like (c) because it's inelegant and makes
assumptions about the directory structure (ie. that 'latest' is always
a symlink, and the only symlink) where (a) works in all cases and is
very easy even from a shell script.

Have fun,

Avery

Gabriel Filion

unread,

Dec 29, 2012, 9:19:04 AM12/29/12

to Avery Pennarun, Stefan Buller, bup-...@googlegroups.com

you're totally right. I continued to work together with posativ (from
IRC) yesterday and we reached the same conclusion with some kind of
proof of concept patch: by not passing in a pre-built RefList object and
instead re-instantiating it every time, and also by excluding the
".commit" and branch directories from cache, we can make changes appear
via fuse.

> Something like this:
> - move list of child objects to a tmpdict
> - for each object that now exists:
> - if object existed before: move it from the tmpdict
> - else: create new object
> - delete any objects (trees) now in the tmplist

I'm not sure I get what this relates to exactly (or I haven't slept
enough in the last two days..) could you give some very crude example of
what you mean?

the concern was not really with how to build such a tool but whether
existing tools or scripts could possibly read unexpected data (like
switching to a newer commit while doing cp -a, or having a file suddenly
dissappear while doing something with a for loop in a bash script)

of course, documenting that reading from the "latest" link is
potentially unpredictable to use is a possibility.

--
Gabriel Filion

signature.asc

Simon Sapin

unread,

Dec 29, 2012, 10:22:00 AM12/29/12

to bup-...@googlegroups.com

Le 29/12/2012 15:19, Gabriel Filion a ï¿½crit :

> the concern was not really with how to build such a tool but whether
> existing tools or scripts could possibly read unexpected data (like
> switching to a newer commit while doing cp -a, or having a file suddenly
> dissappear while doing something with a for loop in a bash script)
>
> of course, documenting that reading from the "latest" link is
> potentially unpredictable to use is a possibility.

Wouldnï¿½t such scripts have the same issues with "normal" symlinks anyway?

--
Simon Sapin

Gabriel Filion

unread,

Dec 29, 2012, 10:27:02 AM12/29/12

to Simon Sapin, bup-...@googlegroups.com

On 12/29/2012 10:22 AM, Simon Sapin wrote:

> Le 29/12/2012 15:19, Gabriel Filion a écrit :
>> the concern was not really with how to build such a tool but whether
>> existing tools or scripts could possibly read unexpected data (like
>> switching to a newer commit while doing cp -a, or having a file suddenly
>> dissappear while doing something with a for loop in a bash script)
>>
>> of course, documenting that reading from the "latest" link is
>> potentially unpredictable to use is a possibility.
>

> Wouldn’t such scripts have the same issues with "normal" symlinks anyway?

if your "normal" symlinks are updated automatically by some external
process while you work on them, yes :)

I was just wondering how much users would get bitten by some weird file
disappearances or other such quirks when using the "latest" link.

--
Gabriel Filion

signature.asc

Stefan Buller

unread,

Jan 23, 2013, 4:35:04 PM1/23/13

to Gabriel Filion, Avery Pennarun, bup-...@googlegroups.com

Hey,

Thanks for tackling this problem. I finally got some time to take a look at the proposed solution. I applied a couple patches by Martin Zimmermann:

> [PATCH 2/2] FUSE shows new commits

> [PATCH] only calculate ref list when not in cache

These seem to be related to the discussion here (let me know if I got that wrong, or missed anything important)...

Unfortunately with these patches (or indeed with only the first), performance is unacceptably poor. Reading my 'jobs' directory directly under the fuse mountpoint takes between 1 and 2 minutes - every time. Reading subdirectories of this varies (possibly based on how much they hold), running between 1 and 30 seconds, but generally dropping to about 50% after the first read for any particular directory.

These numbers contrast with 'a handful of seconds' (3 this time, which I suspect is on the low side) for an initial read, and a fraction of a second for any subsequent read - no matter the directory.

My impression is that the cache is quite important for performance.

--
Stefan Buller

Avery Pennarun

unread,

Jan 23, 2013, 6:16:28 PM1/23/13

to Stefan Buller, Gabriel Filion, bup-...@googlegroups.com

On Wed, Jan 23, 2013 at 4:35 PM, Stefan Buller <stefan...@gmail.com> wrote:
> Unfortunately with these patches (or indeed with only the first),
> performance is unacceptably poor. Reading my 'jobs' directory directly under
> the fuse mountpoint takes between 1 and 2 minutes - every time. Reading
> subdirectories of this varies (possibly based on how much they hold),
> running between 1 and 30 seconds, but generally dropping to about 50% after
> the first read for any particular directory.
>
> These numbers contrast with 'a handful of seconds' (3 this time, which I
> suspect is on the low side) for an initial read, and a fraction of a second
> for any subsequent read - no matter the directory.
>
> My impression is that the cache is quite important for performance.

[Disclaimer: I haven't actually read the patches.]

The trick is to cache exactly the right stuff, and not cache the other
stuff. If you can read the toplevel directory in 3 seconds the first
time in the old version, then you should be able to do it in the new
version, because (the first time you read it) it wasn't even cached in
the old version.

Reading the contents of subdirectories should be equally fast in the
new version as in the old version, since they can be cached forever,
because they go via git object ids, which never change.

So if there's slowness, it's probably an implementation detail rather
than a fatal flaw. Not that that solves your problem :)

Have fun,

Avery

Reply all

Reply to author

Forward