last night i thought about how to switch to bup.
I think it would be nice if one could import existing backups into bup.
There are two options:
1) Time travel
Could be time consuming and expensive to work that out.
2) Setting the date for import-commits and manipulating the file-paths
a) Setting the date for import-commits
As you can read in the git-commit-tree manpage. It supports the
environment variables GIT_AUTHOR_DATE and GIT_COMMITTER_DATE.
I tested it with a git repository and 'GIT_AUTHOR_DATE="2000-01-01
12:12:12 +0000" git commit -m "foo"' produces a commit that is displayed
with a date of 2000-01-01.
I tried it with bup ('GIT_AUTHOR_DATE="2000-01-01 12:12:12 +0000" bup
save -n foo /tmp/foo') but it didn't work.
I didn't dig into git.py, yet. Do you guys think it will be much work to
support those variables?
b) manipulating the file-paths
Of course we would need to strip the location of the backup from the
file paths. a index and save option like --strip-path-prefix would be
possible and not too difficult to implement.
When those basic building blocks are ready we could write something like
"bup import".
I'd then write an importer for rsnapshot.
What do you guys think?
Cheers,
Zoran
Actually, if you wanted to work on that one, you could eliminate the
need for bup entirely :)
> 2) Setting the date for import-commits and manipulating the file-paths
> a) Setting the date for import-commits
> As you can read in the git-commit-tree manpage. It supports the
> environment variables GIT_AUTHOR_DATE and GIT_COMMITTER_DATE.
>
> I tested it with a git repository and 'GIT_AUTHOR_DATE="2000-01-01
> 12:12:12 +0000" git commit -m "foo"' produces a commit that is displayed
> with a date of 2000-01-01.
>
> I tried it with bup ('GIT_AUTHOR_DATE="2000-01-01 12:12:12 +0000" bup
> save -n foo /tmp/foo') but it didn't work.
>
> I didn't dig into git.py, yet. Do you guys think it will be much work to
> support those variables?
bup.git.PackWriter._new_commit() already takes options to let you
specify the two dates; PackWriter.new_commit() just supplies 'now' for
both of them. I don't think it would be a problem to make those dates
configurable. I don't know that I'd use an environment variable for
it, though; a command-line option would probably be more obvious.
Feel free to submit a patch to add this feature.
> b) manipulating the file-paths
> Of course we would need to strip the location of the backup from the
> file paths. a index and save option like --strip-path-prefix would be
> possible and not too difficult to implement.
What part do you want to strip and why? Perhaps an example would be best.
For importing from another backup system, it might be easier to not
use 'bup save' at all, but write another program instead that would
work however you wanted, including pathname translation.
> When those basic building blocks are ready we could write something like
> "bup import".
> I'd then write an importer for rsnapshot.
That sounds quite interesting. I suspect it should be called 'bup
import-rsnapshot' or something though, since I'm not sure there would
be a lot of shared code between import-rsnapshot and
import-anything-else.
Have fun,
Avery
Yeah I'll think about that one. Perhaps there will be a lib for python 3.
> bup.git.PackWriter._new_commit() already takes options to let you
> specify the two dates; PackWriter.new_commit() just supplies 'now' for
> both of them. I don't think it would be a problem to make those dates
> configurable. I don't know that I'd use an environment variable for
> it, though; a command-line option would probably be more obvious.
>
> Feel free to submit a patch to add this feature.
yeah command-line options would be more obvious, but as we can see bup
as git for backup an mimic many ways git handles things being compatible
to git wouldn't hurt.
I'll write a patch that uses command-line options first.
>> b) manipulating the file-paths
>> Of course we would need to strip the location of the backup from the
>> file paths. a index and save option like --strip-path-prefix would be
>> possible and not too difficult to implement.
> What part do you want to strip and why? Perhaps an example would be best.
My rsnapshot root directory is /var/backup
for every snapshot there's a directory like /var/backup/daily.2
in there there's a directory for every server: /var/backup/daily.2/web1
inside this server is / for the server, so i have
/var/backup/daily.2/web1/etc for web1's /etc from two days ago.
What bup save or a new command should be able to do is strip
'/var/backup/daily.2/web1 and save the content to a given branch.
> For importing from another backup system, it might be easier to not
> use 'bup save' at all, but write another program instead that would
> work however you wanted, including pathname translation.
yeah that might make sense. I'll work on it. as soon as my exams are done.
>> When those basic building blocks are ready we could write something like
>> "bup import".
>> I'd then write an importer for rsnapshot.
> That sounds quite interesting. I suspect it should be called 'bup
> import-rsnapshot' or something though, since I'm not sure there would
> be a lot of shared code between import-rsnapshot and
> import-anything-else.
yeah sure. my first thought was to specify the source with a option, but
having seperate commands (which don't necessarily need to be in bup's
core) is no bad idea.
Oh, I see. I didn't realize it was that simple :) In that case, just
having a path stripping option to 'bup save' should be enough, you're
right. I assumed the data format was more complicated for some
reason.
The 'tool' for importing rsnapshot backups could then just be a
*really* short shell script :)
Hmm, I guess the import script could run a lot faster if you actually
stripped the prefix at *index* time, though. Basically you would do
for d in *; do
bup index -u --strip=$d $d
bup save --date=$d --strip=$d $d
done
Any file that had the same name+size+ctime from dir1 to dir2, bup
would already know it had already been backed up, so the 'index' phase
wouldn't need to mark it as dirty again, so 'bup save' would go
extremely fast.
We could also find out how bup's disk space usage compares to
rsnapshot, which would be very interesting :)
>>> When those basic building blocks are ready we could write something like
>>> "bup import".
>>> I'd then write an importer for rsnapshot.
>> That sounds quite interesting. I suspect it should be called 'bup
>> import-rsnapshot' or something though, since I'm not sure there would
>> be a lot of shared code between import-rsnapshot and
>> import-anything-else.
>
> yeah sure. my first thought was to specify the source with a option, but
> having seperate commands (which don't necessarily need to be in bup's
> core) is no bad idea.
For this purpose, I guess a simple separate command that calls into
'bup save' would be nice. I think it would be okay to include such a
thing in core, as long as we had some decent unit tests for it so we
can tell if someone breaks it.
Have fun,
Avery
my take on that one:
https://gist.github.com/eae6bb7a1dc25a26d65c
I'm no shell-export I'm open for any comments.
> We could also find out how bup's disk space usage compares to
> rsnapshot, which would be very interesting :)
When things are ready I'll report that.
i started implementing the strip option. I pushed to my git repo [1]
My current problem is, that indexing doesn't work correct. Path prefixes
are stripped correctly, i can see the single directories in the bupindex
file, but bup index -p doesn't print any files.
I'd appreciate it very much if somebody with a deeper understanding of
the index could check out my diffs and point me at my mistake.
Thanks,
Zoran
I just realized that my changes to bup save and bup index are garbage...
We don't need any changes for the index. bup-import-rsnapshot should use
a temporary separate index, but our index represents the source
filesystem...
I'm not sure how to save files and directory to different trees then
they are supposed to...
I'll just try to figure it out until somebody points me at the correct
lines.
thanks for your patience...
Zoran
this [1] is the diff to the current master. If everybody is ok with it
I'll submit the patches.
[1] http://github.com/zoranzaric/bup/compare/master...import#diff-1
> this [1] is the diff to the current master. If everybody is ok with it
> I'll submit the patches.
Well, that's kind of backwards; the point of sending the patches to the
mailing list is so that people can see them and comment on them *before* we
put them in. Otherwise I could just as easily git pull rather than making
you go through the hassle of posting them.
But anyway, because I'm so nice and I have Unix Power, I can comment on them
anyhow:
> +d,date= date for the commit (seconds since the epoch)
I suppose we might someday want a better date parser, like git has. But
this is fine for now.
> +if opt.strip and opt.strip.endswith("/"):
> + opt.strip = opt.strip[:-1]
> +
This makes me a little nervous. Maybe strip_path() should do this for us?
Alternatively, maybe --strip-path shouldn't take a parameter at all; perhaps
it should just strip the part of the backed-up paths that correspond to the
command-line argument. That is,
bup save -t --strip-path /etc
would back up the contents of /etc *and* remove /etc from all the paths.
Are there any situations you can think of where we might want to do
otherwise?
One really elegant thing about doing it that way is that you don't have to
worry about naming the right prefix when symlinks are involved; bup can
always remove them the same way.
> @@ -265,7 +270,15 @@ if opt.tree:
> if opt.commit or opt.name:
> msg = 'bup save\n\nGenerated by command:\n%r' % sys.argv
> ref = opt.name and ('refs/heads/%s' % opt.name) or None
> - commit = w.new_commit(oldref, tree, msg)
> + if opt.date:
> + try:
> + #TODO make date parsing more robust
> + date = float(opt.date)
> + except ValueError, e:
> + o.fatal('the date doesn\'t seem to be in the right format')
> + else:
> + date = time.time()
> + commit = w.new_commit(oldref, tree, date, msg)
It would be best to parse the date further up, before doing the whole
backup. It would be kind of disappointing to wait for a whole backup to run
and then not be able to commit it.
Also, when doing a string containing a ' character, just quote the whole
string in "" instead of '' so you can avoid the backslash.
Rather than saying "seem to be", which is a bit waffly, I would say "invalid
date format (should be a float): %r"
> [in cmd/split]
> + if opt.date:
> + try:
> + #TODO make date parsing more robust
> + date = float(opt.date)
> + except ValueError, e:
> + o.fatal('the date doesn\'t seem to be in the right format')
> + else:
> + date = time.time()
Unfortunately this is some duplicated code; maybe put a
parse_date_or_fatal() in helpers.py.
> +def strip_path(prefix, path):
> + if prefix != None and path.startswith(prefix):
> + return path[len(prefix):]
> + else:
> + return path
This makes me a little nervous. I don't think it's good if we ask it to
strip a particular prefix, and then throw paths at it that don't even *have*
that prefix, and then it silently chooses not to strip the prefix, right?
The end result would be pretty weird. At the very least, I think we should
at least throw an exception if the strip prefix is missing from a path.
This makes another vote in favour of --strip-path not taking a parameter;
this edge case would never happen.
...
Also, I think we actually probably *do* want to allow path stripping in
cmd/index; the reason is to make backups go faster. Why are they faster?
Because we'd be reusing the same index entries for files that are actually
the same:
/snapshots/1/etc/passwd
/snapshots/2/etc/passwd
If we --strip /snapshots/1 and then /snapshots/2, the filename is
/etc/passwd each time. After running 'bup save', /etc/passwd will be marked
in the index as IX_HASHVALID, and we'll have stored the hash of that file
along with its size, dev, ctime, owner, etc. When we then index
/snapshots/2, we can see that /etc/passwd is the same file as it was last
time (same inode, ctime, etc, because it's hardlinked to the other one) and
so the file won't be marked as modified. That means 'bup save' will be able
to run *really* fast on the incremental backups... it can entirely skip
reading the unmodified files.
As you suggested earlier, probably we should use a "temporary index" file if
we're going to do this; that would be easy if we implemented a
BUP_INDEX_FILE environment variable, mirroring the GIT_INDEX_FILE variable
in git.
Have fun,
Avery
I did this because i like githubs interface for inspecting diffs. I
don't have any problems with posting here. No bad intentions. I will do
so when i worked on it again with your feedback.
>> +d,date= date for the commit (seconds since the epoch)
>
> I suppose we might someday want a better date parser, like git has. But
> this is fine for now.
Yeah that's for sure i can put in a better date parser now, too if you
want me to.
>> +if opt.strip and opt.strip.endswith("/"):
>> + opt.strip = opt.strip[:-1]
>> +
>
> This makes me a little nervous. Maybe strip_path() should do this for us?
I guess this is some premature optimization, first thought, strip_path()
would be called often and didn't want the stripping to happen every time.
> Alternatively, maybe --strip-path shouldn't take a parameter at all; perhaps
> it should just strip the part of the backed-up paths that correspond to the
> command-line argument. That is,
>
> bup save -t --strip-path /etc
>
> would back up the contents of /etc *and* remove /etc from all the paths.
> Are there any situations you can think of where we might want to do
> otherwise?
that totally makes sense... I'll change that.
> It would be best to parse the date further up, before doing the whole
> backup. It would be kind of disappointing to wait for a whole backup to run
> and then not be able to commit it.
I'll move it up.
> Also, when doing a string containing a ' character, just quote the whole
> string in "" instead of '' so you can avoid the backslash.
>
> Rather than saying "seem to be", which is a bit waffly, I would say "invalid
> date format (should be a float): %r"
ok
>> [in cmd/split]
>> + if opt.date:
>> + try:
>> + #TODO make date parsing more robust
>> + date = float(opt.date)
>> + except ValueError, e:
>> + o.fatal('the date doesn\'t seem to be in the right format')
>> + else:
>> + date = time.time()
>
> Unfortunately this is some duplicated code; maybe put a
> parse_date_or_fatal() in helpers.py.
allrighty
>> +def strip_path(prefix, path):
>> + if prefix != None and path.startswith(prefix):
>> + return path[len(prefix):]
>> + else:
>> + return path
>
> This makes me a little nervous. I don't think it's good if we ask it to
> strip a particular prefix, and then throw paths at it that don't even *have*
> that prefix, and then it silently chooses not to strip the prefix, right?
> The end result would be pretty weird. At the very least, I think we should
> at least throw an exception if the strip prefix is missing from a path.
>
> This makes another vote in favour of --strip-path not taking a parameter;
> this edge case would never happen.
you already got me ;)
> ...
>
> Also, I think we actually probably *do* want to allow path stripping in
> cmd/index; the reason is to make backups go faster. Why are they faster?
> Because we'd be reusing the same index entries for files that are actually
> the same:
>
> /snapshots/1/etc/passwd
> /snapshots/2/etc/passwd
>
> If we --strip /snapshots/1 and then /snapshots/2, the filename is
> /etc/passwd each time. After running 'bup save', /etc/passwd will be marked
> in the index as IX_HASHVALID, and we'll have stored the hash of that file
> along with its size, dev, ctime, owner, etc. When we then index
> /snapshots/2, we can see that /etc/passwd is the same file as it was last
> time (same inode, ctime, etc, because it's hardlinked to the other one) and
> so the file won't be marked as modified. That means 'bup save' will be able
> to run *really* fast on the incremental backups... it can entirely skip
> reading the unmodified files.
It's not just /snapshots/1/etc/passwd
it's
/daily.0/www1/etc/apache2/httpd.conf
and
/daily.0/mail1/etc/postfix/main.cf
But I guess it would still work.
> As you suggested earlier, probably we should use a "temporary index" file if
> we're going to do this; that would be easy if we implemented a
> BUP_INDEX_FILE environment variable, mirroring the GIT_INDEX_FILE variable
> in git.
Yeah, at the moment bup index has an -f option for an alternate index
file, but bup save doesn't.
Ok some more work to do, some more reasons to procrastinate stuff for
university :)
I think I found a problem with our gitignore or the Makefile. Where
would bup-import-rsnapshot supposed to be? cmd/bup-import-rsnapshot is
ignored by cmd/bup-*
Thanks for your patience!
Zoran
I think I would call it cmd/import-snapshot-cmd.sh, and have the
Makefile auto-symlink it to bup-import-snapshot like it does for
*-cmd.py.
Have fun,
Avery
> would back up the contents of /etc *and* remove /etc from all the paths.
> Are there any situations you can think of where we might want to do
> otherwise?
Yes, actually: a user could want to migrate from rsnapshot or the like
to bup but have limited free space on backup media. Partial path
stripping could help make it possible to perform destructive(!)
migration on a subtree by subtree basis. I admit that that's a corner
case for which other changes would most likely be in order as well,
though, and like the idea of making it easy to strip the whole path.
Perhaps --strip could take an optional numeric parameter indicating
how many leading path elements to keep or strip; --strip=1 would strip
the first element (like patch -p1) and --strip=-2 would strip all but
the last two elements.
> so the file won't be marked as modified. That means 'bup save' will be able
> to run *really* fast on the incremental backups... it can entirely skip
> reading the unmodified files.
Sounds good. Speaking of hard links, it would be awesome if bup could
track them properly; its extensive use of deduplication obviously
means that a na�ve treatment will incur essentially no storage
overhead, but there's still a cost in terms of restoration fidelity.
--
Aaron M. Ucko, KB1CJC (amu at alum.mit.edu, ucko at debian.org)
http://www.mit.edu/~amu/ | http://stuff.mit.edu/cgi/finger/?a...@monk.mit.edu
I think what Avery means is the following:
We have our rsnapshot backups in /var/backup and want to import the last
backup of www1 into bup. This means we want /var/backup/daily.0/www1 be
handled as / so /var/backup/daily.0/www1/etc becomes /etc. When speaking
of removing we only speak of path prefixes.
Your usecase would be perfectly feasable:
bup save -n www1 --strip-path /var/backup/daily.0/www1 &&
rm -rf /var/backup/daily.0/www1
bup save -n www1 --strip-path /var/backup/daily.1/www1 &&
rm -rf /var/backup/daily.1/www1
and so on.
I hope i didn't misread anything.
Zoran
> bup save -n www1 --strip-path /var/backup/daily.0/www1 &&
> rm -rf /var/backup/daily.0/www1
> bup save -n www1 --strip-path /var/backup/daily.1/www1 &&
> rm -rf /var/backup/daily.1/www1
Sure, but if I don't have space to store all of
/var/backup/daily.0/www1, I might need to work on a subtree by subtree
basis:
bup save -n www1 --strip-path=-2 /var/backup/daily.0/www1/home/foo &&
rm -rf /var/backup/daily.0/www1/home/foo
bup save -n www1 --strip-path=-2 /var/backup/daily.1/www1/home/foo &&
rm -rf /var/backup/daily.1/www1/home/foo
[...]
bup save -n www1 --strip-path=-2 /var/backup/daily.0/www1/home/foo &&
rm -rf /var/backup/daily.0/www1/home/bar
bup save -n www1 --strip-path=-2 /var/backup/daily.1/www1/home/foo &&
rm -rf /var/backup/daily.1/www1/home/bar
There would need to be some additional magic to unify the sub-backups
of daily.0, daily.1, etc.
Ok sorry.
> bup save -n www1 --strip-path=-2 /var/backup/daily.0/www1/home/foo &&
> rm -rf /var/backup/daily.0/www1/home/foo
> bup save -n www1 --strip-path=-2 /var/backup/daily.1/www1/home/foo &&
> rm -rf /var/backup/daily.1/www1/home/foo
> [...]
> bup save -n www1 --strip-path=-2 /var/backup/daily.0/www1/home/foo &&
> rm -rf /var/backup/daily.0/www1/home/bar
> bup save -n www1 --strip-path=-2 /var/backup/daily.1/www1/home/foo &&
> rm -rf /var/backup/daily.1/www1/home/bar
>
> There would need to be some additional magic to unify the sub-backups
> of daily.0, daily.1, etc.
Yeah I don't really know how to handle that, we might merge commits with
the same timestamp or something like that.
I personally would prefer --prefix over a --strip-path offset.
That example seems a bit contrived to me, but I guess it's possible
that other such situations might exist.
Perhaps we could have a --strip and a separate --strip-prefix=<prefix>
option. If you specify --strip, it chooses a prefix automatically; if
you provide --strip-prefix, it forces the prefix to be the specified
one. We would then throw a fatal error if any of your paths don't
start with the specified prefix.
> Perhaps --strip could take an optional numeric parameter indicating
> how many leading path elements to keep or strip; --strip=1 would strip
> the first element (like patch -p1) and --strip=-2 would strip all but
> the last two elements.
A numeric parameter just hides the confusion, it doesn't really make
it go away. Then you'd have the oddball behaviour of intending to
strip "/usr/tmp" but you end up stripping "/var/tmp" instead by
accident, and there's no way for bup to help you by printing a warning
or error, because you just wanted to strip "the first two elements"
and it did, but the results were meaningless.
>> so the file won't be marked as modified. That means 'bup save' will be able
>> to run *really* fast on the incremental backups... it can entirely skip
>> reading the unmodified files.
>
> Sounds good. Speaking of hard links, it would be awesome if bup could
> track them properly; its extensive use of deduplication obviously
> means that a naďve treatment will incur essentially no storage
> overhead, but there's still a cost in terms of restoration fidelity.
Yeah, this would be something that should be implemented in the
metadata stuff, I suppose. It's not high priority for me (the
heaviest use of hard links seems to be ... backups :)) so probably
someone else will have to submit a patch for it.
Have fun,
Avery
> Yeah, this would be something that should be implemented in the
> metadata stuff, I suppose. It's not high priority for me (the
> heaviest use of hard links seems to be ... backups :)) so probably
> someone else will have to submit a patch for it.
I'd like to include support, but I haven't thought hard about it yet.
Might be a good time to do so...
--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4
> accident, and there's no way for bup to help you by printing a warning
> or error, because you just wanted to strip "the first two elements"
> and it did, but the results were meaningless.
That's fair; taking either an explicit string to strip (or, for that
matter, to keep or substitute) should address that case as well.
(/usr/tmp, eh? How old-school BSD. ;-)
> Yeah, this would be something that should be implemented in the
> metadata stuff, I suppose. It's not high priority for me (the
> heaviest use of hard links seems to be ... backups :)) so probably
> someone else will have to submit a patch for it.
No problem; it sounds like Rob's up for including it with other
metadata (which strikes me as a reasonable home for it as well), and
if not I could look into implementing it myself.
> I'd like to include support, but I haven't thought hard about it yet.
> Might be a good time to do so...
Great; thanks! FWIW, one way to keep hard-link tracking from being
too memory-intensive is to keep track of how many of each inode's
names you've already seen.
fatal is a options.Options method, which needs the command we're running.
Is there any sane way to call options.Options from helpers with the
right parameters without stuffing everything into parse_date_or_fatal()?
You could just have parse_date_or_fatal take a 'fatal' method that you
pass as o.fatal.
It's all a bit gross, but that's what helpers.py is: a bunch of gross
functions :)
Have fun,
Avery
There's a TODO with the stuff I want to get finished before submitting
the patches.
Just wanted to report asap.
Zoran
Thanks for the update. I'd suggest submitting small things at a time
rather than generating a big queue: the --date option sounds like an
easy place to start.
It's also much less scary to review small patches at a time :)
Have fun,
Avery
# rsnapshot du web2
3,2G /var/backup/daily.0/web2
198M /var/backup/daily.1/web2
154M /var/backup/daily.2/web2
182M /var/backup/daily.3/web2
148M /var/backup/daily.4/web2
148M /var/backup/daily.5/web2
195M /var/backup/daily.6/web2
260M /var/backup/weekly.0/web2
186M /var/backup/weekly.1/web2
276M /var/backup/weekly.2/web2
156M /var/backup/weekly.3/web2
183M /var/backup/monthly.0/web2
128M /var/backup/monthly.1/web2
75M /var/backup/monthly.2/web2
5,4G insgesamt
after only running the commands for the target web2 from
import-rsnapshot-cmd.sh
my bupdir is 2.1G
# rsnapshot du mail2
4,4G /var/backup/daily.0/mail2
189M /var/backup/daily.1/mail2
193M /var/backup/daily.2/mail2
170M /var/backup/daily.3/mail2
183M /var/backup/daily.4/mail2
180M /var/backup/daily.5/mail2
202M /var/backup/daily.6/mail2
201M /var/backup/weekly.0/mail2
243M /var/backup/weekly.1/mail2
327M /var/backup/weekly.2/mail2
277M /var/backup/weekly.3/mail2
228M /var/backup/monthly.0/mail2
258M /var/backup/monthly.1/mail2
234M /var/backup/monthly.2/mail2
7,2G insgesamt
after bupping mail2
bupdir 4.6G
web2 and mail2 both run the same version of the same os, so there's some
potential for deduplication. mail2 is, as the name says a mailserver so
a lot of textfiles -> good, too.
rsnapshot: 12.6G
bup: 4.6G
i think we have a winner here.
Zoran
Even that is a 2.5x ratio. Doesn't rsnapshot compress things? That's
a little odd, if not. But I guess I'll take it :)
> rsnapshot: 12.6G
> bup: 4.6G
>
> i think we have a winner here.
2.7x smaller. I like it :)
Thanks for doing the test!
Have fun,
Avery
no rsnapshot rsyncs the plain files. it hardlinks files that didn't
change. timemachine does it the same way i think.
Oh well, I guess we win the easy way then :)
Time machine is a little funky... I think they might be using
filesystem-layer compression so you don't see that the files are
compressed. I heard MacOS supports this now. I've never used time
machine though.
Have fun,
Avery