git-annex and bup - the other way around ???

197 views
Skip to first unread message

Olaf TNSB

unread,
Jun 17, 2016, 1:30:56 AM6/17/16
to bup list
I've been wondering about using `git-annex` and `bup` together - but _not_ with `bup` as the backend, but rather backing up `bup` repos using `git-annex`.

Let me try to explain...

* `bup` is a an awesome deduplicating backup tool, but it does not have encryption
* `git-annex` is a awesome in so many ways.  Including 1) multiple copies, 2) encryption

(I know the following reads like the motivation for `git-annex`, but let me add the word **backup**)
 
* Recovering large backups over the internet can be costly and time consuming
* Local copies are fast, but are risky

So I was wondering about having my `bup` repos in `git-annex`, with multiple copies, including, say, an encrypted S3 bucket and some local copies.

Then, if I had a problem and needed to restore I could use my local copies for as much as I could and then only pull part of the backup from the complete remote backup.

If that all works, I then have:
1. A more complicated process than a simple backup tool  :-(
2. Multiple complete backups available :-D
3. Encrypted, offsite backups :-D
4. Small transfers (only need to transfer any new files created by my latest `bup` backup) :-D

---

I've not done this yet, but was thinking it through.

Can anyone share some opinions, thoughts, concerns or high-5s for the awesomeness of my idea??  ;-)



I posted almost the same comment on the git-annex page: https://git-annex.branchable.com/forum/git-annex_and_bup_-_the_other_way_around___63____63____63__/

Zoran Zaric

unread,
Jun 17, 2016, 2:00:00 AM6/17/16
to Olaf TNSB, bup list
Hello Olaf,

> Am 17.06.2016 um 07:30 schrieb Olaf TNSB <still.anot...@gmail.com>:
>
> I've been wondering about using `git-annex` and `bup` together - but _not_ with `bup` as the backend, but rather backing up `bup` repos using `git-annex`.
>
> Let me try to explain...
>
> * `bup` is a an awesome deduplicating backup tool, but it does not have encryption
> * `git-annex` is a awesome in so many ways. Including 1) multiple copies, 2) encryption
>
> (I know the following reads like the motivation for `git-annex`, but let me add the word **backup**)
>
> * Recovering large backups over the internet can be costly and time consuming
> * Local copies are fast, but are risky
>
> So I was wondering about having my `bup` repos in `git-annex`, with multiple copies, including, say, an encrypted S3 bucket and some local copies.

I don't see any problem with this. bup's packfiles are immutable. They can even be placed in Amazon's Glacier.

The local index, the MIDX files and refs will be changed/replaced.


> Then, if I had a problem and needed to restore I could use my local copies for as much as I could and then only pull part of the backup from the complete remote backup.
>
> If that all works, I then have:
> 1. A more complicated process than a simple backup tool :-(
> 2. Multiple complete backups available :-D
> 3. Encrypted, offsite backups :-D
> 4. Small transfers (only need to transfer any new files created by my latest `bup` backup) :-D
>
> ---
>
> I've not done this yet, but was thinking it through.
>
> Can anyone share some opinions, thoughts, concerns or high-5s for the awesomeness of my idea?? ;-)

Hope this helps and here's a high five: *5*

Thanks,
Zoran

still.anot...@gmail.com

unread,
Jun 17, 2016, 2:51:38 AM6/17/16
to bup-list, still.anot...@gmail.com, z...@zoranzaric.de
Hi Zoran,


On Friday, June 17, 2016 at 4:00:00 PM UTC+10, Zoran Zaric wrote:
> So I was wondering about having my `bup` repos in `git-annex`, with multiple copies, including, say, an encrypted S3 bucket and some local copies.

I don't see any problem with this. bup's packfiles are immutable.  They can even be placed in Amazon's Glacier.

The local index, the MIDX files and refs will be changed/replaced.

Ah, interesting - I'll have to look at them.  Is there a way to recreate the MIDX files?  Can I just not back them up and then recreate?

> Can anyone share some opinions, thoughts, concerns or high-5s for the awesomeness of my idea??  ;-)

Hope this helps and here's a high five: *5*

LOL.  You've made my day!
 

Zoran Zaric

unread,
Jun 17, 2016, 3:00:25 AM6/17/16
to still.anot...@gmail.com, bup-list
Am 17.06.2016 um 08:51 schrieb still.anot...@gmail.com:

Hi Zoran,

On Friday, June 17, 2016 at 4:00:00 PM UTC+10, Zoran Zaric wrote:
> So I was wondering about having my `bup` repos in `git-annex`, with multiple copies, including, say, an encrypted S3 bucket and some local copies.

I don't see any problem with this. bup's packfiles are immutable.  They can even be placed in Amazon's Glacier.

The local index, the MIDX files and refs will be changed/replaced.

Ah, interesting - I'll have to look at them.  Is there a way to recreate the MIDX files?  Can I just not back them up and then recreate?

To my (old) knowledge: yes you can recreate MIDX files.  They are just for performance improvements. Stuff you need is: *.pack, *.idx, and the refs.

Can somebody with fresher knowledge doublecheck this?


> Can anyone share some opinions, thoughts, concerns or high-5s for the awesomeness of my idea??  ;-)

Hope this helps and here's a high five: *5*

LOL.  You've made my day!


:)

Rob Browning

unread,
Jun 17, 2016, 7:55:47 PM6/17/16
to Zoran Zaric, still.anot...@gmail.com, bup-list
Zoran Zaric <z...@zoranzaric.de> writes:

> To my (old) knowledge: yes you can recreate MIDX files. They are just
> for performance improvements. Stuff you need is: *.pack, *.idx, and
> the refs.
>
> Can somebody with fresher knowledge doublecheck this?

So I *think* in theory, you should have effectively "all" of your data
if you have the pack files, and the contents of refs/heads.

I believe you can regnerate:

- the midx via "bup midx",
- any par2 files via "bup fsck -g",
- the bloom filter via "bup bloom", and
- the .idx files via:
$ (set -e; for f in objects/pack/*.pack; do git index-pack "$f"; done)

Though you might well want to keep the .idx and par2 files too for extra
redundancy. And of course, I'd definitely recommend testing your
restores, once you have it all working.

Hope this helps
--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Greg Troxel

unread,
Jun 17, 2016, 7:59:52 PM6/17/16
to Rob Browning, Zoran Zaric, still.anot...@gmail.com, bup-list

Rob Browning <r...@defaultvalue.org> writes:

> So I *think* in theory, you should have effectively "all" of your data
> if you have the pack files, and the contents of refs/heads.
>
> I believe you can regnerate:
>
> - the midx via "bup midx",

would this happen automatically if they were needed? As I understand
it, they are merely an efficient combined representation of the set of
idx files.

> - the bloom filter via "bup bloom", and

Does/should bup fsck check that, and/or regen if missing?

> - the .idx files via:
> $ (set -e; for f in objects/pack/*.pack; do git index-pack "$f"; done)

Does/should bup fsck check that, and/or regen if missing?

> Though you might well want to keep the .idx and par2 files too for extra
> redundancy. And of course, I'd definitely recommend testing your
> restores, once you have it all working.

Definitely testing would be good :-) I suppose this could turn into a
few new test cases for bup.
signature.asc

Rob Browning

unread,
Jun 17, 2016, 8:27:39 PM6/17/16
to Greg Troxel, Zoran Zaric, still.anot...@gmail.com, bup-list
Greg Troxel <g...@lexort.com> writes:

> Rob Browning <r...@defaultvalue.org> writes:

>> I believe you can regnerate:
>>
>> - the midx via "bup midx",
>
> would this happen automatically if they were needed? As I understand
> it, they are merely an efficient combined representation of the set of
> idx files.

Sounds like it:

Note: you should no longer need to run this command by hand. It gets
run automatically by bup-save(1) and similar commands.

>> - the bloom filter via "bup bloom", and
>
> Does/should bup fsck check that, and/or regen if missing?

I doubt fsck does, but I suspect bup itself (or the relevant commands
that need a bloom filter) will.

>> - the .idx files via:
>> $ (set -e; for f in objects/pack/*.pack; do git index-pack "$f"; done)
>
> Does/should bup fsck check that, and/or regen if missing?

I suspect not since by default fsck runs verify-pack to try to make sure
that the packs are OK -- in part according to the .idx files.

As it stands now, fsck can verify the pack files against their .idx
files, or generate par2 data, or attempt repairs via par2 data.

Generally speaking, I suspect missing .idx files would normally be
extremely suspicious since they're effectively immutable per-pack
summaries. If they're missing, I'd guess that either someone's been
messing around, or something's gone terribly wrong (even though git can
regenerate them).

Gernot Schulz

unread,
Jun 18, 2016, 5:40:13 AM6/18/16
to bup list
On Fri, Jun 17, 2016 at 03:30:55PM +1000, Olaf TNSB wrote:
> So I was wondering about having my `bup` repos in `git-annex`, with
> multiple copies, including, say, an encrypted S3 bucket and some local
> copies.

I have done something like this before which I tried to document
here:

https://git-annex.branchable.com/tips/Bup_repositories_in_git-annex/

Maybe you'll find it useful.

still.anot...@gmail.com

unread,
Jun 19, 2016, 6:59:32 PM6/19/16
to bup-list, ne...@gernot-schulz.com
Hi Gernot,

Apologies I missed your link, not sure how that happened.  :-(

BTW - *5* for an great idea!  :-D

Reading your comment it comes across as if you are not still using this approach...  Can you share what went wrong or why you aren't doing it any more?


Thanks for your notes - it's a great resource.

Greg Troxel

unread,
Jun 19, 2016, 7:18:57 PM6/19/16
to Rob Browning, Zoran Zaric, still.anot...@gmail.com, bup-list

Rob Browning <r...@defaultvalue.org> writes:

> Greg Troxel <g...@lexort.com> writes:
>
>> Rob Browning <r...@defaultvalue.org> writes:
>
>>> I believe you can regnerate:
>>>
>>> - the midx via "bup midx",
>>
>> would this happen automatically if they were needed? As I understand
>> it, they are merely an efficient combined representation of the set of
>> idx files.
>
> Sounds like it:
>
> Note: you should no longer need to run this command by hand. It gets
> run automatically by bup-save(1) and similar commands.

Related, it's not clear to me if the latest midx is sufficient, and why
the earlier ones are kept around.

>>> - the bloom filter via "bup bloom", and
>>
>> Does/should bup fsck check that, and/or regen if missing?
>
> I doubt fsck does, but I suspect bup itself (or the relevant commands
> that need a bloom filter) will.

I would argue that there should be a specification for how the repo
ought to be, and that all commands should, when not having an error
exit, leave it that way, and that fsck should verify the invariant and
to the extent sane repair it, more or less how fsck on a filesystem
does.

>>> - the .idx files via:
>>> $ (set -e; for f in objects/pack/*.pack; do git index-pack "$f"; done)
>>
>> Does/should bup fsck check that, and/or regen if missing?
>
> I suspect not since by default fsck runs verify-pack to try to make sure
> that the packs are OK -- in part according to the .idx files.
>
> As it stands now, fsck can verify the pack files against their .idx
> files, or generate par2 data, or attempt repairs via par2 data.

So it seems like fsck needs a "ensure idx files exist and perhasp check
them" step.

> Generally speaking, I suspect missing .idx files would normally be
> extremely suspicious since they're effectively immutable per-pack
> summaries. If they're missing, I'd guess that either someone's been
> messing around, or something's gone terribly wrong (even though git can
> regenerate them).

True. But the point of fsck is to detect wrong things and fix them; in
a system that has not encountered storage errors and does not have code
errors it should be pointeless to run.
signature.asc

Rob Browning

unread,
Jun 19, 2016, 11:00:34 PM6/19/16
to Greg Troxel, Zoran Zaric, still.anot...@gmail.com, bup-list
Greg Troxel <g...@lexort.com> writes:

> Related, it's not clear to me if the latest midx is sufficient, and why
> the earlier ones are kept around.

From this, without checking the code, I'd guess that they're cumulative
unless you --force:

-f, --force
force generation of a single new .midx file containing all
your .idx files, even if other .midx files already exist.
This will result in the fastest backup performance, but
may take a long time to run.

> I would argue that there should be a specification for how the repo
> ought to be, and that all commands should, when not having an error
> exit, leave it that way, and that fsck should verify the invariant and
> to the extent sane repair it, more or less how fsck on a filesystem
> does.

Offhand, it also seems fine if every command is capable of lazily
regenerating what it needs. Though I think it'd be fine to have a
command that will generate "everything, right now".

> So it seems like fsck needs a "ensure idx files exist and perhasp check
> them" step.

Sure, might be a reasonable addition.

> True. But the point of fsck is to detect wrong things and fix them; in
> a system that has not encountered storage errors and does not have code
> errors it should be pointeless to run.

Indeed, I think it's possible we might want to make fsck more
comprehensive. I was mostly just describing what it does now.

Thanks

Gernot Schulz

unread,
Jul 16, 2016, 4:51:07 PM7/16/16
to bup-list
> --
> You received this message because you are subscribed to the Google Groups "bup-list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bup-list+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Gernot Schulz

unread,
Jul 17, 2016, 6:41:11 AM7/17/16
to bup-list
Hey!

On Sun, Jun 19, 2016 at 03:59:31PM -0700, still.anot...@gmail.com wrote:
> Reading your comment it comes across as if you are not still using this
> approach... Can you share what went wrong or why you aren't doing it any
> more?

Your're right, for the past year or so I haven't been using it to do
uploads to S3. The reasons have nothing to do with git-annex,
though.

I haven't found anything wrong with the concept. In fact, my
repositories are still set up with git-annex, and I use it to track
archived bup repositories on other hard drives.

Now, having been reminded of this, I'm tempted to set up automated
uploads again. :)

PS: Sorry for that empty message before.
Reply all
Reply to author
Forward
0 new messages