Can I send around saves for a multi-tier multi-user backup system?

mle...@gmail.com

unread,

Nov 18, 2022, 4:22:34 PM11/18/22

to bup-list

Hello,

I am on the lookout for a data management scheme for our computational biology working group.

The goal is a group archive that is also the collaboration space, if I want data from a colleague, I download it from the versioned archive. A version is uploaded into the group archive at important milestones (e.g. a talk or a paper) or to provide data to a colleague.

We have the following constraints:

1. Big computation results on Windows PCs, and occasionally Linux clusters
2. A common group drive via Active Directory (Windows) or NFS (Linux) with limited space
3. A tape-backed write-only large backup space that the IT wants to be used responsibly
4. Biologists with self-taught R knowledge (so they can copy/paste bash commands, but no git etc.)
I think that bup should be powerful enough to fulfill constraint 1. and easy enough for 4. (maybe with a small wrapper script). The problem is that because of the limited space (2.) we cannot keep our group archive versions on the group drive indefinitely and to use the big storage we need to present a plan that is not just "let everyone dump whatever they want" (3.).

Therefore, I thought of a automated process that will move saves from the group drive to the archive once they are, say, 1 year old and tagged. Thus, the archive is distilled out of the daily exchange and thus not so easily forgotten among the daily chores.

How can I move saves from one bup store (group drive) to another (tape-backed data freezer)? Using git, I would maybe register the big tape drive as a remote and then do something like a cherry-pick onto branch "remote/backup-set", and then git push. Is this the right way?

Best,
Moritz

Aaron M. Ucko

unread,

Nov 18, 2022, 5:24:09 PM11/18/22

to bup-...@googlegroups.com

On 11/18/22 16:22, mle...@gmail.com wrote:
> I am on the lookout for a data management scheme for our computational
> biology working group.

You may be interested in git-annex (https://git-annex.branchable.com/),
which supports using bup as a "special remote".

--

Aaron M. Ucko, KB1CJC (amu at alum.mit.edu, ucko at debian.org)
http://www.mit.edu/~amu/ | http://stuff.mit.edu/cgi/finger/?a...@monk.mit.edu

mle...@gmail.com

unread,

Nov 18, 2022, 7:50:00 PM11/18/22

to bup-list

Aaron M. Ucko schrieb am Freitag, 18. November 2022 um 23:24:09 UTC+1:

You may be interested in git-annex (https://git-annex.branchable.com/),
which supports using bup as a "special remote".

TL;DR git annex is my short love affair, but theres no future for us.

Much too finicky, not cross-platform enough and not understandable

for people without in-depth git experience. The default for my colleagues

is to send undocumented data via mail. If it gets much more complicated

than this, they'll revert to that.Especially because, sadly, keeping track of

data provenience is far less important in order to get a PhD than to write

papers fast, backup and sharing must be really fast and easy. Long version

and list of complicated corner cases is below. Therefore I hope bup can achieve

this easier, when I use at the same time the fantasic datalad idea of a

dataset UUID.

https://git-annex.branchable.com/forum/Dealing_with_long_filenames/

https://git-annex.branchable.com/design/assistant/desymlink/

https://handbook.datalad.org/en/latest/intro/windows.html

https://git-annex.branchable.com/todo/git_smudge_clean_interface_suboptiomal/

Long version (actually not so much on topic of this thread anymore):

I am actually investigating bup as a replacement for git annex. That program

is impressive, but is not user friendly enough. Also, it relies on symbolic

links and long file names, both of which are not available on Windows

( know that there are NTFS symlinks in developer mode, but

a) I need a robust system with minimal system requirements for research data

storage that should be as accessible as possible.

b) Symlinks are represented differently on our Active Directory share whether

you create them in Linux or Windows/MSYS2. Add

to that the special way that git annex represents annexed files internally

that you sometimes have to get rid of by running 'git annex smudge'. On Windows ,

it has then e.g. a DOCX name extension but contains only one text line

"anbex/objects/(hash), leading to errors when you double-click it.

c) Then the problems when moving symlinks which has to be fixed by

'git annex fix'. Adjusted branches, a nightmare when

merging/rebasing/amending the latest commit etc. etc.

d) git annex creates long filenames which is not supported in Windows and

confusing, frustrating and dangerous if unexperienced users run into

it.

So, sadly, much too complicated for non-informaticians, and datalad,

a wrapper that combines this with git submodules to create linkable

versioned datasets, note in the manual that Windows is a

second-class citizen, mostly, I think, because of the problems they

inherit with using git annex. Finally, git annex is written in Haskell so

the number of programners who can keep the project running is limited.

In my impression, its currently

run by the heroic dedication of its author and grant money via the

datalad project...

Greg Troxel

unread,

Nov 18, 2022, 7:50:25 PM11/18/22

to mle...@gmail.com, bup-list

"mle...@gmail.com" <mle...@gmail.com> writes:

> The goal is a group archive that is also the collaboration space, if I want
> data from a colleague, I download it from the versioned archive. A version
> is uploaded into the group archive at important milestones (e.g. a talk or
> a paper) or to provide data to a colleague.
>
> We have the following constraints:
>
> 1. Big computation results on Windows PCs, and occasionally Linux clusters
> 2. A common group drive via Active Directory (Windows) or NFS (Linux) with
> limited space
> 3. A tape-backed write-only large backup space that the IT wants to be used
> responsibly

This point is hard to grasp precisely.

> 4. Biologists with self-taught R knowledge (so they can copy/paste bash
> commands, but no git etc.)

> I think that bup should be powerful enough to fulfill constraint 1. and
> easy enough for 4. (maybe with a small wrapper script). The problem is that
> because of the limited space (2.) we cannot keep our group archive versions
> on the group drive indefinitely and to use the big storage we need to
> present a plan that is not just "let everyone dump whatever they want" (3.).

This is getting complicated!

> Therefore, I thought of a automated process that will move saves from the
> group drive to the archive once they are, say, 1 year old and tagged. Thus,
> the archive is distilled out of the daily exchange and thus not so easily
> forgotten among the daily chores.

bup's fundamental operation is to record a state of a filesystem,
associating it with a tag, and to keep the data in a way that
deduplicates across all previous data, expecting that the bup repo is
append only.

There is "bup gc" now, but that's technically experimental, even though
people think it works.

"bup get" will fetch refs and what they point to from one repo to
another. This is all "make another copy" and not "move".

You are wanting to copy (works well) and then to prune. You'd want to
read the man pages for "bup rm" and "bup gc".

> How can I move saves from one bup store (group drive) to another
> (tape-backed data freezer)? Using git, I would maybe register the big tape
> drive as a remote and then do something like a cherry-pick onto branch
> "remote/backup-set", and then git push. Is this the right way?

bup get *copies* saves. The scary/tricky bit is removing them from the
original.

I am not clear on if you want your working store in bup format. If
that's the usual

report_final_final(2)_draft_tuesday.docx

nightmare, then you can just use bup daily to back it up to the archive.

signature.asc

mle...@gmail.com

unread,

Nov 18, 2022, 8:45:55 PM11/18/22

to bup-list

Greg Troxel schrieb am Samstag, 19. November 2022 um 01:50:25 UTC+1:

"mle...@gmail.com" <mle...@gmail.com> writes:

> The goal is a group archive that is also the collaboration space (...)

>
> We have the following constraints:
>

> 3. write-only large backup space that the IT wants to be used

> responsibly

This point is hard to grasp precisely.

... currently, they allow file uploads via a database frontend where you

can enter descriptions. I was maybe a bit vague because I think I can

negotiate with them as long as the space does not get cluttered with

unordered files that noone knows/cares about.

bup get" will fetch refs and what they point to from one repo to
another. This is all "make another copy" and not "move".

Hmm, 'bup get --pick' could be what I'm looking for, it serms to 'get' a single save and

not also all its ancestors, that would be the way to transfer to the long-term

Rob Browning

unread,

Nov 19, 2022, 12:54:31 PM11/19/22

to Greg Troxel, mle...@gmail.com, bup-list

Greg Troxel <g...@lexort.com> writes:

> "mle...@gmail.com" <mle...@gmail.com> writes:

> There is "bup gc" now, but that's technically experimental, even though
> people think it works.

We finally removed the scary doc warnings in 0.33, and I think people
have been using it for a good while (me included). So hopefully it's
"OK" now. That said, it *is* still a destructive operation.

--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Greg Troxel

unread,

Nov 19, 2022, 1:01:02 PM11/19/22

to mle...@gmail.com, bup-list

"mle...@gmail.com" <mle...@gmail.com> writes:

>> > 3. write-only large backup space that the IT wants to be used
>> > responsibly
>>
>> This point is hard to grasp precisely.
>
> ... currently, they allow file uploads via a database frontend where you
> can enter descriptions. I was maybe a bit vague because I think I can
> negotiate with them as long as the space does not get cluttered with
> unordered files that noone knows/cares about.

That seems like a user problem :-(

>> bup get" will fetch refs and what they point to from one repo to
>> another. This is all "make another copy" and not "move".
>
> Hmm, 'bup get --pick' could be what I'm looking for, it serms to 'get'
> a single save and not also all its ancestors, that would be the way to
> transfer to the long-term archive

If the long-term is supposed to be very large and append only, then it
would make sense to not use --pick and have it be the true backup, and
the online version just recent saves.

>> bup get *copies* saves. The scary/tricky bit is removing them from the
>> original.
>
> As the pruning would then only affect the group drive, we have
> the Shadow Copy Service (aka Windows File History) as a failsafe.
> Then I would need a way to check for backup integrity, maybe by keeping
> a list of paths/md5suns that I expect in the backup and trying to
> recover random samples thereof... Then I would notice before the
> shadow copies time out...

One thing is "bup fsck". But yes, having a weekly recovery drill is
great. I used to ask my group's sysadmin to once a week, find a random
group member, and have them identify a random important file, and
retrieve it from backups. That doesn't have anything to do with bup
specifically, but I stand by the advice :-)

signature.asc

Rob Browning

unread,

Nov 19, 2022, 1:27:43 PM11/19/22

to mle...@gmail.com, bup-list

"mle...@gmail.com" <mle...@gmail.com> writes:

> ... currently, they allow file uploads via a database frontend where you
> can enter descriptions. I was maybe a bit vague because I think I can
> negotiate with them as long as the space does not get cluttered with
> unordered files that noone knows/cares about.

Right, one thing I wasn't sure about either was the tape system. If
that doesn't behave like a "normal" filesystem somehow, i.e. if it
requires streaming, then you wouldn't be able to use bup directly for
the archival step.

>> How can I move saves from one bup store (group drive) to another
>> (tape-backed data freezer)? Using git, I would maybe register the big tape
>> drive as a remote and then do something like a cherry-pick onto branch
>> "remote/backup-set", and then git push. Is this the right way?

> bup get *copies* saves. The scary/tricky bit is removing them from the
> original.

Right, you might use bup to manage the working set, though the main
benefit to using bup there as compared to a pile of compressed tar
archives would I imagine mostly be deduplication.

With that arrangement access to the working set would be via
restore/fuse/web (though most likely restore for more intensive use). I
suppose you could also handle giant "blob" data sets (if there are any)
via split/join.

In terms of archiving, as I think you suggest, one option might be to
create a new repo with the things you want to archive using "bup get",
then run "bup rm" to drop the refs from the working repo, and finally
run "bup gc" to reclaim space.

That leaves you with the new "transfer repo" which would need to be
stored on the tapes somehow. Of course you'd have effectively no
deduplication across transfer repos (unless the tape system can do that
somehow itself -- doubtful(?), since each transfer repo would be a set
of new, custom-subset packfiles).

Another possibility I wondered about, would be streaming the archive set
directly to the tapes via "git archive". Not sure if "git archive"
would work OK for big saves/repos, but if it would, perhaps interesting.

And of course if git-archive can't handle big saves/repos, something
like that could be added to bup, given the time.

(Hmm, I also hadn't really thought about how "git bundle" might work for
bup repos, though I don't think it can stream, so if it worked, it'd be
more an alternative to a "bup get" subset.)

> Hmm, 'bup get --pick' could be what I'm looking for, it serms to 'get'
> a single save and not also all its ancestors, that would be the way to
> transfer to the long-term archive

Right bup-get is (overly?) flexible. You can cherry-pick saves from a
branch (or even across branches) to create a new branch (even within the
same repo), you can also promote subtress to saves, etc.

> As the pruning would then only affect the group drive, we have
> the Shadow Copy Service (aka Windows File History) as a failsafe.
> Then I would need a way to check for backup integrity, maybe by keeping
> a list of paths/md5suns that I expect in the backup and trying to
> recover random samples thereof... Then I would notice before the
> shadow copies time out...

If you had enough scratch space, you could also do what I do in some
cases, and double-check that a restore (or part of a restore) matches
the original (filesystem or other restore) via rsync -ni... We have a
helper for that in the source tree called compare-trees which the tests
rely on heavily.

One other issue -- you mentioned the working set being on a remote
filesystem. That might introduce some additional questions/concerns.
The happy path for bup is still either direct filesystem or ssh access
for saves/restores, etc. There is also (currently) no locking, or other
carefully vetted coordination with respect to concurrency.

Some operations may well be fairly safe. We do try to handle various
operations safely, at least for local "normal" filesystems, but we don't
do NFS-safe locking for example, and I doubt *all* of the code was
written with concurrency in mind (since that wasn't a goal). Given
that, you may need to handle locking, if you need it, at a higher level
somehow.

I also don't know how well bup will perform with NFS in general[1],
definitely with respect to concurrency, and also with respect to
performance for larger repos, given how much bup (at the moment) and git
rely on mmapping, etc.

I suppose I don't really have enough recent, careful experience with NFS
to have a well informed opinion.

[1] ...and we've had reports of actual trouble that might or might not
have been related to CIFS -- including one recent report.

Hope this helps

Greg Troxel

unread,

Nov 19, 2022, 1:41:29 PM11/19/22

to Rob Browning, mle...@gmail.com, bup-list

I will say to the OP: it is not the least bit clear that using bup for
your problem makes sense.

I say this a longtime bup user and (minor) contributor.

signature.asc

Rob Browning

unread,

Nov 19, 2022, 2:59:04 PM11/19/22

to Greg Troxel, mle...@gmail.com, bup-list

Greg Troxel <g...@lexort.com> writes:

> One thing is "bup fsck". But yes, having a weekly recovery drill is
> great. I used to ask my group's sysadmin to once a week, find a
> random group member, and have them identify a random important file,
> and retrieve it from backups. That doesn't have anything to do with
> bup specifically, but I stand by the advice :-)

Nice.

I suppose it might also be worth mentioning to everyone that in the end,
if you have the relevant set (or superset) of *.pack files along with
the corresponding refs you care about (the "root" hashes), then that's
"all your data", i.e. that set is the only critical thing in a bup
repository; everything else could be reconstructed (with some effort).

Of course you might also want par2 files for the packfiles and the list
of refs.

Nix

unread,

Dec 17, 2022, 6:15:20 PM12/17/22

to Rob Browning, Greg Troxel, mle...@gmail.com, bup-list

On 19 Nov 2022, Rob Browning said:
> I suppose it might also be worth mentioning to everyone that in the end,
> if you have the relevant set (or superset) of *.pack files along with
> the corresponding refs you care about (the "root" hashes), then that's
> "all your data", i.e. that set is the only critical thing in a bup
> repository; everything else could be reconstructed (with some effort).

You can often recover even if you had something eat your refs, as long
as you have the info/ dir with the git reflogs in it. git log -g will
usually show you the last few months of refs, even if they were
accidentally deleted, and given that you can recreate them trivially.

--
NULL && (void)

Reply all

Reply to author

Forward