Restore into a single stream?

10 views
Skip to first unread message

Stefan Monnier

unread,
Oct 6, 2023, 12:00:38 PM10/6/23
to bup-...@googlegroups.com
Is there a convenient way to extract a backup into a pipe-able single
bytestream, like a tarball or cpio archive (assuming of course that the
backup was made with index+save rather than split)?

It would be best if that bytestream uses a standard archive format like
tar/cpio, but I would settle for something like the Git pack files, so
I suspect it could be a simple matter of getting access to the stream
connecting the two ends of a `bup on get ...`, especially if I can then
pipe that data back to `bup` to do the actual restore (or to save it as
a new commit).


Stefan

Rob Browning

unread,
Oct 7, 2023, 1:11:20 PM10/7/23
to Stefan Monnier, bup-...@googlegroups.com
There's not right now, I don't think. Though I have contemplated some
"bup export" command off and on. Python includes at least basic tar
support that could help, but I don't know how good it is.

That said, if you can afford the space on the destination, I suppose
bup-get(1) might suffice. If you already had some of the related data
in a destination repo, it'd also be faster.

And of course, you could use a temporary repo if you didn't want to have
to (rm and/or) gc afterwards to get rid of the temporary transfer. I
might recommend that unless you did want to use existing destination
data to speed up the get.

Hope this helps
--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Stefan Monnier

unread,
Oct 12, 2023, 11:02:57 AM10/12/23
to Rob Browning, bup-...@googlegroups.com
> That said, if you can afford the space on the destination, I suppose
> bup-get(1) might suffice. If you already had some of the related data
> in a destination repo, it'd also be faster.

Yes, `bup get` is what I use, currently.

BTW, my main use is to just measure the "size" of individual backups
(so I `bup get` into a temp repo and then throw away the result).
I use that to keep track of their "sizes" so as to detect important
changes, which at least once brought to my attention that my backup
script was not backing up the right data any more :-).


Stefan

Rob Browning

unread,
Oct 13, 2023, 2:05:22 PM10/13/23
to Stefan Monnier, bup-...@googlegroups.com
Stefan Monnier <mon...@iro.umontreal.ca> writes:

> BTW, my main use is to just measure the "size" of individual backups
> (so I `bup get` into a temp repo and then throw away the result).
> I use that to keep track of their "sizes" so as to detect important
> changes, which at least once brought to my attention that my backup
> script was not backing up the right data any more :-).

Ahh. For a very crude way of detecting save size, I run du on the repo
before and after. Though probably should be running it on
REPO/objects...

But that of course tells nothing about the contents fo the save.

Stefan Monnier

unread,
Oct 13, 2023, 3:24:42 PM10/13/23
to Rob Browning, bup-...@googlegroups.com
>> BTW, my main use is to just measure the "size" of individual backups
>> (so I `bup get` into a temp repo and then throw away the result).
>> I use that to keep track of their "sizes" so as to detect important
>> changes, which at least once brought to my attention that my backup
>> script was not backing up the right data any more :-).
>
> Ahh. For a very crude way of detecting save size, I run du on the repo
> before and after. Though probably should be running it on
> REPO/objects...

That tells you about the appearance of new objects but not about the
disappearance of old objects.

> But that of course tells nothing about the contents fo the save.

No, and that's what I'm after, because I already had my share of
problems where the backups "work" but don't backup some part of the data
any more, which the method you suggest doesn't catch (been there, done
that).


Stefan

Rob Browning

unread,
Oct 13, 2023, 8:21:39 PM10/13/23
to Stefan Monnier, bup-...@googlegroups.com
Stefan Monnier <mon...@iro.umontreal.ca> writes:

> No, and that's what I'm after, because I already had my share of
> problems where the backups "work" but don't backup some part of the data
> any more, which the method you suggest doesn't catch (been there, done
> that).

Right -- for that I have some automated restore spot-checks for some
expected paths, but that only helps with respect to those paths and/or
"is this working at all", and also, who's checking the checker?

Anton Khirnov

unread,
Oct 19, 2023, 5:01:44 AM10/19/23
to bup-...@googlegroups.com
Quoting Rob Browning (2023-10-07 19:11:18)
> "'Stefan Monnier' via bup-list" <bup-...@googlegroups.com> writes:
>
> > Is there a convenient way to extract a backup into a pipe-able single
> > bytestream, like a tarball or cpio archive (assuming of course that the
> > backup was made with index+save rather than split)?
> >
> > It would be best if that bytestream uses a standard archive format like
> > tar/cpio, but I would settle for something like the Git pack files, so
> > I suspect it could be a simple matter of getting access to the stream
> > connecting the two ends of a `bup on get ...`, especially if I can then
> > pipe that data back to `bup` to do the actual restore (or to save it as
> > a new commit).
>
> There's not right now, I don't think. Though I have contemplated some
> "bup export" command off and on. Python includes at least basic tar
> support that could help, but I don't know how good it is.

I'd also appreciate having something like 'bup tar' - my backups live in
an unprivileged container with an idmapped mount, so certain metadata
operations fail even with fakeroot. It would be useful to have a way of
exporting a single backup's data, even when it might not be
representable on the filesystem.

--
Anton Khirnov

Stefan S

unread,
May 15, 2024, 1:09:57 PMMay 15
to bup-list
 I've been looking into the same thing. I use bup save for backups because it's the most efficient way, but when we restore, the application expects a tar file, which I have to construct manually and thus use basically 3 times the actual space - once for bup restore to files, then pack it all into a tar and finally have it extracted again by the application. If I could create a fifo and have bup restore stream into it, it would basically use no additional space at all and be more efficient at it. I looked into various "virtual fiilesystems" like archivemount and avfs, but neither seem to work for writing tar files...

Stefan

Rob Browning

unread,
May 15, 2024, 10:35:03 PMMay 15
to Anton Khirnov, bup-...@googlegroups.com
Anton Khirnov <an...@khirnov.net> writes:

> I'd also appreciate having something like 'bup tar' - my backups live in
> an unprivileged container with an idmapped mount, so certain metadata
> operations fail even with fakeroot. It would be useful to have a way of
> exporting a single backup's data, even when it might not be
> representable on the filesystem.

Without having given it really *any* thought, just wondered if you might
be able to use git bundle or bup get as a workaround to pull what you
want out of the container -- if it's OK to have enough scratch space on
for the intermediate representation inside/outside the container...

Rob Browning

unread,
May 15, 2024, 10:39:11 PMMay 15
to Stefan S, bup-list
Stefan S <ste...@kalaam.org> writes:

> I've been looking into the same thing.

Hmm, I toyed with tar construction a bit I think, a while ago. I might
see about digging that up after I finish merging the first big batch of
Johannes' repository refactoring that I have pending.

Rob Browning

unread,
May 16, 2024, 7:49:58 PMMay 16
to Stefan S, bup-list
Rob Browning <r...@defaultvalue.org> writes:

> Hmm, I toyed with tar construction a bit I think, a while ago. I might
> see about digging that up after I finish merging the first big batch of
> Johannes' repository refactoring that I have pending.

Well that was "easy". I poked at it last night, and (in part thanks to
the built in tarfile module, of course) I have a very incomplete "bup
export"(?) working, for some limited version of working.

Unsurprisingly, it's going to need notably more help before we could
really consider it for inclusion. Among other things, we'll need to
figure out the encoding question, because (as I suppose we should expect
at this point), tarfile doesn't support bytes
paths/users/groups/etc. we'll need to work around that, which I think we
may be able to do via judicious use of latin-1.

It also took me a bit to find/remember what the options are for
acls/attrs/xattrs -- think we'll need to add some appropriate PAX
extended headers.

So no promises, but might be able to come up with a first pass in a bit.

Rob Browning

unread,
Jun 2, 2024, 12:44:45 AMJun 2
to Stefan S, Anton Khirnov, bup-list

[Resending this to the list -- accidently omitted before.]

Rob Browning <r...@defaultvalue.org> writes:

> Well that was "easy". I poked at it last night, and (in part thanks to
> the built in tarfile module, of course) I have a very incomplete "bup
> export"(?) working, for some limited version of working.
>
> Unsurprisingly, it's going to need notably more help before we could
> really consider it for inclusion.

Given the possible limitations of the tarfile module, and the work that'd
be involved in getting this to a plausible state that's high enough
quality (including testing) "for the long term", I thought I might step
back and ask both of you whether an archive export what you'd *really*
want.

For example, would a hypothetical "bup on HOST restore ..." be as good
or better, or for the "what happened" case, some hypothetical
improvements to "bup ls", or the addition of some new command like "bup
find" or "bup du", or...?

And of course cost-wise, we should try to make sure that any new
additions are enough of an improvement to be worth it as compared to
approaches that rely on existing tools like "bup get".

Thanks

Stefan Monnier

unread,
Jun 2, 2024, 10:23:31 AMJun 2
to Rob Browning, Stefan S, Anton Khirnov, bup-list
> Given the possible limitations of the tarfile module, and the work that'd
> be involved in getting this to a plausible state that's high enough
> quality (including testing) "for the long term", I thought I might step
> back and ask both of you whether an archive export what you'd *really*
> want.

Personally, I'm just interested in a streamable format that's as cheap
as possible.

E.g. one of the best option might be to get access to the stream passed
between the two repositories in a `bup get` (e.g. when the target
repository starts empty). If that can be a stream whose content obeys
the format of something standard like a Git pack, that's even better
of course.


Stefan

Rob Browning

unread,
Jun 2, 2024, 11:56:26 AMJun 2
to Stefan Monnier, Stefan S, Anton Khirnov, bup-list
Stefan Monnier <mon...@iro.umontreal.ca> writes:

> Personally, I'm just interested in a streamable format that's as cheap
> as possible.
>
> E.g. one of the best option might be to get access to the stream passed
> between the two repositories in a `bup get` (e.g. when the target
> repository starts empty). If that can be a stream whose content obeys
> the format of something standard like a Git pack, that's even better
> of course.

Hmm. I'm not sure I understand your constraints well enough yet.

Do you care about file content at all, or are you really just looking
for the metadata? For example, conceptually, would the information
produced something like "find RESTORE_TREE -ls" be sufficient?

Stefan Monnier

unread,
Jun 2, 2024, 1:09:07 PMJun 2
to Rob Browning, Stefan S, Anton Khirnov, bup-list
>> E.g. one of the best option might be to get access to the stream passed
>> between the two repositories in a `bup get` (e.g. when the target
>> repository starts empty). If that can be a stream whose content obeys
>> the format of something standard like a Git pack, that's even better
>> of course.
> Hmm. I'm not sure I understand your constraints well enough yet.

The mains purpose for me is to measure the sizes of snapshots.
Currently, I end up doing a `bup get` into a dummy repository, then use
`du`, but that requires actual disk space (and a fairly large amount of
it), so it's not always practical.

> Do you care about file content at all, or are you really just looking
> for the metadata?

I don't care very much about the exact definition of "size" here:
I mostly use those sizes to see how they evolve. But including the
contents (compressed or not) is definitely a plus, since it better
represents the size.


Stefan

Greg Troxel

unread,
Jun 2, 2024, 1:47:54 PMJun 2
to 'Stefan Monnier' via bup-list, Rob Browning, Stefan S, Anton Khirnov
I think it could make sense to have ' bup restore-as-tar' that instead
of writing to the current dir outputs a tar stream.

But, for this need, I think we should have some sort of commands that
will return the amount of space used by a backup. Perhaps both the
uncompressed space represented by the backup and the space used.

For space used, deduplication makes this harder but it is also very
important to understand.

My bias is that the most recent backup is primary, so I'd want all the
objects needed for the most recent commit to be allocated to it, and
then the next commit to have allocated to it objects that are not
already spoken for, but which are needed. And so on. This is sort of
like what rdiff-backup can show.

I can see three columns (since the traversal is the costly part):

- total bytes represented by blobs in this commit
- total size of objects (commit, tree, blobs) needed to represent this
commit
- as above, but not counting objects needed for an earlier commit either
- shown for this name
- shown for any name that has been shown in this run of "bup
sizes"

Rob Browning

unread,
Jun 2, 2024, 2:46:59 PMJun 2
to Stefan Monnier, Anton Khirnov, bup-list
"'Stefan Monnier' via bup-list" <bup-...@googlegroups.com> writes:

> I don't care very much about the exact definition of "size" here:
> I mostly use those sizes to see how they evolve. But including the
> contents (compressed or not) is definitely a plus, since it better
> represents the size.

OK, so just as an example, and assuming you don't really want to have
the data on the remote, might some new save output format provide what
you need (reported during the save).

For example (ignoring the details):

bup save --report SOMETHING foo
2318 foo/z.py
731 foo/x/y.py
4361 foo/a.c
...
$

Then you could traverse that stdout data to answer your questions?

Stefan Monnier

unread,
Jun 2, 2024, 3:11:04 PMJun 2
to Rob Browning, Anton Khirnov, bup-list
> For example (ignoring the details):
>
> bup save --report SOMETHING foo
> 2318 foo/z.py
> 731 foo/x/y.py
> 4361 foo/a.c
> ...
> $
>
> Then you could traverse that stdout data to answer your questions?

That'd work, yes.


Stefan

Reply all
Reply to author
Forward
0 new messages