bup gc resources

Mark J Hewitt

unread,

Sep 11, 2021, 5:53:47 AM9/11/21

to bup-list

I have not looked in detail at the implementation, so it is possible
there is already a ready analytical answer to my question, but in case
this has already been done ...

Is there any way of estimating the machine resources (time, memory,
etc.) that 'bup gc' would need? I ask because I have occasional
removed unwanted branches and saves from small repositories (say 10-20G)
and 'bup gc' completes in a very reasonable time even on small
hardware. However, on a larger repository (say 5T) this is of course
taking a long time, as I expected, but I would like to know if that
'long time' is factorially multiple lifetimes or just a few days! Just
FYI, the virtual memory size of the 'git cat-file' process in this test
is already 3.3T and growing :-)).

Mark.

--
Mark J Hewitt

Greg Troxel

unread,

Sep 11, 2021, 7:38:29 AM9/11/21

to Mark J Hewitt, bup-list

Mark J Hewitt <m...@idnet.com> writes:

A really good question.

> Just FYI, the virtual memory size of the 'git cat-file' process in
> this test is already 3.3T and growing :-)).

I guess this is file-backed vm; I don't see how that could work
otherwise.

I guess this is a clue that gc is not likely to work on 32-bit cpu
types.

signature.asc

Rob Browning

unread,

Sep 11, 2021, 5:40:44 PM9/11/21

to Greg Troxel, Mark J Hewitt, bup-list

Greg Troxel <g...@lexort.com> writes:

> I guess this is file-backed vm; I don't see how that could work
> otherwise.

Right, that's just git mmapping "everything", I assume.

> I guess this is a clue that gc is not likely to work on 32-bit cpu
> types.

Don't know whether git might have fallbacks for that case (i.e. when
mmap fails). I have a tree here where I was working on that for bup
proper, but no idea what git does.

--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Rob Browning

unread,

Sep 11, 2021, 5:50:36 PM9/11/21

to m.he...@computer.org, bup-list

Mark J Hewitt <m...@idnet.com> writes:

> Is there any way of estimating the machine resources (time, memory,
> etc.) that 'bup gc' would need? I ask because I have occasional
> removed unwanted branches and saves from small repositories (say 10-20G)
> and 'bup gc' completes in a very reasonable time even on small
> hardware. However, on a larger repository (say 5T) this is of course
> taking a long time, as I expected, but I would like to know if that
> 'long time' is factorially multiple lifetimes or just a few days! Just
> FYI, the virtual memory size of the 'git cat-file' process in this test
> is already 3.3T and growing :-)).

I'd have to look back at the code to be sure, but offhand, I suspect
that the working set memory is likely to be proportional to the object
count (for the "have we seen this hash?" set), and the IO cost is likely
to be proportional to the repo size (i.e. perhaps scale "linearly"(?)
with the repo size).

And also if I recall correctly, I believe I just set that up to
"mark"/sweep, so it fist traverses the entire graph (all the blobs
reachable from the root "refs") without reading any of the "leaf" blob
data, and then it traverses all the packs, rewriting any (to drop
orphaned blobs) if the pack is more than N% orphans.

...and now that I think about it -- I wonder if we might make that a lot
more efficient by relying on git tools to do the traversal, assuming we
can get them to produce a list of the orphaned (dangling) blobs.

I believe "git fsck" can do that, but we'd have to see if it can do just
what we need. Might be worth investigation; assuming no one else beats
me to it, I'll put it on the list for "later".

Rob Browning

unread,

Sep 12, 2021, 3:53:02 AM9/12/21

to m.he...@computer.org, bup-list

Rob Browning <r...@defaultvalue.org> writes:

> ...and now that I think about it -- I wonder if we might make that a lot
> more efficient by relying on git tools to do the traversal, assuming we
> can get them to produce a list of the orphaned (dangling) blobs.
>
> I believe "git fsck" can do that, but we'd have to see if it can do just
> what we need. Might be worth investigation; assuming no one else beats
> me to it, I'll put it on the list for "later".

Oh, of course, I forgot gc uses a bloom filter to track the live set
(hence the "probabilistic" in the manpage), so I'd expect it to use a
lot less memory than a precise implementation like (I'd imagine) git
fsck.

(I also hacked up a couple of alternate, precise approaches, one using a
python set, and another sqlite, but I set them aside, thinking we could
always add something like that later if it seemed warranted.)

Suppose it might be worth comparing our current approach's run-time and
memory use with "git fsck --connectivity-only"'s for a bigger
repository, if nothing else, just to check assumptions.

Mark Hewitt

unread,

Sep 13, 2021, 3:54:00 PM9/13/21

to Rob Browning, bup-list

On 12/09/2021 08:52, Rob Browning wrote:
> Suppose it might be worth comparing our current approach's run-time and
> memory use with "git fsck --connectivity-only"'s for a bigger
> repository, if nothing else, just to check assumptions.
>

I would try the comparison here, but the 'bup gc' I started 6 days ago
is still running ...

Nix

unread,

Sep 13, 2021, 4:56:56 PM9/13/21

to Rob Browning, Greg Troxel, Mark J Hewitt, bup-list

On 11 Sep 2021, Rob Browning outgrape:

> Greg Troxel <g...@lexort.com> writes:
>> I guess this is a clue that gc is not likely to work on 32-bit cpu
>> types.
>
> Don't know whether git might have fallbacks for that case (i.e. when
> mmap fails).

It does (it also works on platforms with no mmap at all, falling back to
direct reading in that case). Even repack works (very slowly) in that
situation.

--
NULL && (void)

Rob Browning

unread,

Sep 13, 2021, 9:02:38 PM9/13/21

to m.he...@computer.org, bup-list

Mark Hewitt <mjh.br...@gmail.com> writes:

> I would try the comparison here, but the 'bup gc' I started 6 days ago
> is still running ...

Hmm, and can you tell whether it's mostly in the trace or sweep phase?
Also be interesting to see if it's saturating the cpu or disk (say via
atop 3 and a wide enough terminal or similar).

I'd guess that in the end it'd likely have a lower bound on the run time
set by the seeks (IOPs) per second available during the mark phase, or
the MB/s from disk available during sweep phase.

But just a guess atm.

Mark Hewitt

unread,

Sep 14, 2021, 5:56:29 AM9/14/21

to Rob Browning, bup-list

On 14/09/2021 02:02, Rob Browning wrote:
> Hmm, and can you tell whether it's mostly in the trace or sweep phase?
> Also be interesting to see if it's saturating the cpu or disk (say via
> atop 3 and a wide enough terminal or similar).
>
> I'd guess that in the end it'd likely have a lower bound on the run time
> set by the seeks (IOPs) per second available during the mark phase, or
> the MB/s from disk available during sweep phase.
>

The last output produced is

found 652851559 objects (5165/5165
pack-5ab4be864afa037055f57535aa712beb44af545e.idx)

and git cat-file is still running, so I guess this is still in mark phase.

Note that I don't expect this to be fast: In this case, this is a small
service machine dedicated only to providing a remote bup server for
other devices to use. The repository is on an external USB drive and
the processing hardware is only a dedicated 8G Raspberry Pi 4 (with 16G
swap to an SSD, should it need it) running a 64bit Ubuntu 20.04 OS. The
gc performance is acceptable for smaller repositories, and for backup
saves to this larger repository too. My question was about estimating
and setting expectation for any hardware, though that in itself can of
course lead to improvements down the line. If this were linear (and
clearly, if this gets pushed into swapping, it will be anything but!) I
can get a bup gc on a 15G repository completed in about 20 minutes. So
for this 5T repository, I might expect maybe 5 days anyway.

There is nothing especially alarming on the 'atop' output - just lots of
disk reads and very low CPU usage. Seek performance for external USB
drives is never good. However for 'small' (home/home office/home worker
setting rather than medium commercial) deployments that external
devices, possibly with a rotation strategy, would be more normal than
large local bus or SAN storage.

As an aside - I guess it is not safe to be writing to a repository when
gc is running either - or are there sufficient locks in place to make
this a possible concurrent activity?

Mark.

Johannes Berg

unread,

Sep 14, 2021, 6:13:59 AM9/14/21

to m.he...@computer.org, Rob Browning, bup-list

On Tue, 2021-09-14 at 10:56 +0100, Mark Hewitt wrote:
>
> As an aside - I guess it is not safe to be writing to a repository when
> gc is running either - or are there sufficient locks in place to make
> this a possible concurrent activity?

Jumping in just for that question - I don't think it's safe, the mark
phase in the gc wouldn't see your new commits, and that new save might
use objects that gc is then going to remove after not finding any
references in the commits it looked at.

I don't think locking would help anything, unless gc were to remove the
pack/idx from the midx/bloom before, but then a concurrent save might
save an object that gc is going to keep, leaving you with the opposite
problem (missed dedup opportunity).

johannes

Rob Browning

unread,

Sep 14, 2021, 10:58:16 PM9/14/21

to m.he...@computer.org, bup-list

Mark Hewitt <mjh.br...@gmail.com> writes:

> and git cat-file is still running, so I guess this is still in mark phase.

Given your situation, yeah, I'd imagine during the mark phase, the
primary limit is the "seek limit", given that marking has bup jumping
"all over the place" to read each commit/tree/etc.

We've discussed the possibility of making that better (maybe a lot
better?) by writing commits, trees, and maybe .bupm files to packs that
are separate from the bulk data in order to increase the information
density (cache locality) for tree and metadata-related operations.

We could also teach gc about that separation, and support some
optimization command that just duplicates the existing treeish objects
to separate packs until the next gc, assuming that data is not "too
large" (which will be more likely once we can set the treesplit
granularity higher -- i.e. once we have more of Johannes' work
incorporated).

> As an aside - I guess it is not safe to be writing to a repository
> when gc is running either - or are there sufficient locks in place to
> make this a possible concurrent activity?

Probably not advisable right now, but I noticed a trick that I think git
gc uses that we might want to adopt that could help. Although I'll have
to think it through, i.e. could we ignore any pack files created after
the gc start time to allow saves to continue (or something...)?

Rob Browning

unread,

Sep 14, 2021, 11:02:26 PM9/14/21

to Johannes Berg, m.he...@computer.org, bup-list

Johannes Berg <joha...@sipsolutions.net> writes:

> I don't think locking would help anything, unless gc were to remove the
> pack/idx from the midx/bloom before, but then a concurrent save might
> save an object that gc is going to keep, leaving you with the opposite
> problem (missed dedup opportunity).

Hmm, in the reply I sent just before this, I wondered about the
possibility of allowing concurrent saves maybe in part by having gc
ignore packs newer than the start of the gc, though I assumed it might
be more complicated than that -- and now that I remember that gc clears
the bloom and midx files while it's working, yeah, it's likely to be
"more complicated" :)

Johannes Berg

unread,

Sep 15, 2021, 6:39:11 AM9/15/21

to Rob Browning, m.he...@computer.org, bup-list

On Tue, 2021-09-14 at 21:58 -0500, Rob Browning wrote:

> Mark Hewitt <mjh.br...@gmail.com> writes:
>
> We've discussed the possibility of making that better (maybe a lot
> better?) by writing commits, trees, and maybe .bupm files to packs that
> are separate from the bulk data in order to increase the information
> density (cache locality) for tree and metadata-related operations.

That might in fact help here, since we need to read commit/tree objects,
but only know about existence of blob objects.

I've got most of the machinery needed for this in my tree (because I
want/need it for encrypted repositories), but it's not hooked up to the
normal git repository writing at the moment. Wouldn't be hard to do
though.

johannes

Stefan Monnier

unread,

Sep 15, 2021, 8:00:20 AM9/15/21

to Johannes Berg, m.he...@computer.org, Rob Browning, bup-list

Johannes Berg [2021-09-14 12:13:56] wrote:
> On Tue, 2021-09-14 at 10:56 +0100, Mark Hewitt wrote:
>> As an aside - I guess it is not safe to be writing to a repository when
>> gc is running either - or are there sufficient locks in place to make
>> this a possible concurrent activity?
> Jumping in just for that question - I don't think it's safe, the mark

Making it safe to allow GC while the repository is modified can be
tricky, indeed, but `bup` should absolutely make sure that the user can
safely launch a backup (or any other normal operation) while a GC is in
progress (and vice versa).

This should be fairly easy to do by using a repository-wide lock: the
backup will be delayed until after the GC terminates but that's much
better than a corrupted repository.

Stefan

Rob Browning

unread,

Sep 15, 2021, 10:39:20 PM9/15/21

to Stefan Monnier, Johannes Berg, m.he...@computer.org, bup-list

Stefan Monnier <mon...@iro.umontreal.ca> writes:

> Making it safe to allow GC while the repository is modified can be
> tricky, indeed, but `bup` should absolutely make sure that the user can
> safely launch a backup (or any other normal operation) while a GC is in
> progress (and vice versa).
>
> This should be fairly easy to do by using a repository-wide lock: the
> backup will be delayed until after the GC terminates but that's much
> better than a corrupted repository.

...plus or minus all the problems with portable filesystem locks, nfs,
etc. If/when we try this, I'd probably start by looking at the
algorithm Debian policy requires (cf. liblockfile). But yes, it'd be
great to make that kind of thing safe where we can.

For now, though, any concurrent operations on a bup repo should be
considered potentially risky.

Mark Hewitt

unread,

Sep 21, 2021, 5:44:42 AM9/21/21

to Rob Browning, Stefan Monnier, Johannes Berg, bup-list

On 16/09/2021 03:38, Rob Browning wrote:
>
> For now, though, any concurrent operations on a bup repo should be
> considered potentially risky.
>

And in my case, 'bup gc' finished on 21st September - having run for 14
days continuausly. However, it did not finish well:

found 652851559 objects (5165/5165
pack-5ab4be864afa037055f57535aa712beb44af545e.idx)

Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
File "/usr/local/lib/bup/bup/main.py", line 425, in <module>
    main()
File "/usr/local/lib/bup/bup/main.py", line 422, in main
    wrap_main(lambda : run_subcmd(cmd_module, subcmd))
File "/usr/local/lib/bup/bup/compat.py", line 201, in wrap_main
    sys.exit(main())
File "/usr/local/lib/bup/bup/main.py", line 422, in <lambda>
    wrap_main(lambda : run_subcmd(cmd_module, subcmd))
File "/usr/local/lib/bup/bup/main.py", line 417, in run_subcmd
    run_module_cmd(module, args)
File "/usr/local/lib/bup/bup/main.py", line 300, in run_module_cmd
    import_and_run_main(module, args)
File "/usr/local/lib/bup/bup/main.py", line 295, in import_and_run_main
    module.main(args)
File "/usr/local/lib/bup/bup/cmd/gc.py", line 41, in main
    bup_gc(threshold=opt.threshold,
File "/usr/local/lib/bup/bup/gc.py", line 235, in bup_gc
    live_objects = find_live_objects(existing_count, cat_pipe,
File "/usr/local/lib/bup/bup/gc.py", line 115, in find_live_objects
    for item in walk_object(cat_pipe.get, hexlify(ref_id), stop_at=stop_at,
File "/usr/local/lib/bup/bup/git.py", line 1437, in walk_object
    get_oidx, typ, _ = next(item_it)
File "/usr/local/lib/bup/bup/git.py", line 1311, in get
    raise GitError('expected object (id, type, size), got %r' % info)
bup.git.GitError: expected object (id, type, size), got [b'']

Needless to say, I probably won't be trying that again on a repository
this large.

Mark.

Rob Browning

unread,

Sep 24, 2021, 12:05:18 PM9/24/21

to m.he...@computer.org, Stefan Monnier, Johannes Berg, bup-list

Mark Hewitt <mjh.br...@gmail.com> writes:

> raise GitError('expected object (id, type, size), got %r' % info)
> bup.git.GitError: expected object (id, type, size), got [b'']
>
> Needless to say, I probably won't be trying that again on a repository
> this large.

That's not good. If I'm reading it right, it appears that readline() on
the pipe connected to git cat-file returned b'', i.e. EOF. I wonder if
that might indicate that git crashed.

In any case, we should probably handle the readline() return value more
carefully, i.e. when it's EOF, at least say that.

I assume there weren't any other git-related errors in the output? I'll
go see if we'd even expect any (and if not, we should fix that too,
i.e. make sure we always see the cat-file process' stderr).

Thanks for reporting the problem

Rob Browning

unread,

Sep 24, 2021, 12:09:11 PM9/24/21

to m.he...@computer.org, Stefan Monnier, Johannes Berg, bup-list

Rob Browning <r...@defaultvalue.org> writes:

> I assume there weren't any other git-related errors in the output? I'll
> go see if we'd even expect any (and if not, we should fix that too,
> i.e. make sure we always see the cat-file process' stderr).

Looks looks like (via git.CatPipe.restart()) stderr should always be
visible on bup's stderr, so if git did die with any output there, I
suspect we should have seen it (likely earlier in the output).

Mark Hewitt

unread,

Sep 25, 2021, 5:17:30 AM9/25/21

to Rob Browning, Stefan Monnier, Johannes Berg, bup-list

On 24/09/2021 17:08, Rob Browning wrote:
> Rob Browning <r...@defaultvalue.org> writes:
>
>> I assume there weren't any other git-related errors in the output? I'll
>> go see if we'd even expect any (and if not, we should fix that too,
>> i.e. make sure we always see the cat-file process' stderr).
> Looks looks like (via git.CatPipe.restart()) stderr should always be
> visible on bup's stderr, so if git did die with any output there, I
> suspect we should have seen it (likely earlier in the output).
>

I have a log of all the console output .. the command was actually:

/usr/local/bin/bup
--bup-dir=/bupsys/data/External-010/bupdata/a2/repository/store gc
--verbose --unsafe 2>&1 | tee -a
/home/bupuser/Sandbox/localgit/mjh/BackupTools/Bup_backup/EXPIRE.log

and it ends like this:

found 652323376 objects (5162/5165
pack-c0279ad27cc5299d5d003d07e2517dfcfe64e6d5.idx)^Mfound 652455291
objects (5163/5165
pack-370af2342e148763c395ac9097e4373dc2cb37ad.idx)^Mfound 652587166
objects (5164/5165
pack-f081ad2e3fceb71b3a425916b1e42906cde97f7a.idx)^Mfound 652718764
objects (5165/5165
pack-5ab4be864afa037055f57535aa712beb44af545e.idx)^Mfound 652851559 objects

raise GitError('expected object (id, type, size), got %r' % info)
bup.git.GitError: expected object (id, type, size), got [b'']

There are no other errors.

Mark.

Rob Browning

unread,

Sep 25, 2021, 1:59:40 PM9/25/21

to m.he...@computer.org, Stefan Monnier, Johannes Berg, bup-list

Mark Hewitt <mjh.br...@gmail.com> writes:

> and it ends like this:

[...]

> raise GitError('expected object (id, type, size), got %r' % info)
> bup.git.GitError: expected object (id, type, size), got [b'']
>
> There are no other errors.

OK, thanks. My current guess is that either git didn't print anything,
or we lost the end of git's stderr somehow -- more likely if our "line
filtering" was involved, but your redirection to a pipe should have
prevented that because isatty() should have returned false in main.py,
preventing the filter from engaging.

In any case, I'll probably add some more diagnostics in catpipe get() so
that next time we'll at least see the git exit status.

Reply all

Reply to author

Forward