Possible race condition in midx

19 views
Skip to first unread message

Mark Hewitt

unread,
Mar 31, 2026, 7:08:19 AM (9 days ago) Mar 31
to bup list
For a large remote repository, I was finding that some saves hung for a
long time. After a little investigation, it seemed that running midx
manually also hung on the same repository's local index-cache. To get a
little more information, I ran with BUP_DEBUG=1, and of course, this
completes. Sigh!

However, this midx completes leaving multiple (7) midx files even with
--force. Running again consolidates those to one, and which midx --check
is happy with.

This could be related to the the debug1() logging that forces a stdout
flush, or could be the _hard_write() logic it uses unblocking things
here. I have not yet discovered exactly where this is having its effect,
but if you have seen this before, maybe there is a workaround,
otherwise, I'll continue looking.

This is bup 0.33.8, and the (midx debug output) repository size is:

First Pass:

midx: 14481 indexes; want no more than 1.
midx: b''creating from 14481 files (1900530701 objects).
midx: table size: 67108864 (24 bits)

Second Pass:

midx: 14475 indexes; want no more than 1.
midx: b''creating from 14475 files (1518879293 objects).
midx: table size: 33554432 (23 bits)

The machine running this (Linux Ubuntu 24.04.4 LTS) has the file
descriptor ulimit set at 50000, and has 128G of RAM, so resource issues
are unlikely, and there are no OOM killed process logged either.

If you have any ideas where best to focus my search for this issue, I'd
be grateful as it takes a long time for each pass with this many objects!

Mark.



Rob Browning

unread,
Mar 31, 2026, 12:46:33 PM (9 days ago) Mar 31
to Mark Hewitt, bup list
"'Mark Hewitt' via bup-list" <bup-...@googlegroups.com> writes:

> For a large remote repository, I was finding that some saves hung for a
> long time. After a little investigation, it seemed that running midx
> manually also hung on the same repository's local index-cache. To get a
> little more information, I ran with BUP_DEBUG=1, and of course, this
> completes. Sigh!

By hung, I assume you mean without any cpu or IO activity? And not that
I suspect it matters, but is this Linux/ext4 or something else?

Also what was the command, just midx --force directly on the
index-cache repo?

> However, this midx completes leaving multiple (7) midx files even with
> --force.

I'm not positive, but offhand with --force and not --auto, I'm not sure
it looks at (and removes) the existing midxes. e.g. for current 0.33.x:

https://codeberg.org/bup/bup/src/commit/0fd0f5a8e928c9c93900f9d6f54fec61c855614a/lib/bup/cmd/midx.py#L202

But I may be overlooking something else.

> This could be related to the the debug1() logging that forces a stdout
> flush, or could be the _hard_write() logic it uses unblocking things
> here.

I'd be a little surprised if _hard_write helps avoid the hang; I'd only
expect it to reveal more output before a hang. (I've also wondered
lately if we still need it at all...)

I'd also be a bit surprised if logging affects whether or not it hangs
overall. My current vague guess is that the cause is "something else"
that's just not deterministic. So I'd guess that if you wanted to,
adding some more log() statements in bup.midx or bup.cmd.midx to try to
narrow down where it's hanging might be fine.

--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Mark Hewitt

unread,
Mar 31, 2026, 2:02:33 PM (9 days ago) Mar 31
to Rob Browning, bup list
Hi Rob,

"Hung" means that the command sits around apparently doing nothing but
taking a small amount of CPU and no IO. That particular midx (originally
called automatically from the save but reproducible when called
directly) normally completes in a little under an hour when there is no
existing midx files to work from, but "hung" means I have waited days
and it is still sat there.

I'm currently working through variously instrumented versions of bup,
and I'll take a look with strace or something when it hangs again. I did
look with strace (also truss on a FreeBSD system) when I first noticed,
it it seemed like a lot of pipe waits on one of two threads with rapidly
expiring timeouts looping around - but at that point, I did not dig too
deeply as I suspected just a resource issue.

With regard to filesystems - actually I first noticed this on a
FreeBSD/xfs system, but I transferred the local (index cache) repository
to a larger Linux machine (again, thinking probably resource issues)
with an ext4 fs and reproduced the problem there.

And yes, investigating on the index-cache directly, with --force (though
the internal invocation is obviously just with --auto).

Yes, I'd be surprised if the stdout flushes or select-based
_hard_write() was directly responsible, but that was an avenue to
investigate. And yes, possibly nondeterministic - my favourite kind of
bug swatting!

Mark.

Mark Hewitt

unread,
Mar 31, 2026, 6:39:09 PM (8 days ago) Mar 31
to bup-...@googlegroups.com
Hmm ... I guess this might be relevant:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/bup/bup/main.py", line 384, in <module>
    main()
  File "/usr/local/lib/bup/bup/main.py", line 381, in main
    wrap_main(lambda : run_subcmd(cmd_module, subcmd))
  File "/usr/local/lib/bup/bup/compat.py", line 94, in wrap_main
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/bup/bup/main.py", line 381, in <lambda>
    wrap_main(lambda : run_subcmd(cmd_module, subcmd))
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/bup/bup/main.py", line 376, in run_subcmd
    run_module_cmd(module, args)
  File "/usr/local/lib/bup/bup/main.py", line 297, in run_module_cmd
    return import_and_run_main(module, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/bup/bup/main.py", line 292, in import_and_run_main
    module.main(args)
  File "/usr/local/lib/bup/bup/cmd/midx.py", line 318, in main
    do_midx_dir(path, opt.output, byte_stream(sys.stdout),
  File "/usr/local/lib/bup/bup/cmd/midx.py", line 252, in do_midx_dir
    all = list(do_midx_group(path, outfilename, part1,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/bup/bup/cmd/midx.py", line 272, in do_midx_group
    rv = _do_midx(outdir, outfilename, sublist, gprefix,
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/bup/bup/cmd/midx.py", line 170, in _do_midx
    count = merge_into(fmap, bits, total, inp)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OverflowError: total too big for unsigned int

Mark


Rob Browning

unread,
Apr 1, 2026, 2:54:42 AM (8 days ago) Apr 1
to Mark Hewitt, bup-...@googlegroups.com
"'Mark Hewitt' via bup-list" <bup-...@googlegroups.com> writes:

>   File "/usr/local/lib/bup/bup/cmd/midx.py", line 170, in _do_midx
>     count = merge_into(fmap, bits, total, inp)
>             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> OverflowError: total too big for unsigned int

Without looking closely yet, I think that may be the total object count
over all the indexes (and midxes?) it's processing. How does "git count
objects -v" look for the repo?

Mark Hewitt

unread,
Apr 1, 2026, 3:59:45 AM (8 days ago) Apr 1
to bup-...@googlegroups.com
This error is just coming from the local index-cache: not sure I can
convince git to see it as a git repository as there are no packs here.
So for the index-cache repository, I only see

count: 0
size: 0
in-pack: 0
packs: 0
size-pack: 0
prune-packable: 0
garbage: 0
size-garbage: 0

For the remote repository to which it relates, I see

warning: garbage found:
/mnt/BUPDATA-06/bupdata/repository/megastore//objects/pack/bup.bloom
warning: garbage found:
/mnt/BUPDATA-06/bupdata/repository/megastore//objects/pack/midx-d5cf67704d1a88a2254d6b5c1bca69cc429b393d.midx-phdv3ljn
warning: garbage found:
/mnt/BUPDATA-06/bupdata/repository/megastore//objects/pack/midx-a55cef90234dd29e6b642b7c99501a6dc6cf27be.midx
warning: garbage found:
/mnt/BUPDATA-06/bupdata/repository/megastore//objects/pack/midx-3a1cf85444f6312724b0481eb62db4c5ed0b84b1.midx-1f8nyuy7
warning: garbage found:
/mnt/BUPDATA-06/bupdata/repository/megastore//objects/pack/midx-eee3e98203c166192d941506b9f6caf0188f0ef0.midx
count: 0
size: 0
in-pack: 1524990918
packs: 14543
size-pack: 11168946133
prune-packable: 0
garbage: 5
size-garbage: 37867141

However, I think this may just be because of a left-over midx file in
the index-cache that was not cleared, as I have

-rw-rw-r-- 1 mjh mjh 72974762439 Mar 31 23:25
midx-a31f6a32ad71c26ffb78d753f54291900247848e.midx

which is twice the size of those that I can build here (with debugging
enabled).

Mark.

Rob Browning

unread,
Apr 1, 2026, 12:27:12 PM (8 days ago) Apr 1
to Mark Hewitt, bup-...@googlegroups.com
"'Mark Hewitt' via bup-list" <bup-...@googlegroups.com> writes:

> This error is just coming from the local index-cache: not sure I can
> convince git to see it as a git repository as there are no packs here.

Right, for that something like this should at least show the object
counts:

for idx in $(find -name "*.idx"); do git show-index < "$idx" | wc -l; done

I was looking for a quick impression of what the total really ought to
be since merge_into appears to be claiming it won't fit in (presumably
for your arch) 32-bits.

And while I'll need to double check, with --force and without --auto, I
don't think it'll be including any midx files in the total.

Rob Browning

unread,
Apr 1, 2026, 1:15:20 PM (8 days ago) Apr 1
to Mark Hewitt, bup list
"'Mark Hewitt' via bup-list" <bup-...@googlegroups.com> writes:

> I'm currently working through variously instrumented versions of bup,
> and I'll take a look with strace or something when it hangs again.

Given the OverflowError you reported, perhaps also include a
log(f'...{total}...\n') before the merge_into call in the instrumented
version if you can.

Mark J Hewitt

unread,
Apr 1, 2026, 6:00:06 PM (7 days ago) Apr 1
to bup-...@googlegroups.com
This is getting confusing.

Your suggested 
for idx in $(find -name "*.idx"); do git show-index < "$idx" | wc -l; done

produces 13375 lines of the individual index counts. If you meant

for idx in $(find -name "*.idx"); do git show-index < "$idx"; done | wc -l

then the answer is 1518879293.

Regarding the OverflowError, this is running with both --auto and --force here. Though apparently completing successfully (and an exit code of 0), a previous midx invocation left behind this:

-rw-rw-r--  1 mjh  mjh  72974762439 Mar 31 23:25 midx-a31f6a32ad71c26ffb78d753f54291900247848e.midx

which is the apparent cause of the subsequent overflow. If I remove this, the generated midx is then

-rw-rw-r-- 1 mjh mjh 36487381225 Apr  1 17:07 midx-f20258f451b0dce6ee866fd3e27805f03754c552.midx

and now (nondeterminism, or at least ignorance of relevant parameters is at work here), can now be built without the debugging.

So several weird things on a large repository, but nothing as concrete now as it was a week ago when I found the apparently hung "bup midx" processes, and the ease with which I could reproduce that on another machine with a copy of the repo.

I'll continue digging - I'll add your suggested line below soon.

Mark


On 01/04/2026 18:15, Rob Browning wrote:
Given the OverflowError you reported, perhaps also include a
log(f'...{total}...\n') before the merge_into call in the instrumented
version if you can.


-- 
Mark J Hewitt

Rob Browning

unread,
Apr 1, 2026, 6:14:55 PM (7 days ago) Apr 1
to m.he...@computer.org, bup-...@googlegroups.com
Mark J Hewitt <m...@idnet.com> writes:

> for idx in $(find -name "*.idx"); do git show-index < "$idx" | wc -l; done
>
> produces 13375 lines of the individual index counts. If you meant
>
> for idx in $(find -name "*.idx"); do git show-index < "$idx"; done | wc -l
>
> then the answer is 1518879293.

Yeah, I meant the total number of lines across all the show-index
commands, but pasted the wrong thing. I really just wanted to be sure
an overflow wasn't plausible from the indexes themselves.

> which is the apparent cause of the subsequent overflow. If I remove this, the generated midx is then
>
> -rw-rw-r-- 1 mjh mjh 36487381225 Apr 1 17:07 midx-f20258f451b0dce6ee866fd3e27805f03754c552.midx
>
> and now (nondeterminism, or at least ignorance of relevant parameters
> is at work here), can now be built without the debugging.
>
> So several weird things on a large repository, but nothing as concrete
> now as it was a week ago when I found the apparently hung "bup midx"
> processes, and the ease with which I could reproduce that on another
> machine with a copy of the repo.

Just a guess, but I was wondering if the hang might actually be a
client/server issue where something crashes (like that OverflowError),
but we manage not to exit, and so the client or server just blocks
waiting on a reply that'll never come.

(And I believe we have in fact made some things in that general area
(hopefully) more reliable in main, though I don't recall anything *midx*
specific offhand.)

> I'll continue digging - I'll add your suggested line below soon.

So you don't think that overflow was the only issue?

(We may also need improvements to bup-midx(1).)

Thanks
Reply all
Reply to author
Forward
0 new messages