binary search debugging of compilers

Russ Cox

unread,

May 12, 2023, 9:34:41 PM5/12/23

to

Hi all,

In the Go compiler, we have been using a technique for the past eight
years that I assume has been done in other compilers before, but I'm
unaware of any. I describe it below, along with refinements we've made
over the years. I'd be very interested to hear about earlier uses of
approaches like this, or of any other ideas for use cases.

In 2015, Keith Randall was working on a new SSA backend for the Go
compiler. The old and new backends coexisted, and we could use either
for any given function in the compilation. Keith introduced an
environment variable GOSSAHASH that specifies the last few binary
digits of the hash of function names that should use the new backend.
So GOSSAHASH=0110 means compile only those functions whose names hash
to a value with last four bits 0110. When a test is failing with the
new backend, you try GOSSAHASH=0 and GOSSAHASH=1, and then use binary
search to progressively refine the pattern, narrowing the failure down
until only a single function is being compiled with the new backend.
This was invaluable for approaching failures in large real-world tests
(tests for libraries or production code, not for the compiler) that we
had not written and did not understand.

In 2016, David Chase was debugging a new optimizer rewrite rule that
should have been correct but was causing mysterious test failures.
He reused the same technique at finer granularity: the bit pattern now
matched against a hash of the current file name and line number,
so that the binary search pinpoints the exact line of code that is
miscompiled (meaning causes a test failure) when the suspect rewrite
rule is used.

Since then we have continued to find uses for this approach.
For example, when we added automatic FMA insertion on certain
architectures, some tests started failing. By making the rewrite rule
controlled by a file:line hash pattern, we can pinpoint the exact line
or lines where FMA insertion causes the failure.

As another example, we are considering a small change to the semantics
of variables declared in for loops, along the lines of what C# 5 did
and what JavaScript does in for-let loops. By using a hash pattern to
toggle whether the new semantics applies at a given file:line, we can
identify the specific loops where the semantic change provokes a
test failure.

The binary search does not have to stop at a single failure. With a
richer pattern syntax adding an AND-NOT operator, we can disable the
instances we've found, and if the test still fails, run a new binary
search to find another culprit. (Technically, AND-NOT is not necessary
for doing a complete binary traversal of the search space guided by
single failures, but it is easy, and it is necessary for the next step.)

The binary search can also be adapted to finding pairs or larger
groups of changes, all of which are necessary to provoke a failure.
If suffix B provokes the failure but neither 0B nor 1B do, then
(assuming the test is not flaky!) you start using the pattern “0B OR 1B”,
refine the left half of the expression with binary search
(for example the first step checks “00B OR 1B” versus “10B OR 1B”),
and then having refined the left half, move on to refining the right half.
After refining both, you have an OR of patterns that each match one
change, and all the changes are necessary to provoke the failure.
The splitting naturally recurses as needed to find a locally minimal set
of changes, in the sense that by construction, removing any single
change would not provoke the failure.

Stepping slightly away from compilers temporarily, we have recently
realized that this technique can be adapted to identifying specific
bug-inducing contexts for library functions as well. For example,
suppose we want to introduce a new qsort implementation, but that
causes tests to fail in a large program (maybe even a compiler) that
makes use of qsort in many contexts. Binary search selecting the old
or new algorithm according to a hash of the call stack will quickly
identify the exact call stack affected by the new qsort.

As I mentioned at the top, I am interested to hear about earlier uses
of approaches like this, or any other ideas for situations where the
approach might be applicable. Clearly the general problem has overlaps
with group testing [1], and in particular Hwang's binary search-based
approach (1972) [2]. I've also found one paper describing this
algorithm in the context of identifying which of a large set of
changes between GDB versions caused a crash (1999) [3].
(That paper's version of the algorithm for finding pairs or larger
groups is buggy unless there is only one such group.)

It seems like the idea of binary search to narrow down a buggy
software change is so natural that there must have been many earlier
uses for debugging before 1999, especially in compilers.

Thanks for any references, interesting history, or ideas.

If anyone is interested in code, we've published the binary search tool [4]
and the code that interprets the patterns [5].

Best,
Russ

[1] https://en.wikipedia.org/wiki/Group_testing (a remarkably good article)
[2] https://www.jstor.org/stable/2284447
[3] https://dl.acm.org/doi/10.1145/318774.318946
[4] https://pkg.go.dev/golang.org/x/tools/cmd/bisect
[5] https://pkg.go.dev/golang.org/x/tools/internal/bisect

Fernando

unread,

May 13, 2023, 11:15:53 AM5/13/23

to

Hi Russ, that's very interesting. You should consider submitting a report of
the technique to CGO! The next deadline is on the 19th, but then there will be
another deadline on September 1st.

As for a related work, see John Regehr's research on test case reduction for
compilers [1].

[1] John Regehr, Yang Chen, Pascal Cuoq, Eric Eide, Chucky Ellison, Xuejun Yang:
Test-case reduction for C compiler bugs. PLDI 2012: 335-346

Regards,

Fernando

Kaz Kylheku

unread,

May 13, 2023, 11:16:31 AM5/13/23

to

On 2023-05-12, Russ Cox <r...@swtch.com> wrote:
> In 2015, Keith Randall was working on a new SSA backend for the Go
> compiler. The old and new backends coexisted, and we could use either
> for any given function in the compilation. Keith introduced an
> environment variable GOSSAHASH that specifies the last few binary
> digits of the hash of function names that should use the new backend.
> So GOSSAHASH=0110 means compile only those functions whose names hash
> to a value with last four bits 0110. When a test is failing with the
> new backend, you try GOSSAHASH=0 and GOSSAHASH=1, and then use binary
> search to progressively refine the pattern, narrowing the failure down
> until only a single function is being compiled with the new backend.
> This was invaluable for approaching failures in large real-world tests
> (tests for libraries or production code, not for the compiler) that we

> had not written and did not understand. ...

> It seems like the idea of binary search to narrow down a buggy
> software change is so natural that there must have been many earlier
> uses for debugging before 1999, especially in compilers.

I have used binary searching to find bugs in the TXR Lisp compiler.

Say for the sake of simplicity that I already have a pretty reliable
idea in which file the miscompiled function resides.

Because the compiler executes the top-level forms that it compiles,
I can go to 50% of that file (literally by typing 50% in Vim).
Then stick a (setf *opt-level* 0) at that line. The bottom half
of the file is then compiled without optimizations. If the bug goes
away, then something is being mis-optimized in the second half.
And so it goes.

Another trick I have used is to suppress the compilation of
specific forms according to this pattern:

(defun fun (...) ...) --> (eval '(defun fun () ...))

I.e. take the form, and eval its quoted version:

form -> (eval 'form)

So the compiler now is compiling nothing more than an eval call which
takes a canned code literal as its argument. When the resulting compiled
file is loaded, fun will be an interpreted function, not a compiled
fucntion: the compiled version of the eval call executes, interpreting
the defun form, resulting in an interpreted function being defined. If
the bug goes away, that function was being miscompiled.

(This works in Lisp dialects that have a pure interpreter behind eval;
dialects whose eval is actually a compiler may not support this
technique well.)

The hash technique seems nice. If we have no idea where in the
code base the miscompilation is happening, it seems like it could help;
it seems easier than some of the following techniques.

I often have a good idea where in the code base the problem is because
during the rebuild of TXR, files are compiled one by one. As each file
in the stdlib/ is compiled, the next compile job now uses that file (if
it is implied for loading). So more ofen than not, as soon as we
miscompile some file that used during compilation, the compilation of
the next file will misbehave. Or possibly the same file. The compiler
evaluates what it compiles (unless told not to). If the compiler is
compiling some code that it's also itself using, it may misbehave soon
after miscompiling some of it, before that file is done compiling.

A useful technique used sometimes is to substitute known-good compiled
files. If the problem goes away if we substitute a known-good
compiled file (touching the timestamps so that it's not clobbered
with a recompile), we can suspect that the file is being miscompiled.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazi...@mstdn.ca

Kaz Kylheku

unread,

May 14, 2023, 12:20:54 PM5/14/23

to

On 2023-05-13, Fernando <pron...@gmail.com> wrote:
> Hi Russ, that's very interesting. You should consider submitting a report of
> the technique to CGO! The next deadline is on the 19th, but then there will be
> another deadline on September 1st.

We can summarize it briefly.

Background:

Suppose we have some data processing technique which transforms N
entities. A new version of the technique changes how all the entities
are transformed. One or more of them are transformed erroneously,
resulting in a system regression (one which doesn't readily identify
the bad element). Because all elements are transformed with the new
technique, there are big differences in all of them compared to the
old technique: comparing old and new is futile.

Setup:

We can (somehow) enumerate the N entities with integers, which we
express using a pure binary enumeration. Then, we can discover the
number of an erroneously processed element, one bit at a time.

Procedure:

First we treat all elements numbered ..XXXXXX0 using the new technique
technique, and all others using the the old technique. If the
system fails, we know that it's the ..XXXXX0 set which has one or more
badly processed elements. Otherwise it's the other set, whose
enumerations end in a 1. Either way, we have discovered the identity of
the last bit.

Suppose that discovered bit is 1. We then proceed with the next bit: we
process ..XXXXXX01 entities with the new technique and all others
with the old technique. If the problem reproduces, we have narrowed
to the ...XXXXX01 set, otherwise to the .XXXXX11 set.

In this manner we discover every bit, at which point we have
a useful piece of information: the identity of a misprocessed element.
We can then focus the investigation on how it is misprocessed.

Thomas Koenig

unread,

May 14, 2023, 10:42:02 PM5/14/23

to

Russ Cox <r...@swtch.com> schrieb:

> As I mentioned at the top, I am interested to hear about earlier uses
> of approaches like this, or any other ideas for situations where the
> approach might be applicable. Clearly the general problem has overlaps
> with group testing [1], and in particular Hwang's binary search-based
> approach (1972) [2].

For GCC regression-hunting, two techniques are routinely used.

One of them is the use of "gcc bisect", described at
https://git-scm.com/docs/git-bisect . If there is a failing test
case, this can (at the cost of some CPU time) usually pinpoint
offending commit.

For reducing test cases, at least for C and C++, cvise
https://github.com/marxin/cvise is now the mehod of choice; it is
brutally efficient at reducing test cases to an absolute minimum.
Because it can transform C and C++ programs, it can make
transformations of the program which are language-aware.

Both are tools external to the compiler.

gah4

unread,

May 14, 2023, 10:43:38 PM5/14/23

to

On Sunday, May 14, 2023 at 9:20:54 AM UTC-7, Kaz Kylheku wrote:

(snip)

> Setup:
>
> We can (somehow) enumerate the N entities with integers, which we
> express using a pure binary enumeration. Then, we can discover the
> number of an erroneously processed element, one bit at a time.
>
> Procedure:
>
> First we treat all elements numbered ..XXXXXX0 using the new technique
> technique, and all others using the the old technique. If the
> system fails, we know that it's the ..XXXXX0 set which has one or more
> badly processed elements. Otherwise it's the other set, whose
> enumerations end in a 1. Either way, we have discovered the identity of
> the last bit.

There is the assumtion that only one of the N is in error.

Reminds me of ECC memory, which can correct one bit errors, and detect
two bit errors. After log(N) tests, you find the one that it is
supposed to be, fix it, and it either works or doesn't.

Then you work on a new set of tests.

Or start out with a more complicated set, which can learn more in one set.

Cameron McInally

unread,

May 15, 2023, 12:15:47 PM5/15/23

to

On Sun, May 14, 2023 at 22:42 Thomas Koenig <tko...@netcologne.de> wrote:

> Russ Cox <r...@swtch.com> schrieb:
>
> > As I mentioned at the top, I am interested to hear about earlier uses
> > of approaches like this, or any other ideas for situations where the
> > approach might be applicable. Clearly the general problem has overlaps
> > with group testing [1], and in particular Hwang's binary search-based
> > approach (1972) [2].
>
> For GCC regression-hunting, two techniques are routinely used.
>
> One of them is the use of "gcc bisect", described at
> https://git-scm.com/docs/git-bisect
> . If there is a failing test
> case, this can (at the cost of some CPU time) usually pinpoint
> offending commit.

Ah, this is close to my heart. The story I heard was that automated
regression bisection originated in the Cray compiler group around the
mid 1990s. I didn't join that team until the aughts, but can attest to
the quality of Cray's system, which is still in use today AFAIK.

https://ieeexplore.ieee.org/document/625082

For the general case of reducing runtime errors, the Wolf Fencing algorithm
would be a good thread to pull for uncovering the history.

https://dl.acm.org/doi/abs/10.1145/358690.358695

Hope that helps,
Cam

Kaz Kylheku

unread,

May 15, 2023, 9:53:10 PM5/15/23

to

On 2023-05-14, Thomas Koenig <tko...@netcologne.de> wrote:
> Russ Cox <r...@swtch.com> schrieb:
>
>> As I mentioned at the top, I am interested to hear about earlier uses
>> of approaches like this, or any other ideas for situations where the
>> approach might be applicable. Clearly the general problem has overlaps
>> with group testing [1], and in particular Hwang's binary search-based
>> approach (1972) [2].
>
> For GCC regression-hunting, two techniques are routinely used.
>
> One of them is the use of "gcc bisect", described at
> https://git-scm.com/docs/git-bisect . If there is a failing test
> case, this can (at the cost of some CPU time) usually pinpoint
> offending commit.

The technique described by Russ Cox is applicable when you know what
the offending commit is. E.g. you made some changes, say, to the
optimizer, and something is going wrong when you boostrap the compiler.
Obviously, the optimization changes are breaking something; the prior
commit works. But you don't understand what is breaking where.

>
> For reducing test cases, at least for C and C++, cvise
> https://github.com/marxin/cvise is now the mehod of choice; it is
> brutally efficient at reducing test cases to an absolute minimum.
> Because it can transform C and C++ programs, it can make
> transformations of the program which are language-aware.

This is only applicable if you have an external test case which
reproduces some compiler problem (like a crash) and would like
to make it smaller.

The binary search technique described by Cox is uniquely applicable
in the situation that something goes wrong with the bootstrapping
of a compiler.

1. A bug has been introduced into the compiler source.
You already know exactly which commit did that, and
what change is responsible, due to having done the git bisect
or whatever. Maybe this is just new, uncommited work that is
breaking compared to the working HEAD.

2. Consequently, the compiler miscompiles something in its own
source code (anywhere in its source: in the compiler proiper, or
standard library support routines of that language which are
used by the compiler.)

3. Consequently, the compiled compiler misbehaves somehow.
(Or possibly, the boostrapping even fails: compiled code is
loaded into the compiler's image, and it cannot continue
the boostrapping process.)

Characterizing what is wrong in (3) is not too difficult (though it can
be). The problem is that the issues exhibited in 3 are in good parts of
the compiler. So they are red herring issues. If you look at the source
code for those parts, it is fine and has not been touched. Rather, the
object code is incorrect due to the introduced compiler problem.

You might know the commit which introduces the problem, yet be stumped
as to how, due to the complexity of all the pieces and how they
interact, and the difficulty of some of the algorithms. (Plus hidden
problems, like the new work actually being correct, but exposing a bug
in existing code; so staring at just the new code until you are blue
does no good.)

Cox's trick lets you turn on the new, known-bad code only for a
seletected subset of functions (or whatever logical units of the image).

You can iteratively shrink the subset until it contains only one
mistranslated unit.

Then you know---aha!--the bad optimizer change is, say, mistranslating
the partition function in stdlib/quicksort.mylang. If just that function
is treated with the buggy new change, and nothing else, the show grinds
to a halt.

So now you know you can focus your investigation on what the buggy
compiler change is doing to the partition function in
stdlib/quicksort.mylang.

You can write test cases exercising that function to understand
how its behavior has gone wrong, and study the object code to
see how it's organized to bring about the wrong behavior, and how
that relates the bad new code in the compiler.

You can tweak the code, and build the image with just that partition
function being processed with the new code to see whether the regression
has gone away, and iterate on that. Then try boostrapping the entire
image with the repaired code.

Etc.

The bug can show up as a mistranslation of anything, which is why I
picked the partition helper of quicksort example. That could wreck the
compiler's behavior, if some logic in it somemwhere depends on
sorting something. You could really be on a wild goose chase trying
to debug *that*, rather than it completely unrelated root cause.

> Both are tools external to the compiler.

Right; so not applicable to this internal technique where basically we
subdivide the compiler's own image into "parts compiled with the
known-good compiler" and "parts compiled with the new, known-bad change
in effect", where the parts are on a finer granularity than files.

Kaz Kylheku

unread,

May 15, 2023, 9:54:08 PM5/15/23

to

On 2023-05-14, gah4 <ga...@u.washington.edu> wrote:
> On Sunday, May 14, 2023 at 9:20:54 AM UTC-7, Kaz Kylheku wrote:
>
> (snip)
>
>> Setup:
>>
>> We can (somehow) enumerate the N entities with integers, which we
>> express using a pure binary enumeration. Then, we can discover the
>> number of an erroneously processed element, one bit at a time.
>>
>> Procedure:
>>
>> First we treat all elements numbered ..XXXXXX0 using the new technique
>> technique, and all others using the the old technique. If the
>> system fails, we know that it's the ..XXXXX0 set which has one or more
>> badly processed elements. Otherwise it's the other set, whose
>> enumerations end in a 1. Either way, we have discovered the identity of
>> the last bit.
>

> There is the assumption that only one of the N is in error.

You might think, but it's not actually a problem with the algorithm.

When we bisect a change history, there is the assumption that the
bug showed up, so there are good commits before that, and bad commits
after. If the history is not like that: the bug disappears and
reappears, then the bisect will not identify *the* bad commit reliably,
because there isn't one.

I have a story like this; but it digresses from my present point, so
I will return to it below.

In this situation, we are bisecting a space of N entities. All we need
is to find one bad one. If there are three bad ones, it doesn't matter
which one we find.

Say we have a basket of apples with a rotten smell emanating from it.
We can subdivide it, and smell one half and the other. If both halves
smell, we know we have two or more rotten apples and they ended up
in different halves. This doesn't matter. We just pick one half and
keep subdividing. As long as we stay on the trail of the rotten scent, we will
get down to one rotten apple, and we can use that apple to analyze further:
what kind of mould or bacterium has infected it and so on. Probably
the other rotten apples have the same root cause. If they have different root
causes, we can do another search after fixing the one root cause we have found.

Now about the story. The conclusion of the story was that there was
a GC-related bug in an ash (arithmetic shift) function, dealing with
heap-allocated bignum integers.

A bignum integer was being prematurely reclaimed by the garbage
collector.

The fix looked like this:

commit 9cf992835c8a0f2e6c4ace07b67fea2acb762cc5
Author: Kaz Kylheku <k...@kylheku.com>
Date: Wed Jun 19 23:21:39 2019 -0700

ash: gc problem.

On 32 bit x86 Solaris 10, with gcc 4.9.0, this issue caused a
miscompilation of the pset macro, due to ash abruptly
returning zero, causing the op-code field of an instruction to
be NOOP rather than GCALL.

I'm committing this fix just for reference; I will immediately
replace it with a refactoring of the function.

* arith.c (ash): Add a gc_hint to prevent the a bignum from
being reclaimed if the b = make_bignum() triggers gc.
This happens in the case when the previous case computes
a = make_bignum() and falls through.

diff --git a/arith.c b/arith.c
index fdb294b7..f374ddf4 100644
--- a/arith.c
+++ b/arith.c
@@ -3235,6 +3235,7 @@ val ash(val a, val bits)
b = make_bignum();
if ((mpe = mp_shift(mp(a), mp(b), bn)) != MP_OKAY)
break;
+ gc_hint(a);
return normalize(b);
default:
goto bad3;

The problem only showed upon Solaris (first red flag). When I git bisected it,
it was randomly appearing and disappearing, so git bisect was of no use.

Not all problems introduce themselves in a way that there is a reproducible
test case.

Here, the garbage collector had to go off exactly at the wrong time during the
execution of the ash function. Small perturbations to completely unrelated code
can make it go away.

In the end, I had to debug it on Solaris (no working gdb or anything), using
the repro that I had.

Once I discovered the bad VM instruction, which was affecting the behavior of a
certain Lisp macro in the standard library, I traced it through the compiler
right down to the assembler, where the opcode field for a good assembly
instruction was being miscalculated. I could not reproduce the bad ash
calculation though in isolation.

In summary:

1. non-gc-aware, sloppy coding in ash function caused ...
2. rare problem in assembler's masking and shifting calculations, causing ...
3. Bad instruction when compiling the "pset" (parallel assignment) macro ...
4. Which failed some code relying pset.

All of this reproduced only one one platform, one particular commit.
Small perturbations of Lisp code just about anywhere made it go away.
Bisect was useless.

gah4

unread,

May 17, 2023, 1:04:03 PM5/17/23

to

On Monday, May 15, 2023 at 6:54:08 PM UTC-7, Kaz Kylheku wrote:

(snip)

> Say we have a basket of apples with a rotten smell emanating from it.
> We can subdivide it, and smell one half and the other. If both halves
> smell, we know we have two or more rotten apples and they ended up
> in different halves. This doesn't matter. We just pick one half and
> keep subdividing.

But the algorithm described, at least as I remember it, doesn't test both
halves.

Sniffing two baskets doesn't take so long, but two tests might.

I was suspecting that there were more efficient ways than doing
both halves, though didn't try to figure out what they might be.

> As long as we stay on the trail of the rotten scent, we will
> get down to one rotten apple, and we can use that apple to analyze further:
> what kind of mould or bacterium has infected it and so on. Probably
> the other rotten apples have the same root cause. If they have different
root
> causes, we can do another search after fixing the one root cause we have
found.

I don't know apple statistics so well. If you suspect more than one from the
beginning,

As a binary search, it should be 50% probability on each test.
If you see higher than 50% as the tests go on, it might look
suspicious already.

Russ Cox

unread,

May 17, 2023, 1:15:15 PM5/17/23

to

Thanks to everyone for the replies and pointers.

As far as 'git bisect' is concerned, we have found 'git bisect'
incredibly useful too, of course. It is clearly another instance of
binary search to find bugs, but it is binary search over time, while
the system I described is binary search over the program being
compiled. "What change broke the compiler?" vs "What code does that
change miscompile?". The two are independent and complementary.

I had originally not mentioned binary search over time because it
seemed non-compiler-specific, but as Cameron McInally pointed out,
one of its independent inventions was at Cray Research in the mid-1990s
in a compiler context [1]. I reached out to Brian Ness to find out more
of the history of that work, and I've included his response below,
with permission. The other independent invention I know of was by
Larry McVoy, also in the mid-1990s, in the context of Linux kernel
development [2]. Linux did not have any kind of whole-tree source
control snapshots, but they did post frequent enough manual snapshots
to make the technique applicable. I also reached out to Larry McVoy for
more of the history of that work, and I've included his response below,
also with permission. That branch of the history led directly to
'git bisect' [3].

In general, binary search for narrowing a problem being debugged was
obviously well known by the mid-1990s. At that time, whole snapshot
source history was becoming common enough that the idea of binary
search over it was inevitable. I expect there are other instances of
independent invention we don't know about.

Best,
Russ

[1] https://ieeexplore.ieee.org/document/625082
[2] https://elixir.bootlin.com/linux/1.3.73/source/Documentation/BUG-HUNTING
[3] https://groups.google.com/g/fa.linux.kernel/c/cp6abJnEN5U/m/5Z5s14LkzR4J

---------- Forwarded message ---------
From: Russ Cox <r...@swtch.com>
Date: Mon, May 15, 2023 at 1:18 PM
Subject: history of source control binary search for debugging

Hi Larry and Brian,

I am interested in the origin of the idea of binary search over source
control history running tests to identify the change that introduced a
given failure. I have traced it back as far as the mid-1990s with
Larry's Documentation/BUG-HUNTING file in the Linux kernel
(March 1996) and Brian's COMPSAC'97 paper with Viet Ngo "Regression
containment through source change isolation" (August 1997).
(I have been unable to find Viet Ngo's email address or I would add
him to this thread.)

Can you share any background about when you first encountered the idea,
or any pointers to earlier systems or engineers I might reach out to?
Thanks very much.

Best,
Russ

---------- Forwarded message ---------
From: Larry McVoy <l...@mcvoy.com>
Date: Mon, May 15, 2023 at 1:57 PM
Subject: Re: history of source control binary search for debugging

Good to hear from you Russ.

I'm pretty sure I "invented" that idea, which means I hadn't seen it before.
All I was doing was using the fact that binary search is log(n). And giving
non-kernel people a way to do some grunt work and then get it close and
then hand that off to the kernel people.

But note that the BUG-HUNTING was using snapshot, not source management.
BitKeeper hadn't been invented yet and the other source management systems
sucked because they were reproducible only at tagged points. At least
that was true of CVS.

When BitKeeper came along, we added this:

--longest Restrict the deltas to those on the longest line
between the two range endpoints. Unlike a range, the
lower bound is included in the output.

because BitKeeper history is a lattice, not a straight line. So your
test points became

bk changes -nd:REV: > REVS

and binary search over those. That gets you the longest "straight" line
in the graph. I dunno if Git has something like that, maybe. Lots of
stuff got lost in the transition to Git, still chaps my hide to this
day.

---------- Forwarded message ---------
From: Brian Ness <b...@bubblecreek.com>
Date: Mon, May 15, 2023 at 3:47 PM
Subject: Re: history of source control binary search for debugging

Hi Russ,

I haven’t been in contact with Viet since about 2005 or 2006, so I
don’t know his whereabouts now.

Viet and I worked together in the compiler development group at Cray.
There were language specific front-end teams for Fortan90 and C/C++,
an optimizer team, and a backend team. Viet and I both worked on the
optimizer team. Each team delivered its component as a set of
relocatables (.o files), identified by a component version string.
We had a tool for building the various compilers from the relocatable
sets, by simply specifying the desired version of each component in a
version string given as an input parameter to the build tool. There
were other options, such as for linking up debug versions of the
compilers. The build tool retrieved the relocatables from an archive
we had set up, and linked them together as an executable. Each night,
cron jobs would build compilers for testing from the most recent
versions of each component. When regressions were discovered, we would
build compilers with all previously validated components and execute
the newly failing tests. If a given test failed with a previously
validated compiler, we might have a test that had been modified or an
external library that had regressed, but when previously validated
compiler components passed the newly failing test, we would try one
new compiler component (e.g. the optimizer) with previously validated
other components. Since putting together the different combinations of
components was easy with our build tool, we could quickly identify
which new component introduced the regression.

The optimizer group did something similar, but with source code rather
than relocatables. Cray had its own version control system called USM
(Unicos Source Manager) which had support for atomic changes, meaning
that a set of related changes in multiple files could be encapsulated
and treated as a single thing. We called those atomic change sets,
“mods”. As far as I know, there were no other version control systems
supporting that at the time. Atomic change sets made it easy to
backtrack through the source code to identify the particular set of
changes causing a regression (or other behavior change). After proving
the viability of this approach with linear searches through the code,
we optimized it by using binary searches. This was important, because
an optimizer build from source could involve compiling something like
a million lines of code. Even on our supercomputers this could take a
while, so we wanted to do the minimum number of builds from source.
We called the process “mod isolation”.

After identifying all the mods causing regressions using our mod
isolation process, we would remove those mods and rebuild the
optimizer for the next round of testing. This was repeated until the
optimizer had no known regressions, and that version would be marked
as a release candidate. The “pulled” mods would be reworked by the
author and re-applied for a future release candidate.

This process allowed us to move from infrequent delivery of
non-regressing optimizers to one or two week delivery cycles.

I wasn’t the one to come up with the idea for mod isolation at Cray,
but I was the one who implemented it and automated much of the
process. The COMPSAC paper resulted from Viet’s Phd advisor being
hired by Cray to evaluate our development processes. During his
evaluation, he discovered our mod isolation process, and said he had
never seen this being done before. He asked us to write the paper.
I presented it at COMPSAC. We were not able to find much prior work on
this, probably because there wasn’t support for atomic changes in the
version control tools in use at that time, with the exception of USM.
Of course, now this functionality is available within git as “git
bisect”.
(I used “git bisect” just yesterday.)

I hope this helps you. Let me know if you want any other info.

-Brian

Cameron McInally

unread,

May 17, 2023, 1:46:24 PM5/17/23

to

On Wed, May 17, 2023 at 13:13 Russ Cox <r...@swtch.com> wrote:

> Thanks to everyone for the replies and pointers.
>
> As far as 'git bisect' is concerned, we have found 'git bisect'
> incredibly useful too, of course. It is clearly another instance of
> binary search to find bugs, but it is binary search over time, while
> the system I described is binary search over the program being
> compiled. "What change broke the compiler?" vs "What code does that
> change miscompile?". The two are independent and complementary.

As for a binary search over the program being compiled, LLVM has a
mechanism to bisect optimization passes. Not exactly the same, but worth
noting.

https://llvm.org/docs/OptBisect.html

Also, Viet is still in compilers at Coherent Logix.

Kaz Kylheku

unread,

May 17, 2023, 3:08:27 PM5/17/23

to

On 2023-05-17, gah4 <ga...@u.washington.edu> wrote:
> On Monday, May 15, 2023 at 6:54:08 PM UTC-7, Kaz Kylheku wrote:
>
> (snip)
>
>> Say we have a basket of apples with a rotten smell emanating from it.
>> We can subdivide it, and smell one half and the other. If both halves
>> smell, we know we have two or more rotten apples and they ended up
>> in different halves. This doesn't matter. We just pick one half and
>> keep subdividing.
>
> But the algorithm described, at least as I remember it, doesn't test both
> halves.

If the problem reproducews with the half you're testing, you assume
the misprocessed function is in that half. (There could be a bad
function in the other half also, and you could test that, but it's
unnecessary.)

If the problem deosn't reproduce in the half you're testing, you
assume the misprocessed functions lie in the other half, and
recurse on that half instead.

The algorithm will work even if every single function is miscompiled.
(And will take the same logarithmic number of steps, since we cut
in half no matter what.)

It will arbitrarily narrow it down to one, showing that even if the
compiler change is enabled for just one function, there is a problem.
That function can be investigated. Then maybe then the fix solve the
issue for the five hundred other broken functions, too.

The main problem in that situation is that the function you narrow it
down to may not be the easiest one to investigate. Say that a
three line function is broken, and a 1000 line function is broken;
the binary search finds the 1000 line function, whereas it would be
easier to investigate how the three line function is miscompiled.

There could be the following issue: among several functions that
are miscompiled, there are two: foo and bar. *Both* must be
miscompiled for the problem to reproduce, due to an interaction.
If there is a recursive subdivision such that foo and bar are in
different halves, then the problem doesn't reproduce for either half.

Therefore, when the problem fails to reproduce for the currently
tested half, it may be wise to cross-test the other half. If it also
doesn't reproduce, the halves have to be recombined and some other
approach taken, like perturbing the enumeration so that the partitioning
is done differently, or doing a slow linear elimination.

Kaz Kylheku

unread,

May 17, 2023, 3:09:26 PM5/17/23

to

On 2023-05-17, Russ Cox <r...@swtch.com> wrote:
> because BitKeeper history is a lattice, not a straight line. So your
> test points became

That's one reason I religiously keep git histories linear, though
they could be lattices also, like in BK.

I've not used the "git merge" command since 2009, other than by accident
(via git pull, having forgotten to reconfigure pull to do rebase rather
than merge). And not even that; I quickly trained myself never to run
"git pull"; I banished that command also. Always "git fetch" and then
"git rebase". Before rebasing, I usually take a look at the diverging
upstream: how much new is coming in, and what it's like.

I really, really hate the whole concept of a commit having more than one
parent.

> From: Brian Ness <b...@bubblecreek.com>
> than relocatables. Cray had its own version control system called USM
> (Unicos Source Manager) which had support for atomic changes, meaning
> that a set of related changes in multiple files could be encapsulated
> and treated as a single thing. We called those atomic change sets,
> “mods”. As far as I know, there were no other version control systems
> supporting that at the time.

Simple archives of a snapshot of the entire source code support
mods, though obviously in an unwieldy way when the code is large;
and if you want every change isolated, that requires keeping a lot
of archives.

(And that's basically what git is: every commit in git is snapshot
rather than a delta.)

Linux kernel people used to bisect with kernel snapshots, before
they settled on a version control system.

gah4

unread,

May 18, 2023, 1:00:12 PM5/18/23

to

On Wednesday, May 17, 2023 at 12:08:27 PM UTC-7, Kaz Kylheku wrote:
> On 2023-05-17, gah4 <ga...@u.washington.edu> wrote:
> > On Monday, May 15, 2023 at 6:54:08 PM UTC-7, Kaz Kylheku wrote:
> >
> > (snip)
> >
> >> Say we have a basket of apples with a rotten smell emanating from it.
> >> We can subdivide it, and smell one half and the other. If both halves
> >> smell, we know we have two or more rotten apples and they ended up
> >> in different halves. This doesn't matter. We just pick one half and
> >> keep subdividing.

> > But the algorithm described, at least as I remember it, doesn't test both
> > halves.

> If the problem reproducews with the half you're testing, you assume
> the misprocessed function is in that half. (There could be a bad
> function in the other half also, and you could test that, but it's
> unnecessary.)
>
> If the problem deosn't reproduce in the half you're testing, you
> assume the misprocessed functions lie in the other half, and
> recurse on that half instead.

I was mostly noting it, as the apple test was written to test
both halves.

> The algorithm will work even if every single function is miscompiled.
> (And will take the same logarithmic number of steps, since we cut
> in half no matter what.)

> It will arbitrarily narrow it down to one, showing that even if the
> compiler change is enabled for just one function, there is a problem.
> That function can be investigated. Then maybe then the fix solve the
> issue for the five hundred other broken functions, too.

As to the problem of debugging by miscompiled code, this reminds
me of stories of debugging compilers by feeding them cards from
the card recycling bin. It seems that cards were especially high grade
paper and so worth keeping separate for recycling. I never actually
saw anyone do that, though.

In my younger days, some said that I was especially good at finding
compiler errors. Maybe not as good now. I would try things that were
allowed, but others might not have thought about doing.

Only one I remember now, is a C compiler that miscompiled the ++
operator on a double variable. Allowed in C, but maybe not often used.

Spiros Bousbouras

unread,

May 18, 2023, 1:02:22 PM5/18/23

to

On Wed, 17 May 2023 10:55:27 -0400
Russ Cox <r...@swtch.com> wrote:
> When BitKeeper came along, we added this:
>
> --longest Restrict the deltas to those on the longest line
> between the two range endpoints. Unlike a range, the
> lower bound is included in the output.
>
> because BitKeeper history is a lattice, not a straight line. So your
> test points became
>
> bk changes -nd:REV: > REVS
>
> and binary search over those. That gets you the longest "straight" line
> in the graph.

Any partial ordering can be extended to a total ordering. Why can't you
just do that instead of using 'the longest "straight" line' ?

Russ Cox

unread,

May 18, 2023, 5:04:07 PM5/18/23

to

> Any partial ordering can be extended to a total ordering. Why can't you
> just do that instead of using 'the longest "straight" line' ?

A precondition for binary search is that you are searching a function
f over some domain [0, n) with the property that f(i) == true implies
f(j) == true for all j >= i. That is, once the condition "flips", it
stays flipped. Otherwise binary search is not a correct algorithm.

The assumption in tools like git bisect is that some commit introduces the
bug, and then the bug is detectable in all future commits as well,
establishing the precondition. The test f(i), testing at commit i, can be
viewed as asking whether the bad commit is in the range [0, i] (the full
history of that commit). With that definition, clearly f(i) implies f(j)
for all j >= i, because the commit history [0, j] is a superset of the
commit history [0, i].

If instead you squash unrelated commit lines together in a total order, the
precondition is no longer true: if i is from one line but j > i is from a
different unmerged line, then testing j does not include some history that
testing i does. The bug can disappear as you cross from one line to
another. Focusing on the longest straight line in the commit history is one
way to establish the necessary precondition.

Another possibility, which is what git bisect does, is to maintain a set of
commits not yet decided and then find some "good" midpoint along the
longest line of commits remaining. You may in general end up with multiple
disconnected lines in your set of commits not yet decided, and you have to
handle each one separately until you find the bug.

What git bisect does, then, is a more complex algorithm that uses
binary search over longest lines as a subroutine. Restricted to a
single line of commits the precondition holds and binary search works.
(I am simplifying a bit. The actual implementation is perhaps not this
principled, but this is the essence of what it is doing and why it
works.)

The binary search I described to start this thread is also not exactly
the same binary search algorithm, because the underlying input problem
is not a single-input function with domain [0, n). Instead the input
problem (in the simplest form) is a function taking a set of possible
changes [lo, hi) that reports whether the bug is in that set. We
recognize it as binary search because it's still the same code and
idea, and every binary search does maintain this progressively
narrower range, but normal binary searches don't tell f the range.

Best,
Russ

Kaz Kylheku

unread,

May 19, 2023, 3:47:54 PM5/19/23

to

On 2023-05-17, gah4 <ga...@u.washington.edu> wrote:
> As to the problem of debugging by miscompiled code, this reminds
> me of stories of debugging compilers by feeding them cards from
> the card recycling bin. It seems that cards were especially high grade
> paper and so worth keeping separate for recycling. I never actually
> saw anyone do that, though.

Obviously, the card recycling bin is going to be a source of garbage
that is invalid inputs to the compiler, with perhaps some valid bits.

This was effectively an early form of fuzzing.

> In my younger days, some said that I was especially good at finding
> compiler errors. Maybe not as good now.

Fuzzing has matured now. There are tools now which instrument the
executable, and monitor what is going on in terms of the program
control flow in response to random inputs. Then they adjust the bits
to try to stimulate paths not taken to tickle bugs. That's the
jist of it.

> I would try things that were
> allowed, but others might not have thought about doing.

I've not used fuzzing in a while ... since 2016 it turns out! I played with it
in the summer of that year. The software was AFL (American Fuzzy Lop) 2.30b.

I had found three bugs in a short timespan back then.

One of them was exponential memory explosion in nested quasiquoting syntax like
^^^....^^exr. (TXR Lisp's quasiquoting operator is written ^ rather than the
usual backtick ` used in most traditional Lisp dialects, otherwise it is the
same thing).

AFL also discovered that if it puts a huge number into the op syntax like this:
(op list @123455234), the op expander will obligingly generate a lambda with
that many arguments! E.g. (op list @3) generates something similar to
(lambda (arg0 arg1 arg3) (list arg3)). So imagine if we replace @3 with a
large integer.

AFL also found a crash in the regex parser. It was not doing a range check on
numeric character escapes, making it possible for the program to sneak a
negatively valued character code into the regex copiler, where it resulted in
an out-of-bounds array access problem.

I have been meaning brush the dust off this technique again.

> Only one I remember now, is a C compiler that miscompiled the ++
> operator on a double variable. Allowed in C, but maybe not often used.

Oh, stepping doubles with ++ is, I would say, not *that* uncommon.
I doubt that if such a bug were introduced as an easter egg into GCC,
it would go very long without being discovered by the FOSS distros and
other downstream users.

Kaz Kylheku

unread,

May 19, 2023, 3:51:18 PM5/19/23

to

I think, not without introducing tie breaker rules that don't
necessarily make any sense.

But, maybe that doens't matter as something else. Say we have:

D-------E
/ \
A---B---C I----J
\ /
F---G---H

A is before B and so on. Are F-G-H before D-E or after?

Suppose the bug was introduced by E. We would like to know that.

The longest path between the A-J bisection points goes through F-G-H,
ignoring D-E. So it may wrongly look like I is the culprit, if E is
never tested. I has the bug, H doesn't.

Say we pick a precedence and flatten everything, in these
two ways:

A---B---C---D---E---F---G---H---I----J

or

A---B---C---F---G---H---D---E---I----J

problem is, under the first order, the bug is bouncing into
and out of existence:

A---B---C---D---E---F---G---H---I----J
X X X

E has the bug, and so do I and J. But F-G-H do not. So the bisect
algorithm has a problem; it is predicated on the idea that the bug
exists in the upper endpoint, is absent in the lower and has appeared
once in between.

Now, if we actually cherry pick all those commits into that order
(solving whatever conflicts are required), then we can do a meaningful
bisect. Then the rebased F G H commits do have the bug:

A---B---C---D---E---F'--G'--H'--I'---J
X X X X X X

It's unthinkable to be bisecting some lineage whipped up on-the-fly
that doesn't actually exist in the repo.

The best thing is to just keep the actual history that way.

Or else, do that longest-path things when bisecting. Then
if a merge point is implicated as the bad commit, then go
into the branch. When I is implicated, we note that it
has two parents, E and H. Only the H parentage was tested.
C was good. Thus we bisect C to E.

A smarter bisect algorithm could do this.

To make this sort of topical for comp.compilers, we note that the
revision history is a graph which resembles the control flow
graph of a program (one with no backwards branches).

The graph can be divided into our favorite chunks: basic blocks.

Basic blocks are sequence of commits in which no commit has multiple
parents (other than the first one), or is the parent of multiple children
(other than the last one).

The bisect algorithm should divide the graph into basic blocks,
and recurse into then when necessary.

So in:

D-------E
/ \
A---B---C I----J
\ /
F---G---H

we have these basic blocks:

/
A---B---C
\

D---E
\

/
F---G---H

I---J

First, the bisect could process the longest-path basic blocks: those
headed by A, F and I. The goal is to determine which basic block
introduces the bug.

When a buggy basic block is identified, then bisect actually
does commit-granularity bisection in that block.

If we discover that the bad commit is the head of a block,
and that block has multiple parents, we then recurse over
those parents somehow.

We identify paths (through the basic block graph) we have not taken
and similarly analyze them.

Cliff Click

unread,

May 19, 2023, 3:58:44 PM5/19/23

to

HotSpot C2 compiler has a set of -XX debugging flags for doing binary
search on top-level methods to compile with C2 (or C1 or interpret), and
a similar control over which methods to inline. While I was at Sun I had
a bisect shell script to narrow down failing compilations (and trim the
set of inlined methods).

Cliff

Max B

unread,

May 20, 2023, 6:02:50 PM5/20/23

to

We did something similar -- delta debugging -- with the Cinder JIT. Except
we didn't have hashes, but instead lists of function names. It's an
incredibly useful approach.

See more at https://bernsteinbear.com/blog/cinder-jit-bisect/

Thomas Koenig

unread,

May 20, 2023, 6:05:13 PM5/20/23

to

Kaz Kylheku <864-11...@kylheku.com> schrieb:

> Oh, stepping doubles with ++ is, I would say, not *that* uncommon.
> I doubt that if such a bug were introduced as an easter egg into GCC,
> it would go very long without being discovered by the FOSS distros and
> other downstream users.

It would very likely be caught by regression testing. GCC has an
extensive test suite, and it is a requirement that this is run
before submitting a patch.

If that slips through (as can happen from time to time, for
example for different architectures or operating systems), then
the automated regression testers will flag it.

If not that, the automated SPEC testers also have a good chance
of catching it.

As an aside, the post-increment operator is converted to its
equivalent assignment statement very early, during gimplification.

In the following example, the result of the first two passes that
GCC performs are shown - the first is a reproduction of the source
code, as parsed, and the second one after lowering to GIMPLE.
File names may differ bit depending on whic version of GCC you use.

$ cat double.c
double c;

void foo()
{
c++;
}
$ gcc -c -fdump-tree-original -fdump-tree-gimple double.c
$ cat double.c.004t.original

;; Function foo (null)
;; enabled by -tree-original

{
c++ ;
}

$ cat double.c.005t.gimple
foo ()
{
c.0_1 = c;
_2 = c.0_1 + 1.0e+0;
c = _2;
}

Mike Stump

unread,

May 20, 2023, 6:06:13 PM5/20/23

to

>From: Larry McVoy <l...@mcvoy.com>
>Date: Mon, May 15, 2023 at 1:57 PM
>Subject: Re: history of source control binary search for debugging
>
>Good to hear from you Russ.
>
>I'm pretty sure I "invented" that idea, which means I hadn't seen it before.
>All I was doing was using the fact that binary search is log(n). And giving
>non-kernel people a way to do some grunt work and then get it close and
>then hand that off to the kernel people.
>
>But note that the BUG-HUNTING was using snapshot, not source management.
>BitKeeper hadn't been invented yet and the other source management systems
>sucked because they were reproducible only at tagged points. At least
>that was true of CVS.

With cvs update -D, one can select a date and update the tree to that
specific date. Indeed, I've been known to use '1 day ago', ... '50
days ago' and binary search that way as well. Nice interface as you
only have to deal with changing a single number. And the granularity
in days is a nice metric. It gets you down to a day or two and then
you dig in from there. Further, with cvs, you could update only parts
of the tree if you suspect which part might be failing. Front end,
loop unrolling and so on.

Another technique I've used to is save off part of the compiler
(cc1/cc1plus) binary from the nightly build system, and then when you
want state on a bug, you simply for i in */cc1; do test $i; done and
then you get a total view of that code against all the compilers. You
can see bugs pop in and out, or if they are simple, always worked
before, and never after.

These methods are so basic that there is no need to write them down.
I've been using such techniques longer than linux has been around.
:-)

gah4

unread,

May 21, 2023, 2:27:50 PM5/21/23

to

On Saturday, May 20, 2023 at 3:05:13 PM UTC-7, Thomas Koenig wrote:
> Kaz Kylheku <864-11...@kylheku.com> schrieb:
> > Oh, stepping doubles with ++ is, I would say, not *that* uncommon.
> > I doubt that if such a bug were introduced as an easter egg into GCC,
> > it would go very long without being discovered by the FOSS distros and
> > other downstream users.

> It would very likely be caught by regression testing. GCC has an
> extensive test suite, and it is a requirement that this is run
> before submitting a patch.

I believe it was Borland Turbo C, maybe 2.0.

At one point, I had a few different C compilers for MS-DOS, not so long
before I went to using OS/2. It seems that Borland also had compilers
for OS/2, but I don't remember using those.

Hans-Peter Diettrich

unread,

May 22, 2023, 9:21:23 AM5/22/23

to

On 5/21/23 5:20 AM, gah4 wrote:

> I believe it was Borland Turbo C, maybe 2.0.
>
> At one point, I had a few different C compilers for MS-DOS, not so long
> before I went to using OS/2. It seems that Borland also had compilers
> for OS/2, but I don't remember using those.

Borland also developed Borland C++ for the Atari ST. The library was not
so perfect, as my decompiler could reveal during beta test. But it was
by far the best C/C++ compiler available for the ST.

DoDi