Naive question; Branch Prediction

gareth evans

unread,

Sep 8, 2021, 6:54:49 AM9/8/21

to

There must be some background processing taking place at
a speed that is greater than the execution of program
instructions in order to estimate a future branch prediction.

OK; so why isn't that faster processing given over to
the instructions themselves?

Terje Mathisen

unread,

Sep 8, 2021, 8:26:54 AM9/8/21

to

gareth evans wrote:
> There must be some background processing taking place at
> a speed that is greater than the execution of program
> instructions in order to estimate a future branch prediction.

Your premise is wrong.

>
> OK; so why isn't that faster processing given over to
> the instructions themselves?

The branch prediction tables only need to be updated fast enough so that
they can be used by the next time the same branch comes around, i.e.
this can happen cycles after the actual branch has been taken (or not).

There have been CPUs which depended on having a few non-branch cycles in
order to update the branch tables, there you could hit a bad slowdown in
hand-written code that actually did have branches every cycle.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup

unread,

Sep 8, 2021, 11:53:54 AM9/8/21

to

On Wednesday, September 8, 2021 at 5:54:49 AM UTC-5, gareth evans wrote:
> There must be some background processing taking place at
> a speed that is greater than the execution of program
> instructions in order to estimate a future branch prediction.
<

Terje is correct: your premise is incorrect.

>
> OK; so why isn't that faster processing given over to
> the instructions themselves?
<

Branch prediction is all done by recording branch histories
from the past and using the past to predict the future.

Stephen Fuld

unread,

Sep 8, 2021, 12:00:59 PM9/8/21

to

Yes, and to Gareth's question, the "processing" is pretty trivial.
Could just be compare the prediction to the reality and increment or
decrement a small counter. So not much "processing power" to "give
back" to processing instructions.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

BGB

unread,

Sep 8, 2021, 2:41:19 PM9/8/21

to

Yeah.

For example, in my case predicting is something like:
Form a small hash-index based on the address and branch history;
Fetch a 3-bit predictor;
Predict direction based on the predictor.

This part is done fairly early in the pipeline, and effectively diverts
instruction fetching to another location. The destination typically
comes from the branch instruction (or certain "fixed" registers).

The unconditional branch instructions may also do this.

This is along with a few other cases ('RTS' and 'JMP R1', which are used
in function returns), though these instructions have an interlock-check
of sorts against the rest of the pipeline to make sure nothing has LR or
R1 as a destination register.

Then, once the branch is executed, a similar process happens but
updating the predictor based on its current state and the branch direction.

States are, eg:
000: Small, Not-Taken
001: Small, Taken
010: Large, Not-Taken
011: Large, Taken
100: Small, Not-Taken, Alternating
101: Small, Taken, Alternating
110: Large, Not-Taken, Alternating
111: Large, Taken, Alternating

The first 4 predict branches which nearly always branch in the same
direction, whereas the alternating cases predict branches that tend to
branch the opposite direction from whatever they took last time. These
states are less common, but can help slightly.

Adding more levels to increase the "strength" of taken or not-taken
states was not effective in my tests (can actually make it worse).

The state update is basically a 4-bit table lookup to find the next state.

The branch, during the execute stage, sees whether or not the prediction
was correct or incorrect:
Correct: Behave mostly like a NOP;
Incorrect: Initiate a branch to the new location.

As can be noted, my ISA currently has several branch displacement sizes:
Disp8s, 16-bit branch ops, handled by predictor;
Disp20s, 32-bit branch ops, handled by predictor;
Disp33s, 64-bit branch ops, execute-stage (ignored by predictor);
Abs48, 64-bit branch ops, handled by predictor.

Nearly all local branches can fit into a 20 bit displacement (+/- 1MB in
this case). Non-local branches can use Abs48 (48-bit absolute address)
or a branch through a pointer.

The 33-bit branches are rare enough and expensive enough to handle early
that they can be left as non-predicted without too much adverse effect.
This could happen though in a binary with a ".text" section exceeding 1MB.

Part of the cost is due to adder latency, where a smaller adder has less
latency. This is not really an issue for Abs48 branches.

Previously, there was an issue with Mod-4GB branch-addressing due to the
adder latency issue, but then I came up with the idea that the
branch-predictor can ignore any branches which produce a carry outside
the low 24 bits or so (these would be left to the EX stages to deal
with). This allows dropping the Mod-4GB branch restriction.

Currently, the branch predictor will also ignore branches that are put
into a bundle, with the minor drawback that this effectively prevents
bundling instructions in parallel with a branch.

...

None of this is "general purpose" processing that would be of much use
as part of the ISA.

Thomas Koenig

unread,

Sep 8, 2021, 3:53:20 PM9/8/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

I'm always astonished that it appears to work so well.

for (i=0; i<3; i++)
for (j=0; j<3; j++)

would already hard, correct?

MitchAlsup

unread,

Sep 8, 2021, 4:11:47 PM9/8/21

to

Not when you include branch history ! in the prediction hash !!
The above loops have an inner pattern TTN and an outer pattern
of TTN. The inner pattern is at one virtual address, the outer
pattern is at a different virtual address, to they use different
prediction cells in the branch prediction table (2×3) cells. Given
1K-16K of these cells, there is scant interference (0.6%-0.01%),
as long as your has function works well.