Yeah.
For example, in my case predicting is something like:
Form a small hash-index based on the address and branch history;
Fetch a 3-bit predictor;
Predict direction based on the predictor.
This part is done fairly early in the pipeline, and effectively diverts
instruction fetching to another location. The destination typically
comes from the branch instruction (or certain "fixed" registers).
The unconditional branch instructions may also do this.
This is along with a few other cases ('RTS' and 'JMP R1', which are used
in function returns), though these instructions have an interlock-check
of sorts against the rest of the pipeline to make sure nothing has LR or
R1 as a destination register.
Then, once the branch is executed, a similar process happens but
updating the predictor based on its current state and the branch direction.
States are, eg:
000: Small, Not-Taken
001: Small, Taken
010: Large, Not-Taken
011: Large, Taken
100: Small, Not-Taken, Alternating
101: Small, Taken, Alternating
110: Large, Not-Taken, Alternating
111: Large, Taken, Alternating
The first 4 predict branches which nearly always branch in the same
direction, whereas the alternating cases predict branches that tend to
branch the opposite direction from whatever they took last time. These
states are less common, but can help slightly.
Adding more levels to increase the "strength" of taken or not-taken
states was not effective in my tests (can actually make it worse).
The state update is basically a 4-bit table lookup to find the next state.
The branch, during the execute stage, sees whether or not the prediction
was correct or incorrect:
Correct: Behave mostly like a NOP;
Incorrect: Initiate a branch to the new location.
As can be noted, my ISA currently has several branch displacement sizes:
Disp8s, 16-bit branch ops, handled by predictor;
Disp20s, 32-bit branch ops, handled by predictor;
Disp33s, 64-bit branch ops, execute-stage (ignored by predictor);
Abs48, 64-bit branch ops, handled by predictor.
Nearly all local branches can fit into a 20 bit displacement (+/- 1MB in
this case). Non-local branches can use Abs48 (48-bit absolute address)
or a branch through a pointer.
The 33-bit branches are rare enough and expensive enough to handle early
that they can be left as non-predicted without too much adverse effect.
This could happen though in a binary with a ".text" section exceeding 1MB.
Part of the cost is due to adder latency, where a smaller adder has less
latency. This is not really an issue for Abs48 branches.
Previously, there was an issue with Mod-4GB branch-addressing due to the
adder latency issue, but then I came up with the idea that the
branch-predictor can ignore any branches which produce a carry outside
the low 24 bits or so (these would be left to the EX stages to deal
with). This allows dropping the Mod-4GB branch restriction.
Currently, the branch predictor will also ignore branches that are put
into a bundle, with the minor drawback that this effectively prevents
bundling instructions in parallel with a branch.
...
None of this is "general purpose" processing that would be of much use
as part of the ISA.