m...@iae.nl writes:
>On Wednesday, April 15, 2020 at 5:40:01 PM UTC+2, Anton Ertl wrote:
>> Cecil - k5nwa <
cba...@cbayona.com> writes:
>> >Why small
>> >? So it can fit in the cache of the CPU and be lightning fast among
>> >other things
>[..]
>> By contrast, --no-super reduces the "branch-misses" (branch
>> mispredictions) by a 172M. If we assume that each misprediction costs
>> 16 cycles (which is realistic), this explains the performance
>> difference of 2.8G cycles.
>
>That is interesting. Is there some special instruction or
>secret sequence that we can use to advance-warn the CPU that
>we are going to branch or call somewhere soon?
Depends on the architecture:
PowerPC has the ctr (counter) and lr (link) registers, and you have to
move the target address to one of these registers (with mtctr or mtlr)
before you branch there (with bctr or blr and its variants); the
implementations could use this as an advance warning; I don't know if
they do so, though. I found that gcc just does not separate them
(probably these registers are not modeled in gcc, and modeling them
would require quite a bit of work).
IA-64 splits indirect branches in a similar way.
IIRC, on Itanium II (or I may be confusing this with the PowerPC 970),
when the branch part is seen, it uses the contents of the register at
that time to predict the target; if the register is set too late, this
results in a mispredicted branch; I dimly remember a latency of 6
cycles that is necessary between the first and the second part.
SPARC acquired some prepare-to-indirect-branch instruction with much
fanfare ("speeds up Java") at some point, but I have not looked
closely at it.
The Intel/AMD side acquired indirect branch predictors relatively
early: 1993 for Intel (Pentium), 1999 for AMD (K7). Initially they
had branch target buffers, later (e.g., Core 2 Duo) they used more
history for predicting indirect branches. This works pretty well and
does not need compiler enhancements that never come.
While CPUs still only had BTBs, we developed replication (--no-super;
and by extension, dynamic superinstructions with replication, the
default in Gforth) to get a better prediction accuracy out of BTBs.
History-based predictors can do already quite well without replication
(for the small benchmarks, Skylake produces the same performance with
--no-dynamic and --no-super; for larger benchmarks, replication still
produces an advantage on Haswell; I have not measured on Skylake). So
I presented results on a Goldmont, where replication still makes a big
difference.
One other aspect is that most high-performance CPUs have a "return
stack" (not the Forth return stack) for predicting returns. This
usually works quite well.
>I suppose that if the hardware knew that we are
>going to jump to $xxxx it could already prefetch
>the code there. A pattern like "pop ebx, jmp ebx,"
>or "ret," etc. is common enough that the hardware
>could speculate with some confidence that the value
>at the top of the return stack is a good candidate
>for an imminent IP change.
It does not do that, and it's hard for an OoO or deeply pipelined
in-order CPU. That's because the front end of the CPU runs far ahead
of the back end where data is handled.
Instead, the CPU's return stack is a separate structure inside the
CPU's front end; when the front end sees a call, it pushes, when it
sees a return, it pulls. If you perform return-address manipulation,
it will mispredict.
>If such a mechanism were in place, it would mean
>that we should avoid using the return stack for data
>because this data effectively hides where we are going to
>jump/return: "pop ebx, pop ecx, ret" would be a bad idea,
>"pop ebx pop ecx ...<useful code>... ret" is then much
>better.
The branch predictor does not see that.
>I guess that RISC V exploits the fact that the
>BL instruction exposes the return address in the link
>register.
AFAIK on RISC-V, the link register is a general-purpose register
(unlike on PowerPC), and the return is an ordinary indirect branch.
Of course, a RISC-V implementation can be designed to work better for
the usual return idiom (with a return stack) than for others.
Some literature:
@InProceedings{ertl&gregg03,
author = "M. Anton Ertl and David Gregg",
title = "Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters",
crossref = "sigplan03",
OPTpages = "",
url = "
http://www.complang.tuwien.ac.at/papers/ertl%26gregg03.ps.gz",
abstract = "Interpreters designed for efficiency execute a huge
number of indirect branches and can spend more than
half of the execution time in indirect branch
mispredictions. Branch target buffers are the best
widely available\mn{on all recent general-purpose
machines?} form of indirect branch prediction;
however, their prediction accuracy for existing
interpretes is only 2\%--50\%. In this paper we
investigate two methods for improving the prediction
accuracy of BTBs for interpreters: replicating
virtual machine (VM) instructions and combining
sequences of VM instructions into superinstructions.
We investigate static (interpreter build-time) and
dynamic (interpreter run-time) variants of these
techniques and compare them and several combinations
of these techniques. These techniques can eliminate
nearly all of the dispatch branch mispredictions,
and have other benefits, resulting in speedups by a
factor of up to 3.17 over efficient threaded-code
interpreters, and speedups by a factor of up to 1.3
over techniques relying on superinstructions alone."
}
@Proceedings{sigplan03,
booktitle = "SIGPLAN Conference on Programming Language
Design and Implementation (PLDI'03)",
title = "SIGPLAN Conference on Programming Language
Design and Implementation (PLDI'03)",
year = "2003",
key = "PLDI '03"
}
@Article{ertl&gregg03jilp,
author = {M. Anton Ertl and David Gregg},
title = {The Structure and Performance of \emph{Efficient}
Interpreters},
journal = {The Journal of Instruction-Level Parallelism},
year = {2003},
volume = {5},
month = nov,
url = {
http://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz},
url2 = {
http://www.jilp.org/vol5/v5paper12.pdf},
note = {
http://www.jilp.org/vol5/},
abstract = {Interpreters designed for high general-purpose
performance typically perform a large number of
indirect branches (3.2\%--13\% of all executed
instructions in our benchmarks). These branches
consume more than half of the run-time in a number
of configurations we simulated. We evaluate how
accurate various existing and proposed branch
prediction schemes are on a number of interpreters,
how the mispredictions affect the performance of the
interpreters and how two different interpreter
implementation techniques perform with various
branch predictors. We also suggest various ways in
which hardware designers, C compiler writers, and
interpreter writers can improve the performance of
interpreters.}
}
@Article{casey+07toplas,
author = {Kevin Casey and M. Anton Ertl and David Gregg},
title = {Optimizing indirect branch prediction accuracy in
virtual machine interpreters},
journal = toplas,
year = {2007},
volume = {29},
number = {6},
pages = {37:1--37:36},
month = oct,
url = {
http://doi.acm.org/10.1145/1286821.1286828},
abstract = {Interpreters designed for efficiency execute a huge
number of indirect branches and can spend more than
half of the execution time in indirect branch
mispredictions. Branch target buffers (BTBs) are the
most widely available form of indirect branch
prediction; however, their prediction accuracy for
existing interpreters is only 2\%--50\%. In this
article we investigate two methods for improving the
prediction accuracy of BTBs for interpreters:
replicating virtual machine (VM) instructions and
combining sequences of VM instructions into
superinstructions. We investigate static
(interpreter build-time) and dynamic (interpreter
runtime) variants of these techniques and compare
them and several combinations of these
techniques. To show their generality, we have
implemented these optimizations in VMs for both Java
and Forth. These techniques can eliminate nearly all
of the dispatch branch mispredictions, and have
other benefits, resulting in speedups by a factor of
up to 4.55 over efficient threaded-code
interpreters, and speedups by a factor of up to 1.34
over techniques relying on dynamic superinstructions
alone.}