The data he shows at the end is not representative of what an
interpreter sees:
First of all, on every dispatch he eventually calls a function that
invalidates the cache line at the start of the dispatch function.
That makes each iteration of the benchmark very slow (reflected in the
low instructions per cycle) and not at all dominated by the dispatch
(unlike efficient interpreters); conversely, efficient interpreters
typically run out of the I-cache much of the time, unless they call
some big library function (and then the dispatch overhead becomes
irrelevant).
Also, he calls one function 80% of the time, and a random one out of
255 the rest of the time; that's not representative of an interpreter.
An interpreter tends to repeat sequences of dispatches (e.g., in loops
and in called functions), and a history-based branch predictor can
exploit that. How well that works out with indirect branches vs. a
tree of conditional branches has to be measured, and may vary from CPU
to CPU.
Note that you also have to have benchmarks of significant size with
current CPUs. He cites "Branch Prediction and the Performance of
Interpreters - Don't Trust Folklore"; I ran experiments with
substantial Forth programs on Ivy Bridge and Haswell, and found that
the interpreter optimizations we use in Gforth are still very
beneficial, even on Haswell.
The dependence on the actual CPU is also true for his benchmark: On a
Ryzen 1800X his x86_64-vtable beats x86_64-binary for the same
parameters:
Performance counter stats for './x86_64-binary 5000000' (5 runs):
739.637786 task-clock (msec)
62 context-switches
0 cpu-migrations
21 page-faults
2994569072 cycles
5108587 stalled-cycles-frontend
50210715 stalled-cycles-backend
284352758 instructions
82447723 branches
5808693 branch-misses
0.740464011 seconds time elapsed
Performance counter stats for './x86_64-vtable 5000000' (5 runs):
713.539066 task-clock (msec)
60 context-switches
0 cpu-migrations
22 page-faults
2885008628 cycles
14138003 stalled-cycles-frontend
15179665 stalled-cycles-backend
218699025 instructions
52264833 branches
3759984 branch-misses
0.714335864 seconds time elapsed
>Switching on opcode number is a special case of a general switch.
>Specifically:
>1. opcodes are dense and contiguous;
>2. opcodes typically take values over a relatively small range.
>For example, CPython's opcodes range from 1 to 161 (excepting EXCEPT_HANDLER).
>The jump table is thus a perfect hash: We need only store the addresses of the
>opcode handler functions in a smallish table and use the opcode value as an
>index into the table. Given (1) and (2), it's not clear to me that, say, a
>binary search with Log_2(161) ~ 8 comparisons is better than the lookup table.
>Is that really what the recent literature is saying?
Not that I know of.
> Are concerns about cache
>misses really valid for a table of 161 addresses?
No. That would be <1.3KB on a 64-bit machine, or, for the 256 cases
used in the benchmark, 2KB. By comparison, the binary-search dispatch
function used for the benchmark above is 2567 bytes longer than the
table dispatch code, so what you gain in D-cache, you more than lose
in I-cache.
> Is a load instruction that
>much more expensive than comparison instructions with branch prediction?
If the indirect branch predicts correctly, the load is essentially for
free. If it doesn't, it increases the misprediction penalty (~20
cycles) by its latency (4 cycles). In the correctly predicted case,
the indirect branch itself costs one cycle or so.
For your binary tree of conditional branches, processing the tree
costs some cycles even if correctly predicted. E.g., assume that the
core can process one not-taken branch and a branch per cycle. With
our 8-deep tree, even if we can always process two branches per cycle,
that's 4 cycles spent in dispatch; and some cases actually take 8, and
all with correct prediction. You may be able to get this down by
skewing the tree to have shorter depths (and beneficial code
arrangements) for the more frequent instructions, but in the end the
minimal cost of this approach will be higher.
However, the big cost is the mispredictions; they are hard to predict
for interpreters, so better measure them.
My feeling, though, is that the binary tree loses for such an
interpreter on current CPUs.
>I also have a question about optimization. Testing for opcode values in order
>of frequency of occurrence optimizes for the 1-gram entropy of opcodes. While
>we expect the conditional entropy of a given opcode to be high, I would still
>think that a bigram optimization would still be a benefit. One wouldn't have
>to compute all of the conditional probability distributions for every opcode
>to do this. Just do it for the statistically most common bigrams, including
>checks of the next opcode against the most probable successor in the current
>opcode's handler code.
Chapter 4.6 of
http://www.ethoberon.ethz.ch/WirthPubl/AD.pdf discusses
the construction of an optimal search tree given access probabilities.
- anton
--
M. Anton Ertl
an...@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/