Opcode handler dispatch in an interpreter: Implementing switch on OpCode.

Robert Jacobson

unread,

Oct 5, 2017, 3:05:16 PM10/5/17

to

I am trying to wrap my mind around the issue of dynamic dispatch in the
context of switching on opcode in a bytecode interpreter. I came across Julian
Squires' recent well-cited blog post (via r/programming), "Are Jump Tables
Always Fastest?"
(http://www.cipht.net/2017/10/03/are-jump-tables-always-fastest.html). There
appears to be a sizable literature on the subject. Optimal designs apparently
require careful consideration of the processor's branch prediction capability.
I have some questions.

Switching on opcode number is a special case of a general switch.
Specifically:
1. opcodes are dense and contiguous;
2. opcodes typically take values over a relatively small range.
For example, CPython's opcodes range from 1 to 161 (excepting EXCEPT_HANDLER).
The jump table is thus a perfect hash: We need only store the addresses of the
opcode handler functions in a smallish table and use the opcode value as an
index into the table. Given (1) and (2), it's not clear to me that, say, a
binary search with Log_2(161) ~ 8 comparisons is better than the lookup table.
Is that really what the recent literature is saying? Are concerns about cache
misses really valid for a table of 161 addresses? Is a load instruction that
much more expensive than comparison instructions with branch prediction?

I also have a question about optimization. Testing for opcode values in order
of frequency of occurrence optimizes for the 1-gram entropy of opcodes. While
we expect the conditional entropy of a given opcode to be high, I would still
think that a bigram optimization would still be a benefit. One wouldn't have
to compute all of the conditional probability distributions for every opcode
to do this. Just do it for the statistically most common bigrams, including
checks of the next opcode against the most probable successor in the current
opcode's handler code. Is this ever done in practice? Why or why not?

Best,

Robert

bartc

unread,

Oct 6, 2017, 11:26:29 AM10/6/17

to

On 05/10/2017 17:29, Robert Jacobson wrote:
> I am trying to wrap my mind around the issue of dynamic dispatch in the
> context of switching on opcode in a bytecode interpreter. I came across Julian
> Squires' recent well-cited blog post (via r/programming), "Are Jump Tables

> Always Fastest?" ...

> Switching on opcode number is a special case of a general switch.
> Specifically:
> 1. opcodes are dense and contiguous;
> 2. opcodes typically take values over a relatively small range.
> For example, CPython's opcodes range from 1 to 161 (excepting EXCEPT_HANDLER).
> The jump table is thus a perfect hash: We need only store the addresses of the
> opcode handler functions in a smallish table and use the opcode value as an
> index into the table. Given (1) and (2), it's not clear to me that, say, a
> binary search with Log_2(161) ~ 8 comparisons is better than the lookup table.

I don't understand what this binary search is looking for here. You've
got an index of 1 to 161, and a table of addresses index from 1 to 161;
what more is needed? You just do an index operation to find the address
you have to get to to deal with that byte-code.

> Is that really what the recent literature is saying? Are concerns about cache
> misses really valid for a table of 161 addresses? Is a load instruction that
> much more expensive than comparison instructions with branch prediction?
>
> I also have a question about optimization. Testing for opcode values in order
> of frequency of occurrence optimizes for the 1-gram entropy of opcodes. While
> we expect the conditional entropy of a given opcode to be high, I would still
> think that a bigram optimization would still be a benefit. One wouldn't have
> to compute all of the conditional probability distributions for every opcode
> to do this. Just do it for the statistically most common bigrams, including
> checks of the next opcode against the most probable successor in the current
> opcode's handler code. Is this ever done in practice? Why or why not?

Are you trying to implement something or doing academic research?

I can only give my own practical experience based on four kinds of
dispatchers I've used:

(1) Function-pointer dispatcher.

Within the byte-code program, byte-code indices (c. 1 to 200 in my
interpreter) are replaced before starting by the actual address of the
corresponding handler. Then there is a very simple loop that that passes
control direct to the handler for the current byte-code.

(2) Switch-based dispatcher.

Indices are retained within the byte-code, and used to jump to one of
the 200 different case-labels (in C jargon). Most of those will then
perform a function-call (but if using C, that is likely to inline those
functions)

(3) Label-pointer dispatcher.

Similar to the block of 200 case-labels, there are 200 ordinary labels,
each of which will again call the handler (which may again be inlined).

Within the byte-code, each index is replaced by the address of each
label (if using C, it needs support for this feature).

(4) Thread-code dispatcher.

Within the byte-code, each index is replaced by the address of each
handler, each of which is now a threaded-code function (jump into the
function, jump out again straight to the next handler).

In terms of performance, (1) was usually the slowest, (2) faster, and
(3) fastest.

A direct comparison with (4) is difficult, because this dispatcher was
written as an ASM-based overlay on top of (1). If no other optimisations
are done, then this overlay (4) ends up slightly slower than (1).

(In practice, (4) will directly handle many simple byte-codes in in-line
assembly, and it will keep global values resident in registers as much
as it can. So it generally ends up the fastest of all.)

The overheads of byte-code dispatch depend also on the kind of language
being interpreted. For my dynamic one which needs also to deal with
type-dispatch, and where some opcodes can do a lot of work, they might
not be as significant compared with lower-level languages.

(I've since dropped dispatchers (2) and (3), and use either (1) or (4).)

--
bartc

Hans-Peter Diettrich

unread,

Oct 6, 2017, 11:26:54 AM10/6/17

to

Am 05.10.2017 um 18:29 schrieb Robert Jacobson:
> I am trying to wrap my mind around the issue of dynamic dispatch in the
> context of switching on opcode in a bytecode interpreter

> Switching on opcode number is a special case of a general switch.
> Specifically:
> 1. opcodes are dense and contiguous;

This is not a requirement, can be optimized (see below).

> 2. opcodes typically take values over a relatively small range.

ACK

> For example, CPython's opcodes range from 1 to 161 (excepting EXCEPT_HANDLER).
> The jump table is thus a perfect hash: We need only store the addresses of the
> opcode handler functions in a smallish table and use the opcode value as an
> index into the table. Given (1) and (2), it's not clear to me that, say, a
> binary search with Log_2(161) ~ 8 comparisons is better than the lookup table.
> Is that really what the recent literature is saying? Are concerns about cache
> misses really valid for a table of 161 addresses? Is a load instruction that
> much more expensive than comparison instructions with branch prediction?

I second your doubts, as soon as more values have to be considered than
used in the examples.

I don't see how branch prediction will have an impact on runtime, when
the prediction only applies to conditional branches, and the next opcode
is one-of-many. If the cache can hold the jump table, nothing can be
improved with this model. Consider how many *possible* comparisons and
branches must be encoded, so that they will lead to *all* possible (161)
target addresses.

Many years ago I saw an 68k emulator, where the opcode was multiplied by
16 and used as an offset into the emulator code, eliminating the branch
table. The 16 byte blocks cover the emulator code for most instructions,
this factor can be scaled to any sufficiently large block size if
required. If some bytecodes require more than one block of executable
code, the codes can be made discontiguous, with gaps wherever the
code-to-execute is longer than the chosen blocking factor.

Of course such an approach lacks any cache support, but a large cache
still may hold the most frequently used code sequences automatically. If
the cache is too small to hold the code for multiple opcodes, it's
useless at all.

DoDi

Anton Ertl

unread,

Oct 6, 2017, 9:31:23 PM10/6/17

to

Robert Jacobson <rljac...@gmail.com> writes:
>I am trying to wrap my mind around the issue of dynamic dispatch in the
>context of switching on opcode in a bytecode interpreter. I came across Julian
>Squires' recent well-cited blog post (via r/programming), "Are Jump Tables
>Always Fastest?"
>(http://www.cipht.net/2017/10/03/are-jump-tables-always-fastest.html).

The data he shows at the end is not representative of what an
interpreter sees:

First of all, on every dispatch he eventually calls a function that
invalidates the cache line at the start of the dispatch function.
That makes each iteration of the benchmark very slow (reflected in the
low instructions per cycle) and not at all dominated by the dispatch
(unlike efficient interpreters); conversely, efficient interpreters
typically run out of the I-cache much of the time, unless they call
some big library function (and then the dispatch overhead becomes
irrelevant).

Also, he calls one function 80% of the time, and a random one out of
255 the rest of the time; that's not representative of an interpreter.
An interpreter tends to repeat sequences of dispatches (e.g., in loops
and in called functions), and a history-based branch predictor can
exploit that. How well that works out with indirect branches vs. a
tree of conditional branches has to be measured, and may vary from CPU
to CPU.

Note that you also have to have benchmarks of significant size with
current CPUs. He cites "Branch Prediction and the Performance of
Interpreters - Don't Trust Folklore"; I ran experiments with
substantial Forth programs on Ivy Bridge and Haswell, and found that
the interpreter optimizations we use in Gforth are still very
beneficial, even on Haswell.

The dependence on the actual CPU is also true for his benchmark: On a
Ryzen 1800X his x86_64-vtable beats x86_64-binary for the same
parameters:

Performance counter stats for './x86_64-binary 5000000' (5 runs):

739.637786 task-clock (msec)
62 context-switches
0 cpu-migrations
21 page-faults
2994569072 cycles
5108587 stalled-cycles-frontend
50210715 stalled-cycles-backend
284352758 instructions

82447723 branches
5808693 branch-misses

0.740464011 seconds time elapsed

Performance counter stats for './x86_64-vtable 5000000' (5 runs):

713.539066 task-clock (msec)
60 context-switches
0 cpu-migrations
22 page-faults
2885008628 cycles
14138003 stalled-cycles-frontend
15179665 stalled-cycles-backend
218699025 instructions

52264833 branches
3759984 branch-misses

0.714335864 seconds time elapsed

>Switching on opcode number is a special case of a general switch.
>Specifically:
>1. opcodes are dense and contiguous;
>2. opcodes typically take values over a relatively small range.
>For example, CPython's opcodes range from 1 to 161 (excepting EXCEPT_HANDLER).
>The jump table is thus a perfect hash: We need only store the addresses of the
>opcode handler functions in a smallish table and use the opcode value as an
>index into the table. Given (1) and (2), it's not clear to me that, say, a
>binary search with Log_2(161) ~ 8 comparisons is better than the lookup table.
>Is that really what the recent literature is saying?

Not that I know of.

> Are concerns about cache
>misses really valid for a table of 161 addresses?

No. That would be <1.3KB on a 64-bit machine, or, for the 256 cases
used in the benchmark, 2KB. By comparison, the binary-search dispatch
function used for the benchmark above is 2567 bytes longer than the
table dispatch code, so what you gain in D-cache, you more than lose
in I-cache.

> Is a load instruction that
>much more expensive than comparison instructions with branch prediction?

If the indirect branch predicts correctly, the load is essentially for
free. If it doesn't, it increases the misprediction penalty (~20
cycles) by its latency (4 cycles). In the correctly predicted case,
the indirect branch itself costs one cycle or so.

For your binary tree of conditional branches, processing the tree
costs some cycles even if correctly predicted. E.g., assume that the
core can process one not-taken branch and a branch per cycle. With
our 8-deep tree, even if we can always process two branches per cycle,
that's 4 cycles spent in dispatch; and some cases actually take 8, and
all with correct prediction. You may be able to get this down by
skewing the tree to have shorter depths (and beneficial code
arrangements) for the more frequent instructions, but in the end the
minimal cost of this approach will be higher.

However, the big cost is the mispredictions; they are hard to predict
for interpreters, so better measure them.

My feeling, though, is that the binary tree loses for such an
interpreter on current CPUs.

>I also have a question about optimization. Testing for opcode values in order
>of frequency of occurrence optimizes for the 1-gram entropy of opcodes. While
>we expect the conditional entropy of a given opcode to be high, I would still
>think that a bigram optimization would still be a benefit. One wouldn't have
>to compute all of the conditional probability distributions for every opcode
>to do this. Just do it for the statistically most common bigrams, including
>checks of the next opcode against the most probable successor in the current
>opcode's handler code.

Chapter 4.6 of http://www.ethoberon.ethz.ch/WirthPubl/AD.pdf discusses
the construction of an optimal search tree given access probabilities.

- anton
--
M. Anton Ertl
an...@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

George Neuner

unread,

Oct 7, 2017, 1:26:16 PM10/7/17

to

On Fri, 6 Oct 2017 02:56:06 +0200, Hans-Peter Diettrich
<DrDiet...@netscape.net> wrote:

>Am 05.10.2017 um 18:29 schrieb Robert Jacobson:
>> I am trying to wrap my mind around the issue of dynamic dispatch in the

>> context of switching on opcode in a bytecode interpreter ...

>
>I don't see how branch prediction will have an impact on runtime, when
>the prediction only applies to conditional branches, and the next opcode
>is one-of-many. If the cache can hold the jump table, nothing can be
>improved with this model. Consider how many *possible* comparisons and
>branches must be encoded, so that they will lead to *all* possible (161)
>target addresses.

Unfortunately, on modern CPUs, *all* branches are predicted. An
indirect jump through the table will mispredict virtually every time.
The same will be true of an indirect jump via register based address.

The best you can do with an interpreter is to have all the code in L1
code cache. As soon as you have to go to L2 (which typically is
shared between code and data) or deeper, you risk taking large hits if
the code is not resident.

comp.lang.asm.x86 has seen extensive discussions of mispredict
problems in interpreters and JIT compiled code. The conclusions there
are applicable to most CPU architectures.

>Many years ago I saw an 68k emulator, where the opcode was multiplied by
>16 and used as an offset into the emulator code, eliminating the branch
>table. The 16 byte blocks cover the emulator code for most instructions,
>this factor can be scaled to any sufficiently large block size if
>required. If some bytecodes require more than one block of executable
>code, the codes can be made discontiguous, with gaps wherever the
>code-to-execute is longer than the chosen blocking factor.
>
>Of course such an approach lacks any cache support, but a large cache
>still may hold the most frequently used code sequences automatically. If
>the cache is too small to hold the code for multiple opcodes, it's
>useless at all.

Caching is relevant mainly to minimizing the effect of the mispredict
- i.e. whether it takes 15 cycles or 315 cycles to get the pipeline
moving again. Minimizing mispredicts buys you more than caching.

George

bartc

unread,

Oct 7, 2017, 2:37:43 PM10/7/17

to

On 07/10/2017 17:55, George Neuner wrote:
>> Am 05.10.2017 um 18:29 schrieb Robert Jacobson:
>>> I am trying to wrap my mind around the issue of dynamic dispatch in the
>>> context of switching on opcode in a bytecode interpreter ...

> Unfortunately, on modern CPUs, *all* branches are predicted. An
> indirect jump through the table will mispredict virtually every time.
> The same will be true of an indirect jump via register based address.
>
> The best you can do with an interpreter is to have all the code in L1
> code cache. As soon as you have to go to L2 (which typically is
> shared between code and data) or deeper, you risk taking large hits if
> the code is not resident.
>
> comp.lang.asm.x86 has seen extensive discussions of mispredict
> problems in interpreters and JIT compiled code. The conclusions there
> are applicable to most CPU architectures.

I tried an experiment a few years ago, where the byte-code for a test
program was expanded into a sequence of instructions in a statically
compiled language (or it might have been done at run-time; I can't
remember). Each byte-code was represented by some mix of function
call, or some inline code.

This would have been expected to benefit by eliminating dispatch
overheads (it just steps into the next lot of code like an ordinary
program), and also by having dedicated code for some things that were
otherwise a parameter in a generic byte-code instruction.

But in fact the results weren't that great at all; most normal
dispatchers were actually faster!

Perhaps it was affected by having to fill up the instruction cache
with 1000 copies of the same PUSH code sequence, instead of re-using
the same single copy when running the byte-code dispatcher.

--
bartc

Anton Ertl

unread,

Oct 8, 2017, 2:41:55 PM10/8/17

to

George Neuner <gneu...@comcast.net> writes:
>Unfortunately, on modern CPUs, *all* branches are predicted. An
>indirect jump through the table will mispredict virtually every time.
>The same will be true of an indirect jump via register based address.

No. Since the Pentium, we have had indirect branch predictors. For
the early BTB-based indirect branch predictors, you get about 50%-60%
mispredictions in a virtual machine (VM) interpreter if you have a
separate dispatch per VM instructions (and approximately 100%
mispredictions if you have a shared dispatch routine). See
<https://www.jilp.org/vol5/v5paper12.pdf>.

In the meantime at least Intel (and probably others) have improved
indirect branch prediction, giving much better results. The paper

"Branch Prediction and the Performance of Interpreters - Don't Trust

Folklore" even concluded that branch mispredictions are no longer a
performance issue for interpreters; my own tests
<2015Aug1...@mips.complang.tuwien.ac.at>
<2015Sep...@mips.complang.tuwien.ac.at> did not confirm this
conclusion, but still, the branch prediction has been improved by a
lot.

>The best you can do with an interpreter is to have all the code in L1
>code cache. As soon as you have to go to L2 (which typically is
>shared between code and data) or deeper, you risk taking large hits if
>the code is not resident.

That's the case for any code and any data. And yet we found that an
optimization that expands code size by a lot (far out of L1 size)
<http://www.complang.tuwien.ac.at/papers/ertl%26gregg03.ps.gz> gives a
very good speedup by reducing the number of mispredictions. The
postings referenced above show that this optimization is still useful
even with Haswell's indirect branch predictor (I have not repeated the
measurements on Skylake or Zen, but what I have seen on these
processors indicates that the optimization still provides a speedup).
So, no, the best you can do is not to have all the code in the L1
cache; other considerations also play a role, and caches usually also
work if the code/data is larger than the cache. A simple heuristic
like you propose enough is not good enough. You have to try it out
and measure.

David Gregg and I once tried out whether it would be beneficial to
have smaller dynamic superinstructions (by splitting long dynamic
superinstructions in several parts) that could be shared between more
uses, resulting in lower code size while still giving most of the
indirect branch prediction benefit. We found that the code was
smaller, but the result had more I-cache misses then dynamic
superinstructions; we explained that by having worse spatial locality:
If a long dynamic superinstruction occupies one cache line, and is
split in two parts that reside in different parts of the code space,
now at least 2 cache lines need to be resident to execute the same
code. We did not publish that work.

George Neuner

unread,

Oct 8, 2017, 2:43:00 PM10/8/17

to

Possibly.

Some of the things to come out of the discussions in c.l.a.x86 were
that the fastest [simple] JIT sequence for an interpreter is a list of
explicit call instructions: e.g.,

call <bytecode_function>
call <bytecode_function>
:

rather than a list of function addresses with a dispatcher. This is
because the call leverages the CPU branch predictor rather than
fighting with it.

Also, if you use an address list coding, tail threading[*] the
interpreter is ususally a win. Threading is better than a central
dispatcher in cases where "instruction" sequences are reused often in
(roughly) the same order ... as in loops.

George
[*] in tail threading, instead of returning to a central dispatcher,
each function ends with a bit of dispatch logic that jumps directly to
the next function.

bartc

unread,

Oct 8, 2017, 7:26:20 PM10/8/17

to

On 08/10/2017 19:36, George Neuner wrote:
> On Sat, 7 Oct 2017 19:05:10 +0100, bartc <b...@freeuk.com> wrote:

>> I tried an experiment a few years ago, where the byte-code for a test
>> program was expanded into a sequence of instructions in a statically
>> compiled language (or it might have been done at run-time; I can't
>> remember). Each byte-code was represented by some mix of function
>> call, or some inline code.
>>
>> This would have been expected to benefit by eliminating dispatch
>> overheads (it just steps into the next lot of code like an ordinary
>> program), and also by having dedicated code for some things that were
>> otherwise a parameter in a generic byte-code instruction.
>>
>> But in fact the results weren't that great at all; most normal
>> dispatchers were actually faster!

> Some of the things to come out of the discussions in c.l.a.x86 were
> that the fastest [simple] JIT sequence for an interpreter is a list of
> explicit call instructions: e.g.,
>
> call <bytecode_function>
> call <bytecode_function>
> :
>
> rather than a list of function addresses with a dispatcher. This is
> because the call leverages the CPU branch predictor rather than
> fighting with it.

I think that's the sort of thing I was doing. Of course when it comes to
function calls, conditional code and so on, all things that require
jumping about in the byte-code, then it starts to get mixed up with the
language used to express those calls.

But I thought I'd do one quick experiment on one specific program in the
interpreted language (a very poor benchmark but just trying to see if
the runtime can be improved):

i:=0

while i<100 million do
++i
end

So a while-loop counting to 100 million. The normal byte-code that is
generated for that fragment:

2: 00006 ----------push_ci 0d
3: 00008 ----------pop_f [while.start.i:-1]
4: 00010 ----------jump L14
5: 00012 L12:
----------incrto_f [while.start.i:-1]
4: 00014 L14:
----------push_f [while.start.i:-1]
4: 00016 ----------push_ci 100000000d
4: 00018 ----------jumplt L12

These are normally executed in some sort of loop (not the loop indicated
here). The experiment emulates five of those byte-codes, by implementing
them as functions callable from static code representing this program,
as shown below.

The normally interpreted byte-code took 2.2 seconds.
The static version took 1.9 seconds.

A small difference. But this code uses my own poor x64 compiler. If
converted to C and compiled with gcc -O3 -m64, then the figures are 1.7
seconds and 0.75 seconds. (1.8 and 1.1 using -m32.)

So a significant difference this time, up to twice the speed. However,
that's comparing against the slow function-pointer dispatcher; a
switch- or label-pointer-based dispatcher would be rather brisker. But
the static code would still be quite a bit faster I think.

(But note: the very small number of handlers in my test can be easily
inlined by gcc, especially as they are only called once or twice. It's
might be a different story when they are called 100s of times. And also
the code might be generated via JIT methods; inlining too much could
have its own problems.)

There's one more comparison to make however: normally I run my byte-code
using my ASM overlay dispatcher, coupled with my poor compiler. For this
test, it runs the normal byte-code in 0.55 seconds. Faster than the
static code compiled with gcc-O3.

-------------------------------------------------------------------

The test function 'testwhile()', and the five special byte-code
handlers, written in the same static language as the interpreter, and
hidden away in a corner of the interpreter.

The local 'i' variable of the original has been changes to a static 'ii'
variable here to simplify matters (no stack frame has been created):

varrec ii

proc testwhile= # emulate the byte-code of the 'while' test

do_push_ci(0)
do_pop_m(&ii)

goto L14
L12:
do_incrto_m(&ii)
L14:
do_push_m(&ii)
do_push_ci(100 million)
if do_lt() then goto L12 fi

println "Done, ii = ",ii.value
end

proc do_push_ci(int a)=
(--sptr)^.tagx:=tint
sptr^.value:=a
end

proc do_pop_m(ref varrec a)=
vx_ufree(a) when a^.hasref
a^:=sptr++^
end

proc do_incrto_m(ref varrec a)=

switch ttbasetype[a^.tag]
when tint then
++a^.value
when trefvar then
++a^.varptr
when trefpacked then
a^.ptr+:=ttsize[a^.refelemtag]
else
pcustype("INCRTO_M",a)
endswitch
end

proc do_push_m(ref varrec a)=
(--sptr)^:= a^

if sptr^.hasref then
++sptr^.objptr^.refcount
fi
end

function do_lt:int=
ref varrec x,y
int xt,res

y:=sptr++
x:=sptr++
xt:=x^.tag

if xt<>y^.tag then
if xt:=vx_mixed(x,y) then
goto retry
fi
pcmxtypes("LT/MIXED",x,y)
fi
retry:

switch xt
when tint then
if x^.value < y^.value then
return 1
fi
..... etc
return 0
end

--
bartc

bartc

unread,

Oct 11, 2017, 6:12:19 PM10/11/17

to

On 08/10/2017 19:36, George Neuner wrote:

> On Sat, 7 Oct 2017 19:05:10 +0100, bartc <b...@freeuk.com> wrote:

>> I tried an experiment a few years ago, where the byte-code for a test
>> program was expanded into a sequence of instructions in a statically
>> compiled language (or it might have been done at run-time; I can't
>> remember). Each byte-code was represented by some mix of function
>> call, or some inline code.
>>
>> This would have been expected to benefit by eliminating dispatch
>> overheads (it just steps into the next lot of code like an ordinary
>> program), and also by having dedicated code for some things that were
>> otherwise a parameter in a generic byte-code instruction.
>>
>> But in fact the results weren't that great at all; most normal
>> dispatchers were actually faster!

> Some of the things to come out of the discussions in c.l.a.x86 were
> that the fastest [simple] JIT sequence for an interpreter is a list of
> explicit call instructions: e.g.,
>
> call <bytecode_function>
> call <bytecode_function>
> :
>
> rather than a list of function addresses with a dispatcher. This is
> because the call leverages the CPU branch predictor rather than
> fighting with it.

or label-pointer-based dispatcher would be rather more brisker. But the

Hans-Peter Diettrich

unread,

Oct 11, 2017, 6:12:35 PM10/11/17

to

Am 08.10.2017 um 20:36 schrieb George Neuner:

> Some of the things to come out of the discussions in c.l.a.x86 were
> that the fastest [simple] JIT sequence for an interpreter is a list of
> explicit call instructions: e.g.,
>
> call <bytecode_function>
> call <bytecode_function>
> :
>
> rather than a list of function addresses with a dispatcher. This is
> because the call leverages the CPU branch predictor rather than
> fighting with it.

I'm not sure about a single common instruction cache and branch
prediction architecture, the behaviour may differ amongst architectures.
The memory bus interface instead should be similar in all architectures,
so that it should speed up execution if function code is aligned to the
machine's data bus width (size of a chache line). This will make
available the highest possible number of sequential instructions after
the first read at a non-sequential (jump/call) address.

Tail threading instead can benefit from micro code, if that instruction
sequence can be implemented in a single dedicated micro code sequence.
Again this behaviour depends heavily on the particular machine
architecture. Perhaps more instruction sequences in the emulator code
can be optimized this way?

DoDi

Hans-Peter Diettrich

unread,

Oct 11, 2017, 6:13:06 PM10/11/17

to

Am 08.10.2017 um 23:49 schrieb bartc:

> But I thought I'd do one quick experiment on one specific program in the
> interpreted language (a very poor benchmark but just trying to see if
> the runtime can be improved):
>

> B B i:=0
>
> B B while i<100 million do
> B B B B ++i
> B B end

Many years ago the AIX compiler came with a similar benchmark, with
computations inside the loop. It demonstrated that the code took a few
seconds on the AIX system, 8 minutes on another workstation, and on the
HP-UX machine I killed the process after one hour. It turned out that
the AIX compiler optimized out the useless computations, and the HP-UX
compiler included clumsy debug features into the code, all that by
default compiler settings. After adding an output statement to the code,
after the loop, it took 8 minutes even on the AIX system, and all
machines were equally fast with explicit settings of the compiler
switches :-]

Loop optimization and CSE are researched and well known since ages, and
cheating in benchmarks also is common practice. With interpreted code or
JIT compilers the impact is not easily predictable, when in contrast to
statically (once) compiled code such optimizations would have to be
applied with every invocation of the program. According hints in the
bytecode may help here - doesn't .NET already allow to include such meta
information in the code, for use by the JIT compiler?

DoDi

bartc

unread,

Oct 17, 2017, 3:55:00 PM10/17/17

to

On 11/10/2017 23:13, Hans-Peter Diettrich wrote:
> Am 08.10.2017 um 23:49 schrieb bartc:
>
>> But I thought I'd do one quick experiment on one specific program in the
>> interpreted language (a very poor benchmark but just trying to see if
>> the runtime can be improved):
>>

>> B B BB i:=0
>>
>> B B BB while i<100 million do
>> B B B B BB ++i
>> B B BB end

>
> Many years ago the AIX compiler came with a similar benchmark, with
> computations inside the loop. It demonstrated that the code took a few
> seconds on the AIX system, 8 minutes on another workstation, and on the
> HP-UX machine I killed the process after one hour. It turned out that
> the AIX compiler optimized out the useless computations, and the HP-UX
> compiler included clumsy debug features into the code, all that by
> default compiler settings. After adding an output statement to the code,
> after the loop, it took 8 minutes even on the AIX system, and all
> machines were equally fast with explicit settings of the compiler
> switches :-]

That's always been a problem with comparing the performance of a
compiler's code with that of something like gcc -O3.

I ended up measuring how long my code took to do a task, with how long
gcc took to /not/ do it!

Now I don't pay much attention to micro-benchmarks, and especially not
to gcc-O3. But I will use the later to measure runtime of a real
application, which is a fairer test.

> Loop optimization and CSE are researched and well known since ages, and
> cheating in benchmarks also is common practice. With interpreted code or
> JIT compilers the impact is not easily predictable, when in contrast to
> statically (once) compiled code such optimizations would have to be
> applied with every invocation of the program. According hints in the
> bytecode may help here - doesn't .NET already allow to include such meta
> information in the code, for use by the JIT compiler?

Actually there was a similar effect testing the PyPy version of the
Python interpreter. This is a very complicated JIT system (I think few
people know exactly how it works).

With simple integer benchmarks, it might typically have been double the
speed of my own accelerated interpreter (the one with the ASM overlay,
but still executing a byte-code at a time, while PyPy probably executed
some dedicated native code).

But if there was a loop involved, then increasing the count by 10
usually made mine run ten times as long, but the PyPy was only slightly
slower! Then comparisons were rendered pointless.

PyPy is good at optimising loops. In real applications, PyPy /can/ be
very good, if there are many loops with high iteration counts!

--
bartc

George Neuner

unread,

Oct 17, 2017, 3:55:31 PM10/17/17

to

On Wed, 11 Oct 2017 18:12:33 -0400 (EDT), Hans-Peter Diettrich
<DrDiet...@netscape.net> wrote:

>Am 08.10.2017 um 20:36 schrieb George Neuner:
>
>> Some of the things to come out of the discussions in c.l.a.x86 were
>> that the fastest [simple] JIT sequence for an interpreter is a list of
>> explicit call instructions: e.g.,
>>
>> call <bytecode_function>
>> call <bytecode_function>
>> :
>>
>> rather than a list of function addresses with a dispatcher. This is
>> because the call leverages the CPU branch predictor rather than
>> fighting with it.
>
>I'm not sure about a single common instruction cache and branch
>prediction architecture, the behaviour may differ amongst architectures.

Branch prediction strategies vary considerably, but for the last ~25
year or so, most new CPUs have had separate L1 code and data caches.
Very few carry the separation any deeper than L1, however.

There are cached MPUs and DSPs that still use a shared L1, but CPUs
migrated away from that structure long ago. [And, of course, CPUs
increasingly are being used as MPUs.]

>The memory bus interface instead should be similar in all architectures,
>so that it should speed up execution if function code is aligned to the
>machine's data bus width (size of a chache line). This will make
>available the highest possible number of sequential instructions after
>the first read at a non-sequential (jump/call) address.

Yes and no. Remember that a cached CPU fetches from memory by lines,
not by instructions, and modern decoders are wide: accepting and
decoding whole cache lines at a time. On many machines there is no
problem picking out an instruction from the middle of the cache line.

That said, it is usually somewhat faster to _branch_ to a target
address that is aligned to the beginning of a cache line. But many
architectures do not actually require this even if there are other
alignment requirements for instructions (e.g., even address, etc.).

>Tail threading instead can benefit from micro code, if that instruction
>sequence can be implemented in a single dedicated micro code sequence.
>Again this behaviour depends heavily on the particular machine
>architecture. Perhaps more instruction sequences in the emulator code
>can be optimized this way?

Yes. When the same sequence of interpreter functions [bytecode
instructions] is used repeatedly, tail threading the functions will
train the branch predictor to anticipate that sequence.

This effectively turns the sequence into a single "macro" instruction
from the point of view of the branch predictor. As long as the
sequence is not varied too often, the tail threaded version will be
the most performant.

George