Why is the fib benchmark still slow - part 1

Python runs the fib test roughly twice as fast as Parrot. I don't
like that ;) So what's the problem?

1) CPU cache issues

First, if you like to investigate the issue yourself (and run
i386/linux or similar, and you don't have it) get valgrind. Great
tool. It contains a plugin called cachegrind which shows detailed
information about executed inctructions including cache misses.

For parrot -j -Oc examples/benchmarks/fib.imc[1] (and perl, python ...)
we get:

                 IRefs         DRrefs        Time       L2 Misses
Perl 5.8.0       2.0e9          1.2e9        2.5s           7.600
Perl 5.8.0-thr   2.2e9          1.4e9        3.0s           8.500
Python 2.3.3     1.2e9          0.7e9        1.4s          54.000
Parrot IMS       1.1e9          0.7e9        3.3s       7.000.000
Parrot  MS       1.7e9          0.7e9        2.6s       4.800.000
Lua 5.0 [2]      0.7e9          0.4e9        0.85 !!!       4.000

IRefs ... instructions executed
DRefs ... data read + write
IMS   ... Parrot with incremental M&S (settings.h), -O3 compiled
MS    ... Parrot CVS with stop-the-world MS, -O3

DRefs are boring - except that Perl5 takes 50% more. IRefs are quite
interesting, IMS has fewest, even less then Python. But benchmark
timings are differing totally. IMS is slower then MS and both are much
slower then Python.

The reason is in the last columns. The valgrind manual states (and
above numbers second it) that:
*one Level 2 cache miss takes roughly the time of 200 instructions*.

These 7 Mio cache misses account for ~1.4e9 IRefs, which more then
doubles the exucution time. Timings might be better on a fat Xeon CPU
but that doesn't really matter (above is w. AMD 800, 256 KB L2 cache)

The cache misses are basically in two places

a) accessing a new register frame and context
b) during DOD/GC

We have to address both areas to get rid of the majority of cache
misses.

ad a)

For each level of recursion, we are allocating a new context structure
and a new register frame. Half of these is coming from the recently
implemented return continuation and register frame chaches. The other
half has to be freshly allocated. We get exactly for every second
function call L2 cache misses for both the context and the register
structure.

We can't do much against the cache misses in the context, just make it
as small as possible: e.g by moving the now unused old register stack
pointers (4 words) out of the context or toss these 4 pointers
entirely.

But we can address the problem with the register frame, by --well
making it smaller. Python is running the code on the stack. It's
touching only SP(0) and SP(1) mostly.

Register usage analysis of fib shows that it could run on Parrot with
just *one* persistent register per kind.

More about that is coming and was already said: "sliding register
window".

ad b)

The (currently broken) Parrot setting ARENA_DOD_FLAGS shows one
possibility to reduce cache misses in DOD. During a sweep (which runs
through all allocated object memory) the memory itself isn't touched,
just a nibble per object is used, which holds the relevant information
like "is_live".

A second way to address DOD issues it to make the PMCs variable sized.
I've proposed that already not too long ago.

Third and independent of these is not to touch most of the memory at
all by implementing a generatioal garbage collector. After an object
has survived a few GC cycles, its moved into an older generation,
which isn't subject to a GC sweep.

Both the "make PMCs variable sized" and the generatioal GC need very
likely another indirection to access a PMC. But by avoiding just one
cache miss, you can do 200 of such indirect accesses.

Thanks for reading til here & comments welcome,
leo

[1] fib.imc is using integers, which is already an optimization. But
    that's not the point here.
[2] valgrind --skin=cachegrind /opt/src/lua-5.0/bin/lua
         /opt/src/lua-5.0/test/fib.lua  28
    and that's even comparing a memoized fib with plain too.