Why is the fib benchmark still slow - part 1 Python runs the fib test roughly twice as fast as Parrot. I don't like that ;) So what's the problem? 1) CPU cache issues First, if you like to investigate the issue yourself (and run i386/linux or similar, and you don't have it) get valgrind. Great tool. It contains a plugin called cachegrind which shows detailed information about executed inctructions including cache misses. For parrot -j -Oc examples/benchmarks/fib.imc[1] (and perl, python ...) we get: IRefs DRrefs Time L2 Misses Perl 5.8.0 2.0e9 1.2e9 2.5s 7.600 Perl 5.8.0-thr 2.2e9 1.4e9 3.0s 8.500 Python 2.3.3 1.2e9 0.7e9 1.4s 54.000 Parrot IMS 1.1e9 0.7e9 3.3s 7.000.000 Parrot MS 1.7e9 0.7e9 2.6s 4.800.000 Lua 5.0 [2] 0.7e9 0.4e9 0.85 !!! 4.000 IRefs ... instructions executed DRefs ... data read + write IMS ... Parrot with incremental M&S (settings.h), -O3 compiled MS ... Parrot CVS with stop-the-world MS, -O3 DRefs are boring - except that Perl5 takes 50% more. IRefs are quite interesting, IMS has fewest, even less then Python. But benchmark timings are differing totally. IMS is slower then MS and both are much slower then Python. The reason is in the last columns. The valgrind manual states (and above numbers second it) that: *one Level 2 cache miss takes roughly the time of 200 instructions*. These 7 Mio cache misses account for ~1.4e9 IRefs, which more then doubles the exucution time. Timings might be better on a fat Xeon CPU but that doesn't really matter (above is w. AMD 800, 256 KB L2 cache) The cache misses are basically in two places a) accessing a new register frame and context b) during DOD/GC We have to address both areas to get rid of the majority of cache misses. ad a) For each level of recursion, we are allocating a new context structure and a new register frame. Half of these is coming from the recently implemented return continuation and register frame chaches. The other half has to be freshly allocated. We get exactly for every second function call L2 cache misses for both the context and the register structure. We can't do much against the cache misses in the context, just make it as small as possible: e.g by moving the now unused old register stack pointers (4 words) out of the context or toss these 4 pointers entirely. But we can address the problem with the register frame, by --well making it smaller. Python is running the code on the stack. It's touching only SP(0) and SP(1) mostly. Register usage analysis of fib shows that it could run on Parrot with just *one* persistent register per kind. More about that is coming and was already said: "sliding register window". ad b) The (currently broken) Parrot setting ARENA_DOD_FLAGS shows one possibility to reduce cache misses in DOD. During a sweep (which runs through all allocated object memory) the memory itself isn't touched, just a nibble per object is used, which holds the relevant information like "is_live". A second way to address DOD issues it to make the PMCs variable sized. I've proposed that already not too long ago. Third and independent of these is not to touch most of the memory at all by implementing a generatioal garbage collector. After an object has survived a few GC cycles, its moved into an older generation, which isn't subject to a GC sweep. Both the "make PMCs variable sized" and the generatioal GC need very likely another indirection to access a PMC. But by avoiding just one cache miss, you can do 200 of such indirect accesses. Thanks for reading til here & comments welcome, leo [1] fib.imc is using integers, which is already an optimization. But that's not the point here. [2] valgrind --skin=cachegrind /opt/src/lua-5.0/bin/lua /opt/src/lua-5.0/test/fib.lua 28 and that's even comparing a memoized fib with plain too.