Stefan is exactly right, and I just want to echo that matrix multiply
really is a special case --- it is basically the only commonly-used
math kernel that is different *per cpu*. Not architecture (e.g. x86),
mind you, but tuned for specific CPU models. The triple loop will make
any language, including C, look bad. Fortran will probably do much
better since it avoids C's aliasing problems and is likely to do more
loop transformations in the compiler.
Near the top of base/linalg_dense.jl there is a slightly more
optimized matrix multiply in julia that uses some manual hoisting and
loop blocking, if you are curious.
A good benchmark might be to find a matlab or python/numpy routine
that somebody around the lab uses (of moderate size, maybe 100 lines
or so), port it to julia, and compare. That could make it more
relevant and interesting to people too.
> --
>
>
>