I am getting unexpectedly fast results using matlab's matrix multiplication and I am wondering if it is doing something besides straight multiplication, such as using strassen's algorithm.
I am benchmarking the matrix multiply at over 30 GigaFlops on my cpu, however, I am running only one thread (using -singleCompThread at startup), and my computer is 2.25 Ghz. Since I am doing single precision, SSE instructions allow 4 multiplies or adds per instruction, this allows 9 Gigaflops. Most computers can execute multiple instructions at once, however I know from my own tests, and consulting the intel manual:
http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf (table C-10 and C-10A) that my computer can do at most 2 floating point operations in one cycle. This puts the theoretical maximum bound at 18Gigaflops, yet MATLAB seems to give much higher performance. How can this be?
Below is my code that I tested with
sz=2.^(0:12)
for s=1:numel(sz)
A=rand(sz(s),sz(s),'single');
tic;
A'*A;
t=toc;
fprintf('sz=%d, time=%3.3gms, GFlops=%3.3g\n', sz(s), t*1000, 2*size(A,1)^3/1e9/t)
end
OUTPUT
sz=1, time=0.0457ms, GFlops=4.37e-005
sz=2, time=0.0199ms, GFlops=0.000803
sz=4, time=0.00634ms, GFlops=0.0202
sz=8, time=0.0086ms, GFlops=0.119
sz=16, time=0.0086ms, GFlops=0.952
sz=32, time=0.029ms, GFlops=2.26
sz=64, time=0.0503ms, GFlops=10.4
sz=128, time=0.268ms, GFlops=15.7
sz=256, time=1.61ms, GFlops=20.9
sz=512, time=10.8ms, GFlops=24.8
sz=1024, time=76.4ms, GFlops=28.1
sz=2048, time=572ms, GFlops= 30
sz=4096, time=4.36e+003ms, GFlops=31.5