SSE2

Marcel Hendrix

unread,

Jul 27, 2012, 10:20:47 AM7/27/12

to

I had a fresh look at SSE2. Strangely, my tests fail to show up any
performance difference with the good ol' FPU.

The test I did is based on an SSE2 DDOT implementation (DDOT_sse2, vector
multiplication). I won't show the code, it is too long and specific.

What I did is derived from the MiniBLAS sources. As SSE2 operates on
4 doubles at a time, speedups of around 4 are suggesting themselves.
However, I can find no trace of this. An obvious reason could be that
memory throughput can not keep up with the FP units. Strange, as one
would expect this hardware problem to be fixed by now.

Does anybody know if there is (currently) a hardware limitation in
i7 920 CPU's? I vaguely remember having read somewhere that some corners
are cut because not everybody is using 64bit software yet, and obviously
clock speed 'sells' better than actual FP performance.

FORTH> 100000000 TEST
Using 64 bits floats.
mul : 1.2400000000000000000e+0011 2.055 seconds elapsed.
mul1 : 1.2400000000000000000e+0011 3.141 seconds elapsed.
mul2 : 1.2400000000000000000e+0011 1.230 seconds elapsed.
mul3 : 1.2400000000000000000e+0011 1.604 seconds elapsed.
mmul : 4.6000000000000000000e+0010 2.376 seconds elapsed.
mmul1 : 4.6000000000000000000e+0010 2.854 seconds elapsed.
mmul3 : 4.6000000000000000000e+0010 3.785 seconds elapsed.
mmul4 : 4.6000000000000000000e+0010 2.933 seconds elapsed. ok

The fastest DDOT variant, MUL2, uses direct memory addressing. In
general, this can only work when Forth compiles at runtime (with
quotations :-).

MUL1 is using indexed addressing. This is quite a lot slower (I hate
this, because I have put a lot of effort in the iForth compiler to
support it fully. After 10 years, it looks like Intel has still not
found a way to improve the speed of this flexible addressing scheme.)

MUL is using a pointer instead of computing an array index. This is
again faster than indexed addressing, but not as fast as an immediate
address. (This is probably because every instruction, both integer and
fp, counts here).

MUL3 is using DDOT_sse2. It is worse than MUL2, not bad, but far
from 400% faster.

The MMUL and MMUL1 are matrix multiplication words. They use the
DDOTs defined earlier. Performance is as expected.

The performance of MMUL3, which uses DDOT_sse2 is suspiciously low.
It proved that having a normal FPU operation (here DF!) at the end
of a tight loop is a bad idea. (I also noticed this for DDOT_0 and
it is the reason that I unrolled the loops). The word MMUL4 therefore
unrolls everything. Not unsurprising, the perfomance increases, but
only so much as to be inline with MUL3.

-marcel

-- -----------

CREATE a1 #16 DFLOATS ALLOT
CREATE a2 #16 DFLOATS ALLOT
CREATE a3 #16 DFLOATS ALLOT

: filla ( -- ) a1 #16 0 DO I S>F DF!+ LOOP DROP ;
: fillb ( -- ) a2 #16 0 DO I S>F DF!+ LOOP DROP ;

: DDOT_0 ( a1 a2 n -- ) ( F: -- r )
0e 2 RSHIFT
0 ?DO DF@+ swap DF@+ F* F+
DF@+ swap DF@+ F* F+
DF@+ swap DF@+ F* F+
DF@+ swap DF@+ F* F+
LOOP 2DROP ;

: DDOT_1 ( a1 a2 n -- ) ( F: -- r )
0e
0 ?DO DUP I DFLOAT[] DF@ OVER I DFLOAT[] DF@ F* F+
DUP I 1+ DFLOAT[] DF@ OVER I 1+ DFLOAT[] DF@ F* F+
DUP I 2+ DFLOAT[] DF@ OVER I 2+ DFLOAT[] DF@ F* F+
DUP I 3 + DFLOAT[] DF@ OVER I 3 + DFLOAT[] DF@ F* F+
4 +LOOP 2DROP ;

: mul ( F: -- r ) a1 a2 #16 DDOT_0 ;
: mul1 ( F: -- r ) a1 a2 #16 DDOT_1 ;

: mul2 ( F: -- r )
0e
a1 0 DFLOAT[] DF@ a2 0 DFLOAT[] DF@ F* F+
a1 1 DFLOAT[] DF@ a2 1 DFLOAT[] DF@ F* F+
a1 2 DFLOAT[] DF@ a2 2 DFLOAT[] DF@ F* F+
a1 3 DFLOAT[] DF@ a2 3 DFLOAT[] DF@ F* F+

a1 4 DFLOAT[] DF@ a2 4 DFLOAT[] DF@ F* F+
a1 5 DFLOAT[] DF@ a2 5 DFLOAT[] DF@ F* F+
a1 6 DFLOAT[] DF@ a2 6 DFLOAT[] DF@ F* F+
a1 7 DFLOAT[] DF@ a2 7 DFLOAT[] DF@ F* F+

a1 8 DFLOAT[] DF@ a2 8 DFLOAT[] DF@ F* F+
a1 9 DFLOAT[] DF@ a2 9 DFLOAT[] DF@ F* F+
a1 10 DFLOAT[] DF@ a2 10 DFLOAT[] DF@ F* F+
a1 11 DFLOAT[] DF@ a2 11 DFLOAT[] DF@ F* F+

a1 12 DFLOAT[] DF@ a2 12 DFLOAT[] DF@ F* F+
a1 13 DFLOAT[] DF@ a2 13 DFLOAT[] DF@ F* F+
a1 14 DFLOAT[] DF@ a2 14 DFLOAT[] DF@ F* F+
a1 15 DFLOAT[] DF@ a2 15 DFLOAT[] DF@ F* F+ ;

: mul3 a1 a2 #16 DDOT_sse2 ; ( F: -- r )

: fillm1 ( -- ) filla ;

: fillm12 ( -- ) \ a2 = a1^T
fillm1
4 0 DO 4 0 DO J 4 * I + a1 []DFLOAT DF@
I 4 * J + a2 []DFLOAT DF!
LOOP
LOOP ;

: mmul ( F: -- r )
0e
4 0 DO 4 0 DO J 4 * a1 []DFLOAT
I 4 * a2 []DFLOAT 4 DDOT_0
J 4 * I + a3 []DFLOAT FDUP DF!
F+
LOOP
LOOP ;

: mmul1 ( F: -- r )
0e
4 0 DO 4 0 DO J 4 * a1 []DFLOAT
I 4 * a2 []DFLOAT 4 DDOT_1
J 4 * I + a3 []DFLOAT FDUP DF!
F+
LOOP
LOOP ;

: mmul3 ( F: -- r )
0e
4 0 DO 4 0 DO J 4 * a1 []DFLOAT
I 4 * a2 []DFLOAT 4 DDOT_sse2
J 4 * I + a3 []DFLOAT FDUP DF!
F+
LOOP
LOOP ;

: mmul4 ( F: -- r )
0e
4 0 DO
I 4 * a1 []DFLOAT
a2 4 DDOT_sse2
I 4 * a3 []DFLOAT FDUP DF! F+

I 4 * a1 []DFLOAT
4 a2 []DFLOAT 4 DDOT_sse2
I 4 * 1+ a3 []DFLOAT FDUP DF! F+

I 4 * a1 []DFLOAT
8 a2 []DFLOAT 4 DDOT_sse2
I 4 * 2+ a3 []DFLOAT FDUP DF! F+

I 4 * a1 []DFLOAT
#12 a2 []DFLOAT 4 DDOT_sse2
I 4 * 3 + a3 []DFLOAT FDUP DF! F+
LOOP ;

: TEST ( u -- )
1 UMAX LOCAL #tries
#tries 3 RSHIFT 1 UMAX LOCAL #mtrys
CR ." Using 64 bits floats."
filla fillb
CR ." mul : " TIMER-RESET 0e #tries 0 DO mul F+ LOOP +E. SPACE .ELAPSED
CR ." mul1 : " TIMER-RESET 0e #tries 0 DO mul1 F+ LOOP +E. SPACE .ELAPSED
CR ." mul2 : " TIMER-RESET 0e #tries 0 DO mul2 F+ LOOP +E. SPACE .ELAPSED
CR ." mul3 : " TIMER-RESET 0e #tries 0 DO mul3 F+ LOOP +E. SPACE .ELAPSED

CR ." mmul : " TIMER-RESET 0e #mtrys 0 DO mmul F+ LOOP +E. SPACE .ELAPSED
CR ." mmul1 : " TIMER-RESET 0e #mtrys 0 DO mmul1 F+ LOOP +E. SPACE .ELAPSED
CR ." mmul3 : " TIMER-RESET 0e #mtrys 0 DO mmul3 F+ LOOP +E. SPACE .ELAPSED
CR ." mmul4 : " TIMER-RESET 0e #mtrys 0 DO mmul4 F+ LOOP +E. SPACE .ELAPSED ;

Paul Rubin

unread,

Jul 27, 2012, 11:03:51 AM7/27/12

to

m...@iae.nl (Marcel Hendrix) writes:
> What I did is derived from the MiniBLAS sources. As SSE2 operates on
> 4 doubles at a time, speedups of around 4 are suggesting themselves.

Scalar operations use 2 doubles at a time so that suggests speedup of 2
rather than 4. But, a parallel 64 bit multiplier is probably about the
most expensive resource in the ALU, so there may be just one of them.
In that case it would be the bottleneck and DDOT would be about the same
as 2 scalar multiplications. You could try a single precision version
that would give more speedup.

Are you using the DPPD (double precision dot product) instruction from
SSE4.1? I'd expect that to be fastest.

Later i7's (Sandy Bridge) have the 256 bit AVX instructions that may be
faster still, and the forthcoming Haswell will have fused MAC. For
maximum speed of course you want to use a GPU.

Anton Ertl

unread,

Jul 27, 2012, 10:56:05 AM7/27/12

to

m...@iae.nl (Marcel Hendrix) writes:
>What I did is derived from the MiniBLAS sources. As SSE2 operates on
>4 doubles at a time, speedups of around 4 are suggesting themselves.
>However, I can find no trace of this. An obvious reason could be that
>memory throughput can not keep up with the FP units. Strange, as one
>would expect this hardware problem to be fixed by now.

If the problem in your benchmark is that it is memory bandwidth
limited, no, that problem is not "fixed". Much software run on PCs is
not bandwidth limited, so it would not be economical to add an
expensive memory system for more bandwidth, certainly not for every
PC. There are "PCs" with more memory bandwidth (typically servers),
and supercomputers often have more memory bandwidth relative to FLOPS,
but even there the trend is to lower that ratio (because additional
FPUs are cheaper than additional bandwidth, and additional FPUs are
useful for some supercomputer software, and because many
supercomputers are based on PC CPUs nowadays). GPUs tend to have a
high memory bandwidth, but also high FLOPS.

> FORTH> 100000000 TEST
> Using 64 bits floats.
> mul : 1.2400000000000000000e+0011 2.055 seconds elapsed.
> mul1 : 1.2400000000000000000e+0011 3.141 seconds elapsed.
> mul2 : 1.2400000000000000000e+0011 1.230 seconds elapsed.
> mul3 : 1.2400000000000000000e+0011 1.604 seconds elapsed.
> mmul : 4.6000000000000000000e+0010 2.376 seconds elapsed.
> mmul1 : 4.6000000000000000000e+0010 2.854 seconds elapsed.
> mmul3 : 4.6000000000000000000e+0010 3.785 seconds elapsed.
> mmul4 : 4.6000000000000000000e+0010 2.933 seconds elapsed. ok
>
>The fastest DDOT variant, MUL2, uses direct memory addressing. In
>general, this can only work when Forth compiles at runtime (with
>quotations :-).

Hmm, if the problem is memory bandwidth, I would expect all variants
to have the same performance (unless you use additional indirection
vectors or varying memory layouts).

>MUL1 is using indexed addressing. This is quite a lot slower (I hate
>this, because I have put a lot of effort in the iForth compiler to
>support it fully. After 10 years, it looks like Intel has still not
>found a way to improve the speed of this flexible addressing scheme.)

If the memory accesses are the same, then your problem is not memory
bandwidth.

My _guess_ is that your compiler produces some stack memory stores for
some of the variants, with some stack memory fetches soon after, and
these cost quie a bit of performance. At least they used to. You
don't have a data-near-code problem, do you?

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2012: http://www.euroforth.org/ef12/

Paul Rubin

unread,

Jul 27, 2012, 11:56:12 AM7/27/12

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> If the problem in your benchmark is that it is memory bandwidth
> limited, no, that problem is not "fixed".

Yeah I wouldn't have expected memory bandwidth to be the problem either,
in a sequential-access problem like dot product. One simple test is
access all the data before beginning DDOT, to get it all into cache.
There are also some cache prefetch instructions in the x86 that let you
get the data into cache ahead of time, when you know what the access
pattern is going to be.

Albert van der Horst

unread,

Jul 27, 2012, 12:31:15 PM7/27/12

to

In article <5615190...@frunobulax.edu>, Marcel Hendrix <m...@iae.nl> wrote:
>I had a fresh look at SSE2. Strangely, my tests fail to show up any
>performance difference with the good ol' FPU.
>
>The test I did is based on an SSE2 DDOT implementation (DDOT_sse2, vector
>multiplication). I won't show the code, it is too long and specific.
>
>What I did is derived from the MiniBLAS sources. As SSE2 operates on
>4 doubles at a time, speedups of around 4 are suggesting themselves.
>However, I can find no trace of this. An obvious reason could be that
>memory throughput can not keep up with the FP units. Strange, as one
>would expect this hardware problem to be fixed by now.
>
>Does anybody know if there is (currently) a hardware limitation in
>i7 920 CPU's? I vaguely remember having read somewhere that some corners
>are cut because not everybody is using 64bit software yet, and obviously
>clock speed 'sells' better than actual FP performance.
>
> FORTH> 100000000 TEST
> Using 64 bits floats.
> mul : 1.2400000000000000000e+0011 2.055 seconds elapsed.
> mul1 : 1.2400000000000000000e+0011 3.141 seconds elapsed.
> mul2 : 1.2400000000000000000e+0011 1.230 seconds elapsed.
> mul3 : 1.2400000000000000000e+0011 1.604 seconds elapsed.
> mmul : 4.6000000000000000000e+0010 2.376 seconds elapsed.
> mmul1 : 4.6000000000000000000e+0010 2.854 seconds elapsed.
> mmul3 : 4.6000000000000000000e+0010 3.785 seconds elapsed.
> mmul4 : 4.6000000000000000000e+0010 2.933 seconds elapsed. ok

I miss something here. What of this is FPU/stack based and what is
SSE-based?

<SNIP>

Groetjes Albert

--

--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- like all pyramid schemes -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Anton Ertl

unread,

Jul 27, 2012, 12:28:01 PM7/27/12

to

Paul Rubin <no.e...@nospam.invalid> writes:
>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>> If the problem in your benchmark is that it is memory bandwidth
>> limited, no, that problem is not "fixed".
>
>Yeah I wouldn't have expected memory bandwidth to be the problem either,
>in a sequential-access problem like dot product.

It can be, if your data does not fit into the cache.

E.g., if you have 2 channels with DDR3-12800 memory, you get at best
25600MB/s of memory bandwidth, if your CPU core is capable of fetching
2 8-byte floats per cycle and of doing one F+ and one F* per cycle and
has 3GHz clock speed, that would be 2*8*3000=48000MB/s bandwidth
requirement; and that's just one core.

>One simple test is
>access all the data before beginning DDOT, to get it all into cache.
>There are also some cache prefetch instructions in the x86 that let you
>get the data into cache ahead of time, when you know what the access
>pattern is going to be.

For a fixed-stride access pattern the hardware prefetcher should be
able to cover that nicely, no software prefetching needed.

David Kuehling

unread,

Jul 27, 2012, 7:47:41 PM7/27/12

to

I'd suggest to use n*m dot-products to compute a product of a kxn with a
mxk matrix. This will be closer to some real-world use with (given that
k is small enough) most data accesses going to the caches.

David
--
GnuPG public key: http://dvdkhlng.users.sourceforge.net/dk.gpg
Fingerprint: B17A DC95 D293 657B 4205 D016 7DEF 5323 C174 7D40

Marcel Hendrix

unread,

Jul 28, 2012, 9:28:17 AM7/28/12

to

Paul Rubin <no.e...@nospam.invalid> writes Re: SSE2

> m...@iae.nl (Marcel Hendrix) writes:
>> What I did is derived from the MiniBLAS sources. As SSE2 operates on
>> 4 doubles at a time, speedups of around 4 are suggesting themselves.

> Scalar operations use 2 doubles at a time so that suggests speedup of 2
> rather than 4. But, a parallel 64 bit multiplier is probably about the
> most expensive resource in the ALU, so there may be just one of them.
> In that case it would be the bottleneck and DDOT would be about the same
> as 2 scalar multiplications. You could try a single precision version
> that would give more speedup.

Sorry, of course I should have said 200%, not 400%. SSE2 packs 2 doubles
in a "word".

I followed your suggestion and tried single-precision (SF@ etc.).

> Are you using the DPPD (double precision dot product) instruction from
> SSE4.1? I'd expect that to be fastest.

It seems to me that this instruction is just more convenient, not faster.
It still works on 2 doubles.

> Later i7's (Sandy Bridge) have the 256 bit AVX instructions that may be
> faster still, and the forthcoming Haswell will have fused MAC. For
> maximum speed of course you want to use a GPU.

Interesting suggestions, but unfortunately it would be real work to
update my 64 bit assembler for AVX. Going for a GPU is a very interesting
option. I see they do double-precision, have at least 6 GB of memory and
can so (some) special functions. But would they be a 'transparent'
resource?

The original problem was that I ran DDOT for a vector size of 16. This is too
small for the code I used (it goes flat out with 16/32 floats / block).

Conclusions (in the context of iForth):

1) There is almost no difference between SINGLE and DOUBLE for smallish
vectors (16). For 1024 element vectors the SSE2 speed difference is
about 1.5. I assume it will reach 2 eventually.

2) SSE2 works for large vectors and gives about 4x more performance for
SDOT. It also works for DDOT and gives more than 2x the performance
that can be had otherwise.

3) The matrix code does not speed up as much as I expected. It should
be proportional to the S/DDOT speed, but apparently there is extra
overhead that does not go down very far (yet) with 1024 element vectors.

Most of my work is with quite small vectors (3 - 6 elements). However,
SPICE jobs should speed up a great deal with SSE and friends.

-marcel

PS: In the below printout only MUL3 and MMUL3 use SSE2.

-- ------------------------------------------
(*
* LANGUAGE : ANS Forth with extensions
* PROJECT : Forth Environments
* DESCRIPTION : Array code compiler
* CATEGORY : Example
* AUTHOR : Marcel Hendrix
* LAST CHANGE : July 26, 2012, Marcel Hendrix
*)

NEEDS -miscutil

REVISION -arraymulc "--- Array test Version 0.00 ---"

PRIVATES

DOC
(*
FORTH> 10000000 TEST
Using 64 bits floats.
Vector size = 16
mul : 1.2400000000000000000e+0010 0.218 seconds elapsed.
mul1 : 1.2400000000000000000e+0010 0.314 seconds elapsed.
mul2 : 1.2400000000000000000e+0010 0.198 seconds elapsed.
mul3 : 1.2400000000000000000e+0010 0.161 seconds elapsed.
mmul : 4.6000000000000000000e+0009 0.238 seconds elapsed.
mmul1 : 4.6000000000000000000e+0009 0.286 seconds elapsed.
mmul3 : 4.6000000000000000000e+0009 0.358 seconds elapsed. ok

FORTH> 10000000 TEST
Using 32 bits floats.
Vector size = 16
mul : 1.2400000000000000000e+0010 0.214 seconds elapsed.
mul1 : 1.2400000000000000000e+0010 0.314 seconds elapsed.
mul2 : 1.2400000000000000000e+0010 0.150 seconds elapsed.
mul3 : 1.2400000000000000000e+0010 0.134 seconds elapsed.
mmul : 4.6000000000000000000e+0009 0.239 seconds elapsed.
mmul1 : 4.6000000000000000000e+0009 0.284 seconds elapsed.
mmul3 : 4.6000000000000000000e+0009 0.356 seconds elapsed. ok

FORTH> 1000000 TEST
Using 32 bits floats.
Vector size = 1024
mul : 3.5738982400000000000e+0014 1.117 seconds elapsed.
mul1 : 3.5738982400000000000e+0014 1.837 seconds elapsed.
mul3 : 3.5738979200000000000e+0014 0.233 seconds elapsed.
mmul : 1.0719948800000000000e+0015 4.995 seconds elapsed.
mmul1 : 1.0719948800000000000e+0015 7.768 seconds elapsed.
mmul3 : 1.0719948800000000000e+0015 2.526 seconds elapsed. ok

FORTH> 1000000 TEST
Using 64 bits floats.
Vector size = 1024
mul : 3.5738982400000000000e+0014 1.067 seconds elapsed.
mul1 : 3.5738982400000000000e+0014 1.788 seconds elapsed.
mul3 : 3.5738982400000000000e+0014 0.358 seconds elapsed.
mmul : 1.0719948800000000000e+0015 4.913 seconds elapsed.
mmul1 : 1.0719948800000000000e+0015 7.784 seconds elapsed.
mmul3 : 1.0719948800000000000e+0015 2.952 seconds elapsed. ok
*)
ENDDOC

4 =: /rsz
/rsz DUP * =: /size
FALSE =: 64bitsf?

64bitsf?
[IF]

CREATE a1 /size DFLOATS ALLOT
CREATE a2 /size DFLOATS ALLOT
CREATE a3 /size DFLOATS ALLOT

: filla ( -- ) a1 /size 0 DO I S>F DF!+ LOOP DROP ;
: fillb ( -- ) a2 /size 0 DO I S>F DF!+ LOOP DROP ;

: DDOT_0 ( a1 a2 n -- ) ( F: -- r )
0e 2 RSHIFT
0 ?DO DF@+ swap DF@+ F* F+
DF@+ swap DF@+ F* F+
DF@+ swap DF@+ F* F+
DF@+ swap DF@+ F* F+
LOOP 2DROP ;

: DDOT_1 ( a1 a2 n -- ) ( F: -- r )
0e
0 ?DO DUP I DFLOAT[] DF@ OVER I DFLOAT[] DF@ F* F+
DUP I 1+ DFLOAT[] DF@ OVER I 1+ DFLOAT[] DF@ F* F+
DUP I 2+ DFLOAT[] DF@ OVER I 2+ DFLOAT[] DF@ F* F+
DUP I 3 + DFLOAT[] DF@ OVER I 3 + DFLOAT[] DF@ F* F+
4 +LOOP 2DROP ;

: mul ( F: -- r ) a1 a2 /size DDOT_0 ;
: mul1 ( F: -- r ) a1 a2 /size DDOT_1 ;

: mul2 ( F: -- r )

0e /size #16 <> ?EXIT

a1 0 DFLOAT[] DF@ a2 0 DFLOAT[] DF@ F* F+
a1 1 DFLOAT[] DF@ a2 1 DFLOAT[] DF@ F* F+
a1 2 DFLOAT[] DF@ a2 2 DFLOAT[] DF@ F* F+
a1 3 DFLOAT[] DF@ a2 3 DFLOAT[] DF@ F* F+

a1 4 DFLOAT[] DF@ a2 4 DFLOAT[] DF@ F* F+
a1 5 DFLOAT[] DF@ a2 5 DFLOAT[] DF@ F* F+
a1 6 DFLOAT[] DF@ a2 6 DFLOAT[] DF@ F* F+
a1 7 DFLOAT[] DF@ a2 7 DFLOAT[] DF@ F* F+

a1 8 DFLOAT[] DF@ a2 8 DFLOAT[] DF@ F* F+
a1 9 DFLOAT[] DF@ a2 9 DFLOAT[] DF@ F* F+
a1 10 DFLOAT[] DF@ a2 10 DFLOAT[] DF@ F* F+
a1 11 DFLOAT[] DF@ a2 11 DFLOAT[] DF@ F* F+

a1 12 DFLOAT[] DF@ a2 12 DFLOAT[] DF@ F* F+
a1 13 DFLOAT[] DF@ a2 13 DFLOAT[] DF@ F* F+
a1 14 DFLOAT[] DF@ a2 14 DFLOAT[] DF@ F* F+
a1 15 DFLOAT[] DF@ a2 15 DFLOAT[] DF@ F* F+ ;

: mul3 a1 a2 /size DDOT_sse2 ; ( F: -- r )

: fillm1 ( -- ) filla ;

: fillm12 ( -- ) \ a2 = a1^T
fillm1

/rsz 0 DO
/rsz 0 DO
J /rsz * I + a1 []DFLOAT DF@
I /rsz * J + a2 []DFLOAT DF!

LOOP
LOOP ;

: mmul ( F: -- r )
0e

/rsz 0 DO
/rsz 0 DO J /rsz * a1 []DFLOAT
I /rsz * a2 []DFLOAT /rsz DDOT_0
J /rsz * I + a3 []DFLOAT FDUP DF!

F+
LOOP
LOOP ;

: mmul1 ( F: -- r )
0e

/rsz 0 DO
/rsz 0 DO J /rsz * a1 []DFLOAT
I /rsz * a2 []DFLOAT /rsz DDOT_1
J /rsz * I + a3 []DFLOAT FDUP DF!

F+
LOOP
LOOP ;

: mmul3 ( F: -- r )
0e

/rsz 0 DO
/rsz 0 DO J /rsz * a1 []DFLOAT
I /rsz * a2 []DFLOAT /rsz DDOT_sse2
J /rsz * I + a3 []DFLOAT FDUP DF!
F+
LOOP
LOOP ;

[ELSE]

CREATE a1 /size SFLOATS ALLOT
CREATE a2 /size SFLOATS ALLOT
CREATE a3 /size SFLOATS ALLOT

: filla ( -- ) a1 /size 0 DO I S>F SF!+ LOOP DROP ;
: fillb ( -- ) a2 /size 0 DO I S>F SF!+ LOOP DROP ;

: SDOT_0 ( a1 a2 n -- ) ( F: -- r )
0e 2 RSHIFT
0 ?DO SF@+ swap SF@+ F* F+
SF@+ swap SF@+ F* F+
SF@+ swap SF@+ F* F+
SF@+ swap SF@+ F* F+
LOOP 2DROP ;

: SDOT_1 ( a1 a2 n -- ) ( F: -- r )
0e
0 ?DO DUP I SFLOAT[] SF@ OVER I SFLOAT[] SF@ F* F+
DUP I 1+ SFLOAT[] SF@ OVER I 1+ SFLOAT[] SF@ F* F+
DUP I 2+ SFLOAT[] SF@ OVER I 2+ SFLOAT[] SF@ F* F+
DUP I 3 + SFLOAT[] SF@ OVER I 3 + SFLOAT[] SF@ F* F+
4 +LOOP 2DROP ;

: mul ( F: -- r ) a1 a2 /size SDOT_0 ;
: mul1 ( F: -- r ) a1 a2 /size SDOT_1 ;

: mul2 0e /size #16 <> ?EXIT
a1 0 SFLOAT[] SF@ a2 0 SFLOAT[] SF@ F* F+
a1 1 SFLOAT[] SF@ a2 1 SFLOAT[] SF@ F* F+
a1 2 SFLOAT[] SF@ a2 2 SFLOAT[] SF@ F* F+
a1 3 SFLOAT[] SF@ a2 3 SFLOAT[] SF@ F* F+

a1 4 SFLOAT[] SF@ a2 4 SFLOAT[] SF@ F* F+
a1 5 SFLOAT[] SF@ a2 5 SFLOAT[] SF@ F* F+
a1 6 SFLOAT[] SF@ a2 6 SFLOAT[] SF@ F* F+
a1 7 SFLOAT[] SF@ a2 7 SFLOAT[] SF@ F* F+

a1 8 SFLOAT[] SF@ a2 8 SFLOAT[] SF@ F* F+
a1 9 SFLOAT[] SF@ a2 9 SFLOAT[] SF@ F* F+
a1 10 SFLOAT[] SF@ a2 10 SFLOAT[] SF@ F* F+
a1 11 SFLOAT[] SF@ a2 11 SFLOAT[] SF@ F* F+

a1 12 SFLOAT[] SF@ a2 12 SFLOAT[] SF@ F* F+
a1 13 SFLOAT[] SF@ a2 13 SFLOAT[] SF@ F* F+
a1 14 SFLOAT[] SF@ a2 14 SFLOAT[] SF@ F* F+
a1 15 SFLOAT[] SF@ a2 15 SFLOAT[] SF@ F* F+ ;

: mul3 a1 a2 /size SDOT_sse2 ; ( F: -- r )

: fillm1 ( -- ) filla ;

: fillm12 ( -- ) \ a2 = a1^T
fillm1

/rsz 0 DO /rsz 0 DO J /rsz * I + a1 []SFLOAT SF@
I /rsz * J + a2 []SFLOAT SF!

LOOP
LOOP ;

: mmul ( F: -- r )
0e

/rsz 0 DO /rsz 0 DO J /rsz * a1 []SFLOAT
I /rsz * a2 []SFLOAT /rsz SDOT_0
J /rsz * I + a3 []SFLOAT FDUP SF!

F+
LOOP
LOOP ;

: mmul1 ( F: -- r )
0e

/rsz 0 DO /rsz 0 DO J /rsz * a1 []SFLOAT
I /rsz * a2 []SFLOAT /rsz SDOT_1
J /rsz * I + a3 []SFLOAT FDUP SF!

F+
LOOP
LOOP ;

: mmul3 ( F: -- r )
0e

/rsz 0 DO /rsz 0 DO J /rsz * a1 []SFLOAT
I /rsz * a2 []SFLOAT /rsz SDOT_sse2
J /rsz * I + a3 []SFLOAT FDUP SF!
F+
LOOP
LOOP ;

[THEN]

: TEST ( u -- )
1 UMAX LOCAL #tries
#tries 3 RSHIFT 1 UMAX LOCAL #mtrys

CR ." Using " 64bitsf? IF #64 ELSE #32 ENDIF DEC. ." bits floats."
CR ." Vector size = " /size DEC.

filla fillb
CR ." mul : " TIMER-RESET 0e #tries 0 DO mul F+ LOOP +E. SPACE .ELAPSED
CR ." mul1 : " TIMER-RESET 0e #tries 0 DO mul1 F+ LOOP +E. SPACE .ELAPSED

/size #16 = IF CR ." mul2 : " TIMER-RESET 0e #tries 0 DO mul2 F+ LOOP +E. SPACE .ELAPSED ENDIF

CR ." mul3 : " TIMER-RESET 0e #tries 0 DO mul3 F+ LOOP +E. SPACE .ELAPSED

CR ." mmul : " TIMER-RESET 0e #mtrys 0 DO mmul F+ LOOP +E. SPACE .ELAPSED
CR ." mmul1 : " TIMER-RESET 0e #mtrys 0 DO mmul1 F+ LOOP +E. SPACE .ELAPSED

CR ." mmul3 : " TIMER-RESET 0e #mtrys 0 DO mmul3 F+ LOOP +E. SPACE .ELAPSED ;

:ABOUT CR ." Try: ( u -- ) TEST " ;

.ABOUT -arraymulc CR
DEPRIVE

(* End of Source *)

Marcel Hendrix

unread,

Jul 28, 2012, 9:28:54 AM7/28/12

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes Re: SSE2

> m...@iae.nl (Marcel Hendrix) writes:
>>What I did is derived from the MiniBLAS sources. As SSE2 operates on
>>4 doubles at a time, speedups of around 4 are suggesting themselves.
>>However, I can find no trace of this. An obvious reason could be that
>>memory throughput can not keep up with the FP units. Strange, as one
>>would expect this hardware problem to be fixed by now.

> If the problem in your benchmark is that it is memory bandwidth
> limited, no, that problem is not "fixed".

See my answer to Paul. The problem was that the S/DDOT code needs
large vectors to become effective. For small sizes it does almost
nothing.

[,,]

> Hmm, if the problem is memory bandwidth, I would expect all variants
> to have the same performance (unless you use additional indirection
> vectors or varying memory layouts).

That's a good argument.

[..]

> My _guess_ is that your compiler produces some stack memory stores for
> some of the variants, with some stack memory fetches soon after, and
> these cost quie a bit of performance. At least they used to.

I would be interested to read more about this. What is the problem here,
exactly? Is a stackframe better than push/pop?

> You
> don't have a data-near-code problem, do you?

No, I allways check for these, but I have not come across anything
like it since quite a long time.

-marcel

Marcel Hendrix

unread,

Jul 30, 2012, 6:28:41 PM7/30/12

to

David Kuehling <dvdk...@gmx.de> writes Re: SSE2

[..]

> I'd suggest to use n*m dot-products to compute a product of a kxn with a
> mxk matrix. This will be closer to some real-world use with (given that
> k is small enough) most data accesses going to the caches.

I suppose you mean "gaxpy," not "dot," do you?

Golub and Van Loan's 'Matrix Computations' has some gems once you know what
to look for. It is possible to specify matrix multiplication in dot and
gaxpy operations (six equivalent possibilities). There is one that
accesses data strictly row by row (!), and it is based on gaxpy. I used it
for the below iForth results.

I have now implemented DAXPY and SAXPY for (64-bit mode) SSE2, and use it
to do inproducts and matrix multiplications for small and large vectors.
The result is shown below.

Summary: SSE2 does not pay off for small single-precision matrix
multiplication (a simple pointer may win). In all other cases, it
wins by sometimes a very wide margin.

I changed the code of a previous posting to be fully general, that is, all
routines (mul, mul1, ... ) now work for any size (not only for multiples
of 4 or 16). Therefore the shown timings are different.

Thanks for the trigger!

-marcel

-- --------------------------------------------------------------
FORTH> 1000000 TEST-DOT
DOT using 32 bits floats.
Vector size = 1024
mul : 5.2377600000000000000e+0011 1.100 seconds elapsed.
mul1 : 5.2377600000000000000e+0011 1.820 seconds elapsed.
mul2 : 5.2377600000000000000e+0011 1.612 seconds elapsed.
mul3 (SSE2) : 5.2377600000000000000e+0011 0.187 seconds elapsed.
Note: SINGLE maxint == 16777217, printout for mul3 may be wrong. ok

FORTH> 100000 TEST-AXPY
AXPY using 32 bits floats.
Vector size = 1024
mmul : 1.0828195840000000000e+0014 0.569 seconds elapsed.
mmul1 : 1.0828195840000000000e+0014 0.874 seconds elapsed.
mmul2 : 1.0828195840000000000e+0014 1.001 seconds elapsed.
mmul3 (SSE2) : 1.0828195840000000000e+0014 0.287 seconds elapsed.
mmul4 (SSE2) : 6.7019493475742391600e+0017 0.230 seconds elapsed.
Note: SINGLE maxint == 16777217, printout for mmul3 may be wrong. ok

FORTH> 1000000 TEST-DOT
DOT using 64 bits floats.
Vector size = 1024
mul : 5.2377600000000000000e+0011 1.089 seconds elapsed.
mul1 : 5.2377600000000000000e+0011 1.813 seconds elapsed.
mul2 : 5.2377600000000000000e+0011 1.627 seconds elapsed.
mul3 (SSE2) : 5.2377600000000000000e+0011 0.365 seconds elapsed.
Note: SINGLE maxint == 16777217, printout for mul3 may be wrong. ok

FORTH> 100000 TEST-AXPY
AXPY using 64 bits floats.
Vector size = 1024
mmul : 1.0828195840000000000e+0014 0.569 seconds elapsed.
mmul1 : 1.0828195840000000000e+0014 0.869 seconds elapsed.
mmul2 : 1.0828195840000000000e+0014 1.000 seconds elapsed.
mmul3 (SSE2) : 1.0828195840000000000e+0014 0.343 seconds elapsed.
mmul4 (SSE2) : 6.7015868170240000000e+0017 0.283 seconds elapsed.
Note: SINGLE maxint == 16777217, printout for mmul3 may be wrong. ok

FORTH> 100000000 TEST-DOT
DOT using 64 bits floats.
Vector size = 16
mul : 1.2000000000000000000e+0010 2.121 seconds elapsed.
mul1 : 1.2000000000000000000e+0010 3.330 seconds elapsed.
mul2 : 1.2000000000000000000e+0010 4.689 seconds elapsed.
mul3 (SSE2) : 1.2000000000000000000e+0010 1.602 seconds elapsed.
Note: SINGLE maxint == 16777217, printout for mul3 may be wrong. ok

FORTH> 100000000 TEST-AXPY
AXPY using 64 bits floats.
Vector size = 16
mmul : 4.9000000000000000000e+0010 3.708 seconds elapsed.
mmul1 : 4.9000000000000000000e+0010 4.332 seconds elapsed.
mmul2 : 4.9000000000000000000e+0010 8.950 seconds elapsed.
mmul3 (SSE2) : 4.9000000000000000000e+0010 4.304 seconds elapsed.
mmul4 (SSE2) : 2.8750007200000000000e+0017 3.104 seconds elapsed.
Note: SINGLE maxint == 16777217, printout for mmul3 may be wrong. ok

FORTH> 10000000 TEST-DOT
DOT using 32 bits floats.
Vector size = 16
mul : 1.2000000000000000000e+0009 0.230 seconds elapsed.
mul1 : 1.2000000000000000000e+0009 0.337 seconds elapsed.
mul2 : 1.2000000000000000000e+0009 0.514 seconds elapsed.
mul3 (SSE2) : 1.2000000000000000000e+0009 0.144 seconds elapsed.
Note: SINGLE maxint == 16777217, printout for mul3 may be wrong. ok

FORTH> 10000000 TEST-AXPY
AXPY using 32 bits floats.
Vector size = 16
mmul : 4.9000000000000000000e+0009 0.391 seconds elapsed.
mmul1 : 4.9000000000000000000e+0009 0.430 seconds elapsed.
mmul2 : 4.9000000000000000000e+0009 0.965 seconds elapsed.
mmul3 (SSE2) : 4.9000000000000000000e+0009 0.445 seconds elapsed.
mmul4 (SSE2) : 2.8791162680142440000e+0015 0.316 seconds elapsed.
Note: SINGLE maxint == 16777217, printout for mmul3 may be wrong. ok

Anton Ertl

unread,

Jul 31, 2012, 10:29:43 AM7/31/12

to

m...@iae.nl (Marcel Hendrix) writes:
>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes Re: SSE2

>> My _guess_ is that your compiler produces some stack memory stores for
>> some of the variants, with some stack memory fetches soon after, and
>> these cost quie a bit of performance. At least they used to.
>
>I would be interested to read more about this. What is the problem here,
>exactly? Is a stackframe better than push/pop?

At least in earlier CPUs a dependency through memory had a long
latency, like 15 cycles or so (and all stuff that went through memory
was affected): the data went in the store buffer and eventually went
out to cache, and only then the load could access it for real. I
think newer CPUs have added a bypass (store-to-load-forwarding) to
make such cases faster; or maybe they just made the case faster where
the address of the store is different from the address of the load
(i.e., no bypass).

A benchmark I used was a simple counted loop where the counter was in
memory. IIRC this needed about 15 cycles/iteration or so on Athlon
and Pentium 3.

Forth systems that are not very sophisticated (or for sophisticated
compilers, a fallback option) might storing stack items to the
stack-in memory and soon after load that iterm from the stack, and
so might run into this problem.

David Kuehling

unread,

Aug 1, 2012, 5:27:01 AM8/1/12

to

>>>>> "Marcel" == Marcel Hendrix <m...@iae.nl> writes:

> David Kuehling <dvdk...@gmx.de> writes Re: SSE2 [..]

>> I'd suggest to use n*m dot-products to compute a product of a kxn
>> with a mxk matrix. This will be closer to some real-world use with
>> (given that k is small enough) most data accesses going to the
>> caches.

> I suppose you mean "gaxpy," not "dot," do you?

I didn't really think this through to the end I guss. Last time I did
SIMD, it was on the Cell CPU and I just made sure that data was
formatted in a way that a 4-way parallel (single-precision) dot product
made sense (i.e. everything, all matrices etc. were stored 4-way
parallel).

> Golub and Van Loan's 'Matrix Computations' has some gems once you know
> what to look for. It is possible to specify matrix multiplication in
> dot and gaxpy operations (six equivalent possibilities). There is one
> that accesses data strictly row by row (!), and it is based on
> gaxpy. I used it for the below iForth results.

Quite some time since I last had a look at the Goblub. Looks like I
completely missed the part you meniton...

> I have now implemented DAXPY and SAXPY for (64-bit mode) SSE2, and use
> it to do inproducts and matrix multiplications for small and large
> vectors. The result is shown below.

> Summary: SSE2 does not pay off for small single-precision matrix
> multiplication (a simple pointer may win). In all other cases, it wins
> by sometimes a very wide margin.

For the Cell CPU I got quite some performance gains by implementing loop
pipelining, i.e. fetching the data for the next loop iteration already
one loop iteration ahead, with first and last iteration peeled off the
loop to handle the special cases, and the loop body duplicated to
implement the double-buffering prefetch via two sets of registers. I
guess with speculatively executing modern CPUs that may not bring any
performance, though (unless you plan to run your code on an Intel Atom
CPU :)

Actually I did some more optimizations to manage the excessive floating
point latencies on the cell. On cell the floating point add is 6 cycles
latency, so however fast your multiplication step is, you won't be
faster than one iteration per 6 cycles when adding to the same
accumulator once per iteration. I countered that by computing 8 dot
products in parallel, you may also just use N accumulators that are
added together when done (and unrolling N iterations of the loop body).

> I changed the code of a previous posting to be fully general, that is,
> all routines (mul, mul1, ... ) now work for any size (not only for
> multiples of 4 or 16). Therefore the shown timings are different.

> Thanks for the trigger!

BTW, just wondering, why aren't you just linking with libatlas and use
the BLAS api?

cheers,

Marcel Hendrix

unread,

Aug 1, 2012, 2:23:43 PM8/1/12

to

David Kuehling <dvdk...@gmx.de> writes Re: SSE2

>>>>>> "Marcel" == Marcel Hendrix <m...@iae.nl> writes:

>> David Kuehling <dvdk...@gmx.de> writes Re: SSE2 [..]

[..]

>> Summary: SSE2 does not pay off for small single-precision matrix
>> multiplication (a simple pointer may win). In all other cases, it wins
>> by sometimes a very wide margin.

> For the Cell CPU I got quite some performance gains by implementing loop
> pipelining, i.e. fetching the data for the next loop iteration already
> one loop iteration ahead, with first and last iteration peeled off the
> loop to handle the special cases, and the loop body duplicated to
> implement the double-buffering prefetch via two sets of registers. I
> guess with speculatively executing modern CPUs that may not bring any
> performance, though (unless you plan to run your code on an Intel Atom
> CPU :)

This SSE2 code uses 8 XMMM registers, with the aligned accesses to memory
interleaved between the addpd and mulpd instructions to prevent stalls.
The intermediate results go to 6 accumulators which are picked apart and
added outside of the main loop. The back jump takes insignificant time
compared to the work done. There should not be many wasted cycles
there.

The most difficult thing to do was handling the cases where the number
of vector elements does not match the unrolled loop size. (Of course,
that is trivial in high-level Forth.)

Hmm, I see my matrix code does not yet work for odd rowsize (because
then a row address may become a multiple of 8, not 16, as required)

> Actually I did some more optimizations to manage the excessive floating
> point latencies on the cell. On cell the floating point add is 6 cycles
> latency, so however fast your multiplication step is, you won't be
> faster than one iteration per 6 cycles when adding to the same
> accumulator once per iteration. I countered that by computing 8 dot
> products in parallel, you may also just use N accumulators that are
> added together when done (and unrolling N iterations of the loop body).

Yes, that seems very similar to SSE2, although the numbers differ.

[..]

> BTW, just wondering, why aren't you just linking with libatlas and use
> the BLAS api?

I have that too (the ACML)! Somewhat to my initial surprise, that code
runs again 2* faster for midsize vectors (N=120) and even 4 times faster
for N=500 (about 21 GFLOPS). Apparently the 4 cores are all put to work.

The reason for my own tries is

a) curiosity

b) it is convenient not having to maintain too many 32/64bit MinGW projects
that should work on Windows, Linux and OSX,

c) For the very small vectors that I use everyday:

FORTH> SPECIAL-TEST4
4x4 mm - iForth DAXPY based 474.24 MFlops, 6.32 ticks/flop, 0.000 us
6x6 mm - iForth DAXPY based 927.40 MFlops, 3.23 ticks/flop, 0.000 us
8x8 mm - iForth DAXPY based 1089.43 MFlops, 2.75 ticks/flop, 0.000 us
10x10 mm - iForth DAXPY based 1272.48 MFlops, 2.35 ticks/flop, 1.000 us
12x12 mm - iForth DAXPY based 1520.02 MFlops, 1.97 ticks/flop, 2.000 us
14x14 mm - iForth DAXPY based 1683.61 MFlops, 1.78 ticks/flop, 3.000 us
16x16 mm - iForth DAXPY based 1897.78 MFlops, 1.58 ticks/flop, 4.000 us
32x32 mm - iForth DAXPY based 2881.72 MFlops, 1.04 ticks/flop, 22.000 us

where the best ACML results are:

FORTH> SPECIAL-TEST2
4x4 mm - iForth DGEMM1 411.91 MFlops, 7.28 ticks/flop, 0.000 us
6x6 mm - iForth DGEMM1 814.29 MFlops, 3.68 ticks/flop, 0.000 us
8x8 mm - iForth DGEMM1 193.67 MFlops, 15.48 ticks/flop, 5.000 us
10x10 mm - iForth DGEMM1 353.35 MFlops, 8.49 ticks/flop, 5.000 us
12x12 mm - iForth DGEMM1 600.39 MFlops, 4.99 ticks/flop, 5.000 us
14x14 mm - iForth DGEMM1 845.41 MFlops, 3.54 ticks/flop, 6.000 us
16x16 mm - iForth DGEMM1 1208.30 MFlops, 2.48 ticks/flop, 6.000 us
32x32 mm - iForth DGEMM1 3552.92 MFlops, 0.84 ticks/flop, 18.000 us ok

-marcel

Marcel Hendrix

unread,

Aug 5, 2012, 3:00:02 PM8/5/12

to

m...@iae.nl (Marcel Hendrix) writes Re: SSE2
[..]

> Hmm, I see my matrix code does not yet work for odd rowsize (because
> then a row address may become a multiple of 8, not 16, as required)

I have now fixed this, and guess what: although I use movups (unaligned
fetch) instead of movaps (aligned fetch) everywhere, the speed drops
less than 5 - 10%. Interesting. This so close to optimal that I didn't
special case aligned access for the iForth kernel words (see the final
results below).

For the new tests, I only compare general purpose words -- the
high-level Forth and SSE replacements are fully general (work
aligned / unaligned, for any size, and no buffer is used to keep
transposed matrices).

The main conclusion is again that SSE2 has advantages, but for short
double vectors you won't see it.

-marcel

=== ( singles ) ===

FORTH> TESTS
DOT/AXPY using 32 bits floats.
Vector size = 16
mul0 (dot) : 1.2000000000000000000e+0009 0.226 seconds elapsed.
mul1 (dot_sse2) : 1.2000000000000000000e+0009 0.136 seconds elapsed.
mmul0 (axpy) : 6.0000000000000000000e+0008 0.358 seconds elapsed.
mmul1 (axpy_sse2) : 6.0000000000000000000e+0008 0.341 seconds elapsed.
Note: SINGLE maxint == 16777217, printout may be wrong. ok

FORTH> TESTS
DOT/AXPY using 32 bits floats.
Vector size = 1024
mul0 (dot) : 5.2377600000000000000e+0011 1.088 seconds elapsed.
mul1 (dot_sse2) : 5.2377600000000000000e+0011 0.189 seconds elapsed.
mmul0 (axpy) : 2.0951040000000000000e+0012 5.648 seconds elapsed.
mmul1 (axpy_sse2) : 2.0951040000000000000e+0012 2.329 seconds elapsed.
Note: SINGLE maxint == 16777217, printout may be wrong. ok

=== ( doubles ) ===

FORTH> TESTS
DOT/AXPY using 64 bits floats.
Vector size = 16
mul0 (dot) : 1.2000000000000000000e+0009 0.237 seconds elapsed.
mul1 (dot_sse2) : 1.2000000000000000000e+0009 0.146 seconds elapsed.
mmul0 (axpy) : 6.0000000000000000000e+0008 0.435 seconds elapsed.
mmul1 (axpy_sse2) : 6.0000000000000000000e+0008 0.349 seconds elapsed. ok

FORTH> TESTS
DOT/AXPY using 64 bits floats.
Vector size = 1024
mul0 (dot) : 5.2377600000000000000e+0011 1.095 seconds elapsed.
mul1 (dot_sse2) : 5.2377600000000000000e+0011 0.370 seconds elapsed.
mmul0 (axpy) : 2.0951040000000000000e+0012 5.718 seconds elapsed.
mmul1 (axpy_sse2) : 2.0951040000000000000e+0012 3.080 seconds elapsed. ok

=== mm_old.frt ===

500x500 mm - normal algorithm 0.290 secs.
500x500 mm - temporary variable in loop 0.439 secs.
500x500 mm - unrolled inner loop, factor of 4 0.321 secs.
500x500 mm - unrolled inner loop, factor of 8 0.294 secs.
500x500 mm - unrolled inner loop, factor of 16 0.280 secs.
500x500 mm - pointers used to access matrices 0.350 secs.
500x500 mm - pointers used, unrolled by 8 0.258 secs.
500x500 mm - transposed B matrix 0.400 secs.
500x500 mm - interchanged inner loops 0.423 secs.
500x500 mm - blocking, step size of 20 0.457 secs.
500x500 mm - Robert's algorithm 0.064 secs.
500x500 mm - T. Maeno's algorithm, subarray 20x20 0.372 secs.
500x500 mm - Generic Maeno, subarray 20x20 0.392 secs.
500x500 mm - D. Warner's algorithm, subarray 20x20 0.372 secs.
500x500 mm - SSE2 0.064 secs.
========================================================= =====
Total using no extensions and using no hackery 4.776 secs. ok