Paul Rubin <no.e...@nospam.invalid> writes Re: SSE2
>
m...@iae.nl (Marcel Hendrix) writes:
>> What I did is derived from the MiniBLAS sources. As SSE2 operates on
>> 4 doubles at a time, speedups of around 4 are suggesting themselves.
> Scalar operations use 2 doubles at a time so that suggests speedup of 2
> rather than 4. But, a parallel 64 bit multiplier is probably about the
> most expensive resource in the ALU, so there may be just one of them.
> In that case it would be the bottleneck and DDOT would be about the same
> as 2 scalar multiplications. You could try a single precision version
> that would give more speedup.
Sorry, of course I should have said 200%, not 400%. SSE2 packs 2 doubles
in a "word".
I followed your suggestion and tried single-precision (SF@ etc.).
> Are you using the DPPD (double precision dot product) instruction from
> SSE4.1? I'd expect that to be fastest.
It seems to me that this instruction is just more convenient, not faster.
It still works on 2 doubles.
> Later i7's (Sandy Bridge) have the 256 bit AVX instructions that may be
> faster still, and the forthcoming Haswell will have fused MAC. For
> maximum speed of course you want to use a GPU.
Interesting suggestions, but unfortunately it would be real work to
update my 64 bit assembler for AVX. Going for a GPU is a very interesting
option. I see they do double-precision, have at least 6 GB of memory and
can so (some) special functions. But would they be a 'transparent'
resource?
The original problem was that I ran DDOT for a vector size of 16. This is too
small for the code I used (it goes flat out with 16/32 floats / block).
Conclusions (in the context of iForth):
1) There is almost no difference between SINGLE and DOUBLE for smallish
vectors (16). For 1024 element vectors the SSE2 speed difference is
about 1.5. I assume it will reach 2 eventually.
2) SSE2 works for large vectors and gives about 4x more performance for
SDOT. It also works for DDOT and gives more than 2x the performance
that can be had otherwise.
3) The matrix code does not speed up as much as I expected. It should
be proportional to the S/DDOT speed, but apparently there is extra
overhead that does not go down very far (yet) with 1024 element vectors.
Most of my work is with quite small vectors (3 - 6 elements). However,
SPICE jobs should speed up a great deal with SSE and friends.
-marcel
PS: In the below printout only MUL3 and MMUL3 use SSE2.
-- ------------------------------------------
(*
* LANGUAGE : ANS Forth with extensions
* PROJECT : Forth Environments
* DESCRIPTION : Array code compiler
* CATEGORY : Example
* AUTHOR : Marcel Hendrix
* LAST CHANGE : July 26, 2012, Marcel Hendrix
*)
NEEDS -miscutil
REVISION -arraymulc "--- Array test Version 0.00 ---"
PRIVATES
DOC
(*
FORTH> 10000000 TEST
Using 64 bits floats.
Vector size = 16
mul : 1.2400000000000000000e+0010 0.218 seconds elapsed.
mul1 : 1.2400000000000000000e+0010 0.314 seconds elapsed.
mul2 : 1.2400000000000000000e+0010 0.198 seconds elapsed.
mul3 : 1.2400000000000000000e+0010 0.161 seconds elapsed.
mmul : 4.6000000000000000000e+0009 0.238 seconds elapsed.
mmul1 : 4.6000000000000000000e+0009 0.286 seconds elapsed.
mmul3 : 4.6000000000000000000e+0009 0.358 seconds elapsed. ok
FORTH> 10000000 TEST
Using 32 bits floats.
Vector size = 16
mul : 1.2400000000000000000e+0010 0.214 seconds elapsed.
mul1 : 1.2400000000000000000e+0010 0.314 seconds elapsed.
mul2 : 1.2400000000000000000e+0010 0.150 seconds elapsed.
mul3 : 1.2400000000000000000e+0010 0.134 seconds elapsed.
mmul : 4.6000000000000000000e+0009 0.239 seconds elapsed.
mmul1 : 4.6000000000000000000e+0009 0.284 seconds elapsed.
mmul3 : 4.6000000000000000000e+0009 0.356 seconds elapsed. ok
FORTH> 1000000 TEST
Using 32 bits floats.
Vector size = 1024
mul : 3.5738982400000000000e+0014 1.117 seconds elapsed.
mul1 : 3.5738982400000000000e+0014 1.837 seconds elapsed.
mul3 : 3.5738979200000000000e+0014 0.233 seconds elapsed.
mmul : 1.0719948800000000000e+0015 4.995 seconds elapsed.
mmul1 : 1.0719948800000000000e+0015 7.768 seconds elapsed.
mmul3 : 1.0719948800000000000e+0015 2.526 seconds elapsed. ok
FORTH> 1000000 TEST
Using 64 bits floats.
Vector size = 1024
mul : 3.5738982400000000000e+0014 1.067 seconds elapsed.
mul1 : 3.5738982400000000000e+0014 1.788 seconds elapsed.
mul3 : 3.5738982400000000000e+0014 0.358 seconds elapsed.
mmul : 1.0719948800000000000e+0015 4.913 seconds elapsed.
mmul1 : 1.0719948800000000000e+0015 7.784 seconds elapsed.
mmul3 : 1.0719948800000000000e+0015 2.952 seconds elapsed. ok
*)
ENDDOC
4 =: /rsz
/rsz DUP * =: /size
FALSE =: 64bitsf?
64bitsf?
[IF]
CREATE a1 /size DFLOATS ALLOT
CREATE a2 /size DFLOATS ALLOT
CREATE a3 /size DFLOATS ALLOT
: filla ( -- ) a1 /size 0 DO I S>F DF!+ LOOP DROP ;
: fillb ( -- ) a2 /size 0 DO I S>F DF!+ LOOP DROP ;
: DDOT_0 ( a1 a2 n -- ) ( F: -- r )
0e 2 RSHIFT
0 ?DO DF@+ swap DF@+ F* F+
DF@+ swap DF@+ F* F+
DF@+ swap DF@+ F* F+
DF@+ swap DF@+ F* F+
LOOP 2DROP ;
: DDOT_1 ( a1 a2 n -- ) ( F: -- r )
0e
0 ?DO DUP I DFLOAT[] DF@ OVER I DFLOAT[] DF@ F* F+
DUP I 1+ DFLOAT[] DF@ OVER I 1+ DFLOAT[] DF@ F* F+
DUP I 2+ DFLOAT[] DF@ OVER I 2+ DFLOAT[] DF@ F* F+
DUP I 3 + DFLOAT[] DF@ OVER I 3 + DFLOAT[] DF@ F* F+
4 +LOOP 2DROP ;
: mul ( F: -- r ) a1 a2 /size DDOT_0 ;
: mul1 ( F: -- r ) a1 a2 /size DDOT_1 ;
: mul2 ( F: -- r )
0e /size #16 <> ?EXIT
a1 0 DFLOAT[] DF@ a2 0 DFLOAT[] DF@ F* F+
a1 1 DFLOAT[] DF@ a2 1 DFLOAT[] DF@ F* F+
a1 2 DFLOAT[] DF@ a2 2 DFLOAT[] DF@ F* F+
a1 3 DFLOAT[] DF@ a2 3 DFLOAT[] DF@ F* F+
a1 4 DFLOAT[] DF@ a2 4 DFLOAT[] DF@ F* F+
a1 5 DFLOAT[] DF@ a2 5 DFLOAT[] DF@ F* F+
a1 6 DFLOAT[] DF@ a2 6 DFLOAT[] DF@ F* F+
a1 7 DFLOAT[] DF@ a2 7 DFLOAT[] DF@ F* F+
a1 8 DFLOAT[] DF@ a2 8 DFLOAT[] DF@ F* F+
a1 9 DFLOAT[] DF@ a2 9 DFLOAT[] DF@ F* F+
a1 10 DFLOAT[] DF@ a2 10 DFLOAT[] DF@ F* F+
a1 11 DFLOAT[] DF@ a2 11 DFLOAT[] DF@ F* F+
a1 12 DFLOAT[] DF@ a2 12 DFLOAT[] DF@ F* F+
a1 13 DFLOAT[] DF@ a2 13 DFLOAT[] DF@ F* F+
a1 14 DFLOAT[] DF@ a2 14 DFLOAT[] DF@ F* F+
a1 15 DFLOAT[] DF@ a2 15 DFLOAT[] DF@ F* F+ ;
: mul3 a1 a2 /size DDOT_sse2 ; ( F: -- r )
: fillm1 ( -- ) filla ;
: fillm12 ( -- ) \ a2 = a1^T
fillm1
/rsz 0 DO
/rsz 0 DO
J /rsz * I + a1 []DFLOAT DF@
I /rsz * J + a2 []DFLOAT DF!
LOOP
LOOP ;
: mmul ( F: -- r )
0e
/rsz 0 DO
/rsz 0 DO J /rsz * a1 []DFLOAT
I /rsz * a2 []DFLOAT /rsz DDOT_0
J /rsz * I + a3 []DFLOAT FDUP DF!
F+
LOOP
LOOP ;
: mmul1 ( F: -- r )
0e
/rsz 0 DO
/rsz 0 DO J /rsz * a1 []DFLOAT
I /rsz * a2 []DFLOAT /rsz DDOT_1
J /rsz * I + a3 []DFLOAT FDUP DF!
F+
LOOP
LOOP ;
: mmul3 ( F: -- r )
0e
/rsz 0 DO
/rsz 0 DO J /rsz * a1 []DFLOAT
I /rsz * a2 []DFLOAT /rsz DDOT_sse2
J /rsz * I + a3 []DFLOAT FDUP DF!
F+
LOOP
LOOP ;
[ELSE]
CREATE a1 /size SFLOATS ALLOT
CREATE a2 /size SFLOATS ALLOT
CREATE a3 /size SFLOATS ALLOT
: filla ( -- ) a1 /size 0 DO I S>F SF!+ LOOP DROP ;
: fillb ( -- ) a2 /size 0 DO I S>F SF!+ LOOP DROP ;
: SDOT_0 ( a1 a2 n -- ) ( F: -- r )
0e 2 RSHIFT
0 ?DO SF@+ swap SF@+ F* F+
SF@+ swap SF@+ F* F+
SF@+ swap SF@+ F* F+
SF@+ swap SF@+ F* F+
LOOP 2DROP ;
: SDOT_1 ( a1 a2 n -- ) ( F: -- r )
0e
0 ?DO DUP I SFLOAT[] SF@ OVER I SFLOAT[] SF@ F* F+
DUP I 1+ SFLOAT[] SF@ OVER I 1+ SFLOAT[] SF@ F* F+
DUP I 2+ SFLOAT[] SF@ OVER I 2+ SFLOAT[] SF@ F* F+
DUP I 3 + SFLOAT[] SF@ OVER I 3 + SFLOAT[] SF@ F* F+
4 +LOOP 2DROP ;
: mul ( F: -- r ) a1 a2 /size SDOT_0 ;
: mul1 ( F: -- r ) a1 a2 /size SDOT_1 ;
: mul2 0e /size #16 <> ?EXIT
a1 0 SFLOAT[] SF@ a2 0 SFLOAT[] SF@ F* F+
a1 1 SFLOAT[] SF@ a2 1 SFLOAT[] SF@ F* F+
a1 2 SFLOAT[] SF@ a2 2 SFLOAT[] SF@ F* F+
a1 3 SFLOAT[] SF@ a2 3 SFLOAT[] SF@ F* F+
a1 4 SFLOAT[] SF@ a2 4 SFLOAT[] SF@ F* F+
a1 5 SFLOAT[] SF@ a2 5 SFLOAT[] SF@ F* F+
a1 6 SFLOAT[] SF@ a2 6 SFLOAT[] SF@ F* F+
a1 7 SFLOAT[] SF@ a2 7 SFLOAT[] SF@ F* F+
a1 8 SFLOAT[] SF@ a2 8 SFLOAT[] SF@ F* F+
a1 9 SFLOAT[] SF@ a2 9 SFLOAT[] SF@ F* F+
a1 10 SFLOAT[] SF@ a2 10 SFLOAT[] SF@ F* F+
a1 11 SFLOAT[] SF@ a2 11 SFLOAT[] SF@ F* F+
a1 12 SFLOAT[] SF@ a2 12 SFLOAT[] SF@ F* F+
a1 13 SFLOAT[] SF@ a2 13 SFLOAT[] SF@ F* F+
a1 14 SFLOAT[] SF@ a2 14 SFLOAT[] SF@ F* F+
a1 15 SFLOAT[] SF@ a2 15 SFLOAT[] SF@ F* F+ ;
: mul3 a1 a2 /size SDOT_sse2 ; ( F: -- r )
: fillm1 ( -- ) filla ;
: fillm12 ( -- ) \ a2 = a1^T
fillm1
/rsz 0 DO /rsz 0 DO J /rsz * I + a1 []SFLOAT SF@
I /rsz * J + a2 []SFLOAT SF!
LOOP
LOOP ;
: mmul ( F: -- r )
0e
/rsz 0 DO /rsz 0 DO J /rsz * a1 []SFLOAT
I /rsz * a2 []SFLOAT /rsz SDOT_0
J /rsz * I + a3 []SFLOAT FDUP SF!
F+
LOOP
LOOP ;
: mmul1 ( F: -- r )
0e
/rsz 0 DO /rsz 0 DO J /rsz * a1 []SFLOAT
I /rsz * a2 []SFLOAT /rsz SDOT_1
J /rsz * I + a3 []SFLOAT FDUP SF!
F+
LOOP
LOOP ;
: mmul3 ( F: -- r )
0e
/rsz 0 DO /rsz 0 DO J /rsz * a1 []SFLOAT
I /rsz * a2 []SFLOAT /rsz SDOT_sse2
J /rsz * I + a3 []SFLOAT FDUP SF!
F+
LOOP
LOOP ;
[THEN]
: TEST ( u -- )
1 UMAX LOCAL #tries
#tries 3 RSHIFT 1 UMAX LOCAL #mtrys
CR ." Using " 64bitsf? IF #64 ELSE #32 ENDIF DEC. ." bits floats."
CR ." Vector size = " /size DEC.
filla fillb
CR ." mul : " TIMER-RESET 0e #tries 0 DO mul F+ LOOP +E. SPACE .ELAPSED
CR ." mul1 : " TIMER-RESET 0e #tries 0 DO mul1 F+ LOOP +E. SPACE .ELAPSED
/size #16 = IF CR ." mul2 : " TIMER-RESET 0e #tries 0 DO mul2 F+ LOOP +E. SPACE .ELAPSED ENDIF
CR ." mul3 : " TIMER-RESET 0e #tries 0 DO mul3 F+ LOOP +E. SPACE .ELAPSED
CR ." mmul : " TIMER-RESET 0e #mtrys 0 DO mmul F+ LOOP +E. SPACE .ELAPSED
CR ." mmul1 : " TIMER-RESET 0e #mtrys 0 DO mmul1 F+ LOOP +E. SPACE .ELAPSED
CR ." mmul3 : " TIMER-RESET 0e #mtrys 0 DO mmul3 F+ LOOP +E. SPACE .ELAPSED ;
:ABOUT CR ." Try: ( u -- ) TEST " ;
.ABOUT -arraymulc CR
DEPRIVE
(* End of Source *)