This is a call for testers concerning an experimental OCaml compiler
back-end that uses SSE2 instructions for floating-point arithmetic.
This code generation strategy was discussed before on this list, and I
include below a summary in Q&A style.
The new back-end is being considered for inclusion in the next major
release (3.12), but performance testing done so far at INRIA and by
Caml Consortium members is not conclusive. Additional results
from members of this list would therefore be very welcome.
We're not terribly interested in small (< 50 LOC), Shootout-style
benchmarks, since their performance is very sensitive to code and data
placement. However, if some of you have a sizeable (> 500 LOC) body
of float-intensive Caml code, we'd be very interested to hear about
the compared speed of the SSE2 back-end and the old back-end on your
code.
Switching to Q&A style:
Q: Where can I get the code?
A: From the SVN repository:
svn checkout http://caml.inria.fr/svn/ocaml/branches/sse2 ocaml-sse2
Source-code only. Very lightly tested under Windows, so you might be
better off testing under Unix.
Q: What is this SSE2 thingy?
A: An extension of the Intel/AMD x86 instruction set that provides,
among other things, 64-bit float arithmetic instructions operating
over 64-bit float registers. Before SSE2, the only way to perform
64-bit float arithmetic on x86 was the x87 instructions, which compute
in 80-bit precision and use a stack instead of registers.
Q: Why this sudden interest in SSE2?
A: SSE2 has several potential advantages over x87, including:
- The register-based SSE2 model fits the OCaml back-end much better
than the stack-based x87 model. In particular, "let"-bound intermediate
results of type "float" can be kept in SSE2 registers, while in
the current x87 mode they are systematically flushed to the stack.
- SSE2 implements exactly 64-bit IEEE arithmetic, giving float results
that are consistent with those obtained on other platforms and with
the OCaml bytecode interpreter. The 80-bit format of x87 produces
different results and can causes surprises such as "double rounding"
errors. (For more explanations, see David Monniaux's excellent article,
http://hal.archives-ouvertes.fr/hal-00128124/ )
- Some x86 processors execute SSE2 instructions faster than their x87
counterparts. This speed difference was notable on the Pentium 4
in particular, but is much smaller on more recent processors such as
Core 2.
Note that x86-64 bits systems as well as Mac OS X already use SSE2 as
their default floating-point model.
SSE2 also has some potential disadvantages:
- The instructions are bigger than x87 instructions, causing some
increase in code size and potentially some decrease in instruction
cache efficiency.
- Computing intermediate results in 80-bit precision, like x87 does,
can improve the numerical stability of poorly-conditioned float
computations, although it doesn't make a difference for well-written
numerical code.
Q: Is SSE2 universally available on x86 processors?
A: Not universally but pretty close. SSE2 made its debut in 2000, in
the Pentium 4 processor. All x86 machines built in the last 4 years
or so support SSE2, but pre-Pentium 4 and pre-Athlon64 processors do not.
Q: So if you adopt this new back-end, OCaml will stop working on my
trusty 1995-vintage Pentium?
A: No. Under friendly pressure from our Debian friends, we agreed to
keep the x87 back-end alive for a while in parallel with the SSE2
back-end. The x87 back-end is selected at configuration time if the
processor doesn't support SSE2 or if a special flag is given to the
configure script.
Q: I observed a 20% (speedup|slowdown)! Should I tell the world about it?
A: If your benchmark spends all its time in 10 lines of OCaml, maybe
not. On such small codes, variations in code and data placement alone
(without changing the instructions that are actually executed) can
result in performance variations by 20%, so this is just experimental
noise. Larger programs are less sensitive to this noise, which is why
we're much more interested in results obtained on real OCaml
applications. Finally, one micro-benchmark slowed down by a factor of
2 for reasons we couldn't explain.
Q: What are those inconclusive results you mentioned?
A: On medium-sized numerical kernels (e.g. FFT, Gaussian process
regression), we've observed speedups of about 8% on Core 2 processors
and somewhat higher on recent AMD processors. On bigger OCaml
applications that perform floating-point computations but not
exclusively, the performance difference was lost in the noise.
Looking forward to interesting experimental results,
- Xavier Leroy
_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs
> - The register-based SSE2 model fits the OCaml back-end much better
> than the stack-based x87 model. In particular, "let"-bound intermediate
> results of type "float" can be kept in SSE2 registers, while in
> the current x87 mode they are systematically flushed to the stack.
>
> Note that x86-64 bits systems as well as Mac OS X already use SSE2 as
> their default floating-point model.
>
I have a bunch of biological sequence analysis stuff that could be
interesting but I am already in x86-64 ("Wow! A 64 bit architecture!"). The
above seems pretty clear but just to verify - I would not benefit from this
new back-end, right?
Mike
Right. Sorry for not mentioning this. The x86-64 bit code generator for
OCaml uses SSE2 floats, like all C compilers for this platform. The
experimental back-end I announced is for x86-32 bit. Some more Q&A:
Q: I have OCaml installed on my x86 machine, how do I know if it's 32
or 64 bits?
A: Do:
grep ^ARCH `ocamlopt -where`/Makefile.config
If it says "amd64", it's 64 bits with SSE2 floats.
If it says "i386", it's 32 bits with x87 floats.
If if says "ia32", it's the experimental back-end: 32 bits with SSE2 floats.
Q: If I compile from sources, which code generator is chosen by
default? 32 or 64 bits?
A: OCaml's configure script chooses whatever mode the C compiler
defaults to. For instance, on a 32-bit Linux installation, the 32-bit
generator is selected, and on 64-bit Linux installation, it's the
64-bit generator. Mac OS X is more tricky: 10.5 and earlier default
to 32 bits, but 10.6 defaults to 64 bits...
Will Farr wrote:
> Oops. I just ran a bunch of tests on my Mac OS 10.6 system---does
> that mean that I compared two sse2 backends? The ocaml-sse2 branch
> definitely produced different code than the trunk, but that could
> easily be due to any small difference in the two compilers, and not
> due to a change of architecture.
It is quite possible you ended up with two 64-bit, SSE2-float back-ends.
Oups. Sorry for your time loss. And, yes, unrelated changes between
release 3.11.2 and the experimental sources I released (based on what
will become 3.12.0) can account for small speed differences.
Gaėtan Dubreil
I cannot provide any benchmark yet but even not taking into account
the better register organization there are at least two areas where
SSE2 can outperform x87 significantly.
1. Float to integer conversion
Is quite inefficient on x87 because you have to explicitly set and
restore rounding mode. Typical
let round x = truncate (x +. 0.5)
Translates to
_camlT__round_58:
sub esp, 8
L100:
fld L101
fadd REAL8 PTR [eax]
sub esp, 8
fnstcw [esp+4]
mov ax, [esp+4]
mov ah, 12
mov [esp], ax
fldcw [esp]
fistp DWORD PTR [esp]
mov eax, [esp]
fldcw [esp+4]
add esp, 8
lea eax, DWORD PTR [eax+eax+1]
add esp, 8
ret
but just to
_camlT__round_58:
L100:
movlpd xmm0, L101
addsd xmm0, REAL8 PTR [eax]
cvttsd2si eax, xmm0
lea eax, DWORD PTR [eax+eax+1]
ret
with SSE2.
2. Float compare
Does not set flags on x87 so
let fmin (x:float) y = if x < y then x else y
ends up with
_camlT__fmin_58:
sub esp, 8
L101:
mov ecx, eax
fld REAL8 PTR [ebx]
fld REAL8 PTR [ecx]
fcompp
fnstsw ax
and ah, 69
cmp ah, 1
jne L100
mov eax, ecx
add esp, 8
ret
L100:
mov eax, ebx
add esp, 8
ret
on SSE2 you just have
_camlT__fmin_58:
L101:
movlpd xmm1, REAL8 PTR [ebx]
movlpd xmm0, REAL8 PTR [eax]
comisd xmm1, xmm0
jbe L100
ret
L100:
mov eax, ebx
ret
As for SSE2 backend presented I have some thoughts regarding the code
(fast math functions via x87 are questionable, optimization of
floating compare etc.) Where to discuss that - just here or there is
some entry in Mantis?
- Dmitry Bely
Sure, it was just an code generation example. Probably I should use
round_positive name.
> that's not what you want. You want :
>
>> let round x = floor (x +. 0.5)
..and you get a floor() C call. I would better use
let round x = truncate (x +. (if x > 0. then 0.5 else -0.5))
>> This is a call for testers concerning an experimental OCaml compiler
>> back-end that uses SSE2 instructions for floating-point arithmetic.[...]
>
> I cannot provide any benchmark yet
Too bad :-( I got very little feedback to my call: just one data point
(thanks Gaetan). Perhaps most OCaml users interested in numerical
computations have switched to x86-64bits already? At any rate, given
such a lack of interest, this x86-32/SSE2 port isn't going to make it
into the OCaml distribution.
> but even not taking into account
> the better register organization there are at least two areas where
> SSE2 can outperform x87 significantly.
>
> 1. Float to integer conversion
> Is quite inefficient on x87 because you have to explicitly set and
> restore rounding mode.
Right. The mode change makes the conversion about 10x slower on x87
than on SSE2. Apparently, float->int conversion is uncommon is
numerical code, otherwise we'd observe bigger speedups on real
applications...
> 2. Float compare
> Does not set flags on x87 so
The SSE2 code is prettier than the x87 code, but this doesn't seem to
translate into a significant performance gain, in my limited testing.
> As for SSE2 backend presented I have some thoughts regarding the code
> (fast math functions via x87 are questionable,
Most x86-32bits C libraries implement sin(), cos(), etc with the x87
instructions, so I'm curious to know what you find objectionable here.
> optimization of floating compare etc.) Where to discuss that - just
> here or there is some entry in Mantis?
Why not start on this list? We'll move to private e-mail if the
discussion becomes too heated :-)
- Xavier Leroy
>>> This is a call for testers concerning an experimental OCaml compiler
>>> back-end that uses SSE2 instructions for floating-point arithmetic.[...]
>>
>> I cannot provide any benchmark yet
>
> Too bad :-( I got very little feedback to my call: just one data point
> (thanks Gaetan). �Perhaps most OCaml users interested in numerical
> computations have switched to x86-64bits already? �At any rate, given
> such a lack of interest, this x86-32/SSE2 port isn't going to make it
> into the OCaml distribution.
It's a pity. Probably even my (future) benchmarks won't help...
>> but even not taking into account
>> the better register organization there are at least two areas where
>> SSE2 can outperform x87 significantly.
>>
>> 1. Float to integer conversion
>> Is quite inefficient on x87 because you have to explicitly set and
>> restore rounding mode.
>
> Right. �The mode change makes the conversion about 10x slower on x87
> than on SSE2. �Apparently, float->int conversion is uncommon is
> numerical code, otherwise we'd observe bigger speedups on real
> applications...
>
>> 2. Float compare
>> Does not set flags on x87 so
>
> The SSE2 code is prettier than the x87 code, but this doesn't seem to
> translate into a significant performance gain, in my limited testing.
>
>> As for SSE2 backend presented I have some thoughts regarding the code
>> (fast math functions via x87 are questionable,
>
> Most x86-32bits C libraries implement sin(), cos(), etc with the x87
> instructions, so I'm curious to know what you find objectionable here.
Microsoft's implementation for P4 and above is SSE2-based. And Intel
itself recommends to do so:
[quote]
What Is AM Library?
===================
Ever missed a sine or arctangent instruction among Intel Streaming
SIMD Extensions? Ever wished there were a way to calculate logarithm
or exponent in about a dozen cycles? Here is a new release of
Approximate Math Library (AM Library) -- a set of fast routines to
calculate math functions using Intel(R) Streaming SIMD Extensions
(SSE) and Streaming SIMD Extensions 2 (SSE2). The Library offers
trigonometric, reverse trigonometric, logarithmic, and exponential
functions for packed and scalar arguments. The processing speed is
many times faster than that of x87 instructions and even of table
lookups. The accuracy of AM Library routines can be adequate for many
applications. It is comparable with that of reciprocal SSE
instructions, and is hundreds times better than what is achievable
with lookup tables.
The AM Library is provided along with the full source code and a usage sample.
[end of quote]
http://www.intel.com/design/pentiumiii/devtools/AMaths.zip
Another interesting reading:
http://users.ece.utexas.edu/~adnan/comm/fast-trigonometric-functions-using.pdf
>> optimization of floating compare etc.) Where to discuss that - just
>> here or there is some entry in Mantis?
>
> Why not start on this list? �We'll move to private e-mail if the
> discussion becomes too heated :-)
OK
1. My variant of emit_float_test (in many cases eliminates extra jump).
let emit_float_test cmp neg arg lbl =
let opcode_jp cmp =
match (cmp, neg) with
(Ceq, false) -> ("je", true)
| (Ceq, true) -> ("jne", true)
| (Cne, false) -> ("jne", true)
| (Cne, true) -> ("je", true)
| (Clt, false) -> ("jb", true)
| (Clt, true) -> ("jae", true)
| (Cle, false) -> ("jbe", true)
| (Cle, true) -> ("ja", true)
| (Cgt, false) -> ("ja", false)
| (Cgt, true) -> ("jbe", false)
| (Cge, false) -> ("jae", true)
| (Cge, true) -> ("jb", false) in
let branch_opcode, need_jp = opcode_jp cmp in
let branch_opcode, arg0, arg1, need_jp =
match arg.(1).loc with
Reg _ when need_jp ->
(* swap args if it excludes jmp *)
let (branch_opcode_swap, need_jp_swap) =
opcode_jp
(match cmp with
Ceq -> Ceq
| Cne -> Cne
| Clt -> Cgt
| Cle -> Cge
| Cgt -> Clt
| Cge -> Cle) in
if need_jp_swap
then (branch_opcode, arg.(0), arg.(1), true)
else (branch_opcode_swap, arg.(1), arg.(0), false)
| _ ->
(branch_opcode, arg.(0), arg.(1), need_jp)
in
begin match cmp with
| Ceq | Cne -> ` ucomisd `
| _ -> ` comisd `
end;
`{emit_reg arg0}, {emit_reg arg1}\n`;
let branch_if_not_comparable =
if cmp = Cne then not neg else neg in
if need_jp then
if branch_if_not_comparable then begin
` jp {emit_label lbl}\n`;
` {emit_string branch_opcode} {emit_label lbl}\n`
end else begin
let next = new_label() in
` jp {emit_label next}\n`;
` {emit_string branch_opcode} {emit_label lbl}\n`;
`{emit_label next}:\n`
end
else begin
` {emit_string branch_opcode} {emit_label lbl}\n`
end
2. My variant of fast math functions (see explanation above)
let emit_floatspecial = function
"sqrt" -> ` sqrtsd `
| _ -> assert false
3. Loading st(0) can be two instructions shorter :)
` sub esp, 8\n`;
` fstp REAL8 PTR [esp]\n`;
` movsd {emit_reg dst}, REAL8 PTR [esp]\n`;
` add esp, 8\n`
can be written as
` fstp REAL8 PTR [esp-8]\n`;
` movlpd {emit_reg dst}, REAL8 PTR [esp-8]\n`;
4. Unnecessary instruction in Lop(Iload(Single, addr))
` movss {emit_reg dest}, REAL4 PTR {emit_addressing addr
i.arg 0}\n`;
` cvtss2sd {emit_reg dest}, {emit_reg dest}\n`
can be written as
` cvtss2sd {emit_reg dest}, REAL4 PTR {emit_addressing
addr i.arg 0}\n`
- Dmitry Bely