[Caml-list] matrix-matrix multiply - O'Caml is 6 times slower than C

Paul Stodghill

unread,

Oct 21, 2002, 6:18:16 AM10/21/02

to Xavier Leroy, caml...@inria.fr

>>>>> "Xavier" == Xavier Leroy <xavier...@inria.fr> writes:
Xavier> No, it doesn't. It doesn't hoist loop-invariant computations
Xavier> either. (See http://caml.inria.fr/ocaml/numerical.html)

I'll take a look. Thanks.

Xavier> Multiplications are pretty expensive on modern processors,
Xavier> so your code would run much faster if you hoist the
Xavier> computation i*120 out of the j and k loops, and
Xavier> strength-reduce k*120 in the innermost loop.

I don't see how to do strength-reduction without introducing references,
which appears to reduce performance more than is gained.

Xavier> Another option worth considering is to swap the j and k
Xavier> loops, thus making k*120 and a.(i*120+k) two invariants of
Xavier> the inner loop. In this case, you might not even have to
Xavier> strength-reduce k*120.

I tried this, and indeed the performance of the ML code get up to 95
Mflops. However, the performance of the C code goes down, as the ikj
loop ordering requires an additional store of C[i,j] in every iteration.

Xavier> Other loop optimizations such as loop unrolling would (I
Xavier> guess) make much less of a difference -- modern processors
Xavier> are very good at branch prediction, making loop overhead
Xavier> low.

Loop unrolling can also increase ILP. It can make a big difference in
performance.

Loop unrolling can be thought of as a way of getting some of the effect
of software pipelining without the complicated scheduling.

Xavier> Some more thoughts:

Xavier> Bagley's code works on a "proper" matrix (an array of
Xavier> arrays), while yours flatten the matrix into one single
Xavier> array. There are pros and cons to either approach, but
Xavier> notice that your approach (without strength reduction)
Xavier> entails fewer loads but more multiplications, which can be
Xavier> more expensive...

Right. So it is a win when it is done in combination with strength reduction.

Xavier> Tiling loops is good for really large matrices, but yours
Xavier> (in this example) occupies "only" 115 Kb, thus fits
Xavier> comfortably in L2 cache. Perhaps tiling isn't beneficial in
Xavier> this case.

True. This was a miscalculation on my part. However, the blocksize that
I chose makes the blocks fit in the L1 cache, so I was still getting a
measurable benefit.

Xavier> - Xavier Leroy

Here is what I was trying to accomplish - I am involved with a project
that is trying to automate/generalize some of the tricks that ATLAS
(http://math-atlas.sourceforge.net/) uses for getting maximal
performance from matrix-matrix multiply (MMM). I knew about Bagley's
conclusion that O'Caml was competitive with C for MMM, so I was curious
how close O'Caml came to GCC on Blocked MMM. I would like to use O'Caml
as a target language over C or Fortran.

My conclusion, at this point, is that the current native O'Caml compiler
is not going to competitive with a native C or Fortran compiler because
the it does not optimize the innermost loop as aggressively.

Thanks for your help.

-------------------
To unsubscribe, mail caml-lis...@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners

Xavier Leroy

unread,

Oct 21, 2002, 6:21:44 AM10/21/02

to Paul Stodghill, caml...@inria.fr

> Looking at the assembly produced by O'Caml and GCC, it appears that GCC
> is performance loop unrolling (as requested with -funroll-loops) and
> strength reduction in the inner loops. I can easily see why these two
> optimizations can result in such a tremendous performance difference.
>
> My question is this: I can obviously performance loop unrolling myself
> by hand - does ocamlopt perform strength reduction?

No, it doesn't. It doesn't hoist loop-invariant computations either.
(See http://caml.inria.fr/ocaml/numerical.html)

Multiplications are pretty expensive on modern processors, so your
code would run much faster if you hoist the computation i*120 out of
the j and k loops, and strength-reduce k*120 in the innermost
loop.

Another option worth considering is to swap the j and k loops, thus
making k*120 and a.(i*120+k) two invariants of the inner loop. In
this case, you might not even have to strength-reduce k*120.

Other loop optimizations such as loop unrolling would (I guess) make
much less of a difference -- modern processors are very good at branch
prediction, making loop overhead low.

Some more thoughts:

Bagley's code works on a "proper" matrix (an array of

arrays), while yours flatten the matrix into one single array. There
are pros and cons to either approach, but notice that your approach
(without strength reduction) entails fewer loads but more
multiplications, which can be more expensive...

Tiling loops is good for really large matrices, but yours (in this
example) occupies "only" 115 Kb, thus fits comfortably in L2 cache.
Perhaps tiling isn't beneficial in this case.

- Xavier Leroy

Issac Trotts

unread,

Oct 21, 2002, 6:33:54 AM10/21/02

to OCaml Mailing List

> I don't see how to do strength-reduction without introducing references,
> which appears to reduce performance more than is gained.

You might try converting your references to mutable fields.
Here's an example of the performance gain:

% cat ref.ml

let x = ref 1.0 in
let n = int_of_string Sys.argv.(1) in
for i = 1 to n do x := !x +. 1.0 done

% cat ref2.ml

type t = { mutable f:float };;
let x = { f = 1.0 } in
let n = int_of_string Sys.argv.(1) in
for i = 1 to n do x.f <- x.f +. 1.0 done

% ocamlopt -o ref{,.ml}

% ocamlopt -o ref2{,.ml}

% time ./ref 100000000
./ref 100000000 2.51s user 0.00s system 99% cpu 2.515 total

% time ./ref2 100000000
./ref2 100000000 1.54s user 0.01s system 100% cpu 1.542 total

The standard references take 1.6 times as much cpu time in this
case.

Issac

Xavier Leroy

unread,

Oct 21, 2002, 8:56:15 AM10/21/02

to Paul Stodghill, caml...@inria.fr

> I don't see how to do strength-reduction without introducing references,
> which appears to reduce performance more than is gained.

The compiler can optimize "local" references into variables, avoiding
the extra indirection normally associated with references. But even
with this optimization, there might be issues related to the small
number of registers available on the x86.

> Here is what I was trying to accomplish - I am involved with a project
> that is trying to automate/generalize some of the tricks that ATLAS
> (http://math-atlas.sourceforge.net/) uses for getting maximal
> performance from matrix-matrix multiply (MMM). I knew about Bagley's
> conclusion that O'Caml was competitive with C for MMM, so I was curious
> how close O'Caml came to GCC on Blocked MMM. I would like to use O'Caml
> as a target language over C or Fortran.

OCaml is a fun language to write the problem analysis and code generation for
specialized numerical components, see e.g. http://www.fftw.org/
However, I agree with your conclusions that for ultimate performance
you probably want to generate Fortran (for a good parallelizing compiler)
or C (with asm inserts for SIMD instructions).

- Xavier Leroy

Christophe TROESTLER

unread,

Oct 23, 2002, 6:16:30 AM10/23/02

to caml...@inria.fr

On Sun, 20 Oct 2002, Issac Trotts <ijtr...@ucdavis.edu> wrote:
>
> You might try converting your references to mutable fields.
>

> let x = ref 1.0 in
> let n = int_of_string Sys.argv.(1) in
> for i = 1 to n do x := !x +. 1.0 done
>

> ./ref 100000000 2.51s user 0.00s system 99% cpu 2.515 total
>

> type t = { mutable f:float };;
> let x = { f = 1.0 } in
> let n = int_of_string Sys.argv.(1) in
> for i = 1 to n do x.f <- x.f +. 1.0 done
>

> ./ref2 100000000 1.54s user 0.01s system 100% cpu 1.542 total

A few questions in view of this. First, on my machine (AMD Athlon
1GHz running GNU/Linux), the timings give a preference to ref.ml

time ./ref 100000000
real 0m1.279s user 0m1.280s sys 0m0.000s
time ./ref2 100000000
real 0m1.411s user 0m1.380s sys 0m0.000s

What could be a reason for that?

Second, ain't references be optimized when their type is statically
known to be a float ref (I thought so, please confirm or correct)?

It seems to me there are three main issues concerning floats:
* storing (avoid unnecessary indirections but take care of GC)
* comparisons (= is not reflexive in IEEE 754 arithmetic)
* conversions

About "conversions", float : int -> float seems to be slow (compared
to a cast in C at least). Is there any way to optimize it when it
intervene in an algebraic expression? (Frequently on has to write
things like: for i=0 to n do let x = float i / float n in ... done)

I understand that float values need to be boxed to "dialog" with
polymorphic functions. Let me picture it as

+--------+ f : float -> float +--------+
| | --------------------> | |
+--------+ +--------+
| |
V V
+--------+ +--------+
| double | | double |
+--------+ +--------+

However, couldn't we imagine that functions with float arguments or
return value have "two interfaces": the standard one (where one knows
the pointer) and another one (which gives the value) :

+--------+ f : float -> float +--------+
| | --------------------> | |
+--------+ +--------+
| |
V V
+--------+ f': float -> float +--------+
| double | --------------------> | double |
+--------+ +--------+

The _idea_, in C parlance, is to declare f(x) as &(f'(*x)). Now, the
boxing should allow the GC to take care of these values. But, if a
function returning a float feeds another function expecting a float,
the compiler could connect the "bottom lines" instead of passing
through the pointers:

f : 'a -> float +--------+ +--------+ g : float -> 'b
-----------------> | | | | ----------------->
+--------+ +--------+
| |
V V
f': 'a -> float +--------+ +--------+ g': float -> 'b
-----------------> | double |----->| double | ----------------->
+--------+ +--------+

This kind of idea could also apply to recursive functions passing
float values along the recursion...

My question is: is this type of idea workable? Is it difficult to
implement? (In a way it just generalize the special treatment of
arithmetic expressions.) Maybe this can be generalized further to put
float references & equalities under the same umbrella?

Bear in mind I am not a compiler expert (and people even giving
compiling courses here are not very helpful), so my questions are also
a way for me to learn a little something...

Cheers,
ChriS

malc

unread,

Oct 23, 2002, 8:10:30 AM10/23/02

to Christophe TROESTLER, caml...@inria.fr

On Wed, 23 Oct 2002, Christophe TROESTLER wrote:

> On Sun, 20 Oct 2002, Issac Trotts <ijtr...@ucdavis.edu> wrote:
> >
> > You might try converting your references to mutable fields.
> >
> > let x = ref 1.0 in
> > let n = int_of_string Sys.argv.(1) in
> > for i = 1 to n do x := !x +. 1.0 done
> >
> > ./ref 100000000 2.51s user 0.00s system 99% cpu 2.515 total
> >
> > type t = { mutable f:float };;
> > let x = { f = 1.0 } in
> > let n = int_of_string Sys.argv.(1) in
> > for i = 1 to n do x.f <- x.f +. 1.0 done
> >
> > ./ref2 100000000 1.54s user 0.01s system 100% cpu 1.542 total
>
> A few questions in view of this. First, on my machine (AMD Athlon
> 1GHz running GNU/Linux), the timings give a preference to ref.ml
>
> time ./ref 100000000
> real 0m1.279s user 0m1.280s sys 0m0.000s
> time ./ref2 100000000
> real 0m1.411s user 0m1.380s sys 0m0.000s
>
> What could be a reason for that?

I think the reason is simple, both are more or less nop operations,
x or x.f is not used anywhere, hence no need to allocate the float.
This short example highlights the difference:

let useref n =

let x = ref 1.0 in

for i = 1 to n do x := !x +. 1.0 done;
!x

type t = { mutable f:float };;

let userec n =

let x = { f = 1.0 } in

for i = 1 to n do x.f <- x.f +. 1.0 done;
x.f

let _ =
let n = int_of_string Sys.argv.(2) in
Printf.printf "%f\n"
(if Sys.argv.(1) = "ref" then
useref n
else
userec n)

ref# time ./refrec rec 100000000
100000001.000000

real 0m2.283s
user 0m2.280s
sys 0m0.000s
ref# time ./refrec ref 100000000
100000001.000000

real 0m1.916s
user 0m1.910s
sys 0m0.010s

More or less same machine here.

--
mailto:ma...@pulsesoft.com

malc

unread,

Oct 23, 2002, 8:02:40 PM10/23/02

to Christophe TROESTLER, caml...@inria.fr

I believe i should add something here. Let's look at the inner loops.

Mutable version:
.L107:
fld1
faddl (%ecx)
fstpl (%ecx)
addl $2, %eax
cmpl %ebx, %eax
jle .L107

Suboptimal but ok...

Reference version:
.L101:
.L103: movl young_ptr, %eax
subl $12, %eax
movl %eax, young_ptr
cmpl young_limit, %eax
jb .L104
leal 4(%eax), %ecx
movl $2301, -4(%ecx)
fld1
faddl (%esi)
fstpl (%ecx)
movl %ecx, %esi
addl $2, %ebx
cmpl %edx, %ebx
jle .L101

Lots of instructions + boxing.. And yet its faster than mutable one..
Wonders of modern CPUs.

My first take at simplest asm code doing the same:
mov eax, n
fld1
fld1
LL:
fadd st, st(1)
dec eax
jnz LL

fstp result
fstp st

ref# time ./c 100000000
100000001.000000

real 0m0.394s
user 0m0.390s
sys 0m0.000s

(Turned out that both gcc and icc produce similar code give or take)

P.S. It would be interesting to see timings produced by P3/P4.