Value types (Was: [Caml-list] ocamlopt LLVM support)

Jon Harrop

unread,

Dec 12, 2010, 9:54:50 AM12/12/10

to caml...@inria.fr

The Haskell guys got their best performance improvement moving to LLVM from
the hailstone benchmark so it is interesting to examine this case as well. I
just added support for 64-bit ints to HLVM to implement that benchmark and
my code is:

let rec collatzLen ((c, n): int * int64) : int =
if n = 1L then c else
collatzLen (c+1, if n % 2L = 0L then n / 2L else 3L * n + 1L);;

let rec loop ((i, (nlen, n)): int64 * (int * int64)) : int64 =
if i = 1L then n else
let ilen = collatzLen(1, i) in
let nlen, n = if ilen > nlen then ilen, i else nlen, n in
loop (i-1L, (nlen, n));;

When run without using LLVM's optimization passes, this produces the
following output for 10k, 100k and 1M, respectively:

- : `Int64 = 6171L
Live: 0
0.00704098s total; 0s suspend; 0s mark; 0s sweep
- : `Int64 = 77031L
Live: 0
0.087815s total; 0s suspend; 0s mark; 0s sweep
- : `Int64 = 837799L
Live: 0
1.06907s total; 0s suspend; 0s mark; 0s sweep

With LLVM's default optimization passes enabled, I get:

- : `Int64 = 6171L
Live: 0
0.00508595s total; 0s suspend; 0s mark; 0s sweep
- : `Int64 = 77031L
Live: 0
0.0626569s total; 0s suspend; 0s mark; 0s sweep
- : `Int64 = 837799L
Live: 0
0.759738s total; 0s suspend; 0s mark; 0s sweep

Note the ~30% performance improvement in this case. In other cases, LLVM's
default optimization passes can degrade performance significantly.

Here’s the equivalent OCaml code:

let rec collatzLen(c, n) : int =
if n = 1L then c else
collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else
Int64.add (Int64.mul 3L n) 1L);;

let rec loop(i, (nlen, n)) =
if i = 1L then n else
let ilen = collatzLen(1, i) in
let nlen, n = if ilen > nlen then ilen, i else nlen, n in
loop (Int64.sub i 1L, (nlen, n));;

and Haskell:

import Data.Int

collatzLen :: Int -> Int64 -> Int
collatzLen c 1 = c
collatzLen c n = collatzLen (c+1) $ if n `mod` 2 == 0 then n `div` 2 else
3*n+1

pmax x n = x `max` (collatzLen 1 n, n)

main = print $ foldl pmax (1,1) [2..1000000]

and C99:

#include <stdio.h>

int collatzLen(int c, long long n) {
return (n==1 ? c : collatzLen(c+1, (n%2==0 ? n/2 : 3*n+1)));
}

long long loop(long long m) {
int nlen=1;
long long n = 1;
for (long long i=2; i<=m; ++i) {
const int ilen = collatzLen(1, i);
if (ilen > nlen) {
nlen = ilen;
n = i;
}
}
return n;
}

int main() {
printf("%lld", loop(1000000));
}

And here are my timings:

OCaml: 24.3s ocamlopt
Haskell: 24.0s ghc6.10 --make -O2
F#.NET: 3.45s fsc --optimize+ --platform:x86
C: 1.20s gcc -O3 -std=c99
HLVM: 1.07s ./repl --compile
HLVM (opt): 0.76s opt -tailcallelim -std-compile-opts

These results really demonstrate two things:

1. Unboxing can give huge performance improvements on serial code, let alone
parallel code. The optimized HLVM is running 32× faster than the OCaml here.

2. LLVM makes it easy to JIT fast code from OCaml. HLVM is using it to beat
GCC-compiled C code here.

--
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Török Edwin

unread,

Dec 12, 2010, 10:55:52 AM12/12/10

to Jon Harrop, caml...@inria.fr

On Sun, 12 Dec 2010 14:54:14 -0000
"Jon Harrop" <j...@ffconsultancy.com> wrote:

> The Haskell guys got their best performance improvement moving to
> LLVM from the hailstone benchmark so it is interesting to examine
> this case as well. I just added support for 64-bit ints to HLVM to
> implement that benchmark and my code is:
>

> Here’s the equivalent OCaml code:
>
> let rec collatzLen(c, n) : int =
> if n = 1L then c else
> collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else

OK, but boxing itself has nothing to do with the performance degration
here. It is the lack of compiler optimizations on the Int64 type. This
could be solved by implementing compiler optimizations for it (or
manually optimizing some integer arithmetic that is slow).

Lets run the code under a profiler, or look at the assembly (I used
'perf record' and 'perf report'). 2 'idiv' instructions turn up as top
offenders in the profile.

Problem #1: Int64.rem n 2 -> another idiv instruction

A C compiler would optimize this to an 'and' instruction.
Change that to 'Int64.logand n 1L = 0L'/

Problem #2: Int64.div n 2 -> idiv instruction.

A C compiler would optimize this to a right shift. Changing that to 'Int64.shift_right n 1' speeds
up the code.

With these changes I get almost the same speed as the C code:
$ ocamlopt x.ml -o x && time ./x
837799
real 0m0.664s
user 0m0.667s
sys 0m0.000s

$ gcc -O3 x.c && time ./a.out
837799
real 0m0.635s
user 0m0.633s
sys 0m0.000s

Here's the OCaml code:

let rec collatzLen(c, n) : int =
if n = 1L then c else

collatzLen (c+1, if Int64.logand n 1L = 0L then Int64.shift_right
n 1 else Int64.add (Int64.mul 3L n) 1L);;

let rec loop(i, (nlen, n)) =
if i = 1L then n else
let ilen = collatzLen(1, i) in
let nlen, n = if ilen > nlen then ilen, i else nlen, n in
loop (Int64.sub i 1L, (nlen, n));;

let _ =
let s = loop (1000000L, (1,1000000L)) in
print_int (Int64.to_int s);;

> 1. Unboxing can give huge performance improvements on serial code,

s/Unboxing/arithmetic optimizations/
Please find an example where the performance benefit is due to
unboxing, and not due to arithmetic optimizations performed on the
unboxed code.

> let alone parallel code. The optimized HLVM is running 32× faster
> than the OCaml here.
>
> 2. LLVM makes it easy to JIT fast code from OCaml. HLVM is using it
> to beat GCC-compiled C code here.
>

One advantage of using LLVM is that it would notice arithmetic
optimizations like this and perform it itself (even if you use the
boxed representation).

Best regards,
--Edwin

Jon Harrop

unread,

Dec 12, 2010, 12:15:01 PM12/12/10

to Török Edwin, caml...@inria.fr

Török Edwin wrote:
> Problem #1: Int64.rem n 2 -> another idiv instruction
>
> A C compiler would optimize this to an 'and' instruction.
> Change that to 'Int64.logand n 1L = 0L'/

Yes. LLVM did that for me.

> Problem #2: Int64.div n 2 -> idiv instruction.
>
> A C compiler would optimize this to a right shift. Changing that to
> 'Int64.shift_right n 1' speeds
> up the code.

Yes. LLVM also did that for me. In fact, I have been bitten by ocamlopt not optimizing div and mod by a constant in real OCaml code before. This problem also turns up in the context of hash table implementations where you want to % by the length of the spine.

> With these changes I get almost the same speed as the C code:
> $ ocamlopt x.ml -o x && time ./x
> 837799
> real 0m0.664s
> user 0m0.667s
> sys 0m0.000s
>
> $ gcc -O3 x.c && time ./a.out
> 837799
> real 0m0.635s
> user 0m0.633s
> sys 0m0.000s
>
> Here's the OCaml code:
> let rec collatzLen(c, n) : int =
> if n = 1L then c else
> collatzLen (c+1, if Int64.logand n 1L = 0L then Int64.shift_right
> n 1 else Int64.add (Int64.mul 3L n) 1L);;
>
> let rec loop(i, (nlen, n)) =
> if i = 1L then n else
> let ilen = collatzLen(1, i) in
> let nlen, n = if ilen > nlen then ilen, i else nlen, n in
> loop (Int64.sub i 1L, (nlen, n));;
>
> let _ =
> let s = loop (1000000L, (1,1000000L)) in
> print_int (Int64.to_int s);;

I am unable to reproduce your results. Here, the time falls from 24s to 19.5s (using ocamlopt 3.12.0 on Intel x86) which is still 26× slower than HLVM.

> > 1. Unboxing can give huge performance improvements on serial code,
>
> s/Unboxing/arithmetic optimizations/
> Please find an example where the performance benefit is due to
> unboxing, and not due to arithmetic optimizations performed on the
> unboxed code.

The last example I gave (array of key-value pairs) demonstrates some of the performance improvements offered by unboxing in the heap (12.3× faster than OCaml in that case). I'm still not sure that this example is invalid because I cannot reproduce your results.

> > let alone parallel code. The optimized HLVM is running 32× faster
> > than the OCaml here.
> >
> > 2. LLVM makes it easy to JIT fast code from OCaml. HLVM is using it
> > to beat GCC-compiled C code here.
>
> One advantage of using LLVM is that it would notice arithmetic
> optimizations like this and perform it itself (even if you use the
> boxed representation).

Yes. LLVM hopefully optimizes div/mod by any constant which is quite tricky in the general case.

Cheers,
Jon.

Jon Harrop

unread,

Dec 12, 2010, 1:01:31 PM12/12/10

to Török Edwin, caml...@inria.fr

Török Edwin wrote:
> Do you really need to use Int64 for that though? Won't the 63-bit
> version do?

I'm running 32-bit.

> > I am unable to reproduce your results. Here, the time falls from 24s
> > to 19.5s (using ocamlopt 3.12.0 on Intel x86) which is still 26×
> > slower than HLVM.

Sorry, I'm actually using an Opteron x86 (logged in from an Intel x86!).

> Do you still have 'idiv' in the compiled code? See my attached
> assembly, and compare it with yours please.
> I was doing the test on 64-bit, with ocamlopt 3.11.2 and 3.12.0.

I get what appear to be calls to C code:

camlCollatz__collatzLen_1030:
subl $8, %esp
L103:
movl %eax, 4(%esp)
movl %ebx, 0(%esp)
pushl $camlCollatz__10
pushl %ebx
movl $caml_equal, %eax
call caml_c_call
L104:
addl $8, %esp
cmpl $1, %eax
je .L102
movl 4(%esp), %eax
addl $8, %esp
ret
.align 16
L102:
pushl $camlCollatz__8
movl 4(%esp), %eax
pushl %eax
movl $caml_int64_and, %eax
call caml_c_call
L105:
addl $8, %esp
pushl $camlCollatz__9
pushl %eax
movl $caml_equal, %eax
call caml_c_call
L106:
addl $8, %esp
cmpl $1, %eax
je .L101
pushl $3
movl 4(%esp), %eax
pushl %eax
movl $caml_int64_shift_right, %eax
call caml_c_call
L107:
addl $8, %esp
movl %eax, %ebx
jmp .L100
.align 16
L101:
movl 0(%esp), %eax
pushl %eax
pushl $camlCollatz__6
movl $caml_int64_mul, %eax
call caml_c_call
L108:
addl $8, %esp
pushl $camlCollatz__7
pushl %eax
movl $caml_int64_add, %eax
call caml_c_call
L109:
addl $8, %esp
movl %eax, %ebx
L100:
movl 4(%esp), %eax
addl $2, %eax
jmp .L103

> FWIW the original code took 2.8 seconds here, so only 4x slower (this
> is an AMD Phenom II x6 1090T CPU). It probably depends how fast/slow
> the 'idiv' is on your CPU.

The performance of idiv is irrelevant here. The bottleneck may be those C calls but I don't understand why they are being generated.

Török Edwin

unread,

Dec 12, 2010, 1:22:28 PM12/12/10

to Jon Harrop, caml...@inria.fr

On Sun, 12 Dec 2010 18:01:13 -0000
"Jon Harrop" <j...@ffconsultancy.com> wrote:

> Török Edwin wrote:
> > Do you really need to use Int64 for that though? Won't the 63-bit
> > version do?
>
> I'm running 32-bit.

That explains it, in a 32-bit chroot my modified version is still slow.
I thought that 32-bit systems are not that relevant anymore (except for
Windows, but then people start moving to 64-bit there also).

>
> > > I am unable to reproduce your results. Here, the time falls from
> > > 24s to 19.5s (using ocamlopt 3.12.0 on Intel x86) which is still
> > > 26× slower than HLVM.
>
> Sorry, I'm actually using an Opteron x86 (logged in from an Intel
> x86!).
>
> > Do you still have 'idiv' in the compiled code? See my attached
> > assembly, and compare it with yours please.
> > I was doing the test on 64-bit, with ocamlopt 3.11.2 and 3.12.0.
>
> I get what appear to be calls to C code:
>
> camlCollatz__collatzLen_1030:
> subl $8, %esp

> .L103:

> movl %eax, 4(%esp)
> movl %ebx, 0(%esp)
> pushl $camlCollatz__10
> pushl %ebx
> movl $caml_equal, %eax
> call caml_c_call

Yes, that is quite bad. I don't know how OCaml's code generator works,
but it looks like it calls the C implementation if the CPU doesn't
support the operation directly. And since this is 32-bit you need all
the extra pushes and movs to do actually call something.
If only it could inline those calls, then it could optimize away most
of the overhead (LLVM would help here again).

>
> > FWIW the original code took 2.8 seconds here, so only 4x slower
> > (this is an AMD Phenom II x6 1090T CPU). It probably depends how
> > fast/slow the 'idiv' is on your CPU.
>
> The performance of idiv is irrelevant here. The bottleneck may be
> those C calls but I don't understand why they are being generated.

I think for the same reason gcc has __udivdi3 in libgcc: there is no
direct way of executing a 64-bit divide on a 32-bit machine, and it
saves code space to do it in a function.
However that doesn't make much sense for mul and add, which don't need
that many instructions to implement on 32-bit.

Brian Hurt

unread,

Dec 12, 2010, 2:53:51 PM12/12/10

to Jon Harrop, caml...@inria.fr

I'm going to regret this. I know I'm going to regret this.

On Sun, 12 Dec 2010, Jon Harrop wrote:

> Here?s the equivalent OCaml code:

>
> let rec collatzLen(c, n) : int =
> if n = 1L then c else
> collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else
> Int64.add (Int64.mul 3L n) 1L);;
>
> let rec loop(i, (nlen, n)) =
> if i = 1L then n else
> let ilen = collatzLen(1, i) in
> let nlen, n = if ilen > nlen then ilen, i else nlen, n in
> loop (Int64.sub i 1L, (nlen, n));;

Congratulations, Jon, you win today's Captain Obvious award. Using
Int64's, which are forced to be boxed, really slows things down. Also,
uncurrying all your arguments also slows things down. Running your
original code on my 64-bit laptop, it took 6.35s to run the 1M example.
The following alternate code only took 0.82s, for a speed up of almost
7.75x. Scaling your timings by a similar amount gives Ocaml a running
speed of 3.14s in your set up, or competitive with F#.

My code:
let rec collatzLen c n =
if n = 1 then c else
collatzLen (c+1)
(if (n land 1) == 0 then (n lsr 1) else ((n * 3) + 1))
;;

let rec loop i nlen n =
if i = 1 then n else
let ilen = collatzLen 1 i in
if (ilen > nlen) then
loop (i - 1) ilen i
else
loop (i - 1) nlen n
;;

loop 1000000 0 0

Here is where you insert a lecture about how Ocaml int's being on 63 (or
31) bits aren't "real" ints, and that therefor this isn't a valid
comparison at all.

Brian

Török Edwin

unread,

Dec 12, 2010, 2:56:05 PM12/12/10

to Benedikt Meurer, caml...@yquem.inria.fr

On Sun, 12 Dec 2010 20:09:00 +0100
Benedikt Meurer <benedik...@googlemail.com> wrote:

>
> On Dec 12, 2010, at 16:55 , Török Edwin wrote:
>
> > [...]

> > Problem #2: Int64.div n 2 -> idiv instruction.
> >
> > A C compiler would optimize this to a right shift. Changing that to
> > 'Int64.shift_right n 1' speeds up the code.
>

> This is easy to fix in ocamlopt (see attached patch
> ocamlopt-natint.patch), by applying the same optimizations already
> used for constant int's to constant natint's (Int64 is Nativeint on
> 64bit).

Nice patch. I definitely agree that what can be fixed in ocamlopt's
high-level opts should be fixed there, rather than hope that LLVM will
just do everything.

> [...]

> >
> > One advantage of using LLVM is that it would notice arithmetic
> > optimizations like this and perform it itself (even if you use the
> > boxed representation).
>

> In case of x86-32, it won't, simply because LLVM will be presented
> with the calls to caml_int32_* functions.

I was thinking that the runtime could be compiled to .bc, and then they
would get inlined and optimized away at link time. That is overkill for
something as simple as integer arithmetic, but could be useful for more
complicated runtime functions (or C bindings).

> You'd need to change the
> Cmm code instead (changing the low-level stuff is straight-forward as
> demonstrated).

I agree that not emitting those C calls in the first place is better.

> For 64bit targets, see attached patch.
>
> This doesn't mean that LLVM wouldn't be useful (in fact, I've just
> started an LLVM backend for ocamlopt).

Great! If you notice some more optimizations missed by ocamlopt while
working on it, could you write up a list of those?

I think I suggested earlier in this thread that LLVM could be used in
two ways: write a backend with it, or port some of the LLVM
optimizations to ocamlopt.
Maybe it'd be good to write some documentation on ocamlopt's internals,
and how one can add more optimizations there. Something simple like
what is the IR format (like LLVM's langref), how you perform the
optimizations (what is the equivalent of LLVM's passes), and what
helper modules can you use (what is the equivalent of LLVM's analysis)
would suffice for a start. Does something like this exist already?

> But it is important to note
> that LLVM is not the solution to everything. As the name implies,
> it's "low level", it does a few "higher level" optimizations for C,
> but these are special cases (and somewhat ugly if you take the time
> to read the code). It won't make a faster OCaml magically, just like
> it didn't make a faster Haskell by itself.
>
> I could go on by quoting common "Harrop jokes" like "you need types
> in the low-level IR", etc. trying to tell him that this is simply
> wrong; but after reading through the Caml/LISP mailing list archives
> (thanks for the pointers by several readers), I'm pretty much
> convinced that Jon simply decided to end his war against LISP just to
> start a new one against ocamlopt.
>
> If anyone is serious about ocamlopt with LLVM, feel free to contact
> me (tho, my time is limited atm).

I wouldn't have much time to dedicate to this, so I can't say I'm
serious about it.
AFAICT LLVM's OCaml bindings are only good for generating LLVM IR from
OCaml, not for actually performing transformations on it (there is no
binding to retrieve the type of a value for example). I'll probably be
looking into fixing that in the near future, and this may indirectly
help your LLVM backend (if you intend to write OCaml specific
transformations on the LLVM IR).

Jon Harrop

unread,

Dec 12, 2010, 3:41:30 PM12/12/10

to Brian Hurt, caml...@inria.fr

Brian Hurt wrote:
> On Sun, 12 Dec 2010, Jon Harrop wrote:
> > let rec collatzLen(c, n) : int =
> > if n = 1L then c else
> > collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else
> > Int64.add (Int64.mul 3L n) 1L);;
> >
> > let rec loop(i, (nlen, n)) =
> > if i = 1L then n else
> > let ilen = collatzLen(1, i) in
> > let nlen, n = if ilen > nlen then ilen, i else nlen, n in
> > loop (Int64.sub i 1L, (nlen, n));;
>
> Congratulations, Jon, you win today's Captain Obvious award. Using
> Int64's, which are forced to be boxed, really slows things down.

Apparently boxing isn't the issue here, as I had assumed. On 32-bit, OCaml
compiles each arithmetic operation on the int64s to a C function call.

> Also, uncurrying all your arguments also slows things down.

I see <3% performance improvement from currying everything.

> Running your
> original code on my 64-bit laptop, it took 6.35s to run the 1M example.
> The following alternate code only took 0.82s, for a speed up of almost
> 7.75x.

According to Edwin, you should be able to get C-like performance by running
the OCaml in 64-bit and replacing the div and mod operations with shifts and
logical ANDs.

> Scaling your timings by a similar amount gives Ocaml a running
> speed of 3.14s in your set up, or competitive with F#.

I'd be wary of scaling timings by measurements made across different
architectures. OCaml seems to be doing completely different things on x86
and x64 here.

Cheers,
Jon.

Jon Harrop

unread,

Dec 12, 2010, 4:50:39 PM12/12/10

to Benedikt Meurer, caml...@inria.fr

Benedict wrote:
> > A C compiler would optimize this to a right shift. Changing that to
> > 'Int64.shift_right n 1' speeds up the code.
>

> This is easy to fix in ocamlopt (see attached patch ocamlopt-
> natint.patch), by applying the same optimizations already used for

> constant int's to constant natint's (Int64 is Nativeint on 64bit). Note
> however, that "mod 2" is not really "and 1", neither is "div 2"
> equivalent to "lsr 1"; that would be the case for unsigned arithmetic
> (doesn't matter in your example, tho).

That's great. Does it optimize div and mod by any constant integer as C
compilers do?

> >> 1. Unboxing can give huge performance improvements on serial code,
> >
> > s/Unboxing/arithmetic optimizations/
> > Please find an example where the performance benefit is due to
> > unboxing, and not due to arithmetic optimizations performed on the
> > unboxed code.
>

> The boxing involved is relevant, but boxing in general is not the
> issue. In this special case, the "let nlen, n = if..." code requires
> heap allocation, because of the way the pattern is compiled. This could
> be fixed by moving the condition out of the code and using two if's to
> select n/nlen separately (doesn't speed up that much). Fixing the
> pattern compiler to handle these cases might be interesting for general
> benefit.
>
> I already mentioned this multiple times, but here we go again: Unboxing
> optimizations may indeed prove useful if applied wisely (cmmgen.ml is
> of special interest here, the unboxing optimizations are more or less
> special cases; that could be extended to include interesting cases like
> moving boxing out of if-then-else in return position, etc).
>
> But (here comes the special "Harrop note") this has absolutely nothing
> to do with LLVM (and of course really, really, really nothing to do
> with HLVM). Using a different data representation for the heap requires
> a nearly complete rewrite of the OCaml system (you would probably need
> to start at the Typing level); if one wants to do this, enjoy and come
> back with code. But even then, data representation issues will have to
> be considered long before it comes to actual code generation (if you
> are serious, you'll have to think about the runtime first prior to
> talking about code generation for a non-existing runtime), so even then
> it has nothing to do with LLVM (or C-- or C or whatever IR you can
> think of).

OCaml programmers can benefit from more appropriate data representations by
using LLVM as a library to generate code from OCaml. HLVM is an example of
this that anyone can play with.

> Combining alloc's across if-then-else constructs further reduces code
> size in your example (and probably other programs as well), see
> attached file ocamlopt-comballoc-ifthenelse.patch. It's quick&dirty,
> but it should illustrate the idea.

I think that is an even more valuable improvement to ocamlopt than the int
optimization.

> This doesn't mean that LLVM wouldn't be useful (in fact, I've just

> started an LLVM backend for ocamlopt). But it is important to note that

> LLVM is not the solution to everything. As the name implies, it's "low
> level", it does a few "higher level" optimizations for C, but these are
> special cases (and somewhat ugly if you take the time to read the
> code). It won't make a faster OCaml magically, just like it didn't make
> a faster Haskell by itself.

Absolutely.

> I could go on by quoting common "Harrop jokes" like "you need types in
> the low-level IR", etc. trying to tell him that this is simply wrong;
> but after reading through the Caml/LISP mailing list archives (thanks
> for the pointers by several readers), I'm pretty much convinced that
> Jon simply decided to end his war against LISP just to start a new one
> against ocamlopt.

Suggesting that OCaml programmers use LLVM as a library because you can get
huge performance gains is hardly a "war against ocamlopt".

Cheers,
Jon.

Jon Harrop

unread,

Dec 12, 2010, 5:06:00 PM12/12/10

to Török Edwin, caml...@inria.fr

Edwin wrote:
> AFAICT LLVM's OCaml bindings are only good for generating LLVM IR from
> OCaml, not for actually performing transformations on it (there is no
> binding to retrieve the type of a value for example). I'll probably be
> looking into fixing that in the near future, and this may indirectly
> help your LLVM backend (if you intend to write OCaml specific
> transformations on the LLVM IR).

That's a lot of work. Wouldn't it be preferable to do the passes on the OCaml side and focus on generating high quality LLVM IR?

Cheers,
Jon.

Jon Harrop

unread,

Dec 12, 2010, 6:41:37 PM12/12/10

to Török Edwin, caml...@inria.fr

Edwin wrote:

> On Sun, 12 Dec 2010 22:05:34 -0000
> Jon Harrop <jonathand...@googlemail.com> wrote:
> > Edwin wrote:
> > > AFAICT LLVM's OCaml bindings are only good for generating LLVM IR
> > > from OCaml, not for actually performing transformations on it
> > > (there is no binding to retrieve the type of a value for example).
> > > I'll probably be looking into fixing that in the near future, and
> > > this may indirectly help your LLVM backend (if you intend to write
> > > OCaml specific transformations on the LLVM IR).
> >
> > That's a lot of work. Wouldn't it be preferable to do the passes on
> > the OCaml side and focus on generating high quality LLVM IR?
>

> Yes, that is probably a better approach (generating the optimized IR in
> the first place).

FWIW, just the basic existing bindings were still buggy last I looked. For
example, HLVM had to use a workaround to add a dummy argument to a function
with no arguments because the bindings didn't handle that case correctly.
Ironing the bugs out of the existing bindings would be more useful than
making them even bigger, IMHO.

Going back to Richard's idea, writing an OCaml library that uses LLVM to JIT
interface code on-the-fly as a replacement for writing separate C stubs
would be incredibly useful and you could even use it to interface to LLVM
itself more easily!

Eray Ozkural

unread,

Dec 12, 2010, 9:13:50 PM12/12/10

to Jon Harrop, caml...@inria.fr

It's better to focus on low-level optimizations within LLVM IR I think,
high-level optimizations would better be done beforehand. It's not a bad
marriage though. :)

--
Eray Ozkural, PhD candidate. Comp. Sci. Dept., Bilkent University, Ankara
http://groups.yahoo.com/group/ai-philosophy
http://myspace.com/arizanesil http://myspace.com/malfunct

Alain Frisch

unread,

Dec 13, 2010, 3:43:57 AM12/13/10

to Benedikt Meurer, caml...@yquem.inria.fr

On 12/12/2010 08:09 PM, Benedikt Meurer wrote:
> The boxing involved is relevant, but boxing in general is not the
> issue. In this special case, the "let nlen, n = if..." code requires
> heap allocation, because of the way the pattern is compiled. This could
> be fixed by moving the condition out of the code and using two if's to
> select n/nlen separately (doesn't speed up that much). Fixing the
> pattern compiler to handle these cases might be interesting for general
> benefit.

Instead of duplicating the conditional, you could also push the
assignments to bound variables down the expression. For instance:

let (x, y) = if b then (u, v) else (v, u) in ...

can be replaced, conceptually, by:

let x = <dummy> in
let y = <dummy> in
if b then (x <- u; y <- v) else (x <- v; y <- u);
..

and similarly when the bound expression is a pattern matching.

I've played with this a few months ago and could observe important
speedups (27%, 20%) on two micro-benchmarks.

The diff is really small:

http://caml.inria.fr/cgi-bin/viewcvs.cgi/ocaml/branches/inplace_let/bytecomp/matching.ml?rev=10475&sortby=date&r2=10475&r1=10474

-- Alain

Micro benchmark 1:

let () =
for k = 1 to 10000 do
for i = 1 to 100000 do
let (x, y) =
if i mod 2 = 0 then (1, i * 2)
else (2, i * 3)
in
r := !r * x + y
done
done

Micro benchmark 2:

let f x y z =
let a, b =
match z with
| Some (u, v) -> u, v * 2
| None -> 10, 20
in
a * x + b * y

let () =
let r = ref 0 in
for k = 1 to 2000 do
for i = 1 to 100000 do
r := !r + f k i (Some (k, i));
r := !r + f k i None
done
done

Goswin von Brederlow

unread,

Dec 14, 2010, 4:54:55 AM12/14/10

to Edwin, Jon Harrop, caml...@inria.fr

Török Edwin <edwin...@gmail.com> writes:

> On Sun, 12 Dec 2010 14:54:14 -0000
> "Jon Harrop" <j...@ffconsultancy.com> wrote:
>
>> The Haskell guys got their best performance improvement moving to
>> LLVM from the hailstone benchmark so it is interesting to examine
>> this case as well. I just added support for 64-bit ints to HLVM to
>> implement that benchmark and my code is:
>>
>> Here’s the equivalent OCaml code:
>>
>> let rec collatzLen(c, n) : int =
>> if n = 1L then c else
>> collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else
>
> OK, but boxing itself has nothing to do with the performance degration
> here. It is the lack of compiler optimizations on the Int64 type. This
> could be solved by implementing compiler optimizations for it (or
> manually optimizing some integer arithmetic that is slow).
>
> Lets run the code under a profiler, or look at the assembly (I used
> 'perf record' and 'perf report'). 2 'idiv' instructions turn up as top
> offenders in the profile.
>
> Problem #1: Int64.rem n 2 -> another idiv instruction
>
> A C compiler would optimize this to an 'and' instruction.
> Change that to 'Int64.logand n 1L = 0L'/

But C would have an uint64_t there (if you are smart). Otherwise that
isn't equivalent:

void foo(int64_t x) {
a = x % 2;
}

0000000000000000 <foo>:
0: 48 89 f8 mov %rdi,%rax
3: 48 c1 e8 3f shr $0x3f,%rax
7: 48 01 c7 add %rax,%rdi
a: 83 e7 01 and $0x1,%edi
d: 48 29 c7 sub %rax,%rdi
10: 48 89 3d 00 00 00 00 mov %rdi,0x0(%rip) # 17 <foo+0x17>
17: c3 retq

> Problem #2: Int64.div n 2 -> idiv instruction.
>
> A C compiler would optimize this to a right shift. Changing that to 'Int64.shift_right n 1' speeds
> up the code.

Same thing again:

void foo(int64_t x) {
a = x / 2;
}

0000000000000000 <foo>:
0: 48 89 f8 mov %rdi,%rax
3: 48 c1 e8 3f shr $0x3f,%rax
7: 48 8d 3c 38 lea (%rax,%rdi,1),%rdi
b: 48 d1 ff sar %rdi
e: 48 89 3d 00 00 00 00 mov %rdi,0x0(%rip) # 15 <foo+0x15>
15: c3 retq

In both cases gcc avoids the expensive idiv instruction but it needs to
handle the sign of the number with some tricks. Still faster than idiv
though.

An UInt64 module would be nice here.

MfG
Goswin

Benedikt Meurer

unread,

Dec 15, 2010, 5:29:21 AM12/15/10

to caml...@yquem.inria.fr

On Dec 13, 2010, at 09:43 , Alain Frisch wrote:

> On 12/12/2010 08:09 PM, Benedikt Meurer wrote:
>> The boxing involved is relevant, but boxing in general is not the
>> issue. In this special case, the "let nlen, n = if..." code requires
>> heap allocation, because of the way the pattern is compiled. This could
>> be fixed by moving the condition out of the code and using two if's to
>> select n/nlen separately (doesn't speed up that much). Fixing the
>> pattern compiler to handle these cases might be interesting for general
>> benefit.
>
> Instead of duplicating the conditional, you could also push the assignments to bound variables down the expression. For instance:
>
> let (x, y) = if b then (u, v) else (v, u) in ...
>
> can be replaced, conceptually, by:
>
> let x = <dummy> in
> let y = <dummy> in
> if b then (x <- u; y <- v) else (x <- v; y <- u);

> ...

>
> and similarly when the bound expression is a pattern matching.
>
>
> I've played with this a few months ago and could observe important speedups (27%, 20%) on two micro-benchmarks.
>
> The diff is really small:
>
> http://caml.inria.fr/cgi-bin/viewcvs.cgi/ocaml/branches/inplace_let/bytecomp/matching.ml?rev=10475&sortby=date&r2=10475&r1=10474

Nice. But it would be even better to avoid the dummy, in your example

let x = u in
let y = v in
if b then x <- v; y <- u

This does not only avoid the dummy, but would also allow lowering to "cmovcc" instructions in the backend selector (atleast for x86-32/64).

> -- Alain

Benedikt

Jon Harrop

unread,

Dec 15, 2010, 8:16:01 AM12/15/10

to caml...@inria.fr

Benedikt wrote:
> Nice. But it would be even better to avoid the dummy, in your example
>
> let x = u in
> let y = v in
> if b then x <- v; y <- u
>
> This does not only avoid the dummy, but would also allow lowering to
> "cmovcc" instructions in the backend selector (atleast for x86-32/64).

FWIW, when you're using LLVM (as a library!) to generate equivalent code
then you can just represent the tuples as structs (value types) and LLVM
will take care of all of this for you. Again, it generates really efficient
code with minimal effort.

The latest example front-end for HLVM represents tuples as structs so it
gets this for free. However, polymorphic recursion is *much* harder to
implement with that design choice. Also, there is the question of when boxed
vs unboxed tuples are more efficient.

Cheers,
Jon.