Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

[Caml-list] MetaOcaml and high-performance [was: AST versus Ocaml]

22 views

Skip to first unread message

ol...@okmij.org

unread,

Nov 8, 2009, 11:24:02 PM11/8/09

to caml...@inria.fr

Jon Harrop wrote:
> I did some experiments with MetaOCaml. Firstly, it is x86 only and not x64
> which means poor floating-point performance, many times slower than HLVM's.
> The language itself is also very restrictive, e.g. you cannot generate
> pattern matches dynamically so you cannot leverage the pattern match
> compiler, which is a huge loss. In essence, effective use of MetaOCaml
> requires writing in continuation passing style which OCaml and, therefore,
> MetaOCaml do not handle well (closure invocations are slow in OCaml and CPS
> is not optimized back into anything sane). So I do not consider MetaOCaml to
> be a viable option for performance users.

A few clarifications seems to be in order. First of all, the original
poster asked about _offshoring_ with MetaOCaml. When the generated
code is run with offshoring, a C of Fortran file is created, which can
be compiled with your favorite compiler and dynamically linked back
into the running OCaml program. Alternatively, you can use the
generated C/Fortran code as it is, as a part of a C/Fortran project. We
did exactly the latter in our FFT project: we used MetaOCaml to create C
files for FFT kernels, and plugged the files into the FFTW benchmarking
framework, which is pure C. It worked as expected.

Because offshoring produces a portable C or Fortran code file, you can
use the code on 32 or 64-bit platform. The reason the native MetaOCaml
without offshoring does not work on amd64 is because at that time
OCaml didn't emit PIC code for amd64. So, dynamic linking was
impossible. That problem has long been fixed in later versions of
OCaml. Offshoring is a good way to get around it, thanks Jan Kybic.

Offshoring could just as well produce Verilog or LLVM code. Alas,
we didn't get around to exploring that idea.

Regarding continuation-passing style (CPS): if your code generatOR
needs let-insertion, then the generatOR _may_ need to be encoded in
CPS. The generatED code is _not_ in CPS; it is in the `conventional',
so-called `direct style'. Even if we assume that CPS code is
difficult to compile efficiently (which is not certain: in CPS, all
`important' function calls are tail-calls, and OCaml is very good with
tail-calls), that difficulty affects only the code generator rather
than the generated code. Most of the time it is the generated code
that is performance-critical.

One may say that writing the generator in CPS is a bother. I have
heard such objections. Please see our PEPM2009 paper about a way to
address it
http://okmij.org/ftp/Computation/Generative.html#circle-shift
and write generators in the conventional style. Please see the example
of writing a flexible Gaussian Elimination code, paying no penalty for
abstractions.

Fortunately, some people have considered MetaOCaml to be a viable
option for performance users and have reported good results. For
example,

Tuning MetaOCaml Programs for High Performance
Diploma Thesis of Tobias Langhammer.
http://www.infosun.fmi.uni-passau.de/cl/arbeiten/Langhammer.pdf

Here is a good quotation from the Introduction:

``This thesis proposes MetaOCaml for enriching the domain of high-performance
computing by multi-staged programming. MetaOCaml extends the OCaml
language.
..
Benchmarks for all presented implementations confirm that the
execution time can be reduced significantly by high-level
optimizations. Some MetaOCaml programs even run as fast as respective
C implementations. Furthermore, in situations where optimizations in
pure MetaOCaml are limited, computation hotspots can be explicitly or
implicitly exported to C. This combination of high-level and low-level
techniques allows optimizations which cannot be obtained in pure C
without enormous effort.''

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Jon Harrop

unread,

Nov 10, 2009, 10:38:01 AM11/10/09

to caml...@yquem.inria.fr

On Monday 09 November 2009 04:23:28 ol...@okmij.org wrote:
> Because offshoring produces a portable C or Fortran code file, you can
> use the code on 32 or 64-bit platform. The reason the native MetaOCaml
> without offshoring does not work on amd64 is because at that time
> OCaml didn't emit PIC code for amd64. So, dynamic linking was
> impossible. That problem has long been fixed in later versions of

> OCaml...

Has the problem been fixed in MetaOCaml?

> Fortunately, some people have considered MetaOCaml to be a viable
> option for performance users and have reported good results. For
> example,
>
> Tuning MetaOCaml Programs for High Performance
> Diploma Thesis of Tobias Langhammer.
> http://www.infosun.fmi.uni-passau.de/cl/arbeiten/Langhammer.pdf
>
> Here is a good quotation from the Introduction:
>
> ``This thesis proposes MetaOCaml for enriching the domain of
> high-performance computing by multi-staged programming. MetaOCaml extends
> the OCaml language.

> ...

> Benchmarks for all presented implementations confirm that the
> execution time can be reduced significantly by high-level
> optimizations. Some MetaOCaml programs even run as fast as respective
> C implementations. Furthermore, in situations where optimizations in
> pure MetaOCaml are limited, computation hotspots can be explicitly or
> implicitly exported to C. This combination of high-level and low-level
> techniques allows optimizations which cannot be obtained in pure C
> without enormous effort.''

That thesis contains three benchmarks:

1. Dense float matrix-matrix multiply.

2. Blur of an int image matrix as convolution with a 3x3 stencil matrix.

3. Polynomial multiplication with distributed parallelism.

I don't know about polynomial multiplication (suffice to say that it is not
leveraging shared-memory parallelism which is what performance users value in
today's multicore era) but the code for the first two benchmarks is probably
10-100x slower than any decent implementation. For example, his fastest
2048x2048 matrix multiply takes 167s whereas Matlab takes only 3.6s here.

In essence, the performance gain (if any) from offshoring to C or Fortran is
dwarfed by the lack of shared-memory parallelism.

--
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e

0 new messages