[Caml-list] Slow allocations with 64bit code?

Markus Mottl

unread,

Apr 20, 2007, 4:33:29 PM4/20/07

to ocaml

Hi,

I wonder whether others have already noticed that allocations may
surprisingly be slower on 64bit platforms than on 32bit ones.

I compiled the following code using an OCaml-compiler that generates 32bit code:

-------------------------
let () =
for i = 1 to 100000000 do
ignore (Int32.add 42l 24l)
done
-------------------------

I ran it on a 64bit platform (Intel(R) Pentium(R) D CPU 2.80GHz), and
it took 0.65 seconds to finish. Then I recompiled it on this same
platform using an OCaml-compiler that generates 64bit code.
Surprisingly, the resulting executable took 0.72 seconds to run!

This is only a difference of about 10%, but I have seen more complex
cases where there are timing differences in excess of 50%, which is
already pretty substantial.

Looking at the assembly, there is really no difference in the loop
other than the use of the quad word instructions, which should not
take longer on the exact same platform (i.e. same CPU-frequency). But
there is a suspicious call to "caml_alloc2", which might cause these
differences. Can it be that there are alignment problems or similar
in the run time?

In the considerably more complex code I'm currently working on it also
seemed to me that it's allocations (the run time) that cause the
performance difference.

Regards,
Markus

--
Markus Mottl http://www.ocaml.info markus...@gmail.com

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

skaller

unread,

Apr 20, 2007, 10:58:37 PM4/20/07

to Markus Mottl

On Fri, 2007-04-20 at 16:31 -0400, Markus Mottl wrote:

> This is only a difference of about 10%, but I have seen more complex
> cases where there are timing differences in excess of 50%, which is
> already pretty substantial.

I am surprised! The 64 bit code is so fast!

You are using 64 bit pointers. They're twice as big as 32 bit
pointers. So every 'box' and all heap slot are double the size.

On a memory intensive operation, you'd expect the 64 bit model
to run at half the speed. In your example:

let () =
for i = 1 to 100000000 do
ignore (Int32.add 42l 24l)
done

it would appear 'ignore' enforces an allocation which is subsequently
garbage collected. So you have both allocation and collection which
hits double the memory than on a 32 bit model. It seems likely
the reason the code is only 10% slower here is that the minor
heap compactor is successfully preventing this code hitting
much memory, possibly keeping the whole thing in cache.

This will be a gc tuning detail.

Try folding over a large list. The 64 bit version should
take twice as long because the memory accesses are the only
part of the operation that takes significant time.
[Everything else should fit in the cache except reading
the list: boxing unboxing the accumulator and invoking
the argument closure should all be effectively zero cost]

--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

Jon Harrop

unread,

Apr 21, 2007, 5:15:03 AM4/21/07

to caml...@yquem.inria.fr

On Friday 20 April 2007 21:31, Markus Mottl wrote:
> In the considerably more complex code I'm currently working on it also
> seemed to me that it's allocations (the run time) that cause the
> performance difference.

Are you sure it isn't just eating the minor heap 2x faster?

I did quite a few benchmarks when I first got my AMD64 and found 64-bit to be
faster on all but tree-based algorithms. I put that down to 64-bit pointers
consuming 2x more memory (although the performance difference was much less
than 2x).

Doing the benchmark again (nth.opt 50 1 cfg-10k-aSi) I get:

7.438s 32-bit metaocamlopt 3.09.1
5.289s 64-bit ocamlopt 3.10.0+beta

What version of OCaml are you using?

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
The F#.NET Journal
http://www.ffconsultancy.com/products/fsharp_journal/?e

Xavier Leroy

unread,

Apr 22, 2007, 6:24:28 AM4/22/07

to Markus Mottl

> I wonder whether others have already noticed that allocations may
> surprisingly be slower on 64bit platforms than on 32bit ones.

As already mentioned, on 64-bit platforms almost all Caml data
representations are twice as large as on 32-bit platforms (exceptions:
strings, float arrays), so the processor has twice as much data to
move through its memory subsystem.

However, you certainly don't get a slowdown by a factor of 2, for two
reasons: 1- the processor doesn't spend all its time doing memory
accesses, there are some computations here and there; 2- cache lines
are much bigger than 32 bits, meaning that accessing 64 bits at a
given address is much cheaper than accessing two 32-bit
quantities at two random addresses (spatial locality).

Moreover, x86 in 64-bit mode is much more compiler-friendly than in
32-bit mode: twice as many registers, a sensible floating-point model
at last. So, OCaml in 64-bit mode generates better code than in
32-bit mode.

All in all, your 10% slowdown seems reasonable and in line with what
others reported using C benchmarks.

> This is only a difference of about 10%, but I have seen more complex
> cases where there are timing differences in excess of 50%, which is
> already pretty substantial.

Be careful with timings: I've seen simple changes in code placement
(e.g. introducing or removing dead code) cause performance differences
in excess of 20%. It's an unfortunate fact of today's processors that
their performance is very hard to predict.

> Looking at the assembly, there is really no difference in the loop
> other than the use of the quad word instructions, which should not
> take longer on the exact same platform (i.e. same CPU-frequency). But
> there is a suspicious call to "caml_alloc2", which might cause these
> differences. Can it be that there are alignment problems or similar
> in the run time?

ocamlopt compiles module initialization code in the so-called
"compact" model, where code size is reduced by not open-coding some
operations such as heap allocation, but instead going through
auxiliary functions like "caml_alloc2". This makes sense since
initialization code is usually large but not performance-critical.
I recommend you put performance-critical code in functions, not in the
initialization code.

- Xavier Leroy

Markus Mottl

unread,

Apr 22, 2007, 12:13:26 PM4/22/07

to Xavier Leroy

On 4/22/07, Xavier Leroy <Xavier...@inria.fr> wrote:
> > I wonder whether others have already noticed that allocations may
> > surprisingly be slower on 64bit platforms than on 32bit ones.
>
> As already mentioned, on 64-bit platforms almost all Caml data
> representations are twice as large as on 32-bit platforms (exceptions:
> strings, float arrays), so the processor has twice as much data to
> move through its memory subsystem.

Interesting, I was obviously under the wrong assumption that a 64bit
machine would scale appropriately when accessing 64bit words in
memory. Of course, I'm aware that cache effects also play a role, but
the minor heap should easily fit into the cache of any modern machine
in any case, and it's not like this experiment is eating memory.

> However, you certainly don't get a slowdown by a factor of 2, for two
> reasons: 1- the processor doesn't spend all its time doing memory
> accesses, there are some computations here and there; 2- cache lines
> are much bigger than 32 bits, meaning that accessing 64 bits at a
> given address is much cheaper than accessing two 32-bit
> quantities at two random addresses (spatial locality).
>
> Moreover, x86 in 64-bit mode is much more compiler-friendly than in
> 32-bit mode: twice as many registers, a sensible floating-point model
> at last. So, OCaml in 64-bit mode generates better code than in
> 32-bit mode.
>
> All in all, your 10% slowdown seems reasonable and in line with what
> others reported using C benchmarks.

This seems reasonable. It just seemed surprising to me that in some
of my tests a 64bit machine could be slower handling even "large"
Int64-values than in 32bit-mode, in which it always has to perform two
memory accesses and possibly some additional computation steps.

> Be careful with timings: I've seen simple changes in code placement
> (e.g. introducing or removing dead code) cause performance differences
> in excess of 20%. It's an unfortunate fact of today's processors that
> their performance is very hard to predict.

This surely also requires some caution when interpreting mini-benchmarks.

> ocamlopt compiles module initialization code in the so-called
> "compact" model, where code size is reduced by not open-coding some
> operations such as heap allocation, but instead going through
> auxiliary functions like "caml_alloc2". This makes sense since
> initialization code is usually large but not performance-critical.
> I recommend you put performance-critical code in functions, not in the
> initialization code.

Thanks, this is a very important bit of information that I wasn't
aware of! I used to run mini-benchmarks from initialization code in
most cases, which is obviously a bad idea...

Regards,
Markus

--
Markus Mottl http://www.ocaml.info markus...@gmail.com

_______________________________________________

Markus Mottl

unread,

Apr 23, 2007, 4:14:45 PM4/23/07

to ocaml

On 4/22/07, Markus Mottl <markus...@gmail.com> wrote:
> On 4/22/07, Xavier Leroy <Xavier...@inria.fr> wrote:
> > Be careful with timings: I've seen simple changes in code placement
> > (e.g. introducing or removing dead code) cause performance differences
> > in excess of 20%. It's an unfortunate fact of today's processors that
> > their performance is very hard to predict.

After performing many extensive tests between 32bit/64bit platforms,
it seems that indeed code placement is a major cause of many if not
most timing differences I have seen, especially if the difference is
unusually big.

Other developers who want to make their code run fast independent of
platform should therefore be cautioned that a program compiled for
different architectures may be slower/faster for very random reasons
that have nothing to do with not having optimized well enough for the
special case. This is especially true for low-level code, where such
effects do not cancel each other out easily.